Thursday, July 31, 2008

Stemming, Part 14: Verb Endings

In today’s posting, we’ll take a look at the initial step in the Porter stemmer.

An Overview

The first step in stemming tokens involves removing plural -s and verb endings -ed and -ing. It also turns -y to -i, so it will be recognized as a suffix in later steps. The documentation in the C source code lists some examples:

caresses -> caress ponies -> poni ties -> ti caress -> caress cats -> cat feed -> feed agreed -> agree disabled -> disable matting -> mat mating -> mate meeting -> meet milling -> mill messing -> mess meetings -> meet

Plurals

The first step is to strip off plurals. If the word ends in a -sses or -ies, this removes the -es; otherwise, if it ends in -s, but not -ss, it takes off the -s. If none of those apply, it returns the original stemmer.

(defn stem-plural
  "This is part of step 1ab. It removes plurals (-s) from a stem."
  [stemmer]
  (if (= (last (:word stemmer)) \s)
    (cond-ends? st stemmer
      "sses" (reset-index (pop (pop (:word st))))
      "ies" (set-to st "i")
      :else (if (and (>= (count (:word st)) 2)
                     (not= (nth (:word st) (- (count (:word st)) 2))
                           \s))
              (assoc st :word (pop (:word st)))
              st))
    stemmer))

It’s been a while since we’ve seen some of these functions. Here are a few reminders about what they do:

reset-index returns a new stemmer from a word vector (here with the last two letters removed from the word).

set-to removes the -ies from the end of the word and replaces it with an -i.

cond-ends? is the macro we created in the last few postings. I just wanted to point that out.

Verb Endings

Another part of step 1 involves removing the verb suffixes -ed and -ing from the word. It first looks for -eed if it is long enough (that is, if it has more than one internal sequence of consonants), it removes the -d; if it ends in -ed, it removes that; and if it ends in -ing, it removes that. In either of the last two cases, it also expands truncated suffixes.

Expanding suffixes just tests for certain endings and, if they are found, appends an -e to the word. Specifically, it looks for -at, -bl, and -iz. It also checks for a double-consonant ending. Some are all right in English (-ll, -ss, and -zz), but most others should be removed. Finally, it removes the final consonant in words that end in CVC (with exceptions). This process is handled by the stem-expand-suffix function, which is listed first.

(defn stem-expand-suffix
  "This is part of step 1ab. It expands -at, -bl,
  and -iz by adding an -e in certain circumstances."
  [stemmer]
  (cond-ends? st stemmer
    "at" (set-to st "ate")
    "bl" (set-to st "ble")
    "iz" (set-to st "ize")
    :else (cond
            (double-c? st (dec (count (:word st))))
              (if (#{\l \s \z} (last (:word st)))
                st
                (assoc st :word (pop (:word st))))
            (and (= (m st) 1) (cvc? st (dec (count (:word st)))))
              (set-to st "e")
            :else
              st)))

(defn stem-verb-ending
  "This is part of step 1ab. It removes verb endings -ed
  and -ing."
  [stemmer]
  (cond-ends? st stemmer
    "eed" (if (pos? (m st))
            (assoc st :word (pop (:word st)))
            stemmer)
    "ed"  (if (vowel-in-stem? st)
            (stem-expand-suffix (assoc st :word (subword st)))
            stemmer)
    "ing" (if (vowel-in-stem? st)
            (stem-expand-suffix (assoc st :word (subword st)))
            stemmer)))

double-c? returns true if the word ends in a double consonant. For example, -ll or -ss or something.

cvc? returns true if the word ends in a consonant-vowel-consonant sequence and if the final consonant is not w, x, or y.

m returns the number of consonant sequences between the start of a word and the index position. But if the word starts with a consonant sequence, it isn’t counted.

Step 1AB

The actual function for step 1AB (A calls stem-plural and B calls stem-verb-ending) is simple. It just passes its input through the two functions:

(defn step-1ab
  [stemmer]
  (stem-verb-ending (stem-plural stemmer)))

One thing about functional languages is that they often have to be read from right to left. The stemmer gets passed to stem-plural first and stem-verb-ending second. The way it is written, however, is counter-intuitive, and it makes functional languages harder to read.

Clojure provides an improvement to this. It uses the -> macro to build expressions like above. Let’s spend some time understanding what this macro does, first by using it and then by looking at the expressions it outputs.

The first parameter to -> is an expression. The remaining parameters are functions. The expression parameter gets passed as the first parameter to the first function, and the output of this gets passed as the first parameter to the second function parameter. This continues until all the functions have been chained together.

To play with this, let’s define some functions that operate on strings.

porter=> (defn to-lower-case [string] (.toLowerCase string))
#'porter/to-lower-case
porter=> (defn trim [string] (.trim string))
#'porter/trim
porter=> (trim (to-lower-case "   ThIs NeEdS ClEaNiNg  "))
"this needs cleaning"

If we make that last call with ->, it looks like this:

porter=> (-> "   ThIs NeEdS ClEaNiNg   " to-lower-case trim)
"this needs cleaning"

This is really handy if there aren’t any other arguments. But what if you want to use a function that needs more than one argument. As long as the first argument is the expression parameter or the result of the previous function in the sequence, it’s no problem. Just enclose the function and the remaining parameters to that function in parentheses. For example, this defines a wrapper for String.substring that returns everything from the second parameter on.

porter=> (defn substring [string index] (.substring string index))
#'porter/substring
porter=> (-> "   ThIs NeEdS ClEaNiNg   " to-lower-case trim (substring 11))
"cleaning"

We can see what this does using macroexpand-1:

porter=> (macroexpand-1 '(-> "   ThIs NeEdS ClEaNiNg   " to-lower-case trim (substring 11)))
(clojure/-> (clojure/-> "   ThIs NeEdS ClEaNiNg   " to-lower-case) trim (substring 11))

Umm. It changed, but not by much. Let’s just take the inner, second -> and try expanding it:

porter=> (macroexpand-1 '(-> "   ThIs NeEdS ClEaNiNg   " to-lower-case))
(to-lower-case "   ThIs NeEdS ClEaNiNg   ")

That seems like a normal function call. If we keep breaking it down, we get this:

(substring (trim (to-lower-case "   ThIs NeEdS ClEaNiNg   ")) 11)

Which is what we want and expect.

Now, we can rewrite step-1ab to use this macro and be much more readable.

(defn step-1ab
  "step-1ab gets rid of plurals and -ed or -ing. E.g.,
    caresses -> caress
    ponies -> poni
    ties -> ti
    caress -> caress
    cats -> cat
    feed -> feed
    agreed -> agree
    disabled -> disable
    matting -> mat
    mating -> mate
    meeting -> meet
    milling -> mill
    messing -> mess
    meetings -> meet
  "
  [stemmer]
  (-> stemmer stem-plural stem-verb-ending))

Step 1C

The rest of step one just tests to see if the word ends in -y. If it does, and if there is a vowel in the stem, it removes the -y and adds a -i to the word.

(defn step-1c
  "Turns terminal y to i when there is another vowel in the stem."
  [stemmer]
  (if-ends? st stemmer "y"
            (if (vowel-in-stem? st)
              (reset-index (conj (pop (:word st)) \i))
              stemmer)))

That’s pretty straightforward. It uses if-ends? to test for the -y, and vowel-in-stem? looks for a vowel before the -y. If either of these is not the case, the original stemmer is returned.

Those two functions comprise step 1. In the next posting, we’ll look at the next several steps.

Tuesday, July 29, 2008

Stemming, Part 13: The Steps for Processing

We finally have all the pieces in place to actually put the Porter stemmer together. But it’s been so long, I’ve certainly forgotten what goes next, so let’s take a moment to remember where we are going with this.

Earlier I outlined the process that the stemmer will perform in five steps:

  1. Get rid of plurals, -ed, and -ing, and turn -y to -i, so it will be recognized as a suffix in later steps;
  2. Collapse multiple suffixes, such as -ational, -ator, -iveness, and others, to a single suffix, such as -ate, -ate, and -ive, respectively;
  3. Collapse a different set of multiple suffixes or remove a small set of single suffixes;
  4. Remove a set of suffixes including -ance, -ic, and -ive; and
  5. Remove final -e and change -ll to -l in some circumstances.

In the next posting, we’ll pick apart what needs to be done for step 1.


(Sorry this posting isn’t longer. I’m still taking a breath after the macro death march.)

Monday, July 28, 2008

Stemming, Part 12: Macros and Moving on

All right, let’s review what we’ve been doing for the last few days.

What Have We Done?

In a real sense, we’ve been writing a program that writes programs. A macro is a function that accepts Clojure code—represented as lists, vectors, symbols, strings, numbers, and other Clojure data types—and returns another list, vector, etc., that represents Clojure code. A macro can change its input any way it wants to, but of course the output needs to be valid Clojure code.

Creating Abstractions

All computer languages allow us to make abstractions. We can abstract the data “given-name,” “surname,” “age,” and “address” into a class (or structure in Clojure) called “Person.” We can abstract the action of printing “Greeting” and a person’s name into the function “hello-world.” All languages in common use let us do that. Object-oriented languages also give us other abstractions for associating actions with data.

But lisps, including Clojure, go a step beyond that. They allow us to create abstractions of the syntax of the language. This allows us to control when and how expressions get executed.

Does this allow us to do things we cannot do in other languages? Strictly speaking, no.

But it allows us to do things more concisely and readably than we otherwise could. But because we can abstract more things away and worry less about them, we can build and understand programs that are more complex than we otherwise could be able to comprehend.

Domain Specific Languages

Another benefit of macros is that they allow us to create a mini-language with Clojure. If you look at Clojure’s source code, most of it is written in Clojure. And you have that same power. Once written, your code isn’t really that different than the code that makes up Clojure’s core.

You can use that ability to extend Clojure, to build up from its foundation, to create your own language ideally suited to the work you need to do.

You can do this in other languages, of course. It’s called creating domain specific languages. But in lisp, it is natural in a way it is in no other language.

The Spiderman Clause

Of course, as Peter Parker was told, “With great power comes great responsibility.” Like operator overloading or multiple inheritance, macros make it easy for you to shoot yourself in the foot. If you’re not careful, you can create something that is so far from Clojure that others have trouble understanding it, and you will have trouble getting your bearings when you come back to it in six months.

Remember: a little macros go a long way.

In the next posting, we’ll finally start on the steps involved in stemming.

Friday, July 25, 2008

Stemming, Part 11: the cond-ends? Macro

In the last posting, we created the macro if-ends?, which was a combination of let, if and the function ends?. Today, we’ll look at another similar macro, one that combines let, cond, and ends?.

The Purpose of cond-ends?

The problem with if-ends? is that sometimes we’ll want to look for twenty or more different endings on a single stemmer and handle each case differently. We can do that with if-ends?, but it would be a lot cleaner if we had a construct like cond.

Input Patterns

For example, we might want to do something like this:

(cond-ends? st (make-stemmer "names")
  "ing" (do (println "ends with -ing") st)
  "ed" (do (println "ends with -ed") st)
  "s" (do (println "ends with -s") st)
  :else (do (println "no ending") st))

First, notice that this starts out like if-ends?. It has a variable (st) that the new stemmer will be assigned to, followed by a stemmer instance to test.

These are followed by a series of ending/expression pairs. The endings are tested one by one until a match is found. Once it is, the expression following that ending is executed with the variable st available to it and assigned to the new stemmer, with the index changed to reflect the ending. The return value of that expression is the value of the entire cond-ends? expression.

Finally, if no ending matches, the cond-ends? should return the original stemmer unchanged. Or, optionally, the last ending can be the keyword :else. If no ending before that point matched, the expression following :else is executed and its value returned.

Before we start on the expected output for this macro, let’s look at a short expression that uses the default value if no ending matches:

(cond-ends? st (make-stemmer "names")
  "ing" (do (println "ends with -ing") st))

Of course, that would be better expressed with if-ends?, but using it as an input pattern will help us build the default ending.

Output Patterns

Here’s what the macro should construct from the input above.

;(cond-ends? st (make-stemmer "names")
;  "ing" (do (println "ends with -ing") st)
;  "ed" (do (println "ends with -ed") st)
;  "s" (do (println "ends with -s") st)
;  :else (do (println "no ending") st))
;; Set up the environment
(let [stemmer# (make-stemmer "names"),
      word# (subword stemmer#)]
  ;; -ing
  (let [j# (- (count word#) 3)]
    (if (and (pos? j#) (= (subvec word# j#) [\i \n \g]))
      (let [st (assoc stemmer# :index (dec j#))]
        (do (println "ends with -ing") st))
      ;; -ed
      (let [j# (- (count word#) 2)]
        (if (and (pos? j#) (= (subvec word# j#) [\e \d]))
          (let [st (assoc stemmer# :index (dec j#))]
            (do (println "ends with -ed") st))
          ;; -s
          (let [j# (- (count word#) 1)]
            (if (and (pos? j#) (= (subvec word# j#) [\s]))
              (let [st (assoc stemmer# :index (dec j#))]
                (do (println "ends with -s") st))
              ;; :else
              (let [st stemmer#]
                (do (println "no ending") st)))))))))

That’s scary.

Actually, it’s not that bad. Notice that after the first let, it is a series of (let ... (if ...)) constructions. (I’ve added comments to separate each of these sections.) Each of those three let/if constructions are almost identical, until the :else expression. Also, notice that I’ve not only pre-compiled the vector itself, but also the length of the vector, and I’ve inserted them into the result example as their literal values. (Pre-computing the length of the vector was suggested by Holger Durer in the comments to the last posting.)

Before we start writing the macro, let’s look at the expected output for the second example above.

;(cond-ends? st (make-stemmer "names")
;  "ing" (do (println "ends with -ing") st))
;; Set up the environment
(let [stemmer# (make-stemmer "names"),
      word# (subword stemmer#)]
  ;; -ing
  (let [j# (- (count word#) 3)]
    (if (and (pos? j#) (= (subvec word# j#) [\i \n \g]))
      (let [st (assoc stemmer# :index (dec j#))]
        (do (println "ends with -ing") st))
      ;; default :else
      stemmer#)))

One of the things to notice about these output expressions is that a lot of the state for it can be computed once, and used throughout all the conditionals in the cond-ends? body. That observation will be useful later.

The Macro

Looking at the output above, we really have three tasks: set up the environment for the cond-ends? in the outer let; for each test generate a let/if construction; and finally generate an :else value, maybe using the default.

One way to tackle this problem would be to separate it into two macros:

  1. One macro sets up the environment and calls another macro to build the test expressions;

  2. The second macro builds one let/if pair, similar to what if-ends? did, for one test expression, then it calls itself on the rest of the test expressions. It does this until it reaches the last test expression and that test is :else or until it is out of test expressions.

Let’s start with the second macro. That will allow us to see what the first macro needs to provide to the test macro, and it has to be defined first anyway, because that’s the way Clojure likes things.

Generating the Tests

The test macro will be overloaded. Either it will take one ending/expression pair, or it will take many. If there is only one, the ending could be an :else, or it could be the last ending to test for, and the :else will be a default. So the first, shorter version of this will return one of these two constructions (taken from above):

  ;;; With :else
              ;; :else
              (let [st stemmer#]
                (do (println "no ending") st))

  ;;; With a test + the default :else
  ;; -ing
  (let [j# (- (count word#) 3)]
    (if (and (pos? j#) (= (subvec word# j#) [\i \n \g]))
      (let [st (assoc stemmer# :index (dec j#))]
        (do (println "ends with -ing") st))
      ;; default :else
      stemmer#))

Another point to make about the test expressions is that—as an astute commenter pointed out (thanks, Holger)—there is a lot of duplication between the two versions of the macro that builds the expressions. We can capture that duplication in a function and just call that function with the difference between the two structures. The function will then create the macro’s output structure.

All right. Let’s roll up our sleeves. The helper function will be called make-cond-ends-test. It take the variable symbol, the stemmer variable, an ending string, a true expression, and a false expression.

(defn make-cond-ends-test
  [var stemmer word end true-expr false-expr]
  (let [vend (vec end)]
    `(let [j# (- (count ~word) ~(count vend))]
       (if (and (pos? j#) (= (subvec ~word j#) ~vend))
         (let [~var (assoc ~stemmer :index (dec j#))]
           ~true-expr)
         ~false-expr))))

Notice that this function requires several parameters beyond the ending (end) and the result expressions (true-expr and false-expr)? It also has var, stemmer, and word. But the stemmer and the subword should have been stored in variables in the first, outer let by this point. That’s right. In this function, these are the variables that those values were stored in, represented here as symbols. Throughout this helper function, ~stemmer inserts the variable that the stemmer will have been assigned to in the outer let.

Now we can define the secondary macro. We’ll call it cond-ends-helper, because I’m not feeling creative. It’s main purpose is to examine the input and call make-cond-ends-test with the appropriate false expression. The first overload for it will handle the last expression by testing for :else and deciding whether to use the default :else:

(defmacro cond-ends-helper
  "This helps cond-ends? by processing the "
  test-exprs pairs in cond-ends?'s environment."
  ([var stemmer word end true-expr]
   (if (= end :else)
     `(let [~var ~stemmer]
        ~true-expr)
     (make-cond-ends-test var stemmer word end true-expr stemmer)))
  ;; The overload will go here ...
  )

What happens here? First, at compile time (before the quasi-quote or call to make-cond-ends-test), it tests whether the ending is :else. If it is, it just outputs the shorter version of the ending. If it isn’t :else, it calls the helper function, with all the same parameters plus the false expression as a data structure. In this case, the false expression is the stemmer variable.

If your head is starting to hurt, take a moment. Unless you have programmed in lisp before, this isn’t like any programming you’ve ever done. We’re writing a program to write programs. Like any meta-activity, it can make your brain explode. Unfortunately, this is a particularly convoluted macro, because it has these two levels of macro, plus a function. Don’t worry if you don’t quite get it at first. There’s nothing wrong with copying and pasting the macro’s code into your source file and moving on. Start building macros like debug and if-ends?, and when you’re comfortable there, come back and read this through again. It will make a lot more sense.

Now, if you’re still with me, here’s the overloaded version of the helper macro. This takes an ending/expression pair and a list of the rest of the pairs in the expression. It then calls make-cond-ends-test with a new version of

(defmacro cond-ends-helper
  "This helps cond-ends? by processing the
  test-exprs pairs in cond-ends?'s environment."
  ;; The earlier override goes here....
  ([var stemmer word end true-expr & more]
   (make-cond-ends-test
      var stemmer word end true-expr
      `(cond-ends-helper ~var ~stemmer ~word ~@more))))

In some ways, this part of cond-ends-helper is much simpler. There are just a few things to comment on.

  1. In the parameter list, & more collects the values of the rest of the parameters and stores them in a variable I’ve named more. This works with functions too.

  2. The false expression that this passes to make-cond-ends-test begins with a quasi-quote (back quote). This just generates an expression to pass as the false expression. This expression gets inserted into the expression that make-conds-ends-test creates.

  3. This quasi-quoted expression, when finally evaluated, is a macro call to the macro that created it. It starts the process all over again, passing the new macro expansion the values of var, stemmer, and word unchanged, and splicing all of the rest of the parameters that haven’t been handled yet into the expression using ~@more.

Creating the Environment

So we know that the cond-ends? sets up the environment by creating an expression like this:

;; Set up the environment
(let [stemmer# (make-stemmer "names"),
      word# (subword stemmer#)]
  ; ...
  )

And it replaces the ellipses by calling the macro cond-ends-helper with the variable and the variables stemmer# and word#. Put like that, it almost writes itself.

(defmacro cond-ends?
  "This is the same as a stacked series of if-ends?.
  This just sets up the environment for cond-ends-helper."
  [var stemmer & test-exprs]
  `(let [stemmer# ~stemmer, word# (subword stemmer#)]
     (cond-ends-helper ~var stemmer# word# ~@test-exprs)))

Notice that we use & test-exprs to collect all of the test expressions, and we use ~@test-exprs to splice them into the cond-ends-helper call.

Summary

For posterity’s sake, here is all the code for the cond-ends? macro, in one place:

(defn make-cond-ends-test
  [var stemmer word end true-expr false-expr]
  (let [vend (vec end)]
    `(let [j# (- (count ~word) ~(count vend))]
       (if (and (pos? j#) (= (subvec ~word j#) ~vend))
         (let [~var (assoc ~stemmer :index (dec j#))]
           ~true-expr)
         ~false-expr))))

(defmacro cond-ends-helper
  "This helps cond-ends? by processing the
  test-exprs pairs in cond-ends?'s environment."
  ([var stemmer word end true-expr]
   (if (= end :else)
     `(let [~var ~stemmer]
        ~true-expr)
     (make-cond-ends-test var stemmer word end
                          true-expr stemmer)))
  ([var stemmer word end true-expr & more]
   (make-cond-ends-test
      var stemmer word end true-expr
      `(cond-ends-helper ~var ~stemmer ~word ~@more))))

(defmacro cond-ends?
  "This is the same as a stacked series of if-ends?.
  This just sets up the environment for cond-ends-helper."
  [var stemmer & test-exprs]
  `(let [stemmer# ~stemmer, word# (subword stemmer#)]
     (cond-ends-helper ~var stemmer# word# ~@test-exprs)))

In the next posting, I’ll to wrap up macros by pointing out how they’re used and why.

Tuesday, July 22, 2008

Stemming, Part 10: The if-ends? Macro

For the last two postings, we’ve looked at macros. Today, we’ll finally put them to use for the Porter Stemmer. Remember that we’re looking to simplify the ends? function, so let’s start by reviewing it.

The ends? Function

The ends? function tests whether a stemmer ends with a given suffix. If it does, it moves the working :index to exclude that ending. If it does not, it leaves the :index where it was originally. Moveover, code that calls ends? needs to know whether that ending was found or not. If it was, it may want to take special action; if it wasn’t, it may want to test the stemmer against another ending. Currently, ends? returns a vector containing the resulting stemmer (which may be the original one) and a flag indicating whether the ending was found or not:

(defn ends?
  "true if the word ends with s."
  [stemmer s]
  (let [word (subword stemmer),
        sv (vec s),
        j (- (count word) (count sv))]
    (if (and (pos? j) (= (subvec word j) sv))
      [(assoc stemmer :index (dec j)) true]
      [stemmer false])))

A quick explanation:

  • (subword stemmer) is the part of word before the working index;
  • (vec s) turns the ending into a vector, if it isn’t already; and
  • (- (count word) (count sv)) is the index where the ending starts in the word. It will be the new value of the working index if the ending is found.

First, this tests whether the ending is shorter than the word and whether the word ends with the ending. If both are true, it returns a new stemmer with the working index (:index) set to the position just before the ending in the word.

Let’s see what how what calling it looks like.

porter=> (ends? (make-stemmer "walking") "ing")
[{:word [\w \a \l \k \i \n \g], :index 3} true]
porter=> (ends? (make-stemmer "walking") "s")
[{:word [\w \a \l \k \i \n \g], :index 6} false]

The Purpose of if-ends?

The reason why we want to use a macro in place of a function for ends? is because returning the extra value, the flag indicating whether the ending was actually found, is messy. Instead, we want something like an if statement. We want a conditional that tests whether a stemmer ends with a suffix, but instead of returning true or false, it executes a different expression, depending on the result of ends?.

Another wrinkle is that we’ll want to have access to the newly created stemmer in the true- or false-expressions. In this sense, if-ends? is like a let. We need to provide a variable that will be used in the bodies of the true- and false-expressions. The altered stemmer will be assigned to it.

Sample Inputs

What would calling this look like? Here are some examples:

(if-ends? x (make-stemmer "names") "s"
  (println x "YES: had a plural suffix")
  (println x "NO : never had a plural suffix"))

(if-ends? x (make-stemmer "walking") "ing"
  (println "with -ing:" x)
  (println "without -ing:" x))

(if-ends? x (make-stemmer "walked") "ed"
  x)

From these expressions, we can see that if-ends? will need to take five or six parameters:

  1. The variable that the resulting stemmer will be assigned to and that can be used in the body expressions of the if-ends?. In all of these expressions, that variable is x.

  2. The next parameter is the input stemmer. This could be an expression, as it is here, so we’ll need to evaluate this first and store it in a variable.

  3. Next is the ending.

  4. Next is the expression to evaluate if the stemmer does end with the ending. In it, the altered stemmer is assigned to the variable name given in the first parameter.

  5. Finally, the last parameter is the expression to evaluate if the stemmer does not end with the ending. Just as with if, this branch is optional. If it is not given, whatever expression the if-ends? creates should the original, unchanged stemmer.

Sample Outputs

Given these constraints, what would the output for each of these if-ends? look like?

;(if-ends? x (make-stemmer "names") "s"
;  (println x "YES: had a plural suffix")
;  (println x "NO : never had a plural suffix"))
(let [stemmer-var (make-stemmer "names"),
      end-var (vec "s"),
      word-var (subword stemmer-var),
      j-var (- (count word-var) (count end-var))]
  (if (and (pos? j-var) (= (subvec word-var j-var) end-var))
    (let [x (assoc stemmer-var :index (dec j-var))]
      (println x "YES: had a plural suffix"))
    (let [x stemmer-var]
      (println x "NO : never had a plural suffix"))))

(Here I’ve used -var in place of the # ending.)

First, this assigns the input stemmer and ending to temporary variables. Second, it assigns word-var and j-var to values, just as ends? does. Finally, it performs the test. In either branch, it uses a let to assign stemmer—new or old—to the variable x. Finally, it inserts the body of the branch expression into each branch.

;(if-ends? x (make-stemmer "walking") "ing"
;  (println "with -ing:" x)
;  (println "without -ing:" x))
(let [stemmer-var (make-stemmer "walking"),
      end-var (vec "ing"),
      word-var (subword stemmer-var),
      j-var (- (count word-var) (count end-var))]
  (if (and (pos? j-var) (= (subvec word-var j-var) end-var))
    (let [x (assoc stemmer-var :index (dec j-var))]
      (println "with -ing:" x))
    (let [x stemmer-var]
      (println "without -ing:" x))))

:::clj
;(if-ends? x (make-stemmer "walked") "ed"
;  x)
(let [stemmer-var (make-stemmer "walked"),
      end-var (vec "ed"),
      word-var (subword stemmer-var),
      j-var (- (count word-var) (count end-var))]
  (if (and (pos? j-var) (= (subvec word-var j-var) end-var))
    (let [x (assoc stemmer-var :index (dec j-var))]
      x)
    stemmer-var))

The last output shows what it should return when the optional false-expression is omitted.

Writing the Macro

Actually, let’s start writing the macro with the shorter version. You can overload a macro’s parameter list just as you do with functions. Just enclose each set of parameters and body in its own set of parentheses.

(defmacro if-ends?
  "Instead of the function ends?, I'm using this:
  (if-ends? x (make-stemmer \"names\") [\\s]
            (println x \"no longer has a plural suffix\")
            (println x \"never had a plural suffix\"))
  "
  ([var stemmer end true-expr]
   `(let [stemmer# ~stemmer,
          end# ~(vec ~end),
          word# (subword stemmer#),
          j# (- (count word#) (count end#))]
      (if (and (pos? j#) (= (subvec word# j#) end#))
        (let [~var (assoc stemmer# :index (dec j#))]
          ~true-expr)
        stemmer#)))
  ;; ... the full version of if-ends? goes here
  )

This is a fairly straightforward transformation. The -var variables are replaced with their # counterparts. Otherwise, the parameters are inserted into the macro result with tilde expansion (~expression).

The full version of if-ends?.

(defmacro if-ends?
  ;; ... the short version of if-ends? goes here
  ([var stemmer end true-expr false-expr]
   `(let [stemmer# ~stemmer,
          end# ~(vec ~end),
          word# (subword stemmer#),
          j# (- (count word#) (count end#))]
      (if (and (pos? j#) (= (subvec word# j#) end#))
        (let [~var (assoc stemmer# :index (dec j#))]
          ~true-expr)
        (let [~var stemmer#]
          ~false-expr)))))

Compile-Time Optimization

We’ve got all the pieces in place, but we’re still not quite finished. Let’s make one assumption about our input, an assumption that will hold as we use this macro in the stemmer code. This will allow us to make an optimization that also leverages one of the advantages of macros: they are expanded at compile-time.

The assumption is this: the ending will always be a string literal. That is, we will always start if-ends? like this:

(if-ends? v stemmer "ing" ...)

"ing" will get passed to the macro as a string literal, and we can convert it to a vector when the macro is translated at compile time. In the macros that I’ve outlined above, the ending is converted to a vector in the outer let. But we could also move it to a new let outside the macro expansion expression. The value of that pre-converted ending is used in the macro result.

With this optimization, the entirety of if-ends? looks like this:

(defmacro if-ends?
  "Instead of the function ends?, I'm using this:
  (if-ends? x (make-stemmer \"names\") [\\s]
            (println x \"no longer has a plural suffix\")
            (println x \"never had a plural suffix\"))
  "
  ([var stemmer end true-expr]
   (let [vend (vec end)]
     `(let [stemmer# ~stemmer,
            end# ~vend,
            word# (subword stemmer#),
            j# (- (count word#) (count end#))]
        (if (and (pos? j#) (= (subvec word# j#) end#))
          (let [~var (assoc stemmer# :index (dec j#))]
            ~true-expr)
          stemmer#))))
  ([var stemmer end true-expr false-expr]
   (let [vend (vec end)]
     `(let [stemmer# ~stemmer,
            end# ~vend,
            word# (subword stemmer#),
            j# (- (count word#) (count end#))]
        (if (and (pos? j#) (= (subvec word# j#) end#))
          (let [~var (assoc stemmer# :index (dec j#))]
            ~true-expr)
          (let [~var stemmer#]
            ~false-expr))))))

Here, in both versions of if-ends?, we first convert the ending to a vector and assign it to vend. Inside the macro expansion expression, vend is what gets assigned to end#. Clojure calculates vend when the macro is expanded at compile time. The time saved by the optimization won’t be much, but it does illustrate another benefit of macros.

This simplifies things a lot, but it doesn’t cover all the use cases for ends?. We will also want to use it in a nested set of twenty or more comparisons. That is less like an if and more like a cond expression. In the next posting, we’ll create cond-ends?.