Friday, July 25, 2008

Stemming, Part 11: the cond-ends? Macro

In the last posting, we created the macro if-ends?, which was a combination of let, if and the function ends?. Today, we’ll look at another similar macro, one that combines let, cond, and ends?.

The Purpose of cond-ends?

The problem with if-ends? is that sometimes we’ll want to look for twenty or more different endings on a single stemmer and handle each case differently. We can do that with if-ends?, but it would be a lot cleaner if we had a construct like cond.

Input Patterns

For example, we might want to do something like this:

(cond-ends? st (make-stemmer "names")
  "ing" (do (println "ends with -ing") st)
  "ed" (do (println "ends with -ed") st)
  "s" (do (println "ends with -s") st)
  :else (do (println "no ending") st))

First, notice that this starts out like if-ends?. It has a variable (st) that the new stemmer will be assigned to, followed by a stemmer instance to test.

These are followed by a series of ending/expression pairs. The endings are tested one by one until a match is found. Once it is, the expression following that ending is executed with the variable st available to it and assigned to the new stemmer, with the index changed to reflect the ending. The return value of that expression is the value of the entire cond-ends? expression.

Finally, if no ending matches, the cond-ends? should return the original stemmer unchanged. Or, optionally, the last ending can be the keyword :else. If no ending before that point matched, the expression following :else is executed and its value returned.

Before we start on the expected output for this macro, let’s look at a short expression that uses the default value if no ending matches:

(cond-ends? st (make-stemmer "names")
  "ing" (do (println "ends with -ing") st))

Of course, that would be better expressed with if-ends?, but using it as an input pattern will help us build the default ending.

Output Patterns

Here’s what the macro should construct from the input above.

;(cond-ends? st (make-stemmer "names")
;  "ing" (do (println "ends with -ing") st)
;  "ed" (do (println "ends with -ed") st)
;  "s" (do (println "ends with -s") st)
;  :else (do (println "no ending") st))
;; Set up the environment
(let [stemmer# (make-stemmer "names"),
      word# (subword stemmer#)]
  ;; -ing
  (let [j# (- (count word#) 3)]
    (if (and (pos? j#) (= (subvec word# j#) [\i \n \g]))
      (let [st (assoc stemmer# :index (dec j#))]
        (do (println "ends with -ing") st))
      ;; -ed
      (let [j# (- (count word#) 2)]
        (if (and (pos? j#) (= (subvec word# j#) [\e \d]))
          (let [st (assoc stemmer# :index (dec j#))]
            (do (println "ends with -ed") st))
          ;; -s
          (let [j# (- (count word#) 1)]
            (if (and (pos? j#) (= (subvec word# j#) [\s]))
              (let [st (assoc stemmer# :index (dec j#))]
                (do (println "ends with -s") st))
              ;; :else
              (let [st stemmer#]
                (do (println "no ending") st)))))))))

That’s scary.

Actually, it’s not that bad. Notice that after the first let, it is a series of (let ... (if ...)) constructions. (I’ve added comments to separate each of these sections.) Each of those three let/if constructions are almost identical, until the :else expression. Also, notice that I’ve not only pre-compiled the vector itself, but also the length of the vector, and I’ve inserted them into the result example as their literal values. (Pre-computing the length of the vector was suggested by Holger Durer in the comments to the last posting.)

Before we start writing the macro, let’s look at the expected output for the second example above.

;(cond-ends? st (make-stemmer "names")
;  "ing" (do (println "ends with -ing") st))
;; Set up the environment
(let [stemmer# (make-stemmer "names"),
      word# (subword stemmer#)]
  ;; -ing
  (let [j# (- (count word#) 3)]
    (if (and (pos? j#) (= (subvec word# j#) [\i \n \g]))
      (let [st (assoc stemmer# :index (dec j#))]
        (do (println "ends with -ing") st))
      ;; default :else
      stemmer#)))

One of the things to notice about these output expressions is that a lot of the state for it can be computed once, and used throughout all the conditionals in the cond-ends? body. That observation will be useful later.

The Macro

Looking at the output above, we really have three tasks: set up the environment for the cond-ends? in the outer let; for each test generate a let/if construction; and finally generate an :else value, maybe using the default.

One way to tackle this problem would be to separate it into two macros:

  1. One macro sets up the environment and calls another macro to build the test expressions;

  2. The second macro builds one let/if pair, similar to what if-ends? did, for one test expression, then it calls itself on the rest of the test expressions. It does this until it reaches the last test expression and that test is :else or until it is out of test expressions.

Let’s start with the second macro. That will allow us to see what the first macro needs to provide to the test macro, and it has to be defined first anyway, because that’s the way Clojure likes things.

Generating the Tests

The test macro will be overloaded. Either it will take one ending/expression pair, or it will take many. If there is only one, the ending could be an :else, or it could be the last ending to test for, and the :else will be a default. So the first, shorter version of this will return one of these two constructions (taken from above):

  ;;; With :else
              ;; :else
              (let [st stemmer#]
                (do (println "no ending") st))

  ;;; With a test + the default :else
  ;; -ing
  (let [j# (- (count word#) 3)]
    (if (and (pos? j#) (= (subvec word# j#) [\i \n \g]))
      (let [st (assoc stemmer# :index (dec j#))]
        (do (println "ends with -ing") st))
      ;; default :else
      stemmer#))

Another point to make about the test expressions is that—as an astute commenter pointed out (thanks, Holger)—there is a lot of duplication between the two versions of the macro that builds the expressions. We can capture that duplication in a function and just call that function with the difference between the two structures. The function will then create the macro’s output structure.

All right. Let’s roll up our sleeves. The helper function will be called make-cond-ends-test. It take the variable symbol, the stemmer variable, an ending string, a true expression, and a false expression.

(defn make-cond-ends-test
  [var stemmer word end true-expr false-expr]
  (let [vend (vec end)]
    `(let [j# (- (count ~word) ~(count vend))]
       (if (and (pos? j#) (= (subvec ~word j#) ~vend))
         (let [~var (assoc ~stemmer :index (dec j#))]
           ~true-expr)
         ~false-expr))))

Notice that this function requires several parameters beyond the ending (end) and the result expressions (true-expr and false-expr)? It also has var, stemmer, and word. But the stemmer and the subword should have been stored in variables in the first, outer let by this point. That’s right. In this function, these are the variables that those values were stored in, represented here as symbols. Throughout this helper function, ~stemmer inserts the variable that the stemmer will have been assigned to in the outer let.

Now we can define the secondary macro. We’ll call it cond-ends-helper, because I’m not feeling creative. It’s main purpose is to examine the input and call make-cond-ends-test with the appropriate false expression. The first overload for it will handle the last expression by testing for :else and deciding whether to use the default :else:

(defmacro cond-ends-helper
  "This helps cond-ends? by processing the "
  test-exprs pairs in cond-ends?'s environment."
  ([var stemmer word end true-expr]
   (if (= end :else)
     `(let [~var ~stemmer]
        ~true-expr)
     (make-cond-ends-test var stemmer word end true-expr stemmer)))
  ;; The overload will go here ...
  )

What happens here? First, at compile time (before the quasi-quote or call to make-cond-ends-test), it tests whether the ending is :else. If it is, it just outputs the shorter version of the ending. If it isn’t :else, it calls the helper function, with all the same parameters plus the false expression as a data structure. In this case, the false expression is the stemmer variable.

If your head is starting to hurt, take a moment. Unless you have programmed in lisp before, this isn’t like any programming you’ve ever done. We’re writing a program to write programs. Like any meta-activity, it can make your brain explode. Unfortunately, this is a particularly convoluted macro, because it has these two levels of macro, plus a function. Don’t worry if you don’t quite get it at first. There’s nothing wrong with copying and pasting the macro’s code into your source file and moving on. Start building macros like debug and if-ends?, and when you’re comfortable there, come back and read this through again. It will make a lot more sense.

Now, if you’re still with me, here’s the overloaded version of the helper macro. This takes an ending/expression pair and a list of the rest of the pairs in the expression. It then calls make-cond-ends-test with a new version of

(defmacro cond-ends-helper
  "This helps cond-ends? by processing the
  test-exprs pairs in cond-ends?'s environment."
  ;; The earlier override goes here....
  ([var stemmer word end true-expr & more]
   (make-cond-ends-test
      var stemmer word end true-expr
      `(cond-ends-helper ~var ~stemmer ~word ~@more))))

In some ways, this part of cond-ends-helper is much simpler. There are just a few things to comment on.

  1. In the parameter list, & more collects the values of the rest of the parameters and stores them in a variable I’ve named more. This works with functions too.

  2. The false expression that this passes to make-cond-ends-test begins with a quasi-quote (back quote). This just generates an expression to pass as the false expression. This expression gets inserted into the expression that make-conds-ends-test creates.

  3. This quasi-quoted expression, when finally evaluated, is a macro call to the macro that created it. It starts the process all over again, passing the new macro expansion the values of var, stemmer, and word unchanged, and splicing all of the rest of the parameters that haven’t been handled yet into the expression using ~@more.

Creating the Environment

So we know that the cond-ends? sets up the environment by creating an expression like this:

;; Set up the environment
(let [stemmer# (make-stemmer "names"),
      word# (subword stemmer#)]
  ; ...
  )

And it replaces the ellipses by calling the macro cond-ends-helper with the variable and the variables stemmer# and word#. Put like that, it almost writes itself.

(defmacro cond-ends?
  "This is the same as a stacked series of if-ends?.
  This just sets up the environment for cond-ends-helper."
  [var stemmer & test-exprs]
  `(let [stemmer# ~stemmer, word# (subword stemmer#)]
     (cond-ends-helper ~var stemmer# word# ~@test-exprs)))

Notice that we use & test-exprs to collect all of the test expressions, and we use ~@test-exprs to splice them into the cond-ends-helper call.

Summary

For posterity’s sake, here is all the code for the cond-ends? macro, in one place:

(defn make-cond-ends-test
  [var stemmer word end true-expr false-expr]
  (let [vend (vec end)]
    `(let [j# (- (count ~word) ~(count vend))]
       (if (and (pos? j#) (= (subvec ~word j#) ~vend))
         (let [~var (assoc ~stemmer :index (dec j#))]
           ~true-expr)
         ~false-expr))))

(defmacro cond-ends-helper
  "This helps cond-ends? by processing the
  test-exprs pairs in cond-ends?'s environment."
  ([var stemmer word end true-expr]
   (if (= end :else)
     `(let [~var ~stemmer]
        ~true-expr)
     (make-cond-ends-test var stemmer word end
                          true-expr stemmer)))
  ([var stemmer word end true-expr & more]
   (make-cond-ends-test
      var stemmer word end true-expr
      `(cond-ends-helper ~var ~stemmer ~word ~@more))))

(defmacro cond-ends?
  "This is the same as a stacked series of if-ends?.
  This just sets up the environment for cond-ends-helper."
  [var stemmer & test-exprs]
  `(let [stemmer# ~stemmer, word# (subword stemmer#)]
     (cond-ends-helper ~var stemmer# word# ~@test-exprs)))

In the next posting, I’ll to wrap up macros by pointing out how they’re used and why.

4 comments:

Anonymous said...

Hi Eric,

First of all, the obligatory, but extremely well-deserved, thanks for this series. It's *really* interesting! I've just discovered it and read all the posts up to here in the last couple of days.

Not being steeped in macros, perhaps I oughtn't comment, but it seems to me that:

1. if-ends? should make use of the ends? function rather than include the logic directly in the macro

2. cond-ends? should use cond (and ends?) rather than tackling the harder task of building nested lets and ifs.

The reason I suggest this is to make it easier for beginners to understand. If there are performance reasons to do as much work as you've done, that's probably not enough justification in a pedagogical example. If there are "correctness" reasons for it, then maybe they should be discussed.

Thanks again. Looking forward to reading the rest, and any other series you feel like producing :)

Gavin Sinclair

Eric Rochester said...

Hi Gavin,

Glad you're getting something out of it. Unfortunately, it's far from finished at this point, although it does have a lot of useful information. (I also need to update the code to reflect the latest changes in Clojure, but that's another story.)

Now for your questions. (I've answered them together, since they're really the same question.)

I'm not sure that using ends? would make cond-ends? that much simpler, since ends? isn't really a straight predicate function. You'd still have to test at every stage. You might be able to wrap ends? in another function and use "or" to short-cut the tests. That would work, although it might also be a little too clever. You'd also still need to test for the ":else" condition.

You're right, though, the primary reason why I approached things this way was for performance. It does also illustrate one of the advantages of macros, though: they can "insert" expressions into the code at compile time, so that whenever it's executed, it's just a constant. For example, in "cond-ends?", the length of each of the endings is pre-computed ("~(count vend)") and inserted into the resulting expression as a constant integer. Similarly, the ending vector itself is a constant, although I'd have to think about whether that gains anything or not.

Pedagogically, I really should have stuck with a simpler stemming algorithm, as someone suggested (and you commented on) on another posting. That's one of the joys of blogs or any other serial publishing format: by the time realize what a thicket you're in, you may be too far along to turn back.

But you're probably right, using a simpler macro here would have made the explanation much simpler, too.

Glad you're enjoying it, though.
Eric

GS said...

Thanks for your detailed response. One unfortunate thing I find about Clojure is that it's often hard for me to understand unless I'm writing it (and thereby experimenting at the REPL). Accordingly, I don't fully follow what you've written because it's been some time (a week or so) since I read and grokked the code involved. The solution to this is for me to type in all the code and try to implement if-ends? and cond-ends? differently myself. I'll let you know how I go (won't be very soon though).

Cheers,
Gavin

Eric Rochester said...

LOL. I know exactly what you mean. It's been months since I last looked at this code, and it took me a while to remember what i was thinking at the time. I'm still not sure that I did.

And I started to implement the suggestions I made in my last post, but I didn't really have time right then.

Life does sometimes get in the way.

Eric