Tuesday, July 22, 2008

Stemming, Part 10: The if-ends? Macro

For the last two postings, we’ve looked at macros. Today, we’ll finally put them to use for the Porter Stemmer. Remember that we’re looking to simplify the ends? function, so let’s start by reviewing it.

The ends? Function

The ends? function tests whether a stemmer ends with a given suffix. If it does, it moves the working :index to exclude that ending. If it does not, it leaves the :index where it was originally. Moveover, code that calls ends? needs to know whether that ending was found or not. If it was, it may want to take special action; if it wasn’t, it may want to test the stemmer against another ending. Currently, ends? returns a vector containing the resulting stemmer (which may be the original one) and a flag indicating whether the ending was found or not:

(defn ends?
  "true if the word ends with s."
  [stemmer s]
  (let [word (subword stemmer),
        sv (vec s),
        j (- (count word) (count sv))]
    (if (and (pos? j) (= (subvec word j) sv))
      [(assoc stemmer :index (dec j)) true]
      [stemmer false])))

A quick explanation:

  • (subword stemmer) is the part of word before the working index;
  • (vec s) turns the ending into a vector, if it isn’t already; and
  • (- (count word) (count sv)) is the index where the ending starts in the word. It will be the new value of the working index if the ending is found.

First, this tests whether the ending is shorter than the word and whether the word ends with the ending. If both are true, it returns a new stemmer with the working index (:index) set to the position just before the ending in the word.

Let’s see what how what calling it looks like.

porter=> (ends? (make-stemmer "walking") "ing")
[{:word [\w \a \l \k \i \n \g], :index 3} true]
porter=> (ends? (make-stemmer "walking") "s")
[{:word [\w \a \l \k \i \n \g], :index 6} false]

The Purpose of if-ends?

The reason why we want to use a macro in place of a function for ends? is because returning the extra value, the flag indicating whether the ending was actually found, is messy. Instead, we want something like an if statement. We want a conditional that tests whether a stemmer ends with a suffix, but instead of returning true or false, it executes a different expression, depending on the result of ends?.

Another wrinkle is that we’ll want to have access to the newly created stemmer in the true- or false-expressions. In this sense, if-ends? is like a let. We need to provide a variable that will be used in the bodies of the true- and false-expressions. The altered stemmer will be assigned to it.

Sample Inputs

What would calling this look like? Here are some examples:

(if-ends? x (make-stemmer "names") "s"
  (println x "YES: had a plural suffix")
  (println x "NO : never had a plural suffix"))

(if-ends? x (make-stemmer "walking") "ing"
  (println "with -ing:" x)
  (println "without -ing:" x))

(if-ends? x (make-stemmer "walked") "ed"
  x)

From these expressions, we can see that if-ends? will need to take five or six parameters:

  1. The variable that the resulting stemmer will be assigned to and that can be used in the body expressions of the if-ends?. In all of these expressions, that variable is x.

  2. The next parameter is the input stemmer. This could be an expression, as it is here, so we’ll need to evaluate this first and store it in a variable.

  3. Next is the ending.

  4. Next is the expression to evaluate if the stemmer does end with the ending. In it, the altered stemmer is assigned to the variable name given in the first parameter.

  5. Finally, the last parameter is the expression to evaluate if the stemmer does not end with the ending. Just as with if, this branch is optional. If it is not given, whatever expression the if-ends? creates should the original, unchanged stemmer.

Sample Outputs

Given these constraints, what would the output for each of these if-ends? look like?

;(if-ends? x (make-stemmer "names") "s"
;  (println x "YES: had a plural suffix")
;  (println x "NO : never had a plural suffix"))
(let [stemmer-var (make-stemmer "names"),
      end-var (vec "s"),
      word-var (subword stemmer-var),
      j-var (- (count word-var) (count end-var))]
  (if (and (pos? j-var) (= (subvec word-var j-var) end-var))
    (let [x (assoc stemmer-var :index (dec j-var))]
      (println x "YES: had a plural suffix"))
    (let [x stemmer-var]
      (println x "NO : never had a plural suffix"))))

(Here I’ve used -var in place of the # ending.)

First, this assigns the input stemmer and ending to temporary variables. Second, it assigns word-var and j-var to values, just as ends? does. Finally, it performs the test. In either branch, it uses a let to assign stemmer—new or old—to the variable x. Finally, it inserts the body of the branch expression into each branch.

;(if-ends? x (make-stemmer "walking") "ing"
;  (println "with -ing:" x)
;  (println "without -ing:" x))
(let [stemmer-var (make-stemmer "walking"),
      end-var (vec "ing"),
      word-var (subword stemmer-var),
      j-var (- (count word-var) (count end-var))]
  (if (and (pos? j-var) (= (subvec word-var j-var) end-var))
    (let [x (assoc stemmer-var :index (dec j-var))]
      (println "with -ing:" x))
    (let [x stemmer-var]
      (println "without -ing:" x))))

:::clj
;(if-ends? x (make-stemmer "walked") "ed"
;  x)
(let [stemmer-var (make-stemmer "walked"),
      end-var (vec "ed"),
      word-var (subword stemmer-var),
      j-var (- (count word-var) (count end-var))]
  (if (and (pos? j-var) (= (subvec word-var j-var) end-var))
    (let [x (assoc stemmer-var :index (dec j-var))]
      x)
    stemmer-var))

The last output shows what it should return when the optional false-expression is omitted.

Writing the Macro

Actually, let’s start writing the macro with the shorter version. You can overload a macro’s parameter list just as you do with functions. Just enclose each set of parameters and body in its own set of parentheses.

(defmacro if-ends?
  "Instead of the function ends?, I'm using this:
  (if-ends? x (make-stemmer \"names\") [\\s]
            (println x \"no longer has a plural suffix\")
            (println x \"never had a plural suffix\"))
  "
  ([var stemmer end true-expr]
   `(let [stemmer# ~stemmer,
          end# ~(vec ~end),
          word# (subword stemmer#),
          j# (- (count word#) (count end#))]
      (if (and (pos? j#) (= (subvec word# j#) end#))
        (let [~var (assoc stemmer# :index (dec j#))]
          ~true-expr)
        stemmer#)))
  ;; ... the full version of if-ends? goes here
  )

This is a fairly straightforward transformation. The -var variables are replaced with their # counterparts. Otherwise, the parameters are inserted into the macro result with tilde expansion (~expression).

The full version of if-ends?.

(defmacro if-ends?
  ;; ... the short version of if-ends? goes here
  ([var stemmer end true-expr false-expr]
   `(let [stemmer# ~stemmer,
          end# ~(vec ~end),
          word# (subword stemmer#),
          j# (- (count word#) (count end#))]
      (if (and (pos? j#) (= (subvec word# j#) end#))
        (let [~var (assoc stemmer# :index (dec j#))]
          ~true-expr)
        (let [~var stemmer#]
          ~false-expr)))))

Compile-Time Optimization

We’ve got all the pieces in place, but we’re still not quite finished. Let’s make one assumption about our input, an assumption that will hold as we use this macro in the stemmer code. This will allow us to make an optimization that also leverages one of the advantages of macros: they are expanded at compile-time.

The assumption is this: the ending will always be a string literal. That is, we will always start if-ends? like this:

(if-ends? v stemmer "ing" ...)

"ing" will get passed to the macro as a string literal, and we can convert it to a vector when the macro is translated at compile time. In the macros that I’ve outlined above, the ending is converted to a vector in the outer let. But we could also move it to a new let outside the macro expansion expression. The value of that pre-converted ending is used in the macro result.

With this optimization, the entirety of if-ends? looks like this:

(defmacro if-ends?
  "Instead of the function ends?, I'm using this:
  (if-ends? x (make-stemmer \"names\") [\\s]
            (println x \"no longer has a plural suffix\")
            (println x \"never had a plural suffix\"))
  "
  ([var stemmer end true-expr]
   (let [vend (vec end)]
     `(let [stemmer# ~stemmer,
            end# ~vend,
            word# (subword stemmer#),
            j# (- (count word#) (count end#))]
        (if (and (pos? j#) (= (subvec word# j#) end#))
          (let [~var (assoc stemmer# :index (dec j#))]
            ~true-expr)
          stemmer#))))
  ([var stemmer end true-expr false-expr]
   (let [vend (vec end)]
     `(let [stemmer# ~stemmer,
            end# ~vend,
            word# (subword stemmer#),
            j# (- (count word#) (count end#))]
        (if (and (pos? j#) (= (subvec word# j#) end#))
          (let [~var (assoc stemmer# :index (dec j#))]
            ~true-expr)
          (let [~var stemmer#]
            ~false-expr))))))

Here, in both versions of if-ends?, we first convert the ending to a vector and assign it to vend. Inside the macro expansion expression, vend is what gets assigned to end#. Clojure calculates vend when the macro is expanded at compile time. The time saved by the optimization won’t be much, but it does illustrate another benefit of macros.

This simplifies things a lot, but it doesn’t cover all the use cases for ends?. We will also want to use it in a nested set of twenty or more comparisons. That is less like an if and more like a cond expression. In the next posting, we’ll create cond-ends?.

4 comments:

Holger Durer said...

Cool blog post series, thanks!

A few questions about that if-ends? macros:

Firstly: In the final, optimised version, why not pre-calculate the length of the end vector as well?

And then the question why not go further with the macro calculation:
- The four and five parameter forms are almost identical, why not factor out the common part and pass the only differing part (i.e. the false branch) to that common code?
- The optimised and unoptimised versions are also almost identical. Why not detect if the end parameter is really constant and do the optimisation depending on that?
Would this have been too complicated for the intended audience?

Eric Rochester said...

Hi Holger,

Good questions/points, all.

In all of these cases, I was trying to keep it simple for the intended audience. But there are a couple of other reasons too:

- In fact, I kept asking myself if I should pre-calculate the length of the end vector, and the audience was the main reason not to. But also, because getting the length of a vector is a constant-time operation, it didn't seem like a great priority. As it was, I almost didn't include the optimization at all.

- Factoring out the common parts would have introduced a new level of abstraction (macros calling functions to create code structures), and macros are abstract enough for those being introduced to it for the first time. But that's a good idea for a future improvement.

- And detecting the type of the end didn't really occur to me, because none of my use cases called for it :). In all cases I'm using a constant value for the ending to test. That's another one to add to my todo list.

Trying to keep this simpler for the audience has been a major problem. To be honest, I really didn't intend to cover macros yet, but when I sat down to code up the stemmer, I found that really needed them. So I jumped in. If I'd known then what I know now, I may have put off including the stemmer or dropped it entirely (maybe).

These are some really good ideas, though. Hmm. Maybe I need to do another posting covering them in more detail.

Thanks,
Eric

Holger Durer said...

I suspected it was to do with the audience. I have zero experience trying to teach programming and have always failed in trying to teach people anything so I don't try anymore. :-)

Still I find that eventually people should be shown that macros are more than text patterns with maybe a little code around it. The beauty of the Lisp languages is that macros are code to be executed at compile time but otherwise it is the same code as you use for runtime. In fact, you use the same code for compile and runtime -- and that is what people should eventually grasp to see the full power.

Eric Rochester said...

Well, I don't know how well I'm doing at teaching programming, either. We'll see.

What I was trying to do is to introduce macros gradually, at first as simply as possible, and only gradually add more abstractions and complexity. As it is, I felt things were already moving too fast for one posting.

The next posting, on cond-ends?, bites off a bit more. It's still probably too much, but it's a little better since I've already covered if-ends?.

I agree with you 100% on the power and beauty of macros, though. I should have the opportunity to keep returning to macros, so hopefully that will come across.