Thursday, July 31, 2008

Stemming, Part 14: Verb Endings

In today’s posting, we’ll take a look at the initial step in the Porter stemmer.

An Overview

The first step in stemming tokens involves removing plural -s and verb endings -ed and -ing. It also turns -y to -i, so it will be recognized as a suffix in later steps. The documentation in the C source code lists some examples:

caresses -> caress ponies -> poni ties -> ti caress -> caress cats -> cat feed -> feed agreed -> agree disabled -> disable matting -> mat mating -> mate meeting -> meet milling -> mill messing -> mess meetings -> meet

Plurals

The first step is to strip off plurals. If the word ends in a -sses or -ies, this removes the -es; otherwise, if it ends in -s, but not -ss, it takes off the -s. If none of those apply, it returns the original stemmer.

(defn stem-plural
  "This is part of step 1ab. It removes plurals (-s) from a stem."
  [stemmer]
  (if (= (last (:word stemmer)) \s)
    (cond-ends? st stemmer
      "sses" (reset-index (pop (pop (:word st))))
      "ies" (set-to st "i")
      :else (if (and (>= (count (:word st)) 2)
                     (not= (nth (:word st) (- (count (:word st)) 2))
                           \s))
              (assoc st :word (pop (:word st)))
              st))
    stemmer))

It’s been a while since we’ve seen some of these functions. Here are a few reminders about what they do:

reset-index returns a new stemmer from a word vector (here with the last two letters removed from the word).

set-to removes the -ies from the end of the word and replaces it with an -i.

cond-ends? is the macro we created in the last few postings. I just wanted to point that out.

Verb Endings

Another part of step 1 involves removing the verb suffixes -ed and -ing from the word. It first looks for -eed if it is long enough (that is, if it has more than one internal sequence of consonants), it removes the -d; if it ends in -ed, it removes that; and if it ends in -ing, it removes that. In either of the last two cases, it also expands truncated suffixes.

Expanding suffixes just tests for certain endings and, if they are found, appends an -e to the word. Specifically, it looks for -at, -bl, and -iz. It also checks for a double-consonant ending. Some are all right in English (-ll, -ss, and -zz), but most others should be removed. Finally, it removes the final consonant in words that end in CVC (with exceptions). This process is handled by the stem-expand-suffix function, which is listed first.

(defn stem-expand-suffix
  "This is part of step 1ab. It expands -at, -bl,
  and -iz by adding an -e in certain circumstances."
  [stemmer]
  (cond-ends? st stemmer
    "at" (set-to st "ate")
    "bl" (set-to st "ble")
    "iz" (set-to st "ize")
    :else (cond
            (double-c? st (dec (count (:word st))))
              (if (#{\l \s \z} (last (:word st)))
                st
                (assoc st :word (pop (:word st))))
            (and (= (m st) 1) (cvc? st (dec (count (:word st)))))
              (set-to st "e")
            :else
              st)))

(defn stem-verb-ending
  "This is part of step 1ab. It removes verb endings -ed
  and -ing."
  [stemmer]
  (cond-ends? st stemmer
    "eed" (if (pos? (m st))
            (assoc st :word (pop (:word st)))
            stemmer)
    "ed"  (if (vowel-in-stem? st)
            (stem-expand-suffix (assoc st :word (subword st)))
            stemmer)
    "ing" (if (vowel-in-stem? st)
            (stem-expand-suffix (assoc st :word (subword st)))
            stemmer)))

double-c? returns true if the word ends in a double consonant. For example, -ll or -ss or something.

cvc? returns true if the word ends in a consonant-vowel-consonant sequence and if the final consonant is not w, x, or y.

m returns the number of consonant sequences between the start of a word and the index position. But if the word starts with a consonant sequence, it isn’t counted.

Step 1AB

The actual function for step 1AB (A calls stem-plural and B calls stem-verb-ending) is simple. It just passes its input through the two functions:

(defn step-1ab
  [stemmer]
  (stem-verb-ending (stem-plural stemmer)))

One thing about functional languages is that they often have to be read from right to left. The stemmer gets passed to stem-plural first and stem-verb-ending second. The way it is written, however, is counter-intuitive, and it makes functional languages harder to read.

Clojure provides an improvement to this. It uses the -> macro to build expressions like above. Let’s spend some time understanding what this macro does, first by using it and then by looking at the expressions it outputs.

The first parameter to -> is an expression. The remaining parameters are functions. The expression parameter gets passed as the first parameter to the first function, and the output of this gets passed as the first parameter to the second function parameter. This continues until all the functions have been chained together.

To play with this, let’s define some functions that operate on strings.

porter=> (defn to-lower-case [string] (.toLowerCase string))
#'porter/to-lower-case
porter=> (defn trim [string] (.trim string))
#'porter/trim
porter=> (trim (to-lower-case "   ThIs NeEdS ClEaNiNg  "))
"this needs cleaning"

If we make that last call with ->, it looks like this:

porter=> (-> "   ThIs NeEdS ClEaNiNg   " to-lower-case trim)
"this needs cleaning"

This is really handy if there aren’t any other arguments. But what if you want to use a function that needs more than one argument. As long as the first argument is the expression parameter or the result of the previous function in the sequence, it’s no problem. Just enclose the function and the remaining parameters to that function in parentheses. For example, this defines a wrapper for String.substring that returns everything from the second parameter on.

porter=> (defn substring [string index] (.substring string index))
#'porter/substring
porter=> (-> "   ThIs NeEdS ClEaNiNg   " to-lower-case trim (substring 11))
"cleaning"

We can see what this does using macroexpand-1:

porter=> (macroexpand-1 '(-> "   ThIs NeEdS ClEaNiNg   " to-lower-case trim (substring 11)))
(clojure/-> (clojure/-> "   ThIs NeEdS ClEaNiNg   " to-lower-case) trim (substring 11))

Umm. It changed, but not by much. Let’s just take the inner, second -> and try expanding it:

porter=> (macroexpand-1 '(-> "   ThIs NeEdS ClEaNiNg   " to-lower-case))
(to-lower-case "   ThIs NeEdS ClEaNiNg   ")

That seems like a normal function call. If we keep breaking it down, we get this:

(substring (trim (to-lower-case "   ThIs NeEdS ClEaNiNg   ")) 11)

Which is what we want and expect.

Now, we can rewrite step-1ab to use this macro and be much more readable.

(defn step-1ab
  "step-1ab gets rid of plurals and -ed or -ing. E.g.,
    caresses -> caress
    ponies -> poni
    ties -> ti
    caress -> caress
    cats -> cat
    feed -> feed
    agreed -> agree
    disabled -> disable
    matting -> mat
    mating -> mate
    meeting -> meet
    milling -> mill
    messing -> mess
    meetings -> meet
  "
  [stemmer]
  (-> stemmer stem-plural stem-verb-ending))

Step 1C

The rest of step one just tests to see if the word ends in -y. If it does, and if there is a vowel in the stem, it removes the -y and adds a -i to the word.

(defn step-1c
  "Turns terminal y to i when there is another vowel in the stem."
  [stemmer]
  (if-ends? st stemmer "y"
            (if (vowel-in-stem? st)
              (reset-index (conj (pop (:word st)) \i))
              stemmer)))

That’s pretty straightforward. It uses if-ends? to test for the -y, and vowel-in-stem? looks for a vowel before the -y. If either of these is not the case, the original stemmer is returned.

Those two functions comprise step 1. In the next posting, we’ll look at the next several steps.

2 comments:

Anonymous said...

(assoc st :word (pop (:word st))))

Code like that keeps popping up, and it seems low-level in the context of nice macros like cond-ends?. Might it be a worthwhile motivation for a new function, used as follows?

(pop-word st) ; pop one char
(pop-word st 2) ; pop two chars

Gavin Sinclair

Eric Rochester said...

Hi Gavin,

Good call. The version of this with one parameter is already there (http://code.google.com/p/word-clj/source/browse/trunk/porter.clj#43).

The second would be helpful too, though.

Thanks,
Eric