Tuesday, August 5, 2008

Stemming, Part 16: More Suffixes

Today, we’ll look at step four of the Porter stemmer. This is a little different than previous and later steps, so we’ll just focus on it today.

Utilities

Before we outline the step itself, however, we need to define another utility function. This tests whether the stemmer’s word has more than one internal consonant cluster. If it does, it strips off the ending; otherwise, it returns the original stemmer.

(defn chop
  "If there is more than one internal
  consonant cluster in the stem, this chops
  the ending (as identified by the index)."
  [stemmer]
  (if (> (m stemmer) 1)
    (assoc stemmer :word (subword stemmer))
    stemmer))

Step Four

Once chop is defined, the rest of step four pretty much defines itself. Like steps two and three, it is a cond-ends? that tests for a variety of endings and strips them off.

There is one special case: If a word ends in -ion, preceded by a -s- or -t-, the -ion is removed, but the -s- or -t- is left. You can see how that is handled about half way through step-4.

(defn step-4
  "takes off -ant, -ence, etc., in context <c>vcvc<v>."
  [stemmer]
  (cond-ends? st stemmer
              "al" (chop st)
              "ance" (chop st)
              "ence" (chop st)
              "er" (chop st)
              "ic" (chop st)
              "able" (chop st)
              "ible" (chop st)
              "ant" (chop st)
              "ement" (chop st)
              "ment" (chop st)
              "ent" (chop st)
              "ion" (if (#{\s \t} (index-char st))
                      (chop st)
                      stemmer)
              "ou" (chop st)
              "ism" (chop st)
              "ate" (chop st)
              "iti" (chop st)
              "ous" (chop st)
              "ive" (chop st)
              "ize" (chop st)))

That’s it for step four. In the posting, I’ll outline step five, which is, if anything, more like step one than steps two to four.

No comments: