Friday, August 1, 2008

Stemming, Part 15: Morphological Suffix

Now on to steps two and three. Here we strip off a bunch of morphological suffixes.

What Is a Morphological Suffix?

A morphological suffix changes a word from one part of speech to another. These can be joined together almost infinitely:

  • sense (verb)
  • sense + -ate = sensate (adjective)
  • sensate + -tion = sensation (noun)
  • sensation + al = sensational (adjective)
  • sensational + -ly = sensationally (adverb)
  • sensational + -ize = sensationalize (verb)

You get the idea. We could play this all day.

(I should drop in a warning here. The derivations above are purely morphological. I’m not making any statement about how a word developed historically or how it came into the language. There, I feel better. Thanks for putting up with my moment of pedantry.)

There are a set of rules to how morphological suffixes can be combined. You can’t just stick sensation and -ize together to make a verb. Also, different morphological suffixes change the stem’s root in different ways. Sense + -ate (sensate) is very different than sense + -ible (sensible), even though both -ate and -ible turn sense into an adjective.

The Porter stemmer leverages these rules to test for two different sets of endings in two different steps. The two steps are structured almost identically as two large cond-ends? expressions. In each case, they test for an ending, and if it is found, they replace it with another ending using r. The r function only makes the change if the word has an internal consonant cluster inside the stem. If it doesn’t have an internal consonant cluster, the ending is assumed to be part of the word and left alone.

For example, step 1c changes sensationally to sensationalli. Step 2 changes sensationalli to sensational.

On the other hand, the name calli, which also ends in -alli, should not be truncated to cal. Because the stemmer only truncates the ending if the stem has an internal consonant cluster, which calli does not, calli is left the way it is.

Our Functions for Today

Today, we’ll look at the functions for steps 2 and 3. As I said before, they are almost structurally identical, so I’ll show them both and comment on them together.

(defn step-2
  [stemmer]
  (cond-ends? st stemmer
              "ational" (r st stemmer "ate")
              "tional" (r st stemmer "tion")
              "enci" (r st stemmer "ence")
              "anci" (r st stemmer "ance")
              "izer" (r st stemmer "ize")
              "bli" (r st stemmer "ble")
              "alli" (r st stemmer "al")
              "entli" (r st stemmer "ent")
              "eli" (r st stemmer "e")
              "ousli" (r st stemmer "ous")
              "ization" (r st stemmer "ize")
              "ation" (r st stemmer "ate")
              "ator" (r st stemmer "ate")
              "alism" (r st stemmer "al")
              "iveness" (r st stemmer "ive")
              "fulness" (r st stemmer "ful")
              "ousness" (r st stemmer "ous")
              "fulness" (r st stemmer "ful")
              "ousness" (r st stemmer "ous")
              "aliti" (r st stemmer "al")
              "iviti" (r st stemmer "ive")
              "biliti" (r st stemmer "ble")
              "logi" (r st stemmer "log")))

(defn step-3
  "deals with -ic-, -full, -ness, etc., using
  a similar strategy to step-2."
  [stemmer]
  (cond-ends? st stemmer
              "icate" (r st stemmer "ic")
              "ative" (r st stemmer "")
              "alize" (r st stemmer "al")
              "iciti" (r st stemmer "ic")
              "ical" (r st stemmer "ic")
              "ful" (r st stemmer "")
              "ness" (r st stemmer "")))

These are nothing but cond-ends?. Each tests the input stemmer for a series of endings, and on the first ending found, it tests for an internal consonant cluster and changes the ending. If either of those conditions are false, the original stemmer is returned.

Possible Improvements

One obvious improvement would be to make a macro that takes an input stemmer and a sequence of ending/replacement pairs. It would expand into the cond-ends? above. It might look something like:

(replace-ending stemmer
                "icate" "ic"
                "ative" ""
                "alize" "al"
                "iciti" "ic"
                "ful" ""
                "ness" "")

I’ll leave that as an exercise to the reader.

In the next posting, we’ll look at stem 4, which is slightly different than steps 2 and 3.

No comments: