Now on to steps two and three. Here we strip off a bunch of morphological suffixes.
What Is a Morphological Suffix?
A morphological suffix changes a word from one part of speech to another. These can be joined together almost infinitely:
- sense (verb)
- sense + -ate = sensate (adjective)
- sensate + -tion = sensation (noun)
- sensation + al = sensational (adjective)
- sensational + -ly = sensationally (adverb)
- sensational + -ize = sensationalize (verb)
You get the idea. We could play this all day.
(I should drop in a warning here. The derivations above are purely morphological. I’m not making any statement about how a word developed historically or how it came into the language. There, I feel better. Thanks for putting up with my moment of pedantry.)
There are a set of rules to how morphological suffixes can be combined. You can’t just stick sensation and -ize together to make a verb. Also, different morphological suffixes change the stem’s root in different ways. Sense + -ate (sensate) is very different than sense + -ible (sensible), even though both -ate and -ible turn sense into an adjective.
The Porter stemmer leverages these rules to test for two different sets of
endings in two different steps. The two steps are structured almost
identically as two large
cond-ends? expressions. In each case, they test for
an ending, and if it is found, they replace it with another ending using
r function only makes the change if the word has an internal consonant
cluster inside the stem. If it doesn’t have an internal consonant cluster, the
ending is assumed to be part of the word and left alone.
For example, step 1c changes sensationally to sensationalli. Step 2 changes sensationalli to sensational.
On the other hand, the name calli, which also ends in -alli, should not be truncated to cal. Because the stemmer only truncates the ending if the stem has an internal consonant cluster, which calli does not, calli is left the way it is.
Our Functions for Today
Today, we’ll look at the functions for steps 2 and 3. As I said before, they are almost structurally identical, so I’ll show them both and comment on them together.
(defn step-2 [stemmer] (cond-ends? st stemmer "ational" (r st stemmer "ate") "tional" (r st stemmer "tion") "enci" (r st stemmer "ence") "anci" (r st stemmer "ance") "izer" (r st stemmer "ize") "bli" (r st stemmer "ble") "alli" (r st stemmer "al") "entli" (r st stemmer "ent") "eli" (r st stemmer "e") "ousli" (r st stemmer "ous") "ization" (r st stemmer "ize") "ation" (r st stemmer "ate") "ator" (r st stemmer "ate") "alism" (r st stemmer "al") "iveness" (r st stemmer "ive") "fulness" (r st stemmer "ful") "ousness" (r st stemmer "ous") "fulness" (r st stemmer "ful") "ousness" (r st stemmer "ous") "aliti" (r st stemmer "al") "iviti" (r st stemmer "ive") "biliti" (r st stemmer "ble") "logi" (r st stemmer "log"))) (defn step-3 "deals with -ic-, -full, -ness, etc., using a similar strategy to step-2." [stemmer] (cond-ends? st stemmer "icate" (r st stemmer "ic") "ative" (r st stemmer "") "alize" (r st stemmer "al") "iciti" (r st stemmer "ic") "ical" (r st stemmer "ic") "ful" (r st stemmer "") "ness" (r st stemmer "")))
These are nothing but
cond-ends?. Each tests the input stemmer for a series
of endings, and on the first ending found, it tests for an internal consonant
cluster and changes the ending. If either of those conditions are false, the
original stemmer is returned.
One obvious improvement would be to make a macro that takes an input stemmer
and a sequence of ending/replacement pairs. It would expand into the
cond-ends? above. It might look something like:
(replace-ending stemmer "icate" "ic" "ative" "" "alize" "al" "iciti" "ic" "ful" "" "ness" "")
I’ll leave that as an exercise to the reader.
In the next posting, we’ll look at stem 4, which is slightly different than steps 2 and 3.