Today, we’ll look at step four of the Porter stemmer. This is a little different than previous and later steps, so we’ll just focus on it today.
Utilities
Before we outline the step itself, however, we need to define another utility function. This tests whether the stemmer’s word has more than one internal consonant cluster. If it does, it strips off the ending; otherwise, it returns the original stemmer.
(defn chop "If there is more than one internal consonant cluster in the stem, this chops the ending (as identified by the index)." [stemmer] (if (> (m stemmer) 1) (assoc stemmer :word (subword stemmer)) stemmer))
Step Four
Once chop
is defined, the rest of step four pretty much defines itself. Like
steps two and three, it is a cond-ends?
that tests for a variety of endings
and strips them off.
There is one special case: If a word ends in -ion, preceded by a -s- or
-t-, the -ion is removed, but the -s- or -t- is left. You can see how
that is handled about half way through step-4
.
(defn step-4 "takes off -ant, -ence, etc., in context <c>vcvc<v>." [stemmer] (cond-ends? st stemmer "al" (chop st) "ance" (chop st) "ence" (chop st) "er" (chop st) "ic" (chop st) "able" (chop st) "ible" (chop st) "ant" (chop st) "ement" (chop st) "ment" (chop st) "ent" (chop st) "ion" (if (#{\s \t} (index-char st)) (chop st) stemmer) "ou" (chop st) "ism" (chop st) "ate" (chop st) "iti" (chop st) "ous" (chop st) "ive" (chop st) "ize" (chop st)))
That’s it for step four. In the posting, I’ll outline step five, which is, if anything, more like step one than steps two to four.
No comments:
Post a Comment