In today’s posting, we’ll take a look at the initial step in the Porter stemmer.
An Overview
The first step in stemming tokens involves removing plural -s and verb endings -ed and -ing. It also turns -y to -i, so it will be recognized as a suffix in later steps. The documentation in the C source code lists some examples:
caresses -> caress ponies -> poni ties -> ti caress -> caress cats -> cat feed -> feed agreed -> agree disabled -> disable matting -> mat mating -> mate meeting -> meet milling -> mill messing -> mess meetings -> meet
Plurals
The first step is to strip off plurals. If the word ends in a -sses or -ies, this removes the -es; otherwise, if it ends in -s, but not -ss, it takes off the -s. If none of those apply, it returns the original stemmer.
(defn stem-plural "This is part of step 1ab. It removes plurals (-s) from a stem." [stemmer] (if (= (last (:word stemmer)) \s) (cond-ends? st stemmer "sses" (reset-index (pop (pop (:word st)))) "ies" (set-to st "i") :else (if (and (>= (count (:word st)) 2) (not= (nth (:word st) (- (count (:word st)) 2)) \s)) (assoc st :word (pop (:word st))) st)) stemmer))
It’s been a while since we’ve seen some of these functions. Here are a few reminders about what they do:
reset-index
returns a new stemmer from a word vector (here with the last two
letters removed from the word).
set-to
removes the -ies from the end of the word and replaces it with an
-i.
cond-ends?
is the macro we created in the last few postings. I just wanted
to point that out.
Verb Endings
Another part of step 1 involves removing the verb suffixes -ed and -ing from the word. It first looks for -eed if it is long enough (that is, if it has more than one internal sequence of consonants), it removes the -d; if it ends in -ed, it removes that; and if it ends in -ing, it removes that. In either of the last two cases, it also expands truncated suffixes.
Expanding suffixes just tests for certain endings and, if they are found,
appends an -e to the word. Specifically, it looks for -at, -bl, and
-iz. It also checks for a double-consonant ending. Some are all right in
English (-ll, -ss, and -zz), but most others should be removed. Finally,
it removes the final consonant in words that end in CVC (with exceptions).
This process is handled by the stem-expand-suffix
function, which is listed
first.
(defn stem-expand-suffix "This is part of step 1ab. It expands -at, -bl, and -iz by adding an -e in certain circumstances." [stemmer] (cond-ends? st stemmer "at" (set-to st "ate") "bl" (set-to st "ble") "iz" (set-to st "ize") :else (cond (double-c? st (dec (count (:word st)))) (if (#{\l \s \z} (last (:word st))) st (assoc st :word (pop (:word st)))) (and (= (m st) 1) (cvc? st (dec (count (:word st))))) (set-to st "e") :else st))) (defn stem-verb-ending "This is part of step 1ab. It removes verb endings -ed and -ing." [stemmer] (cond-ends? st stemmer "eed" (if (pos? (m st)) (assoc st :word (pop (:word st))) stemmer) "ed" (if (vowel-in-stem? st) (stem-expand-suffix (assoc st :word (subword st))) stemmer) "ing" (if (vowel-in-stem? st) (stem-expand-suffix (assoc st :word (subword st))) stemmer)))
double-c?
returns true if the word ends in a double consonant. For example,
-ll or -ss or something.
cvc?
returns true if the word ends in a consonant-vowel-consonant sequence
and if the final consonant is not w, x, or y.
m
returns the number of consonant sequences between the start of a word and
the index position. But if the word starts with a consonant sequence, it isn’t
counted.
Step 1AB
The actual function for step 1AB (A calls stem-plural
and B calls
stem-verb-ending
) is simple. It just passes its input through the two
functions:
(defn step-1ab [stemmer] (stem-verb-ending (stem-plural stemmer)))
One thing about functional languages is that they often have to be read from
right to left. The stemmer gets passed to stem-plural
first and
stem-verb-ending
second. The way it is written, however, is
counter-intuitive, and it makes functional languages harder to read.
Clojure provides an improvement to this. It uses the ->
macro to build
expressions like above. Let’s spend some time understanding what this macro
does, first by using it and then by looking at the expressions it outputs.
The first parameter to ->
is an expression. The remaining parameters are
functions. The expression parameter gets passed as the first parameter to the
first function, and the output of this gets passed as the first parameter to
the second function parameter. This continues until all the functions have
been chained together.
To play with this, let’s define some functions that operate on strings.
porter=> (defn to-lower-case [string] (.toLowerCase string)) #'porter/to-lower-case porter=> (defn trim [string] (.trim string)) #'porter/trim porter=> (trim (to-lower-case " ThIs NeEdS ClEaNiNg ")) "this needs cleaning"
If we make that last call with ->
, it looks like this:
porter=> (-> " ThIs NeEdS ClEaNiNg " to-lower-case trim) "this needs cleaning"
This is really handy if there aren’t any other arguments. But what if you want
to use a function that needs more than one argument. As long as the first
argument is the expression parameter or the result of the previous function in
the sequence, it’s no problem. Just enclose the function and the remaining
parameters to that function in parentheses. For example, this defines a
wrapper for String.substring
that returns everything from the second
parameter on.
porter=> (defn substring [string index] (.substring string index)) #'porter/substring porter=> (-> " ThIs NeEdS ClEaNiNg " to-lower-case trim (substring 11)) "cleaning"
We can see what this does using macroexpand-1
:
porter=> (macroexpand-1 '(-> " ThIs NeEdS ClEaNiNg " to-lower-case trim (substring 11))) (clojure/-> (clojure/-> " ThIs NeEdS ClEaNiNg " to-lower-case) trim (substring 11))
Umm. It changed, but not by much. Let’s just take the inner, second ->
and
try expanding it:
porter=> (macroexpand-1 '(-> " ThIs NeEdS ClEaNiNg " to-lower-case)) (to-lower-case " ThIs NeEdS ClEaNiNg ")
That seems like a normal function call. If we keep breaking it down, we get this:
(substring (trim (to-lower-case " ThIs NeEdS ClEaNiNg ")) 11)
Which is what we want and expect.
Now, we can rewrite step-1ab
to use this macro and be much more readable.
(defn step-1ab "step-1ab gets rid of plurals and -ed or -ing. E.g., caresses -> caress ponies -> poni ties -> ti caress -> caress cats -> cat feed -> feed agreed -> agree disabled -> disable matting -> mat mating -> mate meeting -> meet milling -> mill messing -> mess meetings -> meet " [stemmer] (-> stemmer stem-plural stem-verb-ending))
Step 1C
The rest of step one just tests to see if the word ends in -y. If it does, and if there is a vowel in the stem, it removes the -y and adds a -i to the word.
(defn step-1c "Turns terminal y to i when there is another vowel in the stem." [stemmer] (if-ends? st stemmer "y" (if (vowel-in-stem? st) (reset-index (conj (pop (:word st)) \i)) stemmer)))
That’s pretty straightforward. It uses if-ends?
to test for the -y, and
vowel-in-stem?
looks for a vowel before the -y. If either of these is not
the case, the original stemmer is returned.
Those two functions comprise step 1. In the next posting, we’ll look at the next several steps.
2 comments:
(assoc st :word (pop (:word st))))
Code like that keeps popping up, and it seems low-level in the context of nice macros like cond-ends?. Might it be a worthwhile motivation for a new function, used as follows?
(pop-word st) ; pop one char
(pop-word st 2) ; pop two chars
Gavin Sinclair
Hi Gavin,
Good call. The version of this with one parameter is already there (http://code.google.com/p/word-clj/source/browse/trunk/porter.clj#43).
The second would be helpful too, though.
Thanks,
Eric
Post a Comment