Thursday, August 7, 2008

Stemming, Part 18: Processing and Testing

All the pieces are in place, now here is the final piece. Also, I’ll describe how I tested this to make sure it was working correctly.

The stem Function

Everything that we’ve written so far happens under the hood. This function is finally the one function that will be called in other code. Without further ado, here it is.

(defn stem
  [word]
  (if (<= (count word) 2)
    word
    (apply str (-> word make-stemmer
                   step-1ab step-1c step-2 step-3 step-4 step-5
                   :word))))

If the word has one or two letters, just return it. If it is longer, use the -> macro to thread the word through make-stemmer and the steps, and extract the stem vector.

The word vector gets passed to the apply function. This is a special higher-order function that takes a function and its arguments as a sequence. It applies the arguments to the function and returns the result. Let’s look at how it works.

user=> (+ 1 2 3)
6
user=> (apply + '(1 2 3))
6
user=> (apply + 1 '(2 3))
6

You can see that only the last argument to apply has to be a sequence of arguments to pass to the function. The other arguments can be listed individually before the final sequence, and they are put before the sequence. For example, you can’t do this:

user=> (apply + 1 2 3)   
java.lang.IllegalArgumentException: Don't know how to create ISeq from: Integer

Of course, if you’re doing that, you already know how many arguments you’re calling the function with, and in that case, you should just call it as is (that is, just call (+ 1 2 3)).

So in stem, we take a word vector and pass all of the characters in it to the str function. str converts all of this arguments to a string and concatenates them.

user=> (str \w \o \r \d)
"word"
user=> (apply str [\w \o \r \d])
"word"

Well, we have a new toy now, so let’s play with it:

porter=> (stem "porter")
"porter"
porter=> (stem "porting")
"port"
porter=> (stem "ports")
"port"
porter=> (stem "ported")
"port"
porter=> (stem "stemming")
"stem"

Testing

I’ve been presenting the code here as a finished product, perfect (I guess) as written. But it didn’t begin that way. In fact, I originally wrote something very close to the C version of the algorithm and made sure that worked right. Then I gradually changed it to make it more lispy. The is the result I have presented here.

To make sure it worked correctly, I downloaded the test input data and expected output from the Porter Stemmer web site. The first file contains 23,531 words for a test set. The second contains those same words after they’ve been run through the stemmer.

Next, I wrote a function that reads from both files, stems the input, and compares it to the output. I don’t always need to test every item in the test set. Sometimes I can get by with only testing the first so many words, so I’ve included a parameter to limit how many words to test. Also, sometimes I may want to see the output from every word in the test set, but most of the time, I really only want to see the errors. Finally, this returns the total number of words tested, the number the stemmer got right, and the number it got wrong.

(defn read-lines
  [filename]
  (with-open reader (new java.io.BufferedReader (new java.io.FileReader filename))
    (doall (line-seq reader))))

(defn test-porter
  ([]
   (test-porter (.MAX_VALUE Integer) false))
  ([n output-all?]
   (loop [input (take n (read-lines "porter-test/voc.txt")),
          expected (take n (read-lines "porter-test/output.txt")),
          total 0, correct 0, error 0]
     (if (and input expected)
       (let [i (first input), e (first expected), a (stem i)]
         (if (= a e)
           (do
             (when output-all?
               (println "OK:" (pr-str i)))
             (recur (rest input)
                    (rest expected)
                    (inc total)
                    (inc correct)
                    error))
           (do
             (println "ERROR:" (pr-str i)
                      "=> (" (pr-str a) "!=" (pr-str e) ")")
             (recur (rest input)
                    (rest expected)
                    (inc total)
                    correct
                    (inc error)))))
       [total correct error]))))

The highlights of this are:

  • read-lines is a utility that opens a file using a Java BufferedReader and assigns that to reader. with-open always calls (. reader close) when it exits. line-seq takes a reader and returns a lazy sequence on the lines in the reader, and doall forces Clojure to read all the items in a lazy sequence. Basically, read-lines reads all the lines in a file and returns them in a sequence.

  • As we’ve seen before, take pulls the first n items from a list, which limits the number of words to be tested.

  • The loop continues while there is input from input and expected.

  • The input is stemmed and stored as the variable a (short for actual).

  • If the actual is the same as the expected, optionally output that, and loop, incrementing the number of total words tested and the number of words stemmed correctly.

  • If the actual and expected are not the same, always write this out and loop, incrementing the number of total words tested and the number of errors.

Tomorrow, I’ll talk about how I tracked down bugs that cropped up during testing.

Wednesday, August 6, 2008

Stemming, Part 17: The Final Step

Today, we pick apart step five. This will involve removing the final -e and -l from words. But each of these cases will be handled in a different function.

Final E

Silent -e is one of the more, um, endearing quirks of English spelling. Invariably, whenever someone suggests spelling reform, it is the first letter on the chopping block. Stemming isn’t spelling reform, but this step does get rid of silent and final -e in all words with more than one internal consonant cluster.

(defn rule-e
  "This removes the final -e from a word if
    - there is more than one internal consonant cluster; or
    - there is exactly one final consonant cluster and
      it is not preceded by a CVC sequence.
  "
  [stemmer]
  (if (= (last (:word stemmer)) \e)
    (let [a (m stemmer)]
      (if (or (> a 1)
              (and (= a 1)
                   (not (cvc? stemmer (dec (:index stemmer))))))
        (pop-word stemmer)
        stemmer))
    stemmer))

Final L

Handling final -l just changes -ll to -l if there is more than one consonant cluster in the word. This cleans up any double-l’s that may be left around.

(defn rule-l
  "This changes -ll to -l if (> (m) 1)."
  [stemmer]
  (if (and (= (last (:word stemmer)) \l)
           (double-c? stemmer (dec (count (:word stemmer))))
           (> (m stemmer) 1))
    (pop-word stemmer)
    stemmer))

Step 5

Once again, the function for step five is pretty simple:

  1. It pulls the word from the input stemmer and creates a new one with an index pointing to the end of the word;
  2. It runs that through both of the functions listed above.

Again, notice that we used -> to make it easier to read.

(defn step-5
  "Removes a final -e and changes -ll to -l if (> (m) 1)."
  [stemmer]
  (-> stemmer :word reset-index rule-e rule-l))

Surprise

One quirk with this algorithm is that it sometimes removes letters from standard English words. For example, locate becomes locat. The output it produces isn’t standard, but it is correct. That is, all forms of locate collapse down to one stem: locat. So if you see a word that isn’t a word, don’t worry, it’s still correct. Remember, we’re identifying stems, not words.

Next, we’ll look at the stem function and at how to test a system like this.

Tuesday, August 5, 2008

Stemming, Part 16: More Suffixes

Today, we’ll look at step four of the Porter stemmer. This is a little different than previous and later steps, so we’ll just focus on it today.

Utilities

Before we outline the step itself, however, we need to define another utility function. This tests whether the stemmer’s word has more than one internal consonant cluster. If it does, it strips off the ending; otherwise, it returns the original stemmer.

(defn chop
  "If there is more than one internal
  consonant cluster in the stem, this chops
  the ending (as identified by the index)."
  [stemmer]
  (if (> (m stemmer) 1)
    (assoc stemmer :word (subword stemmer))
    stemmer))

Step Four

Once chop is defined, the rest of step four pretty much defines itself. Like steps two and three, it is a cond-ends? that tests for a variety of endings and strips them off.

There is one special case: If a word ends in -ion, preceded by a -s- or -t-, the -ion is removed, but the -s- or -t- is left. You can see how that is handled about half way through step-4.

(defn step-4
  "takes off -ant, -ence, etc., in context <c>vcvc<v>."
  [stemmer]
  (cond-ends? st stemmer
              "al" (chop st)
              "ance" (chop st)
              "ence" (chop st)
              "er" (chop st)
              "ic" (chop st)
              "able" (chop st)
              "ible" (chop st)
              "ant" (chop st)
              "ement" (chop st)
              "ment" (chop st)
              "ent" (chop st)
              "ion" (if (#{\s \t} (index-char st))
                      (chop st)
                      stemmer)
              "ou" (chop st)
              "ism" (chop st)
              "ate" (chop st)
              "iti" (chop st)
              "ous" (chop st)
              "ive" (chop st)
              "ize" (chop st)))

That’s it for step four. In the posting, I’ll outline step five, which is, if anything, more like step one than steps two to four.

Friday, August 1, 2008

Stemming, Part 15: Morphological Suffix

Now on to steps two and three. Here we strip off a bunch of morphological suffixes.

What Is a Morphological Suffix?

A morphological suffix changes a word from one part of speech to another. These can be joined together almost infinitely:

  • sense (verb)
  • sense + -ate = sensate (adjective)
  • sensate + -tion = sensation (noun)
  • sensation + al = sensational (adjective)
  • sensational + -ly = sensationally (adverb)
  • sensational + -ize = sensationalize (verb)

You get the idea. We could play this all day.

(I should drop in a warning here. The derivations above are purely morphological. I’m not making any statement about how a word developed historically or how it came into the language. There, I feel better. Thanks for putting up with my moment of pedantry.)

There are a set of rules to how morphological suffixes can be combined. You can’t just stick sensation and -ize together to make a verb. Also, different morphological suffixes change the stem’s root in different ways. Sense + -ate (sensate) is very different than sense + -ible (sensible), even though both -ate and -ible turn sense into an adjective.

The Porter stemmer leverages these rules to test for two different sets of endings in two different steps. The two steps are structured almost identically as two large cond-ends? expressions. In each case, they test for an ending, and if it is found, they replace it with another ending using r. The r function only makes the change if the word has an internal consonant cluster inside the stem. If it doesn’t have an internal consonant cluster, the ending is assumed to be part of the word and left alone.

For example, step 1c changes sensationally to sensationalli. Step 2 changes sensationalli to sensational.

On the other hand, the name calli, which also ends in -alli, should not be truncated to cal. Because the stemmer only truncates the ending if the stem has an internal consonant cluster, which calli does not, calli is left the way it is.

Our Functions for Today

Today, we’ll look at the functions for steps 2 and 3. As I said before, they are almost structurally identical, so I’ll show them both and comment on them together.

(defn step-2
  [stemmer]
  (cond-ends? st stemmer
              "ational" (r st stemmer "ate")
              "tional" (r st stemmer "tion")
              "enci" (r st stemmer "ence")
              "anci" (r st stemmer "ance")
              "izer" (r st stemmer "ize")
              "bli" (r st stemmer "ble")
              "alli" (r st stemmer "al")
              "entli" (r st stemmer "ent")
              "eli" (r st stemmer "e")
              "ousli" (r st stemmer "ous")
              "ization" (r st stemmer "ize")
              "ation" (r st stemmer "ate")
              "ator" (r st stemmer "ate")
              "alism" (r st stemmer "al")
              "iveness" (r st stemmer "ive")
              "fulness" (r st stemmer "ful")
              "ousness" (r st stemmer "ous")
              "fulness" (r st stemmer "ful")
              "ousness" (r st stemmer "ous")
              "aliti" (r st stemmer "al")
              "iviti" (r st stemmer "ive")
              "biliti" (r st stemmer "ble")
              "logi" (r st stemmer "log")))

(defn step-3
  "deals with -ic-, -full, -ness, etc., using
  a similar strategy to step-2."
  [stemmer]
  (cond-ends? st stemmer
              "icate" (r st stemmer "ic")
              "ative" (r st stemmer "")
              "alize" (r st stemmer "al")
              "iciti" (r st stemmer "ic")
              "ical" (r st stemmer "ic")
              "ful" (r st stemmer "")
              "ness" (r st stemmer "")))

These are nothing but cond-ends?. Each tests the input stemmer for a series of endings, and on the first ending found, it tests for an internal consonant cluster and changes the ending. If either of those conditions are false, the original stemmer is returned.

Possible Improvements

One obvious improvement would be to make a macro that takes an input stemmer and a sequence of ending/replacement pairs. It would expand into the cond-ends? above. It might look something like:

(replace-ending stemmer
                "icate" "ic"
                "ative" ""
                "alize" "al"
                "iciti" "ic"
                "ful" ""
                "ness" "")

I’ll leave that as an exercise to the reader.

In the next posting, we’ll look at stem 4, which is slightly different than steps 2 and 3.