Thursday, August 7, 2008

Stemming, Part 18: Processing and Testing

All the pieces are in place, now here is the final piece. Also, I’ll describe how I tested this to make sure it was working correctly.

The stem Function

Everything that we’ve written so far happens under the hood. This function is finally the one function that will be called in other code. Without further ado, here it is.

(defn stem
  (if (<= (count word) 2)
    (apply str (-> word make-stemmer
                   step-1ab step-1c step-2 step-3 step-4 step-5

If the word has one or two letters, just return it. If it is longer, use the -> macro to thread the word through make-stemmer and the steps, and extract the stem vector.

The word vector gets passed to the apply function. This is a special higher-order function that takes a function and its arguments as a sequence. It applies the arguments to the function and returns the result. Let’s look at how it works.

user=> (+ 1 2 3)
user=> (apply + '(1 2 3))
user=> (apply + 1 '(2 3))

You can see that only the last argument to apply has to be a sequence of arguments to pass to the function. The other arguments can be listed individually before the final sequence, and they are put before the sequence. For example, you can’t do this:

user=> (apply + 1 2 3)   
java.lang.IllegalArgumentException: Don't know how to create ISeq from: Integer

Of course, if you’re doing that, you already know how many arguments you’re calling the function with, and in that case, you should just call it as is (that is, just call (+ 1 2 3)).

So in stem, we take a word vector and pass all of the characters in it to the str function. str converts all of this arguments to a string and concatenates them.

user=> (str \w \o \r \d)
user=> (apply str [\w \o \r \d])

Well, we have a new toy now, so let’s play with it:

porter=> (stem "porter")
porter=> (stem "porting")
porter=> (stem "ports")
porter=> (stem "ported")
porter=> (stem "stemming")


I’ve been presenting the code here as a finished product, perfect (I guess) as written. But it didn’t begin that way. In fact, I originally wrote something very close to the C version of the algorithm and made sure that worked right. Then I gradually changed it to make it more lispy. The is the result I have presented here.

To make sure it worked correctly, I downloaded the test input data and expected output from the Porter Stemmer web site. The first file contains 23,531 words for a test set. The second contains those same words after they’ve been run through the stemmer.

Next, I wrote a function that reads from both files, stems the input, and compares it to the output. I don’t always need to test every item in the test set. Sometimes I can get by with only testing the first so many words, so I’ve included a parameter to limit how many words to test. Also, sometimes I may want to see the output from every word in the test set, but most of the time, I really only want to see the errors. Finally, this returns the total number of words tested, the number the stemmer got right, and the number it got wrong.

(defn read-lines
  (with-open reader (new (new filename))
    (doall (line-seq reader))))

(defn test-porter
   (test-porter (.MAX_VALUE Integer) false))
  ([n output-all?]
   (loop [input (take n (read-lines "porter-test/voc.txt")),
          expected (take n (read-lines "porter-test/output.txt")),
          total 0, correct 0, error 0]
     (if (and input expected)
       (let [i (first input), e (first expected), a (stem i)]
         (if (= a e)
             (when output-all?
               (println "OK:" (pr-str i)))
             (recur (rest input)
                    (rest expected)
                    (inc total)
                    (inc correct)
             (println "ERROR:" (pr-str i)
                      "=> (" (pr-str a) "!=" (pr-str e) ")")
             (recur (rest input)
                    (rest expected)
                    (inc total)
                    (inc error)))))
       [total correct error]))))

The highlights of this are:

  • read-lines is a utility that opens a file using a Java BufferedReader and assigns that to reader. with-open always calls (. reader close) when it exits. line-seq takes a reader and returns a lazy sequence on the lines in the reader, and doall forces Clojure to read all the items in a lazy sequence. Basically, read-lines reads all the lines in a file and returns them in a sequence.

  • As we’ve seen before, take pulls the first n items from a list, which limits the number of words to be tested.

  • The loop continues while there is input from input and expected.

  • The input is stemmed and stored as the variable a (short for actual).

  • If the actual is the same as the expected, optionally output that, and loop, incrementing the number of total words tested and the number of words stemmed correctly.

  • If the actual and expected are not the same, always write this out and loop, incrementing the number of total words tested and the number of errors.

Tomorrow, I’ll talk about how I tracked down bugs that cropped up during testing.

No comments: