All the pieces are in place, now here is the final piece. Also, I’ll describe how I tested this to make sure it was working correctly.
The stem
Function
Everything that we’ve written so far happens under the hood. This function is finally the one function that will be called in other code. Without further ado, here it is.
(defn stem [word] (if (<= (count word) 2) word (apply str (-> word make-stemmer step-1ab step-1c step-2 step-3 step-4 step-5 :word))))
If the word has one or two letters, just return it. If it is longer, use the
->
macro to thread the word through make-stemmer
and the steps, and
extract the stem vector.
The word vector gets passed to the apply
function. This is a special
higher-order function that takes a function and its arguments as a sequence.
It applies the arguments to the function and returns the result. Let’s look
at how it works.
user=> (+ 1 2 3) 6 user=> (apply + '(1 2 3)) 6 user=> (apply + 1 '(2 3)) 6
You can see that only the last argument to apply
has to be a sequence of
arguments to pass to the function. The other arguments can be listed
individually before the final sequence, and they are put before the sequence.
For example, you can’t do this:
user=> (apply + 1 2 3) java.lang.IllegalArgumentException: Don't know how to create ISeq from: Integer
Of course, if you’re doing that, you already know how many arguments you’re
calling the function with, and in that case, you should just call it as is
(that is, just call (+ 1 2 3)
).
So in stem
, we take a word vector and pass all of the characters in it to
the str
function. str
converts all of this arguments to a string and
concatenates them.
user=> (str \w \o \r \d) "word" user=> (apply str [\w \o \r \d]) "word"
Well, we have a new toy now, so let’s play with it:
porter=> (stem "porter") "porter" porter=> (stem "porting") "port" porter=> (stem "ports") "port" porter=> (stem "ported") "port" porter=> (stem "stemming") "stem"
Testing
I’ve been presenting the code here as a finished product, perfect (I guess) as written. But it didn’t begin that way. In fact, I originally wrote something very close to the C version of the algorithm and made sure that worked right. Then I gradually changed it to make it more lispy. The is the result I have presented here.
To make sure it worked correctly, I downloaded the test input data and expected output from the Porter Stemmer web site. The first file contains 23,531 words for a test set. The second contains those same words after they’ve been run through the stemmer.
Next, I wrote a function that reads from both files, stems the input, and compares it to the output. I don’t always need to test every item in the test set. Sometimes I can get by with only testing the first so many words, so I’ve included a parameter to limit how many words to test. Also, sometimes I may want to see the output from every word in the test set, but most of the time, I really only want to see the errors. Finally, this returns the total number of words tested, the number the stemmer got right, and the number it got wrong.
(defn read-lines [filename] (with-open reader (new java.io.BufferedReader (new java.io.FileReader filename)) (doall (line-seq reader)))) (defn test-porter ([] (test-porter (.MAX_VALUE Integer) false)) ([n output-all?] (loop [input (take n (read-lines "porter-test/voc.txt")), expected (take n (read-lines "porter-test/output.txt")), total 0, correct 0, error 0] (if (and input expected) (let [i (first input), e (first expected), a (stem i)] (if (= a e) (do (when output-all? (println "OK:" (pr-str i))) (recur (rest input) (rest expected) (inc total) (inc correct) error)) (do (println "ERROR:" (pr-str i) "=> (" (pr-str a) "!=" (pr-str e) ")") (recur (rest input) (rest expected) (inc total) correct (inc error))))) [total correct error]))))
The highlights of this are:
read-lines
is a utility that opens a file using a JavaBufferedReader
and assigns that toreader
.with-open
always calls(. reader close)
when it exits.line-seq
takes a reader and returns a lazy sequence on the lines in the reader, anddoall
forces Clojure to read all the items in a lazy sequence. Basically,read-lines
reads all the lines in a file and returns them in a sequence.As we’ve seen before,
take
pulls the firstn
items from a list, which limits the number of words to be tested.The
loop
continues while there is input frominput
andexpected
.The input is stemmed and stored as the variable
a
(short for actual).If the actual is the same as the expected, optionally output that, and loop, incrementing the number of total words tested and the number of words stemmed correctly.
If the actual and expected are not the same, always write this out and loop, incrementing the number of total words tested and the number of errors.
Tomorrow, I’ll talk about how I tracked down bugs that cropped up during testing.
No comments:
Post a Comment