Today, we’ll define some more utilities for the Stemmer.
Internal Functions
One facet of functions that these utilities will use is internal functions.
Remember that we can declare a function literal using fn
. Also, we can
declare variables using let
. Internal functions combine both of these to
create functions that are only visible and usable within a function. For
example, in the last posting we redefined count-item
to use cond
. We could
also rewrite it to use an internal function instead of loop
.
(defn count-item [sequence item] (let [ci (fn [sq accum] (cond (not (seq sq)) accum (= (peek sq) item) (recur (pop sq) (inc accum)) :else (recur (pop sq) accum)))] (ci sequence 0))) #'user/count-item
In this case, the loop is redefined as a recursive function. It is assigned to
the variable ci
, which is called in the body of the let
.
In this case, using an internal function instead of a loop
isn’t really a
win, but if you have several internal loops that interact, defining them as
internal functions can greatly clarify what the function does.
Stemmer Utilities
(m stemmer)
One of the utilities that will use internal functions is m
. It counts how
many consonant sequences are between the beginning of a word and the stemmer’s
index. It uses a function called count-v
, which skips letters while they are
still vowels; count-c
, which skips letters while they are still consonants;
and count-cluster
, which walks over the vowel and consonant clusters in the
word, counting the consonants.
count-v
and count-c
both return vectors. The first item in the vector
indicates what the caller should do after the function returns.
:return
means that the function should just return immediately; :break
means that it
should continue processing. The second and the third items in the vectors are
the current consonant cluster count and the current index of letter that is
being considered.
(defn m "Measures the number of consonant sequences between the start of word and position j. If c is a consonant sequence and v a vowel sequence, and <...> indicates arbitrary presence, <c><v> -> 0 <c>vc<v> -> 1 <c>vcvc<v> -> 2 <c>vcvcvc<v> -> 3 ... " [stemmer] (let [ j (get-index stemmer) count-v (fn [n i] (cond (> i j) [:return n i] (vowel? stemmer i) [:break n i] :else (recur n (inc i)))) count-c (fn [n i] (cond (> i j) [:return n i] (consonant? stemmer i) [:break n i] :else (recur n (inc i)))) count-cluster (fn [n i] (let [[stage1 n1 i1] (count-c n i)] (if (= stage1 :return) n1 (let [[stage2 n2 i2] (count-v (inc n1) (inc i1))] (if (= stage2 :return) n2 (recur n2 (inc i2))))))) [stage n i] (count-v 0 0) ] (if (= stage :return) n (count-cluster n (inc i)))))
(ends? stemmer suffix)
ends?
tests whether the stemmer ends with a given suffix. If it does, it
moves the stemmer’s current :index
and returns the new stemmer. The
processor also needs to know whether the ending was actually found. To
accommodate this, ends?
returns a vector containing the new (or old) stemmer
and true
or false
.
(defn ends? "true if the word ends with s." [stemmer s] (let [word (subword stemmer), sv (vec s), j (- (count word) (count sv))] (if (and (pos? j) (= (subvec word j) sv)) [(assoc stemmer :index (dec j)) true] [stemmer false])))
(set-to stemmer new-ending)
set-to
sets the stemmer’s word to the prefix of the word (everything before
the stemmer’s index) and the new ending.
(defn set-to "This sets the last j+1 characters to x and readjusts the length of b." [stemmer new-end] (reset-index (into (subword stemmer) new-end)))
(r stemmer orig-stemmer suffix)
r
tests whether there are any consonant clusters in the stem. If so, the
ending is set to suffix
. Otherwise, the original stemmer is returned. This
is used in some of the steps to add a suffix only if the stem is long enough.
For example, you want to replace the ending of “restive” with nothing (to
produce “rest”); but you don’t want to strip the “-ive” off “five.”
(defn r "This is used further down." [stemmer orig-stemmer s] (if (pos? (m stemmer)) (set-to stemmer s) orig-stemmer))
Some of these are pretty messy. For instance, returning the vector of multiple
values may be fine for a collection of internal functions, but it creates a
complicated interface to the ends?
predicate. In the next posting, we’ll
look at how we can simplify ends?
.
7 comments:
Wonderful series, thank you! You're very much responsible for my getting to know Clojure. Already I think I could probably replace my mainstay Python for most tasks.
I think the Clojure community greatly needs more literature like this. I was rewriting some example ML code in Clojure the other day and it would have been made easier if I'd seen this part 7 -- specifically a little of the destructuring binding -- but there simply aren't any good, extant Beginner's Guides to Functional Programming in Clojure (or something). So thank you, again.
The one thing that could improve this series is something of a blueprint as to where the individual pieces fit in. These most recent functions, for instance, are a little unmotivated -- at least for someone who hasn't had any prior linguistic aspirations.
Hi James,
Glad this is helping. There's a lot more to come.
And you're right. I really didn't do a great job of explaining what those functions are for. Some of them are pretty random, even from a linguistic perspective. Hopefully, it will fall in place as I post more of the stemmer, but in the meantime, I'll try to do a better job of explaining the pieces as they're being posted.
Right now, I'm thinking of this as a first draft, and once the series is complete I'll clean it up, tighten it up, and fill in any gaps. Maybe I'll post it somewhere in a format like, say, Python's tutorial. We'll see.
Thanks,
Eric
Is there anything that can be done about the truncation of listings in the left half of the display?
Thanks,
..jim
Jim here, again.
I see that I can get a better listing by turning off css (whic is easy to do in Opera or Firefox).
Maybe this will be of help to someone else.
Thanks again for your nice tutorials,
..jim
Hi Jim,
Hmm. I'm using Chrome, and the listings wrap. It still isn't easy to read. Probably a bug in Chrome....
You can also get the code from http://code.google.com/p/word-clj/
Glad you're liking it, though. Let me know if you see any other typos or bugs or if anything could be clearer.
Eric
Hi Eric,
first and foremost, this is an exceptionally great tutorial - I've already read a lot about Clojure, but I just wasn't getting used to it; typing and understanding your code character by character finally broke the dam.
Regarding the "m" function, I think found a more concise way to solve it, but I'm unsure if it is an elegant and/or efficient solution:
https://gist.github.com/2627485
Hi Manuel,
I'm glad you've found this so useful. The syntax is a little out-of-date, but hopefully the concepts hold up better.
You're solution is a lot more concise, but it doesn't actually follow the algorithm from the original paper (http://tartarus.org/~martin/PorterStemmer/def.txt). For one thing, the paper defines a consonant sequence as being one or more consonants, but you are counting sequences of more than one consonant. You also need to skip the initial consonant cluster, if it's there (for instance, don't count the initial tr in trouble).
Looking back (and with a lot more experience writing functional code), I'd implement this completely functionally, and only back off from that selectively if the performance is just atrocious.
Here's a different version of m written functionally. https://gist.github.com/2628865
Again, glad this helped!
Eric
Post a Comment