Today, we’ll define some more utilities for the Stemmer.
One facet of functions that these utilities will use is internal functions.
Remember that we can declare a function literal using
fn. Also, we can
declare variables using
let. Internal functions combine both of these to
create functions that are only visible and usable within a function. For
example, in the last posting we redefined
count-item to use
cond. We could
also rewrite it to use an internal function instead of
(defn count-item [sequence item] (let [ci (fn [sq accum] (cond (not (seq sq)) accum (= (peek sq) item) (recur (pop sq) (inc accum)) :else (recur (pop sq) accum)))] (ci sequence 0))) #'user/count-item
In this case, the loop is redefined as a recursive function. It is assigned to
ci, which is called in the body of the
In this case, using an internal function instead of a
loop isn’t really a
win, but if you have several internal loops that interact, defining them as
internal functions can greatly clarify what the function does.
One of the utilities that will use internal functions is
m. It counts how
many consonant sequences are between the beginning of a word and the stemmer’s
index. It uses a function called
count-v, which skips letters while they are
count-c, which skips letters while they are still consonants;
count-cluster, which walks over the vowel and consonant clusters in the
word, counting the consonants.
count-c both return vectors. The first item in the vector
indicates what the caller should do after the function returns.
means that the function should just return immediately;
:break means that it
should continue processing. The second and the third items in the vectors are
the current consonant cluster count and the current index of letter that is
(defn m "Measures the number of consonant sequences between the start of word and position j. If c is a consonant sequence and v a vowel sequence, and <...> indicates arbitrary presence, <c><v> -> 0 <c>vc<v> -> 1 <c>vcvc<v> -> 2 <c>vcvcvc<v> -> 3 ... " [stemmer] (let [ j (get-index stemmer) count-v (fn [n i] (cond (> i j) [:return n i] (vowel? stemmer i) [:break n i] :else (recur n (inc i)))) count-c (fn [n i] (cond (> i j) [:return n i] (consonant? stemmer i) [:break n i] :else (recur n (inc i)))) count-cluster (fn [n i] (let [[stage1 n1 i1] (count-c n i)] (if (= stage1 :return) n1 (let [[stage2 n2 i2] (count-v (inc n1) (inc i1))] (if (= stage2 :return) n2 (recur n2 (inc i2))))))) [stage n i] (count-v 0 0) ] (if (= stage :return) n (count-cluster n (inc i)))))
(ends? stemmer suffix)
ends? tests whether the stemmer ends with a given suffix. If it does, it
moves the stemmer’s current
:index and returns the new stemmer. The
processor also needs to know whether the ending was actually found. To
ends? returns a vector containing the new (or old) stemmer
(defn ends? "true if the word ends with s." [stemmer s] (let [word (subword stemmer), sv (vec s), j (- (count word) (count sv))] (if (and (pos? j) (= (subvec word j) sv)) [(assoc stemmer :index (dec j)) true] [stemmer false])))
(set-to stemmer new-ending)
set-to sets the stemmer’s word to the prefix of the word (everything before
the stemmer’s index) and the new ending.
(defn set-to "This sets the last j+1 characters to x and readjusts the length of b." [stemmer new-end] (reset-index (into (subword stemmer) new-end)))
(r stemmer orig-stemmer suffix)
r tests whether there are any consonant clusters in the stem. If so, the
ending is set to
suffix. Otherwise, the original stemmer is returned. This
is used in some of the steps to add a suffix only if the stem is long enough.
For example, you want to replace the ending of “restive” with nothing (to
produce “rest”); but you don’t want to strip the “-ive” off “five.”
(defn r "This is used further down." [stemmer orig-stemmer s] (if (pos? (m stemmer)) (set-to stemmer s) orig-stemmer))
Some of these are pretty messy. For instance, returning the vector of multiple
values may be fine for a collection of internal functions, but it creates a
complicated interface to the
ends? predicate. In the next posting, we’ll
look at how we can simplify