Thursday, July 17, 2008

Stemming, Part 7: More Functions

Today, we’ll define some more utilities for the Stemmer.

Internal Functions

One facet of functions that these utilities will use is internal functions. Remember that we can declare a function literal using fn. Also, we can declare variables using let. Internal functions combine both of these to create functions that are only visible and usable within a function. For example, in the last posting we redefined count-item to use cond. We could also rewrite it to use an internal function instead of loop.

(defn count-item [sequence item]
  (let [ci (fn [sq accum]
             (cond (not (seq sq)) accum
                   (= (peek sq) item) (recur (pop sq) (inc accum))
                   :else (recur (pop sq) accum)))]
    (ci sequence 0)))
#'user/count-item

In this case, the loop is redefined as a recursive function. It is assigned to the variable ci, which is called in the body of the let.

In this case, using an internal function instead of a loop isn’t really a win, but if you have several internal loops that interact, defining them as internal functions can greatly clarify what the function does.

Stemmer Utilities

(m stemmer)

One of the utilities that will use internal functions is m. It counts how many consonant sequences are between the beginning of a word and the stemmer’s index. It uses a function called count-v, which skips letters while they are still vowels; count-c, which skips letters while they are still consonants; and count-cluster, which walks over the vowel and consonant clusters in the word, counting the consonants.

count-v and count-c both return vectors. The first item in the vector indicates what the caller should do after the function returns.
:return means that the function should just return immediately; :break means that it should continue processing. The second and the third items in the vectors are the current consonant cluster count and the current index of letter that is being considered.

(defn m
  "Measures the number of consonant sequences between
  the start of word and position j. If c is a consonant
  sequence and v a vowel sequence, and <...> indicates
  arbitrary presence,
    <c><v>       -> 0
    <c>vc<v>     -> 1
    <c>vcvc<v>   -> 2
    <c>vcvcvc<v> -> 3
    ...
  "
  [stemmer]
  (let [
        j (get-index stemmer)
        count-v (fn [n i]
                  (cond (> i j) [:return n i]
                        (vowel? stemmer i) [:break n i]
                        :else (recur n (inc i))))
        count-c (fn [n i]
                  (cond (> i j) [:return n i]
                        (consonant? stemmer i) [:break n i]
                        :else (recur n (inc i))))
        count-cluster (fn [n i]
                        (let [[stage1 n1 i1] (count-c n i)]
                          (if (= stage1 :return)
                            n1
                            (let [[stage2 n2 i2] (count-v (inc n1) (inc i1))]
                              (if (= stage2 :return)
                                n2
                                (recur n2 (inc i2)))))))
        [stage n i] (count-v 0 0)
        ]
    (if (= stage :return)
      n
      (count-cluster n (inc i)))))

(ends? stemmer suffix)

ends? tests whether the stemmer ends with a given suffix. If it does, it moves the stemmer’s current :index and returns the new stemmer. The processor also needs to know whether the ending was actually found. To accommodate this, ends? returns a vector containing the new (or old) stemmer and true or false.

(defn ends?
  "true if the word ends with s."
  [stemmer s]
  (let [word (subword stemmer), sv (vec s), j (- (count word) (count sv))]
    (if (and (pos? j) (= (subvec word j) sv))
      [(assoc stemmer :index (dec j)) true]
      [stemmer false])))

(set-to stemmer new-ending)

set-to sets the stemmer’s word to the prefix of the word (everything before the stemmer’s index) and the new ending.

(defn set-to
  "This sets the last j+1 characters to x and readjusts the length of b."
  [stemmer new-end]
  (reset-index (into (subword stemmer) new-end)))

(r stemmer orig-stemmer suffix)

r tests whether there are any consonant clusters in the stem. If so, the ending is set to suffix. Otherwise, the original stemmer is returned. This is used in some of the steps to add a suffix only if the stem is long enough. For example, you want to replace the ending of “restive” with nothing (to produce “rest”); but you don’t want to strip the “-ive” off “five.”

(defn r
  "This is used further down."
  [stemmer orig-stemmer s]
  (if (pos? (m stemmer))
    (set-to stemmer s)
    orig-stemmer))

Some of these are pretty messy. For instance, returning the vector of multiple values may be fine for a collection of internal functions, but it creates a complicated interface to the ends? predicate. In the next posting, we’ll look at how we can simplify ends?.

7 comments:

james said...

Wonderful series, thank you! You're very much responsible for my getting to know Clojure. Already I think I could probably replace my mainstay Python for most tasks.

I think the Clojure community greatly needs more literature like this. I was rewriting some example ML code in Clojure the other day and it would have been made easier if I'd seen this part 7 -- specifically a little of the destructuring binding -- but there simply aren't any good, extant Beginner's Guides to Functional Programming in Clojure (or something). So thank you, again.

The one thing that could improve this series is something of a blueprint as to where the individual pieces fit in. These most recent functions, for instance, are a little unmotivated -- at least for someone who hasn't had any prior linguistic aspirations.

Eric Rochester said...

Hi James,

Glad this is helping. There's a lot more to come.

And you're right. I really didn't do a great job of explaining what those functions are for. Some of them are pretty random, even from a linguistic perspective. Hopefully, it will fall in place as I post more of the stemmer, but in the meantime, I'll try to do a better job of explaining the pieces as they're being posted.

Right now, I'm thinking of this as a first draft, and once the series is complete I'll clean it up, tighten it up, and fill in any gaps. Maybe I'll post it somewhere in a format like, say, Python's tutorial. We'll see.

Thanks,
Eric

Anonymous said...

Is there anything that can be done about the truncation of listings in the left half of the display?

Thanks,
..jim

Anonymous said...

Jim here, again.

I see that I can get a better listing by turning off css (whic is easy to do in Opera or Firefox).

Maybe this will be of help to someone else.

Thanks again for your nice tutorials,
..jim

Eric Rochester said...

Hi Jim,

Hmm. I'm using Chrome, and the listings wrap. It still isn't easy to read. Probably a bug in Chrome....

You can also get the code from http://code.google.com/p/word-clj/

Glad you're liking it, though. Let me know if you see any other typos or bugs or if anything could be clearer.

Eric

Manuel Kiessling said...

Hi Eric,

first and foremost, this is an exceptionally great tutorial - I've already read a lot about Clojure, but I just wasn't getting used to it; typing and understanding your code character by character finally broke the dam.

Regarding the "m" function, I think found a more concise way to solve it, but I'm unsure if it is an elegant and/or efficient solution:

https://gist.github.com/2627485

Eric Rochester said...

Hi Manuel,

I'm glad you've found this so useful. The syntax is a little out-of-date, but hopefully the concepts hold up better.

You're solution is a lot more concise, but it doesn't actually follow the algorithm from the original paper (http://tartarus.org/~martin/PorterStemmer/def.txt). For one thing, the paper defines a consonant sequence as being one or more consonants, but you are counting sequences of more than one consonant. You also need to skip the initial consonant cluster, if it's there (for instance, don't count the initial tr in trouble).

Looking back (and with a lot more experience writing functional code), I'd implement this completely functionally, and only back off from that selectively if the performance is just atrocious.

Here's a different version of m written functionally. https://gist.github.com/2628865

Again, glad this helped!
Eric