Friday, July 11, 2008

Stemming, Part 4: Tracking the Stemmer’s Data

After the last several postings, we finally have seen enough of Clojure’s native data structures and the functions associated with them to define the data structure that the Porter Stemmer will use, as well as some of the functions that will operate on it.

The Stemmer Structure

Recall that the data that the stemmer will need to track is the word—and it will will need to manipulate the end of it—and an index into that word. A struct is an obvious way to keep those two data together.

For the word, a string is probably not the best option, because it is a Java string. We would have to copy-and-change every time we wanted to remove a letter. Instead, a vector gives us several advantages:

  1. We can make changes to the vector without having to copy it every time; and
  2. It is still an immutable data structure, so we’re staying within Clojure’s functional framework, where things are easiest.

So open up porter.clj and add these lines to the bottom. This defines the stemmer structure to have two fields, :word and :index.

;; :word = input string
;; :index = general offset into string
(defstruct stemmer :word :index)

Creating A Stemmer Structure

Like other data structures, a stemmer structure is defined by the functions that operate on it.

The first function we’ll need is one to create a stemmer structure from a word. It converts the word to a vector and sets the index to the index of the last character (one less than the number of characters in the word).

(defn make-stemmer
  "This returns a stemmer structure for the given word."
  (struct stemmer (vec word) (dec (count word))))

Notice the string between the function name and the list of parameters ([word]). This is a documentation string. You can use the doc function to retrieve this later:

porter=> (doc make-stemmer)
  This returns a stemmer structure for the given word.

Resetting the Index

Occasionally, we’ll need to reset the index to the last character. Generally, we’ll only need to do this after making a change to the word vector, so this function takes a word vector and creates a new stemmer structure with the correct index value from it.

(defn reset-index
  "This returns a new stemmer with the :word vector and
  :index set to the last index."
  (struct stemmer word-vec (dec (count word-vec))))

Retrieving the Index

We will also need to retrieve the index sometimes. Of course, there’s a chance that the index was not set and is nil or that it has gotten out of sync and points beyond the end of the word. get-index will check for both of these, and it will either return the index or the index of the last character.

(defn get-index
  "This returns a valid value of j."
  (if-let j (:index stemmer)
    (min j (dec (count (:word stemmer))))
    (dec (count (:word stemmer)))))

Retrieving the Word

A major role of the index is to mark a subsection of the word for later consideration. subword returns the part of the word before the index.

(defn subword
  "This returns the subword in the stemmer from 0..j."
  (let [b (:word stemmer), j (inc (get-index stemmer))]
    (if (< j (count b))
      (subvec b 0 j)

If the index points to the last character in the word, it just returns the original word index. Otherwise, it returns the part of the word up to and including the index. By default, subvec only returns the part of the word up to its second index, so the index has to be incremented before getting passed to subvec.

Retrieving a Character

Sometimes we just want the single character that the index points to. index-char handles this.

(defn index-char
  "This returns the index-char character in the word."
  (nth (:word stemmer) (get-index stemmer)))

Removing a Character

So far, we’ve just been messing with the index. The most common operation we’ll perform on the word itself is removing the last character. pop-word handles this by popping a letter off the stemmer’s word and creating a new structure that associates :word with that new, shorter word.

(defn pop-word
  "This returns the stemmer with one character popped from the end of the
  (assoc stemmer :word (pop (:word stemmer))))

For the next posting, I’ll review functions again in more detail.


Anonymous said...

Hi Eric, excellent series, thank you very much. Just one issue, in the current version (1.4/1.5) if-let requires a vector fo rits binding, therefore:
(if-let [j (:index stemmer)]

Eric Rochester said...


You're right. If you look in the comments for other posts in this series, I say that it's out of date. That's exactly what the problem is.