After the last several postings, we finally have seen enough of Clojure’s native data structures and the functions associated with them to define the data structure that the Porter Stemmer will use, as well as some of the functions that will operate on it.
The Stemmer Structure
Recall that the data that the stemmer will need to track is the word—and it
will will need to manipulate the end of it—and an index into that word. A
struct is an obvious way to keep those two data together.
For the word, a string is probably not the best option, because it is a Java string. We would have to copy-and-change every time we wanted to remove a letter. Instead, a vector gives us several advantages:
- We can make changes to the vector without having to copy it every time; and
- It is still an immutable data structure, so we’re staying within Clojure’s functional framework, where things are easiest.
So open up
porter.clj and add these lines to the bottom. This defines the
stemmer structure to have two fields,
;; :word = input string ;; :index = general offset into string (defstruct stemmer :word :index)
Creating A Stemmer Structure
Like other data structures, a
stemmer structure is defined by the functions
that operate on it.
The first function we’ll need is one to create a
stemmer structure from a
word. It converts the word to a vector and sets the index to the index of the
last character (one less than the number of characters in the word).
(defn make-stemmer "This returns a stemmer structure for the given word." [word] (struct stemmer (vec word) (dec (count word))))
Notice the string between the function name and the list of parameters
[word]). This is a documentation string. You can use the
to retrieve this later:
porter=> (doc make-stemmer) ------------------------- porter/make-stemmer ([word]) This returns a stemmer structure for the given word. nil
Resetting the Index
Occasionally, we’ll need to reset the index to the last character. Generally,
we’ll only need to do this after making a change to the word vector, so this
function takes a word vector and creates a new
stemmer structure with the
correct index value from it.
(defn reset-index "This returns a new stemmer with the :word vector and :index set to the last index." [word-vec] (struct stemmer word-vec (dec (count word-vec))))
Retrieving the Index
We will also need to retrieve the index sometimes. Of course, there’s a chance
that the index was not set and is
nil or that it has gotten out of sync and
points beyond the end of the word.
get-index will check for both of these,
and it will either return the index or the index of the last character.
(defn get-index "This returns a valid value of j." [stemmer] (if-let j (:index stemmer) (min j (dec (count (:word stemmer)))) (dec (count (:word stemmer)))))
Retrieving the Word
A major role of the index is to mark a subsection of the word for later
subword returns the part of the word before the index.
(defn subword "This returns the subword in the stemmer from 0..j." [stemmer] (let [b (:word stemmer), j (inc (get-index stemmer))] (if (< j (count b)) (subvec b 0 j) b)))
If the index points to the last character in the word, it just returns the
original word index. Otherwise, it returns the part of the word up to and
including the index. By default,
subvec only returns the part of the word up
to its second index, so the index has to be incremented before getting passed
Retrieving a Character
Sometimes we just want the single character that the index points to.
index-char handles this.
(defn index-char "This returns the index-char character in the word." [stemmer] (nth (:word stemmer) (get-index stemmer)))
Removing a Character
So far, we’ve just been messing with the index. The most common operation
we’ll perform on the word itself is removing the last character.
handles this by popping a letter off the stemmer’s word and creating a new
structure that associates
:word with that new, shorter word.
(defn pop-word "This returns the stemmer with one character popped from the end of the list." [stemmer] (assoc stemmer :word (pop (:word stemmer))))
For the next posting, I’ll review functions again in more detail.