After the last several postings, we finally have seen enough of Clojure’s native data structures and the functions associated with them to define the data structure that the Porter Stemmer will use, as well as some of the functions that will operate on it.
The Stemmer Structure
Recall that the data that the stemmer will need to track is the word—and it
will will need to manipulate the end of it—and an index into that word. A
struct
is an obvious way to keep those two data together.
For the word, a string is probably not the best option, because it is a Java string. We would have to copy-and-change every time we wanted to remove a letter. Instead, a vector gives us several advantages:
- We can make changes to the vector without having to copy it every time; and
- It is still an immutable data structure, so we’re staying within Clojure’s functional framework, where things are easiest.
So open up porter.clj
and add these lines to the bottom. This defines the
stemmer
structure to have two fields, :word
and :index
.
;; :word = input string ;; :index = general offset into string (defstruct stemmer :word :index)
Creating A Stemmer Structure
Like other data structures, a stemmer
structure is defined by the functions
that operate on it.
The first function we’ll need is one to create a stemmer
structure from a
word. It converts the word to a vector and sets the index to the index of the
last character (one less than the number of characters in the word).
(defn make-stemmer "This returns a stemmer structure for the given word." [word] (struct stemmer (vec word) (dec (count word))))
Notice the string between the function name and the list of parameters
([word]
). This is a documentation string. You can use the doc
function
to retrieve this later:
porter=> (doc make-stemmer) ------------------------- porter/make-stemmer ([word]) This returns a stemmer structure for the given word. nil
Resetting the Index
Occasionally, we’ll need to reset the index to the last character. Generally,
we’ll only need to do this after making a change to the word vector, so this
function takes a word vector and creates a new stemmer
structure with the
correct index value from it.
(defn reset-index "This returns a new stemmer with the :word vector and :index set to the last index." [word-vec] (struct stemmer word-vec (dec (count word-vec))))
Retrieving the Index
We will also need to retrieve the index sometimes. Of course, there’s a chance
that the index was not set and is nil
or that it has gotten out of sync and
points beyond the end of the word. get-index
will check for both of these,
and it will either return the index or the index of the last character.
(defn get-index "This returns a valid value of j." [stemmer] (if-let j (:index stemmer) (min j (dec (count (:word stemmer)))) (dec (count (:word stemmer)))))
Retrieving the Word
A major role of the index is to mark a subsection of the word for later
consideration. subword
returns the part of the word before the index.
(defn subword "This returns the subword in the stemmer from 0..j." [stemmer] (let [b (:word stemmer), j (inc (get-index stemmer))] (if (< j (count b)) (subvec b 0 j) b)))
If the index points to the last character in the word, it just returns the
original word index. Otherwise, it returns the part of the word up to and
including the index. By default, subvec
only returns the part of the word up
to its second index, so the index has to be incremented before getting passed
to subvec
.
Retrieving a Character
Sometimes we just want the single character that the index points to.
index-char
handles this.
(defn index-char "This returns the index-char character in the word." [stemmer] (nth (:word stemmer) (get-index stemmer)))
Removing a Character
So far, we’ve just been messing with the index. The most common operation
we’ll perform on the word itself is removing the last character. pop-word
handles this by popping a letter off the stemmer’s word and creating a new
structure that associates :word
with that new, shorter word.
(defn pop-word "This returns the stemmer with one character popped from the end of the list." [stemmer] (assoc stemmer :word (pop (:word stemmer))))
For the next posting, I’ll review functions again in more detail.
2 comments:
Hi Eric, excellent series, thank you very much. Just one issue, in the current version (1.4/1.5) if-let requires a vector fo rits binding, therefore:
(if-let [j (:index stemmer)]
Hi,
You're right. If you look in the comments for other posts in this series, I say that it's out of date. That's exactly what the problem is.
Eric
Post a Comment