Wednesday, July 9, 2008

Stemming, Part 2: Functional Programming

In the last posting, we looked at what the Porter Stemmer will need to do, and we glanced at functional programming. Now, we’ll consider what information the stemmer will need to keep processing a word, and we’ll start examining what data structures Clojure provides to keep track of that information.

What We Need to Keep Track Of

The primary data that the stemmer needs to track is the word itself. It strips characters off the end of the word, and occasionally it adds a character back on or changes the last character.

Also, the stemmer needs an index into the word. Sometimes, one function will analyze the word and mark a location. A later function may then change the word based upon that index.

Of course, to keep the functions from getting too complicated, the stemmer should package those two data—the word and the index—together.

Clojure Data Structures

So what does Clojure give us to manage this data?

I’ll list a few of Clojure’s data structures here. Also, since in a functional language a data type is defined by the functions that operate on it, we’ll look at them too. And as we go along, we’ll try everything out.


The most basic data type is almost any lisp is the list. It is a sequence of elements that is optimized for adding items onto and remove items from the beginning of the list.

List literals have parentheses around the items. Of course, this also looks like a function call, so we have to quote the list to indicate that we want it treated as a list:

user=> '(1 2 3)
(1 2 3)

Lists can also be constructed using list:

user=> (list 1 2 3)
(1 2 3)

Finally, you can convert a list or vector to a list using the seq function. (The thing in square brackets below is a vector. See the next section for details.)

user=> (seq [1 2 3]) 
(1 2 3)


Vectors are conceptually similar to lists, except that they make it easy to add items to or remove items from the end of the vector.

Vector literals also look like lists, except they use square brackets ([ and ]). Since these can’t be confused with function calls, you don’t need to quote them. You can also create a vector using the vector function. And finally, you can convert a list to a vector using vec:

user=> [1 2 3 4]
[1 2 3 4]
user=> (vector 1 2 3 4)
[1 2 3 4]
user=> (vec '(1 2 3 4))
[1 2 3 4]

You can also get part of a vector using subvec. It takes one or two indexes into the vector. If called with one index, it returns everything from that index to the end of the vector; if called with two, it returns everything from the starting index up to, but not including, the second. The index for the first item is always zero, because that’s the way computers do things. (I could explain, but trust me: you don’t want to know.)

user=> (subvec [1 2 3 4 5] 2)
[3 4 5]
user=> (subvec [1 2 3 4 5] 2 4)
[3 4]


Both lists and vectors (as well as some other things) are sequences. Sequence functions take any type of sequence and operate on it. Some of these functions always return lists; some of them return the same type that was passed in to it; some of the functions return information about the sequence or about what it contains. Here are some of the more important sequence functions:

count returns the number of items in the sequence.

user=> (count [1 2 3])

nth returns an element from the sequence. Remember that the index of the first item is zero.

user=> (nth [1 2 3] 1)

pop returns a sequence without one item. In lists, this is the first item; in vectors, it is the last.

user=> (pop '(1 2 3))
(2 3)
user=> (pop [1 2 3])
[1 2]

peek returns one item from the sequence. In lists, this is the first item; in vectors, it is the last.

user=> (peek '(1 2 3))
user=> (peek [1 2 3])

take returns the first n items in the sequence as a list.

user=> (take 2 [1 2 3 4 5])
(1 2)

conj returns the sequence with one item added to it. In lists, the item is added to the beginning of the list; in vectors, it is added to the end.

user=> (conj '(1 2 3) 4)
(4 1 2 3)
user=> (conj [1 2 3] 4)
[1 2 3 4]

into takes two sequences. It returns the results of calling conj on the first sequence with each item in the second sequence.

user=> (into '(1 2 3) [4 5 6])
(6 5 4 1 2 3)
user=> (into [1 2 3] '(4 5 6))
[1 2 3 4 5 6]

You can see that different types of sequences share functions; that the functions do conceptually similar things; but that what exactly happens for each function- and sequence-type-combination may differ.

Also, note that vectors make it easy to add and remove items from the end of the list. Because of this, are ideal for storing the word while the stemmer processes it.


A mapping is the same as a hash table in Perl or a dictionary in Python. Actually, just as Clojure has different types of sequences, it also has a couple of different types of mappings. Hash maps are the most common, and that is the one I’ll be referring to below.

Literals for hash maps use curly braces ({ and }), and they don’t require any punctuation between keys and values or between key/value pairs. You can include a comma, but that’s whitespace and is just ignored. Notice that when Clojure prints the mapping, it adds the commas, because they make it easier to read. We’ll use them too for that reason.

user=> {"a" 1 "the" 2 "but" 3}
{"a" 1, "but" 3, "the" 2}

hash-map also creates a mapping using a function.

user=> (hash-map "a" 1 "the" 2 "but" 3)
{"a" 1, "but" 3, "the" 2}

assoc associates a key with a value in the mapping.

user=> (assoc {"a" 1, "the" 2, "but" 3} "and" 4)
{"a" 1, "but" 3, "the" 2, "and" 4}

dissoc removes a key from the mapping.

user=> (dissoc {"a" 1, "the" 2, "but" 3} "but") 
{"a" 1, "the" 2}

count, like with sequences, returns how many key/value pairs are in the hash map.

user=> (count {"a" 1, "the" 2, "but" 3})       

For the next posting, I’ll introduce some more data types, and we’ll apply them to the stemmer.


Anonymous said...

in "Vectors are conceptually similar to lists, except that they make it easy to add items to or remove items from the beginning of the vector."

did you intend to say end (instead of beginning)"?


Anonymous said...

in "Vectors are conceptually similar to lists, except that they make it easy to add items to or remove items from the beginning of the vector.",

did you intend 'end' (instead of 'beginning')


Eric Rochester said...

I sure did. Thanks.


Emeka said...
This comment has been removed by the author.