Tuesday, July 8, 2008

Stemming, Part 1: Introduction

What Is Stemming?

One problem with the tokenizer we’ve developed is that it doesn’t recognize any relationship between the same word with different endings, between, say, cat and cats or between walk, walks, walked, and walking. One way to identify those relationships is by using stemming, a process that removes the suffixes from a word (-s, -ed, or -ing) and returns just the word’s base (cat or walk).

Porter Stemmer

There are various algorithms to do this automatically. One of the most common is the Porter Stemmer, developed by Martin Porter in 1980, and available online for a variety of programming languages. This web site also has some test data and its expected output, and we’ll test our implementation against that.

In this section of the series, we’ll build a stemmer using the original C version of the algorithm as a guide. Don’t worry, though: you don’t need to know C to follow along here.

Although stemming may seem a minor process, it’s actually more complex than the simple tokenization we implemented earlier. Essentially, it walks the word being stemmed through five steps:

  1. Get rid of plurals, -ed, and -ing, and turn -y to -i, so it will be recognized as a suffix in later steps;
  2. Collapse multiple suffixes, such as -ational, -ator, -iveness, and others, to a single suffix, such as -ate, -ate, and -ive, respectively;
  3. Collapse a different set of multiple suffixes or remove a small set of single suffixes;
  4. Remove a set of suffixes including -ance, -ic, and -ive; and
  5. Remove final -e and change -ll to -l in some circumstances.

This process is going to require more programming constructs and data structures than we’ve used so far. But then, what we’ve used up to now is pretty basic, and the new constructs will be necessary for anything but the most simple programs.

Functional Programming

I said earlier that I’m using the C implementation of the Porter Stemmer as a reference implementation. But C is a very different programming language than Clojure. C is imperative: Programs in C are composed of a series of statements that operate primarily by changing data.

However, as I’ve mentioned before, Clojure is a [functional language][functional]: Programs in Clojure are composed of functions that are put together and that transform data without changing it.

A Contrived Example

For example, a word processor written in C would make changes directly to the document. Each change would obliterate what had been there before it. The final draft of a paper in this program might be perfect, but the program would have no record of how it got to be that way.

A word processor written in Clojure would be different, though. Instead of obliterating parts of a document as it made changes, it would keep the old version and make the changes on a new copy of the document. The final document, in this case, is the sum of the original document plus all the changes that had been made to it.

All of Clojure’s data structures actually keep a reference to the old document and just keep a track of the changes that have been made to it.

(For the rest of this posting, and probably for the rest of this series, I’m going to talk about Clojure making copies of things and changing the copies. That’s not correct. Clojure actually keeps as much of the original uncopied as it can and passes around the original, plus changes. Of course, copy-and-change sounds inefficient—and it is—but what Clojure does is far more efficient. Still, generally, if you think of what Clojure does as copy-and-change, you’ll get the gist of it, and I think it’s less complicated to reason about. It’s certainly less cumbersome to write. )

Now, from the perspective of a user of the word processor, there might not be any difference between the C word processor and the Clojure one. But the Clojure program would work better in certain contexts. For example, imagine that you were editing a document in a collaborative environment, and someone else were editing it simultaneously. With the C word processor, if someone makes a change right after you did, their change would erase any record of yours.

However, with the Clojure, the word processor would be able to see that your friend’s change was to a different version of the document. It could then reconcile these changes.

In fact, Clojure has features for building exactly this kind of collaborative environment. We’ll talk about that later. First, let’s glance at some of the characteristics of functional programming, which we’ll explore them in more detail in coming posts.

Side-Effect-Free Programming

One of the primary rules of functional programming is to avoid side effects. A side effect is a change to a data structure or any other part of the computer’s environment.

To continue the word processor example, say you have a function that adds “The End” to the bottom of a document. An imperative version of the function would simply take the input document, add “The End,” and return the changed document. That modification is a side effect.

The functional version would create a new document that was exactly the same as the original, except it had “The End” at the bottom. It would return the new document, but leave the old one unchanged. That is side-effect free.

Now imagine we have another function that prints a hard copy of a document. Printing—either to the screen or to a printer—is a side effect, but it’s a desirable one. Avoiding side effects is an ideal, but we should be reasonable about it.

Immutable Data Structures

I’ve been talking about creating new data instead of changing them. In the C word processor, the document can be changed. In the Clojure program, it can’t be changed. Not please don’t, not you shouldn’t, not we really recommend not, but you can’t. If you want to change a Clojure data structure, you have to copy-and-change.

Fortunately Clojure makes this easy. Instead, Clojure’s built-in functions never modify a data structure. Instead, they return a modified copy of the data structures they operate on.

Thus, if you limit yourself to use Clojure’s native, immutable data structures, you will be avoiding side effects. In the code for this series, we’re going to stick with Clojure’s data structures every chance we have.

Caveat: Dysfunctional Java

Of course, Java’s data structures are anything but immutable. Any time you use them, you’re risking losing the benefits of functional programming. This isn’t critical now, but once we start working in a concurrent (i.e., collaborative) environment, using Java data structures invites problems.

I’m sure I’ll harp on this more later.

First-Order Functions

At its heart, functional programming is about using functions as data.

In Clojure, the word processing function to add “The End” to a document is a function that can be passed to another function. It can be wrapped in a function that takes a directory of documents and adds “The End” to all of them.

We’ve already seen how functions are data in their own right. Remember how we used map and filter to implement stop words.

The benefits to using functions in this way is that each function is usually small enough to see what it does easily, and programs are easier to think about, to test, and to understand.

Getting Started

Enough talk. Let’s code.

To keep things neat, put the stemmer in its own file and its own namespace: Create a new file in the same directory as word.clj, and name this file porter.clj. Now add this to the top of the file:

(in-ns 'porter)
(clojure/refer 'clojure)

That’s enough programming for one day, don’t you think?

For the next posting, we’ll take tour of some of the primary Clojure data structures, and we’ll consider what would be the best way to represent the stemmer’s work-in-progress, its working state.

No comments: