Tuesday, July 29, 2008

Stemming, Part 13: The Steps for Processing

We finally have all the pieces in place to actually put the Porter stemmer together. But it’s been so long, I’ve certainly forgotten what goes next, so let’s take a moment to remember where we are going with this.

Earlier I outlined the process that the stemmer will perform in five steps:

  1. Get rid of plurals, -ed, and -ing, and turn -y to -i, so it will be recognized as a suffix in later steps;
  2. Collapse multiple suffixes, such as -ational, -ator, -iveness, and others, to a single suffix, such as -ate, -ate, and -ive, respectively;
  3. Collapse a different set of multiple suffixes or remove a small set of single suffixes;
  4. Remove a set of suffixes including -ance, -ic, and -ive; and
  5. Remove final -e and change -ll to -l in some circumstances.

In the next posting, we’ll pick apart what needs to be done for step 1.


(Sorry this posting isn’t longer. I’m still taking a breath after the macro death march.)

2 comments:

Anonymous said...

(Sorry this posting isn’t longer. I’m still taking a breath after the macro death march.)

:-) I think your readers are even more out of breath.

Looking forward to how this continues...

Eric Rochester said...

LOL. You're probably right.

The next few postings should be a bit less meaty.