Wednesday, August 6, 2008

Stemming, Part 17: The Final Step

Today, we pick apart step five. This will involve removing the final -e and -l from words. But each of these cases will be handled in a different function.

Final E

Silent -e is one of the more, um, endearing quirks of English spelling. Invariably, whenever someone suggests spelling reform, it is the first letter on the chopping block. Stemming isn’t spelling reform, but this step does get rid of silent and final -e in all words with more than one internal consonant cluster.

(defn rule-e
  "This removes the final -e from a word if
    - there is more than one internal consonant cluster; or
    - there is exactly one final consonant cluster and
      it is not preceded by a CVC sequence.
  "
  [stemmer]
  (if (= (last (:word stemmer)) \e)
    (let [a (m stemmer)]
      (if (or (> a 1)
              (and (= a 1)
                   (not (cvc? stemmer (dec (:index stemmer))))))
        (pop-word stemmer)
        stemmer))
    stemmer))

Final L

Handling final -l just changes -ll to -l if there is more than one consonant cluster in the word. This cleans up any double-l’s that may be left around.

(defn rule-l
  "This changes -ll to -l if (> (m) 1)."
  [stemmer]
  (if (and (= (last (:word stemmer)) \l)
           (double-c? stemmer (dec (count (:word stemmer))))
           (> (m stemmer) 1))
    (pop-word stemmer)
    stemmer))

Step 5

Once again, the function for step five is pretty simple:

  1. It pulls the word from the input stemmer and creates a new one with an index pointing to the end of the word;
  2. It runs that through both of the functions listed above.

Again, notice that we used -> to make it easier to read.

(defn step-5
  "Removes a final -e and changes -ll to -l if (> (m) 1)."
  [stemmer]
  (-> stemmer :word reset-index rule-e rule-l))

Surprise

One quirk with this algorithm is that it sometimes removes letters from standard English words. For example, locate becomes locat. The output it produces isn’t standard, but it is correct. That is, all forms of locate collapse down to one stem: locat. So if you see a word that isn’t a word, don’t worry, it’s still correct. Remember, we’re identifying stems, not words.

Next, we’ll look at the stem function and at how to test a system like this.

2 comments:

Anonymous said...

The code in the section "Step 5" doesn't display correctly, at least on my computer. I see no other code display problems, so I'd say something is wrong.

Gavin Sinclair

Eric Rochester said...

Fixed, thanks.

The Frankenstein system I'm using to get syntax color highlighting struck again.