Tuesday, July 15, 2008

Stemming, Part 6: Stemmer Predicates

In the last few postings we’ve been looking at functions and how they’re used in Clojure. One of the fundamental kinds of functions is the predicate: a function that tests something and returns true or false. By convention, these functions end in a ?. Clojure has a number of these, and for the Porter Stemmer, we’ll define a few.

Sets

Sets can act as predicates. As we saw when we were discussing stop words, sets are also functions that test for membership.

Built-Ins

Clojure defines a number of built-in predicates, and higher-order functions are often useful for creating other predicates.

(zero? num) Returns whether its argument is zero.

(pos? num) Returns whether its argument is a positive number.

(neg? num) Returns whether its argument is a negative number.

(complement fn) Returns a new function that returns the opposite of the predicate function passed into it. For example, (complement zero?) returns a predicate that tests whether its argument is not zero.

(cond test expression ...) A structure that acts as a series of nested if statements. Each test is followed by one expression. If the test evaluates as true, the expression is evaluated and its value is returned by the cond expression. An optional final test, by default :else, can be used if no previous tests evaluated as true. If no default test is provided, cond returns nil.

For example, in the last post, we had defined count-item, which had two nested if expressions:

(defn count-item [sequence item]
  (loop [sq sequence, accum 0]
    (if (not (seq sq))
      accum
      (if (= (peek sq) item)
        (recur (pop sq) (inc accum))
        (recur (pop sq) accum)))))
#'user/count-item

This could be defined more simply using cond:

(defn count-item [sequence item]
  (loop [sq sequence, accum 0]
    (cond (not (seq sq)) accum
          (= (peek sq) item) (recur (pop sq) (inc accum))
          :else (recur (pop sq) accum))))
#'user/count-item

In the last post, we also defined member?. How would you define it using cond?

Stemmer Predicates

With all that we’ve learned, we’re ready to define a number of predicates that we can use later in the Porter Stemmer.

vowel-letter? is a set of the standard vowel letters. This will only be used to define consonant?.

(def vowel-letter? #{\a \e \i \o \u})

consonant? returns true if the index in the stemmer points to a consonant letter. Alternatively, it tests whether a given index points to a consonant letter.

(defn consonant?
  "Returns true if the ith character in a stemmer
  is a consonant. i defaults to :index."
  ([stemmer]
   (consonant? stemmer (get-index stemmer)))
  ([stemmer i]
   (let [c (nth (:word stemmer) i)]
     (cond (vowel-letter? c) false
           (= c \y) (if (zero? i)
                      true
                      (not (consonant? stemmer (dec i))))
           :else true))))

vowel? is the logical opposite of consonant?.

(def vowel? (complement consonant?))

vowel-in-stem? returns true if any of the characters before the index is a vowel character.

(defn vowel-in-stem?
  "true iff 0 ... j contains a vowel"
  [stemmer]
  (let [j (get-index stemmer)]
    (loop [i 0]
      (cond (> i j) false
            (consonant? stemmer i) (recur (inc i))
            :else true))))

double-c? returns true if the index (or another character) is the last letter in a double consonant pair.

(defn double-c?
  "returns true if this is a double consonant."
  ([stemmer]
   (double-c? stemmer (get-index stemmer)))
  ([stemmer j]
   (and (>= j 1)
        (= (nth (:word stemmer) j)
           (nth (:word stemmer) (dec j)))
        (consonant? stemmer j))))

cvc? return true if the characters before the index (or another character) is a CVC sequence (consonant-vowel-consonant).

(defn cvc?
  "true if (i-2 i-1 i) has the form CVC and
  also if the second C is not w, x, or y.
  This is used when trying to restore an *e*
  at the end of a short word.
  E.g.,
    cav(e), lov(e), hop(e), crim(e)
    but snow, box, tray
  "
  ([stemmer]
   (cvc? stemmer (get-index stemmer)))
  ([stemmer i]
   (and (>= i 2)
        (consonant? stemmer (- i 2))
        (vowel? stemmer (dec i))
        (consonant? stemmer i)
        (not (#{\w \x \y} (nth (:word stemmer) i))))))

Notice that we’ve established a pattern here: these all take one or two arguments. With one argument, they test against the :index character in the stemmer. With two arguments, they test against any character:

porter=> (consonant? (make-stemmer "secrets"))
true
porter=> (consonant? (make-stemmer "secrets") 4)
false
porter=> (nth "secrets" 4)
\e

Read over these and make sure you understand them. There’s nothing in them that we haven’t covered already. And if you have any questions, feel free to ask in the comments.

In the next posting, we’ll define some more utilities for the stemmer.

2 comments:

Anonymous said...

Well written and informative! The use of funcallable lists as predicates is a neat feature of clojure.

Is there a way to specify which predicate will be used to determine equality when a list is funcalled with a single argument?

I guess it's off to take an ever closer look at clojure. :)

drewc

Eric Rochester said...

Hi Drew,

Thanks!

My understanding is that "(= obj1 obj2)" is the same as "obj1.equals(obj2)" in Java. For Clojure types, though, all comparisons in Clojure are done by value, not by identity.

But no, there's no way to specify an equality operator to use in that case.

Eric