Monday, July 14, 2008

Stemming, Part 5: Functions and Recursion

So far in this Clojure tutorial/NLP tutorial, we’ve mainly been looking at Clojure’s functions, but the last posting actually included a good chunk of code for the Porter Stemmer. Today, we’ll review functions. Some of this we’ve seen before, but some of it will be new.

Functions

As with any functional language, functions are the building blocks of Clojure. I’m going to briefly summarize what we’ve learned about functions so far, and then we’ll explore one important topic—recursion—in more detail.

Defining Functions

Generally, functions are defined using defn, followed by the name of the function, a vector listing the parameters it takes, and the expressions in the body of the function. You can also include a documentation string between the function name and the parameter vector:

user=> (defn greetings 
         "Say hello."
         [name]
         (str "Hello, " name))
#'user/greetings
user=> (greetings 'Eric)
"Hello, Eric"

(The str function here just creates strings out of all its arguments and concatenates them together.)

Higher-Order Functions

Higher-order functions are functions that take other functions as values. These may use the other function in a calculation, or they may create a new function from the existing one. For example, map calls a function on each element in a sequence:

user=> (map greetings '(Eric Elsa))
("Hello, Eric" "Hello, Elsa")

complement, however, creates a new function that is equivalent to (not (original-function *args...*)):

user=> (def not-zero? (complement zero?))
#'user/not-zero?
user=> (not-zero? 0)
false
user=> (not-zero? 1)
true

Function Literals

Sometimes, particularly when you’re using a higher-order function, you may want to create a short function just for that one place, and giving it a name would clutter up your program. Or you may want to define a function inside another function, in a let expression, to using only within that function.

For either of these, use a function literal. A function literal looks like a regular function definition, except instead of defn, use fn; the name is optional; and generally it cannot have a documentation string. For example, suppose that greeting, which we defined above, doesn’t exist, and we want to greet a list of people. We could do this by creating a function literal and passing it to map.

user=> (map (fn [n] (str "Hello, " n)) '(Eric Elsa))                  
("Hello, Eric" "Hello, Elsa")

Or, we could temporarily define greet using let, and pass that to map.

user=> (let [greet (fn [n] (str "Hello, " n))]
         (map greet '(Eric Elsa)))
("Hello, Eric" "Hello, Elsa")

Overriding Functions

I’ve already mentioned that Clojure allows you to provide different versions of a function for different argument lists. Do this by grouping each set of parameter vector and expressions in its own list. The documentation string, if there is one, comes before all of the groups:

user=> (defn count-parameters
         "This returns the number of parameters
         passed to the function."
         ([] 0)
         ([a] 1)
         ([a b] 2)
         ([a b c] 3)
         ([a b c d] 4)
         ([a b c d e] 5))
#'user/count-parameters
user=> (count-parameters 'p0 'p1 'p2)
3
user=> (count-parameters)
0

This also works in function literals. This creates a shortened version of count-parameters called cp and calls it twice within a vector. The two calls are collected in a vector because let only returns one value, and we want to see the value of both calls to cp.

user=> (let [cp (fn ([] 0)
                    ([a] 1)
                    ([a b] 2)
                    ([a b c] 3))]
         [(cp 'p0 'p1) (cp)])
[2 0]

Recursion

Many problems can be broken into smaller versions of the same problem.

For example, you can test whether a list contains a particular item by looking at the first element of the list. At each place in the list, ask yourself: is the list empty? If it is, the item is not in the list. However, if the first element is what you’re looking for, good; if it’s not, strip the first element off the list and start over again.

This way of attacking problems—by calling the solution within itself—is called recursion. Recursive problems are fundamental to functional programming, so let’s look at it in more detail.

General Recursion

The main pitfall in creating a recursive problem is to make sure that it will end eventually. To do this, you need to make sure that you have an end condition. In the list-membership problem I described above, the end conditions are the empty list and the first element being the item you are looking for.

Let’s look at what this would look like in Clojure.

user=> (defn member? [sequence item]
         (if (not (seq sequence))    
           nil
           (if (= (peek sequence) item)
             sequence
             (member? (pop sequence) item))))
#'user/member?

In this, the first if tests whether the sequence is empty (that is, if seq cannot create a sequence out of it; if it is, it returns nil, which evaluates as false. The second if tests whether the first element in the sequence is what we’re looking for; if it is, this returns the sequence at that point. This is a common idiom in Clojure. Since the sequence has at least one item, it will evaluate as true, fulfilling its contract as a test, but it returns more information than that. This function is useful for more than just determining whether an item is in a sequence: it also finds the item and returns it. Finally, if the first element is not the item being sought, this *calls member? again with everything except the first item of the list*. This is the recursive part of the function. Let’s test it.

user=> (member? '(1 2 3 4) 2)
(2 3 4)
user=> (member? '(1 2 3 4) 5)
nil

Sometimes it’s helpful to map this out. In the diagram below, a call within another call is indented under it. A function call returning is indicated by a “=>” and a value at the same level of indentation as the original call. Here’s the sequence of calls in the first example above.

(member? '(1 2 3 4) 2)
    (member? '(2 3 4) 2)
    => '(2 3 4)
=> '(2 3 4)

The second example call to member? would graph out like this:

(member? '(1 2 3 4) 5)
    (member? '(2 3 4) 5)
        (member? '(3 4) 5)
            (member? '(4) 5)
                (member? '() 5)
                => nil
            => nil
        => nil
    => nil
=> nil

In this case, we’re just returning the results of the membership test back unchanged, but we could modify the results as they’re being returned. For example, if we wanted to count how many times as item occurs in a sequence, we could do this:

user=> (defn count-item [sequence item]
         (if (not (seq sequence))
           0
           (if (= (peek sequence) item)
             (inc (count-item (pop sequence) item))
             (count-item (pop sequence) item))))
#'user/count-item

In many ways, this is very similar to member?. It first tests whether the sequence is empty, and if it is, it returns zero. If the first element is the item, this calls itself (it recurses) and increments the result by one, to count the current item. Otherwise, it recurses, but it does not increment the result. Let’s see this in action.

user=> (count-item '(1 2 3 2 1 2 3 4 3 2 1) 2)
4
user=> (count-item '(1 2 3 2 1 2 3 4 3 2 1) 3)
3
user=> (count-item '(1 2 3 2 1 2 3 4 3 2 1) 5)
0

Here, the first example (in shortened form) graphs out like so:

(count-item '(1 2 3 2 1) 2)
    (count-item '(2 3 2 1) 2)
        (count-item '(3 2 1) 2)
            (count-item '(2 1) 2)
                (count-item '(1) 2)
                    (count-item '() 2)
                    => 0
                => 0
            => 1
        => 1
    => 2
=> 2

You see that the results gets incremented as the computer leaves each function call where 2 is the first element in the input list.

Tail-Recursive Functions

While these two examples of recursion are superficially similar, on a deeper level they are very different. In member?, once you’ve computed the result, you can return it immediately back to the function that originally called member?. If Clojure had some way to jump completely out of a function from any level, you could do this. On the other hand, when count-item recurses, it is not finished with its calculation yet. It has to wait for itself to return and possibly add one to the result. (Sentences like that remind me why recursion can be confusing. Don’t worry. Eventually your brain will get used to being twisted into a pretzel.)

There’s a term to describe functions like member? that are finished with their calculations when they recurse. It’s called tail-call recursion or tail-recursive. This is a very good quality. The computer can optimize these calls to make them very fast and very efficient.

But the Java Virtual Machine doesn’t recognize tail recursion on its own, so Clojure needs a little help to make these optimizations. You signal a tail-recursive function call by using the recur built-in instead of the function name when you recurse. Thus, we could re-write member? to be tail-recursive like this.

user=> (defn member? [sequence item]          
         (if (not (seq sequence))               
           nil                                    
           (if (= (peek sequence) item)           
             sequence                               
             (recur (pop sequence) item))))         
#'user/member?

See the recur in the last line? That’s all that has changed.

This should work exactly the same (and it does—try it), but for long lists, it should be much more efficient.

In fact, it’s often worth putting in a little extra work to make non-tail-recursive functions tail-recursive. There’s a straightforward transformation you can use to make almost any function tail recursive. Just add an extra parameter and use it to accumulate the results before you make the recursive function call. The first time you call the function, you need to pass the base value into the function for that parameter. For example, count-item with tail recursion would look like this:

user=> (defn count-item [sequence item accum]
         (if (not (seq sequence))              
           accum                                 
           (if (= (peek sequence) item)          
             (recur (pop sequence) item (inc accum))
             (recur (pop sequence) item accum))))   
#'user/count-item

The difference here is the accum parameter. When item equals the first element of the sequence, accum is incremented before the recursive functional call is made. When the function starts again on the shorter list, accum is incremented. Finally, when the end of the list is reached, accum is returned. It contains the counts accumulated as count-item walked down the sequence.

Of course, now we have to call count-item differently also:

user=> (count-item '(1 2 3 2 3 4 3 2 1) 2 0)
3
user=> (count-item '(1 2 3 2 3 4 3 2 1) 1 0)
2
user=> (count-item '(1 2 3 2 3 4 3 2 1) 5 0)
0

Whenever we call count-item, we have to include a superfluous zero that we really don’t care about. That seems messy and error-prone. How can we get rid of it?

Essentially, we want to hide this version of count-item and replace it with a new version that handles the extra zero for us. Then, we call the new version and forget about this one. There are several ways to actually do this:

  1. Have the public function named count-item and create a private version named something like count-item-;
  2. Use let to define the private function inside count-item; or
  3. Use loop.

loop? Yes, this is new. loop is a cross between a function call and let. It looks a lot like let because it allows you to define variables. But it also acts as a target for recur. How would count-item look with loop?

(defn count-item [sequence item]
  (loop [sq sequence, accum 0]
    (if (not (seq sq))
      accum
      (if (= (peek sq) item)
        (recur (pop sq) (inc accum))
        (recur (pop sq) accum)))))
#'user/count-item

Notice that the first line of loop looks a lot like the first line of let. Both declare and initialize a series of variables. Just to keep things clear, I’ve renamed sequence to sq within the loop. Also, item isn’t included in the list of variables that loop declares, since it doesn’t change. Finally, at the end of the loop are two recur statements. When they are evaluated, they cause the program to jump back to the loop statement, but this time, instead of the original values used to initialize the loop variables, the values in the recur call are used.

Now we can again call count-item without the extra parameter:

user=> (count-item '(1 2 3 2 3 4 3 2 1) 2)
3
user=> (count-item '(1 2 3 2 3 4 3 2 1) 1)
2
user=> (count-item '(1 2 3 2 3 4 3 2 1) 5)
0

Stemmer Utility

With recursion, we can define another utility to use on the stemmer structures. This function will take a predicate and a stemmer. For each step, it will test the stemmer with the predicate, and if true, it will pop one character from the word and recurse. If there are no letters left in the word or if the predicate evaluates to false, it will return the stemmer the way it is:

(defn pop-stemmer-on
  "This is an amalgam of a number of
  different functions: pop (it walks
  through the :word sequence using pop);
  drop-while (it drops items off while
  testing the sequence against drop-while);
  and maplist from Common Lisp (the
  predicate is tested against the entire
  current stemmer, not just the first
  element)."
  [predicate stemmer]
  (if (and (seq (:word stemmer)) (predicate stemmer))
    (recur predicate (pop-word stemmer))
    stemmer))

I’m ready for a break. Next time, We’ll look at some more of the functions that Clojure provides, and we’ll define a few predicates of our own.

3 comments:

RD said...

Wonderful tutorial Eric,
I'm having a great time reading it.

RD said...

Hi Eric,
I'm just wondering, in the pop-stemmer-on example, why did u usesd seq in the if condition instead of (not (empty? ...)) ...??

Eric Rochester said...

Hi rdsr.

I used seq for a few reasons.

First, it's the idiom used in the Clojure community. It probably is the idiom for a couple of reasons....

Second, it's shorter to type.

Third, if you look in clojure/core.clj, "empty?" is defined as "(not (seq ...))". So "(not (empty? ...))" would become "(not (not (seq ...)))". I'm just being more direct.



Glad you're liking this.

Eric