## Monday, July 14, 2008

### Stemming, Part 5: Functions and Recursion

So far in this Clojure tutorial/NLP tutorial, we’ve mainly been looking at Clojure’s functions, but the last posting actually included a good chunk of code for the Porter Stemmer. Today, we’ll review functions. Some of this we’ve seen before, but some of it will be new.

## Functions

As with any functional language, functions are the building blocks of Clojure. I’m going to briefly summarize what we’ve learned about functions so far, and then we’ll explore one important topic—recursion—in more detail.

Defining Functions

Generally, functions are defined using `defn`, followed by the name of the function, a vector listing the parameters it takes, and the expressions in the body of the function. You can also include a documentation string between the function name and the parameter vector:

```user=> (defn greetings
"Say hello."
[name]
(str "Hello, " name))
#'user/greetings
user=> (greetings 'Eric)
"Hello, Eric"
```

(The `str` function here just creates strings out of all its arguments and concatenates them together.)

Higher-Order Functions

Higher-order functions are functions that take other functions as values. These may use the other function in a calculation, or they may create a new function from the existing one. For example, `map` calls a function on each element in a sequence:

```user=> (map greetings '(Eric Elsa))
("Hello, Eric" "Hello, Elsa")
```

`complement`, however, creates a new function that is equivalent to ```(not (original-function *args...*))```:

```user=> (def not-zero? (complement zero?))
#'user/not-zero?
user=> (not-zero? 0)
false
user=> (not-zero? 1)
true
```

Function Literals

Sometimes, particularly when you’re using a higher-order function, you may want to create a short function just for that one place, and giving it a name would clutter up your program. Or you may want to define a function inside another function, in a `let` expression, to using only within that function.

For either of these, use a function literal. A function literal looks like a regular function definition, except instead of `defn`, use `fn`; the name is optional; and generally it cannot have a documentation string. For example, suppose that `greeting`, which we defined above, doesn’t exist, and we want to greet a list of people. We could do this by creating a function literal and passing it to `map`.

```user=> (map (fn [n] (str "Hello, " n)) '(Eric Elsa))
("Hello, Eric" "Hello, Elsa")
```

Or, we could temporarily define `greet` using `let`, and pass that to `map`.

```user=> (let [greet (fn [n] (str "Hello, " n))]
(map greet '(Eric Elsa)))
("Hello, Eric" "Hello, Elsa")
```

Overriding Functions

I’ve already mentioned that Clojure allows you to provide different versions of a function for different argument lists. Do this by grouping each set of parameter vector and expressions in its own list. The documentation string, if there is one, comes before all of the groups:

```user=> (defn count-parameters
"This returns the number of parameters
passed to the function."
([] 0)
([a] 1)
([a b] 2)
([a b c] 3)
([a b c d] 4)
([a b c d e] 5))
#'user/count-parameters
user=> (count-parameters 'p0 'p1 'p2)
3
user=> (count-parameters)
0
```

This also works in function literals. This creates a shortened version of `count-parameters` called `cp` and calls it twice within a vector. The two calls are collected in a vector because `let` only returns one value, and we want to see the value of both calls to `cp`.

```user=> (let [cp (fn ([] 0)
([a] 1)
([a b] 2)
([a b c] 3))]
[(cp 'p0 'p1) (cp)])
[2 0]
```

## Recursion

Many problems can be broken into smaller versions of the same problem.

For example, you can test whether a list contains a particular item by looking at the first element of the list. At each place in the list, ask yourself: is the list empty? If it is, the item is not in the list. However, if the first element is what you’re looking for, good; if it’s not, strip the first element off the list and start over again.

This way of attacking problems—by calling the solution within itself—is called recursion. Recursive problems are fundamental to functional programming, so let’s look at it in more detail.

General Recursion

The main pitfall in creating a recursive problem is to make sure that it will end eventually. To do this, you need to make sure that you have an end condition. In the list-membership problem I described above, the end conditions are the empty list and the first element being the item you are looking for.

Let’s look at what this would look like in Clojure.

```user=> (defn member? [sequence item]
(if (not (seq sequence))
nil
(if (= (peek sequence) item)
sequence
(member? (pop sequence) item))))
#'user/member?
```

In this, the first `if` tests whether the sequence is empty (that is, if `seq` cannot create a sequence out of it; if it is, it returns `nil`, which evaluates as `false`. The second `if` tests whether the first element in the sequence is what we’re looking for; if it is, this returns the sequence at that point. This is a common idiom in Clojure. Since the sequence has at least one item, it will evaluate as true, fulfilling its contract as a test, but it returns more information than that. This function is useful for more than just determining whether an item is in a sequence: it also finds the item and returns it. Finally, if the first element is not the item being sought, this *calls `member?` again with everything except the first item of the list*. This is the recursive part of the function. Let’s test it.

```user=> (member? '(1 2 3 4) 2)
(2 3 4)
user=> (member? '(1 2 3 4) 5)
nil
```

Sometimes it’s helpful to map this out. In the diagram below, a call within another call is indented under it. A function call returning is indicated by a “=>” and a value at the same level of indentation as the original call. Here’s the sequence of calls in the first example above.

```(member? '(1 2 3 4) 2)
(member? '(2 3 4) 2)
=> '(2 3 4)
=> '(2 3 4)
```

The second example call to `member?` would graph out like this:

```(member? '(1 2 3 4) 5)
(member? '(2 3 4) 5)
(member? '(3 4) 5)
(member? '(4) 5)
(member? '() 5)
=> nil
=> nil
=> nil
=> nil
=> nil
```

In this case, we’re just returning the results of the membership test back unchanged, but we could modify the results as they’re being returned. For example, if we wanted to count how many times as item occurs in a sequence, we could do this:

```user=> (defn count-item [sequence item]
(if (not (seq sequence))
0
(if (= (peek sequence) item)
(inc (count-item (pop sequence) item))
(count-item (pop sequence) item))))
#'user/count-item
```

In many ways, this is very similar to `member?`. It first tests whether the sequence is empty, and if it is, it returns zero. If the first element is the item, this calls itself (it recurses) and increments the result by one, to count the current item. Otherwise, it recurses, but it does not increment the result. Let’s see this in action.

```user=> (count-item '(1 2 3 2 1 2 3 4 3 2 1) 2)
4
user=> (count-item '(1 2 3 2 1 2 3 4 3 2 1) 3)
3
user=> (count-item '(1 2 3 2 1 2 3 4 3 2 1) 5)
0
```

Here, the first example (in shortened form) graphs out like so:

```(count-item '(1 2 3 2 1) 2)
(count-item '(2 3 2 1) 2)
(count-item '(3 2 1) 2)
(count-item '(2 1) 2)
(count-item '(1) 2)
(count-item '() 2)
=> 0
=> 0
=> 1
=> 1
=> 2
=> 2
```

You see that the results gets incremented as the computer leaves each function call where 2 is the first element in the input list.

Tail-Recursive Functions

While these two examples of recursion are superficially similar, on a deeper level they are very different. In `member?`, once you’ve computed the result, you can return it immediately back to the function that originally called `member?`. If Clojure had some way to jump completely out of a function from any level, you could do this. On the other hand, when `count-item` recurses, it is not finished with its calculation yet. It has to wait for itself to return and possibly add one to the result. (Sentences like that remind me why recursion can be confusing. Don’t worry. Eventually your brain will get used to being twisted into a pretzel.)

There’s a term to describe functions like `member?` that are finished with their calculations when they recurse. It’s called tail-call recursion or tail-recursive. This is a very good quality. The computer can optimize these calls to make them very fast and very efficient.

But the Java Virtual Machine doesn’t recognize tail recursion on its own, so Clojure needs a little help to make these optimizations. You signal a tail-recursive function call by using the `recur` built-in instead of the function name when you recurse. Thus, we could re-write `member?` to be tail-recursive like this.

```user=> (defn member? [sequence item]
(if (not (seq sequence))
nil
(if (= (peek sequence) item)
sequence
(recur (pop sequence) item))))
#'user/member?
```

See the `recur` in the last line? That’s all that has changed.

This should work exactly the same (and it does—try it), but for long lists, it should be much more efficient.

In fact, it’s often worth putting in a little extra work to make non-tail-recursive functions tail-recursive. There’s a straightforward transformation you can use to make almost any function tail recursive. Just add an extra parameter and use it to accumulate the results before you make the recursive function call. The first time you call the function, you need to pass the base value into the function for that parameter. For example, `count-item` with tail recursion would look like this:

```user=> (defn count-item [sequence item accum]
(if (not (seq sequence))
accum
(if (= (peek sequence) item)
(recur (pop sequence) item (inc accum))
(recur (pop sequence) item accum))))
#'user/count-item
```

The difference here is the `accum` parameter. When `item` equals the first element of the sequence, `accum` is incremented before the recursive functional call is made. When the function starts again on the shorter list, `accum` is incremented. Finally, when the end of the list is reached, `accum` is returned. It contains the counts accumulated as `count-item` walked down the sequence.

Of course, now we have to call `count-item` differently also:

```user=> (count-item '(1 2 3 2 3 4 3 2 1) 2 0)
3
user=> (count-item '(1 2 3 2 3 4 3 2 1) 1 0)
2
user=> (count-item '(1 2 3 2 3 4 3 2 1) 5 0)
0
```

Whenever we call `count-item`, we have to include a superfluous zero that we really don’t care about. That seems messy and error-prone. How can we get rid of it?

Essentially, we want to hide this version of `count-item` and replace it with a new version that handles the extra zero for us. Then, we call the new version and forget about this one. There are several ways to actually do this:

1. Have the public function named `count-item` and create a private version named something like `count-item-`;
2. Use `let` to define the private function inside `count-item`; or
3. Use `loop`.

`loop`? Yes, this is new. `loop` is a cross between a function call and `let`. It looks a lot like `let` because it allows you to define variables. But it also acts as a target for `recur`. How would `count-item` look with `loop`?

```(defn count-item [sequence item]
(loop [sq sequence, accum 0]
(if (not (seq sq))
accum
(if (= (peek sq) item)
(recur (pop sq) (inc accum))
(recur (pop sq) accum)))))
#'user/count-item
```

Notice that the first line of `loop` looks a lot like the first line of `let`. Both declare and initialize a series of variables. Just to keep things clear, I’ve renamed `sequence` to `sq` within the `loop`. Also, `item` isn’t included in the list of variables that `loop` declares, since it doesn’t change. Finally, at the end of the `loop` are two `recur` statements. When they are evaluated, they cause the program to jump back to the `loop` statement, but this time, instead of the original values used to initialize the `loop` variables, the values in the `recur` call are used.

Now we can again call `count-item` without the extra parameter:

```user=> (count-item '(1 2 3 2 3 4 3 2 1) 2)
3
user=> (count-item '(1 2 3 2 3 4 3 2 1) 1)
2
user=> (count-item '(1 2 3 2 3 4 3 2 1) 5)
0
```

## Stemmer Utility

With recursion, we can define another utility to use on the `stemmer` structures. This function will take a predicate and a stemmer. For each step, it will test the stemmer with the predicate, and if true, it will pop one character from the word and recurse. If there are no letters left in the word or if the predicate evaluates to false, it will return the stemmer the way it is:

```(defn pop-stemmer-on
"This is an amalgam of a number of
different functions: pop (it walks
through the :word sequence using pop);
drop-while (it drops items off while
testing the sequence against drop-while);
and maplist from Common Lisp (the
predicate is tested against the entire
current stemmer, not just the first
element)."
[predicate stemmer]
(if (and (seq (:word stemmer)) (predicate stemmer))
(recur predicate (pop-word stemmer))
stemmer))
```

I’m ready for a break. Next time, We’ll look at some more of the functions that Clojure provides, and we’ll define a few predicates of our own.

rdsr said...

Wonderful tutorial Eric,
I'm having a great time reading it.

rdsr said...

Hi Eric,
I'm just wondering, in the pop-stemmer-on example, why did u usesd seq in the if condition instead of (not (empty? ...)) ...??

Eric Rochester said...

Hi rdsr.

I used seq for a few reasons.

First, it's the idiom used in the Clojure community. It probably is the idiom for a couple of reasons....

Second, it's shorter to type.

Third, if you look in clojure/core.clj, "empty?" is defined as "(not (seq ...))". So "(not (empty? ...))" would become "(not (not (seq ...)))". I'm just being more direct.