Friday, June 27, 2008

Tokenization, Part 6: Function Overloading

In the last posting, we added the ability to filter a list of input tokens with a list of stop words, usually words that are so common that they aren’t interesting for many types of analysis. But because we don’t want to use a stop list every time, we didn’t add such filtering to the tokenize-str function. Today, we’ll do that.

Function Overloading

Clojure allows what can be called function overloading. Basically, you can have several different versions of the same function all assigned to the same variable name. Each version has to have a different number of arguments. (The number of arguments is sometimes referred to as the function’s arity, so each different version of the function has to have a different arity.) Since the overloaded functions have different argument lists, Clojure can tell which version to use when you call that function.

(Actually, calling this function overloading is misleading. That makes it sound as if we’re assigning multiple function objects to the same variable. The overloading really happens inside a function object, and a single function object is assigned to a variable.)

For reference, here is the current version of tokenize-str and an example of using it and filtering with a stop words list:

(defn tokenize-str [input-string]
  (map to-lower-case (re-seq token-regex input-string)))

(filter (complement stop-words) (tokenize-str "Input string here."))

To define tokenize-str as an overloaded function, wrap each set of arguments and the function body that belongs with it in parentheses:

(defn tokenize-str
  ([input-string]
   (map to-lower-case (re-seq token-regex input-string))))

Next add another set of arguments plus function body on the same parentheses-level as the one already there:

(defn tokenize-str
  ;; This is the first, original version of the function.
  ([input-string]
   (map to-lower-case (re-seq token-regex input-string)))
  ;; This is the second, overridden version of the function.
  ([input-string stop-word?]
   (filter (complement stop-word?)
           (map to-lower-case (re-seq token-regex input-string)))))

There are a couple of things that are new here. Let’s break them down.

First, I’ve included a couple of comments. When Clojure reads a semicolon (";"), it ignores everything to the end of the line. Comments are there for you to read.

Second, I’ve included a new override for the function that takes two arguments (arity 2, also written tokenize-str/2). The second argument is a function that returns true if a token is a stop word. filter then calls the complement of this function on what is essentially the body of the original version of tokenize-str (tokenize-str/1).

(Hint: It can be difficult to read functional programming. Sometimes it’s easier to read from right to left. First, look at (re-seq ...) and read left to (map ...) and (filter ...).)

So how do we use this? Call (load-file "word.clj") to load the most recent code, and let’s run the example from yesterday with this function.

user=> (word/tokenize-str "The cat in the hat.")
("the" "cat" "in" "the" "hat")
user=> (word/tokenize-str "The cat in the hat." word/stop-words)
("cat" "hat")

Recursive Functions

This works, but it’s a little messy. What if we want to change how we tokenize at the most basic level? We’ll need to change our code in two places, and that’s twice as many places to make an error.

It would be useful if we could call the first version of the function from inside the second version, and we can. Just call it the way we would from any other place in the code. It would look like this:

(defn tokenize-str
  ;; This is the first, original version of the function.
  ([input-string]
   (map to-lower-case (re-seq token-regex input-string)))
  ;; This is the second, overridden version of the function.
  ([input-string stop-word?]
   (filter (complement stop-word?) (tokenize-str input-string))))

Now, if we want to modify the way we tokenize, we only need to do it once, in tokenize-str/1 (the original version of the function). tokenize-str/2 only implements stop-word filtering.

Of course, a function calling itself could be dangerous: if you’re not careful, it could end up calling itself forever. Whenever we have a function call itself, we need to make sure that there is a path through the function that must get traveled at some point and that does not call itself. In this case, the first version acts as such a path: if we call the second version, it calls the first version, which just returns with a list of tokens.

Recursion can be a little mind-bending. Don’t worry about that at this point. We’ll see recursion again many times, and soon it will become an old friend.

A little strange, but an old friend nevertheless.

So far, we’ve only been tokenizing strings that we type in. Of course, in practice we’ll want to read data in from a file. Next time, I’ll talk about how to do that.

3 comments:

Brian Doyle said...

In the tokenize-str function you put a question mark at the end of the stop-word? parameter. Why a question mark? I took it out of my code and it still runs just fine so I'm assuming that's some sort of parameter convention. Thanks.

Eric Rochester said...

Hi Brian,

You're right -- the question mark isn't needed. It's a convention from Scheme to end predicates (functions that return a boolean) with a "?," just it indicate that the function answers a question.

In this case, I used a "?" to indicate that the parameter should be a predicate, and to make the parameter look "natural" when it's called in the function.

I should probably explain that better.

HTH,
Eric

Eric Rochester said...

Clojure also follows this convention: the functions zero?, pos?, and neg? all come to mind, and I'm sure there are others I'm just not thinking of.

Eric