Wednesday, June 25, 2008

Tokenization, Part 4: Organization

Improved Tokenization

As I’ve already said, the way we’re tokenizing is lacking. One major problem at this point is that "The" and "the" are returned as two separate tokens. To take care of that, we need to convert all tokens to lower-case as they’re processed.

That’s a two-step process: first, convert a token to lower case; second, apply that conversion to each token as it’s read in.

How do we do this?

Java Interop

In Clojure, strings are the same as Java strings: a Clojure string takes the same methods as a Java string, and like a Java string, a Clojure string is immutable: once you’ve created it, you can’t change it.

So now the question is how do we call Java methods?

In fact, it’s fairly simple: to call the length method on a string, use .length (a dot and the method name) as the function to call, and pass the string as the first argument to that function. Other arguments are passed in after the method:

user=> (def a-string "My name")
#'user/a-string
user=> (.length a-string)
7
user=> (.substring a-string 3)
"name"
user=> (.substring a-string 3 5)
"na"

This also works for calling static methods, just pass in the name of the class instead of a class instance as the first argument. For example, the first line below calls the static method Runtime.getRuntime():

user=> (def rt (.getRuntime Runtime))
#'user/rt
user=> (.freeMemory rt)
30462304

(Runtime gets automatically imported from java.lang. Later I’ll show you how to import other Java classes and how to create instances of them.)

Java strings have a toLowerCase method, which will do exactly what we want. The only problem is that Java methods aren’t the same as Clojure functions, so we’ll need to wrap the method call in a function. Open the word.clj file you’ve been working on and add this before tokenize-str:

(defn to-lower-case [token-string]
  (.toLowerCase token-string))

That creates a function called to-lower-case, which just wraps a method call to String.toLowerCase.

Higher-Order Functions

Now we need to solve the second part of the problem: applying to-lower-case to every token as we create it.

Clojure, like all lisps, Python, Ruby, and many other computer languages, treat functions as objects in their own right. That means you can take a function, assign it to a variable (which is what defn does), and pass it around as an argument. Plus, Clojure has a number of functions—called higher-order functions—that take functions as arguments and do interesting things with them, either creating new functions based on the original function or applying that function to a set of data.

map is one of those functions. It takes another function and a sequence, and it applies the function to every item in the sequence. Finally, it returns a new sequence containing the results of applying the function to each item in the input sequence. For example, let’s apply to-lower-case to a sequence of strings (make sure you call (load-file "word.clj") so that to-lower-case is defined in the REPL):

user=> (map to-lower-case '("This" "IS" "my" "name"))
("this" "is" "my" "name")

Now let’s combine map with what’s already in tokenize-str to create a new version of the function that converts all tokens to lower-case:

(defn tokenize-str [input-string]
  (map to-lower-case (re-seq token-regex input-string)))

Call (load-file "word.clj") again and test the new version of tokenize-str:

user=> (tokenize-str "This is a LIST OF TOKENS.")
("this" "is" "a" "list" "of" "tokens")

There we go: all lower-case tokens.

Sequence Literals

In the first map example above, I included something that we haven’t seen before: sequence literals. Sequences, or lists, are a big part of any lisp, including Clojure.

A list in Clojure is printed like they are in all lisps: a space-delimited list of items surrounded by parentheses. They look just like a function call, which is awkward. It means that if you just type in a list, Clojure will try to call it like a function:

user=> ("my" "name")
java.lang.ClassCastException: java.lang.String cannot be cast to clojure.lang.IFn

That’s a poor way of saying that you can’t treat a string like its a function.

To type the list and have Clojure recognize it as a list, you have to quote it: put a single-quote character in front of the list:

user=> '("my" "name")
("my" "name")

Organizing Your Code

So far we haven’t had any problem with variable names clashing with each other. But if we start using different libraries and files written by different people, we could easily run across several that use the same variable name for different things.

How do we get around that? We need some way to keep all those names separate.

Namespaces

Clojure uses namespaces to keep variable names from clashing. (These are different than Java’s packages, if you’re familiar with those.) By default, all of the built-in functions are in the clojure namespace. When you’re in the REPL, everything is in the user namespace. Remember what the REPL prompt looks like?

user=>

The user at the beginning of the prompt tells which namespace we’re currently working in. Clojure indicates which namespace a variable is in by printing the namespace and a forward slash before the variable name. For example, the user/tokenize-str that is printed after loading word.clj indicates that the last variable read in was tokenize-str in the user namespace.

Here we want to define everything we’re working on in a namespace called word. To do that, just add these lines to the top of word.clj:

(in-ns 'word)
(clojure/refer 'clojure)

The first line creates a new namespace, word, and uses that to contain all the variables defined in the rest of the file.

Immediately after the first line, we can’t reference any of the functions that Clojure provides. The second line fixes that, (clojure/refer 'clojure) call makes everything in the clojure namespace available in the word namespace. Remember that clojure/refer references the refer variable in the clojure namespace. So even though the we can’t access Clojure’s built-ins directly at this point, we can still reference them using their full (namespace plus name) names.

(As an aside, notice that in the second line, the second clojure is quoted. That’s because symbols can either be variables or symbol objects in their own rights. Without the quote, Clojure thinks that the symbol is a variable; with the quote, it reads the second clojure as a symbol, which is what refer wants.)

Now, quit Clojure, go back in, and re-load the file. (We want to start with a clear slate; otherwise, the old function definitions will still be hanging around to confuse us.)

user=> (load-file "word.clj")                    
#'word/tokenize-str

The first thing to notice is that the variable returned at the end has a new namespace prefix, word. Let’s try to use it:

user=> (tokenize-str "This is a String.")
java.lang.Exception: Unable to resolve symbol: tokenize-str in this context

Oops. No tokenize-str is found. We have to tell Clojure to look in the word namespace:

user=> (word/tokenize-str "This is a String.")
("this" "is" "a" "string")

Remember to check out the new code from this posting in the Google Code project for word-clj.

We’ve covered a lot—probably too much—for today. Tomorrow, I’ll tackle just one topic: we’ll add stop-list filtering to our tokenizing.

6 comments:

Kelvin Pompey said...

It seems there has been some changes in the syntax for namespaces. I had to make the following change for the code to work.

(clojure.core/refer 'clojure.core)

Gout Cure said...

Was a bit confused by how (.getRuntime Runtime) no longer works. As far as I can tell the new syntax for calling static functions is now (Runtime/.getRuntime), so you have to do (.freeMemory (Runtime/.getRuntime)). See here: http://clojure.org/java_interop

Rich Hickey has great ideas, but he may be pushing Clojure on us a bit too fast. That such major syntactical changes are being made indicate that Clojure should still be far from a 1.0 release.

Still, I'm glad I discovered it, and I do hope more such changes come, because while the concepts behind Clojure sound solid, the syntactic sugar needs to mature much further.

Jeff Schwab said...

(Runtime/getRuntime) ; no dot

Unknown said...
This comment has been removed by the author.
Unknown said...

I totally understand how the 'map' function is introduced here.

But on a design level, wouldn't it be more efficient to call 'to-lower-case' on the input string before splitting it ?

In that case, just a quick note indicating that this design choice is made only to introduce the 'map' function would be interesting here.

In the opposite case, I would be very curious to know why calling 'to-lower-case' on a sequence using 'map' is not inefficient.

Otherwise, this tutorial is just great.
It is the best I read so far (out of 4 others).
You did a great job ! Thank you !

Eric Rochester said...

Good point about map.

The other factor is memory usage, though. If the input is very long, you may not want to duplicate the entire string in memory, as lower-casing it would require. Instead, you may want to let Clojure's lazy seqs help you to process each token and allow it to be GCed without loading everything into memory.

As usual, there are trade-offs, but these seem pretty interesting in this case.