Improved Tokenization
As I’ve already said, the way we’re tokenizing is lacking. One major problem
at this point is that "The"
and "the"
are returned as two separate tokens.
To take care of that, we need to convert all tokens to lower-case as they’re
processed.
That’s a two-step process: first, convert a token to lower case; second, apply that conversion to each token as it’s read in.
How do we do this?
Java Interop
In Clojure, strings are the same as Java strings: a Clojure string takes the same methods as a Java string, and like a Java string, a Clojure string is immutable: once you’ve created it, you can’t change it.
So now the question is how do we call Java methods?
In fact, it’s fairly simple: to call the length
method on a string, use
.length
(a dot and the method name) as the function to call, and pass the
string as the first argument to that function. Other arguments are passed in
after the method:
user=> (def a-string "My name") #'user/a-string user=> (.length a-string) 7 user=> (.substring a-string 3) "name" user=> (.substring a-string 3 5) "na"
This also works for calling static methods, just pass in the name of the class
instead of a class instance as the first argument. For example, the first line
below calls the static method Runtime.getRuntime()
:
user=> (def rt (.getRuntime Runtime)) #'user/rt user=> (.freeMemory rt) 30462304
(Runtime
gets automatically imported from java.lang
. Later I’ll show you
how to import other Java classes and how to create instances of them.)
Java strings have a toLowerCase
method, which will do exactly what we want.
The only problem is that Java methods aren’t the same as Clojure functions, so
we’ll need to wrap the method call in a function. Open the word.clj
file
you’ve been working on and add this before tokenize-str
:
(defn to-lower-case [token-string] (.toLowerCase token-string))
That creates a function called to-lower-case
, which just wraps a method call
to String.toLowerCase
.
Higher-Order Functions
Now we need to solve the second part of the problem: applying to-lower-case
to every token as we create it.
Clojure, like all lisps, Python, Ruby, and many other computer languages,
treat functions as objects in their own right. That means you can take a
function, assign it to a variable (which is what defn
does), and pass it
around as an argument. Plus, Clojure has a number of functions—called
higher-order functions—that take functions as arguments and do
interesting things with them, either creating new functions based on the
original function or applying that function to a set of data.
map
is one of those functions. It takes another function and a sequence, and
it applies the function to every item in the sequence. Finally, it returns a
new sequence containing the results of applying the function to each item in
the input sequence. For example, let’s apply to-lower-case
to a sequence of
strings (make sure you call (load-file "word.clj")
so that to-lower-case
is defined in the REPL):
user=> (map to-lower-case '("This" "IS" "my" "name")) ("this" "is" "my" "name")
Now let’s combine map
with what’s already in tokenize-str
to create a new
version of the function that converts all tokens to lower-case:
(defn tokenize-str [input-string] (map to-lower-case (re-seq token-regex input-string)))
Call (load-file "word.clj")
again and test the new version of
tokenize-str
:
user=> (tokenize-str "This is a LIST OF TOKENS.") ("this" "is" "a" "list" "of" "tokens")
There we go: all lower-case tokens.
Sequence Literals
In the first map
example above, I included something that we haven’t seen
before: sequence literals. Sequences, or lists, are a big part of any lisp,
including Clojure.
A list in Clojure is printed like they are in all lisps: a space-delimited list of items surrounded by parentheses. They look just like a function call, which is awkward. It means that if you just type in a list, Clojure will try to call it like a function:
user=> ("my" "name") java.lang.ClassCastException: java.lang.String cannot be cast to clojure.lang.IFn
That’s a poor way of saying that you can’t treat a string like its a function.
To type the list and have Clojure recognize it as a list, you have to quote it: put a single-quote character in front of the list:
user=> '("my" "name") ("my" "name")
Organizing Your Code
So far we haven’t had any problem with variable names clashing with each other. But if we start using different libraries and files written by different people, we could easily run across several that use the same variable name for different things.
How do we get around that? We need some way to keep all those names separate.
Namespaces
Clojure uses namespaces to keep variable names from clashing. (These are
different than Java’s packages, if you’re familiar with those.) By default,
all of the built-in functions are in the clojure
namespace. When you’re in
the REPL, everything is in the user
namespace. Remember what the REPL prompt
looks like?
user=>
The user
at the beginning of the prompt tells which namespace we’re
currently working in. Clojure indicates which namespace a variable is in by
printing the namespace and a forward slash before the variable name. For
example, the user/tokenize-str
that is printed after loading word.clj
indicates that the last variable read in was tokenize-str
in the user
namespace.
Here we want to define everything we’re working on in a namespace called
word
. To do that, just add these lines to the top of word.clj
:
(in-ns 'word) (clojure/refer 'clojure)
The first line creates a new namespace, word
, and uses that to contain all
the variables defined in the rest of the file.
Immediately after the first line, we can’t reference any of the functions that
Clojure provides. The second line fixes that, (clojure/refer 'clojure)
call
makes everything in the clojure
namespace available in the word
namespace.
Remember that clojure/refer
references the refer
variable in the clojure
namespace. So even though the we can’t access Clojure’s built-ins directly at
this point, we can still reference them using their full (namespace plus name)
names.
(As an aside, notice that in the second line, the second clojure
is quoted.
That’s because symbols can either be variables or symbol objects in their own
rights. Without the quote, Clojure thinks that the symbol is a variable; with
the quote, it reads the second clojure
as a symbol, which is what refer
wants.)
Now, quit Clojure, go back in, and re-load the file. (We want to start with a clear slate; otherwise, the old function definitions will still be hanging around to confuse us.)
user=> (load-file "word.clj") #'word/tokenize-str
The first thing to notice is that the variable returned at the end has a new
namespace prefix, word
. Let’s try to use it:
user=> (tokenize-str "This is a String.") java.lang.Exception: Unable to resolve symbol: tokenize-str in this context
Oops. No tokenize-str
is found. We have to tell Clojure to look in the
word
namespace:
user=> (word/tokenize-str "This is a String.") ("this" "is" "a" "string")
Remember to check out the new code from this posting in the Google Code project for word-clj.
We’ve covered a lot—probably too much—for today. Tomorrow, I’ll tackle just one topic: we’ll add stop-list filtering to our tokenizing.
6 comments:
It seems there has been some changes in the syntax for namespaces. I had to make the following change for the code to work.
(clojure.core/refer 'clojure.core)
Was a bit confused by how (.getRuntime Runtime) no longer works. As far as I can tell the new syntax for calling static functions is now (Runtime/.getRuntime), so you have to do (.freeMemory (Runtime/.getRuntime)). See here: http://clojure.org/java_interop
Rich Hickey has great ideas, but he may be pushing Clojure on us a bit too fast. That such major syntactical changes are being made indicate that Clojure should still be far from a 1.0 release.
Still, I'm glad I discovered it, and I do hope more such changes come, because while the concepts behind Clojure sound solid, the syntactic sugar needs to mature much further.
(Runtime/getRuntime) ; no dot
I totally understand how the 'map' function is introduced here.
But on a design level, wouldn't it be more efficient to call 'to-lower-case' on the input string before splitting it ?
In that case, just a quick note indicating that this design choice is made only to introduce the 'map' function would be interesting here.
In the opposite case, I would be very curious to know why calling 'to-lower-case' on a sequence using 'map' is not inefficient.
Otherwise, this tutorial is just great.
It is the best I read so far (out of 4 others).
You did a great job ! Thank you !
Good point about map.
The other factor is memory usage, though. If the input is very long, you may not want to duplicate the entire string in memory, as lower-casing it would require. Instead, you may want to let Clojure's lazy seqs help you to process each token and allow it to be GCed without loading everything into memory.
As usual, there are trade-offs, but these seem pretty interesting in this case.
Post a Comment