Improved Tokenization
As I’ve already said, the way we’re tokenizing is lacking. One major problem
at this point is that "The"
and "the"
are returned as two separate tokens.
To take care of that, we need to convert all tokens to lower-case as they’re
processed.
That’s a two-step process: first, convert a token to lower case; second, apply
that conversion to each token as it’s read in.
How do we do this?
Java Interop
In Clojure, strings are the same as Java strings: a Clojure string takes the
same methods as a Java string, and like a Java string, a Clojure
string is immutable: once you’ve created it, you can’t change it.
So now the question is how do we call Java methods?
In fact, it’s fairly simple: to call the length
method on a string, use
.length
(a dot and the method name) as the function to call, and pass the
string as the first argument to that function. Other arguments are passed in
after the method:
user=> (def a-string "My name")
#'user/a-string
user=> (.length a-string)
7
user=> (.substring a-string 3)
"name"
user=> (.substring a-string 3 5)
"na"
This also works for calling static methods, just pass in the name of the class
instead of a class instance as the first argument. For example, the first line
below calls the static method Runtime.getRuntime()
:
user=> (def rt (.getRuntime Runtime))
#'user/rt
user=> (.freeMemory rt)
30462304
(Runtime
gets automatically imported from java.lang
. Later I’ll show you
how to import other Java classes and how to create instances of them.)
Java strings have a toLowerCase
method, which will do exactly what we want.
The only problem is that Java methods aren’t the same as Clojure functions, so
we’ll need to wrap the method call in a function. Open the word.clj
file
you’ve been working on and add this before tokenize-str
:
(defn to-lower-case [token-string]
(.toLowerCase token-string))
That creates a function called to-lower-case
, which just wraps a method call
to String.toLowerCase
.
Higher-Order Functions
Now we need to solve the second part of the problem: applying to-lower-case
to every token as we create it.
Clojure, like all lisps, Python, Ruby, and many other computer languages,
treat functions as objects in their own right. That means you can take a
function, assign it to a variable (which is what defn
does), and pass it
around as an argument. Plus, Clojure has a number of functions—called
higher-order functions—that take functions as arguments and do
interesting things with them, either creating new functions based on the
original function or applying that function to a set of data.
map
is one of those functions. It takes another function and a sequence, and
it applies the function to every item in the sequence. Finally, it returns a
new sequence containing the results of applying the function to each item in
the input sequence. For example, let’s apply to-lower-case
to a sequence of
strings (make sure you call (load-file "word.clj")
so that to-lower-case
is defined in the REPL):
user=> (map to-lower-case '("This" "IS" "my" "name"))
("this" "is" "my" "name")
Now let’s combine map
with what’s already in tokenize-str
to create a new
version of the function that converts all tokens to lower-case:
(defn tokenize-str [input-string]
(map to-lower-case (re-seq token-regex input-string)))
Call (load-file "word.clj")
again and test the new version of
tokenize-str
:
user=> (tokenize-str "This is a LIST OF TOKENS.")
("this" "is" "a" "list" "of" "tokens")
There we go: all lower-case tokens.
Sequence Literals
In the first map
example above, I included something that we haven’t seen
before: sequence literals. Sequences, or lists, are a big part of any lisp,
including Clojure.
A list in Clojure is printed like they are in all lisps: a space-delimited
list of items surrounded by parentheses. They look just like a function call,
which is awkward. It means that if you just type in a list, Clojure will try
to call it like a function:
user=> ("my" "name")
java.lang.ClassCastException: java.lang.String cannot be cast to clojure.lang.IFn
That’s a poor way of saying that you can’t treat a string like its a function.
To type the list and have Clojure recognize it as a list, you have to quote
it: put a single-quote character in front of the list:
user=> '("my" "name")
("my" "name")
Organizing Your Code
So far we haven’t had any problem with variable names clashing with each
other. But if we start using different libraries and files written by
different people, we could easily run across several that use the same
variable name for different things.
How do we get around that? We need some way to keep all those names separate.
Namespaces
Clojure uses namespaces to keep variable names from clashing. (These are
different than Java’s packages, if you’re familiar with those.) By default,
all of the built-in functions are in the clojure
namespace. When you’re in
the REPL, everything is in the user
namespace. Remember what the REPL prompt
looks like?
The user
at the beginning of the prompt tells which namespace we’re
currently working in. Clojure indicates which namespace a variable is in by
printing the namespace and a forward slash before the variable name. For
example, the user/tokenize-str
that is printed after loading word.clj
indicates that the last variable read in was tokenize-str
in the user
namespace.
Here we want to define everything we’re working on in a namespace called
word
. To do that, just add these lines to the top of word.clj
:
(in-ns 'word)
(clojure/refer 'clojure)
The first line creates a new namespace, word
, and uses that to contain all
the variables defined in the rest of the file.
Immediately after the first line, we can’t reference any of the functions that
Clojure provides. The second line fixes that, (clojure/refer 'clojure)
call
makes everything in the clojure
namespace available in the word
namespace.
Remember that clojure/refer
references the refer
variable in the clojure
namespace. So even though the we can’t access Clojure’s built-ins directly at
this point, we can still reference them using their full (namespace plus name)
names.
(As an aside, notice that in the second line, the second clojure
is quoted.
That’s because symbols can either be variables or symbol objects in their own
rights. Without the quote, Clojure thinks that the symbol is a variable; with
the quote, it reads the second clojure
as a symbol, which is what refer
wants.)
Now, quit Clojure, go back in, and re-load the file. (We want to start with a
clear slate; otherwise, the old function definitions will still be hanging
around to confuse us.)
user=> (load-file "word.clj")
#'word/tokenize-str
The first thing to notice is that the variable returned at the end has a new
namespace prefix, word
. Let’s try to use it:
user=> (tokenize-str "This is a String.")
java.lang.Exception: Unable to resolve symbol: tokenize-str in this context
Oops. No tokenize-str
is found. We have to tell Clojure to look in the
word
namespace:
user=> (word/tokenize-str "This is a String.")
("this" "is" "a" "string")
Remember to check out the new code from this posting in the Google Code
project for word-clj.
We’ve covered a lot—probably too much—for today. Tomorrow, I’ll tackle just
one topic: we’ll add stop-list filtering to our tokenizing.