As I’ve already said, the way we’re tokenizing is lacking. One major problem
at this point is that
"the" are returned as two separate tokens.
To take care of that, we need to convert all tokens to lower-case as they’re
That’s a two-step process: first, convert a token to lower case; second, apply that conversion to each token as it’s read in.
How do we do this?
In Clojure, strings are the same as Java strings: a Clojure string takes the same methods as a Java string, and like a Java string, a Clojure string is immutable: once you’ve created it, you can’t change it.
So now the question is how do we call Java methods?
In fact, it’s fairly simple: to call the
length method on a string, use
.length (a dot and the method name) as the function to call, and pass the
string as the first argument to that function. Other arguments are passed in
after the method:
user=> (def a-string "My name") #'user/a-string user=> (.length a-string) 7 user=> (.substring a-string 3) "name" user=> (.substring a-string 3 5) "na"
This also works for calling static methods, just pass in the name of the class
instead of a class instance as the first argument. For example, the first line
below calls the static method
user=> (def rt (.getRuntime Runtime)) #'user/rt user=> (.freeMemory rt) 30462304
Runtime gets automatically imported from
java.lang. Later I’ll show you
how to import other Java classes and how to create instances of them.)
Java strings have a
toLowerCase method, which will do exactly what we want.
The only problem is that Java methods aren’t the same as Clojure functions, so
we’ll need to wrap the method call in a function. Open the
you’ve been working on and add this before
(defn to-lower-case [token-string] (.toLowerCase token-string))
That creates a function called
to-lower-case, which just wraps a method call
Now we need to solve the second part of the problem: applying
to every token as we create it.
Clojure, like all lisps, Python, Ruby, and many other computer languages,
treat functions as objects in their own right. That means you can take a
function, assign it to a variable (which is what
defn does), and pass it
around as an argument. Plus, Clojure has a number of functions—called
higher-order functions—that take functions as arguments and do
interesting things with them, either creating new functions based on the
original function or applying that function to a set of data.
map is one of those functions. It takes another function and a sequence, and
it applies the function to every item in the sequence. Finally, it returns a
new sequence containing the results of applying the function to each item in
the input sequence. For example, let’s apply
to-lower-case to a sequence of
strings (make sure you call
(load-file "word.clj") so that
is defined in the REPL):
user=> (map to-lower-case '("This" "IS" "my" "name")) ("this" "is" "my" "name")
Now let’s combine
map with what’s already in
tokenize-str to create a new
version of the function that converts all tokens to lower-case:
(defn tokenize-str [input-string] (map to-lower-case (re-seq token-regex input-string)))
(load-file "word.clj") again and test the new version of
user=> (tokenize-str "This is a LIST OF TOKENS.") ("this" "is" "a" "list" "of" "tokens")
There we go: all lower-case tokens.
In the first
map example above, I included something that we haven’t seen
before: sequence literals. Sequences, or lists, are a big part of any lisp,
A list in Clojure is printed like they are in all lisps: a space-delimited list of items surrounded by parentheses. They look just like a function call, which is awkward. It means that if you just type in a list, Clojure will try to call it like a function:
user=> ("my" "name") java.lang.ClassCastException: java.lang.String cannot be cast to clojure.lang.IFn
That’s a poor way of saying that you can’t treat a string like its a function.
To type the list and have Clojure recognize it as a list, you have to quote it: put a single-quote character in front of the list:
user=> '("my" "name") ("my" "name")
Organizing Your Code
So far we haven’t had any problem with variable names clashing with each other. But if we start using different libraries and files written by different people, we could easily run across several that use the same variable name for different things.
How do we get around that? We need some way to keep all those names separate.
Clojure uses namespaces to keep variable names from clashing. (These are
different than Java’s packages, if you’re familiar with those.) By default,
all of the built-in functions are in the
clojure namespace. When you’re in
the REPL, everything is in the
user namespace. Remember what the REPL prompt
user at the beginning of the prompt tells which namespace we’re
currently working in. Clojure indicates which namespace a variable is in by
printing the namespace and a forward slash before the variable name. For
user/tokenize-str that is printed after loading
indicates that the last variable read in was
tokenize-str in the
Here we want to define everything we’re working on in a namespace called
word. To do that, just add these lines to the top of
(in-ns 'word) (clojure/refer 'clojure)
The first line creates a new namespace,
word, and uses that to contain all
the variables defined in the rest of the file.
Immediately after the first line, we can’t reference any of the functions that
Clojure provides. The second line fixes that,
(clojure/refer 'clojure) call
makes everything in the
clojure namespace available in the
clojure/refer references the
refer variable in the
namespace. So even though the we can’t access Clojure’s built-ins directly at
this point, we can still reference them using their full (namespace plus name)
(As an aside, notice that in the second line, the second
clojure is quoted.
That’s because symbols can either be variables or symbol objects in their own
rights. Without the quote, Clojure thinks that the symbol is a variable; with
the quote, it reads the second
clojure as a symbol, which is what
Now, quit Clojure, go back in, and re-load the file. (We want to start with a clear slate; otherwise, the old function definitions will still be hanging around to confuse us.)
user=> (load-file "word.clj") #'word/tokenize-str
The first thing to notice is that the variable returned at the end has a new
word. Let’s try to use it:
user=> (tokenize-str "This is a String.") java.lang.Exception: Unable to resolve symbol: tokenize-str in this context
tokenize-str is found. We have to tell Clojure to look in the
user=> (word/tokenize-str "This is a String.") ("this" "is" "a" "string")
Remember to check out the new code from this posting in the Google Code project for word-clj.
We’ve covered a lot—probably too much—for today. Tomorrow, I’ll tackle just one topic: we’ll add stop-list filtering to our tokenizing.