Tuesday, June 24, 2008

Tokenization, Part 3: Functions

Tokenization, Part 3: Functions


In the last post, we saved the regular expression that we used to tokenize a string to a variable. But it would be more convenient to be able to save the entire tokenization procedure to a variable. Pretty much all programming languages let us save a series of statements or expressions—a function—to evaluate later. How does Clojure do this?

In fact, creating a function looks a lot like creating a variable. First, start Clojure and make sure that token-regex is still defined:

user=> (def token-regex #"\w+")

Next, define the function, only instead of using def, use defn:

user=> (defn tokenize-str [input-string]
(re-seq token-regex input-string))

Let’s break that apart:

  1. defn indicates that we’re defining a function, not a variable.
  2. tokenize-str is the name of the function. Functions and variables use the same set of names, so naming a variable tokenize-str will get rid of the function named tokenize-str, and vice versa.
  3. [input-string] is a square-bracket-delimited list of the parameters that this function accepts. In the case of tokenize-str, it takes one argument, named input-string. Expressions inside the function can refer to the value passed into the function using that name.
  4. After you type in that line and hit enter, nothing will happen. The first parenthesis before defn is still open, so the Clojure REPL knows you’re not finished yet. You’ll need to enter the second line to continue.
  5. The second line is just the re-seq function with both arguments as variables, like we used in the last posting. One variable is the regular expression from the previous def, and one is input-string from the function definition.
  6. Functions return the value of their last expression. In this case, that is the function call to re-seq.

Now let’s give it a try:

user=> (tokenize-str "This is a new input string with different tokens.")
("This" "is" "a" "new" "input" "string" "with" "different" "tokens")

Sure enough. Now calling (tokenize-str ...) is the same as calling (re-seq token-regex ...).

Saving Your Work

We’re starting to get enough code that typing it in every time we want to use it would be painful, inefficient, and worst of all, boring. Fortunately, like most other programming language, Clojure lets us save expressions to a file to execute all at once.

To do this, open your text editor and create a new file. Let’s call it word.clj and save it in whatever directory you’re currently working in. Next enter in all the code we’ve entered so far:

(def token-regex #"\\w+")
(defn tokenize-str [input-string]
  (re-seq token-regex input-string))

Now switch back to the Clojure REPL and load this file using the load-file function:

user=> (load-file "word.clj")

After loading the file, Clojure prints the result of the last expression in the file. In this case, that is the expression defining the tokenize-str function.

We can use the variables and functions defined in that file, just as if we had typed them into the REPL:

user=> (tokenize-str "Another input string.")
("Another" "input" "string")

I’ve set up a Google Code Project for the code in this series at http://code.google.com/p/word-clj/. As we go along, I’ll update the code in step with the postings here.

Also, if you find any bugs, you can let me know using the issues tracker there.

Next time we’ll improve the tokenization and talk about how to organize our code better.


Anonymous said...

Hi, excellent tutorial! But you should really remove the double \\ because beginner will be immediately stuck otherwise.

Regards, chris

Anonymous said...

this is a great tutorial - unfortunately i am not able to load the file. the following exception occurred:

user=> (load-file "word.clj")
java.lang.RuntimeException: java.lang.RuntimeException: java.lang.UnsupportedOpe
rationException: nth not supported on this type: PersistentHashSet (word.clj:6)

working with clojure: clojure_20081217

thanks in advance

Anonymous said...

sorry - i forgot to add the word.clj:

(in-ns 'word)
(clojure.core/refer 'clojure.core)

(def token-regex #"\w+")

(defn stop-words #{"one" "two" "three"})

(defn to-lower-case [#^String input-string] (.toLowerCase input-string))

(defn tokenize-str [input-string]
(map to-lower-case (re-seq token-regex input-string)))

Eric Rochester said...


Glad you were able to figure it out. I'll try to clarify that in the future.


Anonymous said...

do you know why the exception above occurred?

Eric Rochester said...

Sorry. I misunderstood the comment where you posted your file.

The problem is with the line that says:

> (defn stop-words #{"one" "two" "three"})

The problem is that you're not defining a function. You're defining a set (which can be used as a function, but that's beside the point here). "defn" is used to define functions. "def" is used to define everything else. Change that line to:

> (def stop-words #{"one" "two" "three"})

And you'll be good.


Stuart Malcolm said...

In the article you define token-regex as "\\w+" however this was failing to re-seq for me (always returning nil)

Solution was to change to "\w+", ie.

(def token-regex #"\w+")


Eric Rochester said...

Hi Stuart,

Thanks for catching that! I've updated the post.


Anonymous said...

Thanks for the ramp-up. Good stuff here.

Being a bit pedantic here, but if it's immutable, it's not a variable, it's a value. (I think other Clojure documentation uses binding for any form of value-to-symbol assignment, some of which are mutable).

Eric Rochester said...

Good point. Binding is a more correct term for them.