Tokenization, Part 3: Functions
Functions
In the last post, we saved the regular expression that we used to tokenize a string to a variable. But it would be more convenient to be able to save the entire tokenization procedure to a variable. Pretty much all programming languages let us save a series of statements or expressions—a function—to evaluate later. How does Clojure do this?
In fact, creating a function looks a lot like creating a variable. First, start Clojure and make sure that token-regex
is still defined:
user=> (def token-regex #"\w+") #'user/token-regex
Next, define the function, only instead of using def
, use defn
:
user=> (defn tokenize-str [input-string] (re-seq token-regex input-string)) #'user/tokenize-str
Let’s break that apart:
-
defn
indicates that we’re defining a function, not a variable. -
tokenize-str
is the name of the function. Functions and variables use the same set of names, so naming a variabletokenize-str
will get rid of the function namedtokenize-str
, and vice versa. -
[input-string]
is a square-bracket-delimited list of the parameters that this function accepts. In the case oftokenize-str
, it takes one argument, namedinput-string
. Expressions inside the function can refer to the value passed into the function using that name. - After you type in that line and hit enter, nothing will happen. The first parenthesis before
defn
is still open, so the Clojure REPL knows you’re not finished yet. You’ll need to enter the second line to continue. - The second line is just the
re-seq
function with both arguments as variables, like we used in the last posting. One variable is the regular expression from the previousdef
, and one isinput-string
from the function definition. - Functions return the value of their last expression. In this case, that is the function call to
re-seq
.
Now let’s give it a try:
user=> (tokenize-str "This is a new input string with different tokens.") ("This" "is" "a" "new" "input" "string" "with" "different" "tokens")
Sure enough. Now calling (tokenize-str ...)
is the same as calling (re-seq token-regex ...)
.
Saving Your Work
We’re starting to get enough code that typing it in every time we want to use it would be painful, inefficient, and worst of all, boring. Fortunately, like most other programming language, Clojure lets us save expressions to a file to execute all at once.
To do this, open your text editor and create a new file. Let’s call it word.clj
and save it in whatever directory you’re currently working in. Next enter in all the code we’ve entered so far:
(def token-regex #"\\w+") (defn tokenize-str [input-string] (re-seq token-regex input-string))
Now switch back to the Clojure REPL and load this file using the load-file
function:
user=> (load-file "word.clj") #'user/tokenize-str
After loading the file, Clojure prints the result of the last expression in the file. In this case, that is the expression defining the tokenize-str
function.
We can use the variables and functions defined in that file, just as if we had typed them into the REPL:
user=> (tokenize-str "Another input string.") ("Another" "input" "string")
I’ve set up a Google Code Project for the code in this series at http://code.google.com/p/word-clj/. As we go along, I’ll update the code in step with the postings here.
Also, if you find any bugs, you can let me know using the issues tracker there.
Next time we’ll improve the tokenization and talk about how to organize our code better.
10 comments:
Hi, excellent tutorial! But you should really remove the double \\ because beginner will be immediately stuck otherwise.
Regards, chris
this is a great tutorial - unfortunately i am not able to load the file. the following exception occurred:
user=> (load-file "word.clj")
java.lang.RuntimeException: java.lang.RuntimeException: java.lang.UnsupportedOpe
rationException: nth not supported on this type: PersistentHashSet (word.clj:6)
working with clojure: clojure_20081217
thanks in advance
sorry - i forgot to add the word.clj:
(in-ns 'word)
(clojure.core/refer 'clojure.core)
(def token-regex #"\w+")
(defn stop-words #{"one" "two" "three"})
(defn to-lower-case [#^String input-string] (.toLowerCase input-string))
(defn tokenize-str [input-string]
(map to-lower-case (re-seq token-regex input-string)))
Hi,
Glad you were able to figure it out. I'll try to clarify that in the future.
Eric
do you know why the exception above occurred?
Sorry. I misunderstood the comment where you posted your file.
The problem is with the line that says:
> (defn stop-words #{"one" "two" "three"})
The problem is that you're not defining a function. You're defining a set (which can be used as a function, but that's beside the point here). "defn" is used to define functions. "def" is used to define everything else. Change that line to:
> (def stop-words #{"one" "two" "three"})
And you'll be good.
Later,
Eric
In the article you define token-regex as "\\w+" however this was failing to re-seq for me (always returning nil)
Solution was to change to "\w+", ie.
(def token-regex #"\w+")
Thanks.
Hi Stuart,
Thanks for catching that! I've updated the post.
Eric
Thanks for the ramp-up. Good stuff here.
Being a bit pedantic here, but if it's immutable, it's not a variable, it's a value. (I think other Clojure documentation uses binding for any form of value-to-symbol assignment, some of which are mutable).
Good point. Binding is a more correct term for them.
Post a Comment