Monday, June 23, 2008

Tokenization: Part 2

Tokenization, Part 2: Literals and Variables

Literals

In the last posting, I showed how to tokenize a string into separate tokens/words. For reference, the code to do this (in the Clojure REPL), was:

user=> (re-seq #"\\w+" "This string contains some tokens. Yipee!")
("This" "string" "contains" "some" "tokens" "Yipee")

In the snippet above, both the regular expression and the input string are literals: a direct representation of a value. #"\\w+" is a literal regular expression, and "This string contains some tokens. Yipee!" is a string literal. Some other literal expressions in Clojure are:

String "This is a string."
Regular expression #"\\w+"
Character \a
Integer 42
Float 3.1415

Clojure also has literal expressions for a variety of other types of data. I’ll introduce them as we need them.

Remembering

Remembering the regular expression we were using above—#"\\w+"—isn’t a big deal. A more complicated expression may be difficult to read, much less to recall or to type correctly all the time. Fortunately, we don’t have to remember it. Instead, we can use variables. Clojure creates a variable and assigns it a value with a def expression:

user=> (def token-regex #"\\w+")
#'user/token-regex

Now, whenever we use the name token-regex, Clojure substitutes the value #"\\w+" instead:

user=> (re-seq token-regex "This string contains some tokens. Yipee!")
("This" "string" "contains" "some" "tokens" "Yipee")

For that matter, if we’re going to be using that input string frequently, we can assign it to a variable also:

user=> (def input-string "This string contains some tokens. Yipee!")  
#'user/input-string
user=> (re-seq token-regex input-string)                              
("This" "string" "contains" "some" "tokens" "Yipee")

Benefits

Using variables allows us a couple of advantages:

  1. We can name things according to what their function in the program is, not by their value.
  2. We can reduce errors by removing duplication. Since we aren't typing in the value by hand every time, we have fewer opportunities to type it in wrong.
  3. For data that may take up a lot of memory, we only need to use up that memory once, and then we can refer to that single instance of the data as many times as we need to.

Caveats

If you’re coming to Clojure from another, non-functional programming language (as I assume you are), variables in Clojure are immutable: you can’t change them. There is no assignment operator, like = in many languages. (You can call def again on the same variable name, but technically that’s not assigning a new value to the old variable, and in general, you don’t re-def a variable in Clojure.)

As we’ll see later, immutability is a good thing in Clojure. But does take some getting used to.


Next, we’ll look at how to wrap up the entire protocol for tokenization into its own variable.

1 comment:

Anonymous said...

It is not necessary to escape '\' anymore in regexes in newer versions of Clojure.