Tokenization, Part 2: Literals and Variables
In the last posting, I showed how to tokenize a string into separate tokens/words. For reference, the code to do this (in the Clojure REPL), was:
user=> (re-seq #"\\w+" "This string contains some tokens. Yipee!") ("This" "string" "contains" "some" "tokens" "Yipee")
In the snippet above, both the regular expression and the input string are
literals: a direct representation of a value.
#"\\w+" is a literal regular
"This string contains some tokens. Yipee!" is a string
literal. Some other literal expressions in Clojure are:
Clojure also has literal expressions for a variety of other types of data. I’ll introduce them as we need them.
Remembering the regular expression we were using above—
#"\\w+"—isn’t a big
deal. A more complicated expression may be difficult to read, much less to
recall or to type correctly all the time. Fortunately, we don’t have to
remember it. Instead, we can use variables. Clojure creates a variable and
assigns it a value with a
user=> (def token-regex #"\\w+") #'user/token-regex
Now, whenever we use the name
token-regex, Clojure substitutes the value
user=> (re-seq token-regex "This string contains some tokens. Yipee!") ("This" "string" "contains" "some" "tokens" "Yipee")
For that matter, if we’re going to be using that input string frequently, we can assign it to a variable also:
user=> (def input-string "This string contains some tokens. Yipee!") #'user/input-string user=> (re-seq token-regex input-string) ("This" "string" "contains" "some" "tokens" "Yipee")
Using variables allows us a couple of advantages:
- We can name things according to what their function in the program is, not by their value.
- We can reduce errors by removing duplication. Since we aren't typing in the value by hand every time, we have fewer opportunities to type it in wrong.
- For data that may take up a lot of memory, we only need to use up that memory once, and then we can refer to that single instance of the data as many times as we need to.
If you’re coming to Clojure from another, non-functional programming language
(as I assume you are), variables in Clojure are immutable: you can’t change
them. There is no assignment operator, like
= in many languages. (You can
def again on the same variable name, but technically that’s not
assigning a new value to the old variable, and in general, you don’t re-
a variable in Clojure.)
As we’ll see later, immutability is a good thing in Clojure. But does take some getting used to.
Next, we’ll look at how to wrap up the entire protocol for tokenization into its own variable.