Tokenization, Part 2: Literals and Variables
Literals
In the last posting, I showed how to tokenize a string into separate tokens/words. For reference, the code to do this (in the Clojure REPL), was:
user=> (re-seq #"\\w+" "This string contains some tokens. Yipee!") ("This" "string" "contains" "some" "tokens" "Yipee")
In the snippet above, both the regular expression and the input string are
literals: a direct representation of a value. #"\\w+"
is a literal regular
expression, and "This string contains some tokens. Yipee!"
is a string
literal. Some other literal expressions in Clojure are:
String | "This is a string." |
---|---|
Regular expression | #"\\w+" |
Character | \a |
Integer | 42 |
Float | 3.1415 |
Clojure also has literal expressions for a variety of other types of data. I’ll introduce them as we need them.
Remembering
Remembering the regular expression we were using above—#"\\w+"
—isn’t a big
deal. A more complicated expression may be difficult to read, much less to
recall or to type correctly all the time. Fortunately, we don’t have to
remember it. Instead, we can use variables. Clojure creates a variable and
assigns it a value with a def
expression:
user=> (def token-regex #"\\w+") #'user/token-regex
Now, whenever we use the name token-regex
, Clojure substitutes the value
#"\\w+"
instead:
user=> (re-seq token-regex "This string contains some tokens. Yipee!") ("This" "string" "contains" "some" "tokens" "Yipee")
For that matter, if we’re going to be using that input string frequently, we can assign it to a variable also:
user=> (def input-string "This string contains some tokens. Yipee!") #'user/input-string user=> (re-seq token-regex input-string) ("This" "string" "contains" "some" "tokens" "Yipee")
Benefits
Using variables allows us a couple of advantages:
- We can name things according to what their function in the program is, not by their value.
- We can reduce errors by removing duplication. Since we aren't typing in the value by hand every time, we have fewer opportunities to type it in wrong.
- For data that may take up a lot of memory, we only need to use up that memory once, and then we can refer to that single instance of the data as many times as we need to.
Caveats
If you’re coming to Clojure from another, non-functional programming language
(as I assume you are), variables in Clojure are immutable: you can’t change
them. There is no assignment operator, like =
in many languages. (You can
call def
again on the same variable name, but technically that’s not
assigning a new value to the old variable, and in general, you don’t re-def
a variable in Clojure.)
As we’ll see later, immutability is a good thing in Clojure. But does take some getting used to.
Next, we’ll look at how to wrap up the entire protocol for tokenization into its own variable.
1 comment:
It is not necessary to escape '\' anymore in regexes in newer versions of Clojure.
Post a Comment