Tokenization, Part 1: Regular Expressions
Tokenization is the process of splitting a string of characters into tokens: useful sequences of characters that all belong together. When analyzing natural language, tokens are often conflated with words, and defining what is and what is not a word, from either a linguistic or a computer's point of view, is problematic. Should a contraction be split into a separate word? Should ice cream be two tokens or one?
But I’m going to keep this simple, so we're going to define a token/word in the simplest way possible, and we're going to use regular expressions to do it.
About Regular Expressions
Regular expressions (often shortened to regexes) are a mini-language used
inside a full-fledged programming language or other tool to specify how to
match text. For example, you can create a regex that says that you want to
match one or more a characters (
a+) and use that pattern to find or replace
a group of a’s in a string.
Regexes also have short-cut ways to indicate classes of characters. For
instance, the class
\w matches any alphanumeric character or underscore.
Underscore? As it turns out, when regexes say “word characters,” you should usually add “as defined by the programming language C.” Words in C can contain non-accented ASCII letters, numbers, and—you guessed it—underscores.
\w is not ideal for tokenizing. It doesn't handle any letters that aren’t
used in American English; it doesn’t include apostrophes; and it does include
underscores. But as I said, I’m trying to keep this simple. Later, you can muck
things up with theoretically pure tokens that actually make sense.
Enough Talk Already!
Let’s fire up Clojure with the scripts we created in the last posting and see what damage we can cause.
Clojure has a simple syntax for creating regexes. To make a regular expression
to find groups of “word” characters, just put a quote around the pattern and
put a hash mark (
#) in front of it. Also, the backslash has special meaning
in Clojure strings, so you’ll need to double it, so that Clojure will pass the
backslash to the regex pattern:
user=> #"\\w+" \w+
That creates the regex, to use it, we’ll pass both it and a string to the
user=> (re-seq #"\\w+" "This string contains some tokens. Yipee!") ("This" "string" "contains" "some" "tokens" "Yipee")
(Don’t worry about how functions are called. Just type in what you see above. I’ll explain what’s happening in the next few postings.)
That’s it. We’ve got the basics of how to tokenize a string down. Next, we’ll learn how to package it up so we know we’ll always be tokenizing consistently.
More about Regular Expressions
This is as much as I’m going to cover about regular expressions. That doesn’t mean that regular expressions aren’t insanely useful: they’re just also insanely complicated. If you do much text processing, however, you'll want to get intimate with all their useful craziness.
Here are some resources for learning regular expressions:
Java Tutorials: Regular Expressions This is a Java-oriented tutorial on regexes, so they should translate exactly to the engine underlying Clojure’s regexes.
Regular Expressions Tutorial This is a good overview that focuses on Perl 5 regular expressions, which are used in many modern programming language, including Java and therefore Clojure.
Regular Expression HOWTO This focuses on Python's implementation of regular expressions, which is also very similar to Perl 5 regexes.
Learning to Use Regular Expressions This again uses Python as the reference.