Tokenization, Part 1: Regular Expressions
About Tokenization
Tokenization is the process of splitting a string of characters into tokens: useful sequences of characters that all belong together. When analyzing natural language, tokens are often conflated with words, and defining what is and what is not a word, from either a linguistic or a computer's point of view, is problematic. Should a contraction be split into a separate word? Should ice cream be two tokens or one?
But I’m going to keep this simple, so we're going to define a token/word in the simplest way possible, and we're going to use regular expressions to do it.
About Regular Expressions
Regular expressions (often shortened to regexes) are a mini-language used
inside a full-fledged programming language or other tool to specify how to
match text. For example, you can create a regex that says that you want to
match one or more a characters (a+
) and use that pattern to find or replace
a group of a’s in a string.
Regexes also have short-cut ways to indicate classes of characters. For
instance, the class \w
matches any alphanumeric character or underscore.
Underscore? As it turns out, when regexes say “word characters,” you should usually add “as defined by the programming language C.” Words in C can contain non-accented ASCII letters, numbers, and—you guessed it—underscores.
Problems
\w
is not ideal for tokenizing. It doesn't handle any letters that aren’t
used in American English; it doesn’t include apostrophes; and it does include
underscores. But as I said, I’m trying to keep this simple. Later, you can muck
things up with theoretically pure tokens that actually make sense.
Enough Talk Already!
Let’s fire up Clojure with the scripts we created in the last posting and see what damage we can cause.
Clojure user=>
Clojure has a simple syntax for creating regexes. To make a regular expression
to find groups of “word” characters, just put a quote around the pattern and
put a hash mark (#
) in front of it. Also, the backslash has special meaning
in Clojure strings, so you’ll need to double it, so that Clojure will pass the
backslash to the regex pattern:
user=> #"\\w+" \w+
That creates the regex, to use it, we’ll pass both it and a string to the
re-seq
function.
user=> (re-seq #"\\w+" "This string contains some tokens. Yipee!") ("This" "string" "contains" "some" "tokens" "Yipee")
(Don’t worry about how functions are called. Just type in what you see above. I’ll explain what’s happening in the next few postings.)
That’s it. We’ve got the basics of how to tokenize a string down. Next, we’ll learn how to package it up so we know we’ll always be tokenizing consistently.
More about Regular Expressions
This is as much as I’m going to cover about regular expressions. That doesn’t mean that regular expressions aren’t insanely useful: they’re just also insanely complicated. If you do much text processing, however, you'll want to get intimate with all their useful craziness.
Here are some resources for learning regular expressions:
Java Tutorials: Regular Expressions This is a Java-oriented tutorial on regexes, so they should translate exactly to the engine underlying Clojure’s regexes.
Regular Expressions Tutorial This is a good overview that focuses on Perl 5 regular expressions, which are used in many modern programming language, including Java and therefore Clojure.
Regular Expression HOWTO This focuses on Python's implementation of regular expressions, which is also very similar to Perl 5 regexes.
Learning to Use Regular Expressions This again uses Python as the reference.
Regular Expressions: A Simple User Guide This is an introduction to the regular expressions used in Apache, PHP4, Javascript, Vim, Emacs, and other tools. It’s a little different than Perl-style regexes, I think.
8 comments:
Great tutorials... You no longer need to escape the forward slashes (double slashes) in the latest build of clojure.
Thanks. I know. I need to go through and update all the code here to work with the latest version of Clojure. For instance, things like doseq and let-if now wrap their "assignment" clauses with a vector.
Eric
Indeed you need to change code here and there. Thanks for your effort by the way. I have learned a lot reading your posts! Clojure rocks, and more so because of your writings...
I suggest you update your code -- I tried the \\w+ RE and it failed. But thanks to Tunde \w+ works.
Good site, mind.
The first clojure tutorial that finally went "click".
THANK YOU!
Great! Glad to hear it.
thanks, dude. great info!!
Hello Eric..
Nice articles
Could you also make the table of contents visible on each page.
thanks
Post a Comment