Sunday, June 22, 2008

Tokenization: Part 1

Tokenization, Part 1: Regular Expressions

About Tokenization

Tokenization is the process of splitting a string of characters into tokens: useful sequences of characters that all belong together. When analyzing natural language, tokens are often conflated with words, and defining what is and what is not a word, from either a linguistic or a computer's point of view, is problematic. Should a contraction be split into a separate word? Should ice cream be two tokens or one?

But I’m going to keep this simple, so we're going to define a token/word in the simplest way possible, and we're going to use regular expressions to do it.

About Regular Expressions

Regular expressions (often shortened to regexes) are a mini-language used inside a full-fledged programming language or other tool to specify how to match text. For example, you can create a regex that says that you want to match one or more a characters (a+) and use that pattern to find or replace a group of a’s in a string.

Regexes also have short-cut ways to indicate classes of characters. For instance, the class \w matches any alphanumeric character or underscore.

Underscore? As it turns out, when regexes say “word characters,” you should usually add “as defined by the programming language C.” Words in C can contain non-accented ASCII letters, numbers, and—you guessed it—underscores.

Problems

\w is not ideal for tokenizing. It doesn't handle any letters that aren’t used in American English; it doesn’t include apostrophes; and it does include underscores. But as I said, I’m trying to keep this simple. Later, you can muck things up with theoretically pure tokens that actually make sense.

Enough Talk Already!

Let’s fire up Clojure with the scripts we created in the last posting and see what damage we can cause.

Clojure
user=>

Clojure has a simple syntax for creating regexes. To make a regular expression to find groups of “word” characters, just put a quote around the pattern and put a hash mark (#) in front of it. Also, the backslash has special meaning in Clojure strings, so you’ll need to double it, so that Clojure will pass the backslash to the regex pattern:

user=> #"\\w+"
\w+

That creates the regex, to use it, we’ll pass both it and a string to the re-seq function.

user=> (re-seq #"\\w+" "This string contains some tokens. Yipee!")
("This" "string" "contains" "some" "tokens" "Yipee")

(Don’t worry about how functions are called. Just type in what you see above. I’ll explain what’s happening in the next few postings.)

That’s it. We’ve got the basics of how to tokenize a string down. Next, we’ll learn how to package it up so we know we’ll always be tokenizing consistently.

More about Regular Expressions

This is as much as I’m going to cover about regular expressions. That doesn’t mean that regular expressions aren’t insanely useful: they’re just also insanely complicated. If you do much text processing, however, you'll want to get intimate with all their useful craziness.

Here are some resources for learning regular expressions:

8 comments:

Anonymous said...

Great tutorials... You no longer need to escape the forward slashes (double slashes) in the latest build of clojure.

Eric Rochester said...

Thanks. I know. I need to go through and update all the code here to work with the latest version of Clojure. For instance, things like doseq and let-if now wrap their "assignment" clauses with a vector.

Eric

Peropaal said...

Indeed you need to change code here and there. Thanks for your effort by the way. I have learned a lot reading your posts! Clojure rocks, and more so because of your writings...

Anonymous said...

I suggest you update your code -- I tried the \\w+ RE and it failed. But thanks to Tunde \w+ works.

Good site, mind.

CP said...

The first clojure tutorial that finally went "click".

THANK YOU!

Eric Rochester said...

Great! Glad to hear it.

Unknown said...

thanks, dude. great info!!

Anonymous said...

Hello Eric..
Nice articles
Could you also make the table of contents visible on each page.
thanks