Writing/Coding: Tokenization: Part 1

Tokenization, Part 1: Regular Expressions

About Tokenization

Tokenization is the process of splitting a string of characters into tokens: useful sequences of characters that all belong together. When analyzing natural language, tokens are often conflated with words, and defining what is and what is not a word, from either a linguistic or a computer's point of view, is problematic. Should a contraction be split into a separate word? Should ice cream be two tokens or one?

But I’m going to keep this simple, so we're going to define a token/word in the simplest way possible, and we're going to use regular expressions to do it.

About Regular Expressions

Regular expressions (often shortened to regexes) are a mini-language used inside a full-fledged programming language or other tool to specify how to match text. For example, you can create a regex that says that you want to match one or more a characters (a+) and use that pattern to find or replace a group of a’s in a string.

Regexes also have short-cut ways to indicate classes of characters. For instance, the class \w matches any alphanumeric character or underscore.

Underscore? As it turns out, when regexes say “word characters,” you should usually add “as defined by the programming language C.” Words in C can contain non-accented ASCII letters, numbers, and—you guessed it—underscores.

Problems

\w is not ideal for tokenizing. It doesn't handle any letters that aren’t used in American English; it doesn’t include apostrophes; and it does include underscores. But as I said, I’m trying to keep this simple. Later, you can muck things up with theoretically pure tokens that actually make sense.

Enough Talk Already!

Let’s fire up Clojure with the scripts we created in the last posting and see what damage we can cause.

Clojure
user=>

Clojure has a simple syntax for creating regexes. To make a regular expression to find groups of “word” characters, just put a quote around the pattern and put a hash mark (#) in front of it. Also, the backslash has special meaning in Clojure strings, so you’ll need to double it, so that Clojure will pass the backslash to the regex pattern:

user=> #"\\w+"
\w+

That creates the regex, to use it, we’ll pass both it and a string to the re-seq function.

user=> (re-seq #"\\w+" "This string contains some tokens. Yipee!")
("This" "string" "contains" "some" "tokens" "Yipee")

(Don’t worry about how functions are called. Just type in what you see above. I’ll explain what’s happening in the next few postings.)

That’s it. We’ve got the basics of how to tokenize a string down. Next, we’ll learn how to package it up so we know we’ll always be tokenizing consistently.

More about Regular Expressions

This is as much as I’m going to cover about regular expressions. That doesn’t mean that regular expressions aren’t insanely useful: they’re just also insanely complicated. If you do much text processing, however, you'll want to get intimate with all their useful craziness.

Here are some resources for learning regular expressions:

Java Tutorials: Regular Expressions This is a Java-oriented tutorial on regexes, so they should translate exactly to the engine underlying Clojure’s regexes.
Regular Expressions Tutorial This is a good overview that focuses on Perl 5 regular expressions, which are used in many modern programming language, including Java and therefore Clojure.
Regular Expression HOWTO This focuses on Python's implementation of regular expressions, which is also very similar to Perl 5 regexes.
Learning to Use Regular Expressions This again uses Python as the reference.
Regular Expressions: A Simple User Guide This is an introduction to the regular expressions used in Apache, PHP4, Javascript, Vim, Emacs, and other tools. It’s a little different than Perl-style regexes, I think.

8 comments:

Anonymous said...: Great tutorials... You no longer need to escape the forward slashes (double slashes) in the latest build of clojure.; February 6, 2009 at 12:01 PM
Eric Rochester said...: Thanks. I know. I need to go through and update all the code here to work with the latest version of Clojure. For instance, things like doseq and let-if now wrap their "assignment" clauses with a vector.

Eric; February 6, 2009 at 12:21 PM
Peropaal said...: Indeed you need to change code here and there. Thanks for your effort by the way. I have learned a lot reading your posts! Clojure rocks, and more so because of your writings...; September 12, 2009 at 2:24 PM
Anonymous said...: I suggest you update your code -- I tried the \\w+ RE and it failed. But thanks to Tunde \w+ works.

Good site, mind.; January 3, 2011 at 6:19 PM
CP said...: The first clojure tutorial that finally went "click".

THANK YOU!; June 5, 2013 at 1:42 AM
Eric Rochester said...: Great! Glad to hear it.; June 5, 2013 at 8:07 AM
Unknown said...: thanks, dude. great info!!; March 19, 2014 at 10:09 AM
Anonymous said...: Hello Eric..
Nice articles
Could you also make the table of contents visible on each page.
thanks; April 21, 2015 at 1:16 AM

Writing/Coding

Sunday, June 22, 2008

Tokenization: Part 1

Tokenization, Part 1: Regular Expressions

About Tokenization

About Regular Expressions

Problems

Enough Talk Already!

More about Regular Expressions

8 comments:

About Me

Landmarks 'Round Here

Mastering Clojure Data Analysis

Clojure Data Analysis Cookbook

Labels

Blog Archive