Monday, June 30, 2008

Tokenization: Part 7, File Reading

So far, we’ve only tokenized data from strings that we’ve typed in. Today, let’s learn how to tokenize the text in a file.

slurp

The easiest way to read in the data from a file is using the slurp function. It takes a file name and returns the contents of that file as a string:

user=> (slurp "LICENSE.txt")
"Copyright (c) 2008, Eric Rochester\nAll ...."

We can then pass that string to tokenize-str to get the list of tokens in the file:

user=> (take 10 (word/tokenize-str (slurp "LICENSE.txt")))
("copyright" "c" "2008" "eric" "rochester" "all" "rights" "reserved" "redistribution" "and")

(take pulls the first n items from a list and returns them.)

tokenize

Of course, it would be nice to have this wrapped into its own function, so let’s add this into word.clj after tokenize-str:

(defn tokenize
  ([filename]
   (tokenize-str (slurp filename)))
  ([filename stop-word?]
   (tokenize-str (slurp filename) stop-word?)))

This function works just like tokenize-str, except it takes a filename and returns the tokens in that file.

Warning!

There’s one big problem with tokenize. Try to tokenize a file that’s several hundred gigabytes in size, and you’ll probably find that problem quickly: it reads the entire file into memory all at once.

There are ways, such as reading the file a line at a time and tokenizing that, to deal with this problem, and Clojure actually makes it relatively easy to do this. However, to keep things simple, at this we’re going to leave tokenize the way it is.

At this point, just keep this limitation in mind. You have been warned.

Friday, June 27, 2008

Tokenization, Part 6: Function Overloading

In the last posting, we added the ability to filter a list of input tokens with a list of stop words, usually words that are so common that they aren’t interesting for many types of analysis. But because we don’t want to use a stop list every time, we didn’t add such filtering to the tokenize-str function. Today, we’ll do that.

Function Overloading

Clojure allows what can be called function overloading. Basically, you can have several different versions of the same function all assigned to the same variable name. Each version has to have a different number of arguments. (The number of arguments is sometimes referred to as the function’s arity, so each different version of the function has to have a different arity.) Since the overloaded functions have different argument lists, Clojure can tell which version to use when you call that function.

(Actually, calling this function overloading is misleading. That makes it sound as if we’re assigning multiple function objects to the same variable. The overloading really happens inside a function object, and a single function object is assigned to a variable.)

For reference, here is the current version of tokenize-str and an example of using it and filtering with a stop words list:

(defn tokenize-str [input-string]
  (map to-lower-case (re-seq token-regex input-string)))

(filter (complement stop-words) (tokenize-str "Input string here."))

To define tokenize-str as an overloaded function, wrap each set of arguments and the function body that belongs with it in parentheses:

(defn tokenize-str
  ([input-string]
   (map to-lower-case (re-seq token-regex input-string))))

Next add another set of arguments plus function body on the same parentheses-level as the one already there:

(defn tokenize-str
  ;; This is the first, original version of the function.
  ([input-string]
   (map to-lower-case (re-seq token-regex input-string)))
  ;; This is the second, overridden version of the function.
  ([input-string stop-word?]
   (filter (complement stop-word?)
           (map to-lower-case (re-seq token-regex input-string)))))

There are a couple of things that are new here. Let’s break them down.

First, I’ve included a couple of comments. When Clojure reads a semicolon (";"), it ignores everything to the end of the line. Comments are there for you to read.

Second, I’ve included a new override for the function that takes two arguments (arity 2, also written tokenize-str/2). The second argument is a function that returns true if a token is a stop word. filter then calls the complement of this function on what is essentially the body of the original version of tokenize-str (tokenize-str/1).

(Hint: It can be difficult to read functional programming. Sometimes it’s easier to read from right to left. First, look at (re-seq ...) and read left to (map ...) and (filter ...).)

So how do we use this? Call (load-file "word.clj") to load the most recent code, and let’s run the example from yesterday with this function.

user=> (word/tokenize-str "The cat in the hat.")
("the" "cat" "in" "the" "hat")
user=> (word/tokenize-str "The cat in the hat." word/stop-words)
("cat" "hat")

Recursive Functions

This works, but it’s a little messy. What if we want to change how we tokenize at the most basic level? We’ll need to change our code in two places, and that’s twice as many places to make an error.

It would be useful if we could call the first version of the function from inside the second version, and we can. Just call it the way we would from any other place in the code. It would look like this:

(defn tokenize-str
  ;; This is the first, original version of the function.
  ([input-string]
   (map to-lower-case (re-seq token-regex input-string)))
  ;; This is the second, overridden version of the function.
  ([input-string stop-word?]
   (filter (complement stop-word?) (tokenize-str input-string))))

Now, if we want to modify the way we tokenize, we only need to do it once, in tokenize-str/1 (the original version of the function). tokenize-str/2 only implements stop-word filtering.

Of course, a function calling itself could be dangerous: if you’re not careful, it could end up calling itself forever. Whenever we have a function call itself, we need to make sure that there is a path through the function that must get traveled at some point and that does not call itself. In this case, the first version acts as such a path: if we call the second version, it calls the first version, which just returns with a list of tokens.

Recursion can be a little mind-bending. Don’t worry about that at this point. We’ll see recursion again many times, and soon it will become an old friend.

A little strange, but an old friend nevertheless.

So far, we’ve only been tokenizing strings that we type in. Of course, in practice we’ll want to read data in from a file. Next time, I’ll talk about how to do that.

Thursday, June 26, 2008

Tokenization, Part 5: Stop Words

Stop Words

For many types of analyses, you need to hang onto every token that comes through the pipeline. For other types of analyses, however, you may want to focus on words that carry meaning: names, nouns, and verbs. Plus, words like the, of, and and occur more than any other words in English, and if you aren’t going to be using them, why keep track of them? For example, in a million word corpus (collection of texts) I have laying around, the alone is 6.8% of the total number of tokens. The twelve most frequent tokens (see below) is 25.6% of the total number of tokens. Considering that you’ll often want to be dealing with huge amounts of text, this can free up a tremendous amount of resources!

(I mentioned this above, but I want to reiterate: contrary to expectations, a lot of interesting linguistic stuff is happening with those frequent but “unimportant” items. If you’re interested in them or think you might be, you should probably forgo a stop list, bite the bullet, and process everything.)

To filter out those most-frequent tokens, many NLP applications use a list of stop words. Anything on the list is removed from the stream of input tokens before any further processing is done.

Which and how many stop words to use depends on the analysis you’re going to do. For the purposes of illustration, we’re going to use a list of the 12 most frequent tokens in the corpus I mentioned earlier. Here they are with their frequencies:

the 69969
of 36472
and 28935
to 26191
a 23529
in 21422
that 10789
is 10101
was 9815
he 9795
for 9498
it 9094

Sets

We’ll keep the stop words in a Clojure set. Sets allow you to store items, but no duplicates. You can create a set using the set function or by using a set literal, which is a hash mark and the items in the list surrounded by curly braces.

Open word.clj in your editor and add this near the top of the file, after you define token-regex, maybe:

(def stop-words
  #{"a" "in" "that" "for" "was" "is"
    "it" "the" "of" "and" "to" "he"})

Sets as Functions

So far, we’ve only used functions as functions. That is, every time we’ve called a function, we’ve called it on a function object, things like map and re-seq. But Clojure also allows some other data types to act like functions. A set is one of those. A set/function takes one argument and tests if that argument is in the set.

For example, call (load-file "word.clj") to update your REPL and try this:

user=> word/stop-words
#{"a" "in" "that" "for" "was" "is" "it" "the" "of" "and" "to" "he"}
user=> (word/stop-words "clojure")
nil
user=> (word/stop-words "was")    
"was"

You can see that when we call stop-words with a word that isn’t in the set, it returns nil. nil is just a special value in Clojure and other lisps that means nothing (it’s like None in Python or null in Java), and it always tests false. If the word is in the set, like "was", it returns that item, which will test true.

Great, now we have a list of the stop words and a function that tells whether a given word is a stop word, all in one object. Let’s put it to use.

Filtering Tokens

To filter out the stop words, we’ll use the filter function. It takes a predicate (a function of one argument that returns a true or false value) and a list. It calls the predicate on every item in the list and returns a new list made up of those items from the original list for which the predicate returned true.

This may make more sense with an example:

user=> (filter pos? '(-2 -1 0 1 2))
(1 2)

pos? is a predicate that returns true if a value is positive. You can see that filter here returns all the values in the original list that are positive, that is, for which pos? returned true.

Let’s try this with stop-words:

user=> (filter word/stop-words '("the" "cat" "in" "the" "hat"))
("the" "in" "the")

Hmm. That return all the tokens that are stop words: the exact opposite of what we want. We need something that will return the opposite of what stop-words would return.

Fortunately for us, Clojure defines such a function: complement takes a function and returns a new one that always returns the logical opposite of the original function. Where the first function returns true, the new function returns false, and vice verse.

Let’s try the example above again, this time using the complement of stop-words:

user=> (filter (complement word/stop-words) '("the" "cat" "in" "the" "hat"))
("cat" "hat")

Exactly.

As I mentioned before, you won’t always want to use a list of stop words, so we won’t immediately make it part of the tokenize-str function. However, in the next posting, I’ll show you how to add a set of stop words as an optional argument to tokenize-str.

Wednesday, June 25, 2008

Tokenization, Part 4: Organization

Improved Tokenization

As I’ve already said, the way we’re tokenizing is lacking. One major problem at this point is that "The" and "the" are returned as two separate tokens. To take care of that, we need to convert all tokens to lower-case as they’re processed.

That’s a two-step process: first, convert a token to lower case; second, apply that conversion to each token as it’s read in.

How do we do this?

Java Interop

In Clojure, strings are the same as Java strings: a Clojure string takes the same methods as a Java string, and like a Java string, a Clojure string is immutable: once you’ve created it, you can’t change it.

So now the question is how do we call Java methods?

In fact, it’s fairly simple: to call the length method on a string, use .length (a dot and the method name) as the function to call, and pass the string as the first argument to that function. Other arguments are passed in after the method:

user=> (def a-string "My name")
#'user/a-string
user=> (.length a-string)
7
user=> (.substring a-string 3)
"name"
user=> (.substring a-string 3 5)
"na"

This also works for calling static methods, just pass in the name of the class instead of a class instance as the first argument. For example, the first line below calls the static method Runtime.getRuntime():

user=> (def rt (.getRuntime Runtime))
#'user/rt
user=> (.freeMemory rt)
30462304

(Runtime gets automatically imported from java.lang. Later I’ll show you how to import other Java classes and how to create instances of them.)

Java strings have a toLowerCase method, which will do exactly what we want. The only problem is that Java methods aren’t the same as Clojure functions, so we’ll need to wrap the method call in a function. Open the word.clj file you’ve been working on and add this before tokenize-str:

(defn to-lower-case [token-string]
  (.toLowerCase token-string))

That creates a function called to-lower-case, which just wraps a method call to String.toLowerCase.

Higher-Order Functions

Now we need to solve the second part of the problem: applying to-lower-case to every token as we create it.

Clojure, like all lisps, Python, Ruby, and many other computer languages, treat functions as objects in their own right. That means you can take a function, assign it to a variable (which is what defn does), and pass it around as an argument. Plus, Clojure has a number of functions—called higher-order functions—that take functions as arguments and do interesting things with them, either creating new functions based on the original function or applying that function to a set of data.

map is one of those functions. It takes another function and a sequence, and it applies the function to every item in the sequence. Finally, it returns a new sequence containing the results of applying the function to each item in the input sequence. For example, let’s apply to-lower-case to a sequence of strings (make sure you call (load-file "word.clj") so that to-lower-case is defined in the REPL):

user=> (map to-lower-case '("This" "IS" "my" "name"))
("this" "is" "my" "name")

Now let’s combine map with what’s already in tokenize-str to create a new version of the function that converts all tokens to lower-case:

(defn tokenize-str [input-string]
  (map to-lower-case (re-seq token-regex input-string)))

Call (load-file "word.clj") again and test the new version of tokenize-str:

user=> (tokenize-str "This is a LIST OF TOKENS.")
("this" "is" "a" "list" "of" "tokens")

There we go: all lower-case tokens.

Sequence Literals

In the first map example above, I included something that we haven’t seen before: sequence literals. Sequences, or lists, are a big part of any lisp, including Clojure.

A list in Clojure is printed like they are in all lisps: a space-delimited list of items surrounded by parentheses. They look just like a function call, which is awkward. It means that if you just type in a list, Clojure will try to call it like a function:

user=> ("my" "name")
java.lang.ClassCastException: java.lang.String cannot be cast to clojure.lang.IFn

That’s a poor way of saying that you can’t treat a string like its a function.

To type the list and have Clojure recognize it as a list, you have to quote it: put a single-quote character in front of the list:

user=> '("my" "name")
("my" "name")

Organizing Your Code

So far we haven’t had any problem with variable names clashing with each other. But if we start using different libraries and files written by different people, we could easily run across several that use the same variable name for different things.

How do we get around that? We need some way to keep all those names separate.

Namespaces

Clojure uses namespaces to keep variable names from clashing. (These are different than Java’s packages, if you’re familiar with those.) By default, all of the built-in functions are in the clojure namespace. When you’re in the REPL, everything is in the user namespace. Remember what the REPL prompt looks like?

user=>

The user at the beginning of the prompt tells which namespace we’re currently working in. Clojure indicates which namespace a variable is in by printing the namespace and a forward slash before the variable name. For example, the user/tokenize-str that is printed after loading word.clj indicates that the last variable read in was tokenize-str in the user namespace.

Here we want to define everything we’re working on in a namespace called word. To do that, just add these lines to the top of word.clj:

(in-ns 'word)
(clojure/refer 'clojure)

The first line creates a new namespace, word, and uses that to contain all the variables defined in the rest of the file.

Immediately after the first line, we can’t reference any of the functions that Clojure provides. The second line fixes that, (clojure/refer 'clojure) call makes everything in the clojure namespace available in the word namespace. Remember that clojure/refer references the refer variable in the clojure namespace. So even though the we can’t access Clojure’s built-ins directly at this point, we can still reference them using their full (namespace plus name) names.

(As an aside, notice that in the second line, the second clojure is quoted. That’s because symbols can either be variables or symbol objects in their own rights. Without the quote, Clojure thinks that the symbol is a variable; with the quote, it reads the second clojure as a symbol, which is what refer wants.)

Now, quit Clojure, go back in, and re-load the file. (We want to start with a clear slate; otherwise, the old function definitions will still be hanging around to confuse us.)

user=> (load-file "word.clj")                    
#'word/tokenize-str

The first thing to notice is that the variable returned at the end has a new namespace prefix, word. Let’s try to use it:

user=> (tokenize-str "This is a String.")
java.lang.Exception: Unable to resolve symbol: tokenize-str in this context

Oops. No tokenize-str is found. We have to tell Clojure to look in the word namespace:

user=> (word/tokenize-str "This is a String.")
("this" "is" "a" "string")

Remember to check out the new code from this posting in the Google Code project for word-clj.

We’ve covered a lot—probably too much—for today. Tomorrow, I’ll tackle just one topic: we’ll add stop-list filtering to our tokenizing.

Tuesday, June 24, 2008

Tokenization, Part 3: Functions

Tokenization, Part 3: Functions

Functions

In the last post, we saved the regular expression that we used to tokenize a string to a variable. But it would be more convenient to be able to save the entire tokenization procedure to a variable. Pretty much all programming languages let us save a series of statements or expressions—a function—to evaluate later. How does Clojure do this?

In fact, creating a function looks a lot like creating a variable. First, start Clojure and make sure that token-regex is still defined:

user=> (def token-regex #"\w+")
#'user/token-regex

Next, define the function, only instead of using def, use defn:

user=> (defn tokenize-str [input-string]
(re-seq token-regex input-string))
#'user/tokenize-str

Let’s break that apart:

  1. defn indicates that we’re defining a function, not a variable.
  2. tokenize-str is the name of the function. Functions and variables use the same set of names, so naming a variable tokenize-str will get rid of the function named tokenize-str, and vice versa.
  3. [input-string] is a square-bracket-delimited list of the parameters that this function accepts. In the case of tokenize-str, it takes one argument, named input-string. Expressions inside the function can refer to the value passed into the function using that name.
  4. After you type in that line and hit enter, nothing will happen. The first parenthesis before defn is still open, so the Clojure REPL knows you’re not finished yet. You’ll need to enter the second line to continue.
  5. The second line is just the re-seq function with both arguments as variables, like we used in the last posting. One variable is the regular expression from the previous def, and one is input-string from the function definition.
  6. Functions return the value of their last expression. In this case, that is the function call to re-seq.

Now let’s give it a try:

user=> (tokenize-str "This is a new input string with different tokens.")
("This" "is" "a" "new" "input" "string" "with" "different" "tokens")

Sure enough. Now calling (tokenize-str ...) is the same as calling (re-seq token-regex ...).

Saving Your Work

We’re starting to get enough code that typing it in every time we want to use it would be painful, inefficient, and worst of all, boring. Fortunately, like most other programming language, Clojure lets us save expressions to a file to execute all at once.

To do this, open your text editor and create a new file. Let’s call it word.clj and save it in whatever directory you’re currently working in. Next enter in all the code we’ve entered so far:

(def token-regex #"\\w+")
(defn tokenize-str [input-string]
  (re-seq token-regex input-string))

Now switch back to the Clojure REPL and load this file using the load-file function:

user=> (load-file "word.clj")
#'user/tokenize-str

After loading the file, Clojure prints the result of the last expression in the file. In this case, that is the expression defining the tokenize-str function.

We can use the variables and functions defined in that file, just as if we had typed them into the REPL:

user=> (tokenize-str "Another input string.")
("Another" "input" "string")

I’ve set up a Google Code Project for the code in this series at http://code.google.com/p/word-clj/. As we go along, I’ll update the code in step with the postings here.

Also, if you find any bugs, you can let me know using the issues tracker there.


Next time we’ll improve the tokenization and talk about how to organize our code better.

Monday, June 23, 2008

Tokenization: Part 2

Tokenization, Part 2: Literals and Variables

Literals

In the last posting, I showed how to tokenize a string into separate tokens/words. For reference, the code to do this (in the Clojure REPL), was:

user=> (re-seq #"\\w+" "This string contains some tokens. Yipee!")
("This" "string" "contains" "some" "tokens" "Yipee")

In the snippet above, both the regular expression and the input string are literals: a direct representation of a value. #"\\w+" is a literal regular expression, and "This string contains some tokens. Yipee!" is a string literal. Some other literal expressions in Clojure are:

String "This is a string."
Regular expression #"\\w+"
Character \a
Integer 42
Float 3.1415

Clojure also has literal expressions for a variety of other types of data. I’ll introduce them as we need them.

Remembering

Remembering the regular expression we were using above—#"\\w+"—isn’t a big deal. A more complicated expression may be difficult to read, much less to recall or to type correctly all the time. Fortunately, we don’t have to remember it. Instead, we can use variables. Clojure creates a variable and assigns it a value with a def expression:

user=> (def token-regex #"\\w+")
#'user/token-regex

Now, whenever we use the name token-regex, Clojure substitutes the value #"\\w+" instead:

user=> (re-seq token-regex "This string contains some tokens. Yipee!")
("This" "string" "contains" "some" "tokens" "Yipee")

For that matter, if we’re going to be using that input string frequently, we can assign it to a variable also:

user=> (def input-string "This string contains some tokens. Yipee!")  
#'user/input-string
user=> (re-seq token-regex input-string)                              
("This" "string" "contains" "some" "tokens" "Yipee")

Benefits

Using variables allows us a couple of advantages:

  1. We can name things according to what their function in the program is, not by their value.
  2. We can reduce errors by removing duplication. Since we aren't typing in the value by hand every time, we have fewer opportunities to type it in wrong.
  3. For data that may take up a lot of memory, we only need to use up that memory once, and then we can refer to that single instance of the data as many times as we need to.

Caveats

If you’re coming to Clojure from another, non-functional programming language (as I assume you are), variables in Clojure are immutable: you can’t change them. There is no assignment operator, like = in many languages. (You can call def again on the same variable name, but technically that’s not assigning a new value to the old variable, and in general, you don’t re-def a variable in Clojure.)

As we’ll see later, immutability is a good thing in Clojure. But does take some getting used to.


Next, we’ll look at how to wrap up the entire protocol for tokenization into its own variable.

Sunday, June 22, 2008

Tokenization: Part 1

Tokenization, Part 1: Regular Expressions

About Tokenization

Tokenization is the process of splitting a string of characters into tokens: useful sequences of characters that all belong together. When analyzing natural language, tokens are often conflated with words, and defining what is and what is not a word, from either a linguistic or a computer's point of view, is problematic. Should a contraction be split into a separate word? Should ice cream be two tokens or one?

But I’m going to keep this simple, so we're going to define a token/word in the simplest way possible, and we're going to use regular expressions to do it.

About Regular Expressions

Regular expressions (often shortened to regexes) are a mini-language used inside a full-fledged programming language or other tool to specify how to match text. For example, you can create a regex that says that you want to match one or more a characters (a+) and use that pattern to find or replace a group of a’s in a string.

Regexes also have short-cut ways to indicate classes of characters. For instance, the class \w matches any alphanumeric character or underscore.

Underscore? As it turns out, when regexes say “word characters,” you should usually add “as defined by the programming language C.” Words in C can contain non-accented ASCII letters, numbers, and—you guessed it—underscores.

Problems

\w is not ideal for tokenizing. It doesn't handle any letters that aren’t used in American English; it doesn’t include apostrophes; and it does include underscores. But as I said, I’m trying to keep this simple. Later, you can muck things up with theoretically pure tokens that actually make sense.

Enough Talk Already!

Let’s fire up Clojure with the scripts we created in the last posting and see what damage we can cause.

Clojure
user=>

Clojure has a simple syntax for creating regexes. To make a regular expression to find groups of “word” characters, just put a quote around the pattern and put a hash mark (#) in front of it. Also, the backslash has special meaning in Clojure strings, so you’ll need to double it, so that Clojure will pass the backslash to the regex pattern:

user=> #"\\w+"
\w+

That creates the regex, to use it, we’ll pass both it and a string to the re-seq function.

user=> (re-seq #"\\w+" "This string contains some tokens. Yipee!")
("This" "string" "contains" "some" "tokens" "Yipee")

(Don’t worry about how functions are called. Just type in what you see above. I’ll explain what’s happening in the next few postings.)

That’s it. We’ve got the basics of how to tokenize a string down. Next, we’ll learn how to package it up so we know we’ll always be tokenizing consistently.

More about Regular Expressions

This is as much as I’m going to cover about regular expressions. That doesn’t mean that regular expressions aren’t insanely useful: they’re just also insanely complicated. If you do much text processing, however, you'll want to get intimate with all their useful craziness.

Here are some resources for learning regular expressions:

Friday, June 20, 2008

Setting Up Clojure

(I added another reason why I like Clojure to my last post: community, reason #1 under "Why Clojure?". Check it out.)

Before we get started, we need to install Clojure and set up our environment to make the best use of it. This will involve installing Java, installing Clojure, creating some scripts to make calling Clojure easier, and setting up a text editor.

Below, I'll sketch out one quick way to get set up. There's more information on the Getting Started page and on the Clojure Wiki.

(These instructions are a little higher-level than I would like, for two reasons. First, how to do these things depends on your operating system. Second, if you have much computer experience, you probably already know how to do the tasks I gloss over below. But you may not. Grab a friend.)

Java

First, you may need to install a Java Runtime Environment. You may already have this installed, and if so, skip to the next section.

If you don't have Java installed or aren't sure whether you do, go to http://java.sun.com/javase/downloads/ and download the latest release of Java SE 6. Once you've downloaded it, double-click the installer to run it and install Java.

Depending on which installer you downloaded, at this point you'll either need to wait for Java to install or wait for it to download, then install. Either way, be patient.

Clojure

(If you're comfortable living on the edge and compiling Java with Maven, get the development code for Clojure from SVN and install that. I've been using the SVN version from the beginning, and it hasn't given me any trouble.)

Now you're ready to install Clojure. Go to its Sourceforge download page and select the latest release (20080612 at the time of this posting). Save it on your disk someplace you'll remember.

Now create a new directory named Clojure and unzip the file you downloaded into it. In that directory will be a file named clojure.jar. This contains the Clojure system.

Scripts

The Clojure JAR file isn't executable itself, so you'll want a script, which you can call from the command-line or double-click on, to run Clojure.

The easiest way to do this is to use the directory named Clojure, which contains clojure.jar as your working directory. Under UNIX/Mac OS-X, create a file in Clojure named clj and add these lines:

#!/bin/bash
java -cp clojure.jar clojure.lang.Repl

Of course, you'll need to set the permissions on this file to make it executable.

On Windows, create a file in Clojure named clj.bat and add these lines:

@ECHO OFF
java -cp clojure.jar clojure.lang.Repl

(Both of these assume that java is in your PATH. If it's not, you'll need to add it.)

Your start-up script can be a lot more complicated, of course. The Clojure Wiki has one that loads other libraries and can also be used to run scripts written in Clojure.

Double-click on or execute the file you created above and you should get a console window with this:

Clojure
user=>

Congratulations! Hit Ctrl-D to exit.

(Using the Clojure directory as your working directory is a good way to get started quickly, but its long-term merit is debatable. Set up the script so you can work from another directory. If that means nothing to you, find a friend who knows Java, and he can help.)

Editors

Because of the variety of editors out there, I'm punting on this one. The Clojure Wiki has information on setting up Emacs and Vim, and there is also enclojure, a plugin for the NetBeans IDE.

So far, I've been using Vim, and I haven't had any complaints.

Tuesday, June 17, 2008

Introduction to Clojure

What is Clojure? The rest of this series will answer that question in an easy-to-understand way. If that's good enough for you, skip down to Why Clojure?, where I talk about what I like about Clojure. If you want the shorter, more difficult explanation, though, wade through this: In one jargon-heavy sentence, Clojure is a general-purpose, dynamic, compiled lisp that targets the Java Virtual Machine and that provides strong facilities for functional and concurrent programming. Pretty dense. What does it mean?
  • General Purpose It's suitable to a wide variety of tasks. You might consider it for almost anything you'd use Java for.
  • Dynamic It shares many features, like dynamic typing, with Perl, Python, Ruby, and other agile languages. These traits make it easy to write programs quickly.
  • Lisp It's in the Lisp family of languages, and it shares much of the philosophy behind them. For example, Clojure programs are written in Clojure data structures, which enables a powerful macro facility.
  • Compiles Although Clojure is dynamic, it is not interpreted. Instead, each expression and function is immediately compiled to Java byte code. Often functions written in Clojure run just as fast as their Java equivalents, especially when you give the compiler hints about what kind of values your variables refer to.
  • Java Virtual Machine Clojure runs on the JVM, and it can use the standard Java library, as well as all of the third-party libraries available for Java.
  • Functional Although you can access mutable Java types, Clojure has an array of immutable, fast data structures suitable to functional programming. This is a style of programming that emphasizes functions that don't change their input and programs that are composed of many small functions, glued together.
  • Concurrent Clojure includes a robust implementation of software transactional memory (STM) and reactive Agents. When combined with functional programming, these systems make it easy to write clean, correct concurrent programs.
Why Clojure? For me, Clojure hits a sweet spot. I like the power and flexibility that dynamic languages—especially lisps—provide, but I also want the speed, the library, and the community of a behemoth like Java.
  • The interactive development model it shares with Python, Ruby, and especially lisp makes it an excellent platform to experiment on.
  • Unlike most current dynamic languages, it lets me take advantage of multiprocessor machines across a variety of OS platforms.
  • It gives me access to the zillion libraries available for Java, and to the JIT capabilities of the JVM.
  • It provides speed in both development and execution.
  • I can deploy it wherever the JVM is available.
In a nutshell, Clojure lets me do interesting things quickly, without having to fight with it. I think you'll like it too.

Clojure Series: Table of Contents

Here are the postings I have published so far for this series.

Monday, June 16, 2008

New Toys

So what new toys have been occupying my time? Clojure. It's a functional, compiled lisp for the Java Virtual Machine (JVM). Like any good lisp, it has macros and first-class functions; plus it also has many nice features for concurrent programming, like software transactional memory and agents. It has fast, read-only data structures designed to be used in a functional language. So far, I've had a lot of fun with it. I've mainly been using it to work through Programming Collective Intelligence by Toby Segaran (PCI, for the rest of this posting). Whenever I'm working through a programming book, I like to re-code the examples so I understand them better. And often it's better to use a language that's not at all like the one the author uses in the book. Since the code in PCI is in Python, most popular languages would be too similar to really help me understand the algorithms Segaran's describing. But since Clojure is a functional language, I have to step back, look at what he's trying to do, and think about how to accomplish the same thing with a different, functional approach. It's been good to help me understand the book better and to help me improve my functional chops. To further explore this fun little language, I'm going to start a long series here on this blog. Here's the plan:
  • Its target audience will be people like myself: self-taught programmers whose primary education has been in the humanities.
  • This won't be a tutorial on how to program: The audience should have a little programming experience, preferably in Perl, Python, or Ruby.
  • But you won't need to have any experience in lisp, functional programming, or concurrent programming. I'll touch on those as we go along.
  • The problems I'm going to tackle will be oriented to processing text documents and analyzing the language in them. The techniques I'll cover will be helpful to those interested in stylistics and other literary studies, or to those interested in corpus linguistics.
  • Many of the things I'll describe—such as tokenization–will be very basic and far back from the cutting-edge. Since this is also an introduction to Clojure, I'll cover the basics also.
  • But there will also be some odd gaps. For example, I'll explain regular expressions just enough to implement tokenization. Then I'll point you to one of the myriad-thousand excellent online tutorials if you want to learn more about them.
  • Since one of the main points of Clojure is concurrency and parallel processing, I'll cover that also.
  • The examples will build on each other, and in the end, we'll have a system for doing parallel processing of text documents. We'll build a variety of tasks to do some standard analyses, and we'll design it so creating more tasks and inserting them into the processing stream will be relatively easy.
Well, this is long enough for today. Tomorrow I'll start with a brief introduction to Clojure and a look at why you might want to learn it.

Friday, June 13, 2008

New Focus

I was beginning to think that this blog was seriously dead. Life has just been too much in the way. But lately, I've begun to suspect there is life here after all. Besides just being busy, part of why I have thought this this blog might be beyond hope was because I wasn't happy with its focus. Splitting my attention here between writing and coding was giving me a headache. I have also generally written longer postings, and that had become overwhelming. So I'm going to focus on shorter postings about programming and computer toys I'm playing with. (I'll probably start a writing blog elsewhere at some point. I'll keep you posted.) I was going to include a brief note about one of those toys, but I'll stop now and save that for a later.