Monday, June 16, 2008

New Toys

So what new toys have been occupying my time? Clojure. It's a functional, compiled lisp for the Java Virtual Machine (JVM). Like any good lisp, it has macros and first-class functions; plus it also has many nice features for concurrent programming, like software transactional memory and agents. It has fast, read-only data structures designed to be used in a functional language. So far, I've had a lot of fun with it. I've mainly been using it to work through Programming Collective Intelligence by Toby Segaran (PCI, for the rest of this posting). Whenever I'm working through a programming book, I like to re-code the examples so I understand them better. And often it's better to use a language that's not at all like the one the author uses in the book. Since the code in PCI is in Python, most popular languages would be too similar to really help me understand the algorithms Segaran's describing. But since Clojure is a functional language, I have to step back, look at what he's trying to do, and think about how to accomplish the same thing with a different, functional approach. It's been good to help me understand the book better and to help me improve my functional chops. To further explore this fun little language, I'm going to start a long series here on this blog. Here's the plan:
  • Its target audience will be people like myself: self-taught programmers whose primary education has been in the humanities.
  • This won't be a tutorial on how to program: The audience should have a little programming experience, preferably in Perl, Python, or Ruby.
  • But you won't need to have any experience in lisp, functional programming, or concurrent programming. I'll touch on those as we go along.
  • The problems I'm going to tackle will be oriented to processing text documents and analyzing the language in them. The techniques I'll cover will be helpful to those interested in stylistics and other literary studies, or to those interested in corpus linguistics.
  • Many of the things I'll describe—such as tokenization–will be very basic and far back from the cutting-edge. Since this is also an introduction to Clojure, I'll cover the basics also.
  • But there will also be some odd gaps. For example, I'll explain regular expressions just enough to implement tokenization. Then I'll point you to one of the myriad-thousand excellent online tutorials if you want to learn more about them.
  • Since one of the main points of Clojure is concurrency and parallel processing, I'll cover that also.
  • The examples will build on each other, and in the end, we'll have a system for doing parallel processing of text documents. We'll build a variety of tasks to do some standard analyses, and we'll design it so creating more tasks and inserting them into the processing stream will be relatively easy.
Well, this is long enough for today. Tomorrow I'll start with a brief introduction to Clojure and a look at why you might want to learn it.

No comments: