Friday, October 24, 2008

Concordances, Part 3: Positioning Tokens

Today, we're going to add to the processing information about where a token appears in the original document.

Tokens Today

Currently, a token is just the string containing the token data.

Easy enough.

Tokens Tomorrow

To hold more information about the token, we'll need a richer data type. To accommodate that, here's a struct for tokens:

(defstruct token :text :raw :line :start :end :filename)

This gives us slots to hold the token's text; its original text before case normalization, stemming, or whatever; the line it occurred on; the start and end indices where it can be found on that line; and the name of the file the token was read from.

Again, pretty simple.

Updating the Tokenization

The big changes happen in the tokenization procedure. Currently, it doesn't take lines into account.

Let's start with the highest-level functions and drill down to the lowest. First, these functions tokenize either a file or a string.

(defn split-lines [input-string]
  (.split #"\\r|\\n|\\r\\n" input-string))

(defn tokenize-str
  ([input-string]
   (tokenize-str-seq (split-lines input-string)))
  ([input-string stop-word?]
   (filter (comp :text (complement stop-word?))
                       (tokenize-str input-string))))

(defn tokenize
  ([filename]
   (with-open in (BufferedReader. (FileReader. filename))
     (doall
       (map (fn [tkn] (assoc tkn :filename filename))
            (tokenize-str-seq (line-seq in))))))
  ([filename stop-word?]
   (with-open in (BufferedReader. (FileReader. filename))
     (doall
       (map (fn [tkn] (assoc tkn :filename filename))
            (filter (comp :text (complement stop-word?))
                    (tokenize-str-seq (line-seq in))))))))

split-lines breaks a string into lines based on a regex of line endings.

tokenize-str uses split-lines to break its input into lines, and it calls tokenize-str-seq with them. The second overload for this function then filters the tokens with a stop list.

tokenize opens a file with a java.io.BufferedReader, and it calls tokenize-str-seq with them. It sets the :filename key on the token structures.

doall is thrown in there because map is lazy, but with-open isn't. doall forces map to evaluate everything. Without it, with-open would close the file before its contents could be read. This is a common mistake, and it will probably bit you regularly. It does me.

We haven't seen tokenize-str-seq yet. What does it do?

(def token-regex #"\\w+")

(defn- tokenize-str-seq
  "This tokenizes a sequence of strings."
  ([strings]
   (tokenize-str-seq strings 0))
  ([strings line-no]
   (when-first line strings
     (lazy-cat (tokenize-line line-no (re-matcher token-regex line) 0)
               (tokenize-str-seq (rest strings) (inc line-no))))))

This function tokenizes a sequence of strings. It walks through the sequence, numbering each line (line-no). For each input line, it constructs a lazy sequence by concatenating the tokens for that line (tokenize-line) with the tokens for the rest of the lines.

when-first is new. It is exactly equivalent to when plus let:

user=> (macroexpand-1 '(when-first line strings (println line)))
(clojure/when (clojure/seq strings)
  (clojure/let [line (clojure/first strings)]
    (println line)))

tokenize-line constructs a lazy sequence of the tokens in that line.

(defn- tokenize-line
  "This tokenizes a single line into a lazy sequence of tokens."
  ([line-no matcher]
   (tokenize-line line-no matcher 0))
  ([line-no matcher start]
   (when (.find matcher start)
     (lazy-cons (mk-token line-no matcher)
                (tokenize-line line-no matcher (.end matcher))))))

mk-token constructs a token struct from a regex and line number.

(defn- mk-token
  "This creates a token given a line number and regex matcher."
  [line-no matcher]
  (let [raw (.group matcher)]
    (struct token
            (.toLowerCase raw) raw
            line-no (.start matcher) (.end matcher))))

That's it. tokenize and tokenize-str create a sequence of strings of input data. Each item in the sequence is a line in the input.

tokenize-str-seq takes that input sequence and creates a lazy sequence of the tokens from the first line and the tokens from the rest of the input sequence.

tokenize-line takes a line and constructs a lazy sequence of the tokens in it, as defined by the regex held in token-regex.

Finally, mk-token constructs the token from the regex Matcher and the line number.


If you've made it this far, you've probably got Clojure up and running, but if not, Bill Clementson has a great post on how to set up Clojure+Emacs+SLIME. In the future, he'll be exploring Clojure in more detail. He's got a lot of good posts on Common Lisp and Scheme, and I'm looking forward to seeing what he does with Clojure.


I haven't really explained about Clojure's laziness. Next, I'll talk about that.

Friday, October 17, 2008

Work in Progress, Draft 2....

Is done!

This is a SF/F novel I've been working on all year.

Is it any good? Who knows? It's certainly better than my last two efforts.

For draft three, I'm just going to decide what point of view to use. I wrote it in first person, but I rewrote the first half of draft two in third-person. I'm going to read the whole thing and decide which half has more potential, and then I'll beat the other half into submission.

Then I will take a break to get some perspective on it, just in time for NaNoWriMo.

Wednesday, October 15, 2008

Blog Action Day

Lately, I've donated some time writing for a non-profit that promotes education in India to diminish poverty, hunger, child labor, illness, and other concerns. So when I saw that the topic for Blog Action Day 2008 was poverty, I thought I'd lend my voice.

When I sat down to write this entry, I looked on the Internet for some facts to throw at you.

Then I changed my mind.

In all those numbers and percent signs, it's easy to forget that poverty has a face.

It is the man who doesn't have the livestock and knowledge to get the most from his farm, and who thus cannot provide enough food for his family.

It is the single mother who doesn't have the education and resources to support her children.

It is the child who has to work in a factory to help her family survive.

The current world economic crisis makes poverty an even more pressing concern. Bill Gates and Warren Buffer will be fine. As worried as the stock brokers and bankers are, they won't bear the full impact of the crisis. They'll be fine. I'm no Bill Gates, but I'll be fine, too, thanks for your concern.

But the economic downturn will hit first and hardest on those who already have trouble getting enough to eat each day.

I would say, "Remember them," but I don't want you to remember them. I want you to find a reputable charity, such as one of these, and give.

Now. Go. Do it.

Concordances, Part 2: Making a Plan

In the last entry, I showed what a concordance looks like. Today, I'm going to talk about the changes that I'll need to make to what we already have and about what I'll need to add to the system that's here now.

Position

The biggest underlying change is that tokens will need to know their position in the input document. The system that we're building can only handle plain text documents, so I get to pretend that position is a simple concept. But it's really not. For example, in an XML document, the position could be an XPath expression.

Even for a text document, position isn't entirely clear. Is the position of a token the byte it started on in the original raw file? Is it the after-Unicode-decoding character of the token in the text of the file?

For our case, I'm going to say that the position of a token is the line number and beginning and ending character in the Unicode data of the document. This is what my example yesterday had.

Input Text

To keep track of that, instead of slurping in a file's contents all at once, I'll now need to read in each line separately and keep track of the line numbers. That shouldn't be difficult, though.

Tokens

Tokens will also need to be more than just plain strings. They'll now need to be structures that keep track of their location: file names, line numbers, and starting and ending character indices.

As an added bonus, tokens will also be able to keep track of the original form of the word, as well as a normalized or a stemmed form.

Indexing

To display the documents in alphabetical order and to pull all the occurrences of each word together easily, I'll need to index the tokens by word. The index won't be industrial-strength, really. It will probably bog down with too large of a document. Other options would be to use Lucene or store the index in a database of some form. For now, though, I'll just keep things simple.

Display

I can imagine displaying a concordance a number of different ways: text files, HTML, a GUI form. To keep the system flexible, I'm going to defer that decision until later, and the core concordance generator will just return some basic Clojure data types. I'll also add a function that prints them to the screen.

That's it. This should be a lot simpler than the Stemmer we just worked on, but it should give us a good idea of how various words are being used in the documents.

Thursday, October 9, 2008

Concordances, Part 1: What's that?

OK, that's enough of a break, don't you think?

Now we've got some processing tools under our belts: tokenizing and stemming. We'll tweak those, and we'll add more later. But for now let's take a look at our data.

One of the earliest ways of analyzing texts is the concordance. Simply put, it's an alphabetical list of the words used in a document, followed by every occurrence of that word, a little context around it (typically just a few words), and the location of that occurrence.

The first concordance was done in the thirteenth century for the Vulgate translation of the Christian Bible. However, because complete concordances are labor-intensive, they really weren't practical as a wide-spread tool. Because of this, concordances were only made for works deemed sufficiently important, such as religious texts or the works of Shakespeare or Chaucer.

But computers changed that. Now, given a text file, a computer can spit out a concordance in little time.

For example, here are the first few entries of a concordance of the first few paragraphs of this blog entry (although this may reflect an earlier draft). The numbers at the beginning are the line in a text file I copied the entry into and the start and end indices of the word in that line. This should give you a picture of what a concordance is, although a better one would pull context from surrounding lines. This one doesn't. Yet.

a
=
  1: 23: 24 K, that's enough of *a* break.
  5:  7:  8                take *a* look at our data.
 10: 35: 36 eren't practical as *a* wide-spread tool. B

add
===
  4: 17: 20      We'll probably *add* more later, and we'

analyzing
=========
  7: 30: 39 he earliest ways of *analyzing* texts is the [conco

and
===
  3: 66: 69 r belts: tokenizing *and* stemming.
  4: 33: 36 bly add more later, *and* we'll definitely tw

are
===
  9: 58: 61 mplete concordances *are* labor-intensive,

Next we'll analyze what we'll need to build the concordance and plan the next steps.