Friday, October 24, 2008

Concordances, Part 3: Positioning Tokens

Today, we're going to add to the processing information about where a token appears in the original document.

Tokens Today

Currently, a token is just the string containing the token data.

Easy enough.

Tokens Tomorrow

To hold more information about the token, we'll need a richer data type. To accommodate that, here's a struct for tokens:

(defstruct token :text :raw :line :start :end :filename)

This gives us slots to hold the token's text; its original text before case normalization, stemming, or whatever; the line it occurred on; the start and end indices where it can be found on that line; and the name of the file the token was read from.

Again, pretty simple.

Updating the Tokenization

The big changes happen in the tokenization procedure. Currently, it doesn't take lines into account.

Let's start with the highest-level functions and drill down to the lowest. First, these functions tokenize either a file or a string.

(defn split-lines [input-string]
  (.split #"\\r|\\n|\\r\\n" input-string))

(defn tokenize-str
  ([input-string]
   (tokenize-str-seq (split-lines input-string)))
  ([input-string stop-word?]
   (filter (comp :text (complement stop-word?))
                       (tokenize-str input-string))))

(defn tokenize
  ([filename]
   (with-open in (BufferedReader. (FileReader. filename))
     (doall
       (map (fn [tkn] (assoc tkn :filename filename))
            (tokenize-str-seq (line-seq in))))))
  ([filename stop-word?]
   (with-open in (BufferedReader. (FileReader. filename))
     (doall
       (map (fn [tkn] (assoc tkn :filename filename))
            (filter (comp :text (complement stop-word?))
                    (tokenize-str-seq (line-seq in))))))))

split-lines breaks a string into lines based on a regex of line endings.

tokenize-str uses split-lines to break its input into lines, and it calls tokenize-str-seq with them. The second overload for this function then filters the tokens with a stop list.

tokenize opens a file with a java.io.BufferedReader, and it calls tokenize-str-seq with them. It sets the :filename key on the token structures.

doall is thrown in there because map is lazy, but with-open isn't. doall forces map to evaluate everything. Without it, with-open would close the file before its contents could be read. This is a common mistake, and it will probably bit you regularly. It does me.

We haven't seen tokenize-str-seq yet. What does it do?

(def token-regex #"\\w+")

(defn- tokenize-str-seq
  "This tokenizes a sequence of strings."
  ([strings]
   (tokenize-str-seq strings 0))
  ([strings line-no]
   (when-first line strings
     (lazy-cat (tokenize-line line-no (re-matcher token-regex line) 0)
               (tokenize-str-seq (rest strings) (inc line-no))))))

This function tokenizes a sequence of strings. It walks through the sequence, numbering each line (line-no). For each input line, it constructs a lazy sequence by concatenating the tokens for that line (tokenize-line) with the tokens for the rest of the lines.

when-first is new. It is exactly equivalent to when plus let:

user=> (macroexpand-1 '(when-first line strings (println line)))
(clojure/when (clojure/seq strings)
  (clojure/let [line (clojure/first strings)]
    (println line)))

tokenize-line constructs a lazy sequence of the tokens in that line.

(defn- tokenize-line
  "This tokenizes a single line into a lazy sequence of tokens."
  ([line-no matcher]
   (tokenize-line line-no matcher 0))
  ([line-no matcher start]
   (when (.find matcher start)
     (lazy-cons (mk-token line-no matcher)
                (tokenize-line line-no matcher (.end matcher))))))

mk-token constructs a token struct from a regex and line number.

(defn- mk-token
  "This creates a token given a line number and regex matcher."
  [line-no matcher]
  (let [raw (.group matcher)]
    (struct token
            (.toLowerCase raw) raw
            line-no (.start matcher) (.end matcher))))

That's it. tokenize and tokenize-str create a sequence of strings of input data. Each item in the sequence is a line in the input.

tokenize-str-seq takes that input sequence and creates a lazy sequence of the tokens from the first line and the tokens from the rest of the input sequence.

tokenize-line takes a line and constructs a lazy sequence of the tokens in it, as defined by the regex held in token-regex.

Finally, mk-token constructs the token from the regex Matcher and the line number.


If you've made it this far, you've probably got Clojure up and running, but if not, Bill Clementson has a great post on how to set up Clojure+Emacs+SLIME. In the future, he'll be exploring Clojure in more detail. He's got a lot of good posts on Common Lisp and Scheme, and I'm looking forward to seeing what he does with Clojure.


I haven't really explained about Clojure's laziness. Next, I'll talk about that.

No comments: