Monday, June 30, 2008

Tokenization: Part 7, File Reading

So far, we’ve only tokenized data from strings that we’ve typed in. Today, let’s learn how to tokenize the text in a file.


The easiest way to read in the data from a file is using the slurp function. It takes a file name and returns the contents of that file as a string:

user=> (slurp "LICENSE.txt")
"Copyright (c) 2008, Eric Rochester\nAll ...."

We can then pass that string to tokenize-str to get the list of tokens in the file:

user=> (take 10 (word/tokenize-str (slurp "LICENSE.txt")))
("copyright" "c" "2008" "eric" "rochester" "all" "rights" "reserved" "redistribution" "and")

(take pulls the first n items from a list and returns them.)


Of course, it would be nice to have this wrapped into its own function, so let’s add this into word.clj after tokenize-str:

(defn tokenize
   (tokenize-str (slurp filename)))
  ([filename stop-word?]
   (tokenize-str (slurp filename) stop-word?)))

This function works just like tokenize-str, except it takes a filename and returns the tokens in that file.


There’s one big problem with tokenize. Try to tokenize a file that’s several hundred gigabytes in size, and you’ll probably find that problem quickly: it reads the entire file into memory all at once.

There are ways, such as reading the file a line at a time and tokenizing that, to deal with this problem, and Clojure actually makes it relatively easy to do this. However, to keep things simple, at this we’re going to leave tokenize the way it is.

At this point, just keep this limitation in mind. You have been warned.


Brian Doyle said...

I was attempting to use slurp and get the error:
user=> (slurp 'license.txt')
java.lang.Exception: Unmatched delimiter: )
java.lang.Exception: ReaderError:(1,1) Unmatched delimiter: )
at clojure.lang.Repl.main(
Caused by: java.lang.Exception: Unmatched delimiter: )
at clojure.lang.LispReader$UnmatchedDelimiterReader.invoke(
at clojure.lang.LispReader$WrappingReader.invoke(
at clojure.lang.LispReader.readDelimitedList(
at clojure.lang.LispReader$ListReader.invoke(
... 1 more

Eric Rochester said...

Hi Brian,

You'll need to use double-quotes around license.txt:

(slurp "license.txt")

Strings in Clojure, like strings in either Common Lisp, Scheme, or Java, are double quoted. I got in the habit of using single-quoted strings in Python, and it is a hard habit to break.

Incidentally, the odd error message comes from the way Clojure attempts to parse the expression: the first single quote turns "license.txt" into a symbol. The second single-quote, it appears to interpret as wanting to be the closing paren of an empty list ( '() ). Interesting. I may need to check that out later.

Of course, you'll also need to have license.txt in the directory you're working in, but I assume you already know that!


Brian Doyle said...

Thanks Eric the double quotes worked like a charm. I'm REALLY enjoying your posts about Clojure. I've never used a functional programming language before so I'm starting from the ground floor here. Do you recommend any good books on functional programming? Thanks.

Eric Rochester said...

Hey Brian, glad you're enjoying it. I'm having a lot of fun putting it together.

Yeah, functional programming takes some getting used to. I'm still not an expert on it by any means. I came to terms with it while learning and working on a big project in Erlang.

Unfortunately, I haven't read any books specifically on FP. You could ask in the Clojure group ( or IRC. I'm sure they'd have some suggestions.

Now that I think about it, Structure and Interpretation of Computer Programs ( might be helpful too. It's been years since I read it, but I'm sure it would have some information about functional programming.

As far as the tutorial I'm putting together, I'm going to keep talking about FP, but it will be a practical introduction. Less "this is FP, this is why it's good" and a lot of FP-by-example. I'm hoping it will sink in by osmosis, I guess. (Which sounds doubtful when I put it that way. Oh well.)

Good luck,