So far, we’ve only tokenized data from strings that we’ve typed in. Today, let’s learn how to tokenize the text in a file.
slurp
The easiest way to read in the data from a file is using the slurp
function.
It takes a file name and returns the contents of that file as a string:
user=> (slurp "LICENSE.txt") "Copyright (c) 2008, Eric Rochester\nAll ...."
We can then pass that string to tokenize-str
to get the list of tokens in
the file:
user=> (take 10 (word/tokenize-str (slurp "LICENSE.txt"))) ("copyright" "c" "2008" "eric" "rochester" "all" "rights" "reserved" "redistribution" "and")
(take
pulls the first n items from a list and returns them.)
tokenize
Of course, it would be nice to have this wrapped into its own function, so
let’s add this into word.clj
after tokenize-str
:
(defn tokenize ([filename] (tokenize-str (slurp filename))) ([filename stop-word?] (tokenize-str (slurp filename) stop-word?)))
This function works just like tokenize-str
, except it takes a filename and
returns the tokens in that file.
Warning!
There’s one big problem with tokenize
. Try to tokenize a file that’s several
hundred gigabytes in size, and you’ll probably find that problem quickly: it
reads the entire file into memory all at once.
There are ways, such as reading the file a line at a time and tokenizing that,
to deal with this problem, and Clojure actually makes it relatively easy to do
this. However, to keep things simple, at this we’re going to leave tokenize
the way it is.
At this point, just keep this limitation in mind. You have been warned.
4 comments:
I was attempting to use slurp and get the error:
user=> (slurp 'license.txt')
java.lang.Exception: Unmatched delimiter: )
java.lang.Exception: ReaderError:(1,1) Unmatched delimiter: )
at clojure.lang.LispReader.read(LispReader.java:160)
at clojure.lang.Repl.main(Repl.java:68)
Caused by: java.lang.Exception: Unmatched delimiter: )
at clojure.lang.LispReader$UnmatchedDelimiterReader.invoke(LispReader.java:798)
at clojure.lang.LispReader.read(LispReader.java:126)
at clojure.lang.LispReader$WrappingReader.invoke(LispReader.java:414)
at clojure.lang.LispReader.readDelimitedList(LispReader.java:822)
at clojure.lang.LispReader$ListReader.invoke(LispReader.java:751)
at clojure.lang.LispReader.read(LispReader.java:126)
... 1 more
Hi Brian,
You'll need to use double-quotes around license.txt:
(slurp "license.txt")
Strings in Clojure, like strings in either Common Lisp, Scheme, or Java, are double quoted. I got in the habit of using single-quoted strings in Python, and it is a hard habit to break.
Incidentally, the odd error message comes from the way Clojure attempts to parse the expression: the first single quote turns "license.txt" into a symbol. The second single-quote, it appears to interpret as wanting to be the closing paren of an empty list ( '() ). Interesting. I may need to check that out later.
Of course, you'll also need to have license.txt in the directory you're working in, but I assume you already know that!
HTH,
Eric
Thanks Eric the double quotes worked like a charm. I'm REALLY enjoying your posts about Clojure. I've never used a functional programming language before so I'm starting from the ground floor here. Do you recommend any good books on functional programming? Thanks.
Hey Brian, glad you're enjoying it. I'm having a lot of fun putting it together.
Yeah, functional programming takes some getting used to. I'm still not an expert on it by any means. I came to terms with it while learning and working on a big project in Erlang.
Unfortunately, I haven't read any books specifically on FP. You could ask in the Clojure group (http://groups.google.com/group/clojure) or IRC. I'm sure they'd have some suggestions.
Now that I think about it, Structure and Interpretation of Computer Programs (http://mitpress.mit.edu/sicp/) might be helpful too. It's been years since I read it, but I'm sure it would have some information about functional programming.
As far as the tutorial I'm putting together, I'm going to keep talking about FP, but it will be a practical introduction. Less "this is FP, this is why it's good" and a lot of FP-by-example. I'm hoping it will sink in by osmosis, I guess. (Which sounds doubtful when I put it that way. Oh well.)
Good luck,
Eric
Post a Comment