So far, we’ve only tokenized data from strings that we’ve typed in. Today, let’s learn how to tokenize the text in a file.
The easiest way to read in the data from a file is using the
It takes a file name and returns the contents of that file as a string:
user=> (slurp "LICENSE.txt") "Copyright (c) 2008, Eric Rochester\nAll ...."
We can then pass that string to
tokenize-str to get the list of tokens in
user=> (take 10 (word/tokenize-str (slurp "LICENSE.txt"))) ("copyright" "c" "2008" "eric" "rochester" "all" "rights" "reserved" "redistribution" "and")
take pulls the first n items from a list and returns them.)
Of course, it would be nice to have this wrapped into its own function, so
let’s add this into
(defn tokenize ([filename] (tokenize-str (slurp filename))) ([filename stop-word?] (tokenize-str (slurp filename) stop-word?)))
This function works just like
tokenize-str, except it takes a filename and
returns the tokens in that file.
There’s one big problem with
tokenize. Try to tokenize a file that’s several
hundred gigabytes in size, and you’ll probably find that problem quickly: it
reads the entire file into memory all at once.
There are ways, such as reading the file a line at a time and tokenizing that,
to deal with this problem, and Clojure actually makes it relatively easy to do
this. However, to keep things simple, at this we’re going to leave
the way it is.
At this point, just keep this limitation in mind. You have been warned.