Thursday, June 26, 2008

Tokenization, Part 5: Stop Words

Stop Words

For many types of analyses, you need to hang onto every token that comes through the pipeline. For other types of analyses, however, you may want to focus on words that carry meaning: names, nouns, and verbs. Plus, words like the, of, and and occur more than any other words in English, and if you aren’t going to be using them, why keep track of them? For example, in a million word corpus (collection of texts) I have laying around, the alone is 6.8% of the total number of tokens. The twelve most frequent tokens (see below) is 25.6% of the total number of tokens. Considering that you’ll often want to be dealing with huge amounts of text, this can free up a tremendous amount of resources!

(I mentioned this above, but I want to reiterate: contrary to expectations, a lot of interesting linguistic stuff is happening with those frequent but “unimportant” items. If you’re interested in them or think you might be, you should probably forgo a stop list, bite the bullet, and process everything.)

To filter out those most-frequent tokens, many NLP applications use a list of stop words. Anything on the list is removed from the stream of input tokens before any further processing is done.

Which and how many stop words to use depends on the analysis you’re going to do. For the purposes of illustration, we’re going to use a list of the 12 most frequent tokens in the corpus I mentioned earlier. Here they are with their frequencies:

the 69969
of 36472
and 28935
to 26191
a 23529
in 21422
that 10789
is 10101
was 9815
he 9795
for 9498
it 9094

Sets

We’ll keep the stop words in a Clojure set. Sets allow you to store items, but no duplicates. You can create a set using the set function or by using a set literal, which is a hash mark and the items in the list surrounded by curly braces.

Open word.clj in your editor and add this near the top of the file, after you define token-regex, maybe:

(def stop-words
  #{"a" "in" "that" "for" "was" "is"
    "it" "the" "of" "and" "to" "he"})

Sets as Functions

So far, we’ve only used functions as functions. That is, every time we’ve called a function, we’ve called it on a function object, things like map and re-seq. But Clojure also allows some other data types to act like functions. A set is one of those. A set/function takes one argument and tests if that argument is in the set.

For example, call (load-file "word.clj") to update your REPL and try this:

user=> word/stop-words
#{"a" "in" "that" "for" "was" "is" "it" "the" "of" "and" "to" "he"}
user=> (word/stop-words "clojure")
nil
user=> (word/stop-words "was")    
"was"

You can see that when we call stop-words with a word that isn’t in the set, it returns nil. nil is just a special value in Clojure and other lisps that means nothing (it’s like None in Python or null in Java), and it always tests false. If the word is in the set, like "was", it returns that item, which will test true.

Great, now we have a list of the stop words and a function that tells whether a given word is a stop word, all in one object. Let’s put it to use.

Filtering Tokens

To filter out the stop words, we’ll use the filter function. It takes a predicate (a function of one argument that returns a true or false value) and a list. It calls the predicate on every item in the list and returns a new list made up of those items from the original list for which the predicate returned true.

This may make more sense with an example:

user=> (filter pos? '(-2 -1 0 1 2))
(1 2)

pos? is a predicate that returns true if a value is positive. You can see that filter here returns all the values in the original list that are positive, that is, for which pos? returned true.

Let’s try this with stop-words:

user=> (filter word/stop-words '("the" "cat" "in" "the" "hat"))
("the" "in" "the")

Hmm. That return all the tokens that are stop words: the exact opposite of what we want. We need something that will return the opposite of what stop-words would return.

Fortunately for us, Clojure defines such a function: complement takes a function and returns a new one that always returns the logical opposite of the original function. Where the first function returns true, the new function returns false, and vice verse.

Let’s try the example above again, this time using the complement of stop-words:

user=> (filter (complement word/stop-words) '("the" "cat" "in" "the" "hat"))
("cat" "hat")

Exactly.

As I mentioned before, you won’t always want to use a list of stop words, so we won’t immediately make it part of the tokenize-str function. However, in the next posting, I’ll show you how to add a set of stop words as an optional argument to tokenize-str.

No comments: