Stop Words
For many types of analyses, you need to hang onto every token that comes through the pipeline. For other types of analyses, however, you may want to focus on words that carry meaning: names, nouns, and verbs. Plus, words like the, of, and and occur more than any other words in English, and if you aren’t going to be using them, why keep track of them? For example, in a million word corpus (collection of texts) I have laying around, the alone is 6.8% of the total number of tokens. The twelve most frequent tokens (see below) is 25.6% of the total number of tokens. Considering that you’ll often want to be dealing with huge amounts of text, this can free up a tremendous amount of resources!
(I mentioned this above, but I want to reiterate: contrary to expectations, a lot of interesting linguistic stuff is happening with those frequent but “unimportant” items. If you’re interested in them or think you might be, you should probably forgo a stop list, bite the bullet, and process everything.)
To filter out those most-frequent tokens, many NLP applications use a list of stop words. Anything on the list is removed from the stream of input tokens before any further processing is done.
Which and how many stop words to use depends on the analysis you’re going to do. For the purposes of illustration, we’re going to use a list of the 12 most frequent tokens in the corpus I mentioned earlier. Here they are with their frequencies:
the | 69969 |
---|---|
of | 36472 |
and | 28935 |
to | 26191 |
a | 23529 |
in | 21422 |
that | 10789 |
is | 10101 |
was | 9815 |
he | 9795 |
for | 9498 |
it | 9094 |
Sets
We’ll keep the stop words in a Clojure set. Sets allow you to store items, but
no duplicates. You can create a set using the set
function or by using a set
literal, which is a hash mark and the items in the list surrounded by curly
braces.
Open word.clj
in your editor and add this near the top of the file, after
you define token-regex
, maybe:
(def stop-words #{"a" "in" "that" "for" "was" "is" "it" "the" "of" "and" "to" "he"})
Sets as Functions
So far, we’ve only used functions as functions. That is, every time we’ve
called a function, we’ve called it on a function object, things like map
and
re-seq
. But Clojure also allows some other data types to act like functions.
A set is one of those. A set/function takes one argument and tests if that
argument is in the set.
For example, call (load-file "word.clj")
to update your REPL and try this:
user=> word/stop-words #{"a" "in" "that" "for" "was" "is" "it" "the" "of" "and" "to" "he"} user=> (word/stop-words "clojure") nil user=> (word/stop-words "was") "was"
You can see that when we call stop-words
with a word that isn’t in the set,
it returns nil
. nil
is just a special value in Clojure and other lisps
that means nothing (it’s like None
in Python or null
in Java), and it
always tests false. If the word is in the set, like "was"
, it returns that
item, which will test true.
Great, now we have a list of the stop words and a function that tells whether a given word is a stop word, all in one object. Let’s put it to use.
Filtering Tokens
To filter out the stop words, we’ll use the filter
function. It takes a
predicate (a function of one argument that returns a true or false value) and
a list. It calls the predicate on every item in the list and returns a new
list made up of those items from the original list for which the predicate
returned true.
This may make more sense with an example:
user=> (filter pos? '(-2 -1 0 1 2)) (1 2)
pos?
is a predicate that returns true if a value is positive. You can see
that filter
here returns all the values in the original list that are
positive, that is, for which pos?
returned true.
Let’s try this with stop-words
:
user=> (filter word/stop-words '("the" "cat" "in" "the" "hat")) ("the" "in" "the")
Hmm. That return all the tokens that are stop words: the exact opposite of
what we want. We need something that will return the opposite of what
stop-words
would return.
Fortunately for us, Clojure defines such a function: complement
takes a
function and returns a new one that always returns the logical opposite of the
original function. Where the first function returns true, the new function
returns false, and vice verse.
Let’s try the example above again, this time using the complement of
stop-words
:
user=> (filter (complement word/stop-words) '("the" "cat" "in" "the" "hat")) ("cat" "hat")
Exactly.
As I mentioned before, you won’t always want to use a list of stop words, so
we won’t immediately make it part of the tokenize-str
function. However, in
the next posting, I’ll show you how to add a set of stop words as an optional
argument to tokenize-str
.
No comments:
Post a Comment