Before I leave the Porter Stemmer behind, I want to show you some of the tools I used to debug the code as I went along.
There are some more modern options for debugging Clojure than what I'm presenting here. (Search the mailing list for details.) Personally, I generally use print statements for debugging. It's primitive, but effective. In some languages, it can also be painful. Fortunately, lisp languages take much of the pain out of print-debugging.
Tracing
One common way to debug programs is to follow when a function is called and returns. This is called tracing, and this function and macro handle that.
(defn trace-call [f tag] (fn [& input] (print tag ":" input "-> ") (flush) (let [result (apply f input)] (println result) (flush) result)))
trace-call
returns a new function that prints the input arguments to a
function, calls the function, prints the result, and returns it. It takes the
function and a tag to identify what is being traced.
(defmacro trace [fn-name] `(def ~fn-name (trace-call ~fn-name '~fn-name)))
The trace
macro is syntactic sugar on trace-call
. It replaces the function
with a traced version of it that uses its own name as a tag. For example, this
creates and traces a function that upper-cases strings:
user=> (defn upper-case [string] (.toUpperCase string)) #'user/upper-case user=> (upper-case "name") "NAME" user=> (trace upper-case) #'user/upper-case user=> (upper-case "name") upper-case : (name) -> NAME "NAME"
The debug
Macro
Another common trick in print-debugging is to print the value of an expression. The macro below evaluates an expression, prints both the expression and the result, and returns the result.
(defmacro debug [expr] `(let [value# ~expr] (println '~expr "=>" value#) (flush) value#))
For example:
user=> (debug (+ 1 2)) (+ 1 2) => 3 3
Lisp macros are especially helpful here, because they allow you to treat the expression both as data to print and as code to evaluate.
The debug-stem
Function
This function is a debugging version to stem
. It uses binding
to replace
all the major functions of the stemmer with traced versions of them.
(We'll talk more about binding
later, when we deal with concurrency. Right
now, just understand that binding
changes the value of a top-level variable,
like a function name, with a new value. But the variable only has that value
for the duration of the binding
. Afterward, it is returned to its former
value.)
(defn debug-stem [word] (binding [stem (trace stem), make-stemmer (trace make-stemmer), step-1ab (trace step-1ab), step-1c (trace step-1c), step-2 (trace step-2), step-3 (trace step-3), step-4 (trace step-4), step-5 (trace step-5)] (stem word)))
That's it. These were the main functions I used in debugging the stemmer as I ported it from C and made it more Clojure-native.
Next up, we'll create a concordance and look at other ways of presenting the texts that we're analyzing.
By the way, I've also finally updated the repository for sample code.
4 comments:
Having really been following your articles - but I thought I would mention that porter is not a very good stemmer. My own work uses CL with stemming, and on the TREC WSJ collection queries 251-300 and 151-200 perform worse when porter is used.
Weak stemming is much better. (ie. simply removing 's' and 'ing' and 'ed' from the ends of words.)
Anonymous,
Of course you're right. Generally, on my projects I don't stem at all.
The Porter Stemmer gets a lot of press as a "standard," though, whatever that means. So I wanted to cover it. It turned into a bigger project than I'd anticipated, though, and I probably would have been better just doing the weak stemmer you describe. Live and learn.
Thanks for checking in,
Eric
"I probably would have been better just doing the weak stemmer you describe."
Nonsense! Readers of this series don't care about obtaining a stemmer they can use; they want to see how Clojure techniques can be used in a range of simple and complex tasks.
Gavin Sinclair
LOL
Post a Comment