Thursday, July 10, 2008

Stemming, Part 3: More Basics

In the last posting, I introduced a number of Clojure data structures. Today, I’ll introduce a few more; then I’ll show you some common functions.

More Data Structures


We’ve seen symbols before: every word in Clojure is represented as a symbol; every function name; anything that’s text, really, but isn’t a string. In programs, symbols act as variable names.

Clojure also has keyword symbols, which are like symbols, except they cannot be used as variables. Instead, a keyword also stands for itself. To write a keyword, put a colon (:) before its name:

user=> :word

Keywords are used a lot in Clojure, particularly as keys for hash maps. There is good reason for this: a keyword is also a function that takes a hash map and returns the value associated with itself in the mapping.

user=> (:word {:frequency 4, :word "the"})

If you try to retrieve a keyword’s value from a mapping that doesn’t have the keyword, it returns nil.

user=> (:location {:frequency 4, :word "the"})


Keywords’ acting as functions makes both keywords and mappings incredibly useful as flexible, generic data structures in Clojure. This is so common in Clojure that Rich Hickey has added structures, which are mappings with predefined sets of keys, and which are very efficient.

To define a structure, use defstruct, give it a name and a list of keyword fields. Here’s a structure that stores a word and its frequency:

user=> (defstruct word-data :word :frequency)

Now, you can define an instance of that data type (a word-data), using struct, which is called with the name of the structure and the values for the fields in the same order as they’re defined in the defstruct:

user=> (def the-word (struct word-data "the" 400))
user=> the-word
{:word "the", :frequency 400}

Use the keyword field names as functions to retrieve the value of the field from a structure:

user=> (:word the-word)
user=> (:frequency the-word)

Of course, the-word is just a hash map, and you can add other fields to it and treat it like a hash map in other ways too:

user=> (assoc the-word :location 'here)
{:word "the", :frequency 400, :location here}

Other Useful Functions

Of course, the functions we’ve just seen won’t do everything that we’ll need. Here are some useful functions, many of which we’ve already seen.

(dec n) Return one less than n. This is faster than (- n 1).

(inc n) Return one more than n. This is faster than (+ n 1).

user=> (dec 4)
user=> (inc 4)

(let [variables] expressions) Defines one or more variables. variables is a vector of variable/value pairs, arranged just like the key/value pairs in a hash mapping. expressions is one or more expressions. The entire let returns the value of the last expression.

user=> (let [x 4, y 5] (+ x y))

(if test true-expression false-expression) Executes test, and if it returns a true value (anything but false or nil), it executes and returns the value of true-expression; otherwise, it executes and returns the value of false-expression.

user=> (let [x :name] (if (= x :name) :yes :no))

(if-let var test true-expression false-expression) Combines let and if, capturing a common pattern:

(let [age (:age person)]
  (if age
    (str "My age is " age)
    "No age given"))

Here, you define a variable from an expression, and if it has a true value, execute one expression, and if it’s false, execute another expression. Here’s what this looks like in practice:

user=> (def person {:given "Eric" :surname "Rochester"})
user=> person
{:given "Eric", :surname "Rochester"}
user=> (:age person)
user=> (if-let age (:age person)
  (str "My age is " age)
  "No age given")
"No age given"

(when test expressions) If the value of test expression is true, executes expressions and returns the value of the last.

user=> (when (= 41 42)
  (list 'expression 'one)
  (list 'expression 'two))
user=> (when (= 42 42)
  (list 'expression 'one)
  (list 'expression 'two))
(expression two)

(min values...) Returns the least value in its arguments.

user=> (min 3 5)
user=> (min 5 7 3)

(and expressions) Evaluates its expressions until one returns a false value, at which point it returns nil; otherwise, it returns the value of the last expression.

(or expressions) Evaluates its expressions until one returns a true value, at which point it returns that; otherwise, it returns nil.

(not expression) Evaluates its one expression and returns the logical complement of it. An expression evaluating to nil or false will return true; a true expression will return false.

user=> (and (:given person) (:surname person))
user=> (and (:given person) (:surname person) (:age person))
user=> (or (:given person) (:surname person))
user=> (or (:given person) (:surname person) (:age person))
user=> (not (:given person))
user=> (not (:age person))

+, -, *, / Performs arithmetic operations on their arguments.

user=> (- 5 3 1)
user=> (+ 5 3 1)
user=> (* 8 2)
user=> (/ 8 2)

=, not=, <, >, <=, >= Compares its arguments, returning a boolean.

user=> (= 5 3)
user=> (not= 4 5)
user=> (not= 4 4)
user=> (< 5 3)
user=> (> 5 3)
user=> (<= 5 3)
user=> (>= 5 3)

For the next posting, we’ll apply what we’ve learned about Clojure and its data structures to the Porter Stemmer algorithm.


Anonymous said...

in the multiply example (*), there seems to be an extraneous backslash.

furthermore, it has a red box around it.


Eric Rochester said...

Thanks for catching that. I'm using Markdown and Pygments to edit my posts, and it evidently they didn't play well together there.

Thanks for the feedback. I appreciate it.


Unknown said...

Hi, in (not expression) you say "An expression evaluating to nil or false will return true; a true expression will return nil." so why this:

user=> (not (:given person))

returns false (instead than nil)?

Thank you for the excellent tutorial! chris

Eric Rochester said...

Busted. What I put before isn't technically correct.

I wrote, "nil," but I really meant, "A value that tests as false." Of course, "not" returns a boolean value, so it must be "true" or "false."

Sorry about the confusion.

I'll have to go in and change that. Thanks for catching it.


Jeff Schwab said...

if-let requires a vector for its binding:

(if-let [age (:age person)]
(str "My age is " age)
"No age given")

Eric Rochester said...

Hi Jeff,

Thanks for the corrections. II wrote this just before Rich changed the syntax to use vectors for binding consistently. About that time I got really busy and didn't have time to add more to this series or this blog, much less to update the code to work with the latest version of Clojure.

So yep, you're going to find a lot of problems like that (and the dot-slash thing you pointed out on the other post).

Thanks for stopping by. I hope you find it useful.