Friday, July 18, 2008

Stemming, Part 8: Macros

At the end of the last posting, I mentioned that we could probably simplify the ends? function. Currently, it returns a vector with the result and a boolean indicating whether it had been changed.

Data as Code

Compared to most programming languages, Clojure (and lisp) programs are odd: Functions and expressions look like lists; parameter lists for functions or let expressions look like vectors; variables look like symbols.

In fact, those programming constructs start their lives as those data structures. Remember that our interactive environment is called a REPL—a read, evaluate, and print loop. In the read stage, a function or expression is read in as a list, vector, symbol, string, or number. It’s not actually compiled into computer code until the evaluate stage. Before then, it’s all data.

This characteristic of Clojure—that programs are data—is called homoiconicity. Part of what makes lisp so powerful and special is that it introduces a new step in the evaluation process. In most computer languages, the computer reads in an expression and immediately turns it into an action or byte code or whatever. In lisp, the computer reads in the expression as a native lisp data structure, and before it processes the data, it asks the programmer again how it should be handled.

This makes more sense with an example. Consider the Clojure built-in when. You’ll recall that when is just like if, except that the true expression can be a sequence of expressions and the false expression doesn’t exist. Consider this when expression:

user=> (when (= (nth "abc" 1) \b)
         (println "b")
         \b)
b
\b

(println prints its arguments, separated by spaces, and followed by a new line.)

The first “b” is what is printed. The “b” is the value returned by the when expression, from the third line of the expression.

Of course, that when expression is the same as this if expression:

user=> (if (= (nth "abc" 1) \b)
         (do
           (println "b")
           \b))
b
\b

(do just executes a list of expressions and returns the value of the last one. It’s a way of taking a sequence of expressions and using them in places that only want one expression.)

In fact, the when expression is exactly equal to the if expression. Clojure reads the when expression and transforms it into the if expression before evaluating it.

How is this done and how can we create our own code transformations?

Macros.

Code Transformations

A macro is a code transformation. If you know XSLT, a macro transforms a Clojure expression in almost the same way as XSLT transforms an XML document. (If you don’t know XSLT, forget I mentioned it.)

A macro definition looks a lot like a function definition. It has a macro name and a vector of parameters. But macros start with the symbol defmacro, and the body of the macro has a template that builds the output expression.

Here are the tools you use to create the template.

Quote

A quote tells Clojure to return the next expression as data, exactly the way it is written, not as an expression or as a macro template. For example, the first let evaluates x before returning, but the quote in the second let means that Clojure returns the symbol x, not the value that the variable x evaluates to.

user=> (let [x 1] x)
1
user=> (let [x 1] 'x)
x

Quasi-Quote

A quasi-quote is a back-quote character. It indicates that, when the macro is called, the expression it quotes should be scanned for the constructs below, and the list or expression generated should be returned. Except for the scanning bit, it’s a lot like a regular quote.

user=> `(let [x 1] x)
(clojure/let [user/x 1] user/x)

Tilde

Inside a macro template, a tilde forces Clojure to evaluate the expression that follows it and insert its value into that place in the result.

user=> (let [x 2]
         `(let [y ~x] y))
(clojure/let [user/y 2] user/y)

In this, the outer let sets x to 2. Inside that let, it defines a macro template, which returns a list that looks like another let expression. Inside the inner let it inserts the value of x from the outer let.

In many ways, macro templates are shortcuts for longer expressions that build lists and vectors. For example, the last expression is the same as this expression, using functions to build the output expression:

user=> (let [x 2]
         (list 'let (vector 'y x) 'y))
(let [y 2] y)

Tilde-At

The tilde-at sequence is a variation of tilde that evaluates its argument, which should return a sequence, and inserts the elements from that sequence into a list in the macro template.

user=> (let [x '(1 2 3)]
         `(println ~@(map (fn [y] (* 2 y)) x) 'done))
(clojure/println 2 4 6 (quote user/done))

In this expression, the elements from x are run through map, which doubles their values. The output of map is inserted into the list that starts with println and ends with 'done. Notice that 'done in the output is still quoted. It was read in as data—'done—and output as the same data.

gensym and #

Sometimes when you’re creating a macro template, you want to use create a variable. One danger is that it could shadow a variable in the surrounding code.

For example, in this code,

user=> (let [x 42]
         `(let [x 13]
           [x x]))
(clojure/let [user/x 13] [user/x user/x])

what do the two x variables refer to on the third line? The x from the outer let (42); or the x from the inner let (13)? In this case, they both refer to the inner x. Occasionally, you may do this on purpose, but usually it’s a mistake.

To avoid this mistake, in variables you declare in a macro template, you should append an # onto the end of the variable name. This makes sure that that variable name is never used elsewhere. It cannot clobber any other variable name.

user=> (let [x 42]
         `(let [x# 13]
           [x x#]))
(clojure/let [x__1888 13] [user/x x__1888])

In this case, the two x variables in the final line now refer to two different variables. The first refers to the variable in the outer let, and the second refers to the variable in the inner let, in the macro template. You can see that it’s different because Clojure has removed the # character and replaced it with __1888. (That number always changes. If you run this code, it will be different for you.) Moreover, Clojure guarantees that that variable name will never be used again during that run.

A Quick Macro

So let’s write a quick macro, just to see what one looks like.

I usually develop with just a REPL and the text editor Vim, and if I need to debug anything, often I just put in print statements. In languages like Python, many of these statements look like this:

print "x+4 =", x+4
print "counts[word] =", counts[word]

That’s a lot of duplication and typing, and aren’t computers supposed to get rid of that?

The problem is that we want to bypass the way the language normally evaluates expressions. We want the computer to print the first expression out, just like it would data; we want it to evaluate the second expression and print out the result.

Because we want to change Clojure’s normal evaluation rules, this is a perfect place for a macro. We will pass in an expression and have Clojure print both the expression and its value. As an added bonus, we can have it also return the value. Here’s what we want it to look like:

user=> (let [x (+ 42 13)]
         (println '(+ 42 13) "=>" x)
         (flush)
         x)
(+ 42 13) => 55
55

Let’s look at what this would look like as a macro named debug:

(defmacro debug
  [expr]
  `(let [value# ~expr]
     (println '~expr "=>" value#)
     (flush)
     value#))

Let’s pick this apart.

(defmacro debug: Create a macro named debug.

[expr]: The macro takes one parameter, which will be called expr in the body of this macro.

(let [value# ~expr]: Start the macro template with a let expression. The let will define one variable, named value#, which will be assigned the value of evaluating expr.

(println '~expr "=>" value#): Print the expression expr as data (because it is quoted, an arrow, and the value of the variable value#.

(flush): Flush the output buffer.

value#)): Return the value of the variable value#.

Before we actually try it, we can see what calling it would return using the function macroexpand-1. This takes a list expression and returns the list expression created by calling the first macro, if any, that needs to be executed on it.

user=> (macroexpand-1 '(debug (+ 42 13)))
(clojure/let [value__1892 (+ 42 13)]
  (clojure/println (quote (+ 42 13)) "=>" value__1892)
  (clojure/flush)
  value__1892)

(I reformatted this to make it readable.) With a little mangling, we can see that the debug macro will generate the expression

(let [value__1892 (+ 42 13)]
  (println '(+ 42 13) "=>" value__1892)
  (flush)
  value__1892)

Which is exactly what we want. If we then call that macro, we get this.

user=> (debug (+ 42 13))
(+ 42 13) => 55
55

(The first line is the printed output of the expression. The final 55 is the result of the expression.)

In the next posting, we’re going to look at the process of how to write macros and how to avoid the pitfalls and dangers of macro writing, and we’re going to write a number of macros for the Porter Stemmer.

14 comments:

Anonymous said...

Nice post. However, '~' is called 'tilde'. See http://en.wikipedia.org/wiki/Tilde.

Eric Rochester said...

Umm. Wasn't that what I called it?

Confusedly,
Eric

Brian Doyle said...

I believe the problem is that your tilde looks like the backquote.

Eric Rochester said...

You're probably right.

Looking at it, I should maybe make the font size larger in the code snippets.

Paul Barry said...

Excellent post! Keep 'em coming!

Brian Doyle said...

Agree with Paul, excellent post and thanks again for the whole series!

Eric Rochester said...

Thanks guys. I appreciate it.

Eric

Anonymous said...

I dont get the line (println '~expr "=>" value#). Above you state that tilde evaluates expr. Why do we want to evaluate here? We want to print the unevaluated expression don't we?

Eric Rochester said...

Hi Anon,

This is where macros get a little complicated. Remember that there are two levels of execution: first when the macro is executed, and second when the code that the macro generates is executed.

I'll step through what's happening, paying attention to the line you mentioned.

At macro expansion time, in (debug (+ 1 2)), expr gets bound to (+ 1 2). So (println '~expr "=>" value#) gets translated to (println '(+ 1 2) "=>" value#).

Without the tilde, it gets changed to (println 'expr "=>" value#).

At execution time, the quote in (println '(+ 1 2) "=>" value#) keeps the expression from being re-executed and has the form get passed to println untouched. value# is bound to 3, so the result is that it prints "(+ 1 2) => 3".

In the other, (println 'expr "=>" value#), because of the quote, expr gets passed through untouched, and the output is "expr => 3".

The difference here is also the quote, which comes into play at execution time and prevents the form (+ 1 2) from being executed a second time.

HTH,
Eric

Anonymous said...

Thank you Eric. It perfectly makes sense now ...

Anonymous said...

you say that ` and ' are basically the same, but it doesn't seem that way to me. if i do 'let vs `let at the repl...i get very different results

Eric Rochester said...

It looks like I said, "Except for the scanning bit, it’s a lot like a regular quote."

"The scanning bit" is an awfully big bit, though.

But you're right, I really did not stress how big of a difference that is.

Eric

Anonymous said...

Best explanation I've seen so far. Thanks!

Anonymous said...

Excellent post! You broke it in peaces so we can easily digest. Divider and conquer!!!