Writing/Coding: Erlang

Showing posts with label Erlang. Show all posts

Sunday, November 11, 2007

Erlang Nostalgia

A few months ago, I wrote a system for controlling distributed processing using Erlang, but since then, I haven't really done much with Erlang. Finishing up that series of articles on Erlang reminded me of what fun Erlang is. Looking at problems concurrently is interesting and fun, and while I'm not always fond of the language's syntax, I still enjoy hacking in it. I'll have to do more Erlang in the near future. Something else for my to-do list.

Wednesday, November 7, 2007

Concurrent Thinking, Part 3

When I decided to dust off and publish the old post I had written, I wasn't expecting it to turn into a week-log series. And it's not. This is the last.

If you'll remember, gentle reader, I've been foreshadowing trouble. Today's the payoff. You'll remember that the reason I didn't just slurp the entire file into memory at one time and build lists from it is that I need to be able to process files that were much larger than would fit in memory. If the process reading the file can read all of it before the transformation process can handle much of it, or before the output process can write the output, the file will be read into memory in its entirety, more or less, and this entire exercise has failed. To imagine how this could happen, suppose the input comes from a local file, but the output is being written across a network. The input process could read all the file, and the transformer process could handle the data, but the messages would still pile up in the output process's mailbox, filling memory and bringing the system to its knees. Or suppose the transformation being applied involves a time-intensive computation. Its mailbox would fill up memory when the input process sends it the entire file while the first few messages are being transformed. Again, the system dies, horribly, horribly. I won't tell you how I solved this when it happened to me. Looking back, it was a nasty, nasty kludge that I only used because I was under time pressure. What I should have done was use the functional solution someone mentioned in a comment to Concurrent Thinking, Part 1. This solution involves having the input and transformation functions return a tuple with two values: data and another function. The new function will return more data and yet another function. Eventually, when there's no more data, the function will just return a flag indicating that. I won't rewrite the entire example from yesterday in this style, but the input function would look like this:

input(Filename) ->
    {ok, F} = file:open(Filename, [read]),
    input_continuation(F).

input_continuation(F) ->
    fun() ->
        case file:read(F, 1024) of
            eof ->
                file:close(F),
                eof;
            Data ->
                {Data, input_continuation(F)}
        end
    end.

Here, input/1 opens the file and calls input_continuation/1. This returns a function, A. When you call A, you get the first block of data and a new function, B. B will return the second block of data and a function C. This happens until there is no more data, at which point the file is closed and eof is returned. This solution is very Erlang-ish, and it throttles nicely so it doesn't accidentally fill up memory. End of series.

Tuesday, November 6, 2007

Concurrent Thinking, Part 2

So, briefly, the problem involves piping large amounts of data through a series of transformations, usually just one. In other languages I would use some form of lazy lists to plug together the input reader, transformations, and output writer. In Erlang, I decided to use processes. One interesting thing about Erlang processes: If you squint at them just right, they act like classes do in other languages, just without inheritance. The processes communicate by passing messages, and those messages trigger the processes to perform predefined actions. This seems very like message-passing paradigm of object oriented programming. Basically, one process reads the input and sends it to the transformer process, which in turn sends it on to the output process. Here's the pipeline I implemented in Python above (it has to be saved in a file named "erltransform.erl" to work):

  -module(erltransform).

  -export([input/2, transform/1, output/1, main/2]).

  input(Filename, NextPid) ->
      {ok, F} = file:open(Filename, [read]),
      input(F, file:read(F, 1024), NextPid).

  input(F, {ok, Data}, Pid) ->
      Pid ! Data,
      input(F, file:read(F, 1024), Pid);
  input(F, eof, Pid) ->
      io:format("closing input.~n", []),
      ok = file:close(F),
      Pid !  eof.

  transform(OutPid) ->
      transform(OutPid, 0).

  transform(OutPid, 0) ->
      receive
          eof ->
              OutPid ! eof;
          _Data ->
              transform(OutPid, 1)
      end;
  transform(OutPid, 1) ->
      receive
          eof ->
              OutPid ! eof;
          Data ->
              OutPid ! Data,
              transform(OutPid, 0)
      end.

  output(Filename) ->
      {ok, F} = file:open(Filename, [write]),
      write_output(F).

  write_output(F) ->
      receive
          eof ->
              ok = file:close(F),
              io:format("closing output. done.~n", []),
              eof;
          Data ->
              ok = file:write(F, Data),
              write_output(F)
      end.

  main(InFilename, OutFilename) ->
      OutPid = spawn(erltransform, output, [OutFilename]),
      TransformPid = spawn(erltransform, transform, [OutPid]),
      spawn(erltransform, input, [InFilename, TransformPid]).

Hmm. This is a lot longer, and it doesn't do as much. Erlang's libraries are OK, but the standard distribution doesn't have Python's batteries included approach. Since Erlang doesn't come with a library to read and write CSV files, this just reads a block of 1024 bytes, and it drops every other block. Some explanation: At the beginning of the code above, the input/2 function opens the input file, reads the first block, and calls input/3, which sends the data to the transformer PID and loops. When it reaches the end of the file, it closes the file, prints a message, and exits. The transform/1 function calls transform/2 with a flag indicating whether it should keep or drop the next block of data that's sent to it. Based on the value of this, it either sends the data on or ignores it. When it receives eof, it sends that on and exits. The output/1 function opens the file and sends it on to write_output/1. (Since both of these functions have the same arity, I had to give the second function a different name.) Output just writes the data it receives out and closes the file when told to. Finally, main/2 spawns all the processes, linking them together, then it gets out of the way. In this case, the messages being passed are simple: the data read, dropped, or written. Yesterday, I hinted that this approach got me in trouble. How? That, gentle reader, is tomorrow's episode. Stay tuned.

Monday, November 5, 2007

Concurrent Thinking, Part 1

Note: I originally wrote this back in June, but never finished it. An now, presenting part one.... Concurrency is the key idea to come to terms with in Erlang and the key tool that it gives you. It's the modeling tool to use when solving a problem. I realized this when I was implementing a code transformation pipeline. The pipeline would go input to transformation to output. I wanted to be able to interchange any of those pieces, so I could read from multiple formats, transform the data in different ways, and write the data out in different formats. And it needed to be able to handle a lot of data. On my first attempt, I approached the problem the way I might with Python: Input, transformation, and output are iterators which are just chained together. For example, the input would read CSV, the transformation would drop every other item, and the output would write it back to CSV::

  from __future__ import with_statement

  import csv

  def input(filename):
      with open(filename, 'rb') as f:
          for row in csv.reader(f):
              yield row

  def transform(rows):
      rows = iter(rows)
      while True:
          rows.next()
          yield rows.next()

  def output(rows, filename):
      with open(filename, 'wb') as f:
          writer = csv.writer(f)
          writer.writerows(rows)

That's not very useful, but you get an idea of the problem and what I was trying to do. Using iterators/generators is nice here because I don't have to worry about too much input filling up memory. Also, I don't build a list I don't need or use. Unfortunately, Erlang doesn't have iterators. I thought I would use lists instead and work from there. I could have had the input function read everything into a list, the transformation function transforms it into another list, and the output function writes a list's contents to disk. It wouldn't have been pretty, but it would have worked. Part way through coding this, I realized that Erlang has a great solution to this: concurrency.

Stay tuned tomorrow, when the hero attempts to use powerful Erlang concurrency and ends up hoisted on his own petard.

Monday, May 28, 2007

Erlang: The Pros

I had meant to post this much sooner, but there's been this thing called work. It gets in the way. Finally, here is the second half of my overview of Erlang. Concurrency Given it's background, it's no surprise that concurrency is easy and cheap in Erlang. Here's a contrived example:

12> PrintRandom = fun() ->              
        {A, B, C} = erlang:now(),           
        random:seed(A, B, C),
        io:format("~p~n", [random:uniform()])
        end.
#Fun

13> lists:foreach(fun(_) -> spawn(PrintRandom) end, lists:seq(1, 10)).
0.664105                                                             
0.664137
0.664170
0.664203
0.993887
0.993920
0.993953
0.993986
0.994019
0.994052
ok

First, the state for the random number generator is stored in each process, and it's initialized from a standard value. (Actually, maybe the random module should be on my Erlang: Cons list, although having the random module not produce random numbers probably makes debugging easier.) Line 13 spawns 10 processes, each of which prints a random number. Not very useful, but it should illustrate how easy it is to create processes. Also, you can easily spawn a lot processes. This code generates a random number and throws it away a million times in a million processes, and it executes in about 3.5 seconds on my machine:

-module(timespawn).

-export([make_random/0, spawn_random/1, ts/1]).

make_random() ->
    {A, B, C} = erlang:now(),
    random:seed(A, B, C),
    random:uniform().

spawn_random(0) ->
    ok;
spawn_random(X) ->
    spawn(fun make_random/0),
    spawn_random(X-1).

ts(X) ->
    timer:tc(?MODULE, spawn_random, [X]).

Distribution Another nice aspect Erlang is how easy it is to distribute processing. This is partially due to its open security model, which I listed as a negative. For example, in two separate console windows, I can start two different instances of Erlang. As long as I pass both the -sname argument (with different values) and the -setcookie argument (with the same value), they can talk to each other, even if they're on different computers in the same network. Functional I've worked with functional languages in the past, but I haven't really drunk the functional Kool Aid until now. I'm enjoying it this time, though, and want to start working more with one of the Lisps (probably Scheme), Haskell, or ML. When I do start on one (or more likely, on all) of these, I'm sure I'll talk about it here. OTP The OTP (Open Telecom Platform) is the standard Erlang library, and it contains modules that make writing fault tolerant, supervised, client-server, and distributed applications easy and fast. Conclusion I'm glad I've gotten to do a lot of Erlang recently. It's changed the way I think about concurrency and distribution, and it's raised my expectations of other systems that tackle the same problem. Even if I do most of my concurrent programming primarily in other languages, this will give me a good basis. And really, what more can I ask for from a programming language?

Sunday, May 13, 2007

Erlang: The Cons

Today, I'm going to talk about the warts I've seen in Erlang so far. I'm sure I'll find other things I don't like, and I'm also sure that I'll make peace with some of the things I've listed here. Nevertheless, here is my list of the bad and the ugly so far. (This isn't a tutorial on Erlang, so I'm not going to explain the code snippets below. Sorry.) Strings The one thing that really made me go "Yuck" as I was learning Erlang is strings. Strings are just lists of integers, which are treated as strings for printing. For example, if you crank up the read-eval print loop (REPL) and type in a list of integers, it gets printed as a string if there are no control characters in the string:

Eshell V5.5.4  (abort with ^G)
1> [69, 114, 105, 99].
"Eric"

Likewise, if you type in a string, it's interpreted as a list of integers:

4> lists:foreach(fun(X) -> io:format("~p ", [X]) end, "Rochester").
82 111 99 104 101 115 116 101 114 ok

Initially, this just felt wrong to me. It still does in some respects, but I've also learned to appreciate how this makes string processing easy most of the time, particularly when it's coupled with Erlang's fantastic pattern matching features. For example, the function get_ints in the module below walks through a string and returns a list of all the integers in the string:

-module(get_ints).
-export([get_ints/1]).

get_ints(String) ->
    get_ints(String, [], []).

get_ints([], [], Ints) ->
    lists:reverse(Ints);
get_ints([], Current, Ints) ->
    CurrentInt = list_to_integer(lists:reverse(Current)),
    lists:reverse([CurrentInt|Ints]);
get_ints([Chr|Data], Current, Ints) when Chr >= $0, Chr =< $9 ->
    get_ints(Data, [Chr|Current], Ints);
get_ints([_Chr|Data], [], Ints) ->
    get_ints(Data, [], Ints);
get_ints([_Chr|Data], Current, Ints) ->
    CurrentInt = list_to_integer(lists:reverse(Current)),
    get_ints(Data, [], [CurrentInt|Ints]).

To use this, save it in a file named "get_ints.erl", compile it, and call the function:

8> c(get_ints).
{ok,get_ints}
9> get_ints:get_ints("This is the answer: 42").
"*"
10> get_ints:get_ints("This is the answer: 0, 42").
[0,42]
11> get_ints:get_ints("This is the answer: 0, 42, 23").
[0,42,23]

Notice that the first list of one integer ([42]) is interpreted as a string. I included the number zero (a control character) in the next test to force Erlang to print the results as a list of integers. Another small complaint with strings involves Unicode. I've done enough processing XML and processing phonetic data that good Unicode handling is important to me, no matter whether I'm using it much at the time or not. In one sense, Erlang handles Unicode just fine. A string containing a schwa character is just [601]. Unfortunately, this is the depth of its Unicode handling. It doesn't give you any information about the Unicode data points or a function to change a character from upper-case to lower- or vice versa. Security Another complaint is Erlang's security model. On the one hand, it has the virtue of being easy to set up, but if two nodes can connect and communicate, there's nothing one can't do on the other node. Having more fine-grained control over things would be nice. Speed Also, there's the issue of speed. Erlang is generally fast enough, and where it's not, you can easily set up a node written in C, Java, or C#. Still, being able to deal with everything in Erlang would be more convenient. REPL Finally, there are restrictions in working from the REPL that I could do without. To create a function like get_ints above, I more or less have to create a new module in a file and put the function there. There are ways to work around this, but they seem unnatural. I'd rather not use them. Nothing on this list is a deal-killer for me. I've been doing a lot of Erlang the past few weeks, and I've really enjoyed it. It's been productive and interesting. Still, I can always dream about a better world.

Sunday, May 6, 2007

Erlang: An Overview

Recently I've been programming in Erlang. It's a new computer language for me, and I thought I would talk about it. This will take several posts. Today I'll just provide an overview of the language and environment. Erlang was developed by Ericsson telecommunications for developing soft real-time concurrent systems. I've never developed real-time systems before, soft or hard, so I won't talk about that much. I haven't done a lot with concurrency, either, except a touch of threading here and there. Nevertheless, the concurrency and distributed features of this language are what really attracted me to it. Originally, Ericsson implemented Erlang in PROLOG, and this still shows in some parts of Erlang, such as the way it uses pattern-matching. As an undergraduate, I went through a period where I programmed more or less only in PROLOG for about two years, so a lot of this feels like a stroll down memory lane. Other parts are very new. It's primarily a functional language. It has a lot of features designed for doing telecom programming, such as processing binary data. Erlang code is organized into modules, which are compiled and run from a read-eval-print-loop (REPL). A short "hello world" style module would look like this:

-module(hello). -exports([hello/1]). hello(Name) -> io:format("Hello, ~p.~n", [Name]).

The -module line identifies the module, which is also the name of the file containing the module. The -exports line identifies the functions from this module that are exposed to the outside world. The hello function is the rest of the module. io:format is the standard way to print to the console. Next, I want to look at some of the parts of Erlang that I'm less fond of.

Writing/Coding