Thursday, October 9, 2008

Concordances, Part 1: What's that?

OK, that's enough of a break, don't you think?

Now we've got some processing tools under our belts: tokenizing and stemming. We'll tweak those, and we'll add more later. But for now let's take a look at our data.

One of the earliest ways of analyzing texts is the concordance. Simply put, it's an alphabetical list of the words used in a document, followed by every occurrence of that word, a little context around it (typically just a few words), and the location of that occurrence.

The first concordance was done in the thirteenth century for the Vulgate translation of the Christian Bible. However, because complete concordances are labor-intensive, they really weren't practical as a wide-spread tool. Because of this, concordances were only made for works deemed sufficiently important, such as religious texts or the works of Shakespeare or Chaucer.

But computers changed that. Now, given a text file, a computer can spit out a concordance in little time.

For example, here are the first few entries of a concordance of the first few paragraphs of this blog entry (although this may reflect an earlier draft). The numbers at the beginning are the line in a text file I copied the entry into and the start and end indices of the word in that line. This should give you a picture of what a concordance is, although a better one would pull context from surrounding lines. This one doesn't. Yet.

  1: 23: 24 K, that's enough of *a* break.
  5:  7:  8                take *a* look at our data.
 10: 35: 36 eren't practical as *a* wide-spread tool. B

  4: 17: 20      We'll probably *add* more later, and we'

  7: 30: 39 he earliest ways of *analyzing* texts is the [conco

  3: 66: 69 r belts: tokenizing *and* stemming.
  4: 33: 36 bly add more later, *and* we'll definitely tw

  9: 58: 61 mplete concordances *are* labor-intensive,

Next we'll analyze what we'll need to build the concordance and plan the next steps.

No comments: