## Visualizing text as paths

I was reminded recently that I actually have a blog. I have a Cellular Automata Part II post that’s been sitting around since April, but here’s something quick before I get around to polishing that up.

During a prescribed Art Night, I decided to come up with a visualization of long texts. I grabbed a few examples of literature from Project Gutenberg as plain text files, and converted to binary using each letter’s ASCII character code.

$a\qquad\to\qquad 97\qquad\to\qquad \begin{array}{ccccccc} 1 & 1 & 0 & 0 & 0 & 0 & 1 \\ \end{array}$

An entire word will then create a string of ones and zeroes 7 times longer. The visualization comes in by turning string into something like the output of a Lindenmayer system. A line is drawn, and at each step it reads the next character. A one instructs the drawing to take a right turn, and a zero instructs it to take a left turn.

$\text{baloney} \\ \strut\quad\downarrow\\ 98, 97, 108, 111, 110, 101, 121 \\ \strut\quad\downarrow \\ 110001011000011101100110111111011101 \\ \strut\quad\downarrow \\ RRLLLRLRRLLLLRRRLRRLLRRLRRRRRRLRRRLR$

Starting from the bottom of an image, drawing upward, and taking these turns one at a time gives a blocky figure.

This was pretty quick to implement. It initially took about 10 or 15 minutes to finish for single-MB size books using repeated AppendTos in Mathematica to construct a list of successive points along the path. A rewrite using Reap/Sow made these finish in less than a minute. I have no learning about optimization, except that the latter is supposed to be fast in M-ca. I suspect that Append rewrites the entire list each time, while Reap/Sow does not. There are likely 800 other better languages in which to write this kind of thing, but hey! Heyo! Woo!

Gutenberg adds in header and footer to each text file, saying some standard stuff about where it was downloaded and when it was written, etc. For each of these, I removed everything except the nitty gritty, proper text. This would only serve to tack on a bit at the beginnings and ends, ultimately not changing much to the shapes. They would have been rotated, though.

Here are the results for 4 books. Here’s a zoomable pdf of the Frankenstein one. In each of these, a green point marks the beginning, and a red point the end. Frankenstein’s green point, for example, is in the lower right, and its red is on the ear of the evil head in the upper left. Boo!

It was really fun to get these results. I acknowledge that there is probably little meaning in these images from a stylometric point of view, but it is interesting to observe the differences. I’m sure these are very sensitive to format, the words themselves, and the ASCII codes for English letters. The Bible is loaded with many paragraph breaks and colons for each verse, in contrast to the long paragraphs of Ulysses. Ensuring that each character begins with a right turn (1) guarantees that each image is going to have some kind of curliness. A single typo wouldn’t change the shape of the subsequent curve, but it would lead to a rotation about that point.

So, is it just happenstance that Ulysses is so much more unfolded than Moby Dick? Is there something fundamentally different about the second half of Moby Dick that makes it hover around the end point?

An issue with the visualizations themselves is that they do not show any overlap as a line covers regions it has already followed. A way around this would be to assign an increasing height to each point, and show these paths as text tunnels in space. These could be seen as a 3D model or as projected images from different directions. Maybe I’ll try this out.

The full image files showing these walks are about 10x the size of the original text files. A text file storing the list of ordered pairs of the path is about 100x the original text file size. So, a forseeable application of this technique could be to make text file sizes much larger, as well as difficult to read and process.

As a final thought, it was easy to call these random walks. Clearly these aren’t random in that they carry the meaning of the texts, but perhaps English letters appear random after being converted in this way. A way to test this is by looking at the distance from the starting point — A true random walk of $N$ steps ought to be a distance of roughly $\sqrt{N}$ from the starting point.

None of these seem to be plots of distance = sqrt steps.