Overview

What do I do with this?

It's a tool for generating nonsense that looks a bit like someone's Twitter stream (how it came to be).

No, you're not limited to using it on the Brendans. But @SharkBrendan is pretty good.

You'll need a unix-style environment (linux, terminal in a mac, or similar), with python and hg installed (and virtualenv if you follow my installation advice).

Installing (if you know enough to do this differently you don't need my advice):

$ hg clone https://tikitu@bitbucket.org/tikitu/markov_brendan
$ cd markov_brendan
$ virtualenv --no-site-packages .
$ bin/pip install zc.buildout
$ bin/buildout

Running:

$ bin/markov -h # lists lots of options
$ bin/markov train @BrendanAdkins 500                    # trains a model on 500 tweets by the real Brendan, writes it to a file
$ bin/markov produce 10 --file markov_BrendanAdkins.dat  # produces 10 pseudo-Brendan tweets based on the model

$ bin/markov train @SharkBrendan 500 --file brendans.dat           # trains into a file you name ...
$ bin/markov train @SpringBrendan 500 --file brendans.dat --append # ... incrementally

A few technical details

The model uses n-grams, with trigrams as the default. Trigrams produce overly faithful output on sparse data, so the output mode includes a "backoff" component (turned on by default, disable it with --no-back-off) which tries to mix things up a little: whenever the current n-gram has only a single choice for its prediction, it tries instead an (n-1)-gram, and so on until it either falls back to bigrams or gets to make a non-trivial choice.

In order to support this backoff system the model stores quite a lot of data: if you're working with 4-grams it will store every bigram, trigram and 4-gram in your corpus. The storage format is extremely dumb, so don't be surprised if the file size goes up pretty rapidly. You should expect something like twice the size of your corpus for bigrams, five times corpus size for trigrams (3 times for the trigrams plus 2 for the bigrams that support backoff), nine times corpus size for 4-grams, etc. (These are upper limits; actual size will depend on how much repetition there is in your corpus.) The good news is, on Twitter data even trigrams seems to be overkill, so very likely this will never become an issue for you.

Contributors