Baker's Dozen

This is a boot-camp style series of thirteen one-day coding projects. The aim is to experiment with NLP in Haskell.


  1. Tokenizer and application framework (done);
  2. Type/token ratios, graph of changing ratio over the course of a text or corpus (done);
  3. Snap server (done);
  4. Frequency report (done);
  5. HTML 5, CoffeeScript, Compass/SASS (done);
  6. XML tokenizer (done);
  7. Re-work tokenizer to match ANC;
  8. Persistence in Level DB or SQLite in RDF;
  9. Corpus management;
  10. Corpus processing;
  11. Search;
  12. Morphological tagger;
  13. POS tagger;
  14. Collocates, analysis and statistics;
  15. Clustering;
  16. Binary categorization (e.g., spam detection);
  17. Multi-label categorization;
  18. Hidden Markov model for corpus;
  19. Topic models;
  20. Clustering;
  21. NER, date extraction;
  22. MM text generation.

Other topics:

  • Parallel or distributed processing.



This takes one or more files and calculates the frequencies for their types.

bakers12 freq [OPTIONS] [FILES/DIRS]


bakers12 serve [--port=INT]

This starts a Snap server for browsing information about the corpus.

At the moment, the only this this server does is allow you to upload a file, which it tokenizes and displays the tokens and a graph of the running type-to-token ratio.


bakers12 tokenize [FILES/DIRS]

This tokenizes the files listed on the command line, and it prints each token out. The output is formatted as CSV, and it includes these fields:

  • the normalized token;
  • the raw token;
  • the name of the file the token was from;
  • the offset character of the token in the file;
  • the raw length of the token; and
  • the running type-to-token ratio.

Future Commands

bakers12 init

Initialize a directory for analyzing documents.

bakers12 add [FILE-OR-DIRECTORY] ...

This adds a document or directory of documents to the corpus.

bakers12 info [DOCUMENT]

This prints information about a corpus or document