This is a boot-camp style series of thirteen one-day coding projects. The aim is to experiment with NLP in Haskell.
- Tokenizer and application framework (done);
- Type/token ratios, graph of changing ratio over the course of a text or corpus (done);
- Snap server (done);
- Frequency report (done);
- HTML 5, CoffeeScript, Compass/SASS (done);
- XML tokenizer (done);
- Re-work tokenizer to match ANC;
- Persistence in Level DB or SQLite in RDF;
- Corpus management;
- Corpus processing;
- Morphological tagger;
- POS tagger;
- Collocates, analysis and statistics;
- Binary categorization (e.g., spam detection);
- Multi-label categorization;
- Hidden Markov model for corpus;
- Topic models;
- NER, date extraction;
- MM text generation.
- Parallel or distributed processing.
This takes one or more files and calculates the frequencies for their types.
bakers12 freq [OPTIONS] [FILES/DIRS]
bakers12 serve [--port=INT]
This starts a Snap server for browsing information about the corpus.
At the moment, the only this this server does is allow you to upload a file, which it tokenizes and displays the tokens and a graph of the running type-to-token ratio.
bakers12 tokenize [FILES/DIRS]
This tokenizes the files listed on the command line, and it prints each token out. The output is formatted as CSV, and it includes these fields:
- the normalized token;
- the raw token;
- the name of the file the token was from;
- the offset character of the token in the file;
- the raw length of the token; and
- the running type-to-token ratio.
Initialize a directory for analyzing documents.
bakers12 add [FILE-OR-DIRECTORY] ...
This adds a document or directory of documents to the corpus.
bakers12 info [DOCUMENT]
This prints information about a corpus or document