Source

bakers12 /

Filename Size Date modified Message
bakers12
bin
log
src
tests
167 B
1.5 KB
2.0 KB
1.9 KB
72 B
4.3 KB

Baker's Dozen

This is a boot-camp style series of thirteen one-day coding projects. The aim is to experiment with NLP in Haskell.

Notes

  1. Tokenizer and application framework (done);
  2. Type/token ratios, graph of changing ratio over the course of a text or corpus (done);
  3. Snap server (done);
  4. Frequency report (done);
  5. HTML 5, CoffeeScript, Compass/SASS (done);
  6. XML tokenizer (done);
  7. Re-work tokenizer to match ANC;
  8. Persistence in Level DB or SQLite in RDF;
  9. Corpus management;
  10. Corpus processing;
  11. Search;
  12. Morphological tagger;
  13. POS tagger;
  14. Collocates, analysis and statistics;
  15. Clustering;
  16. Binary categorization (e.g., spam detection);
  17. Multi-label categorization;
  18. Hidden Markov model for corpus;
  19. Topic models;
  20. Clustering;
  21. NER, date extraction;
  22. MM text generation.

Other topics:

  • Parallel or distributed processing.

Commands

freq

This takes one or more files and calculates the frequencies for their types.

bakers12 freq [OPTIONS] [FILES/DIRS]

serve

bakers12 serve [--port=INT]

This starts a Snap server for browsing information about the corpus.

At the moment, the only this this server does is allow you to upload a file, which it tokenizes and displays the tokens and a graph of the running type-to-token ratio.

tokenize

bakers12 tokenize [FILES/DIRS]

This tokenizes the files listed on the command line, and it prints each token out. The output is formatted as CSV, and it includes these fields:

  • the normalized token;
  • the raw token;
  • the name of the file the token was from;
  • the offset character of the token in the file;
  • the raw length of the token; and
  • the running type-to-token ratio.

Future Commands

bakers12 init

Initialize a directory for analyzing documents.

bakers12 add [FILE-OR-DIRECTORY] ...

This adds a document or directory of documents to the corpus.

bakers12 info [DOCUMENT]

This prints information about a corpus or document

Tip: Filter by directory path e.g. /media app.js to search for public/media/app.js.
Tip: Use camelCasing e.g. ProjME to search for ProjectModifiedEvent.java.
Tip: Filter by extension type e.g. /repo .js to search for all .js files in the /repo directory.
Tip: Separate your search with spaces e.g. /ssh pom.xml to search for src/ssh/pom.xml.
Tip: Use ↑ and ↓ arrow keys to navigate and return to view the file.
Tip: You can also navigate files with Ctrl+j (next) and Ctrl+k (previous) and view the file with Ctrl+o.
Tip: You can also navigate files with Alt+j (next) and Alt+k (previous) and view the file with Alt+o.