Jörg Tiedemann  committed 3987124

Edited online

  • Participants
  • Parent commits 33b4802

Comments (0)

Files changed (1)

-= Welcome =
+= A Blacklist Classifier for Language Discrimination =
-Welcome to your wiki! This is the default page we've installed for your convenience. Go ahead and edit it.
+The blacklist classifier is a simple tool for discriminating related languages. It uses blacklisted words that can be trained on comparable data sets. The package comes with blacklists for distinguishing Bosnian, Croatian and Serbian. The data these lists are trained on is also included together with a test set for the three languages.
-== Wiki features ==
+== Download ==
-This wiki uses the [[|Creole]] syntax, and is fully compatible with the 1.0 specification.
+$ git clone
-The wiki itself is actually a git repository, which means you can clone it, edit it locally/offline, add images or any other file type, and push it back to us. It will be live immediately.
-Go ahead and try:
+== Usage ==
+Simply download all files and run the Perl script {{{}}}
+The basic operations are training, classification and running experiments:
+* classification:
-$ git clone [OPTIONS] lang1 lang2 ... < file
-Wiki pages are normal files, with the .wiki extension. You can edit them locally, as well as creating new ones.
+* training:
+ -n [OPTIONS] text1 text2 > blacklist.txt
+ [OPTIONS] -t "t1.txt t2.txt ..." lang1 lang2 ...
-== Syntax highlighting ==
+* run experiments:
+ -t "t1.txt t2.txt ..." \
+                           -e "e1.txt e2.txt ..." \
+                           lang1 lang2 ...
-You can also highlight snippets of text, we use the excellent [[|Pygments]] library.
-Here's an example of some Python code:
+* lang1 lang2 ... are language ID's
+* blacklists are expected in <BlackListDir>/<lang1-lang2.txt
+* t1.txt t2.txt ... are training data files (in UTF-8)
+* e1.txt e2.txt ... are training data files (in UTF-8)
+* the order of languages needs to be the same for training data, eval data
+   as given by the command line arguments (lang1 lang2 ..)
-def wiki_rocks(text):
-    formatter = lambda t: "funky"+t
-    return formatter(text)
+Other command-line options:
+ -a <freq> ...... min freq for common words
+ -b <freq> ...... max freq for uncommon words
+ -c <score> ..... min difference score to be relevant
+ -d <dir> ....... directory of black lists
+ -i ............. classify each line separately
+ -m <number> .... use approximately <number> tokens to traing/classify
+ -n ............. train a new black list
+ -v ............. verbose mode
+ -U ............. don't lowercase
+ -S ............. don't tokenize (use the string as it is)
+ -A ............. don't discard tokens with non-alphabetic characters
-You can check out the source of this page to see how that's done, and make sure to bookmark [[|the vast library of Pygment lexers]], we accept the 'short name' or the 'mimetype' of anything in there.
-Have fun!
+You can test the tool using the provided GNU Makefile:
+{{{make test}}}
+You can train blacklists using the provided training data:
+{{{make train}}}
+Run training and testing on incremental training data:
+{{{make learning_curve}}}
+The results are store in the subdirectory {{{experiment}}}
+== License ==
+The code is published under the [[|GNU Lesser General Public License]].