Source

Blacklist Classifier / standalone /

Filename Size Date modified Message
..
7.5 KB
34.3 KB
2.5 KB
2.1 KB
11.7 KB
-----------------------------------------------------------------------------

Blacklist Classifier v 0.1

Classifier for language discrimination based on blacklists
Copyright 2012 Joerg Tiedemann

-----------------------------------------------------------------------------

This script requires Perl and nothing else. No special installation is
required. Test the script using the provided GNU Makefile:

  make test

Test training blacklists using the provided training data:

  make train

Run training and testing on incremental training data:

  make learning_curve

-----------------------------------------------------------------------------
 USAGE:
-----------------------------------------------------------------------------

 classification:
   blacklist_classifier.pl [OPTIONS] lang1 lang2 ... < file

 training:
   blacklist_classifier.pl -n [OPTIONS] text1 text2 > blacklist.txt
   blacklist_classifier.pl [OPTIONS] -t "t1.txt t2.txt ..." lang1 lang2 ...

 run experiments:
   blacklist_classifier.pl -t "t1.txt t2.txt ..." \
                           -e "e1.txt e2.txt ..." \
                           lang1 lang2 ...

-----------------------------------------------------------------------------

 - lang1 lang2 ... are language ID's
 - blacklists are expected in <BlackListDir>/<lang1-lang2.txt
 - t1.txt t2.txt ... are training data files (in UTF-8)
 - e1.txt e2.txt ... are training data files (in UTF-8)
 - the order of languages needs to be the same for training data, eval data
   as given by the command line arguments (lang1 lang2 ..)

-----------------------------------------------------------------------------

 OPTIONS:

 -a <freq> ...... min freq for common words
 -b <freq> ...... max freq for uncommon words
 -c <score> ..... min difference score to be relevant
 -d <dir> ....... directory of black lists
 -i ............. classify each line separately
 -m <number> .... use approximately <number> tokens to traing/classify
 -n ............. train a new black list
 -v ............. verbose mode

 -U ............. don't lowercase
 -S ............. don't tokenize (use the string as it is)
 -A ............. don't discard tokens with non-alphabetic characters