Source

Blacklist Classifier / README

-----------------------------------------------------------------------------

Blacklist Classifier

Classifier for language discrimination based on blacklists
Copyright 2012 Joerg Tiedemann

-----------------------------------------------------------------------------

The classifier can be downloaded as a Perl module (see
Lingua-Identify-Blacklists) or as a standalone script (see
standalone). The standalone script will not be maintained and probably
becomes outdated soon. Please, prefer the Perl Module instead.

The module includes a script that implements the same functionality as
the standalone script. In the subdirectory 'test' are some examples
for using the script. The calls are stored in the GNU Makefile and you
can run some test using the following targets:

  make test

Test training blacklists using the provided training data:

  make train

Run training and testing on incremental training data:

  make learning_curve

-----------------------------------------------------------------------------
Installation
-----------------------------------------------------------------------------

See the README in Lingua-Identify-Blacklists for the installation of
the Perl module (it's easy!) The standalone script can be used without
any installation (but it will not be maintained!)


-----------------------------------------------------------------------------
Usage
-----------------------------------------------------------------------------

The standalone script can be used as follows:
(The script 'blacklist_classifier' provided by the Perl module works the same)


 classification:
   blacklist_classifier.pl [OPTIONS] lang1 lang2 ... < file

 training:
   blacklist_classifier.pl -n [OPTIONS] text1 text2 > blacklist.txt
   blacklist_classifier.pl [OPTIONS] -t "t1.txt t2.txt ..." lang1 lang2 ...

 run experiments:
   blacklist_classifier.pl -t "t1.txt t2.txt ..." \
                           -e "e1.txt e2.txt ..." \
                           lang1 lang2 ...

-----------------------------------------------------------------------------

 - lang1 lang2 ... are language ID's
 - blacklists are expected in <BlackListDir>/<lang1-lang2.txt
 - t1.txt t2.txt ... are training data files (in UTF-8)
 - e1.txt e2.txt ... are training data files (in UTF-8)
 - the order of languages needs to be the same for training data, eval data
   as given by the command line arguments (lang1 lang2 ..)

-----------------------------------------------------------------------------

 OPTIONS:

 -a <freq> ...... min freq for common words
 -b <freq> ...... max freq for uncommon words
 -c <score> ..... min difference score to be relevant
 -d <dir> ....... directory of black lists
 -i ............. classify each line separately
 -m <number> .... use approximately <number> tokens to traing/classify
 -n ............. train a new black list
 -v ............. verbose mode

 -U ............. don't lowercase
 -S ............. don't tokenize (use the string as it is)
 -A ............. don't discard tokens with non-alphabetic characters
Tip: Filter by directory path e.g. /media app.js to search for public/media/app.js.
Tip: Use camelCasing e.g. ProjME to search for ProjectModifiedEvent.java.
Tip: Filter by extension type e.g. /repo .js to search for all .js files in the /repo directory.
Tip: Separate your search with spaces e.g. /ssh pom.xml to search for src/ssh/pom.xml.
Tip: Use ↑ and ↓ arrow keys to navigate and return to view the file.
Tip: You can also navigate files with Ctrl+j (next) and Ctrl+k (previous) and view the file with Ctrl+o.
Tip: You can also navigate files with Alt+j (next) and Alt+k (previous) and view the file with Alt+o.