Wiki
Clone wikiBlacklist Classifier / Home
A Blacklist Classifier for Language Identification
The blacklist classifier is a simple tool for discriminating between related languages. It uses blacklisted words that can be trained on comparable data sets. The package comes with blacklists for distinguishing Bosnian, Croatian and Serbian. The data these lists are trained on is also included together with a test set for the three languages. It now comes as a proper Perl Module as well (see download/installation instructions below)
Download and Installation
$ git clone https://bitbucket.org/tiedemann/blacklist-classifier.git
There is a standalone script blacklist_classifier.pl
in the standalone
sub directory. However, this script will not be maintained and you should use the Perl module Lingua-Idenitify-Blacklists
instead. Run the following commands for installing the Perl module and all its files:
cd Lingua-Identify-Blacklists perl Makefile.PL make make test make install
(Note, that you may have to run the last command as 'sudo')
Usage
The Perl module includes a script blacklist_classifier
that can be used in the same way as the standalone script.
The basic operations are training, classification and running experiments:
- classification:
blacklist_classifier [OPTIONS] lang1 lang2 ... < file
- training:
blacklist_classifier -n [OPTIONS] text1 text2 > blacklist.txt blacklist_classifier [OPTIONS] -t "t1.txt t2.txt ..." lang1 lang2 ...
- run experiments:
blacklist_classifier -t "t1.txt t2.txt ..." \ -e "e1.txt e2.txt ..." \ lang1 lang2 ...
- lang1 lang2 ... are language ID's
- blacklists are expected in <BLACKLISTDIR>/<lang1-lang2.txt
- t1.txt t2.txt ... are training data files (in UTF-8)
- e1.txt e2.txt ... are training data files (in UTF-8)
- the order of languages needs to be the same for training data, eval data as given by the command line arguments (lang1 lang2 ..)
Other command-line options:
-a <freq> ...... min freq for common words -b <freq> ...... max freq for uncommon words -c <score> ..... min difference score to be relevant -d <dir> ....... directory of black lists -i ............. classify each line separately -m <number> .... use approximately <number> tokens to traing/classify -n ............. train a new black list -v ............. verbose mode -U ............. don't lowercase -S ............. don't tokenize (use the string as it is) -A ............. don't discard tokens with non-alphabetic characters
You can test the tool using the provided GNU Makefile in the test
sub-directory:
cd test make test
You can train blacklists using the provided training data:
make train
Run training and testing on incremental training data:
make learning_curve
The results are store in the subdirectory experiment
Performance
The confusion matrix of classifying Bosnian, Croatian and Serbian given the training and test data provided in the package is as follows (columns = classifier decisions):
bs | hr | sr | accuracy | |
bs | 188 | 11 | 1 | 0.935 |
hr | 4 | 196 | 0 | 0.975 |
sr | 0 | 0 | 200 | 1.000 |
The overall accuracy is 97.3%.
License
The code is published under the GNU Lesser General Public License.
Updated