Jörg Tiedemann avatar Jörg Tiedemann committed 6617129

Edited online

Comments (0)

Files changed (1)

 = A Blacklist Classifier for Language Discrimination =
 
-The blacklist classifier is a simple tool for discriminating related languages. It uses blacklisted words that can be trained on comparable data sets. The package comes with blacklists for distinguishing Bosnian, Croatian and Serbian. The data these lists are trained on is also included together with a test set for the three languages.
+The blacklist classifier is a simple tool for discriminating related languages. It uses blacklisted words that can be trained on comparable data sets. The package comes with blacklists for distinguishing Bosnian, Croatian and Serbian. The data these lists are trained on is also included together with a test set for the three languages. It now comes as a proper Perl Module as well (see download/installation instructions below)
 
-== Download ==
+== Download and Installation==
 
 {{{
 $ git clone https://bitbucket.org/tiedemann/blacklist-classifier.git
 }}}
 
+There is a standalone script {{{blacklist_classifier.pl}}} in the {{{standalone}} sub directory. However, this script will not be maintained and you should use the Perl module {{{Lingua-Idenitify-Blacklists}}} instead. Run the following commands for installing the Perl module and all its files:
+
+{{{
+cd Lingua-Identify-Blacklists
+perl Makefile.PL
+make
+make test
+make install
+}}}
+
+(Note, that you may have to run the last command as 'sudo')
 
 == Usage ==
 
-Simply download all files and run the Perl script {{{blacklist_classifier.pl}}}
+The Perl module includes a script {{{blacklist_classifier}}} that can be used in the same way as the standalone script.
 The basic operations are training, classification and running experiments:
 
 * classification:
 {{{
-blacklist_classifier.pl [OPTIONS] lang1 lang2 ... < file
+blacklist_classifier [OPTIONS] lang1 lang2 ... < file
 }}}
 
 * training:
 {{{
-   blacklist_classifier.pl -n [OPTIONS] text1 text2 > blacklist.txt
-   blacklist_classifier.pl [OPTIONS] -t "t1.txt t2.txt ..." lang1 lang2 ...
+   blacklist_classifier -n [OPTIONS] text1 text2 > blacklist.txt
+   blacklist_classifier [OPTIONS] -t "t1.txt t2.txt ..." lang1 lang2 ...
 }}}
 
 * run experiments:
 {{{
-   blacklist_classifier.pl -t "t1.txt t2.txt ..." \
+   blacklist_classifier -t "t1.txt t2.txt ..." \
                            -e "e1.txt e2.txt ..." \
                            lang1 lang2 ...
 }}}
 
 
 * lang1 lang2 ... are language ID's
-* blacklists are expected in <BlackListDir>/<lang1-lang2.txt
+* blacklists are expected in <BLACKLISTDIR>/<lang1-lang2.txt
 * t1.txt t2.txt ... are training data files (in UTF-8)
 * e1.txt e2.txt ... are training data files (in UTF-8)
 * the order of languages needs to be the same for training data, eval data
 
 
 
-You can test the tool using the provided GNU Makefile:
+You can test the tool using the provided GNU Makefile in the {{{test}}} sub-directory:
 
 {{{make test}}}
 
 The confusion matrix of classifying Bosnian, Croatian and Serbian given the training and test data provided in the package is as follows (columns = classifier decisions):
 
 |     | bs  | hr  | sr  | accuracy |
-| bs  | 187 | 12  | 1   | 0.935 |
-| hr  | 5   | 195 | 0   | 0.975 |
+| bs  | 188 | 11  | 1   | 0.935 |
+| hr  | 4   | 196 | 0   | 0.975 |
 | sr  | 0   | 0   | 200 | 1.000 |
 
-The overall accuracy is 97%.
+The overall accuracy is 97.3%.
 
 
 == License ==
Tip: Filter by directory path e.g. /media app.js to search for public/media/app.js.
Tip: Use camelCasing e.g. ProjME to search for ProjectModifiedEvent.java.
Tip: Filter by extension type e.g. /repo .js to search for all .js files in the /repo directory.
Tip: Separate your search with spaces e.g. /ssh pom.xml to search for src/ssh/pom.xml.
Tip: Use ↑ and ↓ arrow keys to navigate and return to view the file.
Tip: You can also navigate files with Ctrl+j (next) and Ctrl+k (previous) and view the file with Ctrl+o.
Tip: You can also navigate files with Alt+j (next) and Alt+k (previous) and view the file with Alt+o.