Commits

tiedeman  committed ad3731a

renamed README again

  • Participants
  • Parent commits d64711d

Comments (0)

Files changed (2)

+-----------------------------------------------------------------------------
+
+Blacklist Classifier v 0.1
+
+Classifier for language discrimination based on blacklists
+Copyright 2012 Joerg Tiedemann
+
+-----------------------------------------------------------------------------
+
+This script requires Perl and nothing else. No special installation is
+required. Test the script using the provided GNU Makefile:
+
+  make test
+
+Test training blacklists using the provided training data:
+
+  make train
+
+Run training and testing on incremental training data:
+
+  make learning_curve
+
+-----------------------------------------------------------------------------
+ USAGE:
+-----------------------------------------------------------------------------
+
+ classification:
+   blacklist_classifier.pl [OPTIONS] lang1 lang2 ... < file
+
+ training:
+   blacklist_classifier.pl -n [OPTIONS] text1 text2 > blacklist.txt
+   blacklist_classifier.pl [OPTIONS] -t "t1.txt t2.txt ..." lang1 lang2 ...
+
+ run experiments:
+   blacklist_classifier.pl -t "t1.txt t2.txt ..." \
+                           -e "e1.txt e2.txt ..." \
+                           lang1 lang2 ...
+
+-----------------------------------------------------------------------------
+
+ - lang1 lang2 ... are language ID's
+ - blacklists are expected in <BlackListDir>/<lang1-lang2.txt
+ - t1.txt t2.txt ... are training data files (in UTF-8)
+ - e1.txt e2.txt ... are training data files (in UTF-8)
+ - the order of languages needs to be the same for training data, eval data
+   as given by the command line arguments (lang1 lang2 ..)
+
+-----------------------------------------------------------------------------
+
+ OPTIONS:
+
+ -a <freq> ...... min freq for common words
+ -b <freq> ...... max freq for uncommon words
+ -c <score> ..... min difference score to be relevant
+ -d <dir> ....... directory of black lists
+ -i ............. classify each line separately
+ -m <number> .... use approximately <number> tokens to traing/classify
+ -n ............. train a new black list
+ -v ............. verbose mode
+
+ -U ............. don't lowercase
+ -S ............. don't tokenize (use the string as it is)
+ -A ............. don't discard tokens with non-alphabetic characters
+

File README.md

------------------------------------------------------------------------------
-
-Blacklist Classifier v 0.1
-
-Classifier for language discrimination based on blacklists
-Copyright 2012 Joerg Tiedemann
-
------------------------------------------------------------------------------
-
-This script requires Perl and nothing else. No special installation is
-required. Test the script using the provided GNU Makefile:
-
-  make test
-
-Test training blacklists using the provided training data:
-
-  make train
-
-Run training and testing on incremental training data:
-
-  make learning_curve
-
------------------------------------------------------------------------------
- USAGE:
------------------------------------------------------------------------------
-
- classification:
-   blacklist_classifier.pl [OPTIONS] lang1 lang2 ... < file
-
- training:
-   blacklist_classifier.pl -n [OPTIONS] text1 text2 > blacklist.txt
-   blacklist_classifier.pl [OPTIONS] -t "t1.txt t2.txt ..." lang1 lang2 ...
-
- run experiments:
-   blacklist_classifier.pl -t "t1.txt t2.txt ..." \
-                           -e "e1.txt e2.txt ..." \
-                           lang1 lang2 ...
-
------------------------------------------------------------------------------
-
- - lang1 lang2 ... are language ID's
- - blacklists are expected in <BlackListDir>/<lang1-lang2.txt
- - t1.txt t2.txt ... are training data files (in UTF-8)
- - e1.txt e2.txt ... are training data files (in UTF-8)
- - the order of languages needs to be the same for training data, eval data
-   as given by the command line arguments (lang1 lang2 ..)
-
------------------------------------------------------------------------------
-
- OPTIONS:
-
- -a <freq> ...... min freq for common words
- -b <freq> ...... max freq for uncommon words
- -c <score> ..... min difference score to be relevant
- -d <dir> ....... directory of black lists
- -i ............. classify each line separately
- -m <number> .... use approximately <number> tokens to traing/classify
- -n ............. train a new black list
- -v ............. verbose mode
-
- -U ............. don't lowercase
- -S ............. don't tokenize (use the string as it is)
- -A ............. don't discard tokens with non-alphabetic characters
-