Commits

tiedeman committed 45f465e

version 0.02 for cpan

Comments (0)

Files changed (17)

Lingua-Identify-Blacklists/Changes

 
-0.01	Tue Oct  9 14:48:09 CEST 2012
+0.01	Fri Oct 12 15:42:53 CEST 2012
 	- first version of the blacklist classifier as a perl module
 
+0.02	Sat Oct 13 09:57:20 CEST 2012
+	- now with documentation

Lingua-Identify-Blacklists/Makefile.PL

 use inc::Module::Install;
 
-name          'Lingua::Identify::Blacklists';
+name          'Lingua-Identify-Blacklists';
 all_from      'lib/Lingua/Identify/Blacklists.pm';
 
 install_script 'bin/blacklist_classifier';
+
 install_share;
 
-# requires 'Getopt::Std'     => 0;
-# requires 'FindBin'     => 0;
+requires 'Getopt::Std'     => 0;
+requires 'FindBin'     => 0;
 
 WriteAll;

Lingua-Identify-Blacklists/README

+Lingua-Identify-Blacklists
+
+This package implements a simple tool for language discrimination based on 
+language-pair-specific blacklists. It can be used to learn blacklists from given
+texts and it can be used to classify texts according to the blacklists available.
+
+
+INSTALLATION
+
+To install this module, run the following commands:
+
+	perl Makefile.PL
+	make
+	mak test
+	make install
+
+
+SUPPORT AND DOCUMENTATION
+
+  BitBucket:
+    http://bitbucket.org/tiedemann/blacklist-classifier
+
+
+LICENSE AND COPYRIGHT
+
+Copyright (c) 2012 Jörg Tiedemann.
+
+This program is free software: you can redistribute it and/or modify
+it under the terms of the GNU Lesser General Public License as published
+by the Free Software Foundation, either version 3 of the License, or
+(at your option) any later version.
+
+This program is distributed in the hope that it will be useful,
+but WITHOUT ANY WARRANTY; without even the implied warranty of
+MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+GNU Lesser General Public License for more details.
+
+You should have received a copy of the GNU Lesser General Public License
+along with this program.  If not, see <http://www.gnu.org/licenses/>.
+
+
+THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+"AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

Lingua-Identify-Blacklists/bin/blacklist_classifier

 #!/usr/bin/env perl
 #-*-perl-*-
-#
-#    Classifier for language discrimination based on blacklists v0.1
-#    Copyright 2012 Joerg Tiedemann
-#
-#    This program is free software: you can redistribute it and/or modify
-#    it under the terms of the GNU Lesser General Public License as published
-#    by the Free Software Foundation, either version 3 of the License, or
-#    (at your option) any later version.
-#
-#    This program is distributed in the hope that it will be useful,
-#    but WITHOUT ANY WARRANTY; without even the implied warranty of
-#    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
-#    GNU Lesser General Public License for more details.
-#
-#    You should have received a copy of the GNU Lesser General Public License
-#    along with this program.  If not, see <http://www.gnu.org/licenses/>.
-#
-#-----------------------------------------------------------------------------
-# USAGE:
-#-----------------------------------------------------------------------------
-# classification:
-#   blacklist_classifier.pl [OPTIONS] lang1 lang2 ... < file
-#
-# training:
-#   blacklist_classifier.pl -n [OPTIONS] text1 text2 > blacklist.txt
-#   blacklist_classifier.pl [OPTIONS] -t "t1.txt t2.txt ..." lang1 lang2 ...
-#
-# run experiments:
-#   blacklist_classifier.pl -t "t1.txt t2.txt ..." \
-#                           -e "e1.txt e2.txt ..." \
-#                           lang1 lang2 ...
-#
-#-----------------------------------------------------------------------------
-#
-# - lang1 lang2 ... are language ID's
-# - blacklists are expected in <BlackListDir>/<lang1-lang2.txt
-# - t1.txt t2.txt ... are training data files (in UTF-8)
-# - e1.txt e2.txt ... are training data files (in UTF-8)
-# - the order of languages needs to be the same for training data, eval data
-#   as given by the command line arguments (lang1 lang2 ..)
-#
-#-----------------------------------------------------------------------------
-#
-# OPTIONS:
-#
-# -a <freq> ...... min freq for common words
-# -b <freq> ...... max freq for uncommon words
-# -c <score> ..... min difference score to be relevant
-# -d <dir> ....... directory of black lists
-# -i ............. classify each line separately
-# -m <number> .... use approximately <number> tokens to traing/classify
-# -n ............. train a new black list
-# -v ............. verbose mode
-#
-# -U ............. don't lowercase
-# -S ............. don't tokenize (use the string as it is)
-# -A ............. don't discard tokens with non-alphabetic characters
 
+=head1 USAGE
+
+=head2 Classification:
+
+   blacklist_classifier [OPTIONS] lang1 lang2 ... < file
+
+=head2 training:
+
+   blacklist_classifier -n [OPTIONS] text1 text2 > blacklist.txt
+   blacklist_classifier [OPTIONS] -t "t1.txt t2.txt ..." lang1 lang2 ...
+
+=head2 run experiments:
+
+   blacklist_classifier -t "t1.txt t2.txt ..." \
+                           -e "e1.txt e2.txt ..." \
+                           lang1 lang2 ...
+
+=head2 command line arguments:
+
+ lang1 lang2 ... are language ID's
+ blacklists are expected in <BlackListDir>/<lang1-lang2.txt
+ t1.txt t2.txt ... are training data files (in UTF-8)
+ e1.txt e2.txt ... are training data files (in UTF-8)
+ the order of languages needs to be the same for training data, eval data
+   as given by the command line arguments (lang1 lang2 ..)
+
+
+ -a <freq> ...... min freq for common words
+ -b <freq> ...... max freq for uncommon words
+ -c <score> ..... min difference score to be relevant
+ -d <dir> ....... directory of black lists
+ -i ............. classify each line separately
+ -m <number> .... use approximately <number> tokens to traing/classify
+ -n ............. train a new black list
+ -v ............. verbose mode
+
+ -U ............. don't lowercase
+ -S ............. don't tokenize (use the string as it is)
+ -A ............. don't discard tokens with non-alphabetic characters
+
+=cut
 
 use strict;
 use vars qw($opt_a $opt_b $opt_c $opt_m $opt_n $opt_d $opt_v $opt_i
 binmode(STDOUT,":encoding(UTF-8)");
 binmode(STDERR,":encoding(UTF-8)");
 
-my %blacklists=();
-
 my $min_high = defined $opt_a ? $opt_a : 10;
 my $max_low  = defined $opt_b ? $opt_b : 3;
 my $min_diff = defined $opt_c ? $opt_c : 0.8;
     $Lingua::Identify::Blacklists::BLACKLISTDIR = "$Bin/../share/blacklists";
 }
 
-my $BlackListDir = $Lingua::Identify::Blacklists::BLACKLISTDIR;
-
-
-
+my %options = ( text_size => $opt_m,
+		min_high => $min_high, 
+		max_low  => $max_low, 
+		min_diff => $min_diff );
 
 
 # run experiments with a given set of training corpora and 
     # run an experiment with exponentially increasing training sizes
     # from $opt_F to $opt_T
     if ($opt_F && $opt_T && $opt_L){
-        $opt_m = $opt_F;
-        while ($opt_m < $opt_T){
-            %blacklists=();
+        $options{text_size} = $opt_F;
+        while ($options{text_size} < $opt_T){
             print "train with ca $opt_m tokens\n";
-            &run_experiment($opt_t,$opt_e,@langs);
-            $opt_m *= $opt_L;
+            &run_experiment($opt_t,$opt_e,\%options,@langs);
+            $options{text_size} *= $opt_L;
         }
     }
-    &run_experiment($opt_t,$opt_e,@langs);
+    &run_experiment($opt_t,$opt_e,\%options,@langs);
     exit;
 }
 
 if ($opt_t){
     my @traindata = split(/\s+/,$opt_t);
     my @langs = @ARGV;
-    &batch_train(\@langs,\@traindata);
+    my %trainset = ();
+    for (0..$#langs){ $trainset{$langs[$_]} = $traindata[$_]; }
+    &train( \%trainset, %options );
 }
 elsif ($opt_n){
     my $file1=shift(@ARGV);
     my $file2=shift(@ARGV);
-    &train($file1,$file2);
+    &train_blacklist( $file1,$file2, %options );
 }
 
 
 
 
 
+=head1 AUTHOR
 
-sub batch_train{
-    my $langs = shift;
-    my $traindata = shift;
+Jörg Tiedemann, L<https://bitbucket.org/tiedemann>
 
-    die "no languages given\n" unless (ref($langs) eq 'ARRAY');
-    die "no training data given\n" unless (ref($traindata) eq 'ARRAY');
-    die "number of languages and training data does not match\n" 
-        unless ($#{$traindata} == $#{$langs});
+=head1 BUGS
 
-    for my $s (0..$#{$langs}){
-        for my $t ($s+1..$#{$langs}){
-            print STDERR "traing blacklist for $$langs[$s]-$$langs[$t]\n";
-            &train($$traindata[$s],$$traindata[$t],
-                   "$Lingua::Identify::Blacklists::BLACKLISTDIR/$$langs[$s]-$$langs[$t].txt");
-        }
-    }
-}
+Please report any bugs or feature requests to
+L<https://bitbucket.org/tiedemann/blacklist-classifier>.  I will be notified,
+and then you'll automatically be notified of progress on your bug as I
+make changes.
 
+=head1 SUPPORT
 
-sub train{
-    my ($file1,$file2,$outfile) = @_;
+You can find documentation for this module with the perldoc command.
 
-    my %dic1=();
-    my %dic2=();
+    perldoc Lingua::Identify::Blacklists
 
-    my ($total1,$total2);
+=head1 LICENSE AND COPYRIGHT
 
-    my $total1 = &Lingua::Identify::Blacklists::read_file($file1,\%dic1,$opt_m);
-    my $total2 = &Lingua::Identify::Blacklists::read_file($file2,\%dic2,$opt_m);
+Copyright 2012 Jörg Tiedemann.
 
-    if ($outfile){
-        system("mkdir -p $Lingua::Identify::Blacklists::BLACKLISTDIR") unless (-d $Lingua::Identify::Blacklists::BLACKLISTDIR);
-        open O,">$outfile" || die "cannot write to $outfile\n";
-        binmode(O,":encoding(UTF-8)");
-    }
+This program is free software: you can redistribute it and/or modify
+it under the terms of the GNU Lesser General Public License as published
+by the Free Software Foundation, either version 3 of the License, or
+(at your option) any later version.
 
-    foreach my $w (keys %dic1){
-	next if ($dic1{$w}<$min_high && $dic2{$w}<$min_high);
-	next if ($dic1{$w}>$max_low && $dic2{$w}>$max_low);
-
-        my $s1 = $dic1{$w} * $total2;
-        my $s2 = $dic2{$w} * $total1;
-        my $diff = ($s1 - $s2) / ($s1 + $s2);
-
-	if (abs($diff) > $min_diff){
-            if ($outfile){
-                print O "$diff\t$w\t$dic1{$w}\t$dic2{$w}\n";
-            }
-            else{
-                print "$diff\t$w\t$dic1{$w}\t$dic2{$w}\n";
-            }
-	}
-    }
-    # don't forget words that do NOT appear in dic1!!!
-    foreach my $w (keys %dic2){
-	next if (exists $dic1{$w});
-	next if ($dic2{$w}<10);
-        if ($outfile){
-            print O "-1\t$w\t$dic1{$w}\t$dic2{$w}\n";
-        }
-        else{
-            print "-1\t$w\t$dic1{$w}\t$dic2{$w}\n";
-        }
-    }
-    close O if ($outfile);
-}
-
-
-
-
-
-
-sub run_experiment{
+This program is distributed in the hope that it will be useful,
+but WITHOUT ANY WARRANTY; without even the implied warranty of
+MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+GNU Lesser General Public License for more details.
 
-    use Benchmark;
+You should have received a copy of the GNU Lesser General Public License
+along with this program.  If not, see L<http://www.gnu.org/licenses/>.
 
-    my $trainfiles = shift;
-    my $evalfiles = shift;
 
-    my @traindata = split(/\s+/,$trainfiles);
-    my @evaldata = split(/\s+/,$evalfiles);
-    my @langs = @_;
+THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+"AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 
-    die "no languages given!\n" unless (@langs);
-    die "no training nor evaluation data given!\n" 
-        unless ($#traindata == $#evaldata || $#traindata == $#langs);
-
-    # train blacklists
-
-    if ($#traindata == $#langs){
-        $Lingua::Identify::Blacklists::BLACKLISTDIR = $opt_d || "blacklist-experiment";
-        my $t1 = new Benchmark;
-        &batch_train(\@langs,\@traindata);
-        print STDERR "training took: ".
-            timestr(timediff(new Benchmark, $t1)).".\n";
-    }
-
-    # classify test data
-
-    if ($#evaldata == $#langs){
-        print STDERR "classify ....\n";
-
-        my $correct=0;
-        my $count=0;
-        my %guesses=();
-
-        my %correct_lang=();
-        my %count_lang=();
-
-        my $t1 = new Benchmark;
-        foreach my $i (0..$#langs){
-            open IN,"<:encoding(UTF-8)",$evaldata[$i] || die "...";
-            while (<IN>){
-                chomp;
-                my %dic = ();
-                &Lingua::Identify::Blacklists::process_string($_,\%dic);
-                my $guess = &Lingua::Identify::Blacklists::classify(\%dic,@langs);
-                $count++;
-                $count_lang{$langs[$i]}++;
-                if ($guess eq $langs[$i]){
-                    $correct++;
-                    $correct_lang{$langs[$i]}++;
-                }
-                $guesses{$langs[$i]}{$guess}++;
-            }
-            close IN;
-        }
-        print STDERR "classification took: ".
-            timestr(timediff(new Benchmark, $t1)).".\n";
-
-        printf "accuracy: %6.4f\n   ",$correct/$count;
-        foreach my $c (@langs){
-            print "  $c";
-        }
-        print "\n";
-        foreach my $c (@langs){
-            print "$c ";
-            foreach my $g (@langs){
-                printf "%4d",$guesses{$c}{$g};
-            }
-            printf "  %6.4f",$correct_lang{$c}/$count_lang{$c};
-            print "\n";
-        }
-    }
-    system("wc -l $Lingua::Identify::Blacklists::BLACKLISTDIR/*.txt");
-}
+=cut
 

Lingua-Identify-Blacklists/lib/Lingua/Identify/Blacklists.pm

 
 use 5.008;
 use strict;
-use warnings;
 
-use File::ShareDir 'dist_dir';
+use File::ShareDir qw/dist_dir/;
+use File::Basename qw/dirname/;
+
 
 use Exporter 'import';
-our @EXPORT = qw( identify identify_file identify_stdin );
+our @EXPORT = qw( identify identify_file identify_stdin 
+                  train train_blacklist run_experiment 
+                   available_languages available_blacklists );
 our %EXPORT_TAGS = ( all => \@EXPORT );
 
+our $VERSION = '0.02';
+
 =encoding UTF-8
 
 =head1 NAME
 
 =head1 VERSION
 
-Version 0.01
+Version 0.02
 
-=cut
+=head1 SYNOPSIS
 
-our $VERSION = '0.01';
-our $VERBOSE = 0;
+  use Lingua::Identfy::Blacklists qw/:all/;
+
+  # detect language for a given text 
+  # (discriminate between Bosanian, Croatian and Serbian)
+  my $lang = identify( ".... text to be classified ...", 
+                       langs => ['bs','hr','sr']);
+
+  # use all languages available
+  my $lang = identify( ".... text to be classified ...");
+
+  # delect language in the given file (Unicode UTF-8 is assumed)
+  my $lang = identify_file( $filename );
+
+  # delect language for every line separately from the given file 
+  # (return a list of lang-IDs)
+  my @langs = identify_file( $filename, every_line => 1 );
+
+
+  # learn classifiers (blacklists) for all pairs of languages 
+  # given some training data
+  train( { cs => $file_with_cs_text,
+           sk => $file_with_sk_text,
+           pl => $file_with_pl_text } );
+
+  # learn a blacklist from a given pair of texts (prints to STDOUT)
+  train_blacklist( $filename1, $filename2 );
+
+  # ... the same but write to outfile
+  train_blacklist( $filename1, $filename2, outfile => $outfilename );
+
+  # train and evaluate the classification using given training/test data
+  my @traindata = ($trainfile1, $trainfile2, $trainfile3);
+  my @evaldata  = ($testfile1, $testfile2, $testfile3);
+  run_experiment(\@traindata, \@evaldata, $lang1 $lang2, $lang3);
+
+  # train with different parameters (optional)
+  my %para = ( 
+      min_high => 5,      # minimal token frequency in one langusgae
+      max_low  => 2,      # maximal token frequency in the other language
+      min_diff => 0.7 );  # score difference threshold
+
+  train( { cs => $file_with_cs_text, sk => $file_with_sk_text }, %para );
+
+
+
+=head1 Settings
+
+Module-internal variables that can be modified:
+
+ $BLACKLISTDIR   directory with all blacklists (default: module-share-dir)
+ $LOWERCASE      lowercase all data, yes/no (1/0), default: 1
+ $TOKENIZE       tokenize all data, yes/no (1/0), default: 1
+ $ALPHA_ONLY     don't use tokens with non-alphabetic characters, default: 1
+ $VERBOSE        verbose output (default=0)
+
+Tokenization is very simple and replaces all non-alphabetic characters with a white-space character.
+
+=cut
 
 
 our $BLACKLISTDIR;
 
 my %blacklists = ();
 
+our $VERBOSE = 0;
+
+
+
 
+sub initialize{ %blacklists = (); }
+
+=head1 Exported Functions
+
+=head2 C<$langID = identify( $text[,%options] )>
+
+Analyses a given text and returns a language ID as the result of the classification. C<%options> can be used to change the behaviour of the classifier. Possible options are
+
+  langs => \@list_of_possible_langs
+  use_margin => $score
+
+=cut
 
 sub identify{
   my $text = shift;
 }
 
 
+=head2 C<$langID = identify_file( $filename[,%options] )>
+
+Does the same as C<identify> but reads text from a file. It also takes the same options as the 'identify' function but allows two extra options:
+
+  text_size  => $size,  # number of tokens to be used for classification
+  every_line => 1
+
+Using the C<every_line> option, the classifier checks every input line seperately and returns a list of language ID's.
+
+ @langIDs = identify_file( $filename, every_line => 1, %options )
+
+=cut
+
+
+
 sub identify_file{
     my $file = shift;
     my %options = @_;
     
     my $fh = *STDIN;
     if (defined $file){
-	open $fh,"<$file" || die "cannot read from '$file'\n";
-	binmode($fh,":encoding(UTF-8)");
+	if (-e $file){
+	    open $fh,"<$file" || die "cannot read from '$file'\n";
+	    binmode($fh,":encoding(UTF-8)");
+	}
+	else{ print STDERR "Cannot find file '$file'! Read from STDIN\n"; }
     }
     
     while (<$fh>){
     unless ($options{every_line}){
 	push( @predictions, &classify( \%dic, %options ) );
     }
-    return @predictions;
+    return wantarray ? @predictions : $predictions[0];
 }
 
 
 
 
 
+=head2 C<train(\%traindata[,%options])>
 
+Trains classifiers by learning blacklisted words for pairwise language discrimination. Returns nothing. Blacklists are stored in C<Lingua::Identify::Blacklists::BLACKLISTDIR/>. You may have to run the process as administrator if you don't have write permissions.
 
+C<%traindata> is a hash of training data files associated with their corresponding language IDs:
 
+  'hr' => $croatian_text_file,
+  'sr' => $serbian_text_file,
+  ...
 
+C<%options> is a hash of optional parameters that change the behaviour of the learning algorithm. Possible parameters are:
 
+  min_high => $freq1,      # minimal token frequency in one langusgae
+  max_low  => $freq2,      # maximal token frequency in the other language
+  min_diff => $score,      # score difference threshold
+  text_size => $size,      # maximum number of tokens to be used per text
 
 
-sub classify{
-    my $dic         = shift;
-    my %options     = @_;
-    $options{langs} = '' unless ($options{langs});
+=cut
 
-    my @langs = ref($options{langs}) eq 'ARRAY' ? 
-	@{$options{langs}} : split( /\s+/, $options{langs} ) ;
 
-    @langs = available_languages() unless (@langs);
+sub train{
+    my $traindata = shift;
+    my %options   = @_;
 
-    return &classify_with_margin( $dic, $options{use_margin}, @langs ) 
-	if ($options{use_margin});
-    return &classify_cascaded( $dic, @langs );
+    my @langs = keys %{$traindata};
+
+    for my $s (0..$#langs){
+        for my $t ($s+1..$#langs){
+            print "traing blacklist for $langs[$s]-$langs[$t]\n";
+            &train_blacklist( $$traindata{$langs[$s]},$$traindata{$langs[$t]},
+		    outfile  => "$BLACKLISTDIR/$langs[$s]-$langs[$t].txt",
+		    %options );
+        }
+    }
 }
 
 
+=head2 C<train_blacklist($file1,$file2,%options)>
+
+This function learns a blacklist of words to discriminate between the language given in $file1 and the language given in $file2. It takes the same arguments (%options) as the C<train> function above with one additional parameter:
+
+ outfile => $output_file
+
+Using this parameter, the blacklist will be written to the specified file. Otherwise it will be printed to STDOUT.
+
+The function returns nothing otherwise.
+
+=cut
+
+
+
+sub train_blacklist{
+    my ($file1,$file2,%options) = @_;
+
+    my $min_high = exists $options{min_high} ? $options{min_high} : 10;
+    my $max_low  = exists $options{min_low}  ? $options{max_low}  : 3;
+    my $min_diff = exists $options{min_diff} ? $options{min_diff} : 0.8;
+
+    my %dic1=();
+    my %dic2=();
+
+    my $total1 = &read_file($file1,\%dic1,$options{text_size});
+    my $total2 = &read_file($file2,\%dic2,$options{text_size});
+
+    if ($options{outfile}){
+	mkdir dirname($options{outfile}) unless (-d dirname($options{outfile}));
+        open O,">$options{outfile}" || die "cannot write to $options{outfile}\n";
+        binmode(O,":encoding(UTF-8)");
+    }
+
+    foreach my $w (keys %dic1){
+	next if ((!exists $dic1{$w} || $dic1{$w}<$min_high) && 
+		 (!exists $dic2{$w} || $dic2{$w}<$min_high));
+	next if ((exists $dic1{$w} && $dic1{$w}>$max_low) && 
+		 (exists $dic2{$w} && $dic2{$w}>$max_low));
+
+	my $c1 = exists $dic1{$w} ? $dic1{$w} : 0;
+	my $c2 = exists $dic2{$w} ? $dic2{$w} : 0;
+
+        my $s1 = $c1 * $total2;
+        my $s2 = $c2 * $total1;
+        my $diff = ($s1 - $s2) / ($s1 + $s2);
+
+	if (abs($diff) > $min_diff){
+            if ($options{outfile}){
+                print O "$diff\t$w\t$c1\t$c2\n";
+            }
+            else{
+                print "$diff\t$w\t$c1\t$c2\n";
+            }
+	}
+    }
+    # don't forget words that do NOT appear in dic1!!!
+    foreach my $w (keys %dic2){
+	next if (exists $dic1{$w});
+	next if ($dic2{$w}<10);
+	my $c1 = exists $dic1{$w} ? $dic1{$w} : 0;
+	my $c2 = exists $dic2{$w} ? $dic2{$w} : 0;
+        if ($options{outfile}){
+            print O "-1\t$w\t$c1\t$c2\n";
+        }
+        else{
+            print "-1\t$w\t$c1\t$c2\n";
+        }
+    }
+    close O if ($options{outfile});
+}
 
+=head2 C<@langs = available_languages()>
+
+Returns a list of languages covered by the blacklists in the BLACKLISTDIR.
+
+=cut
 
 sub available_languages{
     unless (keys %blacklists){
 }
 
 
+=head2 C<@langs = available_blacklists()>
+
+Resturns a hash of available language pairs (for which blacklists exist in the system).
+
+=cut
+
+
+sub available_blacklists{
+    unless (keys %blacklists){
+	&load_all_blacklists( $BLACKLISTDIR );
+    }
+    my %pairs = ();
+    foreach (keys %blacklists){
+	my ($lang1,$lang2) = split(/\-/);
+	$pairs{$lang1}{$lang2} = 1;
+	$pairs{$lang2}{$lang1} = 1;
+    }
+    return %pairs;
+}
+
+
+
+
+=head2 C<run_experiment(\@trainfiles,\@testfiles,\%options,@langs)>
+
+This function allows to run experiments, i.e. training and evaluating classifiers for the given languages (C<@langs>). The arrays of training data and test data need to be of the same size as C<@langs>. The function prints the overall accurcy and a confusion table given the data sets and the classification. C<%options> can be used to set classifier-specific parameters.
+
+=cut
+
+
+sub run_experiment{
+
+    use Benchmark;
+
+    my $trainfiles = shift;
+    my $evalfiles = shift;
+    my $options = ref($_[0]) eq 'HASH' ? shift : {};
+
+    my @traindata = ref($trainfiles) eq 'ARRAY' ? @{$trainfiles} : split(/\s+/,$trainfiles);
+    my @evaldata = ref($evalfiles) eq 'ARRAY' ? @{$evalfiles} : split(/\s+/,$evalfiles);
+    my @langs = @_;
+
+    die "no languages given!\n" unless (@langs);
+    die "no training nor evaluation data given!\n" 
+        unless ($#traindata == $#evaldata || $#traindata == $#langs);
+
+    my %trainset = ();
+    for (0..$#langs){ $trainset{$langs[$_]} = $traindata[$_]; }
+
+    # train blacklists
+
+    if ($#traindata == $#langs){
+        $BLACKLISTDIR = $$options{blacklist_dir} || "blacklist-experiment";
+        my $t1 = new Benchmark;
+        &train( \%trainset, %{$options} );
+        print STDERR "training took: ". 
+	    timestr(timediff(new Benchmark, $t1)).".\n";
+    }
+
+    &initialize();
+
+    # classify test data
+
+    if ($#evaldata == $#langs){
+        print STDERR "classify ....\n";
+
+        my $correct=0;
+        my $count=0;
+        my %guesses=();
+
+        my %correct_lang=();
+        my %count_lang=();
+
+        my $t1 = new Benchmark;
+        foreach my $i (0..$#langs){
+            open IN,"<:encoding(UTF-8)",$evaldata[$i] || die "...";
+            while (<IN>){
+                chomp;
+                my %dic = ();
+                &process_string($_,\%dic);
+                my $guess = &classify(\%dic,@langs);
+                $count++;
+                $count_lang{$langs[$i]}++;
+                if ($guess eq $langs[$i]){
+                    $correct++;
+                    $correct_lang{$langs[$i]}++;
+                }
+                $guesses{$langs[$i]}{$guess}++;
+            }
+            close IN;
+        }
+        print STDERR "classification took: ".
+            timestr(timediff(new Benchmark, $t1)).".\n";
+
+        printf "accuracy: %6.4f\n   ",$correct/$count;
+        foreach my $c (@langs){
+            print "  $c";
+        }
+        print "\n";
+        foreach my $c (@langs){
+            print "$c ";
+            foreach my $g (@langs){
+                printf "%4d",$guesses{$c}{$g};
+            }
+            printf "  %6.4f",$correct_lang{$c}/$count_lang{$c};
+            print "\n";
+        }
+    }
+    system("wc -l $Lingua::Identify::Blacklists::BLACKLISTDIR/*.txt");
+}
+
+
+=head2 Module-internal functions
+
+The following functions are not exported and are mainly used for internal purposes (but may be used from the outside if needed).
+
+ classify(\%dic,%options)          # run the classifier
+ classify_cascaded(\%dic,@langs)   # run a cascade of binary classifications
+
+ # run all versus all and return the one that wins most binary decisions
+ # (a score margin is used to adjust the reliability of the decisions)
+ classify_with_margin(\%dic,$margin,@langs) 
+
+ load_all_blacklists()        # load all blacklists available in BLACKLISTDIR
+ load_blacklist(\%list,$dir,$lang1,$lang2) # load a lang-pair specific blacklist
+ read_file($file,\%dic,$max)  # read a file and count token frequencies
+ process_string($string)      # process a given string (lowercasing ...)
+
+=cut
+
+
+
+
+sub classify{
+    my $dic         = shift;
+    my %options     = @_;
+    $options{langs} = '' unless ($options{langs});
+
+    my @langs = ref($options{langs}) eq 'ARRAY' ? 
+	@{$options{langs}} : split( /\s+/, $options{langs} ) ;
+
+    @langs = available_languages() unless (@langs);
+
+    return &classify_with_margin( $dic, $options{use_margin}, @langs ) 
+	if ($options{use_margin});
+    return &classify_cascaded( $dic, @langs );
+}
+
+
+
+
 
 sub classify_cascaded{
     my $dic = shift;
 
 
 
+
+
+
 sub load_all_blacklists{
     my $dir = shift;
 
 }
 
 
+
+
+
+
 1;
 
 
 
 =head1 LICENSE AND COPYRIGHT
 
-   Copyright 2012 Jörg Tiedemann.
-
-   This program is free software: you can redistribute it and/or modify
-   it under the terms of the GNU Lesser General Public License as published
-   by the Free Software Foundation, either version 3 of the License, or
-   (at your option) any later version.
-
-   This program is distributed in the hope that it will be useful,
-   but WITHOUT ANY WARRANTY; without even the implied warranty of
-   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
-   GNU Lesser General Public License for more details.
-
-   You should have received a copy of the GNU Lesser General Public License
-   along with this program.  If not, see <http://www.gnu.org/licenses/>.
+Copyright 2012 Jörg Tiedemann.
+
+This program is free software: you can redistribute it and/or modify
+it under the terms of the GNU Lesser General Public License as published
+by the Free Software Foundation, either version 3 of the License, or
+(at your option) any later version.
+
+This program is distributed in the hope that it will be useful,
+but WITHOUT ANY WARRANTY; without even the implied warranty of
+MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+GNU Lesser General Public License for more details.
+
+You should have received a copy of the GNU Lesser General Public License
+along with this program.  If not, see L<http://www.gnu.org/licenses/>.
+
+
+THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+"AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 
 =cut

Lingua-Identify-Blacklists/share/blacklists/hr-bs.txt

--0.854289923116163	mečeve	1	13
+-0.854290007159691	mečeve	1	13
 1	španjolsku	28	
 1	ulagačko	12	
-1	prosinac	105	
--0.838978058301149	formalni	3	35
+1	prosinac	106	
+-0.838978150409425	formalni	3	35
 1	prometni	12	
-0.954446761962161	načelima	84	2
--0.805819330628547	dostavu	2	19
+0.954446734268499	načelima	84	2
+-0.805819439701568	dostavu	2	19
 1	pripala	15	
 1	crpki	28	
 1	boravilo	12	
 1	registarske	10	
--0.854289923116163	otkupi	1	13
--0.95287543252052	usuglasili	3	127
--0.892591258928404	eventualnu	1	18
+-0.854290007159691	otkupi	1	13
+-0.952875461146407	usuglasili	3	127
+-0.892591322159906	eventualnu	1	18
 1	proračune	16	
-0.849153121568153	promijenjen	12	1
+0.849153034802772	promijenjen	12	1
 1	županiji	10	
 1	kazališnih	35	
 1	sučele	11	
 1	iskazivanje	15	
--0.987983406998078	učešće	2	338
+-0.98801864923148	učešće	2	339
 1	manjkavosti	58	
 1	redoviti	15	
--0.950794704282894	porekao	2	81
-1	zemljopisno	11	
+-0.950794734140875	porekao	2	81
+1	zemljopisno	12	
 1	ocjenjivački	27	
-0.965938314280505	diplomat	113	2
--0.991000503092326	evropa	1	226
-0.836560421268163	nasljeđem	11	1
+0.96593829345127	diplomat	113	2
+-0.991039976390508	evropa	1	227
+0.836560327899804	nasljeđem	11	1
 1	razvidan	13	
 1	stječe	47	
 1	izvanredni	53	
--0.997618848069907	nivou	1	857
-0.869294351409612	nezabilježen	14	1
--0.936196708312082	obimu	1	31
-0.854748782851521	neslužbenog	25	2
--0.863989999047302	rodbini	1	14
+-0.997635384904278	nivou	1	863
+0.869294275410358	nezabilježen	14	1
+-0.936196746738472	obimu	1	31
+0.854748699051882	neslužbenog	25	2
+-0.863990077906338	rodbini	1	14
 1	lipanj	158	
-0.896841901522041	obiteljskog	18	1
+0.896841840656464	obiteljskog	18	1
 1	priključili	20	
-0.948449587107685	središnji	37	1
+0.948449555864262	središnji	37	1
 1	zahtijevat	18	
 1	nazočiti	78	
 1	izniman	27	
--0.975969858345676	centralnom	1	84
-0.836560421268163	slamanju	11	1
+-0.975969873115444	centralnom	1	84
+0.836560327899804	slamanju	11	1
 1	uzvikivali	14	
-0.849153121568153	iskaznicu	12	1
-0.803803435314514	prekomjernog	18	2
-0.821673741820662	stanjima	10	1
-0.996153418813361	dužnosnika	508	1
+0.849153034802772	iskaznicu	12	1
+0.803803325232117	prekomjernog	18	2
+0.821673640773722	stanjima	10	1
+0.996168472167825	dužnosnika	510	1
 1	prvotna	11	
--0.854289923116163	istupi	1	13
--0.973473918460848	prevazići	1	76
-0.977228966764921	inozemne	85	1
+-0.854290007159691	istupi	1	13
+-0.973473934744114	prevazići	1	76
+0.977228952760118	inozemne	85	1
 1	znanstvenicima	20	
 1	kavu	19	
 1	čimbenika	108	
 1	umirovljenih	21	
--0.81462847492659	razmiricama	1	10
--0.907223579184279	mis	1	21
-0.930024850453371	utaje	27	1
-0.810145297617624	učinku	28	3
+-0.814628579559378	razmiricama	1	10
+-0.907223634224005	mis	1	21
+0.930024808444208	utaje	27	1
+0.810145190719012	učinku	28	3
 1	ravnateljem	26	
-0.955482559200511	zanima	43	1
+0.955482532122209	zanima	43	1
 1	računalni	22	
 1	proračunskim	26	
 1	baštine	53	
-1	izgredima	12	
-0.906676356891532	popravak	20	1
-0.921620744268863	opskrbljivača	24	1
+1	izgredima	13	
+0.90667630154302	popravak	20	1
+0.92162069741924	opskrbljivača	24	1
 1	goleme	10	
--0.868371329846482	formalne	2	29
-0.983544742134121	njezini	118	1
--0.913140055412959	desetina	2	45
+-0.868371406344596	formalne	2	29
+0.983544731981364	njezini	118	1
+-0.91314010710258	desetina	2	45
 1	tisuća	222	
-0.955482559200511	izišlo	43	1
--0.89795591829138	zamisli	1	19
--0.830048423436161	gajiti	1	11
--0.956550133123887	zavisnosti	1	46
-0.891104174737053	izbornika	17	1
-1	priopćenju	607	
+0.955482532122209	izišlo	43	1
+-0.897955978534989	zamisli	1	19
+-0.830048520180336	gajiti	1	11
+-0.956550159567236	zavisnosti	1	46
+0.891104110680441	izbornika	17	1
+1	priopćenju	612	
 1	nabavu	30	
 1	gospodarsku	237	
--0.843099996358911	narastao	1	12
+-0.843100086310512	narastao	1	12
 1	konca	544	
-0.849153121568153	stanke	24	2
-0.940633924600966	citira	32	1
+0.849153034802772	stanke	24	2
+0.940633888764989	citira	32	1
 1	priopćena	17	
 1	ubojica	13	
-1	rujnu	205	
-0.906676356891532	pojedinosti	20	1
+1	rujnu	206	
+0.90667630154302	pojedinosti	20	1
 1	znanstveno	17	
-0.836560421268163	iziđe	11	1
+0.836560327899804	iziđe	11	1
 1	crpke	10	
 1	europsku	287	
 1	uporabi	44	
 1	poduprli	56	
 1	povijesnom	47	
 1	poduprijet	13	
-0.821673741820662	aktualni	10	1
-1	tečaja	30	
+0.821673640773722	aktualni	10	1
+1	tečaja	31	
 1	talijanskim	28	
--0.97239884970643	meču	1	73
-0.873516406185737	kolosijeka	29	2
+-0.972398866640406	meču	1	73
+0.873516332475301	kolosijeka	29	2
 1	naprosto	18	
 1	svibnju	271	
 1	židovi	11	
--0.969054347246457	učešća	2	130
-0.970326448450824	demokraciji	65	1
+-0.969054366200173	učešća	2	130
+0.970326430264497	demokraciji	65	1
 1	podupirala	14	
--0.89795591829138	omladinske	1	19
--0.836834792324048	rampama	2	23
-0.849153121568153	biračko	12	1
-0.9109224102492	djelovao	21	1
--0.918347165678883	zadnjoj	1	24
--0.938128880222607	demonstranti	3	96
+-0.897955978534989	omladinske	1	19
+-0.836834885549543	rampama	2	23
+0.849153034802772	biračko	12	1
+0.910922357301293	djelovao	21	1
+-0.91834721440205	zadnjoj	1	24
+-0.938128917522506	demonstranti	3	96
 1	msp-om	11	
 1	sustave	47	
-0.877474234326592	školstvu	30	2
-0.940633924600966	ahtisaaria	32	1
+0.877474162771805	školstvu	30	2
+0.940633888764989	ahtisaaria	32	1
 1	proračunom	40	
--0.843099996358911	minhenu	1	12
-0.983405265873	kojemu	117	1
--0.94248335922929	rangu	2	69
-0.959614839930182	pristaše	95	2
+-0.843100086310512	minhenu	1	12
+0.983405255634907	kojemu	117	1
+-0.942483393981935	rangu	2	69
+0.959614815313483	pristaše	95	2
 1	predsjedanje	54	
-0.949772107597061	kazalištu	38	1
--0.822673038757655	shodno	2	21
-0.930024850453371	znatna	27	1
+0.949772077134522	kazalištu	38	1
+-0.82267313929342	shodno	2	21
+0.930024808444208	znatna	27	1
 1	siječanj	145	
 1	al-qaeda	31	
-0.877474234326592	usporediti	15	1
+0.877474162771805	usporediti	15	1
 1	obiteljskih	16	
 1	proračunske	32	
 1	razina	158	
 1	idućim	20	
-0.924637777800384	brojke	75	3
+0.924637732683407	brojke	75	3
 1	zemljopisni	11	
 1	schengenskoj	34	
 1	tiskani	11	
-0.829441299475285	dosegnuvši	21	2
+0.829441202417667	dosegnuvši	21	2
 1	ocjenjivačkog	38	
-0.884690582974605	podignuo	32	2
+0.884690515375318	podignuo	32	2
 1	veleposlanik	363	
-0.906676356891532	dovršiti	40	2
-0.85994415115505	vancouveru	13	1
+0.90667630154302	dovršiti	40	2
+0.859944070126432	vancouveru	13	1
 1	utvrda	19	
--0.989836513801398	narednog	1	200
-0.854748782851521	prognozirao	25	2
--0.988053670621095	kog	1	170
+-0.989836520092071	narednog	1	200
+0.854748699051882	prognozirao	25	2
+-0.988053678008631	kog	1	170
 1	dužnosnica	26	
-1	nazočnošću	10	
--0.89795591829138	drevna	1	19
-0.948449587107685	uvjetom	37	1
+1	nazočnošću	11	
+-0.897955978534989	drevna	1	19
+0.948449555864262	uvjetom	37	1
 1	pomirbe	10	
--0.81462847492659	materijalnih	1	10
--0.830048423436161	stremiti	1	11
+-0.814628579559378	materijalnih	1	10
+-0.830048520180336	stremiti	1	11
 1	elektroničke	15	
 1	dopušta	10	
--0.975090937697287	liderom	1	81
+-0.97509095300046	liderom	1	81
 1	konzultacija	40	
-0.866319491657977	kaznene	41	3
+0.866319414052682	kaznene	41	3
 1	forenzični	10	
 1	nazočila	24	
 1	spilje	14	
 1	zajma	22	
--0.947642259765036	tračnicama	1	38
+-0.947642291484593	tračnicama	1	38
 1	konzultacije	47	
--0.961826964389014	prosto	2	105
+-0.96182698768357	prosto	2	105
 1	priključile	33	
 1	pričekati	28	
 1	nogometu	31	
--0.902810181231619	privrednih	3	60
+-0.902810238756176	privrednih	3	60
 1	redatelj	95	
 1	jednotjedni	12	
 1	oporbeni	85	
-0.98516559462633	zapreke	131	1
+0.985165585466146	zapreke	131	1
 1	snošljivost	16	
--0.956550133123887	generalnog	3	138
+-0.956550159567236	generalnog	3	138
 1	izvješćuju	42	
 1	priopćili	147	
 1	nazočit	10	
 1	predsjedao	23	
--0.914948014961509	kartama	1	23
+-0.91494806562306	kartama	1	23
 1	obvezujuća	13	
 1	obvezujući	30	
 1	priljev	57	
-0.955482559200511	suradnjom	43	1
+0.955482532122209	suradnjom	43	1
 1	tvrtka	715	
 1	nazočnosti	82	
 1	opozicijska	21	
-0.877474234326592	odgodom	15	1
--0.843099996358911	pusti	1	12
+0.877474162771805	odgodom	15	1
+-0.843100086310512	pusti	1	12
 1	temeljenog	30	
--0.854289923116163	daljeg	1	13
--0.970399390525685	zavisno	1	68
--0.902810181231619	pravičnih	1	20
+-0.854290007159691	daljeg	1	13
+-0.970399408667968	zavisno	1	68
+-0.902810238756176	pravičnih	1	20
 1	razvitku	183	
-0.836560421268163	fenomena	11	1
-0.965639416994525	neovisnog	56	1
+0.836560327899804	fenomena	11	1
+0.965639395985705	neovisnog	56	1
 1	prevoditelj	10	
--0.81462847492659	pravcem	1	10
+-0.814628579559378	pravcem	1	10
 1	dakako	17	
 1	tisuću	20	
 1	nizozemski	59	
--0.884494810323094	korumpiranosti	3	50
+-0.88449487803008	korumpiranosti	3	50
 1	tvrtku	113	
 1	zajamčiti	17	
 1	odljeva	16	
 1	utemeljenje	25	
 1	prvotnu	11	
--0.982237417019819	protesti	1	114
+-0.982237427971955	protesti	1	114
 1	točka	91	
-0.836560421268163	nestrpljenjem	11	1
+0.836560327899804	nestrpljenjem	11	1
 1	mimohodu	18	
--0.978941999381353	liderima	2	192
+-0.978942012343807	liderima	2	192
 1	ustrojila	20	
-0.896841901522041	ubojice	18	1
+0.896841840656464	ubojice	18	1
 1	pridonositi	10	
--0.863989999047302	reorganizaciji	1	14
--0.834633701229412	drevne	3	34
+-0.863990077906338	reorganizaciji	1	14
+-0.834633795599296	drevne	3	34
 1	promatra	40	
 1	uporabu	82	
 1	tijek	10	
-0.821673741820662	povjeren	10	1
--0.879970927020748	prosta	1	16
+0.821673640773722	povjeren	10	1
+-0.879970997210649	prosta	1	16
 1	zanijekao	42	
 1	glasovao	35	
--0.908606967646796	novostima	3	64
+-0.908607021905152	novostima	3	64
 1	tužiteljstava	12	
-0.927431152700466	odsutnosti	26	1
--0.81462847492659	zasnivanje	1	10
-0.991875518563188	njezine	240	1
+0.930024808444208	odsutnosti	27	1
+-0.814628579559378	zasnivanje	1	10
+0.991875513529397	njezine	240	1
 1	veleposlanici	52	
--0.914948014961509	redovne	2	46
+-0.91494806562306	redovne	2	46
 1	naputke	12	
 1	obrambenu	16	
 1	bruxellesu	137	
-0.821673741820662	težnju	10	1
+0.821673640773722	težnju	10	1
 1	izvorno	12	
-0.891104174737053	odredaba	17	1
--0.974145281777702	prvobitno	2	156
+0.891104110680441	odredaba	17	1
+-0.974145297654245	prvobitno	2	156
 1	sudionice	16	
 1	razvidno	25	
 1	antalyi	10	
 1	iskaznica	37	
-0.877474234326592	preokrenuti	15	1
+0.877474162771805	preokrenuti	15	1
 1	ustrajavati	36	
--0.916112080218501	demonstranata	3	70
+-0.91724476105407	demonstranata	3	71
 1	težeći	12	
-0.926060838828872	natječaja	51	2
--0.886631238745463	trustu	1	17
+0.926060794531108	natječaja	51	2
+-0.886631305275454	trustu	1	17
 1	iznimna	21	
 1	oporbena	139	
--0.848901850500422	neformalnom	2	25
--0.879970927020748	svuda	1	16
+-0.848901937398475	neformalnom	2	25
+-0.879970997210649	svuda	1	16
 1	iznimnom	12	
-0.836560421268163	specijalistica	11	1
-0.85994415115505	djelatnicima	13	1
-0.948449587107685	uzajamne	37	1
--0.830048423436161	ispitom	1	11
-0.816090008446126	samostana	29	3
+0.836560327899804	specijalistica	11	1
+0.859944070126432	djelatnicima	13	1
+1	ozračjem	10	
+0.948449555864262	uzajamne	37	1
+-0.830048520180336	ispitom	1	11
+0.816089904554639	samostana	29	3
 1	desetljeće	49	
--0.802693899306768	redovni	3	28
+-0.802694009943555	redovni	3	28
 1	zbilo	10	
 1	srednjovjekovnog	11	
 1	rumunje	31	
--0.830048423436161	dubinske	1	11
+-0.830048520180336	dubinske	1	11
 1	financirat	13	
 1	rumunja	77	
 1	ustrajavali	18	
 1	mksj-om	15	
 1	prodano	15	
-0.995590266444981	sveučilišta	443	1
--0.9341399667044	naznačili	1	30
-0.930024850453371	najvjerojatnije	27	1
--0.998324000373359	evropske	1	1218
--0.883396113225645	pravične	2	33
+0.995600173697004	sveučilišta	444	1
+-0.934140006327356	naznačili	1	30
+0.930024808444208	najvjerojatnije	27	1
+-0.998324001415141	evropske	1	1218
+-0.883396181536817	pravične	2	33
 1	liberalno	47	
-0.884690582974605	nogometna	32	2
+0.884690515375318	nogometna	32	2
 1	predano	16	
 1	potičući	15	
 1	zemljoradnici	10	
 1	uhićenih	38	
 1	eliminiranje	15	
-0.961968337118691	skupinama	101	2
+0.96196831390872	skupinama	101	2
 1	stožer	41	
-0.902005250071844	sabora	57	3
-0.849153121568153	zakladu	12	1
-0.989712544830327	rumunjski	568	3
-0.877474234326592	dužničkih	15	1
-0.973654023701537	svezi	220	3
--0.966790823970024	tačaka	2	121
--0.892591258928404	ujedinjavanje	1	18
-0.902005250071844	proturaketne	19	1
-0.821673741820662	izvanrednom	10	1
+0.902005192095369	sabora	57	3
+0.849153034802772	zakladu	12	1
+0.989712538463316	rumunjski	568	3
+0.877474162771805	dužničkih	15	1
+0.973654007527344	svezi	220	3
+-0.96679084428673	tačaka	2	121
+-0.892591322159906	ujedinjavanje	1	18
+0.902005192095369	proturaketne	19	1
+0.821673640773722	izvanrednom	10	1
 1	etape	14	
 1	uhićenjima	22	
--0.843099996358911	teritorijama	1	12
-0.992801542277693	odličje	271	1
--0.81462847492659	antalija	1	10
-0.914798902774578	složene	22	1
-0.927431152700466	sastat	52	2
-0.821673741820662	zadovoljava	20	2
--0.968578417995352	narednim	2	128
--0.919946850393991	pravca	2	49
--0.997073042121122	odbrane	1	697
-0.959193976300711	predsjedatelja	47	1
+-0.843100086310512	teritorijama	1	12
+0.992801537815578	odličje	271	1
+-0.814628579559378	antalija	1	10
+0.914798852028131	složene	22	1
+0.92743110919274	sastat	52	2
+0.821673640773722	zadovoljava	20	2
+-0.968578437235916	narednim	2	128
+-0.919946898202442	pravca	2	49
+-0.997081406413633	odbrane	1	699
+0.959193951432817	predsjedatelja	47	1
 1	učinkovitiji	14	
--0.843099996358911	izuzetak	2	24
+-0.843100086310512	izuzetak	2	24
 1	temeljeno	45	
--0.968578417995352	upotreba	1	64
-1	istodobno	326	
+-0.968578437235916	upotreba	1	64
+1	istodobno	327	
 1	zločinačkih	10	
--0.918347165678883	ispravnom	1	24
+-0.91834721440205	ispravnom	1	24
 1	sydneyu	10	
 1	atenskog	19	
-0.849153121568153	neovisan	12	1
-0.869294351409612	antičkoj	14	1
-0.877474234326592	teritorijem	15	1
+0.849153034802772	neovisan	12	1
+0.869294275410358	antičkoj	14	1
+0.877474162771805	teritorijem	15	1
 1	nazoči	35	
-0.896841901522041	alevi	18	1
+0.896841840656464	alevi	18	1
 1	bitci	11	
 1	bruxellesom	11	
 1	znanost	279	
 1	sustavi	19	
 1	navješćuje	46	
-0.932439542345611	proračunu	84	3
-0.803803435314514	sastavila	18	2
--0.980264207341701	direktno	2	205
+0.93243950173535	proračunu	84	3
+0.803803325232117	sastavila	18	2
+-0.980264219498374	direktno	2	205
 1	bruxellessa	10	
-0.821673741820662	prinuđena	10	1
-0.973172679621444	bodova	144	2
+0.821673640773722	prinuđena	10	1
+0.973535298061346	bodova	146	2
 1	znanstvene	52	
-0.906676356891532	protukandidata	20	1
-0.836560421268163	povjerenstava	11	1
-0.836560421268163	poštenih	11	1
-0.821673741820662	prekomjerno	10	1
+0.90667630154302	protukandidata	20	1
+0.836560327899804	povjerenstava	11	1
+0.836560327899804	poštenih	11	1
+0.821673640773722	prekomjerno	10	1
 1	ciparskog	61	
-0.821673741820662	sjajno	10	1
+0.821673640773722	sjajno	10	1
 1	kostur	23	
 1	obrani	48	
-0.85994415115505	informatičku	13	1
--0.854289923116163	nezavisnim	1	13
-1	prosvjedu	32	
+1	oess/odihr	10	
+0.859944070126432	informatičku	13	1
+-0.854290007159691	nezavisnim	1	13
+1	prosvjedu	33	
 1	tajnice	28	
--0.959156002537145	protestiraju	1	49
+-0.959156027427684	protestiraju	1	49
 1	kave	10	
--0.81462847492659	povike	1	10
-0.930024850453371	useljenika	54	2
-0.938777511505321	izravni	31	1
-0.99580745798971	tajnik	466	1
--0.886631238745463	stanovišta	2	34
--0.854289923116163	trojci	1	13
-0.942381069161655	unapređenja	33	1
-0.849153121568153	poduzimajući	12	1
+-0.814628579559378	povike	1	10
+0.930024808444208	useljenika	54	2
+0.938777474584084	izravni	31	1
+0.995816414219242	tajnik	467	1
+-0.886631305275454	stanovišta	2	34
+-0.854290007159691	trojci	1	13
+0.942381034349018	unapređenja	33	1
+0.849153034802772	poduzimajući	12	1
 1	ustrajavajući	51	
--0.805819330628547	usijane	2	19
-0.884690582974605	investicijskog	16	1
+-0.805819439701568	usijane	2	19
+0.884690515375318	investicijskog	16	1
 1	povijesna	40	
--0.952510308695931	prisustvuje	1	42
+-0.952510337538219	prisustvuje	1	42
 1	sučelio	22	
-0.877474234326592	internetskim	30	2
-0.980988062181774	neovisnih	102	1
--0.948304654758623	pristalica	2	77
+0.877474162771805	internetskim	30	2
+0.980988050466686	neovisnih	102	1
+-0.948950529737759	pristalica	2	78
 1	pothvata	24	
--0.976249208630545	pristalice	1	85
--0.911253562475137	parlamentarac	2	44
+-0.976522153093688	pristalice	1	86
+-0.911253615235317	parlamentarac	2	44
 1	facto	31	
-1	sveučilištima	61	
-1	gospodarstvu	153	
--0.830048423436161	obavještavati	1	11
-0.920560658951714	potaknuo	71	3
--0.995017140693404	januar	1	409
-0.869294351409612	presađivanje	14	1
-1	ec-a	62	
--0.99198859968052	izvještavaju	1	254
+1	sveučilištima	62	
+1	gospodarstvu	154	
+-0.830048520180336	obavještavati	1	11
+0.920560611494642	potaknuo	71	3
+-0.995017143785567	januar	1	409
+0.869294275410358	presađivanje	14	1
+1	ec-a	63	
+-0.991988604644526	izvještavaju	1	254
 1	teorijski	10	
-0.881191887171015	agency	31	2
+0.881191817649941	agency	31	2
 1	schengenskog	20	
 1	postrojba	74	
 1	svatko	63	
-0.954090701326701	svakako	125	3
+0.95409067342166	svakako	125	3
 1	martya	26	
-0.906676356891532	izišao	20	1
--0.918347165678883	dužine	1	24
--0.907223579184279	šou	1	21
+0.90667630154302	izišao	20	1
+-0.91834721440205	dužine	1	24
+-0.907223634224005	šou	1	21
 1	financirala	24	
-0.849153121568153	upoznavanje	12	1
-0.891104174737053	internetski	34	2
+0.849153034802772	upoznavanje	12	1
+0.891104110680441	internetski	34	2
 1	prijepore	19	
 1	prosvjed	133	
--0.886631238745463	dekret	1	17
-0.99638762936251	znanosti	541	1
--0.988458581663814	otpočeo	1	176
-0.924637777800384	atenske	25	1
-0.877474234326592	opservatorij	15	1
--0.946265205061861	djelotvorno	1	37
-0.869294351409612	internationala	28	2
+-0.886631305275454	dekret	1	17
+0.996394279998019	znanosti	542	1
+-0.98845858880241	otpočeo	1	176
+0.924637732683407	atenske	25	1
+0.877474162771805	opservatorij	15	1
+-0.946265237592653	djelotvorno	1	37
+0.869294275410358	internationala	28	2
 1	složeno	14	
--0.81462847492659	ostvarenih	1	10
+-0.814628579559378	ostvarenih	1	10
 1	sučeljavati	15	
 1	navijestili	31	
-0.99129772351355	povjerenstva	224	1
-0.957418987731759	jamstvo	45	1
-0.896841901522041	surađujemo	18	1
--0.89795591829138	neformalne	1	19
+0.991297718123332	povjerenstva	224	1
+0.957418961805666	jamstvo	45	1
+0.896841840656464	surađujemo	18	1
+-0.897955978534989	neformalne	1	19
 1	umirovljenici	34	
--0.921485059703493	ceku	1	25
-0.98604662993732	stajalište	418	3
+-0.921485106630878	ceku	1	25
+0.986079691152699	stajalište	419	3
 1	promidžbe	40	
--0.863989999047302	povrijeđena	1	14
+-0.863990077906338	povrijeđena	1	14
 1	dojam	100	
--0.944813758277223	princip	2	72
+-0.944813791661799	princip	2	72
 1	litri	12	
-0.927431152700466	upućivanju	26	1
--0.939947466920906	centralni	1	33
+0.92743110919274	upućivanju	26	1
+-0.939947503158414	centralni	1	33
 1	uhićena	32	
 1	haaškom	26	
-0.864780696402166	znatnu	27	2
-0.836560421268163	pozajmljivanja	11	1
-0.88798911212414	podignuli	33	2
-0.932439542345611	icty-em	28	1
+0.864780617968279	znatnu	27	2
+0.836560327899804	pozajmljivanja	11	1
+0.887989046343665	podignuli	33	2
+0.93243950173535	icty-em	28	1
 1	ustrojili	16	
-1	snimkama	10	
-1	obiteljima	84	
-0.829441299475285	obrazloženje	21	2
--0.848901850500422	naznačila	2	25
--0.863989999047302	zaposlenja	2	28
+-0.998647882692864	regionu	1	1510
+1	snimkama	11	
+1	obiteljima	85	
+0.829441202417667	obrazloženje	21	2
+-0.848901937398475	naznačila	2	25
+-0.863990077906338	zaposlenja	2	28
 1	kani	103	
 1	summita	86	
 1	citirali	15	
 1	nabavama	11	
--0.886631238745463	rival	1	17
--0.984524649172	lidere	1	131
--0.989985961053989	stejt	1	203
+-0.886631305275454	rival	1	17
+-0.984524658724873	lidere	1	131
+-0.989985967252627	stejt	1	203
 1	mirovinama	17	
-0.877474234326592	motrenje	15	1
--0.843099996358911	regionu.	1	12
+0.877474162771805	motrenje	15	1
 1	koalicijski	43	
 1	atena	179	
-0.930848700615428	otok	82	3
-0.877474234326592	suzdržana	15	1
+0.931653336302367	otok	83	3
+0.877474162771805	suzdržana	15	1
 1	veleposlanikom	27	
-0.940633924600966	zapovjedništvom	32	1
+0.944028282861588	zapovjedništvom	34	1
 1	bey	10	
-0.849153121568153	bizantske	12	1
-0.803803435314514	saznaje	18	2
+0.849153034802772	bizantske	12	1
+0.803803325232117	saznaje	18	2
 1	tajništva	14	
 1	točan	28	
 1	predprijamnog	10	
--0.863989999047302	primorava	1	14
--0.81462847492659	kursu	1	10
+-0.863990077906338	primorava	1	14
+-0.830048520180336	kursu	1	11
 1	računalu	10	
 1	znanstvenika	88	
-0.869294351409612	dui-a	28	2
+0.869294275410358	dui-a	28	2
 1	četvrtima	12	
 1	internetskog	21	
-0.877474234326592	krajeva	15	1
+0.877474162771805	krajeva	15	1
 1	temeljena	67	
 1	zemnog	105	
 1	složenija	10	
--0.921485059703493	zadnjim	2	50
-0.922652909365201	polovica	73	3
+-0.921485106630878	zadnjim	2	50
+0.922652863107701	polovica	73	3
 1	postrojbe	204	
 1	znanstvenici	134	
-1	izvješću	1161	
--0.928367133811719	neophodna	2	55
+1	izvješću	1206	
+-0.92836717677912	neophodna	2	55
 1	oess-a	185	
 1	priopćenje	189	
 1	značajke	13	
 1	ubojstva	259	
 1	njezinim	115	
-1	glazbe	104	
-0.891104174737053	sljedećem	51	3
+1	glazbe	105	
+0.891104110680441	sljedećem	51	3
 1	veleposlanstvu	21	
 1	ožujak	131	
 1	pljenidbu	10	
--0.830048423436161	assadovog	1	11
+-0.830048520180336	assadovog	1	11
 1	podrijetlo	23	
-0.894050663452574	poticaja	35	2
+0.894050601032093	poticaja	35	2
 1	južnoafričke	12	
 1	utemeljiti	17	
 1	ozljedama	11	
--0.843099996358911	djelotvornosti	1	12
+-0.843100086310512	djelotvornosti	1	12
 1	gospodarska	244	
--0.914948014961509	strahote	1	23
--0.900442185954277	generalnom	2	39
--0.9583228178267	nabavku	1	48
+-0.91494806562306	strahote	1	23
+-0.900442244807067	generalnom	2	39
+-0.958322843214184	nabavku	1	48
 1	izvješćima	249	
-1	veljače	1097	
+1	veljače	1102	
 1	dužnosniku	14	
-0.821673741820662	golem	10	1
--0.863989999047302	glasačka	2	28
--0.907223579184279	pravični	1	21
+0.821673640773722	golem	10	1
+-0.863990077906338	glasačka	2	28
+-0.91494806562306	pravični	1	23
 1	proeuropska	13	
 1	financijskih	191	
-0.990448633594843	izravno	204	1
--0.854289923116163	karakteriziraju	1	13
+0.990448627681219	izravno	204	1
+-0.854290007159691	karakteriziraju	1	13
 1	kreti	16	
 1	priopćen	12	
--0.954619951835236	podrže	1	44
+-0.954619979426037	podrže	1	44
 1	akcesiji	31	
-0.854748782851521	dvorac	25	2
-0.826928408062134	izvanredno	31	3
+0.854748699051882	dvorac	25	2
+0.826928309709818	izvanredno	31	3
 1	liječnicima	18	
-0.954446761962161	obnaša	42	1
--0.843099996358911	sahranjen	1	12
-0.934693141553614	učinke	29	1
+0.954446734268499	obnaša	42	1
+-0.843100086310512	sahranjen	1	12
+0.9346931022522	učinke	29	1
 1	navijestio	71	
-0.934693141553614	dvojbe	29	1
+0.936801210228771	dvojbe	30	1
 1	navješćuju	23	
-0.891104174737053	sprovodu	17	1
-0.869294351409612	kontaktne	14	1
--0.975390969018747	uhapšeno	1	82
-0.810145297617624	nadmeću	28	3
-0.869294351409612	okrivljava	14	1
+0.891104110680441	sprovodu	17	1
+0.869294275410358	kontaktne	14	1
+-0.975390984139889	uhapšeno	1	82
+0.810145190719012	nadmeću	28	3
+0.869294275410358	okrivljava	14	1
 1	nakane	29	
-1	priključenja	115	
-0.896841901522041	pothvatu	18	1
--0.80884728905423	gubici	3	29
--0.94895049878999	reorganizaciju	1	39
-1	sportašima	10	
-0.8131649282125	zakladi	19	2
+1	priključenja	117	
+0.896841840656464	pothvatu	18	1
+-0.80884739660646	gubici	3	29
+-0.948950529737759	reorganizaciju	1	39
+1	sportašima	11	
+0.813164822838618	zakladi	19	2
 1	prometnih	22	
 1	prosvjeda	348	
 1	temeljeni	21	
-0.873516406185737	položaje	29	2
-1	utemeljenja	26	
+0.873516332475301	položaje	29	2
+0.930024808444208	utemeljenja	27	1
 1	vjerovnicima	11	
 1	prijamu	129	
-0.803803435314514	ugljen	18	2
+0.803803325232117	ugljen	18	2
 1	dovršetka	12	
 1	europskom	302	
-0.836560421268163	oznakama	11	1
+0.836560327899804	oznakama	11	1
 1	imf-a	16	
-0.821673741820662	umirovljene	10	1
--0.830048423436161	omladinska	1	11
-0.969398844353542	suradnik	63	1
-0.869294351409612	prisilili	14	1
+0.821673640773722	umirovljene	10	1
+-0.830048520180336	omladinska	1	11
+0.969398825607534	suradnik	63	1
+0.869294275410358	prisilili	14	1
 1	uzajamnom	15	
 1	hendikepirane	13	
--0.848901850500422	rekonstrukciju	2	25
+-0.848901937398475	rekonstrukciju	2	25
 1	gospodarstvima	50	
-0.884690582974605	koštuničina	16	1
-0.994613638813621	suradnje	725	2
--0.914948014961509	izvještavanja	1	23
--0.843099996358911	odsluženju	1	12
-0.993823473336293	gospodarske	632	2
--0.863989999047302	podvukavši	1	14
-0.952223509406485	skopja	80	2
+0.884690515375318	koštuničina	16	1
+0.994613635471735	suradnje	725	2
+-0.91494806562306	izvještavanja	1	23
+-0.843100086310512	odsluženju	1	12
+0.99382346950568	gospodarske	632	2
+-0.863990077906338	podvukavši	1	14
+0.952799393718455	skopja	81	2
 1	dvojaka	14	
-0.938777511505321	udruzi	31	1
+0.940633888764989	udruzi	32	1
 1	podtajnik	28	
--0.830048423436161	pogledima	1	11
+-0.830048520180336	pogledima	1	11
 1	blagdanske	13	
 1	akcesija	13	
 1	glazbeni	49	
 1	preustrojiti	10	
 1	iznimnu	16	
-0.920019789778311	školstvo	47	2
--0.886631238745463	napravile	1	17
-0.982116842061835	uvjetima	217	2
-0.849153121568153	strojevi	12	1
+0.920019742011579	školstvo	47	2
+-0.886631305275454	napravile	1	17
+0.973535298061346	uvjetima	219	3
+0.849153034802772	strojevi	12	1
 1	pridonose	41	
--0.854289923116163	pravičnu	1	13
-1	promatrača	79	
-1	sučeljava	285	
+-0.854290007159691	pravičnu	1	13
+1	sučeljava	287	
+1	promatrača	80	
 1	oporbenim	16	
-0.896841901522041	složenim	18	1
--0.872479195619975	gradić	1	15
+0.896841840656464	složenim	18	1
+-0.87247926989368	gradić	1	15
 1	oporbenog	27	
--0.914948014961509	zapazivši	1	23
+-0.91494806562306	zapazivši	1	23
 1	nogometa	40	
--0.901244235615569	novinsku	3	59
-0.836560421268163	utaja	11	1
-0.85994415115505	interventne	13	1
-0.993032115164804	demokracije	280	1
+-0.901244294018872	novinsku	3	59
+0.836560327899804	utaja	11	1
+0.859944070126432	interventne	13	1
+0.993032110845114	demokracije	280	1
 1	čimbenici	58	
--0.81462847492659	povjerilaca	1	10
+-0.814628579559378	povjerilaca	1	10
 1	nadvladavanje	58	
--0.81462847492659	prisjetiti	1	10
+-0.814628579559378	prisjetiti	1	10