Source

uchardet-enhanced / langstats /

Filename Size Date modified Message
..
french
german
greek
hungarian
swedish
turkish
9.4 KB
1.1 KB
8.6 KB
1.5 KB
3.4 KB
Programs and data to determine the bigrams frequencies for extending
mozilla libcharsetdetect to other languages (for the "Two-Char Sequence
Distribution Method")

Steps:
 - Choose langage charset pair (ie: french/cp1252)

 - Assemble a big chunk of text in the appropriate language and charset
   (fetch from ebooks, wikipedia, whatever, use iconv as needed)

 - Produce character frequency table by running charstats on the chunk, as:
   mkcharstats french/french_cp1252.txt | sort -nr +2 > \
         french/charstats_french_cp1252.txt

 - Edit the resulting file, get rid of punctuation and numbers keep the rest

 - Run mkpairmodel.py to produce the c++ language model. There are two
   phases, to produce a correspondance table from code point to order in
   frequency list, then a 64x64 table listing the pair frequencies for the
   64 most common characters:
   
   mkpairmodel.py french/charstats_french_cp1252.txt \
                  french/french_cp1252.txt             > LangFrenchModel.cpp

 - Add header, license etc. to cpp file and integrate with the rest of the
   models