Bitbucket is a code hosting site with unlimited public and private repositories. We're also free for small teams!

Close
Overview:

This package implements improvements forr the mozilla universalchardet
module described in: 
 http://www-archive.mozilla.org/projects/intl/UniversalCharsetDetection.html

The interface uses the C wrapper described here:
  https://github.com/batterseapower/libcharsetdetect

Modifications / improvements affect the core universalchardet module, not
the C wrapper (which is useful and necessary anyway for integration and
testing)


Directory contents:

testdata/
 - Wikipedia index pages in target languages, sometimes in multiple
   encodings. The pages were manually stripped of english and boilerplate
   content, in hope that the remaining is significant and typical text.

 - Used to check how the detection works.

langstats/
 - Data and code used to produce the bigram frequencies for a
   language/encoding pair, used for the "Two char Distribution Method"
   from the reference article (neither the article nor the mozilla module
   publish the scripts used to generate the tables or the reference data).


libcharsetdetect/
 - The C API from the reference above, with the modified mozilla code
   inside libcharsetdetect/mozilla/extensions/universalchardet/src/base/

Recent activity

Tip: Filter by directory path e.g. /media app.js to search for public/media/app.js.
Tip: Use camelCasing e.g. ProjME to search for ProjectModifiedEvent.java.
Tip: Filter by extension type e.g. /repo .js to search for all .js files in the /repo directory.
Tip: Separate your search with spaces e.g. /ssh pom.xml to search for src/ssh/pom.xml.
Tip: Use ↑ and ↓ arrow keys to navigate and return to view the file.
Tip: You can also navigate files with Ctrl+j (next) and Ctrl+k (previous) and view the file with Ctrl+o.
Tip: You can also navigate files with Alt+j (next) and Alt+k (previous) and view the file with Alt+o.