Issues

Issue #3 resolved

Limit languages

dbrgn
created an issue

Is it possible to limit the possible languages, so that false results can be reduced?

In my case, I'd like to limit the language detection to German, French, Italian and English. I hope that would improve the quality of the guesses.

Comments (9)

  1. dbrgn reporter

    Yes:

    ipdb> title
    u'Gem\xe4lde "Lady Diana"'
    ipdb> description
    u'Original Acryl-Gem\xe4lde 60 x 80cm auf Leinwand, gerahmt'
    ipdb> guess_language(title)
    u'UNKNOWN'
    ipdb> guess_language(description)
    u'de'
    ipdb> guess_language(description + ' ' + title)
    u'ro'
    

    Maybe it would guess correctly when Romanian wouldn't be a viable solution.

  2. spirit repo owner

    I’ve just commited a workaround for German dictionary not being used properly by PyEnchant: 1a3d542

    You can improve detection for short texts by installing hunspell dictionaries. For your purpose, you can install dictionaries for German, French, Italian and English if you wish to improve detection for those languages.

    Even Google can’t do it any better.

    That said, limiting possible languages still sounds like a useful feature. I’ll address that in another commit.

  3. spirit repo owner

    Add hints argument (list of language codes) to limit the possible languages (closes #3).

    This does not guarantee that the returned language code will be one of the hints. For instance, Japanese detection is still script-based.

    → <<cset 7de527659e8d>>

  4. dbrgn reporter

    So to improve the detection I have to install hunspell and the dictionaries? Or only the dictionaries? What directory should the dictionaries go into? Does it have to be in PATH or somewhere else? (I'm trying to set this up for Heroku, so I can't just install hunspell via package manager...)

  5. spirit repo owner

    Yes, hints is an iterable of ISO 639-1 language codes.

    To improve the detection, you'd need to install PyEnchant and some Hunspell dictionaries. Those are usually installed via a package manager.

    I don't know anything about Heroku, but If you have no package manager, maybe you can try to manually install PyEnchant and any required libraries. As for Hunspell dictionaries, they're usually located in /usr/share/hunspell/.

  6. Log in to comment