Okapi validator incorrectly recognizes double words in several Indic languages

Issue #594 resolved
Former user created an issue

Steps to reproduce:

  • Start checkmate
  • Validate the attached file

The validation will recognize "ल ल" as double word, which are actually two characters from two neighboring words.

Similar issues happen for several other languages:

  • Bengali (bn), e.g. string "চালাবেন না!" detects double words "ন ন"
  • Gujarati (gu), e.g. string "લિંક કૉપિ" detects double words "ક ક"
  • Kanada (kn), e.g. string "ಒಳಬರುವ ವೀಡಿಯೊ ಕರೆ" detects double words "ವ ವ"
  • Marathi (mr), e.g. string "साइन इन करा" detects double words "इन इन"
  • Nepali (ne), e.g. string "ड्राइभबाट टिम" detects double words "ट ट"
  • Punjabi (pa), e.g. string "ਕਲਿੱਕ ਕੀਤੀ" detects double words "ਕ ਕ"
  • Sinhalese (si), e.g. string "ඔබ ඔබේ" detects double words "ඔබ ඔබ"

Expected behavior: No double words issue should be reported.

Comments (3)

  1. Mihai Nita

    Okapi falsely recognize "ल" as one word because the character next to it (e.g. ॉ) is a "Mark", not a "Letter" in Unicode.

  2. Log in to comment