Substring match in Inconsistency Check Step

Issue #365 new
Former user created an issue

Original issue 365 created by polytrans2... on 2013-09-14T04:27:57.000Z:

Hi Wei,

BTW, any plan to extend this check to include partial check?
That is, if the source text of a TU is found in the source
text of another TU, then assumably, the target of the former
TU should also be found in the target of the latter TU.
I am sure that this is true in most language pairs.

Possibly. As an option.
Although this would have some major impact on the implementation as we use hashmaps with the similar text as the key, and doing
partial match would require a very different way to do this.
We'll have to look at that.

You should create a new request for enhancement in the project page (https://code.google.com/p/okapi/issues/list) if you want to be
sure we don't forget about this.

Cheers,
-yves

Comments (3)

  1. Mihai Nita

    This is not true due to inflected forms, gender, number, case, etc. For instance "new" in Spanish becomes nuevo / nueva / nuevos / nuevas (all the combinations of singular / plural and masculine / feminine)

    Then we have languages that join words, like German: "house cleaning" => "Wohnungsreinigung", "glass surface cleaning" => "Glasflächenreinigung" So "cleaning" ("reinigung") is joined with other words, becoming very hard to recognize.

    This is true for a lot of languages (all Romance languages, all Slavic, Finish, Germanic)

    The longer the source fragment, the more reliable, so it might work for longer fragments (how long?) Another problem: translating from other languages (think German) to English.

    This starts to looks a lot like what Machine Translation does "in the belly" :-)

    This looks a lot like term-extraction +

  2. Log in to comment