Pensieve sometimes returns an entry with codes when queried for TextFragment with codes

When there are two similar entries like below exist in the TMX file:

    <tu>
      <tuv xml:lang="EN-US">
        <seg><bpt i="1" x="1"/>network<ept i="1"/></seg>
      </tuv>
      <tuv xml:lang="FR-FR">
        <seg><bpt i="1" x="1"/>network<ept i="1"/></seg>
      </tuv>
    </tu>
    <tu>
      <tuv xml:lang="EN-US">
        <seg>network</seg>
      </tuv>
      <tuv xml:lang="FR-FR">
        <seg>network</seg>
      </tuv>
    </tu>
    <tu>
      <tuv xml:lang="EN-US">
        <seg><bpt i="1" x="1"/>network<ept i="1"/></seg>
      </tuv>
      <tuv xml:lang="FR-FR">
        <seg><bpt i="1" x="1"/>network<ept i="1"/></seg>
      </tuv>
    </tu>

net.sf.okapi.tm.pensieve.seeker.PensieveSeeker#searchFuzzy(new TextFragment(“network”) /* no code*/, 95, 5, null)
returns just one hit that represents the entry with codes. Since this is a fuzzy match, it is correct to return the entry with codes, but there should also be a hit that represent the entry without code at the higher score.

The following unit test case can be added to PensieveSeekerTest.java to demonstrate this:

‌

    @Test
    public void searchResultsShouldntFavorSourceWithTextWhenQueriedPlainText() throws Exception {
        // When queried for a plain text (no inline code) Pensieve sometimes favors the TM entry with codes.
        // It shouldn't.
        PensieveWriter writer = getWriter();

        TextFragment tfWoCode = new TextFragment("network");
        TextFragment tfWithCodes = new TextFragment();
        tfWithCodes.append(TagType.OPENING, "Xpt", "");
        tfWithCodes.append("network");
        tfWithCodes.append(TagType.CLOSING, "Xpt", "");

        writer.indexTranslationUnit(new TranslationUnit(
                new TranslationUnitVariant(LocaleId.fromString("EN"), tfWithCodes),
                new TranslationUnitVariant(LocaleId.fromString("FR"), tfWithCodes)));
        writer.indexTranslationUnit(new TranslationUnit(
                new TranslationUnitVariant(LocaleId.fromString("EN"), tfWoCode),
                new TranslationUnitVariant(LocaleId.fromString("FR"), tfWoCode)));

        writer.close();

        tmhits = seeker.searchFuzzy(new TextFragment("network"), 95, 5, null);
        assertEquals("number of docs found", 2, tmhits.size());
        assertEquals("1st match should not include codes", 0, tmhits.get(0).getTu().getSource().getContent().getCodes().size());
        assertEquals("2nd match should include a pair of codes", 2, tmhits.get(1).getTu().getSource().getContent().getCodes().size());
    }

‌

Comments (2)