Pensieve sometimes returns an entry with codes when queried for TextFragment with codes
Issue #1047
resolved
When there are two similar entries like below exist in the TMX file:
<tu>
<tuv xml:lang="EN-US">
<seg><bpt i="1" x="1"/>network<ept i="1"/></seg>
</tuv>
<tuv xml:lang="FR-FR">
<seg><bpt i="1" x="1"/>network<ept i="1"/></seg>
</tuv>
</tu>
<tu>
<tuv xml:lang="EN-US">
<seg>network</seg>
</tuv>
<tuv xml:lang="FR-FR">
<seg>network</seg>
</tuv>
</tu>
<tu>
<tuv xml:lang="EN-US">
<seg><bpt i="1" x="1"/>network<ept i="1"/></seg>
</tuv>
<tuv xml:lang="FR-FR">
<seg><bpt i="1" x="1"/>network<ept i="1"/></seg>
</tuv>
</tu>
net.sf.okapi.tm.pensieve.seeker.PensieveSeeker#searchFuzzy(new TextFragment(“network”) /* no code*/, 95, 5, null)
returns just one hit that represents the entry with codes. Since this is a fuzzy match, it is correct to return the entry with codes, but there should also be a hit that represent the entry without code at the higher score.
The following unit test case can be added to PensieveSeekerTest.java to demonstrate this:
@Test
public void searchResultsShouldntFavorSourceWithTextWhenQueriedPlainText() throws Exception {
// When queried for a plain text (no inline code) Pensieve sometimes favors the TM entry with codes.
// It shouldn't.
PensieveWriter writer = getWriter();
TextFragment tfWoCode = new TextFragment("network");
TextFragment tfWithCodes = new TextFragment();
tfWithCodes.append(TagType.OPENING, "Xpt", "");
tfWithCodes.append("network");
tfWithCodes.append(TagType.CLOSING, "Xpt", "");
writer.indexTranslationUnit(new TranslationUnit(
new TranslationUnitVariant(LocaleId.fromString("EN"), tfWithCodes),
new TranslationUnitVariant(LocaleId.fromString("FR"), tfWithCodes)));
writer.indexTranslationUnit(new TranslationUnit(
new TranslationUnitVariant(LocaleId.fromString("EN"), tfWoCode),
new TranslationUnitVariant(LocaleId.fromString("FR"), tfWoCode)));
writer.close();
tmhits = seeker.searchFuzzy(new TextFragment("network"), 95, 5, null);
assertEquals("number of docs found", 2, tmhits.size());
assertEquals("1st match should not include codes", 0, tmhits.get(0).getTu().getSource().getContent().getCodes().size());
assertEquals("2nd match should include a pair of codes", 2, tmhits.get(1).getTu().getSource().getContent().getCodes().size());
}
Comments (2)
-
reporter -
reporter - changed status to resolved
Fixed in pull request #515.
- Log in to comment
At the end of PensieveSeeker#getTopHits(Query, Metadata), it calls:
to remove duplicate. This relies on TmHit#equals(Object) to do the right thing. Unfortunately, the current implementation does not distinguish the plain text entry and the entry with codes when codes' data is an empty string. They regard them the same entry and depending on the order in which entries appear and how the sort method work, it can favor the entry with codes.