Pensieve sometimes returns an entry with codes when queried for TextFragment with codes

Create issue
Issue #1047 resolved
Kuro Kurosaka (BH Lab) created an issue

When there are two similar entries like below exist in the TMX file:

    <tu>
      <tuv xml:lang="EN-US">
        <seg><bpt i="1" x="1"/>network<ept i="1"/></seg>
      </tuv>
      <tuv xml:lang="FR-FR">
        <seg><bpt i="1" x="1"/>network<ept i="1"/></seg>
      </tuv>
    </tu>
    <tu>
      <tuv xml:lang="EN-US">
        <seg>network</seg>
      </tuv>
      <tuv xml:lang="FR-FR">
        <seg>network</seg>
      </tuv>
    </tu>
    <tu>
      <tuv xml:lang="EN-US">
        <seg><bpt i="1" x="1"/>network<ept i="1"/></seg>
      </tuv>
      <tuv xml:lang="FR-FR">
        <seg><bpt i="1" x="1"/>network<ept i="1"/></seg>
      </tuv>
    </tu>

net.sf.okapi.tm.pensieve.seeker.PensieveSeeker#searchFuzzy(new TextFragment(“network”) /* no code*/, 95, 5, null)
returns just one hit that represents the entry with codes. Since this is a fuzzy match, it is correct to return the entry with codes, but there should also be a hit that represent the entry without code at the higher score.

The following unit test case can be added to PensieveSeekerTest.java to demonstrate this:

    @Test
    public void searchResultsShouldntFavorSourceWithTextWhenQueriedPlainText() throws Exception {
        // When queried for a plain text (no inline code) Pensieve sometimes favors the TM entry with codes.
        // It shouldn't.
        PensieveWriter writer = getWriter();

        TextFragment tfWoCode = new TextFragment("network");
        TextFragment tfWithCodes = new TextFragment();
        tfWithCodes.append(TagType.OPENING, "Xpt", "");
        tfWithCodes.append("network");
        tfWithCodes.append(TagType.CLOSING, "Xpt", "");

        writer.indexTranslationUnit(new TranslationUnit(
                new TranslationUnitVariant(LocaleId.fromString("EN"), tfWithCodes),
                new TranslationUnitVariant(LocaleId.fromString("FR"), tfWithCodes)));
        writer.indexTranslationUnit(new TranslationUnit(
                new TranslationUnitVariant(LocaleId.fromString("EN"), tfWoCode),
                new TranslationUnitVariant(LocaleId.fromString("FR"), tfWoCode)));

        writer.close();

        tmhits = seeker.searchFuzzy(new TextFragment("network"), 95, 5, null);
        assertEquals("number of docs found", 2, tmhits.size());
        assertEquals("1st match should not include codes", 0, tmhits.get(0).getTu().getSource().getContent().getCodes().size());
        assertEquals("2nd match should include a pair of codes", 2, tmhits.get(1).getTu().getSource().getContent().getCodes().size());
    }

Comments (2)

  1. Kuro Kurosaka (BH Lab) reporter

    At the end of PensieveSeeker#getTopHits(Query, Metadata), it calls:

            ArrayList<TmHit> noDups = new ArrayList<>(new LinkedHashSet<>(tmHitCandidates));
    

    to remove duplicate. This relies on TmHit#equals(Object) to do the right thing. Unfortunately, the current implementation does not distinguish the plain text entry and the entry with codes when codes' data is an empty string. They regard them the same entry and depending on the order in which entries appear and how the sort method work, it can favor the entry with codes.

        @Override
        public boolean equals(Object other) {
            if (this == other)
                return true;
            if (!(other instanceof TmHit))
                return false;
    
            TmHit otherHit = (TmHit) other;
            return (this.matchType == otherHit.getMatchType())
                    && (this.tu.getSource().getContent().toText().equals(otherHit
                            .getTu().getSource().getContent().toText()))
                    && (this.tu.getTarget().getContent().toText().equals(otherHit
                            .getTu().getTarget().getContent().toText()));
        }
    

  2. Log in to comment