Updating Lucene version breaks okapi-tm-pensieve

Create issue
Issue #837 new
Mihai Nita created an issue

Used by these artifacts:

  • org.apache.lucene:lucene-core

Current version used by Okapi : 3.3.0

Latest version (today) is 8.1.1

See latest version at https://mvnrepository.com/artifact/org.apache.lucene/lucene-core/

Comments (15)

  1. Kuro Kurosaka (BH Lab)

    @Mihai Nita , I thought I might be able to contribute to this issue because I worked on Lucene before. Could you tell me how to reproduce the issue? Is it a build error that you are concerned with, if we change the Lucene version to the latest?

  2. Chase Tingley

    Yes, update org.apache.lucene.version in superpom/pom.xml and then do a clean build, and things will start to break.

  3. Kuro Kurosaka (BH Lab)

    Does anyone know why Okapi defines its own N-gram analyzer/tokenizer, rather than using the builtin NGramTokenizer?

    In NgramAnalyzer.java, there is an array of ngrams that seems to suggest to be removed from the resulting tokens. This doesn’t make much sense to me. Can anyone enlighten me?

    https://bitbucket.org/okapiframework/okapi/src/f58b7376e391c1a798c0bec144609abc8ecd31ee/okapi/libraries/lib-search/src/main/java/net/sf/okapi/lib/search/lucene/analysis/NgramAnalyzer.java#lines-38

    Even if this makes some sense, should the stopword technique be applied to a TM? We don’t really want TM to recognize “I am happy” and “I am not happy” (“not” being a stopword) as the same sentence and use the same translation, do we?

    And finally, in the test TM I found at https://bitbucket.org/okapiframework/okapi/src/dev/okapi/tm/pensieve-integration-tests/src/test/resources/Paragraph_TM.tmx
    has more than one sentences in a segment, like:

                    <seg><ph x="1">#![PG 0 1]</ph>1.<ph x="2">\t</ph>Confirm that the AC line is in the AC fluid detector.
                        <ph x="3">\n</ph>
                        <ph x="4">\n</ph>2.<ph x="5">\t</ph>Touch Retry.
                    </seg>
    

    Shouldn’t a seg have just one sentence to increase the change of matching?

  4. YvesS

    Hi Kuro,

    I cannot answer the question about why we have our own N-gram code. Maybe @Jim Hargrave (OLD) will be able to do that.

    I cannot either really answer the stop-words question, although I imagine it is to try to improve fuzziness. That said, they do have the distinct disadvantage to make some short segments never found. Here again I expect @Jim Hargrave to have some information about it.

    As for the segments containing more than one sentence. Yes, ideally a segment is a sentence. But in the real world, segments are sometimes paragraphs: it depends on how the tool break them down. So it’s not unexpected for test data to have segments that have two sentences.

    Cheers,
    -ys

  5. Kuro Kurosaka (BH Lab)

    @Martn Wunderl, I’m working on it (slowly). The goal is to upgrade to Lucene 8.8.x. I have a somewhat working version but some unit test cases are failing due to a scoring issue; it doesn’t score 1.0 at the exact match. I am going to spend some time today.

    If you have a patch, if you would like to work with me, or completely take it over, please contact me.

  6. Martn Wunderl

    @Kuro Kurosaka Thank you for the update, Kuro. Unfortunately, right now I probably won’t be able to work on this. I’ll let you know, if or when I have capacity to join this effort.

  7. Kuro Kurosaka (BH Lab)

    Some unit test cases that use the lucene index saved in git repo have been moved to the new TmStepsIT test class in integration-test/pensieve.

  8. Log in to comment