- edited description
Updating Lucene version breaks okapi-tm-pensieve
Used by these artifacts:
- org.apache.lucene:lucene-core
Current version used by Okapi : 3.3.0
Latest version (today) is 8.1.1
See latest version at https://mvnrepository.com/artifact/org.apache.lucene/lucene-core/
Comments (16)
-
reporter -
reporter - changed title to Updating Lucene version breaks okapi-tm-pensieve
-
reporter - marked as minor
-
@Mihai Nita , I thought I might be able to contribute to this issue because I worked on Lucene before. Could you tell me how to reproduce the issue? Is it a build error that you are concerned with, if we change the Lucene version to the latest?
-
Yes, update
org.apache.lucene.version
insuperpom/pom.xml
and then do a clean build, and things will start to break. -
It looks like there was a major API overhaul at Lucene 4.0. I’ll try to follow this migration guide: https://lucene.apache.org/core/4_0_0/MIGRATE.html
Can I assume the Pensieve TM uses Lucene index as its storage of translation pairs?
-
Yes, I believe Pensieve does store the entries directly in the index.
-
Does anyone know why Okapi defines its own N-gram analyzer/tokenizer, rather than using the builtin NGramTokenizer?
In NgramAnalyzer.java, there is an array of ngrams that seems to suggest to be removed from the resulting tokens. This doesn’t make much sense to me. Can anyone enlighten me?
Even if this makes some sense, should the stopword technique be applied to a TM? We don’t really want TM to recognize “I am happy” and “I am not happy” (“not” being a stopword) as the same sentence and use the same translation, do we?
And finally, in the test TM I found at https://bitbucket.org/okapiframework/okapi/src/dev/okapi/tm/pensieve-integration-tests/src/test/resources/Paragraph_TM.tmx
has more than one sentences in a segment, like:<seg><ph x="1">#![PG 0 1]</ph>1.<ph x="2">\t</ph>Confirm that the AC line is in the AC fluid detector. <ph x="3">\n</ph> <ph x="4">\n</ph>2.<ph x="5">\t</ph>Touch Retry. </seg>
Shouldn’t a seg have just one sentence to increase the change of matching?
-
Hi Kuro,
I cannot answer the question about why we have our own N-gram code. Maybe @Jim Hargrave (OLD) will be able to do that.
I cannot either really answer the stop-words question, although I imagine it is to try to improve fuzziness. That said, they do have the distinct disadvantage to make some short segments never found. Here again I expect @Jim Hargrave to have some information about it.
As for the segments containing more than one sentence. Yes, ideally a segment is a sentence. But in the real world, segments are sometimes paragraphs: it depends on how the tool break them down. So it’s not unexpected for test data to have segments that have two sentences.
Cheers,
-ys
-
Hello all,
what is the current status of the Lucene update issue? I have just created a new issue for this, before realising that it’s a duplicate most likely of this one here.
https://bitbucket.org/okapiframework/okapi/issues/1067/updating-lucene-core-from-330
-
@Martn Wunderl, I’m working on it (slowly). The goal is to upgrade to Lucene 8.8.x. I have a somewhat working version but some unit test cases are failing due to a scoring issue; it doesn’t score 1.0 at the exact match. I am going to spend some time today.
If you have a patch, if you would like to work with me, or completely take it over, please contact me.
-
-
assigned issue to
-
assigned issue to
-
@Kuro Kurosaka Thank you for the update, Kuro. Unfortunately, right now I probably won’t be able to work on this. I’ll let you know, if or when I have capacity to join this effort.
-
https://bitbucket.org/okapiframework/okapi/pull-requests/544
Yet to do: Move a few unit tests to the integration test. These unit tests are currently skipped. (These unit tests succeeded if the “testtm” TM was copied from the connector test.)
I’d appreciate it if any of the reviewers can start reviewing code now.
-
Some unit test cases that use the lucene index saved in git repo have been moved to the new TmStepsIT test class in integration-test/pensieve.
-
- changed status to resolved
- Log in to comment