Updating Lucene version breaks okapi-tm-pensieve

Issue #837 resolved

Mihai Nita created an issue 2019-06-28

Used by these artifacts:

org.apache.lucene:lucene-core

Current version used by Okapi : 3.3.0

Latest version (today) is 8.1.1

See latest version at https://mvnrepository.com/artifact/org.apache.lucene/lucene-core/

Comments (16)

Mihai Nita reporter
- edited description
- 2019-06-28T21:16:42+00:00
Mihai Nita reporter
- changed title to Updating Lucene version breaks okapi-tm-pensieve
- 2019-06-28T21:18:49+00:00
Mihai Nita reporter
- marked as minor
- 2019-07-03T20:56:30+00:00
Kuro Kurosaka (BH Lab)
@Mihai Nita , I thought I might be able to contribute to this issue because I worked on Lucene before. Could you tell me how to reproduce the issue? Is it a build error that you are concerned with, if we change the Lucene version to the latest?
- 2021-01-04T21:56:09+00:00
Chase Tingley
Yes, update org.apache.lucene.version in superpom/pom.xml and then do a clean build, and things will start to break.
- 2021-01-04T22:07:23+00:00
Kuro Kurosaka (BH Lab)
It looks like there was a major API overhaul at Lucene 4.0. I’ll try to follow this migration guide: https://lucene.apache.org/core/4_0_0/MIGRATE.html

Can I assume the Pensieve TM uses Lucene index as its storage of translation pairs?
- 2021-01-13T02:24:23+00:00
ysavourel
Yes, I believe Pensieve does store the entries directly in the index.
- 2021-01-13T03:20:51+00:00
Kuro Kurosaka (BH Lab)
Does anyone know why Okapi defines its own N-gram analyzer/tokenizer, rather than using the builtin NGramTokenizer?

In NgramAnalyzer.java, there is an array of ngrams that seems to suggest to be removed from the resulting tokens. This doesn’t make much sense to me. Can anyone enlighten me?

https://bitbucket.org/okapiframework/okapi/src/f58b7376e391c1a798c0bec144609abc8ecd31ee/okapi/libraries/lib-search/src/main/java/net/sf/okapi/lib/search/lucene/analysis/NgramAnalyzer.java#lines-38

Even if this makes some sense, should the stopword technique be applied to a TM? We don’t really want TM to recognize “I am happy” and “I am not happy” (“not” being a stopword) as the same sentence and use the same translation, do we?

And finally, in the test TM I found at https://bitbucket.org/okapiframework/okapi/src/dev/okapi/tm/pensieve-integration-tests/src/test/resources/Paragraph_TM.tmx
has more than one sentences in a segment, like:
```
                <seg><ph x="1">#![PG 0 1]</ph>1.<ph x="2">\t</ph>Confirm that the AC line is in the AC fluid detector.
                    <ph x="3">\n</ph>
                    <ph x="4">\n</ph>2.<ph x="5">\t</ph>Touch Retry.
                </seg>
```
Shouldn’t a seg have just one sentence to increase the change of matching?

‌
- 2021-01-14T22:31:48+00:00
ysavourel
Hi Kuro,

I cannot answer the question about why we have our own N-gram code. Maybe @Jim Hargrave (OLD) will be able to do that.

I cannot either really answer the stop-words question, although I imagine it is to try to improve fuzziness. That said, they do have the distinct disadvantage to make some short segments never found. Here again I expect @Jim Hargrave to have some information about it.

As for the segments containing more than one sentence. Yes, ideally a segment is a sentence. But in the real world, segments are sometimes paragraphs: it depends on how the tool break them down. So it’s not unexpected for test data to have segments that have two sentences.

Cheers,
-ys

‌
- 2021-01-15T03:48:14+00:00
Martn Wunderl
Hello all,

what is the current status of the Lucene update issue? I have just created a new issue for this, before realising that it’s a duplicate most likely of this one here.

https://bitbucket.org/okapiframework/okapi/issues/1067/updating-lucene-core-from-330
- 2021-06-01T12:01:23+00:00
Kuro Kurosaka (BH Lab)
@Martn Wunderl, I’m working on it (slowly). The goal is to upgrade to Lucene 8.8.x. I have a somewhat working version but some unit test cases are failing due to a scoring issue; it doesn’t score 1.0 at the exact match. I am going to spend some time today.

‌

If you have a patch, if you would like to work with me, or completely take it over, please contact me.
- 2021-06-01T15:12:38+00:00
Kuro Kurosaka (BH Lab)
- assigned issue to
  
  Kuro Kurosaka (BH Lab)
- 2021-06-01T23:41:15+00:00
Martn Wunderl
@Kuro Kurosaka Thank you for the update, Kuro. Unfortunately, right now I probably won’t be able to work on this. I’ll let you know, if or when I have capacity to join this effort.
- 2021-06-02T08:20:03+00:00
Kuro Kurosaka (BH Lab)
https://bitbucket.org/okapiframework/okapi/pull-requests/544

Yet to do: Move a few unit tests to the integration test. These unit tests are currently skipped. (These unit tests succeeded if the “testtm” TM was copied from the connector test.)

I’d appreciate it if any of the reviewers can start reviewing code now.
- 2021-08-18T13:38:33+00:00
Kuro Kurosaka (BH Lab)
Some unit test cases that use the lucene index saved in git repo have been moved to the new TmStepsIT test class in integration-test/pensieve.

‌
- 2021-08-19T03:30:55+00:00
jhargrave-straker
- changed status to resolved
- 2022-09-30T16:26:46+00:00
Log in to comment

Assignee: Kuro Kurosaka (BH Lab)

Type: bug

Priority: minor

Status: resolved

Milestone: –

Version: M37

Votes: 0

Watchers: 2