PensieveTM duplicates entries with different values even with -over option

Create issue
Issue #1100 resolved
Pablo Gómez created an issue

I was suspecting it so I made a test case which I post here simplified. I have tried to simplify it further but the results were not consistent.

What I am doing is importing a TMX (level2, coming from OmegaT), via tikal. Source file (simple.tmx):

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE tmx SYSTEM "tmx14.dtd">
<tmx version="1.4">
  <header creationtool="OmegaT" o-tmf="OmegaT TMX" adminlang="EN-US" datatype="plaintext" creationtoolversion="5.5.0_0_a1fd6a4d" segtype="sentence" srclang="en"/>
  <body>
<!-- Default translations -->
    <tu>
      <tuv xml:lang="en">
        <seg><bpt i="1" x="1">&lt;g1&gt;</bpt>Alignment unit<ept i="1">&lt;/g1&gt;</ept></seg>
      </tuv>
      <tuv xml:lang="nl" changeid="PGO" changedate="20211014T130316Z" creationid="PGO" creationdate="20211014T130316Z">
        <seg><bpt i="1" x="1">&lt;g1&gt;</bpt>Uitlijntoestel<ept i="1">&lt;/g1&gt;</ept></seg>
      </tuv>
    </tu>
    <tu>
      <tuv xml:lang="en">
        <seg><bpt i="1" x="1">&lt;g1&gt;</bpt>Brake U/D movement<ept i="1">&lt;/g1&gt;</ept></seg>
      </tuv>
      <tuv xml:lang="nl" changeid="PGO" changedate="20211014T130252Z" creationid="PGO" creationdate="20211014T130252Z">
        <seg><bpt i="1" x="1">&lt;g1&gt;</bpt>Rem Boven/Onder beweging<ept i="1">&lt;/g1&gt;</ept></seg>
      </tuv>
    </tu>
    <tu>
      <tuv xml:lang="en">
        <seg><bpt i="1" x="1">&lt;g1&gt;</bpt>Connectors plate<ept i="1">&lt;/g1&gt;</ept></seg>
      </tuv>
      <tuv xml:lang="nl" changeid="PGO" changedate="20211014T122758Z" creationid="PGO" creationdate="20211014T122749Z">
        <seg><bpt i="1" x="1">&lt;g1&gt;</bpt>Connectorenplaat<ept i="1">&lt;/g1&gt;</ept></seg>
      </tuv>
    </tu>
  </body>
</tmx>

First time, Tikal it is creating the PensieveTM database from scratch:

tikal.bat -imp pensieve simple.tmx -sl EN -tl NL -ie UTF8 -fc okf_tmx -over

Then I export it to check and the result is almost 100% as expected. There is a minor detail regarding the tags which I still don’t know how important it is and is not affecting this issue. And it is functional.

Now I modify the TMX file (modified.tmx), removing some entries, changing the translation of others, as follows:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE tmx SYSTEM "tmx14.dtd">
<tmx version="1.4">
  <header creationtool="OmegaT" o-tmf="OmegaT TMX" adminlang="EN-US" datatype="plaintext" creationtoolversion="5.5.0_0_a1fd6a4d" segtype="sentence" srclang="en"/>
  <body>
<!-- Default translations -->
    <tu>
      <tuv xml:lang="en">
        <seg><bpt i="1" x="1">&lt;g1&gt;</bpt>Brake U/D movement<ept i="1">&lt;/g1&gt;</ept></seg>
      </tuv>
      <tuv xml:lang="nl" changeid="PGO" changedate="20211014T130252Z" creationid="PGO" creationdate="20211014T130252Z">
        <seg><bpt i="1" x="1">&lt;g1&gt;</bpt>Rem boven/onder beweging<ept i="1">&lt;/g1&gt;</ept></seg>
      </tuv>
    </tu>
    <tu>
      <tuv xml:lang="en">
        <seg><bpt i="1" x="1">&lt;g1&gt;</bpt>Connectors plate<ept i="1">&lt;/g1&gt;</ept></seg>
      </tuv>
      <tuv xml:lang="nl" changeid="PGO" changedate="20211014T122758Z" creationid="PGO" creationdate="20211014T122749Z">
        <seg><bpt i="1" x="1">&lt;g1&gt;</bpt>Connectorplaat<ept i="1">&lt;/g1&gt;</ept></seg>
      </tuv>
    </tu>

    <tu>
      <tuv xml:lang="en">
        <seg><bpt i="1" x="1">&lt;g1&gt;</bpt>Counter weight cables<ept i="1">&lt;/g1&gt;</ept></seg>
      </tuv>
      <tuv xml:lang="nl" changeid="PGO" changedate="20211014T130002Z" creationid="PGO" creationdate="20211014T130002Z">
        <seg><bpt i="1" x="1">&lt;g1&gt;</bpt>Kabels tegengewicht<ept i="1">&lt;/g1&gt;</ept></seg>
      </tuv>
    </tu>

    <tu>
      <tuv xml:lang="en">
        <seg><bpt i="1" x="1">&lt;g1&gt;</bpt>Flat wheel<ept i="1">&lt;/g1&gt;</ept></seg>
      </tuv>
      <tuv xml:lang="nl" changeid="PGO" changedate="20211014T123236Z" creationid="PGO" creationdate="20211014T123236Z">
        <seg><bpt i="1" x="1">&lt;g1&gt;</bpt>Vlak wiel<ept i="1">&lt;/g1&gt;</ept></seg>
      </tuv>
    </tu>
    <tu>
      <tuv xml:lang="en">
        <seg>Square wheel</seg>
      </tuv>
      <tuv xml:lang="nl" changeid="PGO" changedate="20211014T123236Z" creationid="PGO" creationdate="20211014T123236Z">
        <seg>Vierkantewiel</seg>
      </tuv>
    </tu>
  </body>
</tmx>

Import it again:

tikal.bat -imp pensieve -sl EN -tl NL -ie UTF8 -fc okf_tmx -over modified.tmx

And check what’s in there by exporting it:

tikal.bat -sl EN -tl NL -ie UTF-8 -oe UTF-8 -fc okf_tmx -exp pensieve.pentm

This is the content of the output pensieve.pentm,tmx file:

<?xml version="1.0" encoding="UTF-8"?>
<tmx version="1.4"><header creationtool="unknown" creationtoolversion="unknown" segtype="paragraph" o-tmf="unknown" adminlang="en" srclang="en" datatype="text"></header><body>
<tu tuid="autoID1">
<tuv xml:lang="en"><seg><bpt i="1" type="Xpt">&lt;g1&gt;</bpt>Alignment unit<ept i="1">&lt;/g1&gt;</ept></seg></tuv>
<tuv xml:lang="nl"><seg><bpt i="1" type="Xpt">&lt;g1&gt;</bpt>Uitlijntoestel<ept i="1">&lt;/g1&gt;</ept></seg></tuv>
</tu>
<tu tuid="autoID2">
<tuv xml:lang="en"><seg><bpt i="1" type="Xpt">&lt;g1&gt;</bpt>Brake U/D movement<ept i="1">&lt;/g1&gt;</ept></seg></tuv>
<tuv xml:lang="nl"><seg><bpt i="1" type="Xpt">&lt;g1&gt;</bpt>Rem Boven/Onder beweging<ept i="1">&lt;/g1&gt;</ept></seg></tuv>
</tu>
<tu tuid="autoID3">
<tuv xml:lang="en"><seg><bpt i="1" type="Xpt">&lt;g1&gt;</bpt>Connectors plate<ept i="1">&lt;/g1&gt;</ept></seg></tuv>
<tuv xml:lang="nl"><seg><bpt i="1" type="Xpt">&lt;g1&gt;</bpt>Connectorenplaat<ept i="1">&lt;/g1&gt;</ept></seg></tuv>
</tu>
<tu tuid="autoID4">
<tuv xml:lang="en"><seg><bpt i="1" type="Xpt">&lt;g1&gt;</bpt>Brake U/D movement<ept i="1">&lt;/g1&gt;</ept></seg></tuv>
<tuv xml:lang="nl"><seg><bpt i="1" type="Xpt">&lt;g1&gt;</bpt>Rem boven/onder beweging<ept i="1">&lt;/g1&gt;</ept></seg></tuv>
</tu>
<tu tuid="autoID5">
<tuv xml:lang="en"><seg><bpt i="1" type="Xpt">&lt;g1&gt;</bpt>Connectors plate<ept i="1">&lt;/g1&gt;</ept></seg></tuv>
<tuv xml:lang="nl"><seg><bpt i="1" type="Xpt">&lt;g1&gt;</bpt>Connectorplaat<ept i="1">&lt;/g1&gt;</ept></seg></tuv>
</tu>
<tu tuid="autoID6">
<tuv xml:lang="en"><seg><bpt i="1" type="Xpt">&lt;g1&gt;</bpt>Counter weight cables<ept i="1">&lt;/g1&gt;</ept></seg></tuv>
<tuv xml:lang="nl"><seg><bpt i="1" type="Xpt">&lt;g1&gt;</bpt>Kabels tegengewicht<ept i="1">&lt;/g1&gt;</ept></seg></tuv>
</tu>
<tu tuid="autoID7">
<tuv xml:lang="en"><seg><bpt i="1" type="Xpt">&lt;g1&gt;</bpt>Flat wheel<ept i="1">&lt;/g1&gt;</ept></seg></tuv>
<tuv xml:lang="nl"><seg><bpt i="1" type="Xpt">&lt;g1&gt;</bpt>Vlak wiel<ept i="1">&lt;/g1&gt;</ept></seg></tuv>
</tu>
<tu tuid="autoID8">
<tuv xml:lang="en"><seg>Square wheel</seg></tuv>
<tuv xml:lang="nl"><seg>Vierkantewiel</seg></tuv>
</tu>
</body>
</tmx>

Some <tu> duplicates happened:

  • Line 15 has duplicated line 7
  • Line19 has duplicated line 11

As I am using the -over option, I expected that <tu>s in lines 7 and 11 would disappear. This is the behavior on most other cases.

Comments (6)

  1. YvesS

    Hi Pablo,

    I have tried your steps using your examples with the latest snapshot of Okapi and the problem seems to have been corrected. (See the attached file: the export after importing the modified TMX file).
    We did had some change in the underlying libraries that work with Pensieve, so the problem may have been an issue that got corrected when doing the update.

    The snapshot version of Tikal: 2.1.41.0-SNAPSHOT can be download from https://gitlab.com/okapiframework/okapi/-/jobs/artifacts/dev/browse/deployment/maven/done?job=verification:jdk8
    That new version should be release soon (hopefully before the end of the month).
    I’ll mark the issue as fixed. but we can re-open it if you detect anything wrong.

  2. Log in to comment