Extraction order in Excel files

Issue #165 resolved

Former user created an issue 2011-02-18

Original issue 165 created by @ysavourel on 2011-02-18T15:17:41.000Z:

The cells of Excel files are extracted in apparent random order. This make translation difficult.

See report here:
http://tech.groups.yahoo.com/group/okapitools/message/1829

There possibly an algorithm to remedy this:
http://sourceforge.net/tracker/index.php?func=detail&aid=1650344&group_id=68187&atid=520347

Comments (13)

Former user Account Deleted
Comment [1.](https://code.google.com/p/okapi/issues/detail?id=165#c1) originally posted by ar...@planet.nl on 2011-04-22T07:35:20.000Z:

Hello Okapi,

Any news on solving this issue? As we translate a lot of excel files, this issue is a show stopper for us. We use Tikal as part of a newly developed automated workflow but we're stuck for now.

Your effort is greatly appreciated.

Arne Leeman.
- 2011-04-22T07:35:20+00:00
Former user Account Deleted
Comment 2. originally posted by @ysavourel on 2012-12-04T22:32:17.000Z:

Didier's post about OmegaT is correct - it's the order that the strings are encountered in the actual sheets. However, because the sharedString table is *shared*, there is a decision to be made about how to expose the entries for translation.

For example, suppose a string appears in 3 separate places in the spreadsheet, but it only appears once in the shared string table. Either:
1) expose the string for translation in each of the 3 places. If I'm reading Dider's post correctly, this is what he is advocating. The problem here is that each of those 3 strings could conceivably be translated differently. At that point, you will either need to update the sheet structure and shared string table (to insert new entries for each translation), or else discard some of the translations.
2) Expose the string for translation in only one of the 3 places, such as the first time you encounter that string going through the spreadsheet. The downside to this approach is that it may confuse translators who wonder why some strings aren't appearing for translation. WorldServer initially took this approach.
- 2012-12-04T22:32:17+00:00
Former user Account Deleted
- attached ordering.xlsx
Comment 3. originally posted by @ysavourel on 2015-03-18T18:31:13.000Z:

Attaching a sample to demonstrate how odd the extraction order is.
- 2015-03-18T18:31:13+00:00
Former user Account Deleted
Comment 4. originally posted by m...@sebastianebert.com on 2015-03-18T18:58:24.000Z:

One remark that might be related to this: I find it confusing that OmegaT does not show any repetitions on Excel files although there are cells with equal content. From that point of view it would be great if you could change the behaviour in that way that we get multiple extractions. This would enable CAT tools to count the repetitions.
- 2015-03-18T18:58:24+00:00
Former user Account Deleted
Comment 5. originally posted by @ysavourel on 2015-03-18T19:25:33.000Z:

I agree; ultimately, that's where we need to get to.

Right now, the issue is that if we expose those repetitions, then people could translate them differently, and this would cause the filter to behave badly when it merged the translations into the target. (Basically, one of the two translations would win, and the other would be lost.)

To really fix this, I think the filterwriter needs the ability to un-inline string table entries, which would require:
* adding new entries to the sharedStrings xml
* updating the worksheet files to rewrite cells to point to the correct copy of the string

This is not impossible, just complicated.
- 2015-03-18T19:25:33+00:00
Former user Account Deleted
Comment 6. originally posted by @ysavourel on 2015-03-25T20:17:36.000Z:

Just to continue this conversation: I've been working on a branch on getting column excludes working again, and it looks like there's some very old (and broken) code to try to deal with the scenario where a single shared string is referenced from both included and excluded contexts. This is a sub-case of the repetitions scenario, so I may need to solve that to fully handle the exclude problem.

The approach I'm leaning towards is to do an initial pass that denormalizes the string table, which would avoid a -lot- of complexity later on.
- 2015-03-25T20:17:36+00:00
Former user Account Deleted
Comment 7. originally posted by m...@sebastianebert.com on 2015-03-26T15:23:21.000Z:

Does your openxml excel filter modification also involve the word filter? If so it would be fantastic to have the color extraction feature also for word files. Lot's of people e.g. use yellow background color to mark text that has been updated since the last version and needs a new translation. If it's a completely different case, I would probably open a new feature request. Just let me know...
- 2015-03-26T15:23:21+00:00
Former user Account Deleted
Comment 8. originally posted by @ysavourel on 2015-03-26T18:04:12.000Z:

No, I'm pretty sure fixing color-based excludes in Excel would only affect Excel, since the style data is encoded differently in the two formats. So it should be its own enhancement request, since it would be independent of the Excel work.

For the use case you're describing, couldn't you define a custom word style to have a yellow background, then exclude that style by name?
- 2015-03-26T18:04:12+00:00
Former user Account Deleted
Comment 9. originally posted by m...@sebastianebert.com on 2015-03-27T07:51:17.000Z:

You are right, this would be an option. However the exlusion feature for word styles isn't working as well. Moreover you can only select a fixed set of styles to exclude (it's not dynamic). Third unfortunately most users tend to not use styles but simply use the brush buttom to mark texts.
- 2015-03-27T07:51:17+00:00
Former user Account Deleted
- attached exclude_include_colors_excel.png
Comment 10. originally posted by m...@sebastianebert.com on 2015-03-27T07:53:17.000Z:

I attached a possible UI design (not looking nice, but showing the behavior). Options could be like that. Does this cover all the common use cases?
- 2015-03-27T07:53:17+00:00
Former user Account Deleted
Comment 11. originally posted by m...@sebastianebert.com on 2015-03-27T07:54:26.000Z:

Colors as well as files and sheets would be dynamic in this case depending on the files on the input list. Don't know if this is possible. Hope so...
- 2015-03-27T07:54:26+00:00

Chase Tingley

changed status to resolved

Overhaul OOXML filter XLSX string processing

This fixes issue #165 and additional cases related to shared strings
that require distinct translations in different contexts.

The filter will now do an initial pass to de-duplify the shared
string entries and rewrite them in "document" order, updating
worksheet references as it goes.  This means that both translatable
(xl/sharedStrings.xml) and non-translatable (xl/worksheets/*) parts
will be modified during the course of translation.

This will change the trans-units exposed for translation by the
filter, in some cases radically, both by reordering them and by
producing multiple copies of strings that appear multiple times.

This also resolves remaining aspects of issue #390 related to string
duplication, ie cases where one copy of a string is excluded and
another is not.

In the test code, the ZipCompare class has been replaced with a new
implementation (OpenXMLPackageDiffer) that uses XMLUnit to perform
XML-aware comparisons of parts and provide better information about
test failures.

→ <<cset c85f282ecb51>>

2015-04-25T21:24:05+00:00

Chase Tingley
- assigned issue to
  
  Chase Tingley
- edited description
- 2015-04-25T21:26:18+00:00
Log in to comment

Assignee: Chase Tingley

Type: enhancement

Priority: minor

Status: resolved

Milestone: –

Version: –

Votes: 0

Watchers: 1