OpenXML filter: subfilter hard line breaks and HTML tags in XLSX documents

Issue #780 resolved
Denis Konovalyenko created an issue

There are Excel spreadsheets that are exported from something like a product database and contain cells with content to translate in place, but there is either:

  • Multi-line text content, using hard line breaks
  • Raw HTML tags in the content

Currently, neither of these cases is handled well (the hard line breaks or HTML tags exposed for translation). For more details please refer to the attached XLSX document.

With a basic subfiltering mechanism, we could use okf_html for tag content and okf_plaintext to split plain text with hard line breaks.

For now, the following restrictions are going to be presumed:

  • It has to be only working on XLSX cell content, we don't need to worry about other OpenXML types.
  • It has to be only working on raw string content, so we don't need to do the tag splicing that caused so much complexity in the subfiltering step.

Comments (6)

  1. Chase Tingley

    @DenisKonovalyenko Of course, as soon as I merge the pull request, I have a question :)

    Do you think it would make more sense to name the config parameter something like xlsxSubfilter or cellSubfilter instead of just subfilter, given its restricted use?

  2. Denis Konovalyenko reporter

    @tingley , in my opinion, it is enough to have just subfilter, as long as we do not have any other subfilters to apply during filtering. I would propose to come back to this question once we approach to making the subfilter more knowledgeable on the other types of OpenXML documents (or styled texts to be precise). What do you think?

  3. Log in to comment