XLIFFFilter: `<ph></ph>` are coded as two TagType.PLACEHOLDER with TextFragment.MARKER_ISOLATED
Hi Okapi,
In short:
<ph></ph>
is coded to two TagType.PLACEHOLDER
with TextFragment.MARKER_ISOLATED
, and it’s not possible to retain the OPEN/CLOSE pair tags from it when processing TextFragment.codes
from Events.
Backgrounds:
we’re using Okapi to parse XLIFF, extract text and InlineElements(MARKER_OPENING/ MARKER_CLOSING Elements) and Markers(MARKER_ISOLATED), send them to our Machine-Translation Service, get the results back, merge back translation with InlineElements and Markers to Okapi.
Recently we find out an issue that xliff tag <ph></ph>
are coded as two TagType.PLACEHOLDER but not a OPENING and CLOSING. We understand that <ph>
stands for PLACEHOLDER, but it this case (<ph></ph>), should it be an OPEN and CLOSE element?
Details:
Xliff:
<source>
<g ctype="x-html-p" id="1" dgo:tag_name="p">
<ph ctype="image" id="2" htm:src="B9BD5C75F6951B0.gif" htm:width="350" htm:height="350" htm:border="2px solid rgb(255, 0, 0)" htm:float="left" htm:margin="10px">
<sub ctype="x-html-img-alt">display</sub>
</ph>
</g>
</source>
<target xml:lang="de-DE">
<ph ctype="image" id="2" htm:src="B9BD5C75F6951B0.gif" htm:width="350" htm:height="350" htm:border="2px solid rgb(255, 0, 0)" htm:float="left" htm:margin="10px">
<sub ctype="x-html-img-alt">
<g ctype="x-html-p" id="1" dgo:tag_name="p">Anzeige</sub>
</ph>
</g>
</target>
TextFragment.codes of the source section:
0 = {Code@21675} ""
tagType = {TextFragment$TagType@21682} "OPENING"
outerData = {StringBuilder@21685} "<g ctype="x-html-p" id="1" dgo:tag_name="p">"
1 = {Code@21656} ""
tagType = {TextFragment$TagType@21244} "PLACEHOLDER"
outerData = {StringBuilder@21693} "<ph ctype="image" id="2" htm:src="B9BD5C75F6951B0.gif" htm:width="350" htm:height="350" htm:border="2px solid rgb(255, 0, 0)" htm:float="left" htm:margin="10px"><sub ctype="x-html-img-alt">"
2 = {Code@21676} ""
tagType = {TextFragment$TagType@21244} "PLACEHOLDER"
outerData = {StringBuilder@21700} "</sub></ph>"
3 = {Code@21677} ""
tagType = {TextFragment$TagType@21721} "CLOSING"
outerData = {StringBuilder@21724} "</g>"
Question:
- Should
<ph></ph>
be coded as "PLACEHOLDER" or "OPENING" and "CLOSING" tags? - If it’s right to make
<ph></ph>
PLACEHOLDERs, is it possible to retain the OPENING and CLOSING information from TextFragment?
Thank you very much!
Hao
Comments (11)
-
-
Thank you for your insights @YvesS can you share some insights/ideas on how to handle the situation? If <ph></ph> would map to a single PLACEHOLDER everything would be fine. <ph><sub> and </sub></ph> each being parsed as
PLACEHOLDER
makes it impossible for our application to know that they need to be treated as a pair in order for Okapi not to produce broken XML.
It seems that for us it would be ideal if
Begin <ph><sub>subflow</sub></ph> End
Could be parsed as 2 segments (or similar structure?)
Begin <PLACEHOLDER/> End
and
subflow
with we current way we get
Begin <PLACEHOLDER outerData="<ph><sub>"/>subflow<PLACEHOLDER outerData="</ph></sub>"/> End
And we can’t tell that swapping the two PLACEHOLDERS is an illegal operation.
An alternative could also be an indicator inside the TextFragment that a subflow begins and ends.
When the inner <sub> is omitted everything is better and <ph></ph> is a single PLACEHOLDER which works great for us.
-
Yes, full support for
sub
as a separate segment would be ideal. It’s just not implemented in the current version of the XLIFF reader/filter. One of the reason is that it’s a recursive feature.A half-way solution would be to support plain text
sub
content, Not ideal, but better than nothing. -
I also noticed that the Wiki here https://okapiframework.org/wiki/index.php/XLIFF_Filter states that
The content of the <sub> element is currently not supported as text. Any element found inside a <bpt>, <ept>, <ph>, and <it> (including <sub>) is included in the code of the parent inline element. A warning is generated when a <sub> element is detected.
This seems to be different from what we are seeing (the text is included “as text” but not marked as being in a subflow in the Okapi object model.)
-
Yes, we probably tired to improve the support a little bit and the wiki note has not been updated..
-
I tried just changing it so that Okapi makes OPENING/CLOSING for <sub> and the results look reasonable to me. What do you think about it? Am i missing some major downside to doing this?
See https://bitbucket.org/PatrickDHuy/okapi/commits/9c276992f8db829b7a1c39a35dfe029c6e584104 where I tried it. I adjusted the XLIFFFilterTest with the changed behaviour as well. I’m not sure if I should make a PR for it already (?)
-
Looks like that may work. but shouldn’t we have append(tagType… with tagType set to the proper marker by the calling statement? Instead of overwriting it inside the function?
I have to say that I have not looked at this code in a long time and cannot remember all the intricacies of that part. But you should feel free to submit a PR, you have tests, so that helps trust a bit the change. -
BTW, such change may trigger some changes in the golden files of the big integration tests.
-
I made a PR which treats <sub> as a seperate TextUnit: https://bitbucket.org/okapiframework/okapi/pull-requests/450/allow-xliff-tag-to-be-represented-as-a could you have a look please?
-
I think we can close this – Patrick/Yves, can you confirm?
-
- changed status to resolved
Yes, it seems we can close.
- Log in to comment
<ph></ph>
should definitely be seen as a PLACEHOLDER code normally. I think in this case thesub
element causes the split into 2.But it should not be an OPENING and a CLOSING.