Character count and GMX-V

Issue #860 new
David Koot created an issue

I have set up a pipeline in Rainbow, with ‘Raw document to Filter Event’ - ‘Character Count’ - ‘Word Count’ - ‘Scoping Report’, no further extraordinary settings.

The document that I’m counting is a txt file with only the segment

Start<bpt><sub> Text </sub></bpt><ept>end</ept>.

I took this example from https://xtm.cloud/manuals/gmx-v/GMX-V-2.0.html#InlineElementsTransparency

According to the GMX-V standard, this segment contains 3 words and 12 characters. The internal elements in GMX-V are ‘transparent’, meaning that they don’t add to the count. For GMX-V, this segment would count the same as “Start Text end.”.

However the report from Rainbow returns 3 words and 30 characters for the counting. Agreed about the 3 words: everything inside the inline element indicators is excluded. But the 30 characters do include everything inside the inline element tags. Which is inconsistent, and contradicting the GMX-V

Do I something wrong or is the character counting off when it comes to inline elements?

Thanks!

David Koot

Comments (9)

  1. David Koot reporter

    Does anyone got some comments about this issue? We are currently checking the counting features of Okapi.

    David Koot, TAUS

  2. ysavourel

    You are saying the file is processed as a “text” file. But that is an XLIFF snippet, it should be part of an XLIFF document. I’m not sure how the text filter will process such input, so it may be the issue.

    And if you are processing the file within a valid XLIFF document, then the issue may come from the XLIFF filter which has no very good support for <sub> elements. I’m not exactly sure how such file would be parsed and passed on to the counter class.

  3. David Koot reporter

    Thank you very much, Yves. I realise now that the example is not valid xliff. But actually we are just passing strings to the word and character count, not files. And I was happy to see that for word count Okapi parses any XML type tag (not limited to XLIFF tags) as internal tags. This is how I suppose https://xtm.cloud/manuals/gmx-v/GMX-V-2.0.html#InlineElementsTransparency is meant to be (from which I took my sample).

    But the character count does not work that way, and is counting any character regardless if it is part of a tag. I guess the text is parsed differently for both ways of counting, and that does not seem right.

  4. Sergei Vasilyev

    I would suggest to change the step to count characters only in the words extracted by the tokenizer for word counts, not in the original string.

  5. Mihai Nita

    I think there is a contradiction here, or I don’t understand the use case:

    I have set up a pipeline in Rainbow

    The document that I’m counting is a txt file

    then

    we are just passing strings to the word and character count, not files

    There is no way use anything other than files to Rainbow. No way to pass strings.

    And if the file has the “txt” extension then the characters inside the tags are counted, as the tags don’t have any meaning in a text file.

    Would be interesting to do an extract to xliff (for example using tikal -x) and see what the extracted text looks like.

  6. David Koot reporter

    Thank you all very much, I appreciate it. I will need to look into this a bit further. But to give you a bit of background: we wanted to get a good way of counting characters and words in a GMX-V compliant way, but we don’t use the complete Okapi framework. We only use the okapi-step-wordcount component, that we use to count words and chars in strings. In order to make my point in this forum, I had this txt file, that showed the same counting behaviour with the pipeline described in the first post.

  7. Mihai Nita

    Thank you David for the patience.

    As we probably all experienced, sometime 90% of the work is to reproduce / understand the bug, and 10% is the fix.
    This is probably one of those :-)

    I think it is very important to understand what we should see in memory at runtime (in the Okapi data structures), after parse.

    So I took the text from your file and put it in an XLIFF file, so that we can see what the <bpt> / <ept> tags are doing.

    The result is 2 words, 9 characters, both in Rainbow (with the pipeline you described) and with tikal -sr

    And I think it makes sense.

    The XLIFF 1.2 spec says that bpt / ept contain “Code data / Zero, one or more <sub> elements.”

    And sub contains “Text / Zero, one or more of the following elements: <g>, <x/>, <bx/> , <ex/>, <bpt>, <ept> , <ph>, <it>, <mrk> , in any order.”

    So “Start” is localizable text (not inside of any tag) and “ Text ” is localizable text (inside <sub>). “end” is not localizable, it is code (inside ept).
    The dot is also localizable, but does not count:

    Start<bpt><sub> Text </sub></bpt><ept>end</ept>.

    That’s 2 words, 9 characters, as reported by tikal / rainbow.

    Mihai

  8. David Koot reporter

    Thank you very much Mihai! It is clear what happens with xliff. But I am still puzzled by what happens with txt. If txt is not parsed, and tags don’t have a meaning, then, in a txt file,

    Start<bpt><sub> Text </sub></bpt><ept>end</ept>.

    should be counted exactly the same as (where < and > have been switched):

    Start>bpt<>sub< Text >/sub<>/bpt<>ept<end>/ept<.

    This is not the case. If you put the above in separate txt files, and count words with “tikal -sr”, then the first counts 30 characters in three words, and the second counts 30 characters in seven words. This is confusing to me. It seems the tokenization for wordcounting is different from charactercounting. As xml tagged text (for my own purpose that would be my preference, as I usually count “general-purpose” xml strings outside of the context of a file), the word count in the first example should be 3 [“Start”, “Text”, “end”], and charcount should be 12; or if you take it as xliff, 2 words [“Start”, “Text”] and 9 characters. If tags have no meaning, it would be 9 words [“Start”, “bpt”, “sub”, “Text”, “sub”, “bpt”, “ept”, “end”, “ept”] and 30 characters. The counting now is inconsistent.

  9. Mihai Nita

    Sorry, took me a bit to get to it (saving the Yahoo group posts, releasing m38, and the regular work :-)

    And it took a while because I don’t have a quick answer, or a good explanation.
    It is weird, and it is very-very likely wrong (the plain text part).

    I tried using the ICU BreakIterator directly (which is used to do the word count), and the results are the same for both strings.

    On the other side I know that the word count step in Okapi is very convoluted, does the calculation in more than one way, then there is a reconcile step.
    I am not familiar with that code, and don’t understand that is going on.

    I think that part is a clear bug, worth investigating.


    “As xml tagged text” … “outside of the context of a file”, I don’t think we support “xml fragments”.

    In general for XML it is unclear what is localizable and what not. And some fragments might not even be valid xml
    (think mismatched open / close tags, normally found in templating systems, where the fragments might be invalid xml, but become valid when “assembled”)

    And what is localizable in the example below?

    <msg msgid="register" help="Click this to register">
      <text>Register</text>
      <alttext>Register</alttext>
      <description>Some long blurb for translators, but not only</description>
    </msg>
    

    You would probably need a custom filter, or maybe a text filter with an xml sub-filter.

  10. Log in to comment