HTML markup exposed for translation

Issue #335 open
Former user created an issue

Original issue 335 created by @ysavourel on 2013-05-08T17:34:30.000Z:

See attached testcase. The expected behavior when extracting text with okf_html is to get a single segment that says "Translate this." However, you actually get two segments, one of which is just a bunch of HTML markup:

<source xml:lang="en"><table style="font-family: Arial; font-size:13px;padding-left:23px;" width="240px"; height="230px"; border=0; padding=0; margin=0; cellspacing=0; cellpadding=0; bgcolor="#cc0044"></source>

The reason for this is that the HTML is malformed (the semicolons after the attributes seem to be the culprits). However, it's not badly malformed, and the page still renders fine in modern browsers. Jericho has an internal error counter that tracks how many errors have been encountered in a single tag. If the error count passes a certain threshold, Jericho gives up and emits the tag as text. That's what's happening here.

The threshold in Jericho is configurable, and a simple improvement might be to set that threshold to something higher, so that it doesn't give up so easily on questionable tags.

Comments (3)

  1. Jim Hargrave (OLD)
    • edited description
    • changed milestone to 1.42.0
    • removed responsible
    • changed version to M33

    We have increased the jericho error threshold. Might be fixed.

  2. Log in to comment