HTML Filter: throws exception while parsing html encoding when input is w/ &quot

Issue #1098 resolved
Wenjin created an issue

Hi team,

I have a service that uses Okapi html filter (underneath, jericho parser http://jericho.htmlparser.net/docs/index.html as mentioned in this ticket’s first comment) to segment html.

When the input looks like (this input is inside an <iframe> html element):

<html>
   <head>  
      <meta http-equiv=&quot;Content-Type&quot; content=&quot;html; charset=UTF-8&quot;>
      </meta>
   </head>
</html>

Okapi throws:

Warning: Unsupported encoding UTF-8" specified in document

In the above text, encoding is specified in META tag. Okapi is taking this encoding to perform filtering. &quot; is quotation mark ". Replacing &quot;<meta http-equiv="Content-Type" content="html; charset=UTF-8">. Okapi takes UTF-8" along with the quotation. Ideally it should take only UTF-8.

This input works correctly <meta http-equiv=&quot;Content-Type&quot; content=\"html; charset=UTF-8\";>. When the quote has escape character, it correctly takes only UTF-8.

This seems a bug with html parser. I wonder if this would be fixed, or there are any suggestions deal with it on our end? Ideally we want to avoid hard-coded logic check for these inputs w/ &quot .

Comments (2)

  1. jhargrave-straker

    This is a problem with

    StreamedSourceCopy.htmlTidiedRewrite
    

    it would have to be fixed in Jericho as you mentioned above - but here is a workaround.

    You can disable the tidy preprocess with this new option:

    cleanupHtml: false
    

  2. Log in to comment