HTML Filter: throws exception while parsing html encoding when input is w/ "
Hi team,
I have a service that uses Okapi html filter (underneath, jericho parser http://jericho.htmlparser.net/docs/index.html as mentioned in this ticket’s first comment) to segment html.
When the input looks like (this input is inside an <iframe>
html element):
<html>
<head>
<meta http-equiv="Content-Type" content="html; charset=UTF-8">
</meta>
</head>
</html>
Okapi throws:
Warning: Unsupported encoding UTF-8" specified in document
In the above text, encoding is specified in META tag. Okapi is taking this encoding to perform filtering. "
is quotation mark "
. Replacing "
, <meta http-equiv="Content-Type" content="html; charset=UTF-8">
. Okapi takes UTF-8"
along with the quotation. Ideally it should take only UTF-8
.
This input works correctly <meta http-equiv="Content-Type" content=\"html; charset=UTF-8\";>
. When the quote has escape character, it correctly takes only UTF-8
.
This seems a bug with html parser. I wonder if this would be fixed, or there are any suggestions deal with it on our end? Ideally we want to avoid hard-coded logic check for these inputs w/ "
.
Comments (2)
-
-
- changed status to resolved
workaround provided
- Log in to comment
This is a problem with
it would have to be fixed in Jericho as you mentioned above - but here is a workaround.
You can disable the tidy preprocess with this new option: