HTML filter: encoding detection rejects StartTag meta

Issue #611 open
Sebastian Ebert created an issue

I am using Rainbow Version 6.0.33

Java Version 1.8.0_131

Windows 7

If I use the okf_html filter to process the files attached, I get error messages:

=== Start process
Input: /C:/Users/sebert/Desktop/rainbow/index.html
ERROR: StartTag meta at (r14,c1,p953) rejected because it has no closing '>' character

Error count: 1, Warning count: 0
Process duration: 0h 0m 0s 821ms
=== End process

It took me about 3 hours to find the possible cause. The original file is an UTF-8 encoded HTML5 file (seems at least HTML5). In the head-section it does not have any charset delarations.

If I add <meta charset="utf-8"> it's not working either.

If I add <meta http-equiv="content-type" content="text/html; charset=utf-8"> it compiles fine.

I suspect Rainbow is not able to handel HTML5 charset information. Or the error message is completely wrong.

Please find the complete project including 3 different source files attached.

Comments (5)

  1. Chase Tingley

    Hi Sebastian,

    It looks to me like this error is non-fatal. With both tikal and Rainbow, the filter extracts the content it is supposed to, even though it logs that error.

    I did some debugging, and it turns out the error is related to encoding statement processing as you suspected. However, it's a little stranger than that. Before the real parsing begins, the filter does a "first pass" where it scans the start of the file to try to guess the encoding. It does this by taking the first 1024 characters of the file and passing them to the Jericho parser (on which the filter is built). Jericho looks through the meta tags in the header in search of one that looks like a suitable encoding declaration.

    In the not-working.html file in your package, there is no encoding declaration. However, the cause of the error is that the <head> section is unusually large -- 1024 characters doesn't even cover all of it. So the first pass through Jericho dies partway through a meta tag that lies on that 1024-character boundary.

    Doing some debugging, the <meta charset="utf-8"> syntax does get parsed properly, at least in Jericho 3.4 (which is used in M32 and M33). However, even so, the error still occurs because Jericho still dies on the broken tag.

    It looks like the right fix here is just to expand the size of the preview buffer. If I set it to 4k, the error goes away.

  2. Sebastian Ebert reporter

    Thanks for the explanations. Three remarks on this:

    • I only send the shortened file to you. In the original file, the head section is 135KB (!). It's because the CMS puts a lot of meta information and also lots of CSS in the head section. One could regard it as bad style, but if pushing the size if the preview buffer up does not cause serious performance issues, I would suggest to set it to 500k.

    • Even if the buffer size is larger, could you place an different error message (maybe "downgrade" it to just a warning or even dismiss it)?

    • Some of my files don't have the charset information at all. Encoding on file level however is UTF-8 (this is what notepad++ says). To what charset does jericho default, if it doesn't find any charset information?

  3. Chase Tingley
    • changed status to open

    Holy smokes, 135k! I'm reopening this for now.

    In answers to your other questions:

    • Unfortunately, the error message is in Jericho itself, not Okapi, so it's difficult to squelch it without also disabling a bunch of real errors messages.
    • If Jericho can't detect the encoding, Okapi will fall back to whatever is set on the RawDocument. For example, if you're running with tikal, it uses the -ie parameter, or the encoding parameter set in Rainbow if you're using that.

    If the headers are going to be arbitrarily large, we might as well just always scan the whole document every time rather than risk these mysterious errors. However, I think there are better ways to call Jericho than what're currently doing. Right now we ask it to parse all the tags out of the content we give it, and then iterate over that list. If do a more careful parse that only reads tags as needed and stops when it hits </head>, I think we can avoid this type of error.

  4. Log in to comment