HTML filter: encoding detection rejects StartTag meta

Chase Tingley

Hi Sebastian,

It looks to me like this error is non-fatal. With both tikal and Rainbow, the filter extracts the content it is supposed to, even though it logs that error.

I did some debugging, and it turns out the error is related to encoding statement processing as you suspected. However, it's a little stranger than that. Before the real parsing begins, the filter does a "first pass" where it scans the start of the file to try to guess the encoding. It does this by taking the first 1024 characters of the file and passing them to the Jericho parser (on which the filter is built). Jericho looks through the meta tags in the header in search of one that looks like a suitable encoding declaration.

In the not-working.html file in your package, there is no encoding declaration. However, the cause of the error is that the <head> section is unusually large -- 1024 characters doesn't even cover all of it. So the first pass through Jericho dies partway through a meta tag that lies on that 1024-character boundary.

Doing some debugging, the <meta charset="utf-8"> syntax does get parsed properly, at least in Jericho 3.4 (which is used in M32 and M33). However, even so, the error still occurs because Jericho still dies on the broken tag.

It looks like the right fix here is just to expand the size of the preview buffer. If I set it to 4k, the error goes away.

2017-05-09T22:13:06+00:00

Chase Tingley

changed status to resolved

Fix issue #611 - look further for HTML encoding statements

→ <<cset 33ca6ee83de7>>

2017-05-09T22:39:11+00:00

Sebastian Ebert reporter

Thanks for the explanations. Three remarks on this:

I only send the shortened file to you. In the original file, the head section is 135KB (!). It's because the CMS puts a lot of meta information and also lots of CSS in the head section. One could regard it as bad style, but if pushing the size if the preview buffer up does not cause serious performance issues, I would suggest to set it to 500k.
Even if the buffer size is larger, could you place an different error message (maybe "downgrade" it to just a warning or even dismiss it)?
Some of my files don't have the charset information at all. Encoding on file level however is UTF-8 (this is what notepad++ says). To what charset does jericho default, if it doesn't find any charset information?

2017-05-10T07:23:52+00:00

Chase Tingley

changed status to open

Holy smokes, 135k! I'm reopening this for now.

In answers to your other questions:

Unfortunately, the error message is in Jericho itself, not Okapi, so it's difficult to squelch it without also disabling a bunch of real errors messages.
If Jericho can't detect the encoding, Okapi will fall back to whatever is set on the RawDocument. For example, if you're running with tikal, it uses the -ie parameter, or the encoding parameter set in Rainbow if you're using that.

If the headers are going to be arbitrarily large, we might as well just always scan the whole document every time rather than risk these mysterious errors. However, I think there are better ways to call Jericho than what're currently doing. Right now we ask it to parse all the tags out of the content we give it, and then iterate over that list. If do a more careful parse that only reads tags as needed and stops when it hits </head>, I think we can avoid this type of error.

2017-05-10T18:15:58+00:00

Chase Tingley

changed title to HTML filter: encoding detection rejects StartTag meta

2017-05-10T18:18:28+00:00

Comments (5)