UTF-8 module errors parsing non-UTF-8 text

Issue #138 new
Stephen Abrams
created an issue

I have attached a text file, identified by DROID as x-fmt/22, which maps to http://jhove2.org/terms/format/utf-8/ascii in JHOVE2, and as x-fmt/283 (8-bit ascii), which doesn’t map to anything in JHOVE2 (this is an “extract” from an actual file we processed here, and which BSD file identifies as “Non-ISO extended-ASCII text, with CRLF line terminators”.

When the UTF-8 module process the file, and encounters the first character – a “forward quote” - - dec 147, 0x93 – we get

InvalidCharacter {UTF8Character}: CodePoint: -1 Size (byte): 0 isC0Control: false isC1Control: false InvalidByteValues: InvalidByteValue: [ERROR/OBJECT] Invalid byte value for byte number 0 at offset 1 : 147 CodePointOutOfRange: [ERROR/OBJECT] Code point out of range at offset 0 : -1 isValid: false Coverage: Inclusive

A couple of things here – A CodePoint value of -1, and size of 0, are a confusing (they are what these fields are initialized to when a UTF8Character object is created) - -we might want to consider using our mechanism for “human-readable” parallel values to explain this - - or, alternatively, suppress the output of CodePoint if negative, and size if not Positivie) – it would help to use those values as well in the CodePointOutOfRange error message. Minimally - -we might want to put this in the FAQs and/or in the module documentation.

The 2nd “I wonder” is more general – We have named this a UTF-8 format module – should we be thinking about a Text format module, that reports specific encodings – UTF-8, etc?

NB – Because, with the 2 invalid characters (the forward and backward quotes, 147 and 148) the ch.getCodeBlock() returns null – only the code blocks for the valid characters get added – and this file is reported as valid according to the ASCII profile -- is this a “design feature”?

Comments (2)

  1. Stephen Abrams reporter

    I think the best solution is, as you suggest, to suppress the CodePoint if -1 (and Size if 0), and define a new error message to take the place of CodePointOutOfRange that reports in terms of byte values, not code point values.

  2. Log in to comment