unicode (utf-8) numeric entity wrong

Issue #3 resolved
Former user created an issue

Using Notepad++ 8.3.3

Test case right single quotation mark, U+2019 ISOnum -> ’

UTF-8 encoded text source: a) With HTML Tag version 1.2.1 (Windows 10/11) - ’ -> ’ b) With HTML Tag version 1.0.0 (Windows 10/11) - ’ -> ’

ANSI encoded text source: c) HTML Tag version 1.2.1 (Windows 10/11) - ’ -> ’ d) With HTML Tag version 1.0.0 (Windows 10/11) - ’ -> ’

ANSI encoded (if you decode [Ctl] + [Shift] + [e], then encode [Ctl] +[e]): e) HTML Tag both versions - ’ -> ’ f) HTML Tag both versions - ’’ -> ’

a = f

The 1.2.1 bug seems to convert the character to ANSI and then to UNICODE. Other characters are likewise affected.

left single quotation mark, U+2018 ISOnum -> ‘ right single quotation mark, U+2018 ISOnum -> ’ left double quotation mark, U+201C ISOnum -> “ right double quotation mark, U+201D ISOnum -> ”

-> ’ ‘ “ ”

IMO, major.

Official response

Comments (8)

  1. Tom Nowacki

    Only to clarify, I’m using Notepad 8.3.3 32-bit version.

    Notepad++ 8.3.3 [32-bit]: HTML Tag 1.3 [32-bit] - does not encode/decode entities in ANSI documents at all.

    Notepad++ 8.3.3 [32-bit]: HTML Tag 1.2.1 [32-bit] - bug described above when encoding entities in documents that are saved in UTF-8 format.

    Notepad++ 8.3.3 [32-bit]: HTML Tag 1.0 works fine.

    Output in encoding right single quotation mark:

    ’

    not

    ’

  2. Tom Nowacki

    Thanks! The tests went well.

    Notepad++ 8.4.2/8.3.3 [32-bit]: HTML Tag 1.3.2 encodes/decodes ANSI/UTF-8. [fixed]
    Notepad++ 8.4.2/8.3.3 [32-bit]: HTML Tag 1.3.2 [32-bit] numeric entities. [fixed]
    Notepad++ 8.4.2 [64-bit] portable: HTML Tag 1.3.2 [64-bit]

    • encodes/decodes ANSI/UTF-8. [fixed]
    • numeric entities. [fixed]

    So, that was very fast. I can’t vouch for all versions of Notepad++, however the plugin HTML Tag is working fine in the versions I tested.

    ’ was a strange one, ’ backwards. It went back further than 1.3.0.

    Thank you very much. I do appreciate it.

  3. rdipardo repo owner

    ’ was a strange one, ’ backwards.

    I'm still a bit fuzzy as to what you mean. Were the decoded characters in reverse order? Whatever used to happen, does it still happen now?

    It may be worth noting here that the original developer made the choice to ignore named entities in XML files. As explained in this thread:

    XML does not support named entities [...], so the plugin doesn't use them. To get named entities, use Notepad++'s Language menu to choose HTML.

    To illustrate, save your sample text as XML; only the numeric entities are translated:

    <element>&#226;&euro;&#8482;</element>
    <!-- decoded as: -->
    <element>â&euro;</element>
    

    Set the buffer's language to HTML, or any other file type; now everything is translated:

    <h2>&#226;&euro;&#8482;</h2>
    <!-- decoded as: -->
    <h2>’</h2>
    

  4. Tom Nowacki

    @Robert Di Pardo

    I don’t know if this will serve to clarify it … 'by default' non-ascii character quotation marks ‘ ’ “ ” are converted to named entities, lsquo, rsquo, ldquo, rdquo. My HTMLTag-entities.ini configuration disables named entities in favor of numeric entities, &#8216; &#8217; &#8220; &#8221;. A personal preference.

    What I think happened was that in UTF-8-saved documents the characters were initially read as though they were in ANSI format. The plugin read as ’ and encoded it to &#226;&euro;&#8482; . I decoded this back to ’. [Shift-Ctrl-e].

    The error is related to the codepage issue discussed in the linked thread, https://community.notepad-plus-plus.org/topic/22503/new-version-of-html-tag/24, because if the plugin did not check if the document was saved as ANSI or UNICODE, it perhaps assumed ANSI. A double-byte character in raw single-byte ANSI looked like ’. ??

    Oddly enough, the plugin did another pass and produced &#226;&euro;&#8482;. Right under the circumstances, and yet only one pass was necessary to encode to &#2817;.

    It’s nice that you fixed this issue in v1.3.2. I use HTMLTag a lot! I can graduate without qualms to 64-bit Notepad++. 😀

  5. Tom Nowacki

    That latter is not strictly true. I still search for a replacement for 32-bit NPPCalc.

    Your XML vs. HTML example above applies, but the user would already know what to expect from decoding.

  6. rdipardo repo owner

    This issue was resolved in v1.3.2.

    Users working with ANSI-encoded files can manually upgrade from an older version.

    Version 1.3.5 will be installable via the official plugin manager in the next public release of Notepad++.

    Edit: this post originally said that v1.3.4 would be made available for automatic installation; issue #4 explains why the version was bumped.

  7. Log in to comment