unicode (utf-8) numeric entity wrong

Issue #3 resolved

Former user created an issue 2022-06-12

Using Notepad++ 8.3.3

Test case right single quotation mark, U+2019 ISOnum -> ’

UTF-8 encoded text source: a) With HTML Tag version 1.2.1 (Windows 10/11) - ’ -> â€™ b) With HTML Tag version 1.0.0 (Windows 10/11) - ’ -> ’

ANSI encoded text source: c) HTML Tag version 1.2.1 (Windows 10/11) - ’ -> ’ d) With HTML Tag version 1.0.0 (Windows 10/11) - ’ -> ’

ANSI encoded (if you decode [Ctl] + [Shift] + [e], then encode [Ctl] +[e]): e) HTML Tag both versions - ’ -> â€™ f) HTML Tag both versions - ’â€™ -> â€™

a = f

The 1.2.1 bug seems to convert the character to ANSI and then to UNICODE. Other characters are likewise affected.

left single quotation mark, U+2018 ISOnum -> ‘ right single quotation mark, U+2018 ISOnum -> ’ left double quotation mark, U+201C ISOnum -> “ right double quotation mark, U+201D ISOnum -> ”

-> ’ ‘ “ ”

IMO, major.

Official response

rdipardo repo owner
This issue was resolved in v1.3.2.

Users working with ANSI-encoded files can manually upgrade from an older version.

Version 1.3.5 will be installable via the official plugin manager in the next public release of Notepad++.
- HTMLTag v1.3.5 (32-bit) – Virus Total scan
- HTMLTag v1.3.5 (64-bit) – Virus Total scan
Edit: this post originally said that v1.3.4 would be made available for automatic installation; issue #4 explains why the version was bumped.
- View original context
- 2022-07-16T03:33:08+00:00

Comments (8)

Tom Nowacki
Only to clarify, I’m using Notepad 8.3.3 32-bit version.

Notepad++ 8.3.3 [32-bit]: HTML Tag 1.3 [32-bit] - does not encode/decode entities in ANSI documents at all.

Notepad++ 8.3.3 [32-bit]: HTML Tag 1.2.1 [32-bit] - bug described above when encoding entities in documents that are saved in UTF-8 format.

Notepad++ 8.3.3 [32-bit]: HTML Tag 1.0 works fine.

Output in encoding right single quotation mark:

â€™

not

’
- 2022-06-12T18:28:12+00:00
rdipardo repo owner
The cause of this issue is explained in 280d9fd.

Please confirm that version 1.3.2 works on ANSI-encoded documents.

v1.3.0 has a critical bug that could hang the application if an open tag has no match: https://community.notepad-plus-plus.org/topic/22503/new-version-of-html-tag/24

The links to the v1.3.0 downloads have been removed.
- 2022-06-12T22:36:24+00:00
Tom Nowacki
Thanks! The tests went well.

Notepad++ 8.4.2/8.3.3 [32-bit]: HTML Tag 1.3.2 encodes/decodes ANSI/UTF-8. [fixed]
Notepad++ 8.4.2/8.3.3 [32-bit]: HTML Tag 1.3.2 [32-bit] numeric entities. [fixed]
Notepad++ 8.4.2 [64-bit] portable: HTML Tag 1.3.2 [64-bit]
- encodes/decodes ANSI/UTF-8. [fixed]
- numeric entities. [fixed]
So, that was very fast. I can’t vouch for all versions of Notepad++, however the plugin HTML Tag is working fine in the versions I tested.

â€™ was a strange one, â€™ backwards. It went back further than 1.3.0.

Thank you very much. I do appreciate it.
- 2022-06-12T23:28:16+00:00
rdipardo repo owner
â€™ was a strange one, â€™ backwards.

I'm still a bit fuzzy as to what you mean. Were the decoded characters in reverse order? Whatever used to happen, does it still happen now?

It may be worth noting here that the original developer made the choice to ignore named entities in XML files. As explained in this thread:

XML does not support named entities [...], so the plugin doesn't use them. To get named entities, use Notepad++'s Language menu to choose HTML.

To illustrate, save your sample text as XML; only the numeric entities are translated:
```
<element>&#226;&euro;&#8482;</element>

<element>â&euro;™</element>
```
Set the buffer's language to HTML, or any other file type; now everything is translated:
```
<h2>&#226;&euro;&#8482;</h2>

<h2>â€™</h2>
```
‌
- 2022-06-13T23:06:22+00:00
Tom Nowacki
@Robert Di Pardo

I don’t know if this will serve to clarify it … 'by default' non-ascii character quotation marks ‘ ’ “ ” are converted to named entities, lsquo, rsquo, ldquo, rdquo. My HTMLTag-entities.ini configuration disables named entities in favor of numeric entities, ‘ ’ “ ”. A personal preference.

What I think happened was that in UTF-8-saved documents the characters were initially read as though they were in ANSI format. The plugin read ’ as â€™ and encoded it to â€™ . I decoded this back to â€™. [Shift-Ctrl-e].

The error is related to the codepage issue discussed in the linked thread, https://community.notepad-plus-plus.org/topic/22503/new-version-of-html-tag/24, because if the plugin did not check if the document was saved as ANSI or UNICODE, it perhaps assumed ANSI. A double-byte ’ character in raw single-byte ANSI looked like â€™. ??

Oddly enough, the plugin did another pass and produced â€™. Right under the circumstances, and yet only one pass was necessary to encode ’ to ଁ.

It’s nice that you fixed this issue in v1.3.2. I use HTMLTag a lot! I can graduate without qualms to 64-bit Notepad++.
- 2022-06-14T00:30:38+00:00
Tom Nowacki
That latter is not strictly true. I still search for a replacement for 32-bit NPPCalc.

Your XML vs. HTML example above applies, but the user would already know what to expect from decoding.
- 2022-06-14T01:18:13+00:00
rdipardo repo owner
This issue was resolved in v1.3.2.

Users working with ANSI-encoded files can manually upgrade from an older version.

Version 1.3.5 will be installable via the official plugin manager in the next public release of Notepad++.
- HTMLTag v1.3.5 (32-bit) – Virus Total scan
- HTMLTag v1.3.5 (64-bit) – Virus Total scan
Edit: this post originally said that v1.3.4 would be made available for automatic installation; issue #4 explains why the version was bumped.
- 2022-07-16T03:33:08+00:00
rdipardo repo owner
- changed status to resolved
- 2022-07-16T03:34:06+00:00
Log in to comment

Assignee: –

Type: bug

Priority: major

Status: resolved

Component: –

Votes: 0

Watchers: None