Missing unicode normalization quick check properties

Issue #273 resolved
Brian Gainor
created an issue

Expected behavior: compile a regex matching characters that are (not) allowed in normalization classes NFC, NFKC, NFD, NFKD

Actual behavior: doesn't recognize the unicode property as valid

Example:

Python 2.7.10 (default, Jul 14 2015, 19:46:27)
[GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.39)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import regex
>>> regex.__version__
'2.4.136'
>>> regex.compile('\p{NFC_QC=Yes}')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Library/Python/2.7/site-packages/regex.py", line 345, in compile
    return _compile(pattern, flags, kwargs)
  File "/Library/Python/2.7/site-packages/regex.py", line 496, in _compile
    caught_exception.pos)
_regex_core.error: unknown property at position 14

This is also the case for NFKC_QC and NFK?D_QC.

Information on these properties is located at http://unicode.org/reports/tr44/#Decompositions_and_Normalization

Comments (7)

  1. Brian Gainor reporter

    The properties don't seem to be working as expected. I tested five characters which should have the following values:

    Character NFC_QC NFKC_QC NFD_QC NFKD_QC
    U+0000 Y Y Y Y
    U+00A0 Y N Y N
    U+00C0 Y Y N N
    U+0300 M M Y Y
    U+0340 N N Y Y

    (EDIT: I now realize that U+0340 should be N across the board)

    Instead, all five matched '\p{NFC_QC=Yes}', '\p{NFKC_QC=Yes}', '\p{NFD_QC=No}' and '\p{NFKD_QC=No}'

  2. Matthew Barnett repo owner

    I'm getting this:

    Character  NFC_QC  NFKC_QC  NFD_QC  NFKD_QC
    U+0000     Y       Y        N       N
    U+00A0     Y       N        N       Y
    U+00C0     Y       Y        Y       Y
    U+0300     Y       Y        N       N
    U+0340     N       N        Y       Y
    

    In Python 2, it'll assume that you'll be working on bytestrings unless the pattern contains (?u), and Unicode properties don't apply to bytestrings.

  3. Matthew Barnett repo owner

    Fixed in regex 2018.02.08.

    However, I'm getting this:

    Character  NFC_QC  NFKC_QC  NFD_QC  NFKD_QC
    U+0000     Y       Y        Y       Y
    U+00A0     Y       N        Y       N
    U+00C0     Y       Y        N       N
    U+0300     M       M        Y       Y
    U+0340     N       N        N       N
    

    because the DerivedNormalizationProps.txt file has these lines:

    0340..0341    ; NFD_QC; N # Mn   [2] COMBINING GRAVE TONE MARK..COMBINING ACUTE TONE MARK
    0340..0341    ; NFC_QC; N # Mn   [2] COMBINING GRAVE TONE MARK..COMBINING ACUTE TONE MARK
    0340..0341    ; NFKD_QC; N # Mn   [2] COMBINING GRAVE TONE MARK..COMBINING ACUTE TONE MARK
    0340..0341    ; NFKC_QC; N # Mn   [2] COMBINING GRAVE TONE MARK..COMBINING ACUTE TONE MARK
    
  4. Brian Gainor reporter

    You're right, I missed those. I went through all the ranges that are NFC_QC=No, and they're all also No for NFD_QC, so there actually are no characters that are [N, N, Y, Y].

    I got the same results as you with the new version. Thanks for addressing this so quickly!

  5. Log in to comment