scx (Script Extensions) property currently matches incorrectly

Create issue
Issue #293 resolved
Ben Yang created an issue

The scx Unicode property currently matches incorrectly for code points which do not have an explicit Script_Extensions property in the Unicode data file (https://unicode.org/Public/UCD/latest/ucd/ScriptExtensions.txt).

The scx that does match seems to be random, whereas the data file indicates:

# All code points not explicitly listed for Script_Extensions
# have as their value the corresponding Script property value

Examples:

>>> regex.findall(r"\p{scx=Latin}", "P")
[]
>>> regex.findall(r"\p{scx=Ahom}", "P")
['P']
>>> regex.findall(r"\p{scx=Common}", "4")
[]
>>> regex.findall(r"\p{scx=Caucasian_Albanian}", "4")
['4']
>>> regex.findall(r"\p{scx=Arabic}", "ت")
[]
>>> regex.findall(r"\p{scx=Balinese}", "ت")
['ت']
>>> regex.findall(r"\p{scx=Devanagari}", "ज")
[]
>>> regex.findall(r"\p{scx=Batak}", "ज")
['ज']

Tested on Python 3.6.5, Ubuntu 18.04.1 (64-bit).

Comments (2)

  1. Ben Yang reporter

    (note: this also applies to the 4 letter script codes, e.g. Latn, Ahom, Zyyy, Arab, Deva, Bali, etc.)

  2. Log in to comment