Include Script Extensions as a supported Unicode property

Create issue
Issue #291 resolved
Ben Yang created an issue

According to Unicode Technical Standard #18, a regex engine must support a minimal list of Unicode properties to be compatible with the Unicode Standard.

All of the listed properties are currently supported by this module except for Script_Extensions (scx), which lists scripts for which a Unicode character is typically used beyond the script listed in the Script property. For example, the character ৯ (U+09EF BENGALI DIGIT 9) has the Script property "Bengali", but because it is also used in the Chakma and Sylheti Nagri scripts, it has the Script_Extensions property of {Beng Cakm Sylo}.

This is important because if a user needs to, for example, get all Sylheti Nagri characters out of a string that also includes Bengali digits, the regular expression "\p{sc=Sylo}" would leave out the digits despite the fact that users consider Bengali digits to be valid Sylheti Nagri characters.

Comments (1)

  1. Log in to comment