bug of POSIX matching

Create issue
Issue #180 resolved
animalize created an issue

I found a bug in POSIX matching.

On regex 2015.11.14:

>>> regex.search(r'(?p)a*(.*?)', 'aaabbb').groups()
('aaabbb',)  # <- wrong
>>> regex.search(r'(?p)a*(.*)', 'aaabbb').groups()
('bbb',)

On GNU sed 4.2.1:

user@linux:~$ echo "aaabbb" | sed -E "s/a*(.*?)/\\1/"
bbb
user@linux:~$ echo "aaabbb" | sed -E "s/a*(.*)/\\1/"
bbb

Forgive me if it's not a bug. I guess very few people are using POSIX matching, so I set the priority to trivial, take your time.

BTW, __repr__ of _regex.Pattern doesn't print .POSIX flag:

>>> regex.compile(r'(?p)a*(.*)', regex.POSIX)
regex.Regex('(?p)a*(.*)', flags=regex.V0) # <- no | regex.P

Comments (2)

  1. animalize reporter

    I modified default flags in _regex_core.py to enable .POSIX by default:

    # The default flags for the various versions.
    DEFAULT_FLAGS = {VERSION0: POSIX, VERSION1: FULLCASE|POSIX}
    

    Then do some benchmarks:

    benchmark of do .sub() with 15 patterns, 100 MB data (procedure 1 mentioned in issue167):

         re    without.POSIX    with.POSIX
    v0  20.74     16.59           16.99
    v1   ---      16.67           17.02
    

    benchmark of do .sub() with 1 pattern, 100 MB data (procedure 2 mentioned in issue167):

          re   without.POSIX    with.POSIX
    v0   1.41     2.28            9.03
    v1   ---      2.30            9.05
    

    All of these engines/flags happen to get the same output (same MD5/SHA1/CRC32).

  2. Log in to comment