Raise exception for invalid group backreference

Create issue
Issue #181 wontfix
animalize created an issue


If there is an invalid group backreference, regex keeps silence.

>>> print(regex.search(r'\1(a)', 'aa'))
>>> print(regex.search(r'(?r)(a)\1', 'aa'))

re module gives a prompt:

>>> re.search(r'\1(a)', 'aa')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Python35\lib\re.py", line 173, in search
    return _compile(pattern, flags).search(string)
  File "C:\Python35\lib\re.py", line 293, in _compile
    p = sre_compile.compile(pattern, flags)
  File "C:\Python35\lib\sre_compile.py", line 536, in compile
    p = sre_parse.parse(p, flags)
  File "C:\Python35\lib\sre_parse.py", line 829, in parse
    p = _parse_sub(source, pattern, 0)
  File "C:\Python35\lib\sre_parse.py", line 437, in _parse_sub
    itemsappend(_parse(source, state))
  File "C:\Python35\lib\sre_parse.py", line 524, in _parse
    code = _escape(source, this, state)
  File "C:\Python35\lib\sre_parse.py", line 418, in _escape
    raise source.error("invalid group reference", len(escape))
sre_constants.error: invalid group reference at position 0


I had a glance over the source code, please review this part:

switch (state->charsize) {
case 1:
case 2:
case 4:

If ->charsize only can be 1/2/4, then change to this will get a speedup, very very tiny, but it's free.

switch (state->charsize) {
case 1:
case 2:
default:  /* is 4 */

Then let's look at this part:

    unicode = (flags & RE_FLAG_UNICODE) != 0;
    locale = (flags & RE_FLAG_LOCALE) != 0;
    ascii = (flags & RE_FLAG_ASCII) != 0;
    if (!unicode && !locale && !ascii) {
        if (PyBytes_Check(self->pattern))
            ascii = RE_FLAG_ASCII;    // should this be ascii = True
            unicode = RE_FLAG_UNICODE;     // should this be unicode = True

I'm showing off my poor skill before an expert, it's all up to you.

Take your time.

Comments (3)

  1. Matthew Barnett repo owner

    The group does exist; it's just hasn't matched anything yet.

    Consider, for example:

    >>> print(regex.search(r'(a)|\1', 'b'))

    It tries to match a, but fails, so it then tries to match \1. That group hasn't matched anything, so it fails again.

    Your example is kind of a variation on that. The group exists, but hasn't matched anything.

    Regexes in, say, Perl do accept that regex.

  2. animalize reporter

    But re doesn't raise exception in this example:

    >>> print(re.search(r'(a)|\1', 'b'))

    Let it pass.

    I have an idea, if regex replaced re, there would be so many known and unknown problems that make troubles.

    So, if talk about standard library, adding a whole new module is better than replacing the old one, that ensures absolute compatibility.

    If so, V0 mode is unnecessary, since it's a hybrid between re and V1, no need to create such variation anymore. And turn off FULLCASE by default, I suppose most users are programmer, only text proccessing needs such feature.

    Just my opinion, I'm enjoying regex right now, in or not in the standard library is not a matter to me.

  3. Log in to comment