Extremely slow matching using IGNORECASE flag

Issue #260 new
raphael0202
created an issue

Using the following regex:

_URL_PATTERN = (
    r"^"
    # in order to support the prefix tokenization (see prefix test cases in test_urls).
    r"(?=[\w])"
    # protocol identifier
    r"(?:(?:https?|ftp|mailto)://)?"
    # user:pass authentication
    r"(?:\S+(?::\S*)?@)?"
    r"(?:"
    # IP address exclusion
    # private & local networks
    r"(?!(?:10|127)(?:\.\d{1,3}){3})"
    r"(?!(?:169\.254|192\.168)(?:\.\d{1,3}){2})"
    r"(?!172\.(?:1[6-9]|2\d|3[0-1])(?:\.\d{1,3}){2})"
    # IP address dotted notation octets
    # excludes loopback network 0.0.0.0
    # excludes reserved space >= 224.0.0.0
    # excludes network & broadcast addresses
    # (first & last IP address of each class)
    # MH: Do we really need this? Seems excessive, and seems to have caused
    # Issue #957
    r"(?:[1-9]\d?|1\d\d|2[01]\d|22[0-3])"
    r"(?:\.(?:1?\d{1,2}|2[0-4]\d|25[0-5])){2}"
    r"(?:\.(?:[1-9]\d?|1\d\d|2[0-4]\d|25[0-4]))"
    r"|"
    # host name
    r"(?:(?:[a-z\u00a1-\uffff0-9]-*)*[a-z\u00a1-\uffff0-9]+)"
    # domain name
    r"(?:\.(?:[a-z\u00a1-\uffff0-9]-*)*[a-z\u00a1-\uffff0-9]+)*"
    # TLD identifier
    r"(?:\.(?:[a-z\u00a1-\uffff]{2,}))"
    r")"
    # port number
    r"(?::\d{2,5})?"
    # resource path
    r"(?:/\S*)?"
    # query parameters
    r"\??(:?\S*)?"
    # in order to support the suffix tokenization (see suffix test cases in test_urls),
    r"(?<=[\w/])"
    r"$"
).strip()

The following code is really slow:

regex.compile(_URL_PATTERN, regex.IGNORECASE).match("snifffffffffffffffffffffffffffffffffffff")

It depends on the number of 'f' that we add: the more, the slower.

If we don't use the IGNORECASE flag, the performance issue disappears. The problem is not present in the re module.

Python 3.5, 64bits regex==2017.4.5

Comments (3)

  1. Matthew Barnett repo owner

    The code you say is really slow is using the re module, and you haven't said what you mean by "really slow".

    The current version is 2017.09.23, and I can't see any speed problem with it.

    As for it becoming slower the longer the target is, well, that's not a problem if it's linear; if it's exponential, then it would be.

    In summary, can't reproduce (with the current release).

  2. raphael0202 reporter

    By extremely slow, I mean > 10 min. Actually I was doing an

    import regex as re
    

    on top of my file, so the code snippet was quite deceptive, I've updated the snippet. I've tried again on a fresh virtualenv with Python 3.6 and there is no performance problem. I'm going to investigate it.

  3. Log in to comment