Is the \L option as efficient as it can be?

Create issue
Issue #376 resolved
40a created an issue

Consider the following script:

from timeit import repeat
from regex import compile

atomic_search = compile('(?>a(?>b|c)|d)').search
atomic_set_search = compile('(?>a[bc]|d)').search
normal_search = compile('(?:a(?:b|c)|d)').search
L_search = compile('\L<abcd>', abcd=['ab', 'ac', 'd']).search
reverse_set_search = compile('(?>d|a[cb])').search

string = 'a' * 1000

results = []
for search_function in (
    atomic_search, atomic_set_search, normal_search, L_search,
    reverse_set_search
):
    time = min(repeat(
        'search_function(string)', repeat=10, number=10, globals=globals()))
    results += (
        time,
        f'pattern:\t{search_function.__self__.pattern:<15}\n'
        f'\ttime:\t{time}'),


for r in sorted(results):
    print(r[1])

I’m consistently getting results in the following order:

pattern:    (?:a(?:b|c)|d) 
    time:   0.0008544000000000052
pattern:    (?>d|a[cb])    
    time:   0.0010161999999999949
pattern:    (?>a[bc]|d)    
    time:   0.0011831999999999954
pattern:    (?>a(?>b|c)|d) 
    time:   0.001701099999999997
pattern:    \L<abcd>       
    time:   0.0023178999999999977

which shows that the \L switch is not as efficient as the other patterns.

I tried to follow the source code to see the generated pattern for the \L list, but I think StringSet result is directly passed into the C code and my C sucks, so I couldn’t get any deeper.

Background:

I recently learned about the \L option and wanted to test its performance against a pattern generator of my own. You can see the code at https://github.com/5j9/wikitextparser/blob/9dd5fff5ddc3f2bef67aacded21477fa42cdf163/wikitextparser/_config.py#L63 . (of-course the code cannot handle byte patterns and may have other bugs or limitations)

regex.__version__ == '2.5.74'

sys.version == '3.8.1 (tags/v3.8.1:1b293b6, Dec 18 2019, 23:11:46) [MSC v.1916 64 bit (AMD64)]'

(Windows 10 64bit)

Comments (6)

  1. Matthew Barnett repo owner

    For a small number of strings, it probably isn't as efficient. Does it matter in practice when used as part of a larger pattern? Probably not.

  2. Makyen

    So… I need to apologize. I’ve owed creating an issue about the performance of \L<list> for much longer than should have been the case. I’m sorry I didn't actually get it done.

    In degenerate cases, we found that using a \L<list> resulted in a few to several orders of magnitude difference in the time required for using our large regular expressions compared to just concatenating the list together in a (?:list|of|words) and, prior to compiling the regex, substituting that concatenated list into the larger regular expression in place of the \L<city> we use.1

    Test case

    The Python code below produces the output:

    Using manual \L<city> took 0.00023419999999998997s
    Using regex \L<city> took 1844.0027194000002s
    

    which is about 7 orders of magnitude difference. The regex used below, base_regex_text, is an actual subset of the regex we currently have in use (it’s 64 out of list of ~21,000 regexes which are grouped together for one test). The test_text is the actual post text which caused one of the stall-crashes.

    # coding=utf-8
    
    import regex
    import timeit
    
    base_regex_text = r'(?is)(?:^|\b|(?w:\b))(?:fashionstylestrend\.com|shawncostellogolf21|jquerytraining\.com|alertfreejob\.com|softknoll\.com|acharya\W*institute|antimalwareserviceexecutable\.org|oil-khratyn\.ir|learningdefinition\.com|t\.me/indianfilms|quickbookshelps\.support|high\W*quality\W*undetectable\W*counterfeit\W*banknotes|dronestream\.app|subhuman(?:\W*arab)\W*raghead|how\W?I\W?became\W?a\W?vampire|hpsoftwaredrivers\.com|kbsu\.ru|templeofanswer@hotmail\.co\.uk|drokougbespecialhome@gmail\.com|357067391386912|\L<city>\Wescorts?|union\.ic\.ac\.uk|jagonyaobatherbal|client360\.in|kingrootapkapp\.com|nhtechnologies\.co\.uk|cryptocrusher\.net|laptops?\W?repairs?$|saravanansubramanian\.com|botchief\.blogspot|uxcurate|^Gordon Leland$|pintail\.co\.in|tfa110\.com|TIARATOURS\W?INDONESIA|pitechworld\.com|driver-canon\.com|(?:trusted\W*)?powerfull?\W*rings\W*for\W*(?:(?:money)\W*)?spells|magic\W*rings?\W*will\W*help\W*you\W*win\W*(?:(?:lotto|casino|horses|powerball|betting)\W*(?:and\W*)?)+|mama\W*shuckumah|mamashuckumah@gmail\.com|yahoomailsupportnumber\.com|8613871143350|missxiaoyuan@126\.com|dhx-display\.com|tripspros\.com|mp3converter\.live|clicktrix\.com|helpforemails\.com|newscurrent\.us|muchotrap\.com|turbotax|limerosdelchess\.BlogSpot\.com|pickyassist\.com|currencycapital@yahoo\.com|luckypatchergames\.com|stateoforigini\.uk|state\W?of\W?origin\W?\d+|isbirecruitment\.com|liam\W*jermy\d*|fencestore\.co|rencore\.com|devops\W?online\W?training|customer-help-number\.com)(?:$|\b|(?w:\b))'
    test_text = "<p>shiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiit!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!</p>\n"
    
    city_list = [
        "Agra", "Ahmedabad", "Ajanta", "Almora", "Alwar", "Amritsar", "Andheri",
        "Bangalore", "Banswarabhiwadi", "Bhilwara", "Bhimtal", "Bhiwandi", "Bhopal",
        "Calcutta", "Calicut", "Chandigarh",
        "Chennai", "Chittorgarh", "Coimbatore", "Colaba",
        "Darjeeling", "Dehradun", "Dehrdun", "Delhi", "Dharamshala", "Dharamsala", "Durgapur",
        "Ernakulam", "Faridabad",
        "Ghatkopar", "Ghaziabad", "Gurgaon", "Gurugram",
        "Haldwani", "Haridwar", "Hyderabad",
        "Jaipur", "Jalandhar", "Jim Corbett",
        "Kandivali", "Kangra", "Kanhangad", "Kanhanjad", "Karnal", "Kerala",
        "Kochi", "Kolkata", "Kota",
        "Lokhandwala", "Lonavala", "Ludhiana",
        "Marine Lines", "Mumbai", "Madurai", "Malad", "Mangalore", "Mangaluru", "Mulund",
        "Nagpur", "Nainital", "Nashik", "Neemrana", "Noida",
        "Patna", "Pune",
        "Raipur", "Rajkot", "Ramnagar", "Rishikesh", "Rohini",
        "Sonipat", "Surat",
        "Telangana", "Tiruchi", "Tiruchirappalli", "Thane",
        "Trichinopoly", "Trichy", "Trivandrum", "Thiruvananthapuram",
        "Udaipur", "Uttarakhand",
        "Visakhapatnam", "Worli",
        # not in India
        "Dubai", "Lusail", "Portland",
        # yes, these aren't cities but...
        "India", "Malaysia", "Pakistan", "Qatar",
        # buyabans.com spammer uses creative variations
        "Sri Lanka", "Srilanka", "Srilankan",
    ]
    city_list_as_group = '(?:{})'.format('|'.join(city_list))
    city_list_sub_regex = regex.compile(r'\\L<city>', regex.UNICODE)
    
    
    def format_with_city_list(regex_text):
        return regex.sub(city_list_sub_regex, city_list_as_group, regex_text)
    
    regex_text_with_city_list_substituted = format_with_city_list(base_regex_text)
    
    base_regex = regex.compile(base_regex_text, regex.UNICODE, city=city_list, ignore_unused=True)
    regex_with_city_list_substituted = regex.compile(regex_text_with_city_list_substituted, regex.UNICODE)
    
    def search_base_regex():
        base_regex.search(test_text)
    
    def search_substituted_regex():
        regex_with_city_list_substituted.search(test_text)
    
    
    print('Using manual \\L<city> took {}s'.format(timeit.timeit('search_substituted_regex()', number=1, setup="from __main__ import search_substituted_regex")))
    print('Using regex \\L<city> took {}s'.format(timeit.timeit('search_base_regex()', number=1, setup="from __main__ import search_base_regex")))
    


    1. Background: as you may recall, we happily use your regex package in SmokeDetector, which is a bot that tests all questions and answers posted to Stack Exchange for spam and/or rude/abusive content.

    In at lest one case, probably more than one, using the regex based \L<city> resulted in taking tens of minutes to test a single post, rather than the normal run-rate of a few posts per second. Overall, the issue resulted in effectively crashing SmokeDetector. The tens of minutes for processing a single post caused the instance which was testing the post to be non-responsive long enough such that the system monitoring the instances declaring the instance dead and brought another instance active. The newly active instance starts off testing posts just prior to where the prior instance stalled. It then quickly queues up and starts testing the same post, causing the newly active instance to become non-responsive. This repeated through all available SmokeDetector instances until all instances were considered offline.

    We’ve only tracked down on a couple of occasions to specific posts, but the situation is known to have happened a few to several times, probably with different posts being the source for the text being tested. The limit in known test cases is due to needing someone responsible for an instance to be present when it happens, see the problem, and be available to track down the post at that time. The type of post which causes the issue tends to be quickly deleted from Stack Exchange, so is not easily available for investigation after the fact.

    Obviously, we’ve changed things both at runtime and in CI testing such that similar stalling issues are both detected if introduced in a commit, and far less likely to cause as significant a problem if something does occur. However, the solution for our use of the \L<city> list was to manually substitute it in, rather than use the capability available from regex.

  3. Matthew Barnett repo owner
    • changed status to open

    There does appear to be a problem with the speed of \L<...>. I've released regex 2020.6.7 that works around it, but it needs more investigation.

  4. Makyen

    Great! That’s a really substantial improvement. It’s now even better than a straight substitution. Thank you for working on it.

    Again, I apologize I didn’t report the issue when I first encountered it. Life was … complicated … at the time and I just didn’t get back to it.

  5. Log in to comment