set qualifiers - feature idea

Issue #11 resolved
Former user created an issue

Some background: I've been working with very large REs in CPython and IronPython. We generate the RE pattern from lists, like lists of cities or lists of names, somewhat like this:

namelist = open("names.txt").read().split() pattern = re.compile("|".join(namelist))

The one I'm working with now is just a pattern for finding substrings that look like the name of a person. It's overflowing the System::Text::RegularExpressions buffers on IronPython, but works OK with CPython 2.6 on 64-bit Ubuntu.

One of the things I've been thinking is that this kind of pattern should be handled differently. Suppose there was some syntax like

pattern = re.compile("(?S<names>)", names=ImmutableSet(namelist))

where (?S indicates a named ImmutableSet, the members of that set to be drawn from the keyword argument of that name. The compiler would generate a reasonably fast pattern from that set, say the union of all characters in all the strings in the set, and a max and min size based on the min-lengthed and max-lengthed elements of the set. When the engine runs, it would match that fast pattern, and if it matches, it would then check to see if the matched group is a member of the named set. If so, the match would be confirmed; if not, it would fail.

Seems like this might be a useful feature for regex to have, given the popularity of this kind of machine-generated RE.

Comments (15)

  1. Anonymous

    Thinking about this a bit more, it would be more appropriate to use something like "`\L<name>`" instead of "`(?S<name>)`".

  2. Anonymous

    Could you provide me with some test data so that I can see what's needed, how it would be used, try some experiments, and see whether 'feels' right, whether it's the right approach?

  3. Anonymous

    Sure. Here's one I've been trying on CPython 2.6 on 64-bit Ubuntu (works), CPython 2.7 on 64-bit Windows (OverflowError), and IronPython 2.7 on 64-bit .NET (StackOverflowError).

  4. Anonymous

    I downloaded the PyPI version, built and installed it on Python 2.5.1, and tried it:

    >>> import regex >>> p = regex.compile(r"333\L<bar>444", bar=set(["one", "two", "three"])) >>> p.match("333four444") >>> p.match("333four444") Traceback (most recent call last): File "<stdin>", line 1, in <module> SystemError: bad format char passed to Py_BuildValue

    Does that seem right to you?

    >>> p.match("333one444") >>>

    And that should have matched, right?

  5. Anonymous

    Ah, OK. I re-downloaded from PyPI, now it's working. But here's another issue:

    >>> p = regex.compile(r"3\L<bar>4\L<bar>+5", bar=sets.ImmutableSet(["one", "two", "three"])) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/Library/Python/2.5/site-packages/regex.py", line 266, in compile return _compile(pattern, flags, kwargs) File "/Library/Python/2.5/site-packages/regex.py", line 371, in _compile parsed = parse_pattern(source, info) File "/Library/Python/2.5/site-packages/_regex_core.py", line 296, in parse_pattern branches = [parse_sequence(source, info)] File "/Library/Python/2.5/site-packages/_regex_core.py", line 313, in parse_sequence item = parse_item(source, info) File "/Library/Python/2.5/site-packages/_regex_core.py", line 323, in parse_item element = parse_element(source, info) File "/Library/Python/2.5/site-packages/_regex_core.py", line 424, in parse_element return parse_escape(source, info, False) File "/Library/Python/2.5/site-packages/_regex_core.py", line 833, in parse_escape return parse_string_set(source, info) File "/Library/Python/2.5/site-packages/_regex_core.py", line 950, in parse_string_set return string_set(info, name) File "/Library/Python/2.5/site-packages/_regex_core.py", line 289, in string_set return StringSet(info, name) File "/Library/Python/2.5/site-packages/_regex_core.py", line 2637, in init index, min_len, max_len = info.string_sets[self.set_key] ValueError: too many values to unpack >>>

  6. Anonymous

    I just tested this enhancement (cf.: http://mail.python.org/pipermail/python-list/2011-June/1274529.html ) and would like to ask about the treatment of metacharacters in the items of the options set; I somehow implied from the overview text, they would be escaped, but they appear to be discarded completely, cf.:

    >>> regex.findall(r"^\L<options>", "solid QWERT", options=set(['good', 'brilliant', '+s
    ol[i}d'])) ['solid'] >>> regex.findall(r"^\L<options>", "+solid QWERT", options=set(['good', 'brilliant', '+solid'])) [] >>>

    I believed, the first pattern shouldn't match if escaped (and cause an error if taken unchanged); the second one would match with escaping; or am I missing something?

    regards, vbr

  7. Anonymous

    You're not missing anything. They should match as you say. But I'm seeing a different result (Ubuntu 10 with Python 2.6):

    >>> regex.findall(r"^\L<options>", "solid QWERT", options=set(['good', 'brilliant', '+s
    ol[i}d'])) [] >>> regex.findall(r"^\L<options>", "solid QWERT", options=['good', 'brilliant', '+s
    ol[i}d']) [] >>> regex.findall(r"^\L<options>", "+solid QWERT", options=set(['good', 'brilliant', '+solid'])) [] >>> regex.search(r"^\L<options>", "+solid QWERT", options=set(['good', 'brilliant', '+solid'])) >>> regex.search(r"^\L<options>", "+solid QWERT", options=set(['good', 'brilliant', 'solid'])) >>> regex.search(r"^\L<options>", "solid QWERT", options=['good', 'brilliant', '+s
    ol[i}d']) >>>

  8. Anonymous

    This is an interesting one.

    If the pattern is known, it fetches from the cache of already-compiled regexes, but the set of strings is different.

    Should it treat the set as part of the pattern and recompile, much as it does with flags?

  9. Anonymous

    Yes, I think that's the right call. The named keyword argument is local to the particular compile() or search() or findall() call. Different calls may use the same keyword name for different values.

  10. Anonymous

    Sorry for the delayed reaction (I somehow believed, I would be notified on further comments after my post). I'd like to confirm the fix in regex-0.1.20110616; I agree with the current solution. thanks; vbr

  11. Log in to comment