Some background: I've been working with very large REs in CPython and IronPython. We generate the RE pattern from lists, like lists of cities or lists of names, somewhat like this:
namelist = open("names.txt").read().split() pattern = re.compile("|".join(namelist))
The one I'm working with now is just a pattern for finding substrings that look like the name of a person. It's overflowing the System::Text::RegularExpressions buffers on IronPython, but works OK with CPython 2.6 on 64-bit Ubuntu.
One of the things I've been thinking is that this kind of pattern should be handled differently. Suppose there was some syntax like
pattern = re.compile("(?S<names>)", names=ImmutableSet(namelist))
where (?S indicates a named ImmutableSet, the members of that set to be drawn from the keyword argument of that name. The compiler would generate a reasonably fast pattern from that set, say the union of all characters in all the strings in the set, and a max and min size based on the min-lengthed and max-lengthed elements of the set. When the engine runs, it would match that fast pattern, and if it matches, it would then check to see if the matched group is a member of the named set. If so, the match would be confirmed; if not, it would fail.
Seems like this might be a useful feature for regex to have, given the popularity of this kind of machine-generated RE.