Allow duplicate names of groups

Issue #87 resolved
Marcin Wojnarski
created an issue

Hi,

Currently, duplicate names are not allowed, for example this code raises an exception because group "a" is defined twice:

>>> regex.match(r'(?<a>here)? or (?<a>here)?', "here or here")
error: duplicate group

I suspect this design is a legacy after standard 're' module which didn't allow multiple values, so it was somehow natural to reject duplicate group names, too. But now, in 'regex' module which can capture repeated values, it would be natural to accept also duplicate group names and merge values extracted from all same-named groups into one list.

This enhancement would allow parsing loose formats, where a given value may appear in any of several different places in the text and we must prepare a regex that has groups in all these places. Usually, we would expect that only one place is matched (groups are optional like in regex above), but we can't say in advance which one and - for convenience - we'd like to use the same name for all these places, to avoid manual merging of several groups afterwards. In other use cases, it may be possible that more than 1 group matches and we want to extract all the matched values as a single list.

I think this enhancement would fit very well to the concept of repeated captures that's already present in 'regex'.

Do any other regex implementations have something like this?

I don't know.

Comments (8)

  1. Anonymous

    Wouldn't the formats be alternatives, e.g. "(?<found>this)|(?<found>that)"?

    The possibility is already covered; the groups are mutually exclusive.

  2. Marcin Wojnarski reporter

    Alternative is very good for different value patterns, but not for different locations. Example: web scraping, complex page where the same value (say, price of a product) can appear in 3 different places, depending on the type of product:

    "(?<price1>\d+)? some-stuff (?<price2>\d+)? other-stuff (?<price3>\d+)?"
    

    Because these are different *locations* in text, not different patterns, and the static parts ("some-stuff") must be present in the middle to correctly position the groups in entire text, alternative can't be used here (or would be very difficult: with static parts copy-pasted several times). Besides, we want to extract other properties too, not only price, and want to use single regex for all this - without making 3 variants of entire regex and without manual labelling of fields 'price1' 'price2' 'price3' and then merging.

  3. Anonymous

    The regex module tries to be compatible with the re module, whose documentation says: """Group names must be valid Python identifiers, and each group name must be defined only once within a regular expression""".

    The regex module relaxes that a little by allowing them multiple times if they're mutually exclusive, but I'm not sure whether they should be allowed in the version 0 ('compatible') behaviour.

    Perhaps only in version 1 ('enhanced') behaviour?

    I'll need to think about it and see whether it would have any adverse side-effects.

    For the record, Perl allows it.

  4. Marcin Wojnarski reporter

    OK, thanks, for my needs V1 would be fine.

    In case if you consider adding it in V0, note that - although this change is not strictly compatible with 're' - it does NOT break any existing code, because it only relaxes the constraints of correct patterns - any pattern correct in 're' would still be correct in 'regex' and behave *exactly* the same, with no changes in result; only some more patterns would be considered correct now.

  5. Anonymous

    It's true that it wouldn't break any existing code, so there'd be no harm in having it work in V0 too.

  6. Marcin Wojnarski reporter

    There is a minor issue when the same group is nested - the inner group overrides the value matched by the outer group and both are present in the result (2 copies of the same inner value). For example:

    >>> match = regex.match(r'(?<x>a(?<x>b))', "ab")
    >>> match.capturesdict()
    {'x': ['b', 'b']}
    
  7. Log in to comment