Request: \K

Create issue
Issue #151 resolved
boolbag NA created an issue

Hi Matthew, Thank you as always for the terrific engine. In my view it's one of the very best engines out there.

There are three missing features that have been "talking to me" for a while, and I thought I'd put in some requests. I'm sure you've considered them before, but I'd like to put forward a case for each of them.

In this thread I'll focus on \K.

I realize that \K was originally intended as a workaround for the lack of infinite lookbehind. Nevertheless, it is an extremely clean and expressive token.

Without \K, you either have to use a lookbehind or capturing groups. Not a problem, but within long expressions, \K gives you a clean "drop everything matched so far".

Also, I often have to translate many expressions from PCRE to Python. When the PCRE expressions are rich with \K, the absence of \K in regex is a real speed bump.

Thanks in advance for considering it again.

Comments (4)

  1. Matthew Barnett repo owner

    As far as I can tell, it would shorten group 0 (the entire match), but not any capture group:

    >>> m ='(abc\Kde)', 'abcde')
    >>> m[0]
    >>> m[1]

    Therefore, it should also affect the span (start and end position) for group 0, but no other groups.

    Is that correct?

  2. boolbag NA reporter

    Hi Matthew,

    Yes, that's exactly right.

    Also note that it's not a magic token: it can appear multiple times. For instance, abc\Kde|fg\Khij matches de in abcde or hij in fghij

    In PCRE, a\Kbc\Kde is legal. This has no point, but I guess the idea is that the token can be dropped anywhere.

    You can have it on a single side of an alternation, for instance ab(?:\Kde|fg) etc.

    I know you had EditPadPro at some stage because I recall seeing you on the forum. For testing purposes Jan has a good implementation in EPP and RegexBuddy, except for a minor bug that he plans to fix in the next release (one of the most recent threads on the RB forum).


  3. boolbag NA reporter

    Absolutely fantastic. Thank you so much for this time-saver.

    An example for anyone interested in seeing it at work: everything to the left of \K (including the start=> marker) is dropped.

    import regex as mrab
    >>> bsk = mrab.compile(r'start=>\K.*')
    >>> print('boring stuff start=>interesting stuff'))
    <regex.Match object; span=(20, 37), match='interesting stuff'>
  4. Log in to comment