different handling of \w in unicode patterns in regex and re

Issue #3 closed
Anonymous created an issue

Hi,

I think, it may be an intended behaviour, but I did't find it mentioned anywhere in the docs. Sorry, if it is already discussed somewhere I haven't looked ...

It seems, that in the unicode patterns like ur"..." regex implicitely sets the unicode flag (?u), while re doesn't seem to do that.

>>> re.findall(ur"\w", u"aáb") [u'a', u'b'] >>> regex.findall(ur"\w", u"aáb") [u'a', u'\xe1', u'b'] >>> re.findall(r"\w", u"aáb") [u'a', u'b'] >>> regex.findall(r"\w", u"aáb") [u'a', u'b'] >>> re.findall(ur"(?u)\w", u"aáb") [u'a', u'\xe1', u'b'] >>> regex.findall(ur"(?u)\w", u"aáb") [u'a', u'\xe1', u'b'] >>>

Python 2.7.1, win XPp SP3, 32 bit Czech; regex r902c02d44f

regards, Vlastimil Brom

Comments (3)

  1. Anonymous

    Ah, yes, if the pattern is a Unicode string then the matching defaults to Unicode, and if the pattern is a bytestring then the matching defaults to ASCII.

    You can be explicit with regex.UNICODE or "(?u)" and regex.ASCII or "(?a)".

    The justification is that if you're using Unicode strings then you probably want Unicode matching too. I'll make a note to update the docs at some point (I don't have any other changes planned).

    I would be willing to make it the same as the 're' module if the general consensus is that it should be.

  2. Anonymous

    Thanks for confirmation; I was just a bit surprised seeing different results in a script (using re) and my general app (using regex normally), where I didn't expect a difference between these re engines.

    I am happy with either behaviour; the (?u) can be simply added if needed and is more explicit; on the other hand the unicode flag is global and cannot be switched off - if one needed an unicode string pattern with special sequences to be interpreted in ascii, [a-zA-Z0-9_] would be necessary instead of \w (if I understand correctly).

    But that being said, I have no strong personal preference, now that it is documented. It would depend on the inclusion policy into the standard library (e.g. whether to include this behaviour to the NEW flag).

    vbr

  3. Log in to comment