Behaviors of .split()

Create issue
Issue #182 wontfix
animalize created an issue

hi, this is a trivial issue.

First, .split() with capturing parentheses:

(regex 2015.11.22, py 3.5.0 64bit, win 10)

# re
>>> re.split(r'(-)|(/)', '07/14/2007')
['07', None, '/', '14', None, '/', '2007']

# V0
>>> regex.split(r'(?V0)(-)|(/)', '07/14/2007')
['07', None, '/', '14', None, '/', '2007']

# V1
>>> regex.split(r'(?V1)(-)|(/)', '07/14/2007')
['07', None, '/', '14', None, '/', '2007']

Anyway, I think NoneType object should not be there.

FYI, .NET Framework's behavior, see the fourth code block of https://msdn.microsoft.com/zh-cn/library/ze12yx1d.aspx, I quote it here:

// Under .NET 1.0 and 1.1, the method returns an array of
// 3 elements, as follows:
//    '07'
//    '14'
//    '2007'
//
// Under .NET 2.0 and later, the method returns an array of
// 5 elements, as follows:
//    '07'
//    '/'
//    '14'
//    '/'
//    '2007' 

Second, here is another issue, split with .REVERSE flag:

behavior of regex:

>>> regex.split(r'(?r)foo', '0123456789foo4567890foo         ')
['         ', '4567890', '0123456789']

It's reversed, should this be clarified in document?

FYI, expected result of .Net Framework with .RightToLeft flag:

(line 226 of https://github.com/dotnet/corefx/blob/master/src/System.Text.RegularExpressions/tests/RightToLeft.cs)

{"0123456789", "4567890", "         "}

Comments (3)

  1. Matthew Barnett repo owner

    These are 2 different issues, as you yourself say. It would've been better if you had reported them separately.

    First, .split with capture groups:

    The regex module is designed to be compatible with the re module, so it has the same behaviour.

    If the pattern contains capture groups, then they are included in the result, as per the documentation for re.

    The pattern in your example contains 2 capture groups, only one of which matches something, and unmatched groups are represented by None.

    The documentation for .Net (https://msdn.microsoft.com/en-us/library/ze12yx1d.aspx) says "all captured text is also added to the returned array", which I take to mean that unmatched groups are excluded. That applies to .NET Framework 2.0 and later, which, as far as I can tell, was released some time after the re module was added to Python.

    If you don't want the Nones, then it's easy enough to strip them out yourself.

    Second, .split with the .REVERSE flag:

    At the time, it wasn't entirely clear how it should behave. I took the view that it should give the same result as using .splititer.

    The string is scanned in reverse, so the results arrive in reverse order:

    >>> regex.split(r'foo', '0123456789foo4567890foo         ') == list(regex.splititer(r'foo', '0123456789foo4567890foo         '))
    True
    >>> regex.split(r'(?r)foo', '0123456789foo4567890foo         ') == list(regex.splititer(r'(?r)foo', '0123456789foo4567890foo         '))
    True
    
  2. animalize reporter

    Yes, I mean both re and regex need fix it. It looks like after cutting a watermelon, some oranges appear. Even return an empty string is better than None.

    On Perl 5.14.2:

    #!/usr/bin/perl -w
    @fields = split(/(-)|(=)/, "07=14=2007");
    print "Field values are: @fields\n";
    

    output, warnings and an empty string was inserted:

    Use of uninitialized value in join or string at ./t.perl line 3.
    Use of uninitialized value in join or string at ./t.perl line 3.
    Field values are: 07  = 14  = 2007
    

    However, maybe no one will be bothered by this issue, let it pass.

  3. Matthew Barnett repo owner

    Has anyone complained about this behaviour of re in the last 15-odd years? It's been there a long time!

  4. Log in to comment