RDF.__init__ SEGFAULT

Issue #32 resolved
Carl Simon Adorf created an issue

I'm currently porting my scripts to python3.4 on collins when I encountered this bug.

The bug occurs when I try to calculate the RDF from a previously read XMLDCDTrajectory.

Boost.Python.ArgumentError: Python argument types in
    RDF.__init__(RDF, Box, float, float)
did not match C++ signature:
    __init__(_object*, float, float)
[collins:25249] *** Process received signal ***
[collins:25249] Signal: Segmentation fault (11)
[collins:25249] Signal code: Address not mapped (1)
[collins:25249] Failing at address: 0xa0
[collins:25249] [ 0] /lib64/libpthread.so.0(+0x10e50) [0x7f6d427f4e50]
[collins:25249] [ 1] /usr/lib64/libpython3.4.so.1.0(+0x9bc38) [0x7f6d43c2cc38]
[collins:25249] [ 2] /usr/lib64/libpython3.4.so.1.0(+0xa5897) [0x7f6d43c36897]
[collins:25249] [ 3] /usr/lib64/libpython3.4.so.1.0(+0xa5297) [0x7f6d43c36297]
[collins:25249] [ 4] /lib64/libc.so.6(__cxa_finalize+0x97) [0x7f6d3de4b327]
[collins:25249] [ 5] /usr/lib64/libboost_python-3.4.so.1.55.0(+0x17743) [0x7f6d42e8a743]
[collins:25249] *** End of error message ***
Segmentation fault

Comments (20)

  1. Matthew Spellings

    The API changed recently to support variable-sized boxes; try passing in just your two floats for the constructor and passing the box in when you call the compute function.

  2. Carl Simon Adorf reporter

    Hi Matthew, I just discovered that. The only weird issue is, that it still segfaults. It does not segfault when I'm using gdb. I'll try to gather more information. Seems to be not related to freud.

  3. Joshua Anderson

    I've got a more reliable reproducer now. I get the issue with python 3.4.1 (tested 3.4.2 as well). But I do not get the issue with 2.7.

    Who all is affected by this issue? Is it just on the vis lab machines? Should I set the default python to 2.7 on these systems?

  4. Carl Simon Adorf reporter

    I am affected on all my machines as I switched to python 3.4 as default for all my scripts because of collins. Do you have a reproducer, that we could use to file a bug report to the boost devs without releasing freud code?

  5. Joshua Anderson

    I think @newmanr mentioned a smaller test case. Perhaps he could share. It does seem to be a bug in the python garbage collection routines.

  6. Carl Simon Adorf reporter

    This is Richmond's test case.

    from freud import trajectory
    import numpy
    from freud import locality
    
    #Cubic box
    L = 10.0;
    box = trajectory.Box(L);
    lc = locality.LinkCell(box,2.0);
    
    #Ideal gas in box
    Np = 100;
    xyz = L*2*(numpy.random.rand(Np,3)-0.5);
    xyz = xyz.astype(numpy.float32);
    
    #Compute at least once (required for segfault)
    lc.computeCellList(box,xyz);
    print("End Script Test LinkCell");
    
  7. Joshua Anderson

    Here is the backtrace:

    0x00007ffff79ec308 in dict_dealloc (mp=0x7ffff5fd0788)
        at /var/tmp/portage/dev-lang/python-3.4.2/work/Python-3.4.2/Objects/dictobject.c:1379
    1379    /var/tmp/portage/dev-lang/python-3.4.2/work/Python-3.4.2/Objects/dictobject.c: No such file or directory.
    (gdb) bt
    #0  0x00007ffff79ec308 in dict_dealloc (mp=0x7ffff5fd0788)
        at /var/tmp/portage/dev-lang/python-3.4.2/work/Python-3.4.2/Objects/dictobject.c:1379
    #1  0x00007ffff79f5f67 in module_dealloc (m=0x7ffff5fbd548)
        at /var/tmp/portage/dev-lang/python-3.4.2/work/Python-3.4.2/Objects/moduleobject.c:398
    #2  0x00007ffff79f5967 in meth_dealloc (m=0x7ffff5fae108)
        at /var/tmp/portage/dev-lang/python-3.4.2/work/Python-3.4.2/Objects/methodobject.c:150
    #3  0x00007ffff73c8fb9 in __run_exit_handlers (status=0, listp=0x7ffff772f5a8 <__exit_funcs>, 
        run_list_atexit=run_list_atexit@entry=true) at exit.c:82
    #4  0x00007ffff73c9005 in __GI_exit (status=<optimized out>) at exit.c:104
    #5  0x00007ffff73b2dcc in __libc_start_main (main=0x4009a0 <main>, argc=2, argv=0x7fffffffd248, init=<optimized out>, 
        fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7fffffffd238) at libc-start.c:319
    #6  0x0000000000400bb9 in _start ()
    

    The minimum ingredients needed appear to be python + boost + calling a function that takes a boost::python::numeric::array input.

  8. Richmond Newman

    My new iMac (gatesbrown, OSX yosemite, python 3.4.2) is afflicted with segfaults when using freud too. If you ensure that any used freud module is either reused, or kept in memory until the end of the script, your work will at least complete before segfaulting on exit. Not a real fix of course, just a way to live with it temporarily.

    I tried creating a boost python example test program, (entirely independent of freud) and get the same errors. I cannot safely deallocate a class after a call to one of its member functions passing in a boost::python::numeric array (even if said function is empty and does nothing with it). So I guess that's the same as Josh's findings.

    Simon: When you can't get it to crash in gdb, did you test a few times? I found that I couldn't get scripts to crash within GDB, but they would segfault (stochastically but with moderate probability) upon exiting.

    I tried creating an example outside of freud as well, but the results remained the same. Calling a function that takes in a boost::python::nuermic::array, even a function that does nothing will trigger the problem. If the module you're using is never deallocated, then

  9. Richmond Newman
    from freud import trajectory
    from freud import sphericalharmonicorderparameters as shop
    import numpy
    import time
    import gc
    import weakref
    
    def testCase():
        #Cubic box
        L = 10.0;
        box = trajectory.Box(L);
        #sl = shop.SolLiqNear(box, 2.0, 10.0, 6, 6, 12);
        #pointers = []
    
        #Ideal gas in box
        Np = 500;
        for i in range(10):
            print("Iteration {}".format(i));
    
            sl = shop.SolLiqNear(box,2.0,10.0,6,6,12);
    
            print("sl refcountafterconstruction is {}".format(len(gc.get_referrers(sl))));
    
            L=10+numpy.random.rand(); #10-11
            box = trajectory.Box(L);
            sl.setBox(box);
    
            #Ideal gas in box
            xyz = L*2*(numpy.random.rand(Np,3)-0.5);
            xyz = xyz.astype(numpy.float32);
    
            sl.computeSolLiqNoNorm(xyz);
            #print("Sl refcount after compute is    {}".format(len(gc.get_referrers(sl))));
    
            pointers.append(sl);
            #del sl;
            print("Sl refcount from pointers is {}".format(len(gc.get_referrers(pointers[-1]))));
        #print("Attempting to deallocate pointers array");
        #del pointers;
        #print("Deallocation complete");
    
    if __name__ == "__main__":
        pointers = [];
        #gc.set_debug(gc.DEBUG_SAVEALL);
        for i in range(10):
            testCase();
        gc.set_debug(gc.DEBUG_SAVEALL);
        print("End Script");
    

    I did manage, however, to write one script that I can't seem to segfault on my iMac, included above. Herein I throw all the created objects into a giant list, which normally would segfault on deallocation at the end of the script. However, if you set some debug parameters to the garbage collector, this seems to prevent that from happening. However, regardless of when this parameter is set (beginning or end), if objects are destroyed during the script, it will still segfault. Sigh.

  10. Carl Simon Adorf reporter

    Hi, when I tested with GDB I was not aware of the stochastic nature of the problem. So I am pretty sure the problem would occur eventually.

    One quick word of warning: I proposed this pointer array to prevent deallocation as a quick test to pinpoint the source of error. We should definitely not use it as a workaround in any kind of productive environment as it presents a massive memory leak.

  11. Richmond Newman

    Giant memory leak indeed. Most of the modules I use for production runs reuse, so they (far as I know) aren't afflicted by the problem w/ being deallocated in the middle of the script. I'm honestly not sure though what to do to fix these errors, since it seems to exist outside the scope of our code.

  12. Joshua Anderson

    OK, my minimal example is much more minimal:

    void LocalDensity::computePy(trajectory::Box& box,
                                                    boost::python::numeric::array ref_points
                                                    boost::python::numeric::array points)
        {
        /* actual compute code here
        */
        }
    

    I commented out any code that would actually do anything in compute() in LocalDensity.

    Then I ran this test:

            self.box = trajectory.Box(10);
            self.pos = numpy.array(numpy.random.random(size=(10000,3)), dtype=numpy.float32)*10 - 5
            self.ld = density.LocalDensity(3, 1, 1);
            self.ld.compute(self.box, self.pos, self.pos);
    

    This segfaults on petry 3 out of every 4 runs with the backtrace I posted above. If I change boost::python::numeric::array to boost::python::object, it no longer segfaults. However, if I instantiate a local variable boost::python::numeric::array ref_points = extract<boost::python::numeric::array>(ref_points_in);, it segfaults again.

    My conclusion is that the problem is somehow caused in garbage collecting a numpy array that has been touched by boost::python::numeric::array.

    Searching by keywords on the line of code at dictobject.c:1379 led me to a python bug in 2.7.5 where the garbage collector was crashing when trying to clean up code allocated in another module. This bug has been fixed in the python 2.7 series (not that we ever triggered it to my knowledge). I couldn't find references to any such bug in python 3.4.

    In any case, this appears to be either a python or boost (or the way the interact) bug and not how we are using them in freud. I am going to roll the linux boxes back to python 3.3. That is the quickest way to work around the problem for now. I certainly don't have the time to try and track down subtle memory manager issues in python far enough to submit a bug.

  13. Log in to comment