AngularSeparation fails unittest on rare occasions

Issue #171 resolved
Vyas Ramasubramani created an issue

The relevant CircleCI result is here. The issue has not been reproduced, so it's unclear what exactly causes it. It may be related to the system configuration when this particular test ran, or the order in which tests run; nobody has seen it yet in practice, so it's not a major concern.

Error pasted below:

======================================================================
ERROR: test_compute_neighbors (test_order_AngularSeparation.TestAngularSeparation)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/ci/ci/freud/tests/test_order_AngularSeparation.py", line 125, in test_compute_neighbors
    npt.assert_almost_equal(ang.getNeighborAngles()[1], 0, 6)
IndexError: index 1 is out of bounds for axis 0 with size 0

----------------------------------------------------------------------
Ran 134 tests in 16.877s

Comments (11)

  1. Vyas Ramasubramani reporter

    This issue appears to be a very subtle and nondeterministic problem (associated with the NeighborList, but I don't believe caused by it). The original unit test fails on these lines in the test_compute_neighbors test of AngularSeparation:

            npt.assert_almost_equal(ang.getNeighborAngles()[0], np.pi/3, 6)
            npt.assert_almost_equal(ang.getNeighborAngles()[1], 0, 6)
    

    In particular, if it fails (which happens with a pretty low frequency, maybe 10% of the time or so in my experience with testing) it always fails on the second line. Now, I tried replacing those lines with the following:

            print("Original length: ", len(ang.nlist))
            tmp = ang.getNeighborAngles()
            print("Length after call: ", len(ang.nlist))
            npt.assert_almost_equal(tmp[0], np.pi/3, 6)
            print("Length after test: ", len(ang.nlist))
            print(tmp[0])
            print(tmp[1])
            tmp = ang.getNeighborAngles()
            print("Length after second call: ", len(ang.nlist))
            print(tmp[0])
            print(tmp[1])
    

    The resulting output from running that 20 times is attached to this issue now (note that the errors are printed above the corresponding output, I'm guessing it has to do with how Python calls the underlying C API). A sample of the output for one of the failed runs looks like this:

    .Original length:  3
    Length after call:  3
    Length after test:  0
    1.0471976
    0.0
    Length after second call:  0
    E...
    ======================================================================
    ERROR: test_compute_neighbors (__main__.TestAngularSeparation)
    ----------------------------------------------------------------------
    Traceback (most recent call last):
      File "test_environment_AngularSeparation.py", line 138, in test_compute_neighbors
        print(tmp[0])
    IndexError: index 0 is out of bounds for axis 0 with size 0
    
    ----------------------------------------------------------------------
    Ran 5 tests in 0.034s
    

    Note that the failure is invariably on the second to last line, which is the first access into tmp after the 2nd call to ang.getNeighborAngles(). The issue is that when the nlist size is (or appears to be) 0 the resulting array is allocated according to that, so tmp.size == 0 in that case. For what it's worth, I also occasionally see this output:

    .Original length:  3
    Length after call:  3
    Length after test:  4483619112
    1.0471976
    0.0
    Length after second call:  4483619112
    1.0471976
    0.0
    ....
    ----------------------------------------------------------------------
    Ran 5 tests in 0.033s
    
    OK
    

    In that case, the test doesn't fail because the array just gets massively overallocated, which is not tested for. Apparently, the call to numpy.testing.assert_almost_equal somehow corrupts some memory, because removing that line causes the test to run fine (I ran it 200 times with 0 failures). I've tried various other combinations of tests, and the failures are always preceded by the first call to the numpy testing framework. This issue is also present if I replace this call with self.assertTrue(np.allclose(tmp[0], np.pi/3)) or self.assertTrue(np.allclose(tmp.copy()[0], np.pi/3)), so it doesn't appear to be directly tied to that object but rather to the the state of the memory before the NumPy API call and how it is changed afterwards.

    It's especially strange because the object getting passed in is just an element of a NumPy that should no longer be associated with the original AngularSeparation object or its NeighborList. It is not obvious why any of the memory associated with the AngularSeparation or NeighborList objects should be invalidated at any point since they are still in scope and shouldn't be modified, but the appearance of the large numbers for the NeighborList length certainly suggests that the NeighborList memory is being unexpectedly freed somewhere. I suspect that a free is deterministically happening every single time, but because there are so few operations involved in the test, the underlying memory (which corresponds to the m_num_bonds member of the NeighborList) is typically not rewritten so it still contains the original nlist length of 3 except in rare cases. That's just a guess though.

    @mspells @joaander if either of you have any thoughts on what could be causing this, I would appreciate some insight. This could be related to Issue #151, but I don't see why there would be any issue with this memory getting invalidated in this case. I could try running this under something like valgrind, but it seems like an awful lot of work for what appears to be a relatively inconsequential bug. I would like to avoid that if possible.

  2. Matthew Spellings

    I haven't looked into this in detail, but is it perhaps caused by the lack of links to the owning object (the reference counting stuff that was changed in response to the RDF "bug")? One easy fix would be to make AngularSeparation.nlist_ store the defaulted_nlist quantity that is calculated in computeNeighbor() (and update the @property for nlist to return the appropriate element of that tuple). That way we are keeping around the neighbor list and the object that holds it. Another option would be to return independent python copies in the make_default_nlist* helper functions rather than having them alongside the generator compute object, but that would have a performance cost.

  3. Vyas Ramasubramani reporter

    @mspells your intuition was spot on. I've pushed an initial fix, but since the NeighborList is now in broad use throughout freud I suspect that this issue exists in many places and this test just happens to be one that pinpoints it. I'll have to do a sweep of the code base to make sure that it's addressed everywhere. For now I created a separate member variable rather than doing what you suggested so that the nlist_ member still has the same meaning in AngularSeparation as in every other class, but I may switch it to your suggestion if that makes sense to do throughout freud.

    Alternatively, instead of our original fix that is causing this problem, should we instead switch to the alternate solution we identified of explicitly calling the garbage collector in the NeighborList? That would allow us to fix this in one place, and it would just entail calling gc.collect instead of manually removing that pointer that we removed before.

  4. Vyas Ramasubramani reporter
    • changed status to open

    We have a solution, but have not yet identified the best global solution. This issue should stay open until the proper solution is applied throughout freud.

  5. Log in to comment