#297 Merged at 18563ee
  1. Britton Smith

This adds to the overloaded new used for memory tracing to allocate arrays only with sizes that are powers of 2.

Comments (17)

  1. Tom Abel

    This could be an acceptable strategy to accept this pull request and encourage multiple people with very large simulations that suffer from memory fragmentation to stress test this. It is already isolated in the makefile so that people can choose to not use it.

  2. Brian OShea

    This passes the test suite (on my laptop; Jenkins is tragically down for the count.) I will hold off on approving + merging until @brittonsmith follows up regarding the results of his cosmology simulation test.

  3. Tom Abel

    Should we rename this and the compile time option to OVERLOAD-NEW ? It would then make it clear what it does and we explain details in the docs. could make sure every new developer will explore this.

    1. Brian OShea

      I don't think that makes sense. There are two different things that overload new - the memory trace code and log2alloc. We can't replace it with a single compile time option, since they can either be used separately or simultaneously (3 use cases). Any possible one-parameter fix doesn't really help with clarity. I think the best approach here is loudly proclaiming it in the documentation!

  4. Nathan Goldbaum

    I have a production simulation that crashes with OOM errors after running for approximately 48 hours. I've just merged with this PR and put the job in the queue with a runtime of 72 hours. Hopefully it won't spend much time in the Pleiades queue and I'll be able to report back by Monday if this PR fixes the OOM errors.

      1. Nathan Goldbaum

        It crashed during I/O. This has happened in the past and I've avoided it by restarting and running with fewer processors on the first node. I've put it back in the queue and will comment here if I'm able to run for longer than 48 hours.

  5. Brian OShea

    @brittonsmith , just following up - did the cosmology simulation you ran go much further than before?

  6. Brian Crosby

    This might be getting ahead of where things are now, but I ported this change to Enzo 3 to test it with active particles and the inline yt interface, and it ended up causes crashes. In particular there seems to be a problem with the call to PyDict_Clear() in CallPython.C trying to free a pointer that was not allocated. I suspect that the memory allocation/freeing processes that are overloaded with Log2Alloc are getting confused somewhere in the C++/cython/python combinations that occur when the grids are passed to yt. Again, I've only tested this with a pretty experimental arrangement in Enzo 3, but it would be worth being aware of this issue when dev gets merged into 3. It might be something simple that someone with a deeper knowledge then me of how the inline interface works, or perhaps we just need to put a catch in there to exit gracefully and informatively if someone tries to use both Log2Alloc and the inline yt interface.

  7. Britton Smith author

    Update: contrary to what I had originally thought, I am actually seeing worse performance memory-wise than without this. I reduced the number of resources on my test run such that I was using the minimum amount necessary for the simulation to run at all. Using the changes in this PR, the simulation ran out of memory quicker than when running without these changes. Perhaps, I messed something up in the implementation because I'm quite surprised by this. Needless to say, this PR should be put on hold until I can figure out what's going on.

  8. Nathan Goldbaum

    Instead of dying after 48 hours, the last job I submitted using these changes ran for 62 hours before crashing with OOM errors. That that for what you will.

  9. Britton Smith author

    Update: I did a simple test with valgrind and found no obvious issues. Since memory fragmentation results have been mixed, I opted to change the default to having this disabled. I think this is the best we can do for now. I also corrected the issue brought up by @jwise77 in regard to doing the right thing when the size is 1.