This could be an acceptable strategy to accept this pull request and encourage multiple people with very large simulations that suffer from memory fragmentation to stress test this. It is already isolated in the makefile so that people can choose to not use it.
This passes the test suite (on my laptop; Jenkins is tragically down for the count.) I will hold off on approving + merging until @Britton Smith follows up regarding the results of his cosmology simulation test.
I don't think that makes sense. There are two different things that overload new - the memory trace code and log2alloc. We can't replace it with a single compile time option, since they can either be used separately or simultaneously (3 use cases). Any possible one-parameter fix doesn't really help with clarity. I think the best approach here is loudly proclaiming it in the documentation!
I have a production simulation that crashes with OOM errors after running for approximately 48 hours. I've just merged with this PR and put the job in the queue with a runtime of 72 hours. Hopefully it won't spend much time in the Pleiades queue and I'll be able to report back by Monday if this PR fixes the OOM errors.
It crashed during I/O. This has happened in the past and I've avoided it by restarting and running with fewer processors on the first node. I've put it back in the queue and will comment here if I'm able to run for longer than 48 hours.
This might be getting ahead of where things are now, but I ported this change to Enzo 3 to test it with active particles and the inline yt interface, and it ended up causes crashes. In particular there seems to be a problem with the call to PyDict_Clear() in CallPython.C trying to free a pointer that was not allocated. I suspect that the memory allocation/freeing processes that are overloaded with Log2Alloc are getting confused somewhere in the C++/cython/python combinations that occur when the grids are passed to yt. Again, I've only tested this with a pretty experimental arrangement in Enzo 3, but it would be worth being aware of this issue when dev gets merged into 3. It might be something simple that someone with a deeper knowledge then me of how the inline interface works, or perhaps we just need to put a catch in there to exit gracefully and informatively if someone tries to use both Log2Alloc and the inline yt interface.
Update: contrary to what I had originally thought, I am actually seeing worse performance memory-wise than without this. I reduced the number of resources on my test run such that I was using the minimum amount necessary for the simulation to run at all. Using the changes in this PR, the simulation ran out of memory quicker than when running without these changes. Perhaps, I messed something up in the implementation because I'm quite surprised by this. Needless to say, this PR should be put on hold until I can figure out what's going on.
Update: I did a simple test with valgrind and found no obvious issues. Since memory fragmentation results have been mixed, I opted to change the default to having this disabled. I think this is the best we can do for now. I also corrected the issue brought up by @John Wise in regard to doing the right thing when the size is 1.