exchange tags memory leak

Issue #7 resolved
Iulian Grindeanu created an issue

it seems that exchange tags has a memory leak

in an example mpitest, the memory usage keeps increasing when exchange tags is called in a loop

valgrind did not report any leaks when run on a minimal configuration, but the memory keeps increasing; for 100K iterations, it increased from 129Mb per task to 189Mb

after compiling in example folder, run it with

mpiexec -np 2 mpitest ../MeshFiles/unittest/64bricks_1khex.h5m

Notes:

1) for a smaller model, it seems that there is no leak. It could be that the mechanism to send 2 messages, one with a fixed size, the second with the rest of the message, could present problems.

2) when we increase the size of the messages, there is not an increase in the leak.

for 100k iterations, the memory leak seems to be about 80Mb

So it is 8.e7/1.e5=800 bytes; maybe it is 1024? Should we increase the size of the buffer from 1024 to something else? If there is no leak, then maybe we do not free the first message? Why valgrind does not complain?

3) So indeed, this is the issue; I increased the size of const unsigned int ParallelComm::INITIAL_BUFF_SIZE = 16384;

and there is no memory leak on the initial test: so maybe we are not handling correctly the memory buffers when there is a second message, following the initial message of size 1024. Need to track down that memory somehow.

Comments (15)

  1. Danqing Wu

    Added a simplified test file to reproduce the memory leak issue, which is a pure MPI test app that has nothing to do with MOAB code. This test should be run with exactly two processors.

  2. Iulian Grindeanu reporter

    nice work Danqing! It has nothing to do with moab code, except that we use the same "Buffer" class. It is clear from your example that the memory leak is not because of moab packing and unpacking the tags. It seems that the issue is something in mpi use. It is either * 1) our incorrect use of mpi_isend and mpi_irecv. * 2) some memory leaks caused by the "Buffer" class; for example, when we "reset" the buffers, I don't think we have to "reserve" again. Or the reserve, it does not have to start with a "malloc", all the time. It should maybe check first if the alloc_size is enough, and then do a malloc/or realloc * 3) maybe it is a bug in mpi implementation. I think that this is highly unlikely, but it cannot be discounted.

    Another question: in the test case number 3 (one processor leaking), how do you know it is the processor that sends or the processor that receives that is leaking? (you said in the test that it is the processor that is sending leaking)

    first, our problem is this: We want to use nonblocking communication, to send data from proc A to proc B. Processor B does not know how much data to expect. Every processor has a local buffer to send from (localOwnedBuffer) and a local buffer to receive to (remoteOwnedBuffer). Our logic is this: processor A sends first a fixed size message, of size "INITIAL_BUFF_SIZE". First bits on info contain the whole size of the message, so it knows it has to send more again; it is waiting though on the ack from B that it received the first message

    • processor B receives the first message, then it sends the ack to A.

    • After A receives the ack, it sends the rest of the message to B.

    All these sends / receives use isend and irecv ;

    the leak is either: * our buffers * mpi buffers

  3. Danqing Wu

    The memory leak issue in mpitest2.cpp is now fixed in mpitest3.cpp. It seems that we should assign three different MPI request objects to 1st part of the message, 2nd part of the message, and ack (instead of using only two).

  4. Iulian Grindeanu reporter

    more details from exchanges with Tom Peterka, Ken, and mpich discussion forum, Wesley Bland and Rajeev Thakur

    Hi Tom, Thanks for your suggestions. It seems that Danqing found what is the issue in our code; it seems to be a mismatched request. We were having 3 types of messages, and we were using 3 mpi tags (fixed size message, ack message, second part of the message), but we were using 2 requests only; for the second send, we were reusing the first request, and it seems that doing this in a loop caused problems. Not sure yet why, Danqing explained it to me, and his code shows no memory leak when we use 3 requests.

    It will be worth implementing your suggestion too, for which we do not need to send the ack, because the order of the messages in guaranteed by MPI standard, so the second message should never arrive before first.

    Still, because we might reallocate the buffer to receive the second message, and the reallocation can happen while the first message is still being received, we might still have to use the ack message.

    Your suggestion of doing a small send first, with just the size of the message, is in a way simpler, but that implies we will be always sending 2 messages. With our current code, if the message is small enough, we will be needing only one send/receive. If the message is bigger, we will need 3 sends/receives. Probably the average is around 2 anyway, depending of course on the actual size of the messages. Maybe the size of the buffers should be a runtime option, so the user can select a better allocation of the receive buffers (problem dependent)

    I received 2 other suggestions on the mpich discuss list; one was to use MPI_?Probe to find about the size of the message to be received, and a simpler one, to just allocate enough in the receive buffer.

    Thanks, Iulian

    That makes sense. I talked to Ken (my office mate and MPI developer), and you should not modify the request until the operation is completed. MPI uses the request to do some reference counting and this can get messed up if you reuse the request before it is no longer needed.

    I'm glad you found it, Tom

    The way you?re doing things is theoretically fine, but unnecessary. In MPI, it?s fine to not know the size of the message before you receive it (as long as you know the type). Instead of sending around ?pre-messages? with all of the meta-data, you can use MPI_Probe (http://www.mpich.org/static/docs/latest/www3/MPI_Probe.html) which will give you information about the next message that you?ll receive (there are other variants that you might prefer to use like MPI_Iprobe, MPI_Mprobe, etc.).

    Have a look at the documentation at the link above, but generally, you?ll probe MPI to find out information about the next message such as the source and the size. Then you can create your buffers appropriately and avoid lots of extra buffer creation / destruction calls.

    Thanks, Wesley

    On Tuesday, May 27, 2014 at 11:15 AM, Grindeanu, Iulian R. wrote:

    Hello, We are trying to track down a memory leak, and we are not sure if we are using correctly MPI_ISend and MPI_Irecv

    We would like to use non-blocking communication, between pairs of processes.

    (it happens during a computation, at every time step; every processor has a list of processors it needs to communicate with; what we do not know, is the size of the messages, in advance)

    Let's assume A needs to send to B Because we do not know the size in advance, we send first a fixed size message from A to B; this first message has info about how big the total message needs to be; A knows it has to send more. When B receives the first part, it sends an ack (small size, int 4) to A, acknowledging, and also resizing the local buffer, to fill more data; then A sends the rest.

    All these sends / receives use Isend and Irecv, and we try to match the messages, using proper tags. We use different tags for fixed size, ack, and rest of the message.

    Before sending the code, I would like to know is what we try is feasible and doable?

    We notice that the memory use on processor A keeps increasing, when we do this in a loop, on the order of about 1000-2000 bytes per iteration.

    I think we are matching correctly isends and ireceives, and that our buffers are not leaking (we may be wrong)

    So again, an iteration is like this: A sends a fixed size message to B; B sends back an ack, and when A receives it, sends the rest, because B has now the proper buffer to receive.

    Next iteration, the size of the message is again unknown, so we do again the dance.

    Do you have an example like that we can use? Should we use other types of sends/receives?

    Thanks, Iulian

    And you are allowed to post a receive of a larger size than the send. So if you know that the send is not going to be larger than 1000 bytes but you don't know the exact size, you can post a receive for 1000 bytes (and allocate that much memory), and whatever was the actual send size (say 800 bytes) will be received.

    Rajeev

  5. Danqing Wu

    This updated ParallelComm.cpp is not a fix to be committed (more work to be done). It only shows that mpitest.cpp can run without memory leak if we handle the MPI requests correctly. Ideally, both send and receive requests should be different for various message tags. The memory leak is caused by shared send request (1st part of the message and 2nd part of the message use the same send request). When the MPI wait all call on all send requests returns, it is likely that only the 1st part has been sent, while the 2nd part has not. To make sure that there is no pending send request that still holds temporarily allocated memory, 1st part and 2nd part should use different send requests.

  6. Iulian Grindeanu reporter

    fix issue 7

    it fixes just the exchange tags method still need to fix: reduce tags send_entities, ghost, settle intersection points, etc.

    (All methods that use this parallel communication strategy as in recv_buffer and send_buffer)

    uses the procedure outlined by Danqing in mpitest3 there should be 3 send/receive requests for each processor we communicate with from current processor

    the memory leak as experienced by the test from Anton Kanaev is gone

    → <<cset 33d234effb99>>

  7. Log in to comment