Infinite loop in mserver due to MFE's default write_cache_size

Issue #339 resolved
Ben Smith created an issue

I have a frontend where the max event size from a remote client is ~2MB. I just updated from a 2019 version of midas to the latest, and had lots of issues with the SYSTEM buffer, causing any client connected to that buffer to crash.

I tracked it down to mserver getting stuck in an infinite loop in bm_flush_cache_locked. The default write_cache_size for MFE frontends is 100000 bytes, and I happened to send an event of 69000 bytes. For some reason bm_flush_cache_locked asks for a maximum of 2*pbuf->write_cache_size/3 bytes. So my 69k event would never fit in the 66k of space it asked for, and the mserver instance handling my remote client just got stuck forever. And it got stuck while holding the semaphore, so all my other clients of SYSTEM got killed due to timeouts.

I managed to “fix” this by setting a much bigger value in Common/Write cache size for my frontend (3MB instead of 100kB). I’ll note that this parameter did not exist when I first wrote the frontend.

My questions:

  1. Is there any situation where the write cache size being smaller than the max event size is a good thing? Should MFE automatically set it so that the condition is fulfilled?
  2. Is the logic in bm_flush_cache_locked really correct? It seems that any event that is >= 2/3 of the write cache size is going to trigger this.
  3. Why did I not see this issue in the older code (before 3619ecc6ba1d29d74c16aa6571e40920018184c0) when the cache size was hard-coded to 100000 bytes!?

Comments (9)

  1. Stefan Ritt

    I see some heavy work by KO on the write cache at commit

    I’m not sure if that introduced some bad side effects, but you can try midas versions before and after that commit to see if it has an effect. At least KO can explain why there is the 2*pbuf->write_cache_size/3.

    My original design of a cache of 100kB had the following idea: If there are many small (10-100 Byte) events, locking the SYSTEM buffer causes quite some overhead, so a frontend could only send a few 1000 events (in the old days at least). Adding the write cache helped, because some 1000 events are collected before the buffer is locked and all events are sent in one go. So with 1000 locks per second, you could reach already one millino events per second this way. If you have large events (MB), a cache won’t do much good, since with 1000 locks per second you get already GB/s. So what I would do is to keep the cache size at 100kB, and if an event comes down the line wich is larger than the cache size, I would flush the current cache (if not empty), then send the large event directly to the system buffer, bypassing the cache. I guess this was implemented in my original code.

    So I don’t see the need for a variable cache size and do not understand why it was implemented. But maybe KO can shed some light on that, he certainly had a good reason for that.

  2. Stefan Ritt

    One more thought: Is this the problem KO currently has in alpha-g? He reports about locked semaphores and timeouts.

  3. dd1

    quick reply. there is no changes to stefan’s original design of the write (and read cache). only changes is to add locking for safe multithread access. K.O.

  4. Ben Smith reporter

    I confirm that my experiment works fine with a cache size of 100000 in 95aabe7, which is the commit before the 2/3 factor was introduced in 0b99ec0.

    It also works in develop if I change

    if (ask_free > 2*pbuf->write_cache_size/3)
       ask_free = ALIGN8(2*pbuf->write_cache_size/3);


    if (ask_free > ALIGN8(pbuf->write_cache_size))
       ask_free = ALIGN8(pbuf->write_cache_size);

    Why was the 2/3 factor introduced, KO? It really does create a logical flaw for events that are between 2/3 and 100% of the write cache size.

  5. dd1

    I confirm this is same bug I see in alpha-g at cern. there is two problems: (1) infinite loop, (2) incorrect size computation. both need fixing.

  6. dd1

    to stefan - consider all pathological corner cases. simplest one: SYSTEM buffer is 100 Mbytes, write cache is 200 Mbytes, bm_flush_cache() will want to flush 200 Mbytes of data into 100 Mbytes of shared memory buffer, cannot be done in one go, have to loop. so limit write cache size to no bigger than SYSTEM buffer. still no go, to flush the cache we have to wait for mlogger to completely empty the buffer. so limit to half the size of SYSTEM buffer. better, but notice that we can only use half available buffer space. nasty business.

    the other side of pathological corner cases is strange-sized events. in alpha-g a multithreaded frontend is sending a few large events and many small events, so it is correct to use a write cache. but. write cache size is 100 kbytes, large events are around 60 kbytes, the sequence is write large event into cache, attempt to write next large event into cache, no go (a large event is already in the cache, 100k-60k=40k, no space for one more large event). flush cache, write our large event into the newly empty cache. rinse, repeat. effectively a 1-event cache, no how it was meant to work.

    there is other strange stuff possible with the write cache - the two above is just what I saw during testing (and now in the alpha-g full system).


  7. dd1

    fix is to remove the infinite loop in bm_flush_cache_locked(), commit 8baca0df95fdced04f7d4ac44e88d8fac6703527. if write cache size is bigger than event buffer, flush will always fail and an error message is logged. bm_set_cache_size() does not allow setting write cache size bigger than 1/3 of the event buffer size, so this error should never happen. (there was not protection or error message about this before, and I have seen strange results if event size, write cache size and event buffer size were not all correct relative to each other). Updated code is under test at CERN ALPHA-g system. K.O.

  8. Log in to comment