deadlock when releasing sharedpointers, that share memory with PyObject*

Issue #81 resolved

M. Gronle created an issue 2019-01-15

This issue has been posted on the itom-discussion mailing list:

Hello,

when I was working with the SerialIO plugin, I noticed random timeouts that did not happen a while ago.

I simplified the getVal and setVal function extremely to trace down the reason:

ito::RetVal SerialIO::getVal(QSharedPointer<char> data, QSharedPointer<int> length, ItomSharedSemaphore *waitCond) { ItomSharedSemaphoreLocker locker(waitCond); ito::RetVal retValue; if (waitCond) { waitCond->returnValue =3D retValue; waitCond->release(); } return retValue; }

ito::RetVal SerialIO::setVal(const char data, const int datalength, ItomSharedSemaphore waitCond) { ItomSharedSemaphoreLocker locker(waitCond); ito::RetVal retval(ito::retOk);

if (waitCond)
{
    waitCond->returnValue =3D retval;
    waitCond->release();
}
return retval;

}

The timeouts are still happening with this python code in Itom:

ser =3D dataIO("SerialIO", 88, 115200,"\n",8,1,0,16,0,0) buf =3D bytearray(50)

for i in range(1,512): ser.setVal('bla') ser.getVal(buf)

After a while I always get:

            RuntimeError: Error invoking function setVal with error message: timeout while calling 'setVal'

Since I did not have this problem a while ago I found out that it started after a specific commit:

Hash: 556def47065735e0c1c7ec1b2e43c95acabdf623
Message: fix in QSharedPointer deleter method which are responsible to also decrement any PyObject elements. Sometimes, it is not verified which thread (Python, GUI...) is finally releasing a QSharedPointer, which for instance wraps a dataObject, char-buffer... whose base is another PyObject element. Once the pointer is released, a specific deleter-method is called which calls PyDECREF... of the PyObject base pointer. However if PyDECREF is called, the Python interpreter lock (GIL) must be hold by the caller, else a crash might occur. This is now ensured. The solution is only safe for Python >=3D 3.4, since other versions cannot tell if the GIL is currently hold by the caller or not, which can lead to a deadlock.
Date: 08.08.2018 18:19:37

My python version is 3.5.2 so it should work.

Comments (6)

M. Gronle reporter
- changed status to resolved
fixes issue ~~#81~~: when a QSharedPointer is build to share memory with a PyObject, like bytearray or bytes, this PyObject has to be decremented once the QSharedPointer is deleted. This is done in a deleter-method. To decref the PyObject* the GIL has to be acquired. This caused a dead-lock in some cases. This is fixed now.

→ <<cset b1f60295f386>>
- 2019-01-15T20:50:37+00:00
Dan Nessett
I think I am running into this problem, even though it is marked as resolved. I execute the following commands:

=================

plugin=dataIO("niDAQmx", "Dev2")

plugin.setParam("aiTaskParams", "20000,600000,0");

plugin.setParam("aiChParams","Dev2/ai0,1,-1,1")

d=dataObject([600000],"float64")

plugin.acquire(1)

plugin.getVal(d)

==================

The result is the following error, which appears about 10 seconds after the getVal() call):

=================

Traceback (most recent call last):

File "<string>", line 1, in <module>

RuntimeError: Error invoking function getVal with error message:

timeout while calling 'getVal'

=================

If I wait 30 seconds after calling acquire(1) and then call getVa(d), it works. I tried to debug this issue with a breakpoint in getVal(), but qtcreator behaved strangly. The error message shows up after about 10 seconds (which as indicated above is about the same time as it takes for the non-debugging session error message to appear). Then about 20 seconds later the breakpoint in getVal() fires. Weird.

Note: the 30 second value arises from the sample rate (20Ks/s) and the number of samples (600,000). I will report this also on the forum, since I don’t know if resolved bugs create notifications when comments are left.
- 2019-08-21T00:52:20+00:00
Dan Nessett
I should note that the breakpoint I set in getVal() is at the “if (waitCond)” statement.
- 2019-08-21T01:18:17+00:00
M. Gronle reporter
Hi Dan,
without having more details, I wouldn’t say that this timeout is related to this closed issue. The timeout is probably correct. The acquire method immediately returns to Python, however since the data acquisition needs around 30sec, the nidaqmx plugin is blocked for further calls (e.g. getVal) for the next 30seconds. The default timeout for calls like acquire or getVal is 10seconds (this can be changed in the itom property dialog). Therefore the call to getVal is postponed until acquire is also finished in the corresponding C++ acquire method. This will probably last for around 30seconds, hence, getVal cannot be processed earlier such that the timeout occurs.

In order to avoid this, you have to regularly call setAlive in the C++ code of the acquire method (or sub-methods) to tell itom that the plugin is still processing. Then the 10seconds timeout is moved to a later time whenever you call setAlive.

Cheers

Marc
- 2019-08-21T19:40:26+00:00
Dan Nessett
Hi Marc, I don't really like the solution of continually calling setAlive in the acquire method. First, this is polling, which is not a modern solution to synchronization problems. Second, and perhaps more importantly, it doesn't fit well with analog input continuous channels. To implement this strategy you would have to have a flag that indicates the input is still continuing and periodically call setAlive during this interval. Then, when you stop the continuous input, this flag would have to be cleared and the acquire method exited. This means you would have to throw a thread that stays in the acquire method polling with setAlive and have the original thread that entered the acquire method exit normally. This isn't really an elegant solution to the problem. What I would prefer is a variant of setAlive that takes an argument specifying how long to wait before triggering a timeout. For finite analog input the acquire method, using the sample rate and number of samples, can compute how long it will take for the input to complete normally. This time value can be given as an argument to the setAlive variant (perhaps adding a small increment to ensure the timeout won't occur during the input activity). For continuous analog input, it would be useful to have the ability to turn off timeouts for the calling thread and re-enable them after the continuous input completes. Is there any way to do this? Regards, Dan
- 2019-08-21T21:15:18+00:00
Dan Nessett
Hi Marc, I solved this by setting all timeouts to -1 in the Application properties. This is probably not the best way to handle the problem in an operational installation. I could poke around the itom code and see if it would be possible to add the necessary functionality to set timeouts on a per-thread basis rather than on a per itom basis, but I wonder if others would welcome this. Let me know. Cheers, Dan
- 2019-08-22T23:55:12+00:00
Log in to comment

Assignee: M. Gronle

Type: bug

Priority: major

Status: resolved

Component: core

Milestone: –

Version: –

Votes: 0

Watchers: 1