Deadlock

Issue #57 on hold

Markus KARG created an issue 2015-10-08

More or less frequently I suffer from deadlocks in Babbler, apparently particularly after resuming a previously hibernated PC running an open XmppSession before hiberation:

"Thread-1881": waiting to lock monitor 0x000000002a024800 (object 0x0000000721511a08, a rocks.xmpp.core.session.TcpConnection), which is held by "XMPP Scheduled Ping Thread" "XMPP Scheduled Ping Thread": waiting for ownable synchronizer 0x0000000728924b90, (a java.util.concurrent.locks.ReentrantLock$NonfairSync), which is held by "XMPP Writer Thread" "XMPP Writer Thread": waiting to lock monitor 0x0000000020bc1e00 (object 0x000000072980cde0, a sun.security.ssl.SSLSocketImpl), which is held by "XMPP Scheduled Ping Thread"

(see attached stacktrace.txt file)

Is this a bug or did I do something wrong?

Comments (25)

Christian Schudt repo owner
This looks like a bug. I will see, if I can understand the stacktrace and try to fix this. Is there any easy reproduction?
- 2015-10-08T21:05:19+00:00
Christian Schudt repo owner
I am not sure, if could be a bug in JDK 9, but you are using "1.9.0-ea-b75". Seems like the SSLSocket is blocking something.

I've checked the source code http://grepcode.com/file_/repository.grepcode.com/java/root/jdk/openjdk/8u40-b25/sun/security/ssl/SSLSocketImpl.java/?v=source But your stacktrace looks different, so they have changed something. (in JDK 8 there's no writeAlert() method, which is causing the lock).

Could you test with JDK 8 and/or give your thoughts?
- 2015-10-08T21:43:20+00:00
Markus KARG reporter
Test with JDK 8 is prepared, stay tuned for the result. :-)
- 2015-10-09T21:21:32+00:00
Markus KARG reporter
With the test I have to wait until 8u66 is finally released (planned for next days), because I need a bug fix introduced with 9b75... without that, my application never passes a particular point in the JavaFX core code, and would never get to the code line deadlocking in Babbler. I will then test and send you the result.

For now, it looks like Babbler has problems with TCP reconnects sometimes. I hoped we have this stable, but it looks like to got unstable again. :-(
- 2015-10-11T15:40:53+00:00
Markus KARG reporter
While 8u66 is not out yet, I gave it another try on JDK 9b75, and the same problem happens again. I'd say it is definitively a bug with thread sync in Babbler:

Found one Java-level deadlock:

"Thread-1622": waiting to lock monitor 0x000000002980d700 (object 0x00000007215f57b0, a rocks.xmpp.core.session.TcpConnection), which is held by "XMPP Scheduled Ping Thread" "XMPP Scheduled Ping Thread": waiting to lock monitor 0x0000000029c66780 (object 0x0000000721544ab0, a sun.security.ssl.SSLSocketImpl), which is held by "XMPP Reader Thread" "XMPP Reader Thread": waiting for ownable synchronizer 0x0000000720fcf6f0, (a java.util.concurrent.locks.ReentrantLock$NonfairSync), which is held by "XMPP Writer Thread" "XMPP Writer Thread": waiting to lock monitor 0x0000000029c66780 (object 0x0000000721544ab0, a sun.security.ssl.SSLSocketImpl), which is held by "XMPP Reader Thread"

Possibly a race condition where Ping thread tries to run before a reconned was successful, because I never noticed the problem without a broken connection. Also I don't know what Thread-1622 is: It waits for a monitor hold by XMPP Scheduled Ping Thread, but itself has not an "XMPP" name, so what is it?!

Maybe this helps... I hope...

"Thread-1622" #1656 daemon prio=1 os_prio=-2 tid=0x000000002b9a9800 nid=0x1744 waiting for monitor entry [0x000000004aabd000] java.lang.Thread.State: BLOCKED (on object monitor) at rocks.xmpp.core.session.TcpConnection.send(TcpConnection.java:262) - waiting to lock <0x00000007215f57b0> (a rocks.xmpp.core.session.TcpConnection) at rocks.xmpp.core.session.XmppSession.send(XmppSession.java:671) at rocks.xmpp.core.session.XmppClient.send(XmppClient.java:465) at rocks.xmpp.core.chat.ChatSession.sendMessage(ChatSession.java:132) at rocks.xmpp.extensions.chatstates.ChatStateManager.setChatState(ChatStateManager.java:145) at application.chat.XmppChat$StatusSenderService$1.lambda$call$82(XmppChat.java:688) at application.chat.XmppChat$StatusSenderService$1$$Lambda$571/469434518.accept(Unknown Source) at java.util.Optional.ifPresent(Optional.java:160) at application.chat.XmppChat$StatusSenderService$1.call(XmppChat.java:687) at application.chat.XmppChat$StatusSenderService$1.call(XmppChat.java:684) at javafx.concurrent.Task$TaskCallable.call(Unknown Source) at java.util.concurrent.FutureTask.run(FutureTask.java:265) at javafx.concurrent.Service.lambda$null$493(Unknown Source) at javafx.concurrent.Service$$Lambda$412/659155730.run(Unknown Source) at java.security.AccessController.doPrivileged(Native Method) at javafx.concurrent.Service.lambda$executeTask$494(Unknown Source) at javafx.concurrent.Service$$Lambda$410/1069862789.run(Unknown Source) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:746)
- 2015-10-15T12:47:34+00:00
Markus KARG reporter
Just looked at the callstack of Thread-1622 and apparently it is one of my worker threads. I wonder why my worker thread DIRECTLY is working inside of rocks.xmpp. Is that correct? Don't you typically use your own "XMPP" named threads to lock objects? Hence: Shouldn't rocks.xmpp.core.session.TcpConnection.send(TcpConnection.java) be called by some of YOUR threads instead?
- 2015-10-15T12:51:04+00:00

Christian Schudt repo owner

No, TcpConnection.send is called in (your) application thread. It delegates to xmppStreamWriter, which then writes to the socket in a new thread (Xmpp Writer Thread).

@Override
    public final synchronized void send(StreamElement element) {
        if (xmppStreamWriter != null) {
            xmppStreamWriter.send(element);
        }
    }

2015-10-15T13:13:06+00:00

Christian Schudt repo owner
What happened here is that PingManager (Ping thread) closes the connection due to failed ping.

It locks on TcpConnection, then on SSLSocketImpl, then waits for an ReentrantLock, which is held by the Writer thread. At the same time, Writer thread waits for the SSLSocketImpl monitor, held by the ping thread (see above).

My assumption is, that inside the SSLSocketImpl's write operation, it first locks on ReentrantLock, and later on the SSLSocketImpl monitor. And in between the Socket.close first locks on SSLSocketImpl.

Your second stacktrace is similar, but here, the Reader thread waits for an ReentrantLock, which is held by the writer thread, while the writer thread waits for the SSLSocketImpl monitor, which is held by the reader thread.

In any case, these locks are created within SSLSocketImpl! It seems like somewhere in SSLSocketImpl the locking order of these two locks is messed up.

Still curious if JDK 8 has the same problem, because your JDK 9 stacktrace doesn't match to the JDK 8 sources.
- 2015-10-17T20:15:34+00:00
Markus KARG reporter
Thank you for this in-depth analysis of the cause. To me it sounds like a new bug in JDK 9. We should investigate an possible report to Oracle.

JDK 8u66 is announced for October 19. I will download an try ASAP.

Thanks
- 2015-10-17T21:45:21+00:00
Markus KARG reporter
FYI, the problem seems to exist since JRE 1.7:

https://community.oracle.com/message/10750121#
- 2015-10-18T07:10:13+00:00
Markus KARG reporter
Installed JDK 8u66 on two machines. Will try to reproduce now.
- 2015-10-27T07:13:33+00:00
Markus KARG reporter
Works like a charm for more than a week now using JRE 8u66 on two machines. This definitively proofs that the cause is a bug in JRE 9.

Next step is trying out the proposed workarkound (explicit "wrapping" lock around TcpConnection).
- 2015-11-04T22:08:18+00:00
Christian Schudt repo owner
Ok, seems like a good news, thanks for looking into it. Hopefully Oracle will fix it in JDK 9.
- 2015-11-04T22:12:44+00:00
Markus KARG reporter
BTW, found the matching JRE 9 source code needed to understand the attached stack trace. :-) click here
- 2015-11-05T22:59:30+00:00
Markus KARG reporter
I did a quick hack (synchronized all TCP access by a single lock) and indeed it improves the situation on JRE 9. The bad thing is, that the workaround is really ugly kind of a hack, and it does not fix the problem completely (had to restart the software twice in 14 days). So for JRE 8 all is well, but how to go on with JRE 9? Any ideas?
- 2015-11-21T16:41:25+00:00
Christian Schudt repo owner
Did it still deadlock with the hack or was the cause of the restart something else?

Currently I won't bother with JRE 9 too much, because it's still one year until GA. If there's a deadlock in close() and write() operation of JRE 9's SSLSocket, I can only imagine to synchronize these operations. The best thing would of course be, if Oracle fixed it until GA. You are more involved in Java development, maybe you can file a bug?

A last resort option would be to use SocketChannel and SSLEngine and reimplement the TLS handshake using NIO. But that's probably very complicated and we also would need to reimplement the java.net.Proxy support. Socket has both features built-in.
- 2015-11-22T18:28:35+00:00
Markus KARG reporter
Bad news. Today the deadlock happened with JRE 8u66, too. Seems we need to get active here, even in JRE 8! I will attach the stack trace, so you can see the even in JRE 8u66 the problem occurs between the ping thread and the writer thread:

Found one Java-level deadlock:

"Thread-1192": waiting to lock monitor 0x000000001c357218 (object 0x0000000705be9eb0, a rocks.xmpp.core.session.TcpConnection), which is held by "XMPP Scheduled Ping Thread" "XMPP Scheduled Ping Thread": waiting to lock monitor 0x000000001c35a708 (object 0x000000070b7e0d70, a java.net.Socket), which is held by "XMPP Writer Thread" "XMPP Writer Thread": waiting to lock monitor 0x00000000225e0e48 (object 0x000000070b7e0ba0, a sun.security.ssl.SSLSocketImpl), which is held by "XMPP Scheduled Ping Thread"
- 2015-11-24T17:40:49+00:00
Markus KARG reporter
- attached deadlock-8u66.txt
- 2015-11-24T17:41:50+00:00
Markus KARG reporter
BTW, I reported the bug at https://bugs.openjdk.java.net/browse/JDK-8141651 -- back then I thought it would work fine in 8u66, but it simply happens not so often as it seems.
- 2015-11-24T17:45:27+00:00
Christian Schudt repo owner
I see this in the stacktrace:
```
at rocks.xmpp.core.session.XmppStreamWriter.lambda$initialize$0(XmppStreamWriter.java:105)
- locked <0x000000070b7e0d70> (a java.net.Socket)
at rocks.xmpp.core.session.XmppStreamWriter$$Lambda$348/812784214.run(Unknown Source)
```
However, the source never locks on the socket. Do you have an idea? Did you run some modified code?

The writer thread first locks the java.net.Socket (see above), then the SSLSocket. The Ping Thread first locks the SSLSocket, then the java.net.Socket (from within SSLSocket#close()), which explains the deadlock.
- 2015-11-24T19:32:35+00:00
Markus KARG reporter
Thanks a lot for this! In fact, it seems as if there is "stale" code in my JAR with the explicit lock I attempted to do earlier... Very strange, since I actually did "mvn clean install" of both, Babbler 0.7.0-SNAPSHOT and my own application. Maybe Maven picked an outdated binary...?! Anyways, once more you identified the right cause! Thanks a lot! I will manually scratch my Maven repo and give it another "mvn clean install" afterwards... I'll keep you updated!
- 2015-11-24T22:22:25+00:00
Markus KARG reporter
Oracle claims to have fixed the deadlock finally (after I was persistent enough to force them to actually fix the bug instead of pretending there is none). I'll give it a try once JDK-9-EA has a working nightly build available (latest builds do not install on Windows). Waiting for the latter the be fixed before we can go on with this ticket. Sigh.
- 2015-12-29T12:17:30+00:00
Christian Schudt repo owner
Any news on this issue?
- 2016-05-15T19:35:11+00:00
Markus KARG reporter
Since december I switched my JRE multiple times (8u76, 8u91) and had not experienced any deadlocks anymore, but I did not find the time to test on JRE-9-EA. I till do tests on Java 9 in June and let you know the result. Until then, I'd like to suggest to just keep this issue open as a reminder, even if the problem is definitively not a bug in Babbler itself.
- 2016-05-16T12:19:27+00:00
Markus KARG reporter
- changed status to on hold
Let's keep it paused until I found the time to test using JRE 9 EA.
- 2016-05-16T12:21:00+00:00
Log in to comment

Assignee: Christian Schudt

Type: bug

Priority: critical

Status: on hold

Version: 0.6.0

Votes: 0

Watchers: 2

Comments (25)

Found one Java-level deadlock:

Found one Java-level deadlock: