Completely hangs after suspend-resume

Issue #19 closed
Markus KARG created an issue

Just noticed that sometimes it happens that a stack trace is printed after my laptop resumes from suspend mode:

Feb 12, 2015 7:01:13 PM rocks.xmpp.core.session.XmppStreamWriter closeConnectionAndNotify
WARNING: java.net.SocketException: Socket is closed
javax.xml.stream.XMLStreamException: java.net.SocketException: Socket is closed
        at com.sun.xml.internal.stream.writers.XMLStreamWriterImpl.close(Unknown Source)
        at rocks.xmpp.core.PrefixFreeCanonicalizationWriter.close(PrefixFreeCanonicalizationWriter.java:136)
        at rocks.xmpp.core.session.XmppStreamWriter.closeConnectionAndNotify(XmppStreamWriter.java:255)
        at rocks.xmpp.core.session.XmppStreamWriter.access$200(XmppStreamWriter.java:55)
        at rocks.xmpp.core.session.XmppStreamWriter$4.run(XmppStreamWriter.java:232)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
        at java.lang.Thread.run(Unknown Source)
Caused by: java.net.SocketException: Socket is closed
        at sun.security.ssl.SSLSocketImpl.checkEOF(Unknown Source)
        at sun.security.ssl.SSLSocketImpl.checkWrite(Unknown Source)
        at sun.security.ssl.AppOutputStream.write(Unknown Source)
        at java.io.BufferedOutputStream.flushBuffer(Unknown Source)
        at java.io.BufferedOutputStream.flush(Unknown Source)
        at java.io.FilterOutputStream.flush(Unknown Source)
        at rocks.xmpp.core.BranchedOutputStream.flush(BranchedOutputStream.java:60)
        at com.sun.xml.internal.stream.writers.UTF8OutputStreamWriter.flush(Unknown Source)
        ... 8 more

As it is quite normal that laptops run into socket timeouts after a resume, we should possibly deal with that situation in a smarter way. For example, we could simply restart the communication on a new socket without showing a stack trace to the end user, to not scare him.

Comments (20)

  1. Christian Schudt repo owner

    Don't log exceptions too eagerly.

    Exceptions which occur during writing to the socket cause the connection to get closed. Exceptions which in turn occur during closing the connection should be added as suppressed exceptions.

    Also make sure that non-final variables become null, so that they are not shutdown/closed twice (which will likely cause a new exception).

    Fixes Issue #19

    → <<cset 7b11c03a2641>>

  2. Christian Schudt repo owner

    Thanks for the report! The issue likely happened because of an Exception during a write operation, which caused the connection to get closed, which then tried to send the closing </stream:stream> element. This caused an exception again, which then tried to close the prefixFreeCanonicalizationWriter for a second time, which then caused your stacktrace.

    I removed the logging and added any exception from a "close()" as suppressed exception. (just like it worked with try-with-resourcement statement). I also nullified prefixFreeCanonicalizationWriter so that it won't get closed twice.

  3. Markus KARG reporter

    Panic! After pulling this changes, the reconnect after a suspend-resume cycle does not work anymore at all! What happens now is that the following is shown in the log...

    OUT: <message to="skarg@jabber.de" type="chat"><thread>f18a5dd6-ccd7-4cfd-aa40-15bdce43d8ac</thread><gone xmlns="http://jabber.org/protocol/chatstates"></gone><
    /message>
    Feb 18, 2015 11:34:16 PM rocks.xmpp.core.session.XmppSession connect
    WARNING: TCP connection to jabber.de:5222 (c70dfca0-e61e-4fe6-a4e3-4b8b8b41c250) failed to connect. Trying alternative connection.
    

    ...(the OUT is the last correctly working thing before the suspend-resume cycle, the WARNING is the first and only line shown after the suspend-resume cycle) and after that Babbler is simply dead. It does not reconnect anymore and I have to restart my application to make it work again! :-(

  4. Christian Schudt repo owner

    I tried to reproduce this with localhost (shutting it down for a while) and also with jabber.de (deactivating my network interface completely). In both cases the reconnect logic starts as soon as a disconnect is detected and will also successfully reconnect as soon as network is available again. Can you give some more detailed information maybe? Is it always reproducible? You could also try to log out the real IOException and/or debug the XmppSession#connect method. It also seems you have configured multiple connection methods?

  5. Markus KARG reporter

    Tried to reproduce it meanwhile, and noticed that apparently it happens only sporadically. Maybe it is dependend of some particular timeout, I don't know. I will try to find out more, but as it seems it needs hours of suspend before it happens, so I think I will collect information about the next days and post again on the weekend with (hopefully) more insight.

  6. Markus KARG reporter

    BTW, funny thing detected. I pulled the LAN wire and tried to send some stanzas, and here is what happens:

    OUT: <message to="skarg@jabber.de" type="chat"><thread>6bb0817e-0227-4c60-a4c4-ae95e6757d70</thread><active xmlns="http://jabber.org/protocol/chatstates"></acti ve></message> OUT: <message to="skarg@jabber.de" type="chat"><thread>6bb0817e-0227-4c60-a4c4-ae95e6757d70</thread><inactive xmlns="http://jabber.org/protocol/chatstates"></in active></message> OUT: <message to="skarg@jabber.de" type="chat"><thread>6bb0817e-0227-4c60-a4c4-ae95e6757d70</thread><active xmlns="http://jabber.org/protocol/chatstates"></acti ve></message> OUT: <message to="skarg@jabber.de" type="chat"><thread>6bb0817e-0227-4c60-a4c4-ae95e6757d70</thread><inactive xmlns="http://jabber.org/protocol/chatstates"></in active></message>

    Babbler thinks it worked. Strange, isn't it?

  7. Christian Schudt repo owner

    I guess this is because how TCP works. If I switch off my WLAN device, I can also still write to the TCP Socket. It takes a while until a write operation throws a SocketException.

  8. Markus KARG reporter

    Oh you are right! In fact, it is why TCP exits in the end... Silly me.

    I checked with MS Knowledge Base and it says the default maximum timeout is 5 * 240 = 1200 seconds (if not customized), i. e. 20 minutes. So if I want to find out a broken connection earlier, I should ping the server more often. So setting the PingManager's interval to one minute should provide a rather agile detection of broken wires, don't you think? I'll try this out!

    Anyways, my bigger problem is still the hanging reconnection. I will add more tracing in the places you mentioned to see what happens over night.

  9. Markus KARG reporter

    Added lots of trace messages to XmppSession.connect(). Hope that I can reproduce soon. Strange but true, the problem did not happen today. Let's just wait until the traces show me where it hangs actually.

    Regarding the second connection: No, I did not provide a second connection. I am using the default configuration as provided by Babbler itself.

  10. Markus KARG reporter

    Problem occured again today on two of three test PCs. Same effect as always: When resuming after several hours in suspend mode, it cannot find TCP server so it tries to use BOSH (as of the default configuration in place), and then hangs without any further log messages. The PCs use different operating systems, and are used in different sites with different internet providers and firewall solutions, so it definitively is either related to Babbler or to jabber.de or both. As a consequence I now added logging code around literally each single line in the connect() method as of [master] [d63b0fa] and deployed again on the same three test PCs. That should provide us at least SOME insight so we hopefully soon have another enlighting stack trance to work with. If the problem will not occur again, I will keep the systems unchanged for one whole week. That should be long enough to reproduce. :-)

  11. Christian Schudt repo owner

    Sigh... this is annoying. I did a test again over night and couldn't reproduce. I don't think it's jabber.de. I guess the IOException will reveal something.

  12. Markus KARG reporter

    Long-Term Test Result

    So finally I can tell you the situation when the hang occurs. :-)

    I had running the version of [master] [d63b0fa] for a full week now without changing anything in the calling application or within Babbler. The test drive included three different PCs using different OS and internet providers, but all clients were accessing jabber.de from different accounts. What I have noticed is solely that the problem occured only once, and it is visible in the below stack trace, happening after resume from suspend mode. Note that there is one minute where nothing happens in this log (between around 08:56:55 and 08:57:55). I have marked that pause with [-----ONE MINUTE PAUSE HERE -----] so you have a better understanding where the software actually is hanging, and might be able to infer why that one minute pause is happening.

    [-----END OF SUSPEND MODE-----]
    Mar 01, 2015 8:56:32 AM rocks.xmpp.core.session.XmppSession connect INFORMATION: BEFORE loginInterval
    OUT: <auth xmlns="urn:ietf:params:xml:ns:xmpp-sasl" mechanism="SCRAM-SHA-1">biwsbj1ta2FyZyxyPWZiY2NXNmhlWUgyMThRYSt2NXFuaXc9PQ==</auth>
    IN : <challenge xmlns="urn:ietf:params:xml:ns:xmpp-sasl">cj1mYmNjVzZoZVlIMjE4UWErdjVxbml3PT0xMDkxOTM3Mi02MTQxLTRlOWQtOTdlNS02OWZlZmJiMTg0Y2Iscz1ObUkwTlRBM1lUa3RNMlF4TnkwMFlqVmxMVGs1T0RBdFptSTVZamMwTlRNek1XTmksaT00MDk2</challenge>
    OUT: <response xmlns="urn:ietf:params:xml:ns:xmpp-sasl">Yz1iaXdzLHI9ZmJjY1c2aGVZSDIxOFFhK3Y1cW5pdz09MTA5MTkzNzItNjE0MS00ZTlkLTk3ZTUtNjlmZWZiYjE4NGNiLHA9SG9oYXFQ
    LzNtZUJnNktWR2VUY0VCemZaNWc4PQ==</response>
    OUT: <message to="s@jabber.de" type="chat"><body>(yawn)</body><thread>350a1633-6034-4305-9624-9221c6a6d464</thread><active xmlns="http://jabber.org/protocol/chatstates"></active></message>
    IN : <success xmlns="urn:ietf:params:xml:ns:xmpp-sasl">dj1Sa1NleW1NL1IyZDZ4dFlPUVNMT3ZzRXVJWGc9</success>
    OUT: <?xml version="1.0" encoding="UTF-8"?><stream:stream xml:lang="de" to="jabber.de" version="1.0" xmlns="jabber:client" xmlns:stream="http://etherx.jabber.org/streams">
    IN : <?xml version='1.0'?><stream:stream id='' xmlns:stream='http://etherx.jabber.org/streams' version='1.0' xmlns='jabber:client'><stream:error><invalid-namespace xmlns='urn:ietf:params:xml:ns:xmpp-streams'/></stream:error></stream:stream>
    OUT: </stream:stream>
    [-----ONE MINUTE PAUSE HERE -----]
    Mar 01, 2015 8:57:55 AM rocks.xmpp.core.session.XmppSession connect INFORMATION: BEFORE connectionIterator.next()
    Mar 01, 2015 8:57:55 AM rocks.xmpp.core.session.XmppSession connect INFORMATION: AFTER connectionIterator.next() / connection = TCP connection to jabber.de:5222 ()
    

    I have no clue of XMPP, but it looks scary to me that the server sends a stream header that includes the word "invalid-namespace".

    I hope that this is helpful, or at least you might be able to infer another place which I should include in the log.

    More Test Results

    Just was done posting my test result, then another hang happend on a different machine. The scenario seems to be slightly different:

    [-----END OF SUSPEND MODE-----]
    [...]
    IN : <presence from="mkarg@einfachjabber.de" to="mkarg@jabber.de" type="error"><error type="cancel"><remote-server-not-found xmlns="urn:ietf:params:xml:ns:xmpp-stanzas"></remote-server-not-found><text xmlns="urn:ietf:params:xml:ns:xmpp-stanzas">Server-to-server connection failed: Connecting failed: connection refused</text></error></presence>
    [-----SEVERAL MINUTES PAUSE-----]
    OUT: <message to="skarg@jabber.de" type="chat"><thread>79be9ceb-f953-4b97-9477-d9dfee5101b0</thread><active xmlns="http://jabber.org/protocol/chatstates"></active></message>
    Mar 01, 2015 9:13:31 AM rocks.xmpp.core.session.XmppSession connect INFORMATION: BEFORE connectionIterator.next()
    Mar 01, 2015 9:21:29 AM rocks.xmpp.core.session.XmppSession connect INFORMATION: AFTER connectionIterator.next() / connection = TCP connection to jabber.de:5222 (d1692b2a-3330-4717-86b8-6293ee426ac8)
    Mar 01, 2015 9:21:29 AM rocks.xmpp.core.session.XmppSession connect INFORMATION: BEFORE connection.connect(null) / connection = TCP connection to jabber.de:5222 (d1692b2a-3330-4717-86b8-6293ee426ac8)
    OUT: <?xml version="1.0" encoding="UTF-8"?><stream:stream xml:lang="de" to="jabber.de" version="1.0" xmlns="jabber:client" xmlns:stream="http://etherx.jabber.org/streams">
    Mar 01, 2015 9:21:29 AM rocks.xmpp.core.session.XmppSession connect INFORMATION: AFTER connection.connect(null) / connection = TCP connection to jabber.de:5222 (d1692b2a-3330-4717-86b8-6293ee426ac8)
    Mar 01, 2015 9:21:29 AM rocks.xmpp.core.session.XmppSession connect INFORMATION: BEFORE awaitNegotiation
    IN : <?xml version='1.0'?><stream:stream xmlns:stream='http://etherx.jabber.org/streams' version='1.0' from='jabber.de' id='d4f67981-701d-4a2e-9833-c61a4f6b6f8d' xml:lang='en' xmlns='jabber:client'><stream:features><starttls xmlns='urn:ietf:params:xml:ns:xmpp-tls'><required/></starttls></stream:features>
    Mar 01, 2015 9:21:29 AM rocks.xmpp.core.session.XmppSession connect INFORMATION: AFTER awaitNegotiation
    Mar 01, 2015 9:21:29 AM rocks.xmpp.core.session.XmppSession connect INFORMATION: BEFORE updateStatus: previousStatus
    Mar 01, 2015 9:21:29 AM rocks.xmpp.core.session.XmppSession connect INFORMATION: AFTER updateStatus
    Mar 01, 2015 9:21:29 AM rocks.xmpp.core.session.XmppSession connect INFORMATION: BEFORE activeConnection.close
    OUT: </stream:stream>
    Mar 01, 2015 9:21:29 AM rocks.xmpp.core.session.XmppSession connect INFORMATION: AFTER activeConnection.close
    Mar 01, 2015 9:21:29 AM rocks.xmpp.core.session.XmppSession connect INFORMATION: BEFORE throwAsXmppExceptionIfNotNull
    [-----ONE MINUTE PAUSE-----]
    Mar 01, 2015 9:22:24 AM rocks.xmpp.core.session.XmppSession connect INFORMATION: BEFORE connectionIterator.next()
    

    BEFORE throwAsXmppExceptionIfNotNull is the invocation of throwAsXmppExceptionIfNotNull(exception); within XmppSession::connect(Jid).

  13. Markus KARG reporter

    Don't know if this is related, but may it is helpful: Using latest ´´´master´´´ ´´´2a3c456´´´ I started two technically identical instances of my GUI within a few seconds on the same PC using two different jabber.de accounts. The first instance started without any problem, the second did not...:

    IN : <presence from="christian.schudt@jabber.de" type="subscribe"></presence>
    IN : <iq id="lx820014" type="set"><query xmlns="jabber:iq:roster" ver="19"><item ask="subscribe" jid="mkarg@einfachjabber.de" subscription="none"><group>Buddys<
    /group></item></query></iq>
    IN : <presence from="mkarg@einfachjabber.de" to="mkarg@jabber.de" type="error"><error type="cancel"><remote-server-not-found xmlns="urn:ietf:params:xml:ns:xmpp-
    stanzas"></remote-server-not-found><text xmlns="urn:ietf:params:xml:ns:xmpp-stanzas">Server-to-server connection failed: Connecting failed: connection refused</
    text></error></presence>
    OUT: <iq id="lx820014" type="result"></iq>
    IN : <message from="skarg@jabber.de/cheekychat" to="mkarg@jabber.de" type="chat"><thread>82d1b4f6-20a6-4757-a354-3cdca3eadccc</thread><inactive xmlns="http://ja
    bber.org/protocol/chatstates"></inactive></message>
    

    I think your assumption is correct that the initial chat status update may or may not interfere with the connect. So maybe it might be a good idea if Babbler would block chat state updates until the connection is done?

  14. Christian Schudt repo owner

    This XMPP log looks quite normal for me, as it seems it's already connected. The incoming error presence shouldn't be harmful for the session state. Unless it produces an exception, but I can't imagine why. Or do you have a stacktrace? The incoming chat state message shouldn't be harmful either.

  15. Markus KARG reporter

    No it doesn't produce an exception. I just wondered why a client receives "Server-to-server connection failed", and particularly why one client does receive it and the other does not. You're the XMPP expert -- if it is OK for you, then it is OK for me, too.

  16. Christian Schudt repo owner

    After our discussion it turned out, that the cause was the DNS lookup timeout, which was zero, which appearently was treated as "indefinite" and then caused the failing reconnection.

    I'm resolving this issue because you reported back that e60305a66f68294ae1f2dc06f9cf16c0af834970 has solved it.

  17. Log in to comment