Does not attempt to reconnect

Issue #66 closed
Markus KARG created an issue

If there previsouly was a several hours long suspend-resume-cylce of the PC, sporadically Babbler runs into a strange state: It tries to reconnect, fails with that, retries again, fails with that, and then seems to give up (it does not retry anymore, even for hours).

I noticed this because I have a implemented a ping-partner-once-per-second functionality installed which starts on reconnect and stops on connection loss events, so I "see" those events in my log:

INFORMATION: PING RESULT: false Dez 27, 2015 10:37:45 PM rocks.xmpp.core.session.XmppSession tryConnect WARNUNG: TCP connection to jabber.de:5222 failed to connect. Trying alternative connection. Dez 27, 2015 10:38:19 PM rocks.xmpp.core.session.XmppSession tryConnect WARNUNG: TCP connection to jabber.de:5222 failed to connect. Trying alternative connection. Dez 28, 2015 8:43:29 AM rocks.xmpp.core.session.XmppSession tryConnect WARNUNG: TCP connection to jabber.de:5222 failed to connect. Trying alternative connection. Dez 28, 2015 2:25:19 PM rocks.xmpp.core.session.XmppSession tryConnect WARNUNG: TCP connection to jabber.de:5222 failed to connect. Trying alternative connection. Dez 28, 2015 2:51:51 PM application.chat.XmppChat lambda$new$5 INFORMATION: Chat partner is using the following software: Software Version: CheekyChat 0.2.0-SNAPSHOT, Windows 7 Dez 28, 2015 2:52:10 PM application.chat.XmppChat$PingService succeeded INFORMATION: PING RESULT: true Dez 28, 2015 2:52:19 PM application.chat.XmppChat$PingService succeeded

Using thread dump I can see that the "XMPP Reconnection Listener is in TIMED_WAITING state, hence it got parked until the next interval, but that interval apparently never ends -- even after hours!

Now here comes the funny thing: The session immediately is functional as soon as the chat partner send a message! Weird, isn't it?

This means, the session apparently does exist, but babbler does not "know" until somebody external gave it a "kick".

After receiving that external message, the application receives the reconnection event from babbler, so the ping timer wakes up and application works again like a charm.

Happens about once per month, independent of machine / OS, and does ONLY occur when the machine is after a suspend-resume cycle (never happens while the machine is on).

Any ideas?

(Version: 0.7.0-SNAPSHOT -- Sorry, BitBucket does not allow to enter that into 'version' field of ticket)

Comments (12)

  1. Christian Schudt repo owner

    How do you know, that it doesn't retry anymore? Maybe it doesn't retry, because it's successfully reconnected? How do you know, that the session is not functional? Or in other words, what do you expect from the session, so that it "knows" that it exists?

  2. Markus KARG reporter

    Good question! I assumed that no retries happen because of the log: * If a retry fails, log tells me that BOSH was tried instead. Such messages occur before the suspend, and once (!) after the resume, but not further. * If a retry succeeds, log tells me that my frequent PING did either return true or false. Hence, as neither message occur, either no more retries did happen, OR the fact that the last retry was successful did not get promoted to my listener (which, BTW, is notified of the reconnect once someone sends me a message, as I wrote above).

    If you like, I can add some logging into the retry code?

  3. Christian Schudt repo owner

    Sure, add some logging! It's hard for me to reproduce your bug, when it only happens once a month.

  4. Christian Schudt repo owner

    It's thrown by StreamFeaturesNegotiator. After connecting (sending stream header), the client waits 10s for a "<mechanism>" response from the server.

    You could log out the XMPP traffic (ConsoleDebugger) to see if there's indeed a <mechanism/> received.

    Maybe also the timeout is to short in XmppClient#connect() (it is 10s). You could try to increase it.

    -- Christian

  5. Markus KARG reporter

    Ok, traced the XMPP traffic. Seems to be a bug in Babbler, actually, but I cannot find it without your genious. In fact, in some cases it happens that the log contains information that a reconnect is attempted, and the TCP connection was good, but immediately it also says NoResponseException(No response waiting on feature: Mechanisms) and it does not record any XMPP messages for that attempt:

    "Strange" kind of reconnect attempt looks like this:

    • Connected via TCP connection to jabber.de:5222

    • Negotiating stream, waiting until SASL is ready to be negotiated.

    • Reconnection failed.

    • => NO XMPP TRAFFIC RECORDED AT ALL WITHIN THIS ATTEMPT! <=

    "Good" kind of reconnect looks like this instead:

    • Connected via TCP connection to jabber.de:5222

    • OUT: <?xml version="1.0" encoding="UTF-8"?><stream:stream xml:lang="de-DE"...

    • Negotiating stream, waiting until SASL is ready to be negotiated.

    • IN : <?xml version="1.0" encoding="UTF-8"?><stream:stream xmlns:stream=...

    • ...

    I can reproduce that pretty easily in 25% of all unplug-plug cycles!

  6. Christian Schudt repo owner

    I've looked into the code and the only suspicous code is in TcpConnection#connect():

            if (socket != null) {
                // Already connected.
                return;
            }
    

    which would result in the "strange" reconnect and also in the NoResponseException. It's reset to null in close().

    Not sure what happens here, are you able to debug/log?

  7. Christian Schudt repo owner

    To answer the last question: It thinks it's connected because "socket != null". The question is why?

    If there's an error (e.g. SocketTimeoutException in XmppStreamReader), XmppSession.notifyException is called, which first closes the TcpConnection, which nullifies the socket and then updates the status to DISCONNECTED, which then triggers the reconnection.

    Unconnected socket is weird. Maybe the solution is to nullify the socket, when an IOException during socket.connect() ?

  8. Log in to comment