Massive provider initiated reconnection problems on all existing Freshtomato Versions since original Advancedtomato 140 (EA6700+up)

Issue #13 closed
TheHiman created an issue

I experience on a german PPPoE VDSL Session massive Problems which make the usage of Freshtomato on all existing versions nearly unusable.

The Problem is the the 24 hours provider initiated “forced” disconnection from the pppoe session exactly after 23.59.59 hours.
The same effect can be repoduced by manual disconnect the session and reconnect manually.

This is the blocker/major bug:

The existing firewall/conntrack will NOT be cleared and NOT correctly reloaded with the NEW data after reconnection to the ISP.

This results asap in:

a.) when ISP-reconnected mostly or huge delayed a new ipv6 session-IP from the ISP is gotten. Often NO new ipv6 address is assiged or offerd, even after many minutes.
b.) many existing internal LAN clients can´t reconnect to there existing service, especialy Fritzboxes with VOIP via NAT
c.) reconnections of internal VPNs (pptp/l2tp) to regulary or manual setted ports no longer works with the new session
d.) many devices still have the old ipv6 adresses in the cache and the router don´t send the new updated ips via RA or dhcp6 offer

The only fix is a HARD reset and this works for exactly one day - until the next forced remote disconnection is triggered by the ISP.

The only latest version which has NOT this problem is the advanced-Tomato from 2017 3.5-140, which i used before on a E3000
Router without any problems for a long time.

I tested the last advanced-tomato work from here: https://bitbucket.org/AndreDVJ/advancedtomato-arm/commits/
and the above problem is here the same, so the issue must be come in right after the fork from 2017 ?

At the moment i´am forced to still use this old version (again) on an linksys EA6700 to have seamless working, like before with my E3000 the last years.

urgent todo:

.) check the differences between a cold start AND a manually “disconnect” and “reconnect” of a pppoe session and check
the firewall / conntrack for ipv4 and ipv6 after them

.) check why pppoe on a dual stack ISP mostly don´t get a new ipv6 ip this way, or why is there a huge delay from 30 seconds to 3 minutes - which will then not be propagated or included in the firewall so much later correctly.

.) simulate a external modem-disconnection by plug out the wan port for 1 or 2 minutes and replug it. This is perfect for emulate
a line disconection which is nearly the same as the forced disconnection from the ISP every day.

.) check if internal clients can reconnect to VOIP-Services via NAT (by using the active NAT helpers), a few l2tp/pptp NATed sessions and on smartphone if IPs got updated after the “external” reconnect. When not updated, disable WLAN manually and reenable it, then the device should have the updated v6 ips and check with an ipv6-test website if booth is working again.

.) german ISPs usualy have a forced reconnect-delay when NOT shutted down the pppoe session CLEANLY! - This reconnect-deny is usualy 2-3 minutes. When you, as sample, disconnect power without clean session shutdown this measure is triggered.
when the rooter reboots after 1 minute, you still have to wait 1 or 2 minutes more, before you can reconnect succsessfully.
This point is imported to all tasks when you change configuration and the pppoe-sessions needs to be restartet “on the fly”!
It is importend to always “clean shut down the session and send a ppp-termination signal to the ISP”. This way you always can reconnected asap without waiting the “punishing” time. Handling ipv6 here is importent, too. Because on dual stack ISPs the measure is splitted between v4 and v6.

Comments (14)

  1. TheHiman reporter

    I noticed basicly some importent things:

    a.)
    The firewall/conntrack starts right after migrating from iptables 1.4 to 1.6 in any tomato-version. Maybee some (new) extra options missed which cleans old stale connections and work with iptables not 1.6 any longer ?

    b.)
    The massive v6 pppoe-dhcp problem can be a result of the many ppp/pppoe patches over time. The old advanced tomato from 2017 always get ipv4 ppp-ip very quickly and the v6 ip including the /56 prefix via RA needs usualy no longer as 1-3 seconds here. No matter if cold started or manually disconnected/reconnected. The over timed patched versions get just in time the v6 data actualy only after a complete cold start or more or less on random, but mostly allways to late and often then not included in the running iptables when arrived minutes later.

  2. Łukasz Turoń

    I’ve recently upgraded my R7000 from 2019.1 to 2020.2 (AIO64K). It seems that i have a similar issue - my ISP provider seem to restart connection once a day.

    After the PPEoE goes down the R7000 seem to become unresponsible (LAN switching not working, WIFI not accessible - could not authenticate). I believe some processes are still working since i can see some entries in “messages” from cron job.

    The only way to recover is hard restart (power off and on). The issue was not visible on 2019.1

    Have You resolved that issue ?

    Apr 17 17:12:42 mynetgear7000 daemon.info pppd[1329]: No response to 5 echo-requests
    Apr 17 17:12:42 mynetgear7000 daemon.notice pppd[1329]: Serial link appears to be disconnected.
    Apr 17 17:12:42 mynetgear7000 daemon.info pppd[1329]: Connect time 1224.9 minutes.
    Apr 17 17:12:42 mynetgear7000 daemon.info pppd[1329]: Sent 547746822 bytes, received 3992449203 bytes.
    Apr 17 17:12:50 mynetgear7000 user.info redial[1330]: Redial: wan DOWN. Reconnecting ...
    Apr 17 17:13:00 mynetgear7000 cron.info crond[1012]: USER root pid 15749 cmd /usr/sbin/watchdog alive
    Apr 17 17:13:00 mynetgear7000 cron.info crond[1012]: USER root pid 15750 cmd service vpnclient1 start
    Apr 17 17:13:35 mynetgear7000 daemon.notice openvpn[1806]: [myVpn] Inactivity timeout (--ping-restart), restarting
    Apr 17 17:13:35 mynetgear7000 daemon.notice openvpn[1806]: SIGUSR1[soft,ping-restart] received, process restarting

    Apr 18 19:36:00 mynetgear7000 cron.info crond[1006]: USER root pid 15243 cmd ddns-update 0
    Apr 18 19:43:15 mynetgear7000 daemon.info pppd[1165]: No response to 5 echo-requests
    Apr 18 19:43:15 mynetgear7000 daemon.notice pppd[1165]: Serial link appears to be disconnected.
    Apr 18 19:43:15 mynetgear7000 daemon.info pppd[1165]: Connect time 1573.6 minutes.
    Apr 18 19:43:15 mynetgear7000 daemon.info pppd[1165]: Sent 491184656 bytes, received 2196243368 bytes.
    Apr 18 19:45:00 mynetgear7000 cron.info crond[1006]: USER root pid 15305 cmd /usr/bin/btcheck check
    Apr 18 19:45:00 mynetgear7000 cron.info crond[1006]: USER root pid 15306 cmd rcheck --cron
    Apr 18 19:45:00 mynetgear7000 cron.info crond[4091]: USER root pid 15318 cmd /usr/bin/btcheck check
    Apr 18 19:45:00 mynetgear7000 cron.info crond[4091]: USER root pid 15319 cmd rcheck --cron
    Apr 18 19:45:04 mynetgear7000 user.debug rcheck[15330]: Breaking /var/lock/restrictions.lock
    Apr 18 19:47:00 mynetgear7000 cron.info crond[1006]: USER root pid 15331 cmd ddns-update 0
    Jan 1 01:00:12 mynetgear7000 user.info hotplug[905]: USB ext3 fs at /dev/sda2 mounted on /opt
    Jan 1 01:00:12 mynetgear7000 kern.info kernel: EXT4-fs (sda2): recovery complete
    Jan 1 01:00:12 mynetgear7000 kern.info kernel: EXT4-fs (sda2): mounted

    I can make some debugging with some help - i’m experienced in some other stuff 😉

  3. TheHiman reporter

    It looks like since 2 days reconnection/conntrack problems are mostly gone now with actual trunk.

    But it still needs further monitoring for many more internal services which holds v4 and v6 connections open if really all services can reconnect reliable when different reconnection scenarios from ISP-side happen over time.

  4. TheHiman reporter

    Actualy there are still issues with PPPoE reconnections:

    a.) when ISP initiated reconnection comes up (Deutsche Telekom, 1&1, etc….). ipv6 is NOT renewed, even after 10+ Minutes and outdated internal/cached v6 prefixes
    are not completly cleared after this event. This results in Timeouts on all Services which was active on ipv6 before the reconnect.

    b.) When in the webinterface a self made disconnect/reconnect is triggered, ipv6 becomes renewed, but unreliable. Sometimes it doesn´t renew v6, sometimes after 20-30 seconds. Internal relaying and notifying (LAAC, dhcpv6) is not really working. Clients still relies minutes later on a lot of outdated ipv6 data.

    Anyway there are still old entrys in the conntrack which results in SIP Timeouts, even when “hold connections active all 30 seconds” is active in IP-client mode on a internal Fritzbox.

    So finaly the actual trunk is still NOT production ready on 24hours forced disconnecting lines 😞

    I would remind again the methods of testing this with an Telekom-PPPoE Access without having forced disconnection:

    a.) plug out the WAN-Port and let PPPoE goes into timeout and reconnect after 30-60 seconds - v4 and v6 must then renewed, dhcp must be renewed and all old conntrack entrys must be cleaned on the internal network.

    b.) use webinterface “disconnect” “connect” button and simulate this way another reconnection

    I noted further that there is a different handling of “random “user” timeout (Plug out the wan port…), remote initiated reconnection and manual connect/disconect method”. The reinitialisation and workflow is different based on the type of reconnection actualy!

    The results will come up very quickly when on the internal network a vpn connection is active with l2tp, gre or pptp and is no longer working after a reconnection.
    internal SIP-Clients have a huge timeout to realize that external Situation has changed. Actualy there is a period of 30-60 Minutes offtime for SIP-connections when i take a look in the fritzbox logfile every night.

  5. M_ars

    Hi

    I was playing/investigating a little bit and yes, sometimes IPv6 has a problem with Telekom ISP. IPv4 always working. VOIP for Telekom is IPv4 ? (and should work always). Never had a problem, even in the old days with 24 hours forced re-connection. May upgrade your contract to a newer one to get rid of the 24 h connection limit :)

    I have a few things i suspect and a little trace.

    BR

  6. TheHiman reporter

    The 2nd line is a 1&1 contract. they always live in the past and are not interested to come to 2020 and cancel this overdue 24hours split things….
    The voip is actualy 1&1, too here. so they use v6 prefered. Yes, its easy possible to configure the ip-client voip box to use v4 only. but this doesnt´t solve the problems with other ip6 related services which have minutes to hours long timeouts. i always compare with the old advanced tomato, which still have not this v6/conntrack issues after provider initated disconnections.
    As i say: the handling is DIFFERENT when the isp send a “close” via pppoe protocol compared to user initiated “disconntect” and “connect” button-method, make some changes in UI, manually service restarts, etc. This way outdated connections are cleaned up and dhcp/dhcpc6/slaac is asap updated. but not when the disconnect
    comes via software level, and/or goes to an pppoe-timeout. Maybe we should jump to the same init-routines as with manual reconnect methods ?

    The ipv6 timings, you self noted, is by the way same way better handled on the old advanced tomato. Actualy i raised the “redial” from 10 to 15 seconds, because i expect, after an isp initated disconnect, the user side needs a minimum wait time to get a fresh v6 set asap after a new connect ?

  7. M_ars

    I have an idea for an update to ipv6 code. Maybe it helps/works. Will have hopefully some time in the next days to test it

  8. Nathan Young

    I’m experiencing the same 24hr disconnect. All connected devices lose internet connectivity but seem to eventually come back after 5-10 minutes.

    Running 2020.6 K26ARM USB AIO-64K on a Netgear R8000

  9. Log in to comment