Stop scheduling task, when node DNS resolving problem

Issue #70 new
IT Expert created an issue

Hi, Olivier.

Have found next issue:

If one of computation node was removed from DNS, god-scheduler try to connect to it, got failed, and than stop to schedule tasks:

2018-04-13 08:39:17,126 INFO  [godocker-scheduler][Thread-1] Registered with framework ID 26fd306b-ffd4-407f-a424-a9b78e1ca54c-0000
2018-04-13 08:39:17,127 DEBUG [godocker-scheduler][Thread-1] StreamId: a5f43bc8-5e47-48c1-b011-c5103e28c0f7
2018-04-13 08:39:17,127 INFO  [godocker-scheduler][Thread-1] Master hostname: mvnode2.company.com
2018-04-13 08:39:17,127 INFO  [godocker-scheduler][Thread-1] Master port: 5050
WARNING:urllib3.connectionpool:Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fe57e8e5ef0>: Failed to establish a new connection: [Errno -2] Name or service not known',)': /metrics/snapshot
WARNING:urllib3.connectionpool:Retrying (Retry(total=1, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fe57e8e5f60>: Failed to establish a new connection: [Errno -2] Name or service not known',)': /metrics/snapshot
WARNING:urllib3.connectionpool:Retrying (Retry(total=0, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fe57e8720b8>: Failed to establish a new connection: [Errno -2] Name or service not known',)': /metrics/snapshot
2018-04-13 08:39:17,347 ERROR [godocker-scheduler][Thread-1] Failed to connect to mesos slave vnode2.company.com: HTTPConnectionPool(host='vnode2.company.com', port=5051): Max retries exceeded with url: /metrics/snapshot (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fe57e872160>: Failed to establish a new connection: [Errno -2] Name or service not known',))
Traceback (most recent call last):
  File "/usr/local/lib/python3.5/dist-packages/urllib3/connection.py", line 141, in _new_conn
    (self.host, self.port), self.timeout, **extra_kw)
  File "/usr/local/lib/python3.5/dist-packages/urllib3/util/connection.py", line 60, in create_connection
    for res in socket.getaddrinfo(host, port, family, socket.SOCK_STREAM):
  File "/usr/lib/python3.5/socket.py", line 733, in getaddrinfo
    for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
socket.gaierror: [Errno -2] Name or service not known

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.5/dist-packages/urllib3/connectionpool.py", line 601, in urlopen
    chunked=chunked)
  File "/usr/local/lib/python3.5/dist-packages/urllib3/connectionpool.py", line 357, in _make_request
    conn.request(method, url, **httplib_request_kw)
  File "/usr/lib/python3.5/http/client.py", line 1107, in request
    self._send_request(method, url, body, headers)
  File "/usr/lib/python3.5/http/client.py", line 1152, in _send_request
    self.endheaders(body)
  File "/usr/lib/python3.5/http/client.py", line 1103, in endheaders
    self._send_output(message_body)
  File "/usr/lib/python3.5/http/client.py", line 934, in _send_output
    self.send(msg)
  File "/usr/lib/python3.5/http/client.py", line 877, in send
    self.connect()
  File "/usr/local/lib/python3.5/dist-packages/urllib3/connection.py", line 166, in connect
    conn = self._new_conn()
  File "/usr/local/lib/python3.5/dist-packages/urllib3/connection.py", line 150, in _new_conn
    self, "Failed to establish a new connection: %s" % e)
urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPConnection object at 0x7fe57e872160>: Failed to establish a new connection: [Errno -2] Name or service not known

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/go-docker/plugins/mesos.py", line 1273, in usage
    r = http.urlopen('GET', 'http://' + slave['hostname'] + ':' + str(slave['port']) + '/metrics/snapshot')
  File "/usr/local/lib/python3.5/dist-packages/urllib3/poolmanager.py", line 321, in urlopen
    response = conn.urlopen(method, u.request_uri, **kw)
  File "/usr/local/lib/python3.5/dist-packages/urllib3/connectionpool.py", line 668, in urlopen
    **response_kw)
  File "/usr/local/lib/python3.5/dist-packages/urllib3/connectionpool.py", line 668, in urlopen
    **response_kw)
  File "/usr/local/lib/python3.5/dist-packages/urllib3/connectionpool.py", line 668, in urlopen
    **response_kw)
  File "/usr/local/lib/python3.5/dist-packages/urllib3/connectionpool.py", line 639, in urlopen
    _stacktrace=sys.exc_info()[2])
  File "/usr/local/lib/python3.5/dist-packages/urllib3/util/retry.py", line 388, in increment
    raise MaxRetryError(_pool, url, error or ResponseError(cause))
urllib3.exceptions.MaxRetryError: HTTPConnectionPool(host='vnode2.company.com', port=5051): Max retries exceeded with url: /metrics/snapshot (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fe57e872160>: Failed to establish a new connection: [Errno -2] Name or service not known',))
2018-04-13 08:39:17,470 INFO  [godocker-scheduler][Thread-1] Load slaves information

Comments (7)

  1. Olivier Sallou repo owner

    I don't understand why scheduling stops. Exception is caught in case of connection issue, and process continues.

    However, to schedule task on offers , scheduler expects to find hostname in mesos fetch info or in slave labels. If no hostname is found, no task is placed on this offer.

    hostname info is mandatory, be it via mesos or slave attributes (required to connect to slave), so this is an expected behavior.

    but how/why a compute would be removed from DNS (though compute is still up and running)

  2. Olivier Sallou repo owner

    If hostname is available in offer, next offers will be accepted according to:

            if 'hostname' not in labels:
                if offer['agent_id']['value'] in self.slaves:
                    labels['hostname'] = self.slaves[offer['agent_id']['value']]
                    self.logger.debug('Mesos:GetSlaveIdFromMasterInfo:%s' % (labels['hostname']))
                elif 'hostname' in offer and offer['hostname']:
                    labels['hostname'] = offer['hostname']
                    self.logger.debug('Mesos:GetSlaveIdFromOffer:%s' % (labels['hostname']))
                else:
                    self.logger.error('Mesos:Error:Configuration: missing label hostname')
    

    if not set in offer by slave, then we expect to get it at startup from slave info. Next offers from this slave will be rejected as we need to know the slave hostname to create task. Offers from other slaves should be accepted.

    However, if hostname is set,task will be scheduled, but monitoring or interactive information will not work as host will not be accessible (dns not resolvable)

  3. IT Expert reporter

    I don't understand why scheduling stops. Exception is caught in case of connection issue, and process continues.

    maybe it's just coincidence...

    However, to schedule task on offers , scheduler expects to find hostname in mesos fetch info or in slave labels. If no hostname is found, no task is placed on this offer. hostname info is mandatory, be it via mesos or slave attributes (required to connect to slave), so this is an expected behavior.

    agree with you, we should better watch for our DNS records.

    but how/why a compute would be removed from DNS (though compute is still up and running)

    it's not static DNS, it's host app, that register host in DNS for some ttl . So it's failed, but host was still alive.

  4. Olivier Sallou repo owner

    Configuring mesos slave to send its hostname in offers or in slave attributes should allow scheduling. Only startup will fail 'silently' with no additional impact.

    However we could record that a failure occured and try again later on until success. Will try to see if could be done with low impact.

  5. Log in to comment