Hostname detection does not work for Mesos agents added later

Issue #57 resolved
bmerry created an issue

The docs say that it's no longer necessary to set a hostname attribute on Mesos agents. If I start up go-d-scheduler with an existing Mesos cluster with one agent, that works. However, if I then destroy the agent and create a new one (without a hostname attribute) and try to schedule a job, I get

2017-12-19 08:33:20,555 ERROR [godocker-scheduler][Thread-1] Mesos:Error:Configuration: missing label hostname
2017-12-19 08:33:20,556 ERROR [godocker-scheduler][Thread-1] Error with task 2: 'hostname'

Looking at the code where this error happens, it looks like it is using cached information about agents that isn't updated when new agents join. It's not clear to me why it needs to do this, since every offer contains the agent hostname (see here).

Comments (9)

  1. Olivier Sallou repo owner

    Hi, Agent list is indeed loaded at startup. This is needed to get agent ip adress which is not in offers. Agents added later on are not known. If adding agents, scheduler should be restarted

  2. Olivier Sallou repo owner

    You should set hostname label to slave ip or a routabke hostname (from inside the container). Ip is better. However at scheduler startup , it queries master to get slave ip addreses, removing the need of label. However if a node is added afterward, scheduler will not know it and you need to restart scheduler to get those info

  3. bmerry reporter

    In that case I'm still not seeing why it can't use the hostname field from the offer. How is it different from the hostname retrieved from the master API? For what it's worth, I've written a Mesos framework that successfully uses the hostname from the offer to connect to services started on the slave.

  4. Olivier Sallou repo owner

    Because you are in a container, and usually hostname is not known by the container and cannot be resolved. If no label is set and ip is not known, I could however use hostname as default with a log warning.

  5. Olivier Sallou repo owner

    And if I ma correct, hostname was not available in offers at some time in offers (was added at some time).

  6. Olivier Sallou repo owner

    I know I faced some issues some time ago with hostname unknown with strange name values given to slave.... Anyway I modified the code so that if label is not defined on slave, and ip is unknown, then use the hostname given in offer as last option. This is for the moment in develop branch with commit #5b2910346e3ab06b53dbb8ddcf7e60c5c5ab397b

  7. bmerry reporter

    I know I faced some issues some time ago with hostname unknown with strange name values given to slave....

    Was that perhaps with the Mesos slave itself running inside a container? In that case one needs to pass --hostname to the slave to set the hostname that it advertises to the outside world. Not setting that properly causes other issues as well e.g. I think the Mesos UI uses that hostname to get information directly from agents.

    Your fix sounds sensible - everything that used to work should still work, and systems that have properly set slave hostnames will work regardless of when the slaves join. Since I can just set the attribute for now, I'm not in any rush for the fix to be released.

  8. Log in to comment