Handling mesos maintenance mode

Issue #77 resolved
IT Expert created an issue

Hi, Oliver. Have problem with mesos maintenance mode. Put few computing nodes in maintenance mode. (http://mesos.apache.org/documentation/latest/maintenance/) I have checked this nodes on mesos interface maintenance tab, and it Schedule for Maintenance. It's ok.

From mesos docs, this should makes resources on a machine unavailable. However jobs ended up on this nodes as usual.

Could you, please check, Is it mesos bug, or godocker doesn't handle it ? Thank you in advance.

Comments (8)

  1. Olivier Sallou repo owner

    Godocker schedules jobs on mesos offers. If mesos continues to send offer for those nodes, then godocker use it. So in your case looks like offers are still sent.

  2. IT Expert reporter

    Moreover it should send inverse offers. Asking just to make sure GoDocker handle it properly.

  3. Olivier Sallou repo owner

    Godocker does not handle inverse orders. I will check mesos doc fir maintenance mode.

  4. Olivier Sallou repo owner

    For jobs already scheduled, node in maintenance may kill runnings jobs. Godocker will keep them as failed and won't rechedule them

  5. IT Expert reporter

    As far, as I understood from docs, there are 3 modes:

    • Scheduled window. ( /maintenance/schedule API endpoint)

    In this mode mesos send inverse offers and doesn't take new jobs. But old jobs should work.

    • Node down mode. ( /machine/down API endpoint)

    This mode kill all tasks (send TASK_LOST message to it + doesn't take new jobs )

    • Node up ( /machine/up API endpoint)

    normal operation.

  6. Olivier Sallou repo owner

    On scheduled maintenance it will stop sending offers when date is reached for defined duration. Before this date it will send inverse offers but godocker will ignire them as job may complete before.

  7. IT Expert reporter

    Looks like it working now.

    2018-06-27 06:07:22,347 DEBUG [godocker-scheduler][Thread-1] OFFER RECEIVED: [<mesoshttp.offers.Offer object at 0x7f9c976b79b0>, <mesoshttp.offers.Offer object at 0x7f9c976b7438>]
    2018-06-27 06:07:22,348 DEBUG [godocker-scheduler][Thread-1] Mesos:Offers:Begin
    2018-06-27 06:07:22,350 DEBUG [godocker-scheduler][Thread-1] {'id': {'value': '97de61d7-475a-4755-9b4f-33b80046f622-O2478995'}, 'resources': [{'scalar': {'value': 1910366.0}, 'role': '*', 'name': 'disk', 'type': 'SCALAR'}, {'role': '*', 'ranges': {'range': [{'end': 2180, 'begin': 1025}, {'end': 3887, 'begin': 2182}, {'end': 5049, 'begin': 3889}, {'end': 8079, 'begin': 5052}, {'end': 8180, 'begin': 8082}, {'end': 34000, 'begin': 8182}]}, 'name': 'ports', 'type': 'RANGES'}, {'scalar': {'value': 3.0}, 'role': '*', 'name': 'cpus', 'type': 'SCALAR'}, {'scalar': {'value': 8671.0}, 'role': '*', 'name': 'mem', 'type': 'SCALAR'}], 'unavailability': {'start': {'nanoseconds': 1530079515000000000}, 'duration': {'nanoseconds': 518400000000000}}, 'attributes': [{'text': {'value': 'GTX1080'}, 'name': 'gputype', 'type': 'TEXT'}, {'text': {'value': 'gpu'}, 'name': 'rack', 'type': 'TEXT'}], 'framework_id': {'value': '6b6a2a5a-47aa-4773-9047-2a53d4e6600c-0002'}, 'url': {'scheme': 'http', 'address': {'ip': '10.0.0.10', 'hostname': 'node15.local', 'port': 5051}, 'path': '/slave(1)'}, 'agent_id': {'value': '6b6a2a5a-47aa-4773-9047-2a53d4e6600c-S24'}, 'hostname': 'node15.local'}
    2018-06-27 06:07:22,350 DEBUG [godocker-scheduler][Thread-1] **Node node15.local in planned maintenance, skipping...**
    

    also mesos send inverse offers

    2018-06-27 06:04:15,145 ERROR [godocker-scheduler][Thread-1] A rescind event have been received for offer: {'offer_id': {'value': '97de61d7-475a-4755-9b4f-33b80046f622-O2478568'}}
    WARNING:mesoshttp.client:INVERSE_OFFERS event no yet implemented
    WARNING:mesoshttp.client:INVERSE_OFFERS event no yet implemented
    WARNING:mesoshttp.client:INVERSE_OFFERS event no yet implemented
    WARNING:mesoshttp.client:INVERSE_OFFERS event no yet implemented
    

    I'll leave here correct request for mesos maintenance mode (just for google)

    #144 hours maintenance window
    curl -X POST master.mesos:5050/maintenance/schedule  --data '{"windows": [{"machine_ids":[{"hostname": "node15.local", "ip":"10.0.0.10"}], "unavailability": {"start": {"nanoseconds": '$(($(date +%s) + 60))'000000000}, "duration": {"nanoseconds": '$((144 * 3600000000000))'}}}]}'
    

    hostname and ip both are required

    Thank you for investigation issue!

  8. Log in to comment