Job fail with node static resource reservation.

Issue #65 resolved
IT Expert created an issue

Hi. Have node with mesos-slave resource static reservation (for testing) for god_r role.

--resources=cpus(*):0;cpus(god_r):4;mem(*):512;mem(god_r):4096;

So I reserved 4CPU & 4096MB RAM for god_r role. Then start scheduler with god_r role. mesos: role: "god_r"

It had registered on mesos master and successfully sent offers.

Then I created task with 1CPU and 1GB RAM. And it failed.

Scheduler log:

2018-02-08 10:22:33,802 DEBUG [godocker-scheduler][Thread-1] {'framework_id': {'value': '1d2d8f83-e7ff-40ff-bf38-d21248192ca6-0018'}, 'resources': [{'type': 'SCALAR', 'allocation_info': {'role': 'god_r'}, 'role': 'god_r', 'name': 'cpus', 'scalar': {'value': 4.0}}, {'type': 'SCALAR', 'allocation_info': {'role': 'god_r'}, 'role': 'god_r', 'name': 'mem', 'scalar': {'value': 4096.0}}], 'id': {'value': '1d2d8f83-e7ff-40ff-bf38-d21248192ca6-O1204281'}, 'allocation_info': {'role': 'god_r'}, 'hostname': 'host1', 'url': {'address': {'port': 5051, 'hostname': 'host1', 'ip': '10.0.0.8'}, 'path': '/slave(1)', 'scheme': 'http'}, 'attributes': [{'type': 'TEXT', 'name': 'rack', 'text': {'value': 'hw'}}], 'agent_id': {'value': '1d2d8f83-e7ff-40ff-bf38-d21248192ca6-S124'}}
2018-02-08 10:22:33,803 DEBUG [godocker-scheduler][Thread-1] Mesos:Labels:{'rack': 'hw'}
2018-02-08 10:22:33,803 DEBUG [godocker-scheduler][Thread-1] Mesos:Received offer 1d2d8f83-e7ff-40ff-bf38-d21248192ca6-O1204281 with cpus: 4.0 and mem: 4096.0
2018-02-08 10:22:33,804 DEBUG [godocker-scheduler][Thread-1] Try to place task 21
2018-02-08 10:22:33,804 DEBUG [godocker-scheduler][Thread-1] Task placed on host host1
2018-02-08 10:22:33,804 DEBUG [godocker-scheduler][Thread-1] Mesos:Task:Running:21
2018-02-08 10:22:33,822 DEBUG [godocker-scheduler][Thread-1] Mesos:Offers:End
...
2018-02-08 10:22:34,105 DEBUG [godocker-scheduler][MainThread] Submission duration: 3.0
2018-02-08 10:22:34,106 DEBUG [godocker-scheduler][MainThread] Tasks submitted
2018-02-08 10:22:34,106 DEBUG [godocker-scheduler][MainThread] Get tasks to reschedule
2018-02-08 10:22:35,810 DEBUG [godocker-scheduler][Thread-1] Task 21-0 is in state TASK_ERROR
2018-02-08 10:22:35,810 WARNI [godocker-scheduler][Thread-1] Task 21-0 is in state TASK_ERROR

Task log:

System crashed or failed to start the task: : Total resources cpus(*)(allocated: god_r):1; mem(*)(allocated: god_r):1000 required by task and its executor is more than available cpus(god_r)(allocated: god_r):4; mem(god_r)(allocated: god_r):4096

Thus I reserved more resources then it need for task, but seems that scheduler think that 1>4 or 1000>4096 and got failed.

Or it is mesos bug ? Because scheduler think that everything ok, had placed task, but it failed because mesos count resources wrong ?

Comments (9)

  1. Olivier Sallou repo owner

    will need to check. godocker scheduler looks at offers, and if task needs less than what is in offer, then it is placed. Should be the same between standard usage and role assigned usage.

    We receive an offer with:

    Received offer 1d2d8f83-e7ff-40ff-bf38-d21248192ca6-O1204281 with cpus: 4.0 and mem: 4096.0
    

    So we place the task as you ask for 1 CPU and 1000 for RAM (1G), fo go-docker it is fine.

    When I look at log:

    System crashed or failed to start the task: : Total resources cpus(*)(allocated: god_r):1; mem(*)(allocated: god_r):1000 required by task and its executor is more than available cpus(god_r)(allocated: god_r):4; mem(god_r)(allocated: god_r):4096
    

    Mesos says we ask more than available (1, 1000) for (4, 4096). So pb is mesos does not accept it. Maybe that for specific role resources, offer should specify allocation in different way:

    we request cpus()(allocated: god_r):1 against cpus(god_r)(allocated: god_r):4 (we ask cpu() for our request and mesos compare it to cpus(god_r) )

    I will need to look at mesos doc for this. As we are connected/registered with a specific role, I do not see why we should specify resources in a different way, but who knows....

  2. Olivier Sallou repo owner

    seems indeed that for roles, we need to respecify in offer accept the role of the resource. I need to see how to do that in a clean way

  3. Olivier Sallou repo owner

    ok, so will need deeper testing. Will setup env next week and let you know

    you still see cpus(*)(allocated: god_r):1; (the star after cpu on accepted offer?)

  4. IT Expert reporter

    yes

    System crashed or failed to start the task: : Total resources cpus(*)(allocated: god_r):1; mem(*)(allocated: god_r):1000 required by task and its executor is more than available cpus(god_r)(allocated: god_r):4; mem(god_r)(allocated: god_r):4096
    
  5. Olivier Sallou repo owner

    after investigation, it is more complex that initially though when node has both "*" and "role" resources. Offer propose a mix of those, so scheduler need to pick some resources from both "*" and "role" resources (and maybe with a preference for role reserved resources when available). So it is not a matter anymore of comparing requested number of cpu vs number of cpu in offer, but to compare number of cpu for all (*) and number of cpu for role.

  6. Olivier Sallou repo owner

    I think I could fix the problem to get shared resources among a role and "global" role. Some testing were fine on my side. Available in develop branch (docker :devbuild in progress. I tested:

    • framework with no role, node with mixed role and global resources: Checked that job can pass on global resources, not roled resources
    • framework with role, node with mixed role and global resources Checked that job can pass on global AND roled resources.

    Placement gives priority to role based resources, the if there is not enough, takes remaining in global resources.

    Will be in next release.

  7. Log in to comment