GPU allocation

Issue #47 resolved
Stone sky created an issue

Mesos provided a convenient way to submit GPU tasks, e.g. mesos-execute \ --master=127.0.0.1:5050 \ --name=gpu-test \ --docker_image=nvidia/cuda \ --command="nvidia-smi" \ --framework_capabilities="GPU_RESOURCES" \ --resources="gpus:1"

and at present, the GPU is treated as standard resources as CPU and mem. I suppose you could use such utils to simplify the GPU task upload.

The advantage is that: No need to specify GPU number and manually write the resources options on the slave node.

Comments (24)

  1. Olivier Sallou repo owner

    this is in the plan, but this has been introduced in recent releases of Mesos. It was not available when set in godocker. I also need t investigate how GPU is mounted in container

  2. Stone sky reporter

    Well, I'm currently not familiar with mesos. But I think a scheduler should not know how to mount GPU into a task container, which is handled by the mesos slave. So here what we should do is to take GPU into consideration when scheduling a task, and then give mesos master correct job attributes depending on whether GPU is required.

  3. Olivier Sallou repo owner

    I agree, I expect to not have to manage anything, but I'll need to check current status of feature.

  4. Konstantin Bokhan

    Dear Colleagues,

    Please note that mesos doesn't support GPU isolation for docker containerizer currently, just for mesos containerizer

  5. Olivier Sallou repo owner

    I had seen this, thanks. Godocker also support mesos containerizers. I have started implementing mesos gpu in framework, now i need to get a server with gpus to test. Should be ok in next weeks. Olivier

  6. Stone sky reporter

    Have you noticed that Mesos introduced GPU auto discovery. It will be triggered when slave specify a nvidia/gpu isolation, then the slave's resources consists of available GPUs. https://reviews.apache.org/r/48366/diff/2#index_header

    But another problem comes with GPU resources.

    According to (this commit) [https://reviews.apache.org/r/48914/], framework that doesn't have GPU_AWARE capability won't get the offer of slave equipped with GPU. So we have to make some change in our mesos scheduler plugin. Unfortunately I haven't found a python interface nor a documentation to add capability. The C++ framework should add the following line:

    framework.add_capabilities()->set_type(
          FrameworkInfo::Capability::GPU_RESOURCES);
    
  7. Stone sky reporter

    So maybe we could treat GPU as ordinary resource as cpu and mem at scheduling stage, rather than leaving it in advanced_setting/constraints.

  8. Olivier Sallou repo owner

    I could, but it needs additional modifications , including web interface, and managed as an optional resource because it is not supported by all schedulers (swarm, kube). I plan first to test gpu mngt with current behavior. Willbsee after that how to set it as first class resource.

  9. Stone sky reporter

    Perhaps I could help with mesos GPU support. I checked out your feature_47 branch, made some changes:(see https://bitbucket.org/yoursky/go-docker/commits/5668cafa5c80263635850f6c9fd7adb76025a13d)

    • gpus default to 1 to test gpu allocation (line280)
    • some lines seem incompatible with Mesos protobuff (516-519)
    • delete the cpu_10.5 related things in slave and don't mount manually.(Your current implementation didn't mount the user level library and utils needed by Nvidia and CUDA, Mesos will gather them together into /var/run/mesos/isolators/gpu/nvidia_<version> maybe you can make full use of it. FYI: Mesos says in [http://events.linuxfoundation.org/sites/events/files/slides/mesoscon_asia_gpu_v1.pdf] that the docker GPU support is coming at 1.2 or 1.3 version) Then using mesos native GPU isolator works. Nvidia-smi outputs correct results. But on slave side error appears:
    W0315 11:16:39.177366  5984 containerizer.cpp:1809] Skipping resource statistic for container fff617a7-3a8a-41b3-9add-d69247131fe4 because: Unknown container 
    W0315 11:16:39.177419  5982 containerizer.cpp:1871] Skipping status for container fff617a7-3a8a-41b3-9add-d69247131fe4 because: Unknown container 
    W0315 11:16:39.177543  5982 containerizer.cpp:1871] Skipping status for container fff617a7-3a8a-41b3-9add-d69247131fe4 because: Container does not exist
    W0315 11:16:39.177489  5984 containerizer.cpp:1809] Skipping resource statistic for container fff617a7-3a8a-41b3-9add-d69247131fe4 because: Unknown container
    

    I'm gonna check this out on containerizer.cpp 's source code.

    I'm not familiar with webUI and CLI management tools. So maybe you can first made some changes to them to provide gpu resource requiring, management and quota.

  10. Olivier Sallou repo owner

    I already have updated code in feature branch. I validated/updated my code yesterday on a server with gpus with mesos 1.1.0 I am updating web ui code to specify gpus like cpus and mem.

    Thanks anyway.

  11. Jack Yang

    Hello there,

    I'm running a Mesos which has GPU resource at its disposal. But I cannot utilize the GPU resource(which is controlled by Mesos) via go-docker.

    Mesos version: 1.1.0

    go-docker and go-docker-web project(I'm using develop branch)

    Command to start Mesos Master:

    [root@mesos-server1 ~]# mesos-master --ip=192.168.1.118 --work_dir=/tmp/mesos
    

    Start Mesos Agent:

    [root@mesos-server1 ~]# mesos-agent --master=192.168.1.118:5050 --work_dir=/tmp/mesos56 --image_providers=docker --containerizers=mesos --isolation="docker/runtime,filesystem/linux,cgroups/devices,gpu/nvidia" --attributes="hostname:192.168.1.118"
    

    I've run the following command and it works fine.

    [root@mesos-server1 ~]# mesos-execute --master=192.168.1.118:5050 --name=gpu-test --command="nvidia-smi" --framework_capabilities="GPU_RESOURCES" --resources="gpus:1"
    

    But got error when running go-docker.

    The error message that I got:

    No devices were found
    

    go-d.ini: mesos-go-docker.png

    go-docker-web: go-web-ui.png

    go-log.png

    Thanks a lot

  12. Olivier Sallou repo owner

    I could test successfully on a server with gpu device but I will have a look. Did you check god.err as well as mesos job logs on agent via mesos ui? Mesos slave job logs should show (or not) if gpu was requested. At last, in jobs details in godocker web ui (after submission), do you see the gpu requirement? Olivier

  13. Jack Yang

    Hi Olivier

    1.god.err log file is empty

    2.Mesos agent log

    I0406 01:41:42.965502 18873 slave.cpp:5044] Current disk usage 9.49%. Max allowed age: 5.635917774648692days
    I0406 01:42:42.966145 18879 slave.cpp:5044] Current disk usage 9.49%. Max allowed age: 5.635917774648692days
    I0406 01:43:28.266775 18860 slave.cpp:1539] Got assigned task '22' for framework 4e23a62f-afc7-42f1-9a80-030194c9927f-0000
    I0406 01:43:28.267211 18860 gc.cpp:83] Unscheduling '/tmp/mesos64/slaves/08e8f2b1-1bc0-427d-9449-1e3ce7cb19b1-S0/frameworks/4e23a62f-afc7-42f1-9a80-030194c9927f-0000' from gc
    I0406 01:43:28.267428 18863 slave.cpp:1701] Launching task '22' for framework 4e23a62f-afc7-42f1-9a80-030194c9927f-0000
    I0406 01:43:28.268244 18863 paths.cpp:536] Trying to chown '/tmp/mesos64/slaves/08e8f2b1-1bc0-427d-9449-1e3ce7cb19b1-S0/frameworks/4e23a62f-afc7-42f1-9a80-030194c9927f-0000/executors/22/runs/6a59e53b-fd20-49ad-8776-a9ee718d46bf' to user 'root'
    I0406 01:43:28.275975 18863 slave.cpp:6179] Launching executor '22' of framework 4e23a62f-afc7-42f1-9a80-030194c9927f-0000 with resources cpus(*):0.1; mem(*):32 in work directory '/tmp/mesos64/slaves/08e8f2b1-1bc0-427d-9449-1e3ce7cb19b1-S0/frameworks/4e23a62f-afc7-42f1-9a80-030194c9927f-0000/executors/22/runs/6a59e53b-fd20-49ad-8776-a9ee718d46bf'
    I0406 01:43:28.276391 18879 containerizer.cpp:938] Starting container 6a59e53b-fd20-49ad-8776-a9ee718d46bf for executor '22' of framework 4e23a62f-afc7-42f1-9a80-030194c9927f-0000
    I0406 01:43:28.276420 18863 slave.cpp:1987] Queued task '22' for executor '22' of framework 4e23a62f-afc7-42f1-9a80-030194c9927f-0000
    I0406 01:43:28.279960 18860 provisioner.cpp:294] Provisioning image rootfs '/tmp/mesos64/provisioner/containers/6a59e53b-fd20-49ad-8776-a9ee718d46bf/backends/copy/rootfses/afa8f034-d5c5-44c3-a44f-a88165e2dfc3' for container 6a59e53b-fd20-49ad-8776-a9ee718d46bf
    W0406 01:43:31.144554 18873 containerizer.cpp:1871] Skipping status for container 6a59e53b-fd20-49ad-8776-a9ee718d46bf because: Unknown container
    W0406 01:43:31.144624 18873 containerizer.cpp:1871] Skipping status for container 6a59e53b-fd20-49ad-8776-a9ee718d46bf because: Container does not exist
    W0406 01:43:31.144994 18872 containerizer.cpp:1809] Skipping resource statistic for container 6a59e53b-fd20-49ad-8776-a9ee718d46bf because: Unknown container
    W0406 01:43:31.145081 18872 containerizer.cpp:1809] Skipping resource statistic for container 6a59e53b-fd20-49ad-8776-a9ee718d46bf because: Unknown container
    I0406 01:43:32.127301 18855 linux_launcher.cpp:421] Launching container 6a59e53b-fd20-49ad-8776-a9ee718d46bf and cloning with namespaces CLONE_NEWNS
    I0406 01:43:32.129673 18855 systemd.cpp:96] Assigned child process '19673' to 'mesos_executors.slice'
    I0406 01:43:32.341197 18879 slave.cpp:3231] Got registration for executor '22' of framework 4e23a62f-afc7-42f1-9a80-030194c9927f-0000 from executor(1)@192.168.1.118:34369
    I0406 01:43:32.342381 18876 slave.cpp:2191] Sending queued task '22' to executor '22' of framework 4e23a62f-afc7-42f1-9a80-030194c9927f-0000 at executor(1)@192.168.1.118:34369
    I0406 01:43:32.349393 18856 slave.cpp:3634] Handling status update TASK_RUNNING (UUID: 27518af7-454b-471b-96dd-6ad7234a944b) for task 22 of framework 4e23a62f-afc7-42f1-9a80-030194c9927f-0000 from executor(1)@192.168.1.118:34369
    I0406 01:43:32.350574 18877 status_update_manager.cpp:323] Received status update TASK_RUNNING (UUID: 27518af7-454b-471b-96dd-6ad7234a944b) for task 22 of framework 4e23a62f-afc7-42f1-9a80-030194c9927f-0000
    I0406 01:43:32.350903 18873 slave.cpp:4051] Forwarding the update TASK_RUNNING (UUID: 27518af7-454b-471b-96dd-6ad7234a944b) for task 22 of framework 4e23a62f-afc7-42f1-9a80-030194c9927f-0000 to master@192.168.1.118:5050
    I0406 01:43:32.351126 18873 slave.cpp:3961] Sending acknowledgement for status update TASK_RUNNING (UUID: 27518af7-454b-471b-96dd-6ad7234a944b) for task 22 of framework 4e23a62f-afc7-42f1-9a80-030194c9927f-0000 to executor(1)@192.168.1.118:34369
    

    3.go-docker web ui go-resource.png

    Thank you

  14. Olivier Sallou repo owner

    Ok, seems an issue on web where gpu appears but is not taken into account. Job detail should show gpu=1 In slave we see cpus():0.1; mem():32 in work directory '/tmp/, gpu would appear if requested So issue is either in views.py or html/js part. Could you check in developer tools of your brower, in network panel when submitting task, for the json content of job submission request, what is gpu value in requirements? You could try to start web server with env variable PYRAMID_ENV=dev, this will load uncompressed/minified files. I will check on my side for status in web server in non dev mode. Olivier

  15. Olivier Sallou repo owner

    i could reproduce, gpu field is not taken into account, certainly a last modif in code that broke it. I gonna investigate, should be fixed quickly

  16. Olivier Sallou repo owner

    issue was introduced when I added a last minute "if" condition to display or not the gpu field according to conf I just fixed the issue in develop. Be sure to refresh your cache after update.

  17. Log in to comment