GPU allocation
Mesos provided a convenient way to submit GPU tasks, e.g. mesos-execute \ --master=127.0.0.1:5050 \ --name=gpu-test \ --docker_image=nvidia/cuda \ --command="nvidia-smi" \ --framework_capabilities="GPU_RESOURCES" \ --resources="gpus:1"
and at present, the GPU is treated as standard resources as CPU and mem. I suppose you could use such utils to simplify the GPU task upload.
The advantage is that: No need to specify GPU number and manually write the resources options on the slave node.
Comments (24)
-
repo owner -
reporter Well, I'm currently not familiar with mesos. But I think a scheduler should not know how to mount GPU into a task container, which is handled by the mesos slave. So here what we should do is to take GPU into consideration when scheduling a task, and then give mesos master correct job attributes depending on whether GPU is required.
-
repo owner I agree, I expect to not have to manage anything, but I'll need to check current status of feature.
-
Dear Colleagues,
Please note that mesos doesn't support GPU isolation for docker containerizer currently, just for mesos containerizer
-
repo owner I had seen this, thanks. Godocker also support mesos containerizers. I have started implementing mesos gpu in framework, now i need to get a server with gpus to test. Should be ok in next weeks. Olivier
-
reporter Have you noticed that Mesos introduced GPU auto discovery. It will be triggered when slave specify a nvidia/gpu isolation, then the slave's resources consists of available GPUs. https://reviews.apache.org/r/48366/diff/2#index_header
But another problem comes with GPU resources.
According to (this commit) [https://reviews.apache.org/r/48914/], framework that doesn't have GPU_AWARE capability won't get the offer of slave equipped with GPU. So we have to make some change in our mesos scheduler plugin. Unfortunately I haven't found a python interface nor a documentation to add capability. The C++ framework should add the following line:
framework.add_capabilities()->set_type( FrameworkInfo::Capability::GPU_RESOURCES);
-
repo owner Already added in code ;-)
-
repo owner You can do this in python using protobuf framework.capabilities.add() then set to gpu value (3), see mesos.proto.
Le sam. 11 mars 2017 13:02, olivier sallou olivier.sallou@gmail.com a écrit :
-
reporter So maybe we could treat GPU as ordinary resource as cpu and mem at scheduling stage, rather than leaving it in advanced_setting/constraints.
-
repo owner I could, but it needs additional modifications , including web interface, and managed as an optional resource because it is not supported by all schedulers (swarm, kube). I plan first to test gpu mngt with current behavior. Willbsee after that how to set it as first class resource.
-
reporter Perhaps I could help with mesos GPU support. I checked out your feature_47 branch, made some changes:(see https://bitbucket.org/yoursky/go-docker/commits/5668cafa5c80263635850f6c9fd7adb76025a13d)
- gpus default to 1 to test gpu allocation (line280)
- some lines seem incompatible with Mesos protobuff (516-519)
- delete the cpu_10.5 related things in slave and don't mount manually.(Your current implementation didn't mount the user level library and utils needed by Nvidia and CUDA, Mesos will gather them together into
/var/run/mesos/isolators/gpu/nvidia_<version>
maybe you can make full use of it. FYI: Mesos says in [http://events.linuxfoundation.org/sites/events/files/slides/mesoscon_asia_gpu_v1.pdf] that the docker GPU support is coming at 1.2 or 1.3 version) Then using mesos native GPU isolator works. Nvidia-smi outputs correct results. But on slave side error appears:
W0315 11:16:39.177366 5984 containerizer.cpp:1809] Skipping resource statistic for container fff617a7-3a8a-41b3-9add-d69247131fe4 because: Unknown container W0315 11:16:39.177419 5982 containerizer.cpp:1871] Skipping status for container fff617a7-3a8a-41b3-9add-d69247131fe4 because: Unknown container W0315 11:16:39.177543 5982 containerizer.cpp:1871] Skipping status for container fff617a7-3a8a-41b3-9add-d69247131fe4 because: Container does not exist W0315 11:16:39.177489 5984 containerizer.cpp:1809] Skipping resource statistic for container fff617a7-3a8a-41b3-9add-d69247131fe4 because: Unknown container
I'm gonna check this out on
containerizer.cpp
's source code.I'm not familiar with webUI and CLI management tools. So maybe you can first made some changes to them to provide gpu resource requiring, management and quota.
-
repo owner I already have updated code in feature branch. I validated/updated my code yesterday on a server with gpus with mesos 1.1.0 I am updating web ui code to specify gpus like cpus and mem.
Thanks anyway.
-
Hello there,
I'm running a Mesos which has GPU resource at its disposal. But I cannot utilize the GPU resource(which is controlled by Mesos) via go-docker.
Mesos version: 1.1.0
go-docker and go-docker-web project(I'm using develop branch)
Command to start Mesos Master:
[root@mesos-server1 ~]# mesos-master --ip=192.168.1.118 --work_dir=/tmp/mesos
Start Mesos Agent:
[root@mesos-server1 ~]# mesos-agent --master=192.168.1.118:5050 --work_dir=/tmp/mesos56 --image_providers=docker --containerizers=mesos --isolation="docker/runtime,filesystem/linux,cgroups/devices,gpu/nvidia" --attributes="hostname:192.168.1.118"
I've run the following command and it works fine.
[root@mesos-server1 ~]# mesos-execute --master=192.168.1.118:5050 --name=gpu-test --command="nvidia-smi" --framework_capabilities="GPU_RESOURCES" --resources="gpus:1"
But got error when running go-docker.
The error message that I got:
No devices were found
go-d.ini:
go-docker-web:
Thanks a lot
-
repo owner I could test successfully on a server with gpu device but I will have a look. Did you check god.err as well as mesos job logs on agent via mesos ui? Mesos slave job logs should show (or not) if gpu was requested. At last, in jobs details in godocker web ui (after submission), do you see the gpu requirement? Olivier
-
Hi Olivier
1.god.err log file is empty
2.Mesos agent log
I0406 01:41:42.965502 18873 slave.cpp:5044] Current disk usage 9.49%. Max allowed age: 5.635917774648692days I0406 01:42:42.966145 18879 slave.cpp:5044] Current disk usage 9.49%. Max allowed age: 5.635917774648692days I0406 01:43:28.266775 18860 slave.cpp:1539] Got assigned task '22' for framework 4e23a62f-afc7-42f1-9a80-030194c9927f-0000 I0406 01:43:28.267211 18860 gc.cpp:83] Unscheduling '/tmp/mesos64/slaves/08e8f2b1-1bc0-427d-9449-1e3ce7cb19b1-S0/frameworks/4e23a62f-afc7-42f1-9a80-030194c9927f-0000' from gc I0406 01:43:28.267428 18863 slave.cpp:1701] Launching task '22' for framework 4e23a62f-afc7-42f1-9a80-030194c9927f-0000 I0406 01:43:28.268244 18863 paths.cpp:536] Trying to chown '/tmp/mesos64/slaves/08e8f2b1-1bc0-427d-9449-1e3ce7cb19b1-S0/frameworks/4e23a62f-afc7-42f1-9a80-030194c9927f-0000/executors/22/runs/6a59e53b-fd20-49ad-8776-a9ee718d46bf' to user 'root' I0406 01:43:28.275975 18863 slave.cpp:6179] Launching executor '22' of framework 4e23a62f-afc7-42f1-9a80-030194c9927f-0000 with resources cpus(*):0.1; mem(*):32 in work directory '/tmp/mesos64/slaves/08e8f2b1-1bc0-427d-9449-1e3ce7cb19b1-S0/frameworks/4e23a62f-afc7-42f1-9a80-030194c9927f-0000/executors/22/runs/6a59e53b-fd20-49ad-8776-a9ee718d46bf' I0406 01:43:28.276391 18879 containerizer.cpp:938] Starting container 6a59e53b-fd20-49ad-8776-a9ee718d46bf for executor '22' of framework 4e23a62f-afc7-42f1-9a80-030194c9927f-0000 I0406 01:43:28.276420 18863 slave.cpp:1987] Queued task '22' for executor '22' of framework 4e23a62f-afc7-42f1-9a80-030194c9927f-0000 I0406 01:43:28.279960 18860 provisioner.cpp:294] Provisioning image rootfs '/tmp/mesos64/provisioner/containers/6a59e53b-fd20-49ad-8776-a9ee718d46bf/backends/copy/rootfses/afa8f034-d5c5-44c3-a44f-a88165e2dfc3' for container 6a59e53b-fd20-49ad-8776-a9ee718d46bf W0406 01:43:31.144554 18873 containerizer.cpp:1871] Skipping status for container 6a59e53b-fd20-49ad-8776-a9ee718d46bf because: Unknown container W0406 01:43:31.144624 18873 containerizer.cpp:1871] Skipping status for container 6a59e53b-fd20-49ad-8776-a9ee718d46bf because: Container does not exist W0406 01:43:31.144994 18872 containerizer.cpp:1809] Skipping resource statistic for container 6a59e53b-fd20-49ad-8776-a9ee718d46bf because: Unknown container W0406 01:43:31.145081 18872 containerizer.cpp:1809] Skipping resource statistic for container 6a59e53b-fd20-49ad-8776-a9ee718d46bf because: Unknown container I0406 01:43:32.127301 18855 linux_launcher.cpp:421] Launching container 6a59e53b-fd20-49ad-8776-a9ee718d46bf and cloning with namespaces CLONE_NEWNS I0406 01:43:32.129673 18855 systemd.cpp:96] Assigned child process '19673' to 'mesos_executors.slice' I0406 01:43:32.341197 18879 slave.cpp:3231] Got registration for executor '22' of framework 4e23a62f-afc7-42f1-9a80-030194c9927f-0000 from executor(1)@192.168.1.118:34369 I0406 01:43:32.342381 18876 slave.cpp:2191] Sending queued task '22' to executor '22' of framework 4e23a62f-afc7-42f1-9a80-030194c9927f-0000 at executor(1)@192.168.1.118:34369 I0406 01:43:32.349393 18856 slave.cpp:3634] Handling status update TASK_RUNNING (UUID: 27518af7-454b-471b-96dd-6ad7234a944b) for task 22 of framework 4e23a62f-afc7-42f1-9a80-030194c9927f-0000 from executor(1)@192.168.1.118:34369 I0406 01:43:32.350574 18877 status_update_manager.cpp:323] Received status update TASK_RUNNING (UUID: 27518af7-454b-471b-96dd-6ad7234a944b) for task 22 of framework 4e23a62f-afc7-42f1-9a80-030194c9927f-0000 I0406 01:43:32.350903 18873 slave.cpp:4051] Forwarding the update TASK_RUNNING (UUID: 27518af7-454b-471b-96dd-6ad7234a944b) for task 22 of framework 4e23a62f-afc7-42f1-9a80-030194c9927f-0000 to master@192.168.1.118:5050 I0406 01:43:32.351126 18873 slave.cpp:3961] Sending acknowledgement for status update TASK_RUNNING (UUID: 27518af7-454b-471b-96dd-6ad7234a944b) for task 22 of framework 4e23a62f-afc7-42f1-9a80-030194c9927f-0000 to executor(1)@192.168.1.118:34369
3.go-docker web ui
Thank you
-
repo owner Ok, seems an issue on web where gpu appears but is not taken into account. Job detail should show gpu=1 In slave we see cpus():0.1; mem():32 in work directory '/tmp/, gpu would appear if requested So issue is either in views.py or html/js part. Could you check in developer tools of your brower, in network panel when submitting task, for the json content of job submission request, what is gpu value in requirements? You could try to start web server with env variable PYRAMID_ENV=dev, this will load uncompressed/minified files. I will check on my side for status in web server in non dev mode. Olivier
-
Hi Olivier
1.go-docker web ui
2.browser developer tool
-
repo owner i could reproduce, gpu field is not taken into account, certainly a last modif in code that broke it. I gonna investigate, should be fixed quickly
-
repo owner issue was introduced when I added a last minute "if" condition to display or not the gpu field according to conf I just fixed the issue in develop. Be sure to refresh your cache after update.
-
I've tested it both on Chrome and Firefox it works perfectly! Thanks for the quick fix :)
-
repo owner Thanks for testing ;-)
-
repo owner Integrated in develop branch
integrated via feature #GOD-63 and commit d4ca6034c77cd482611223b19e37b321d83405ef
-
repo owner - changed status to resolved
-
repo owner -
assigned issue to
-
assigned issue to
- Log in to comment
this is in the plan, but this has been introduced in recent releases of Mesos. It was not available when set in godocker. I also need t investigate how GPU is mounted in container