GPU allocation

Issue #47 resolved

Stone sky created an issue 2017-03-09

Mesos provided a convenient way to submit GPU tasks, e.g. mesos-execute \ --master=127.0.0.1:5050 \ --name=gpu-test \ --docker_image=nvidia/cuda \ --command="nvidia-smi" \ --framework_capabilities="GPU_RESOURCES" \ --resources="gpus:1"

and at present, the GPU is treated as standard resources as CPU and mem. I suppose you could use such utils to simplify the GPU task upload.

The advantage is that: No need to specify GPU number and manually write the resources options on the slave node.

Comments (24)

Olivier Sallou repo owner
this is in the plan, but this has been introduced in recent releases of Mesos. It was not available when set in godocker. I also need t investigate how GPU is mounted in container
- 2017-03-09T08:06:47+00:00
Stone sky reporter
Well, I'm currently not familiar with mesos. But I think a scheduler should not know how to mount GPU into a task container, which is handled by the mesos slave. So here what we should do is to take GPU into consideration when scheduling a task, and then give mesos master correct job attributes depending on whether GPU is required.
- 2017-03-09T08:15:01+00:00
Olivier Sallou repo owner
I agree, I expect to not have to manage anything, but I'll need to check current status of feature.
- 2017-03-09T08:35:56+00:00
Konstantin Bokhan
Dear Colleagues,

Please note that mesos doesn't support GPU isolation for docker containerizer currently, just for mesos containerizer
- 2017-03-09T11:09:43+00:00
Olivier Sallou repo owner
I had seen this, thanks. Godocker also support mesos containerizers. I have started implementing mesos gpu in framework, now i need to get a server with gpus to test. Should be ok in next weeks. Olivier
- 2017-03-11T11:40:06+00:00
Stone sky reporter
Have you noticed that Mesos introduced GPU auto discovery. It will be triggered when slave specify a nvidia/gpu isolation, then the slave's resources consists of available GPUs. https://reviews.apache.org/r/48366/diff/2#index_header

But another problem comes with GPU resources.

According to (this commit) [https://reviews.apache.org/r/48914/], framework that doesn't have GPU_AWARE capability won't get the offer of slave equipped with GPU. So we have to make some change in our mesos scheduler plugin. Unfortunately I haven't found a python interface nor a documentation to add capability. The C++ framework should add the following line:
```
framework.add_capabilities()->set_type(
      FrameworkInfo::Capability::GPU_RESOURCES);
```
- 2017-03-11T11:55:02+00:00
Olivier Sallou repo owner
Already added in code ;-)
- 2017-03-11T12:05:07+00:00
Olivier Sallou repo owner
You can do this in python using protobuf framework.capabilities.add() then set to gpu value (3), see mesos.proto.

Le sam. 11 mars 2017 13:02, olivier sallou olivier.sallou@gmail.com a écrit :
- 2017-03-11T12:05:08+00:00
Stone sky reporter
So maybe we could treat GPU as ordinary resource as cpu and mem at scheduling stage, rather than leaving it in advanced_setting/constraints.
- 2017-03-11T15:00:17+00:00
Olivier Sallou repo owner
I could, but it needs additional modifications , including web interface, and managed as an optional resource because it is not supported by all schedulers (swarm, kube). I plan first to test gpu mngt with current behavior. Willbsee after that how to set it as first class resource.
- 2017-03-11T15:25:06+00:00
Stone sky reporter
Perhaps I could help with mesos GPU support. I checked out your feature_47 branch, made some changes:(see https://bitbucket.org/yoursky/go-docker/commits/5668cafa5c80263635850f6c9fd7adb76025a13d)
- gpus default to 1 to test gpu allocation (line280)
- some lines seem incompatible with Mesos protobuff (516-519)
- delete the cpu_10.5 related things in slave and don't mount manually.(Your current implementation didn't mount the user level library and utils needed by Nvidia and CUDA, Mesos will gather them together into /var/run/mesos/isolators/gpu/nvidia_<version> maybe you can make full use of it. FYI: Mesos says in [http://events.linuxfoundation.org/sites/events/files/slides/mesoscon_asia_gpu_v1.pdf] that the docker GPU support is coming at 1.2 or 1.3 version) Then using mesos native GPU isolator works. Nvidia-smi outputs correct results. But on slave side error appears:
```
W0315 11:16:39.177366  5984 containerizer.cpp:1809] Skipping resource statistic for container fff617a7-3a8a-41b3-9add-d69247131fe4 because: Unknown container 
W0315 11:16:39.177419  5982 containerizer.cpp:1871] Skipping status for container fff617a7-3a8a-41b3-9add-d69247131fe4 because: Unknown container 
W0315 11:16:39.177543  5982 containerizer.cpp:1871] Skipping status for container fff617a7-3a8a-41b3-9add-d69247131fe4 because: Container does not exist
W0315 11:16:39.177489  5984 containerizer.cpp:1809] Skipping resource statistic for container fff617a7-3a8a-41b3-9add-d69247131fe4 because: Unknown container
```
I'm gonna check this out on containerizer.cpp 's source code.

I'm not familiar with webUI and CLI management tools. So maybe you can first made some changes to them to provide gpu resource requiring, management and quota.
- 2017-03-15T03:41:02+00:00
Olivier Sallou repo owner
I already have updated code in feature branch. I validated/updated my code yesterday on a server with gpus with mesos 1.1.0 I am updating web ui code to specify gpus like cpus and mem.

Thanks anyway.
- 2017-03-15T06:05:08+00:00

Jack Yang

Hello there,

I'm running a Mesos which has GPU resource at its disposal. But I cannot utilize the GPU resource(which is controlled by Mesos) via go-docker.

Mesos version: 1.1.0

go-docker and go-docker-web project(I'm using develop branch)

Command to start Mesos Master:

[root@mesos-server1 ~]# mesos-master --ip=192.168.1.118 --work_dir=/tmp/mesos

Start Mesos Agent:

[root@mesos-server1 ~]# mesos-agent --master=192.168.1.118:5050 --work_dir=/tmp/mesos56 --image_providers=docker --containerizers=mesos --isolation="docker/runtime,filesystem/linux,cgroups/devices,gpu/nvidia" --attributes="hostname:192.168.1.118"

I've run the following command and it works fine.

[root@mesos-server1 ~]# mesos-execute --master=192.168.1.118:5050 --name=gpu-test --command="nvidia-smi" --framework_capabilities="GPU_RESOURCES" --resources="gpus:1"

But got error when running go-docker.

The error message that I got:

No devices were found

go-d.ini:

go-docker-web:

Thanks a lot

2017-04-06T03:57:58+00:00

Olivier Sallou repo owner
I could test successfully on a server with gpu device but I will have a look. Did you check god.err as well as mesos job logs on agent via mesos ui? Mesos slave job logs should show (or not) if gpu was requested. At last, in jobs details in godocker web ui (after submission), do you see the gpu requirement? Olivier
- 2017-04-06T05:05:07+00:00

Jack Yang

Hi Olivier

1.god.err log file is empty

2.Mesos agent log

I0406 01:41:42.965502 18873 slave.cpp:5044] Current disk usage 9.49%. Max allowed age: 5.635917774648692days
I0406 01:42:42.966145 18879 slave.cpp:5044] Current disk usage 9.49%. Max allowed age: 5.635917774648692days
I0406 01:43:28.266775 18860 slave.cpp:1539] Got assigned task '22' for framework 4e23a62f-afc7-42f1-9a80-030194c9927f-0000
I0406 01:43:28.267211 18860 gc.cpp:83] Unscheduling '/tmp/mesos64/slaves/08e8f2b1-1bc0-427d-9449-1e3ce7cb19b1-S0/frameworks/4e23a62f-afc7-42f1-9a80-030194c9927f-0000' from gc
I0406 01:43:28.267428 18863 slave.cpp:1701] Launching task '22' for framework 4e23a62f-afc7-42f1-9a80-030194c9927f-0000
I0406 01:43:28.268244 18863 paths.cpp:536] Trying to chown '/tmp/mesos64/slaves/08e8f2b1-1bc0-427d-9449-1e3ce7cb19b1-S0/frameworks/4e23a62f-afc7-42f1-9a80-030194c9927f-0000/executors/22/runs/6a59e53b-fd20-49ad-8776-a9ee718d46bf' to user 'root'
I0406 01:43:28.275975 18863 slave.cpp:6179] Launching executor '22' of framework 4e23a62f-afc7-42f1-9a80-030194c9927f-0000 with resources cpus(*):0.1; mem(*):32 in work directory '/tmp/mesos64/slaves/08e8f2b1-1bc0-427d-9449-1e3ce7cb19b1-S0/frameworks/4e23a62f-afc7-42f1-9a80-030194c9927f-0000/executors/22/runs/6a59e53b-fd20-49ad-8776-a9ee718d46bf'
I0406 01:43:28.276391 18879 containerizer.cpp:938] Starting container 6a59e53b-fd20-49ad-8776-a9ee718d46bf for executor '22' of framework 4e23a62f-afc7-42f1-9a80-030194c9927f-0000
I0406 01:43:28.276420 18863 slave.cpp:1987] Queued task '22' for executor '22' of framework 4e23a62f-afc7-42f1-9a80-030194c9927f-0000
I0406 01:43:28.279960 18860 provisioner.cpp:294] Provisioning image rootfs '/tmp/mesos64/provisioner/containers/6a59e53b-fd20-49ad-8776-a9ee718d46bf/backends/copy/rootfses/afa8f034-d5c5-44c3-a44f-a88165e2dfc3' for container 6a59e53b-fd20-49ad-8776-a9ee718d46bf
W0406 01:43:31.144554 18873 containerizer.cpp:1871] Skipping status for container 6a59e53b-fd20-49ad-8776-a9ee718d46bf because: Unknown container
W0406 01:43:31.144624 18873 containerizer.cpp:1871] Skipping status for container 6a59e53b-fd20-49ad-8776-a9ee718d46bf because: Container does not exist
W0406 01:43:31.144994 18872 containerizer.cpp:1809] Skipping resource statistic for container 6a59e53b-fd20-49ad-8776-a9ee718d46bf because: Unknown container
W0406 01:43:31.145081 18872 containerizer.cpp:1809] Skipping resource statistic for container 6a59e53b-fd20-49ad-8776-a9ee718d46bf because: Unknown container
I0406 01:43:32.127301 18855 linux_launcher.cpp:421] Launching container 6a59e53b-fd20-49ad-8776-a9ee718d46bf and cloning with namespaces CLONE_NEWNS
I0406 01:43:32.129673 18855 systemd.cpp:96] Assigned child process '19673' to 'mesos_executors.slice'
I0406 01:43:32.341197 18879 slave.cpp:3231] Got registration for executor '22' of framework 4e23a62f-afc7-42f1-9a80-030194c9927f-0000 from executor(1)@192.168.1.118:34369
I0406 01:43:32.342381 18876 slave.cpp:2191] Sending queued task '22' to executor '22' of framework 4e23a62f-afc7-42f1-9a80-030194c9927f-0000 at executor(1)@192.168.1.118:34369
I0406 01:43:32.349393 18856 slave.cpp:3634] Handling status update TASK_RUNNING (UUID: 27518af7-454b-471b-96dd-6ad7234a944b) for task 22 of framework 4e23a62f-afc7-42f1-9a80-030194c9927f-0000 from executor(1)@192.168.1.118:34369
I0406 01:43:32.350574 18877 status_update_manager.cpp:323] Received status update TASK_RUNNING (UUID: 27518af7-454b-471b-96dd-6ad7234a944b) for task 22 of framework 4e23a62f-afc7-42f1-9a80-030194c9927f-0000
I0406 01:43:32.350903 18873 slave.cpp:4051] Forwarding the update TASK_RUNNING (UUID: 27518af7-454b-471b-96dd-6ad7234a944b) for task 22 of framework 4e23a62f-afc7-42f1-9a80-030194c9927f-0000 to master@192.168.1.118:5050
I0406 01:43:32.351126 18873 slave.cpp:3961] Sending acknowledgement for status update TASK_RUNNING (UUID: 27518af7-454b-471b-96dd-6ad7234a944b) for task 22 of framework 4e23a62f-afc7-42f1-9a80-030194c9927f-0000 to executor(1)@192.168.1.118:34369

3.go-docker web ui

Thank you

2017-04-06T06:08:53+00:00

Olivier Sallou repo owner
Ok, seems an issue on web where gpu appears but is not taken into account. Job detail should show gpu=1 In slave we see cpus():0.1; mem():32 in work directory '/tmp/, gpu would appear if requested So issue is either in views.py or html/js part. Could you check in developer tools of your brower, in network panel when submitting task, for the json content of job submission request, what is gpu value in requirements? You could try to start web server with env variable PYRAMID_ENV=dev, this will load uncompressed/minified files. I will check on my side for status in web server in non dev mode. Olivier
- 2017-04-06T06:20:05+00:00
Jack Yang
Hi Olivier

1.go-docker web ui

2.browser developer tool
- 2017-04-06T07:08:42+00:00
Olivier Sallou repo owner
i could reproduce, gpu field is not taken into account, certainly a last modif in code that broke it. I gonna investigate, should be fixed quickly
- 2017-04-06T07:33:02+00:00
Olivier Sallou repo owner
issue was introduced when I added a last minute "if" condition to display or not the gpu field according to conf I just fixed the issue in develop. Be sure to refresh your cache after update.
- 2017-04-06T07:41:12+00:00
Jack Yang
I've tested it both on Chrome and Firefox it works perfectly! Thanks for the quick fix :)
- 2017-04-06T09:06:21+00:00
Olivier Sallou repo owner
Thanks for testing ;-)
- 2017-04-06T09:10:09+00:00
Olivier Sallou repo owner
Integrated in develop branch

integrated via feature #GOD-63 and commit d4ca6034c77cd482611223b19e37b321d83405ef
- 2017-04-20T06:51:20+00:00
Olivier Sallou repo owner
- changed status to resolved
- 2017-04-20T06:51:31+00:00
Olivier Sallou repo owner
- assigned issue to
  
  Olivier Sallou
- 2017-04-20T06:51:48+00:00
Log in to comment

Assignee: Olivier Sallou

Type: enhancement

Priority: major

Status: resolved

Votes: 0

Watchers: 1