Clone wiki

go-docker / Home

Presentation

http://fr.slideshare.net/OlivierSallou/godocker-presentation

Screencast

https://www.youtube.com/watch?v=juw_foi-Q0c

https://www.youtube.com/watch?v=3fu2aLocTbI

Blog, wiki, tutorials, development

you will find here the official documentation for GoDocker

https://godocker.atlassian.net/wiki/display/GOD/GODOCKER

API documentation

http://go-docker.readthedocs.io/en/latest/

Features

go-docker is a tool to submit batch jobs on a multi-node/multi-user architecture. It can be compared to other tools like GridEngine/Torque/... It schedules and execute the jobs on an available node and manage its life-cycle. Jobs are executed in Docker containers.

Get more info at FeaturesDetails

Tutorial

End user

Usertutorial

Administrator

Admintutorial

Ecosystem

Logging

Logs of web server or scheduler/watchers can be sent to a central log system like graylog or logstash. This is specified in go-d.ini or production.ini (web), following Python logging configuration. Example configuration are available in go-d.ini.sample. You just need to update host/port information and add the handler to the loggers.

Monitoring

GO-Docker can export several statistics to InfluxDB, which can be linked to Grafana to get dashboards and charts. It also provides a Prometheus (http://prometheus.io/) endpoint (http://ip_add:6543/metrics). Some statistics are: number of running jobs, total number of jobs, scheduler timer, ... Both solution can connect to cAvdisor and go-docker to get detailled statistics on usage.

Security and network considerations

Go-Docker makes use of Docker, so as implications on network and security. Running applications in containers does not mean full isolation. Lean more SecurityNetwork

Task life cycle

GoDocker.png

the reschedule is the workflow -> kill -> set back to pending.

Development tips

Swarm

Start swarm with a list of nodes:

bin/swarm manage -H 127.0.0.1:2376 nodes://127.0.0.1:2375

Docker

restart docker to listen on tcp

On debian, use DOCKER_OPTS (/etc/default/docker)

DOCKER_OPTS="--dns a.b.c.d -H tcp://0.0.0.0:2375"

On Fedora, use OPTIONS (/etc/sysconfig/docker)

OPTIONS=" -H tcp://0.0.0.0:2375"

List running/stopped containers:

docker  -H 127.0.0.1:2376  ps -a

Delete old stopped containers:

docker  -H 127.0.0.1:2376  ps -a | awk 'NR > 1 {print $1}' | xargs docker  -H 127.0.0.1:2376 rm

Database

To clean the database, connect to mongodb database (with 'mongo god' command) and execute:

db.users.drop()
db.jobs.drop()
db.jobsover.drop()

To reset database, connect to redis database (with 'redis-cli' command) and execute:

flushdb

Tech tips

SSH

Issue observed on ubuntu 14.04 image :

to install an SSH server on a docker image, directory creation is needed before apt-get. In the Dockerfile :

RUN mkdir /var/run/sshd
RUN apt-get install ssh -y

Mesos

Increase executor timeout for image pulls:

echo '5mins' > /etc/mesos-slave/executor_registration_timeout

typical slave config for GoDocker

[mesos-slave]# ls
attributes  containerizers  executor_registration_timeout

attributes => storage:disk;hostname:192.168.1.37
containerizers => docker,mesos
executor_registration_timeout => 5mins

Track mesos logs with Graylog (http://www.fluentd.org/guides/recipes/graylog2)

Install fluentd, gelf plugin and add in configuration:

<source>
  type tail
  path /var/log/mesos/mesos-master.ERROR
  pos_file /tmp/mesos-master.ERROR.pos
  tag graylog2.mesos
  format /^(?<code>[A-Z])\d+\s+(?<time>[0-9:]+).*\] (?<message>.*)/
</source>
<match graylog2.**>
  type copy
  <store>
    type gelf
    host localhost
    port 12201
    flush_interval 5s
  </store>
</match>

Tasks management

Sometimes, mesos fails to kill a job. Following steps will help to kill the container

  1. On slave execute

    docker stop XXXXX (container id)

  2. Wait for container to stop and check if job has been killed in web interface (after refresh)

  3. If container still appear in mesos interface and container is stopped, kill the mesos-executor process linked to the container

#ps -ef|grep mesos-executor
root     23110 22419  0 17:35 ?        00:00:00 /usr/libexec/mesos mesos-executor --override /bin/sh -c exit `docker wait mesos-6a7f2dba-6368-42c6-b5a4-19012c9b0834`
#kill 23110
  1. If container is killed in Mesos and does not appear anymore as a Mesos job, but still appear in web interface (framework did not received kill confirmation), connect to redis:

    set god:mesos:over:XXXX 7 with XXXX your task id.

To kill a mesos framework:

curl -d@/tmp/post.txt -X POST http://your_mesos:5050/master/shutdown
#/tmp/post.txt is a file with the follow content:
#frameworkId=23423-23423-234234-234234

CAdvisor

CAdvisor can be executed in a container (ip and ports to be adapted of course)

docker  -H 127.0.0.1:2375 run   --volume=/:/rootfs:ro   --volume=/var/run:/var/run:rw   --volume=/sys:/sys:ro   --volume=/var/lib/docker/:/var/lib/docker:ro   --publish=8080:8080   --detach=true   --name=cadvisor  google/cadvisor:latest -docker="tcp://local_ip:2375"

Optional:

-storage_duration=X (in minutes)
For 10 minutes:
-storage_duration=10m0s

Logstash

Needs to listen on UDP and set host in go-d.ini

bin/logstash -e 'input { udp { port => 59590} } filter { json { source => "message" } output { elasticsearch {  }  }'

Consul

Using consul as status manager, it is possible to use Consul DNS features to load-balance requests to the web servers in HA and scalable mode. More info: https://bitbucket.org/osallou/go-docker-haproxy-consul

Prometheus

To query prometheus about a container, you need the container name (available in job details), then you can use query like:

rate(container_cpu_usage_seconds_total{name="mesos-05f4011f-faa9-4a3c-bbaf-128585555ce1"} [5m])

Updated