Job exit after 5-10 seconds after start.
Hi. After switching to develop branch jobs just exit after 5-10 seconds after start. Nothing wrong in logs. Looked at scheduler logs, mesos slave logs - no any errors.
2018-02-23 11:52:25,840 DEBUG [godocker-scheduler][Thread-1] Task placed on host host1
2018-02-23 11:52:25,840 DEBUG [godocker-scheduler][Thread-1] Mesos:Task:Running:84
2018-02-23 11:52:29,579 DEBUG [godocker-scheduler][Thread-1] Task 84-0 is in state TASK_RUNNING
2018-02-23 11:52:29,844 DEBUG [godocker-scheduler][Thread-1] Task 84-0 is in state TASK_FINISHED
Tried to stop god-watcher, but no luck. Tried on 2 different test env. with the same result.
Could you advise, how to debug this situation, or could you try to start job with "sleep 600" or some long time job for test ?
Comments (22)
-
repo owner -
reporter I mean if I'll run job like this:
echo "Hello GoDocker", it will finish ok.
But if i'll run longer task, for example:
echo "Hello GoDocker" sleep 500 echo "After 500 seconds"
it will finish after 5 secs, and log will have only "Hello GoDocker" without any error.
-
reporter for example, real test job, that was worked fine on master branch, now it just exit after 6 secs.
Exit code 0 Submitted 23,Feb 2018 14:36:23 Running 23,Feb 2018 14:37:12 Over 23,Feb 2018 14:37:18
job:
#!/bin/bash apt-get update apt-get -y install sysbench ts=60 for i in {1..5} do echo " = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = " echo "iteration $i " dd bs=1M count=512 if=/dev/zero of=test conv=fdatasync sleep $ts sysbench --test=cpu --cpu-max-prime=20000 run sleep $ts sysbench --test=memory --memory-total-size=1G run sleep $ts ioping . -c 10 echo " = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = " done
god.log:
Get:1 http://archive.ubuntu.com/ubuntu xenial InRelease [247 kB] Get:2 http://security.ubuntu.com/ubuntu xenial-security InRelease [102 kB]
-
repo owner I made a local test with "longer" job, and it worked fine.
nothing in err file?
Mesos says job ended correctly, so if job ends, should have information. Did you enable some quotas?
-
repo owner did you also checked system logs on mesos slave where task was executed ? (some docker errors... appear in system logs)
-
repo owner please also look at job sanbox, in mesos ui, for stdout and stderr files
-
repo owner there should also be a god_sys.err file in job directory, any content?
-
reporter nothing in err file? empty Did you enable some quotas? no. all by default. no overquota messages.
checked system logs
I0223 13:52:25.849190 14318 slave.cpp:1729] Got assigned task '84-0' for framework 1d2d8f83-e7ff-40ff-bf38-d21248192ca6-0018 I0223 13:52:25.849563 14318 gc.cpp:92] Unscheduling '/var/lib/mesos/slaves/c0703a8b-214d-4ea3-afb8-b0980d582b9d-S0/frameworks/1d2d8f83-e7ff-40ff-bf38-d21248192ca6-0018' from gc I0223 13:52:25.849709 14317 slave.cpp:1996] Authorizing task '84-0' for framework 1d2d8f83-e7ff-40ff-bf38-d21248192ca6-0018 I0223 13:52:25.849980 14317 slave.cpp:2164] Launching task '84-0' for framework 1d2d8f83-e7ff-40ff-bf38-d21248192ca6-0018 I0223 13:52:25.850342 14317 paths.cpp:578] Trying to chown '/var/lib/mesos/slaves/c0703a8b-214d-4ea3-afb8-b0980d582b9d-S0/frameworks/1d2d8f83-e7ff-40ff-bf38-d21248192ca6-0018/executors/84-0/runs/68a531a2-3f06-44b6-8751-2b31b9483651' to user 'root' I0223 13:52:25.850502 14317 slave.cpp:7249] Launching executor '84-0' of framework 1d2d8f83-e7ff-40ff-bf38-d21248192ca6-0018 with resources [{"allocation_info":{"role":"god_r"},"name":"cpus","scalar":{"value":0.1},"type":"SCALAR"},{"allocation_info":{"role":"god_r"},"name":"mem","scalar":{"value":32.0},"type":"SCALAR"}] in work directory '/var/lib/mesos/slaves/c0703a8b-214d-4ea3-afb8-b0980d582b9d-S0/frameworks/1d2d8f83-e7ff-40ff-bf38-d21248192ca6-0018/executors/84-0/runs/68a531a2-3f06-44b6-8751-2b31b9483651' I0223 13:52:25.850777 14317 slave.cpp:2851] Launching container 68a531a2-3f06-44b6-8751-2b31b9483651 for executor '84-0' of framework 1d2d8f83-e7ff-40ff-bf38-d21248192ca6-0018 I0223 13:52:25.850937 14317 slave.cpp:2393] Queued task '84-0' for executor '84-0' of framework 1d2d8f83-e7ff-40ff-bf38-d21248192ca6-0018 I0223 13:52:25.851059 14322 docker.cpp:1119] Skipping non-docker container I0223 13:52:25.851452 14318 containerizer.cpp:1083] Starting container 68a531a2-3f06-44b6-8751-2b31b9483651 I0223 13:52:25.852651 14317 provisioner.cpp:459] Provisioning image rootfs '/var/lib/mesos/provisioner/containers/68a531a2-3f06-44b6-8751-2b31b9483651/backends/aufs/rootfses/9718ba39-c7ed-45d2-be81-f872c68b387e' for container 68a531a2 -3f06-44b6-8751-2b31b9483651 using aufs backend I0223 13:52:25.854712 14319 containerizer.cpp:2712] Transitioning the state of container 68a531a2-3f06-44b6-8751-2b31b9483651 from PROVISIONING to PREPARING I0223 13:52:25.867985 14316 memory.cpp:479] Started listening for OOM events for container 68a531a2-3f06-44b6-8751-2b31b9483651 I0223 13:52:25.869258 14316 memory.cpp:590] Started listening on 'low' memory pressure events for container 68a531a2-3f06-44b6-8751-2b31b9483651 I0223 13:52:25.870509 14316 memory.cpp:590] Started listening on 'medium' memory pressure events for container 68a531a2-3f06-44b6-8751-2b31b9483651 I0223 13:52:25.871773 14316 memory.cpp:590] Started listening on 'critical' memory pressure events for container 68a531a2-3f06-44b6-8751-2b31b9483651 I0223 13:52:25.873411 14320 memory.cpp:199] Updated 'memory.soft_limit_in_bytes' to 1032MB for container 68a531a2-3f06-44b6-8751-2b31b9483651 I0223 13:52:25.875893 14320 memory.cpp:228] Updated 'memory.limit_in_bytes' to 1032MB for container 68a531a2-3f06-44b6-8751-2b31b9483651 I0223 13:52:25.877164 14320 cpu.cpp:101] Updated 'cpu.shares' to 1126 (cpus 1.1) for container 68a531a2-3f06-44b6-8751-2b31b9483651 I0223 13:52:25.879895 14322 linux_launcher.cpp:429] Launching container 68a531a2-3f06-44b6-8751-2b31b9483651 and cloning with namespaces CLONE_NEWNS I0223 13:52:25.899909 14321 containerizer.cpp:2712] Transitioning the state of container 68a531a2-3f06-44b6-8751-2b31b9483651 from PREPARING to ISOLATING I0223 13:52:25.985354 14316 containerizer.cpp:2712] Transitioning the state of container 68a531a2-3f06-44b6-8751-2b31b9483651 from ISOLATING to FETCHING I0223 13:52:25.985821 14315 containerizer.cpp:2712] Transitioning the state of container 68a531a2-3f06-44b6-8751-2b31b9483651 from FETCHING to RUNNING I0223 13:52:26.110772 14317 slave.cpp:3928] Got registration for executor '84-0' of framework 1d2d8f83-e7ff-40ff-bf38-d21248192ca6-0018 from executor(1)@10.0.0.80:39492 I0223 13:52:26.111383 14316 disk.cpp:208] Updating the disk resources for container 68a531a2-3f06-44b6-8751-2b31b9483651 to cpus(allocated: god_r):1.1; mem(allocated: god_r):1032 I0223 13:52:26.113497 14320 memory.cpp:199] Updated 'memory.soft_limit_in_bytes' to 1032MB for container 68a531a2-3f06-44b6-8751-2b31b9483651 I0223 13:52:26.116195 14320 cpu.cpp:101] Updated 'cpu.shares' to 1126 (cpus 1.1) for container 68a531a2-3f06-44b6-8751-2b31b9483651 I0223 13:52:26.116508 14315 slave.cpp:2598] Sending queued task '84-0' to executor '84-0' of framework 1d2d8f83-e7ff-40ff-bf38-d21248192ca6-0018 at executor(1)@10.0.0.80:39492 I0223 13:52:26.121114 14321 slave.cpp:4392] Handling status update TASK_RUNNING (UUID: 77b9da5a-dc0f-42b1-bc82-1f0656baa5c3) for task 84-0 of framework 1d2d8f83-e7ff-40ff-bf38-d21248192ca6-0018 from executor(1)@10.0.0.80:39492 I0223 13:52:26.121736 14321 status_update_manager.cpp:323] Received status update TASK_RUNNING (UUID: 77b9da5a-dc0f-42b1-bc82-1f0656baa5c3) for task 84-0 of framework 1d2d8f83-e7ff-40ff-bf38-d21248192ca6-0018 I0223 13:52:26.121985 14319 slave.cpp:4873] Forwarding the update TASK_RUNNING (UUID: 77b9da5a-dc0f-42b1-bc82-1f0656baa5c3) for task 84-0 of framework 1d2d8f83-e7ff-40ff-bf38-d21248192ca6-0018 to master@10.0.0.11:5050 I0223 13:52:26.122126 14319 slave.cpp:4783] Sending acknowledgement for status update TASK_RUNNING (UUID: 77b9da5a-dc0f-42b1-bc82-1f0656baa5c3) for task 84-0 of framework 1d2d8f83-e7ff-40ff-bf38-d21248192ca6-0018 to executor(1)@10.0.0.80:39492 I0223 13:52:26.307044 14315 slave.cpp:4392] Handling status update TASK_FINISHED (UUID: 9302bd7e-c47f-4102-aebe-54d8955510bb) for task 84-0 of framework 1d2d8f83-e7ff-40ff-bf38-d21248192ca6-0018 from executor(1)@10.0.0.80:39492 I0223 13:52:26.307793 14317 disk.cpp:208] Updating the disk resources for container 68a531a2-3f06-44b6-8751-2b31b9483651 to cpus(allocated: god_r):0.1; mem(allocated: god_r):32 I0223 13:52:26.309151 14318 memory.cpp:199] Updated 'memory.soft_limit_in_bytes' to 32MB for container 68a531a2-3f06-44b6-8751-2b31b9483651 I0223 13:52:26.316941 14318 cpu.cpp:101] Updated 'cpu.shares' to 102 (cpus 0.1) for container 68a531a2-3f06-44b6-8751-2b31b9483651 I0223 13:52:26.317248 14317 status_update_manager.cpp:323] Received status update TASK_FINISHED (UUID: 9302bd7e-c47f-4102-aebe-54d8955510bb) for task 84-0 of framework 1d2d8f83-e7ff-40ff-bf38-d21248192ca6-0018 I0223 13:52:26.317366 14317 slave.cpp:4783] Sending acknowledgement for status update TASK_FINISHED (UUID: 9302bd7e-c47f-4102-aebe-54d8955510bb) for task 84-0 of framework 1d2d8f83-e7ff-40ff-bf38-d21248192ca6-0018 to executor(1)@10.0.0.80:39492 I0223 13:52:27.311293 14320 slave.cpp:5005] Got exited event for executor(1)@10.0.0.80:39492 I0223 13:52:27.390763 14318 containerizer.cpp:2612] Container 68a531a2-3f06-44b6-8751-2b31b9483651 has exited I0223 13:52:27.390820 14318 containerizer.cpp:2166] Destroying container 68a531a2-3f06-44b6-8751-2b31b9483651 in RUNNING state I0223 13:52:27.390847 14318 containerizer.cpp:2712] Transitioning the state of container 68a531a2-3f06-44b6-8751-2b31b9483651 from RUNNING to DESTROYING I0223 13:52:27.390967 14317 linux_launcher.cpp:505] Asked to destroy container 68a531a2-3f06-44b6-8751-2b31b9483651 I0223 13:52:27.392657 14317 linux_launcher.cpp:548] Using freezer to destroy cgroup mesos/68a531a2-3f06-44b6-8751-2b31b9483651 I0223 13:52:27.395380 14322 cgroups.cpp:3058] Freezing cgroup /sys/fs/cgroup/freezer/mesos/68a531a2-3f06-44b6-8751-2b31b9483651 I0223 13:52:27.400364 14320 cgroups.cpp:1413] Successfully froze cgroup /sys/fs/cgroup/freezer/mesos/68a531a2-3f06-44b6-8751-2b31b9483651 after 4.933888ms I0223 13:52:27.403964 14320 cgroups.cpp:3076] Thawing cgroup /sys/fs/cgroup/freezer/mesos/68a531a2-3f06-44b6-8751-2b31b9483651 I0223 13:52:27.407909 14320 cgroups.cpp:1442] Successfully thawed cgroup /sys/fs/cgroup/freezer/mesos/68a531a2-3f06-44b6-8751-2b31b9483651 after 3.89376ms I0223 13:52:27.504433 14318 provisioner.cpp:576] Destroying container rootfs at '/var/lib/mesos/provisioner/containers/68a531a2-3f06-44b6-8751-2b31b9483651/backends/aufs/rootfses/9718ba39-c7ed-45d2-be81-f872c68b387e' for container 68a531a2-3f06-44b6-8751-2b31b9483651 I0223 13:52:27.530289 14318 slave.cpp:5405] Executor '84-0' of framework 1d2d8f83-e7ff-40ff-bf38-d21248192ca6-0018 exited with status 0 I0223 13:52:29.578313 14318 status_update_manager.cpp:395] Received status update acknowledgement (UUID: 77b9da5a-dc0f-42b1-bc82-1f0656baa5c3) for task 84-0 of framework 1d2d8f83-e7ff-40ff-bf38-d21248192ca6-0018 I0223 13:52:29.578622 14318 slave.cpp:4873] Forwarding the update TASK_FINISHED (UUID: 9302bd7e-c47f-4102-aebe-54d8955510bb) for task 84-0 of framework 1d2d8f83-e7ff-40ff-bf38-d21248192ca6-0018 to master@10.0.0.11:5050 I0223 13:52:29.843521 14319 status_update_manager.cpp:395] Received status update acknowledgement (UUID: 9302bd7e-c47f-4102-aebe-54d8955510bb) for task 84-0 of framework 1d2d8f83-e7ff-40ff-bf38-d21248192ca6-0018 I0223 13:52:29.843724 14319 slave.cpp:5509] Cleaning up executor '84-0' of framework 1d2d8f83-e7ff-40ff-bf38-d21248192ca6-0018 at executor(1)@10.0.0.80:39492 I0223 13:52:29.843852 14320 gc.cpp:58] Scheduling '/var/lib/mesos/slaves/c0703a8b-214d-4ea3-afb8-b0980d582b9d-S0/frameworks/1d2d8f83-e7ff-40ff-bf38-d21248192ca6-0018/executors/84-0/runs/68a531a2-3f06-44b6-8751-2b31b9483651' for gc 6.99999023358815days in the future I0223 13:52:29.843945 14319 slave.cpp:5605] Cleaning up framework 1d2d8f83-e7ff-40ff-bf38-d21248192ca6-0018 I0223 13:52:29.843938 14320 gc.cpp:58] Scheduling '/var/lib/mesos/slaves/c0703a8b-214d-4ea3-afb8-b0980d582b9d-S0/frameworks/1d2d8f83-e7ff-40ff-bf38-d21248192ca6-0018/executors/84-0' for gc 6.99999023239704days in the future I0223 13:52:29.844027 14320 status_update_manager.cpp:285] Closing status update streams for framework 1d2d8f83-e7ff-40ff-bf38-d21248192ca6-0018 I0223 13:52:29.844084 14319 gc.cpp:58] Scheduling '/var/lib/mesos/slaves/c0703a8b-214d-4ea3-afb8-b0980d582b9d-S0/frameworks/1d2d8f83-e7ff-40ff-bf38-d21248192ca6-0018' for gc 6.99999023094815days in the future I0223 13:53:05.250643 14321 slave.cpp:5913] Current disk usage 39.24%. Max allowed age: 3.552929658581065days I0223 13:54:05.251130 14321 slave.cpp:5913] Current disk usage 39.24%. Max allowed age: 3.552926724855521days I0223 13:55:05.251878 14315 slave.cpp:5913] Current disk usage 39.24%. Max allowed age: 3.552925412399363days I0223 13:56:05.252715 14315 slave.cpp:5913] Current disk usage 39.24%. Max allowed age: 3.552923096300255days I0223 13:57:05.253638 14315 slave.cpp:5913] Current disk usage 39.24%. Max allowed age: 3.552921552234178days I0223 13:58:05.254462 14315 slave.cpp:5913] Current disk usage 39.24%. Max allowed age: 3.552920471387928days I0223 13:59:05.255206 14315 slave.cpp:5913] Current disk usage 39.24%. Max allowed age: 3.552918927321852days I0223 14:00:05.255996 14318 slave.cpp:5913] Current disk usage 39.24%. Max allowed age: 3.552917228849167days I0223 14:01:05.257365 14317 slave.cpp:5913] Current disk usage 39.24%. Max allowed age: 3.552915607579792days I0223 14:02:05.258126 14315 slave.cpp:5913] Current disk usage 39.24%. Max allowed age: 3.552914140717025days I0223 14:02:25.851563 14320 slave.cpp:5821] Framework 1d2d8f83-e7ff-40ff-bf38-d21248192ca6-0018 seems to have exited. Ignoring registration timeout for executor '84-0'
-
repo owner mesos tags the task as finished right after the running state:
I0223 13:52:26.122126 14319 slave.cpp:4783] Sending acknowledgement for status update TASK_RUNNING (UUID: 77b9da5a-dc0f-42b1-bc82-1f0656baa5c3) for task 84-0 of framework 1d2d8f83-e7ff-40ff-bf38-d21248192ca6-0018 to executor(1)@10.0.0.80:39492 I0223 13:52:26.307044 14315 slave.cpp:4392] Handling status update TASK_FINISHED (UUID: 9302bd7e-c47f-4102-aebe-54d8955510bb) for task 84-0 of framework 1d2d8f83-e7ff-40ff-bf38-d21248192ca6-0018 from executor(1)@10.0.0.80:39492
and nothing seems to occur in the middle.... strange.
Are you using docker containerizer or unified containerizer ? (mesos/unified in go-d.ini)
-
repo owner god_sys.err or mesos stdout/stderr in sandbox should give more info
-
reporter Are you using docker containerizer or unified containerizer ? (mesos/unified in go-d.ini)
unified conteinerizer, mesos executor, because of GPU.
mesos sandbox:
stderr
I0223 13:52:26.108995 25208 exec.cpp:162] Version: 1.4.1 I0223 13:52:26.111753 25216 exec.cpp:237] Executor registered on agent c0703a8b-214d-4ea3-afb8-b0980d582b9d-S0 I0223 13:52:26.114111 25211 executor.cpp:171] Received SUBSCRIBED event I0223 13:52:26.114428 25211 executor.cpp:175] Subscribed executor on host1 I0223 13:52:26.117203 25209 executor.cpp:171] Received LAUNCH event I0223 13:52:26.117327 25209 executor.cpp:633] Starting task 84-0 I0223 13:52:26.118549 25209 executor.cpp:477] Running '/usr/libexec/mesos/mesos-containerizer launch <POSSIBLY-SENSITIVE-DATA>' I0223 13:52:26.119429 25209 executor.cpp:646] Forked command at 25218 I0223 13:52:26.305481 25213 executor.cpp:925] Command exited with status 0 (pid: 25218) I0223 13:52:27.308197 25217 process.cpp:1068] Failed to accept socket: future discarded
stdout
Executing pre-exec command '{"arguments":["mesos-containerizer","mount","--help=false","--operation=make-rslave","--path=\/"],"shell":false,"value":"\/usr\/libexec\/mesos\/mesos-containerizer"}' Executing pre-exec command '{"arguments":["mount","-n","--rbind","\/var\/lib\/mesos\/slaves\/c0703a8b-214d-4ea3-afb8-b0980d582b9d-S0\/frameworks\/1d2d8f83-e7ff-40ff-bf38-d21248192ca6-0018\/executors\/84-0\/runs\/68a531a2-3f06-44b6-8751-2b31b9483651","\/var\/lib\/mesos\/provisioner\/containers\/68a531a2-3f06-44b6-8751-2b31b9483651\/backends\/aufs\/rootfses\/9718ba39-c7ed-45d2-be81-f872c68b387e\/mnt\/mesos\/sandbox"],"shell":false,"value":"mount"}' Executing pre-exec command '{"arguments":["mount","-n","--rbind","\/data\/godshared\/tasks\/pairtree_root\/84\/task","\/var\/lib\/mesos\/provisioner\/containers\/68a531a2-3f06-44b6-8751-2b31b9483651\/backends\/aufs\/rootfses\/9718ba39-c7ed-45d2-be81-f872c68b387e\/mnt\/go-docker"],"shell":false,"value":"mount"}' Changing root to /var/lib/mesos/provisioner/containers/68a531a2-3f06-44b6-8751-2b31b9483651/backends/aufs/rootfses/9718ba39-c7ed-45d2-be81-f872c68b387e OK
-
repo owner I made a local test with unified containerizer (not my default), but still work. But something happens within mesos container/your script.
I0223 13:52:26.118549 25209 executor.cpp:477] Running '/usr/libexec/mesos/mesos-containerizer launch <POSSIBLY-SENSITIVE-DATA>' I0223 13:52:26.119429 25209 executor.cpp:646] Forked command at 25218 I0223 13:52:26.305481 25213 executor.cpp:925] Command exited with status 0 (pid: 25218)
your script exits with exit code 0 immediately.
As some logs of the godocker script appears, it means that job is executed (else would stop at init and god.log and god.err would be empty). But why does the script stops "in the second" ??? Once script is started on mesos, godocker is "out" of the process, godocker just waits for a message from mesos about completion(ok or ko).
And if something goes wrong in the script, it should fail with an exit code > 0.....
Do you see the god_sys.err file in godocker job dir (next to god.log and god.err) ?
-
reporter wrapper.sh was changed:
/mnt/go-docker/godocker.sh -> /mnt/go-docker/godocker.sh &> /mnt/go-docker/god_sys.err
may be when godocker.sh starts in background container exits ?
is it could be a problem there ?
-
repo owner it creates a new log file to catch startup errors in god_sys.err file. I use it and have no issues with it. You should see this file in your job dir next to god.log and god.err.
-
reporter YES!!!! that it!!! Just remove it in scheduler. Now it works.
& -> fork process in background may be you forgot "2" 2> ?
-
repo owner godocker.sh does not start in background. It is executed and forwards both stdout and stderr to god_sys.err
-
reporter look at wrapper.sh last line
-
repo owner in fact seems to depend on executed shell..... in ubuntu debian
#!/bin/sh echo "hello" &> titi
forks and do not redirect
on centos, stdout/stderr is redirected. I made my tests on centos..... So I need to find a fix that works on both..... I gonna remove it in the meanwhile
-
repo owner I have removed the god_sys log redirection in dev and pushed commit , new container is building. Will try to find a way to get the expected redirection on all OS...
-
repo owner the following seems to do the job on both deb/centos OS.
>> /mnt/go-docker/god_sys.err 2>&1
will update code soon, in the meanwhile, redirection has been removed.
-
repo owner Redirection has been fixed to manage debian and centos bash.
-
repo owner - changed status to resolved
- Log in to comment
we see in logs that Mesos sent the TASK_RUNNING and TASK_FINISHED almost immediatly. The ~5s delay is the time for watcher to detect the information and do the termination/cleanup. Script executed as expected and ended immediatly (at least this is what mesos says...). Task finished , for mesos, means script was executed and ended with exit code 0.
What do you mean by " start job with sleep 600" ?
in the godocker cmd script, you execute a shell script with whatever you want, so it can be a sleep if you want.