oom parcc

Issue #234 new
Rob Egan created an issue

Refactor private memory usage in parcc

reproducer 32 knl nodes

UPC_SHARED_HEAP_SIZE=900 test_hipmer.sh human-benchmark

########################################################################
# Starting stage parCC-1 -l linksMeta-1 -m 51 -c merDepth_Bubbletigs_diplotigs-51 -n 3253075 -o SsrfFile-1 -B /dev/shm -P 3,10 -C Bubbletigs_diplotigs-51 -D 1 -d
 1 at 09/01/19 12:37:26
########################################################################

STAGE parCC_main -l linksMeta-1 -m 51 -c merDepth_Bubbletigs_diplotigs-51 -n 3253075 -o SsrfFile-1 -B /dev/shm -P 3,10 -C Bubbletigs_diplotigs-51 -D 1 -d 1 
Filtering contigs with length < 100 and ( depth > 500.0 or (ignoring) connections > 0)
Read lib 0 (aa-1): insert size 190, stddev 0, link file LINKS_OUTPUT_1
Read lib 1 (aa-1): insert size 393, stddev 38, link file LINKS_OUTPUT_1
Built assembly objects DB in 0.02121 seconds
No hmm file provided, so no search for rDNA will be performed
Starting connected components calculations
Done with memory management logistics in 0.14745 seconds
Done with communicating the graph in 0.13240 seconds
Done with packing the local edge lists in 0.00372 seconds
ParCC time is 35.48778 seconds (DONE in 701 rounds...) 6.18814 seconds in 71 reductions
Counted CC membership in 0.01579 seconds
Reductions for CC max sizes in 0.01822 seconds
Threads 0 - 21 will hold memory for the largest CC (102098)
[Th17 INFO 2019-09-01 12:38:02 parCC.upc:738]: largest_CC_ID=17
Build reverse vertex index in 0.09088 seconds
In total 130324 CCs, maximum with size 102098
slurmstepd: error: Detected 1 oom-kill event(s) in step 24282286.0 cgroup. Some of your processes may have been killed by the cgroup out-of-memory handler.
slurmstepd: error: Detected 1 oom-kill event(s) in step 24282286.0 cgroup. Some of your processes may have been killed by the cgroup out-of-memory handler.
slurmstepd: error: Detected 1 oom-kill event(s) in step 24282286.0 cgroup. Some of your processes may have been killed by the cgroup out-of-memory handler.
srun: error: nid09211: task 1427: Out Of Memory
Aborting on 'ERROR' or 'srun: error' or 'UPC runtime error' ... srun: error: nid09211: task 1427: Out Of Memory

hipmer detected a failure in process pid=109069, reading output and terminating at 2019-09-01 12:38:04.604021
Sending SIGINT at 2019-09-01 12:38:05.605349
Sending SIGINT again at 2019-09-01 12:38:06.606659
Sending SIGTERM at 2019-09-01 12:38:08.608944
Sending SIGKILL at 2019-09-01 12:38:09.610194
Got some output from the failed process at 2019-09-01 12:38:12.990746
slurmstepd: error: Detected 1 oom-kill event(s) in step 24282286.0 cgroup. Some of your processes may have been killed by the cgroup out-of-memory handler.
slurmstepd: error: Detected 1 oom-kill event(s) in step 24282286.0 cgroup. Some of your processes may have been killed by the cgroup out-of-memory handler.
slurmstepd: error: Detected 1 oom-kill event(s) in step 24282286.0 cgroup. Some of your processes may have been killed by the cgroup out-of-memory handler.
slurmstepd: error: Detected 1 oom-kill event(s) in step 24282286.0 cgroup. Some of your processes may have been killed by the cgroup out-of-memory handler.
slurmstepd: error: Detected 1 oom-kill event(s) in step 24282286.0 cgroup. Some of your processes may have been killed by the cgroup out-of-memory handler.
slurmstepd: error: Detected 1 oom-kill event(s) in step 24282286.0 cgroup. Some of your processes may have been killed by the cgroup out-of-memory handler.
slurmstepd: error: Detected 1 oom-kill event(s) in step 24282286.0 cgroup. Some of your processes may have been killed by the cgroup out-of-memory handler.
slurmstepd: error: Detected 1 oom-kill event(s) in step 24282286.0 cgroup. Some of your processes may have been killed by the cgroup out-of-memory handler.
srun: interrupt (one more within 1 sec to abort)

Comments (0)

  1. Log in to comment