"reference timer resolution is not acceptable on cori" knl
A job on 128 nodes of Cori KNL failed with this error:
upcxx-run -n 8704 -N 128 -shared-heap 10% -- /global/homes/s/shofmeyr/code/merac/mhmxx/bin/mhmxx --use-bloom=true -i 380:30 --checkpoint=false -r mock150.bbqc.fastq
*** FATAL ERROR (nid09648:75946): Reference timer resolution of 10230 ns on nid09648 is not acceptable for calibration of the TSC.
Please reconfigure with --enable-force-gettimeofday or --enable-force-posix-realtime.
WARNING: Ignoring call to gasneti_print_backtrace_ifenabled before gasneti_backtrace_init
srun: error: nid09648: task 3690: Aborted
srun: First task exited 60s ago
srun: step:29365819.0 tasks 0-3689,3691-8703: running
srun: step:29365819.0 task 3690: exited abnormally
srun: Terminating job step 29365819.0
Comments (14)
-
-
reporter - marked as minor
-
reporter I’ve only seen this that one time; never before or since. I’m not setting any slurm vars and it’s not a Haswell exec running on KNL. In fact, I suspect it’s the bad node issue, because this is mhmxx, and I don’t filter out bad nodes like happens in MetaHipMer. I think we should just put this on hold until (if) we see it again.
-
Thanks for the updates, Steve.
I have found use of this same node in my recent runs working to understand the oom problem. It appears in my logs for "Thu Mar 26 08:48:52 2020" and there was no indication of any issue. So, this may have been a transient issue.
-Paul
-
reporter I saw it again:
Job id 29451519
starting at Mon Apr 6 16:14:42 PDT 2020
Executing:
upcxx-run -n 4352 -N 64 -shared-heap 10% -- /global/homes/s/shofmeyr/code/merac/mhmxx/bin/mhmxx --use-bloom=true -i 380:30 --checkpoint=false -r mock150.bbqc.fastq
*** FATAL ERROR (nid09648:269245): Reference timer resolution of 9980 ns on nid09648 is not acceptable for calibration of the TSC.
Please reconfigure with --enable-force-gettimeofday or --enable-force-posix-realtime.
WARNING: Ignoring call to gasneti_print_backtrace_ifenabled before gasneti_backtrace_init
srun: error: nid09648: task 3078: Aborted
This was built for KNL using the Intel PrgEnv. The previous error also used the Intel PrgEnv.
I’m not oversubscribed. Here’s the ident output:
cori05[16:23]:~/.../mhmxx/slurm-out (aggr-store-rpc-ff *)$ ident ../bin/mhmxx
../bin/mhmxx:
$UPCXXCompilerID: |COMPILER_FAMILY:INTEL|COMPILER_VERSION:1900.20190206|COMPILER_FAMILYID:2|GNU:8.3.0|STD:STDC,__cplusplus=201402L|misc:Intel(R) C++ g++ 8.3 mode| $
$UPCXXLibraryVersion: 20200300L $
$UPCXXNetwork: ARIES $
$UPCXXThreadMode: SEQ $
$UPCXXCodeMode: opt $
$UPCXXGASNetVersion: 2020.3.0 $
$UPCXXCUDAEnabled: 0 $
$UPCXXAssertEnabled: 0 $
$UPCXXMPSCQueue: atomic $
$UPCXXCompilerStd: 201402L $
$UPCXXBuildTimestamp: Mar 28 2020 21:02:38 $
$GASNetCoreLibraryVersion: 2.2 $
$GASNetCoreLibraryName: ARIES $
$GASNetAMMaxMedium: 4032 $
$GASNetExtendedLibraryVersion: 2.2 $
$GASNetExtendedLibraryName: ARIES $
$GASNetToolsConfig: RELEASE=2020.3.0,SPEC=1.15,PTR=64bit,nodebug,SEQ,timers_native,membars_native,atomics_native,atomic32_native,atomic64_native $
$GASNetConfigureArgs: '--enable-cross-compile' '--host=x86_64-unknown-linux-gnu' '--build=x86_64-unknown-linux-gnu' '--target=x86_64-cnl-linux-gnu' '--disable-auto-conduit-detect' '--enable-smp' '--enable-aries' '--enable-mpi=probe' '--enable-ofi=probe' '--with-ofi-provider=gni' '--disable-aligned-segments' '--enable-pshm' '--disable-pshm-posix' '--enable-pshm-xpmem' '--enable-throttle-poll' '--with-feature-list=os_cnl,prgenv_intel' '--enable-backtrace-execinfo' '--enable-backtrace-gdb' '--enable-bug3480-workaround' '--with-pmirun-cmd=/usr/common/ftg/upcxx/libexec/upcxx_srun -n %N -- %C' 'ac_cv_func_PMI_Allgather=yes' 'ac_cv_func_PMI_Bcast=yes' 'CRAY_PMI_POST_LINK_OPTS=-L/opt/cray/pe/pmi/default/lib64' 'CRAY_UGNI_POST_LINK_OPTS=-L/opt/cray/ugni/default/lib64' 'CRAY_UDREG_POST_LINK_OPTS=-L/opt/cray/udreg/default/lib64' 'CRAY_XPMEM_POST_LINK_OPTS=-L/opt/cray/xpmem/default/lib64' '--disable-parsync' '--enable-seq' '--enable-par' '--enable-pthreads' '--disable-segment-everything' '--enable-aries' '--disable-debug' $
$GASNetCompilerID: |COMPILER_FAMILY:INTEL|COMPILER_VERSION:1900.20190206|COMPILER_FAMILYID:2|GNU:8.3.0|STD:STDC,STDC_VERSION=201112L|misc:Intel(R) C++ gcc 8.3 mode| $
$GASNetToolsThreadModel: SEQ $
$GASNetBuildTimestamp: Mar 28 2020 20:57:40 $
$GASNetBuildId: Sat Mar 28 20:46:28 PDT 2020 hargrove $
$GASNetSystemTuple: x86_64-cnl-linux-gnu $
$GASNetSystemName: cori10 $
$GASNetGitHash: gex-2020.3.0 $
$GASNetStridedVersion: 2.0 $
$GASNetStridedLoopingDims: 8 $
$GASNetStridedDirectDims: 15 $
$GASNetVISNPAM: 1 $
$GASNetVISMinPackBuffer: 8192 $
$GASNetConfig: (libgasnet.a) RELEASE=2020.3.0,SPEC=0.8,CONDUIT=ARIES(ARIES-2.2/ARIES-2.2),THREADMODEL=SEQ,SEGMENT=FAST,PTR=64bit,CACHE_LINE_BYTES=64,noalign,pshm,nodebug,notrace,nostats,nodebugmalloc,nosrclines,timers_native,membars_native,atomics_native,atomic32_native,atomic64_native,notiopt $
$GASNetEXAPIVersion: 0.8 $
$GASNetAPIVersion: 1 $
$GASNetThreadModel: GASNET_SEQ $
$GASNetSegment: GASNET_SEGMENT_FAST $
$GASNetConduitName: ARIES $
$GASNetDefaultMaxSegsizeStr: 0.85/H $
$GASNetPMISpawner: 1 $
This error appears before my code outputs any messages, and it initially outputs just simple informational messages.
-
reporter I suspect it’s an Intel issue. I’m retrying the Intel run to see if it’s reproducible.
-
The fact it showed up a second time days later on the SAME node (where thousands of runs on other nodes have no problem) seems like strong evidence of a faulty node.
The next step is probably to allocate specifically that node and try to prune down the code to something we can report to NERSC.
-
I'll save you a read of SLURM docs.
To diagnose the issue, pass-w nid09648
to sbatch to request this node.
Similarly, pass-x nid09648
to your production runs to exclude this node.-w, --nodelist=<node name list> Request a specific list of hosts. The job will contain all of these hosts and possibly additional hosts as needed to satisfy resource requirements. The list may be specified as a comma-separated list of hosts, a range of hosts (host[1-5,7,...] for example), or a filename. The host list will be assumed to be a filename if it contains a "/" character. If you specify a minimum node or processor count larger than can be satisfied by the supplied host list, additional resources will be allocated on other nodes as needed. Duplicate node names in the list will be ignored. The order of the node names in the list is not important; the node names will be sorted by Slurm. [...] -x, --exclude=<node name list> Explicitly exclude certain nodes from the resources granted to the job.
-
reporter Thanks, I already found that and have a run lined up with that node included.
-
reporter I tried another run with that node included, and it also failed early, but with a different message:
upcxx-run -n 4352 -N 64 -shared-heap 10% -- /global/homes/s/shofmeyr/code/merac/mhmxx/bin/mhmxx --use-bloom=true -i 380:30 --checkpoint=false -r mock150.bbqc.fastq
srun: ENCODED: Wed Dec 31 16:00:00 1969
srun: DECODED: Wed Dec 31 16:00:00 1969
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
Tue Apr 7 10:10:43 2020: [PE_3672]:inet_recv:inet_recv: unexpected EOF Success
Tue Apr 7 10:10:43 2020: [PE_1292]:_pmi_network_barrier:_pmi_inet_recv from target 1 failed pmi errno -1
Tue Apr 7 10:10:43 2020: [PE_1292]:_pmi_init:network_barrier failed
Tue Apr 7 10:10:43 2020: [PE_1428]:_pmi_network_barrier:_pmi_inet_recv from target 1 failed pmi errno -1
Tue Apr 7 10:10:43 2020: [PE_1428]:_pmi_init:network_barrier failed
Tue Apr 7 10:10:43 2020: [PE_408]:inet_recv:inet_recv: unexpected EOF Success
Tue Apr 7 10:10:43 2020: [PE_204]:_pmi_network_barrier:_pmi_inet_recv from target 1 failed pmi errno -1
Tue Apr 7 10:10:43 2020: [PE_204]:_pmi_init:network_barrier failed
Tue Apr 7 10:10:43 2020: [PE_612]:_pmi_network_barrier:_pmi_inet_recv from target 1 failed pmi errno -1
Tue Apr 7 10:10:43 2020: [PE_612]:_pmi_init:network_barrier failed
Tue Apr 7 10:10:43 2020: [PE_136]:inet_recv:inet_recv: unexpected EOF Success
Tue Apr 7 10:10:43 2020: [PE_1156]:_pmi_network_barrier:_pmi_inet_recv from target 1 failed pmi errno -1
Tue Apr 7 10:10:43 2020: [PE_1156]:_pmi_init:network_barrier failed
WARNING: Ignoring call to gasneti_print_backtrace_ifenabled before gasneti_backtrace_init
-
Steve,
The new message sounds similar to the PMI timeout described in a NERSC FAQ (search for "Warning bootstrap barrier failed"). Other than that observation, I am not sure what to make of this new message (from several different PEs clearly spanning multiple nodes)
-
- changed milestone to 2021.3.0 release
Mass roll-over of open issues to next release milestone
-
@Steven Hofmeyr are you aware of any need to keep this UPC++ issue open? I believe we determined the problems were specific to a particular node on Cori. Have you had any similar failures since to suggest a UPC++ or GASNet-EX problem needs to be addresssed?
-
- changed status to resolved
Closing due to lack of activity
- Log in to comment
Steve,
A 10.2us timer resolution sounds impossible in normal circumstances. You may have gotten a bad node or are over-subscribed.
Since srun won't let you oversubscribe without trying hard, I suspect this might be a bad node.
Can you confirm that you've not accidentally pinned more than one UPC++ rank per CPU thread? For instance by having
SLURM_*
env vars set?Can you confirm this is a KNL executable not Haswell by mistake?
Also, please tell me which of the installs I provide this came from, or indicate that you've built your own. Since my installs contain some dispatch logic, I also need to know which PrgEnv in addition to the upcxx module used to compile.
Providing the output of
ident /global/homes/s/shofmeyr/code/merac/mhmxx/bin/mhmxx
would also be helpful.This reminds me of INC0133336 which Rob reported in Feb, 2019, but is NOT the same failure mode. The outcome of that was to remove a bad node from service, and NERSC has actually used our timer calibration code on at least one additional occasion I am aware of to diagnose bad hardware.
-Paul