"reference timer resolution is not acceptable on cori" knl

Issue #357 resolved
Steven Hofmeyr created an issue

A job on 128 nodes of Cori KNL failed with this error:

upcxx-run -n 8704 -N 128 -shared-heap 10% -- /global/homes/s/shofmeyr/code/merac/mhmxx/bin/mhmxx --use-bloom=true -i 380:30 --checkpoint=false -r mock150.bbqc.fastq
*** FATAL ERROR (nid09648:75946): Reference timer resolution of 10230 ns on nid09648 is not acceptable for calibration of the TSC.
Please reconfigure with --enable-force-gettimeofday or --enable-force-posix-realtime.
WARNING: Ignoring call to gasneti_print_backtrace_ifenabled before gasneti_backtrace_init
srun: error: nid09648: task 3690: Aborted
srun: First task exited 60s ago
srun: step:29365819.0 tasks 0-3689,3691-8703: running
srun: step:29365819.0 task 3690: exited abnormally
srun: Terminating job step 29365819.0

Comments (14)

  1. Paul Hargrove

    Steve,

    A 10.2us timer resolution sounds impossible in normal circumstances. You may have gotten a bad node or are over-subscribed.
    Since srun won't let you oversubscribe without trying hard, I suspect this might be a bad node.

    Can you confirm that you've not accidentally pinned more than one UPC++ rank per CPU thread? For instance by having SLURM_* env vars set?

    Can you confirm this is a KNL executable not Haswell by mistake?

    Also, please tell me which of the installs I provide this came from, or indicate that you've built your own. Since my installs contain some dispatch logic, I also need to know which PrgEnv in addition to the upcxx module used to compile.

    Providing the output of ident /global/homes/s/shofmeyr/code/merac/mhmxx/bin/mhmxx would also be helpful.

    This reminds me of INC0133336 which Rob reported in Feb, 2019, but is NOT the same failure mode. The outcome of that was to remove a bad node from service, and NERSC has actually used our timer calibration code on at least one additional occasion I am aware of to diagnose bad hardware.

    -Paul

  2. Steven Hofmeyr reporter

    I’ve only seen this that one time; never before or since. I’m not setting any slurm vars and it’s not a Haswell exec running on KNL. In fact, I suspect it’s the bad node issue, because this is mhmxx, and I don’t filter out bad nodes like happens in MetaHipMer. I think we should just put this on hold until (if) we see it again.

  3. Paul Hargrove

    Thanks for the updates, Steve.

    I have found use of this same node in my recent runs working to understand the oom problem. It appears in my logs for "Thu Mar 26 08:48:52 2020" and there was no indication of any issue. So, this may have been a transient issue.

    -Paul

  4. Steven Hofmeyr reporter

    I saw it again:

    Job id 29451519
    starting at Mon Apr 6 16:14:42 PDT 2020
    Executing:
    upcxx-run -n 4352 -N 64 -shared-heap 10% -- /global/homes/s/shofmeyr/code/merac/mhmxx/bin/mhmxx --use-bloom=true -i 380:30 --checkpoint=false -r mock150.bbqc.fastq
    *** FATAL ERROR (nid09648:269245): Reference timer resolution of 9980 ns on nid09648 is not acceptable for calibration of the TSC.
    Please reconfigure with --enable-force-gettimeofday or --enable-force-posix-realtime.
    WARNING: Ignoring call to gasneti_print_backtrace_ifenabled before gasneti_backtrace_init
    srun: error: nid09648: task 3078: Aborted

    This was built for KNL using the Intel PrgEnv. The previous error also used the Intel PrgEnv.

    I’m not oversubscribed. Here’s the ident output:

    cori05[16:23]:~/.../mhmxx/slurm-out (aggr-store-rpc-ff *)$ ident ../bin/mhmxx
    ../bin/mhmxx:
    $UPCXXCompilerID: |COMPILER_FAMILY:INTEL|COMPILER_VERSION:1900.20190206|COMPILER_FAMILYID:2|GNU:8.3.0|STD:STDC,__cplusplus=201402L|misc:Intel(R) C++ g++ 8.3 mode| $
    $UPCXXLibraryVersion: 20200300L $
    $UPCXXNetwork: ARIES $
    $UPCXXThreadMode: SEQ $
    $UPCXXCodeMode: opt $
    $UPCXXGASNetVersion: 2020.3.0 $
    $UPCXXCUDAEnabled: 0 $
    $UPCXXAssertEnabled: 0 $
    $UPCXXMPSCQueue: atomic $
    $UPCXXCompilerStd: 201402L $
    $UPCXXBuildTimestamp: Mar 28 2020 21:02:38 $
    $GASNetCoreLibraryVersion: 2.2 $
    $GASNetCoreLibraryName: ARIES $
    $GASNetAMMaxMedium: 4032 $
    $GASNetExtendedLibraryVersion: 2.2 $
    $GASNetExtendedLibraryName: ARIES $
    $GASNetToolsConfig: RELEASE=2020.3.0,SPEC=1.15,PTR=64bit,nodebug,SEQ,timers_native,membars_native,atomics_native,atomic32_native,atomic64_native $
    $GASNetConfigureArgs: '--enable-cross-compile' '--host=x86_64-unknown-linux-gnu' '--build=x86_64-unknown-linux-gnu' '--target=x86_64-cnl-linux-gnu' '--disable-auto-conduit-detect' '--enable-smp' '--enable-aries' '--enable-mpi=probe' '--enable-ofi=probe' '--with-ofi-provider=gni' '--disable-aligned-segments' '--enable-pshm' '--disable-pshm-posix' '--enable-pshm-xpmem' '--enable-throttle-poll' '--with-feature-list=os_cnl,prgenv_intel' '--enable-backtrace-execinfo' '--enable-backtrace-gdb' '--enable-bug3480-workaround' '--with-pmirun-cmd=/usr/common/ftg/upcxx/libexec/upcxx_srun -n %N -- %C' 'ac_cv_func_PMI_Allgather=yes' 'ac_cv_func_PMI_Bcast=yes' 'CRAY_PMI_POST_LINK_OPTS=-L/opt/cray/pe/pmi/default/lib64' 'CRAY_UGNI_POST_LINK_OPTS=-L/opt/cray/ugni/default/lib64' 'CRAY_UDREG_POST_LINK_OPTS=-L/opt/cray/udreg/default/lib64' 'CRAY_XPMEM_POST_LINK_OPTS=-L/opt/cray/xpmem/default/lib64' '--disable-parsync' '--enable-seq' '--enable-par' '--enable-pthreads' '--disable-segment-everything' '--enable-aries' '--disable-debug' $
    $GASNetCompilerID: |COMPILER_FAMILY:INTEL|COMPILER_VERSION:1900.20190206|COMPILER_FAMILYID:2|GNU:8.3.0|STD:STDC,STDC_VERSION=201112L|misc:Intel(R) C++ gcc 8.3 mode| $
    $GASNetToolsThreadModel: SEQ $
    $GASNetBuildTimestamp: Mar 28 2020 20:57:40 $
    $GASNetBuildId: Sat Mar 28 20:46:28 PDT 2020 hargrove $
    $GASNetSystemTuple: x86_64-cnl-linux-gnu $
    $GASNetSystemName: cori10 $
    $GASNetGitHash: gex-2020.3.0 $
    $GASNetStridedVersion: 2.0 $
    $GASNetStridedLoopingDims: 8 $
    $GASNetStridedDirectDims: 15 $
    $GASNetVISNPAM: 1 $
    $GASNetVISMinPackBuffer: 8192 $
    $GASNetConfig: (libgasnet.a) RELEASE=2020.3.0,SPEC=0.8,CONDUIT=ARIES(ARIES-2.2/ARIES-2.2),THREADMODEL=SEQ,SEGMENT=FAST,PTR=64bit,CACHE_LINE_BYTES=64,noalign,pshm,nodebug,notrace,nostats,nodebugmalloc,nosrclines,timers_native,membars_native,atomics_native,atomic32_native,atomic64_native,notiopt $
    $GASNetEXAPIVersion: 0.8 $
    $GASNetAPIVersion: 1 $
    $GASNetThreadModel: GASNET_SEQ $
    $GASNetSegment: GASNET_SEGMENT_FAST $
    $GASNetConduitName: ARIES $
    $GASNetDefaultMaxSegsizeStr: 0.85/H $
    $GASNetPMISpawner: 1 $

    This error appears before my code outputs any messages, and it initially outputs just simple informational messages.

  5. Steven Hofmeyr reporter

    I suspect it’s an Intel issue. I’m retrying the Intel run to see if it’s reproducible.

  6. Dan Bonachea

    The fact it showed up a second time days later on the SAME node (where thousands of runs on other nodes have no problem) seems like strong evidence of a faulty node.

    The next step is probably to allocate specifically that node and try to prune down the code to something we can report to NERSC.

  7. Paul Hargrove

    I'll save you a read of SLURM docs.
    To diagnose the issue, pass -w nid09648 to sbatch to request this node.
    Similarly, pass -x nid09648 to your production runs to exclude this node.

           -w, --nodelist=<node name list>
                  Request a specific list of hosts.  The job will contain  all  of
                  these  hosts  and possibly additional hosts as needed to satisfy
                  resource  requirements.   The  list  may  be  specified   as   a
                  comma-separated list of hosts, a range of hosts (host[1-5,7,...]
                  for example), or a filename.  The host list will be  assumed  to
                  be  a filename if it contains a "/" character.  If you specify a
                  minimum node or processor count larger than can be satisfied  by
                  the  supplied  host list, additional resources will be allocated
                  on other nodes as needed.  Duplicate node names in the list will
                  be  ignored.   The  order  of  the node names in the list is not
                  important; the node names will be sorted by Slurm.
    
    [...]
    
           -x, --exclude=<node name list>
                  Explicitly exclude certain nodes from the resources  granted  to
                  the job.
    
  8. Steven Hofmeyr reporter

    I tried another run with that node included, and it also failed early, but with a different message:

    upcxx-run -n 4352 -N 64 -shared-heap 10% -- /global/homes/s/shofmeyr/code/merac/mhmxx/bin/mhmxx --use-bloom=true -i 380:30 --checkpoint=false -r mock150.bbqc.fastq
    srun: ENCODED: Wed Dec 31 16:00:00 1969
    srun: DECODED: Wed Dec 31 16:00:00 1969
    srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
    Tue Apr 7 10:10:43 2020: [PE_3672]:inet_recv:inet_recv: unexpected EOF Success
    Tue Apr 7 10:10:43 2020: [PE_1292]:_pmi_network_barrier:_pmi_inet_recv from target 1 failed pmi errno -1
    Tue Apr 7 10:10:43 2020: [PE_1292]:_pmi_init:network_barrier failed
    Tue Apr 7 10:10:43 2020: [PE_1428]:_pmi_network_barrier:_pmi_inet_recv from target 1 failed pmi errno -1
    Tue Apr 7 10:10:43 2020: [PE_1428]:_pmi_init:network_barrier failed
    Tue Apr 7 10:10:43 2020: [PE_408]:inet_recv:inet_recv: unexpected EOF Success
    Tue Apr 7 10:10:43 2020: [PE_204]:_pmi_network_barrier:_pmi_inet_recv from target 1 failed pmi errno -1
    Tue Apr 7 10:10:43 2020: [PE_204]:_pmi_init:network_barrier failed
    Tue Apr 7 10:10:43 2020: [PE_612]:_pmi_network_barrier:_pmi_inet_recv from target 1 failed pmi errno -1
    Tue Apr 7 10:10:43 2020: [PE_612]:_pmi_init:network_barrier failed
    Tue Apr 7 10:10:43 2020: [PE_136]:inet_recv:inet_recv: unexpected EOF Success
    Tue Apr 7 10:10:43 2020: [PE_1156]:_pmi_network_barrier:_pmi_inet_recv from target 1 failed pmi errno -1
    Tue Apr 7 10:10:43 2020: [PE_1156]:_pmi_init:network_barrier failed
    WARNING: Ignoring call to gasneti_print_backtrace_ifenabled before gasneti_backtrace_init

  9. Paul Hargrove

    Steve,

    The new message sounds similar to the PMI timeout described in a NERSC FAQ (search for "Warning bootstrap barrier failed"). Other than that observation, I am not sure what to make of this new message (from several different PEs clearly spanning multiple nodes)

  10. Paul Hargrove

    @Steven Hofmeyr are you aware of any need to keep this UPC++ issue open? I believe we determined the problems were specific to a particular node on Cori. Have you had any similar failures since to suggest a UPC++ or GASNet-EX problem needs to be addresssed?

  11. Log in to comment