thornyflat - production run segmentation fault

Issue #2598 resolved
Maria created an issue

Simfactory scripts for ThornyFlat give segmentation fault in production run. I am attaching the cfg, ini, run and sub scripts.

In the simulations directory, the SIMFACTORY directory does not have inside the subdirectories: exe, cfg, run and par and NODES is empty. The only file is the properties.ini, which includes:

____

Loading torque version 6.1.3 : dev/torque/6.1.3

Loading openmpi version 4.1.2_gcc112 : parallel/openmpi/4.1.2_gcc112

Loading openblas version 0.3.19_gcc112 : libs/openblas/0.3.19_gcc112

+ set -e

+ cd /scratch/mbh0012/simulations/bnsG2/output-0000-active

+ echo Checking:

+ pwd

+ hostname

+ date

+ cat

+ echo Environment:

+ export GMON_OUT_PREFIX=gmon.out

+ GMON_OUT_PREFIX=gmon.out

+ export CACTUS_NUM_PROCS=2

+ CACTUS_NUM_PROCS=2

+ export CACTUS_NUM_THREADS=20

+ CACTUS_NUM_THREADS=20

+ export OMP_NUM_THREADS=20

+ OMP_NUM_THREADS=20

+ env

+ sort

+ echo Starting:

++ date +%s

+ export CACTUS_STARTTIME=1642991790

+ CACTUS_STARTTIME=1642991790

+ mpiexec -n 2 -npernode 2 /scratch/mbh0012/simulations/bnsG2/SIMFACTORY/exe/cactus_nst -L 3 /scratch/mbh0012/simulations/bnsG2/output-0000/nsnstohmns.par

[trcis001:09698] *** Process received signal ***

[trcis001:09698] Signal: Segmentation fault (11)

[trcis001:09698] Signal code:  (128)

[trcis001:09698] Failing at address: (nil)

[trcis001:09698] [ 0] /lib64/libpthread.so.0(+0xf630)[0x2b4fc7e60630]

[trcis001:09698] [ 1] /shared/software/parallel/openmpi/4.1.2_gcc112/lib/libopen-rte.so.40(orte_get_attribute+0x21)[0x2b4fc6f4b101]

[trcis001:09698] [ 2] /shared/software/parallel/openmpi/4.1.2_gcc112/lib/libopen-rte.so.40(orte_plm_base_setup_job+0xf0)[0x2b4fc6f83530]

[trcis001:09698] [ 3] /lib64/libevent_core-2.0.so.5(event_base_loop+0x774)[0x2b4fc7a2f3a4]

[trcis001:09698] [ 4] mpiexec[0x40133a]

[trcis001:09698] [ 5] /lib64/libc.so.6(__libc_start_main+0xf5)[0x2b4fc808f555]

[trcis001:09698] [ 6] mpiexec[0x40114e]

[trcis001:09698] *** End of error message ***

/scratch/mbh0012/simulations/bnsG2/output-0000/SIMFACTORY/RunScript: line 24:  9698 Segmentation fault      (core dumped) mpiexec -n 2 -npernode 2 /scratch/mbh0012/simulations/bnsG2/SIMFACTORY/exe/cactus_nst -L 3 /scratch/mbh0012/simulations/bnsG2/output-0000/nsnstohmns.par

Comments (41)

  1. Maria reporter

    With the command:

    ./simfactory/bin/sim create-submit tov --configuration etk --machine=thornyflat --parfile=par/static_tov.par --cores=10

    I am getting the warnings:

    Warning: Current Working directory does not match Cactus sourcetree, changing to /users/mbh0012/Cactus

    Warning: Too many threads per process specified: specified num-threads=20 (ppn-used is 40)
    Warning: Total number of threads and number of threads per process are inconsistent: procs=10, num-threads=20 (procs*num-smt must be an integer multiple of num-threads)
    Warning: Total number of threads and number of cores per node are inconsistent: procs=10, ppn-used=40 (procs must be an integer multiple of ppn-used)

  2. Maria reporter

    Again, the simulation gave me Segmentation Fault. In essence, the problem seems to be here:

    /scratch/mbh0012/simulations/tov/output-0000/SIMFACTORY/RunScript: line 24: 9372 Segmentation fault (core dumped) mpiexec -n 1 -npernode 2 /scratch/mbh0012/simulations/tov/SIMFACTORY/exe/cactus_etk -L 3 /scratch/mbh0012/simulations/tov/output-0000/static_tov.par

    This can be traced to thornyflat.run, which is attached above. Most likely, there is an error in this line:

    mpiexec -n @NUM_PROCS@ -npernode @(@PPN_USED@ / @NUM_THREADS@)@ @EXECUTABLE@ -L 3 @PARFILE@

  3. Maria reporter

    The helpdesk thinks that is an internal problem with Einstein Toolkit, bug or a bad software design, which has as result not checking for proper allocations. I suspect the problem is in the submit or run files.

  4. Roland Haas

    This was discussed in today’s ET call. While difficult to diagnose remotely, the suggestion were that this may be due to mismatched MPI stacks during compile and runtime or due to incorrect LD_LIBRARY_PATH. Without access to the system though, this is almost impossible to correctly diagnose.

  5. Anuj Kankani

    I am actually able to get ETK running on thornyflat (and spruceknob) using simfactory by slightly modifying the configuration files attached above . My only issue is that when creating a new simulation, the run and submit script that get copied are the generic onces, despite me specifying --machine thornyflat. By manually going in and replacing the run and submit files in the simulation directory, everything works fine. In the thornyflat machine file I specify the thornyflat configuration files, so I’m not sure why it’s still choosing the generic ones. Is there some step I am missing?

    I want to upload the configuration files I used but only see an image upload button?

  6. Roland Haas

    Oh, I see. Ok that is easier to offer suggestions for. There are two things that come mind.

    1. I would try and make sure that simfactory recognizes thornyflat by checking that ./simfactory/bin/sim whoami returns thornyflat and that the mdb entry simfactory uses is the one I expect using ./simfactory/bin/sim print-mdb-entry $(./simfactory/bin/sim whoami | cut -d' ' -f3)
    2. there seems to be a typo in the file name for the submit script. Namely it is called thornyflay.sub (a y where it should be t). This could prevent simfactory from finding it.

    Are there any warnings when you start compiling a configuration? Sometimes simfactory picks a “default” run script when it cannot find the one specified on the command line or in the machine file 😞

    Just to be sure: you tried this compiling a fresh configuration, ideally after a rm -rf configs to make sure there are no lingering old files (a --reconfig does not overwrite an existing RunScript file in configs/sim)?

    The submitscript attached to the ticket does already use --machine @MACHINE@ which makes sure that simfactory uses the correct machine description when executing the run script.

    Note: looking at you file thornyflat.ini in the ticket there is a missing [thornyflat] that should be at the top of the file and that actually tells simfactory the name of the machine (the ini file section).

  7. Anuj Kankani

    I wasn’t doing rm -rf configs before, and after doing that everything seems to be working (I had fixed the typos and [thornyflat] before). I’ve attached the config files I used below. I noticed there were some disabled thorns in the machine file (i’m assuming something with blas vs openblas?). I went ahead and kept them disabled since I assume there was a reason for disabling them. Also I had to specify --machine thornyflat during compilation since it does not choose it by default.

    Thanks for the help!

  8. Roland Haas

    @Maria @Anuj Kankani Ok. Thank you. I will add them to simulation factory if you would like me to.

  9. Anuj Kankani

    I can if needed, but given that I am very new to ETK and unsure of my future usage, I am probably not the best choice.

  10. Maria reporter

    Eric provided the simfactory files necessary for thornyflat, and they used to run just fine. However, an update on ThornyFlat made it unusable. I tried to update the scripts, but I was not able to carry on a simulation with simfactory. I did not have the problem described by Anuj with reverting to a generic configuration. This should not happen, and is strange. My problem was that the runs hung forever in limbo, and could not start. Not even HelloWorld on the end nodes without simfactory. I am not receiving answers from the maintainers and don’t even have access to open a ticket. If Anuj has more success with it, please let me know. Anuj, I am happy to see that you are able to run ETK there. What configuration are you using? How many nodes/cpus are you using? Let me know if I be of assistance?

  11. Roland Haas

    Sure. I am also still looking for someone of those who created the files to volunteer to “maintain” the files ie check peridocaly (every 6months for the release) that they still work. In return that person (and any others named by them as having contributed to the files) gets to be a simfactory and thus ET author.

  12. Maria reporter

    Well. I thought I can do this. I used ThornyFlat before without problems, but for almost an year now, I’m not getting through. I will call them Tomorrow to make sure that I am able to access the WVU cluster and the ticket web page, as an outsider (I am not a WVU faculty).

  13. Anuj Kankani

    @Maria I have run it on 2 nodes/80 cores on thornyflat and 128 cores on spruce knob (with the necessary changes to the configuration files for spruce knob). The files I attached are working for me without issue and the generic configuration problem was due to a mistake I was making.

    Note: they have actually added ET as a module to thorny flat as well, but I was not able to get it to work and the hpc staff told me they are working on fixing it.

    @Roland Haas, if Maria is unable to access the cluster, then I can volunteer as the maintainer for thorny flat and spruce knob

  14. Maria reporter

    Sure, no problem! Meantime, I still want to be able to run ETK on ThornyFlat. Anuj, what version of mpi/gcc are you using when you were able to run on ThornyFlat?

    I was on the phone with Guillermo and Patrick for 2 hours Today. Apparently I have the same rights to compute nodes as any WVU faculty on ThornyFlat community nodes supported by NSF. It’s not the memory or access, although ThornyFlat was rather slow Today. The problem was tracked down to a bug that seems to appear when mpi4.1 talks with gcc11. It is not fixed yet. I am to try to use instead gcc3.1.6 and report back. They’ll work on their end to figure out if it’s fixable by them, or it’s a bug in mpi4.1 that must be reported. As for me not being able to send tickets, this seem to be managed by a different team, that I have to contact.

  15. Anuj Kankani

    The modules I used are the ones listed in the machine file I attached. I also had issues with openmpi 4.something, so I switched to openmpi 3.something and gcc93.

    Here are the modules I used:

    module load lang/gcc/9.3.0
    module load parallel/openmpi/3.1.6_gcc93
    module load libs/fftw/3.3.9_gcc93
    module load libs/hdf5/1.12.2_gcc93
    module load libs/openblas/0.3.20_gcc93

  16. Maria reporter

    Yes, you’re right, with the 93 version of the compiler and openmpi, it does work. I raised the issue because I updated it to gcc11. We should get it to work with current software.

  17. Roland Haas

    So…, who will serve as maintainer for thornyflat and which files should be included in simfactory?

  18. Anuj Kankani

    I can do it. I can also add Spruce Knob (the other WVU cluster). I will try to do it by this weekend and confirm/add the files once I run the test suite.

  19. Anuj Kankani

    Do I need to run the full testsuite right now on the current Reimann release, or before the next release?

  20. Roland Haas

    For the release the testsuite minimally needs to be run after the feature freeze, which for this release is 2022-10-13, see the timeline at https://docs.einsteintoolkit.org/et-docs/Release_Details .

    Though for a first-time cluster you are probably well advised to try testing at least once before to make sure there are no major failures.

    Test results are displayed here: http://einsteintoolkit.org/testsuite_results/index.php using data hosted in this bitbucket repository https://bitbucket.org/einsteintoolkit/testsuite_results . To be able to commit and push your results, would you mind sending me (rhaaas@illinois.edu) the email address associated with your Bitbucket account so that I can give you write permissions, please?

    The names of the files in the repo follow a pattern that is looked for by the website code so you should adhere to the structure outlined in https://docs.einsteintoolkit.org/et-docs/Testsuite_Machines that is the names should be <machine>__1_<N>.log for the 1 process test and <machine>__2_<N/2>.log for the 2 process test.

  21. Anuj Kankani

    Thanks! If I understand you correctly, I don’t need to run anything right now, just before the next release (and since its a first time cluster, before the feature freeze).

    My email is anuj.kankani@mail.wvu.edu

  22. Roland Haas

    Getting a pull request with the actual files before that would be great though. So that there can be comments on them. There are a number of settings that are not totally obvious and are not used by everyone so them missing may not be obvious right away.

  23. Anuj Kankani

    I created my own branch in order to submit a pull request. I didn't see a folder for the cluster files so I have added a folder with the files and submitted the pull request and put you as the reviewer. Hopefully this is ok.

  24. Roland Haas

    They go into the directories “mdb/machines”, “mdb/runscripts”, “mdb/submitscripts” and “mdb/optionlists”. So not one directory per machine but one directory per file type.

    The files are actually part of simfactory not the testsuite results. So the pull request would have to be for https://bitbucket.org/simfactory/simfactory2/

  25. Anuj Kankani

    Ok that’s what I thought I was supposed to do, but I don’t have permissions for the simfactory2 repo so I can’t create a new branch/pull request.

  26. Roland Haas

    You can however create a pull request 🙂 I’ll add you with write permission after the pull request.

  27. Roland Haas

    the way to do it is:

    1. fork the simfactory repo on bitbucket
    2. create a branch in your fork with the files you want
    3. create the pull request in your fork, it will show up in the list of pull requests in the source repository

  28. Anuj Kankani

    Got it, thank you for all the help. Sorry for the mistakes, I don’t have much experience with git/bitbucket.

    I’ve submitted the pull request.

  29. Maria reporter

    I agree with Anuj being in charge. Anuj, please try to work with the support team for Thorny Flat to fix the problem with MPI and keep the compilers up to date.

  30. Roland Haas

    There probably should be one. Right now it is only implied.

    Though I suspect I will have to first actually grant you write access to the repository.

    Ah, I just checked. You do already have write permission.

  31. Log in to comment