thornyflat - production run segmentation fault
Simfactory scripts for ThornyFlat give segmentation fault in production run. I am attaching the cfg, ini, run and sub scripts.
In the simulations directory, the SIMFACTORY directory does not have inside the subdirectories: exe, cfg, run and par and NODES is empty. The only file is the properties.ini, which includes:
____
Loading torque version 6.1.3 : dev/torque/6.1.3
Loading openmpi version 4.1.2_gcc112 : parallel/openmpi/4.1.2_gcc112
Loading openblas version 0.3.19_gcc112 : libs/openblas/0.3.19_gcc112
+ set -e
+ cd /scratch/mbh0012/simulations/bnsG2/output-0000-active
+ echo Checking:
+ pwd
+ hostname
+ date
+ cat
+ echo Environment:
+ export GMON_OUT_PREFIX=gmon.out
+ GMON_OUT_PREFIX=gmon.out
+ export CACTUS_NUM_PROCS=2
+ CACTUS_NUM_PROCS=2
+ export CACTUS_NUM_THREADS=20
+ CACTUS_NUM_THREADS=20
+ export OMP_NUM_THREADS=20
+ OMP_NUM_THREADS=20
+ env
+ sort
+ echo Starting:
++ date +%s
+ export CACTUS_STARTTIME=1642991790
+ CACTUS_STARTTIME=1642991790
+ mpiexec -n 2 -npernode 2 /scratch/mbh0012/simulations/bnsG2/SIMFACTORY/exe/cactus_nst -L 3 /scratch/mbh0012/simulations/bnsG2/output-0000/nsnstohmns.par
[trcis001:09698] *** Process received signal ***
[trcis001:09698] Signal: Segmentation fault (11)
[trcis001:09698] Signal code: (128)
[trcis001:09698] Failing at address: (nil)
[trcis001:09698] [ 0] /lib64/libpthread.so.0(+0xf630)[0x2b4fc7e60630]
[trcis001:09698] [ 1] /shared/software/parallel/openmpi/4.1.2_gcc112/lib/libopen-rte.so.40(orte_get_attribute+0x21)[0x2b4fc6f4b101]
[trcis001:09698] [ 2] /shared/software/parallel/openmpi/4.1.2_gcc112/lib/libopen-rte.so.40(orte_plm_base_setup_job+0xf0)[0x2b4fc6f83530]
[trcis001:09698] [ 3] /lib64/libevent_core-2.0.so.5(event_base_loop+0x774)[0x2b4fc7a2f3a4]
[trcis001:09698] [ 4] mpiexec[0x40133a]
[trcis001:09698] [ 5] /lib64/libc.so.6(__libc_start_main+0xf5)[0x2b4fc808f555]
[trcis001:09698] [ 6] mpiexec[0x40114e]
[trcis001:09698] *** End of error message ***
/scratch/mbh0012/simulations/bnsG2/output-0000/SIMFACTORY/RunScript: line 24: 9698 Segmentation fault (core dumped) mpiexec -n 2 -npernode 2 /scratch/mbh0012/simulations/bnsG2/SIMFACTORY/exe/cactus_nst -L 3 /scratch/mbh0012/simulations/bnsG2/output-0000/nsnstohmns.par
Comments (41)
-
reporter -
reporter Again, the simulation gave me Segmentation Fault. In essence, the problem seems to be here:
/scratch/mbh0012/simulations/tov/output-0000/SIMFACTORY/RunScript: line 24: 9372 Segmentation fault (core dumped) mpiexec -n 1 -npernode 2 /scratch/mbh0012/simulations/tov/SIMFACTORY/exe/cactus_etk -L 3 /scratch/mbh0012/simulations/tov/output-0000/static_tov.par
This can be traced to thornyflat.run, which is attached above. Most likely, there is an error in this line:
mpiexec -n @NUM_PROCS@ -npernode @(@PPN_USED@ / @NUM_THREADS@)@ @EXECUTABLE@ -L 3 @PARFILE@
-
reporter The helpdesk thinks that is an internal problem with Einstein Toolkit, bug or a bad software design, which has as result not checking for proper allocations. I suspect the problem is in the submit or run files.
-
This was discussed in today’s ET call. While difficult to diagnose remotely, the suggestion were that this may be due to mismatched MPI stacks during compile and runtime or due to incorrect
LD_LIBRARY_PATH
. Without access to the system though, this is almost impossible to correctly diagnose. -
I am actually able to get ETK running on thornyflat (and spruceknob) using simfactory by slightly modifying the configuration files attached above . My only issue is that when creating a new simulation, the run and submit script that get copied are the generic onces, despite me specifying --machine thornyflat. By manually going in and replacing the run and submit files in the simulation directory, everything works fine. In the thornyflat machine file I specify the thornyflat configuration files, so I’m not sure why it’s still choosing the generic ones. Is there some step I am missing?
I want to upload the configuration files I used but only see an image upload button?
-
Oh, I see. Ok that is easier to offer suggestions for. There are two things that come mind.
- I would try and make sure that simfactory recognizes thornyflat by checking that
./simfactory/bin/sim whoami
returnsthornyflat
and that the mdb entry simfactory uses is the one I expect using./simfactory/bin/sim print-mdb-entry $(./simfactory/bin/sim whoami | cut -d' ' -f3)
- there seems to be a typo in the file name for the submit script. Namely it is called
thornyflay.sub
(ay
where it should bet
). This could prevent simfactory from finding it.
Are there any warnings when you start compiling a configuration? Sometimes simfactory picks a “default” run script when it cannot find the one specified on the command line or in the machine file
Just to be sure: you tried this compiling a fresh configuration, ideally after a
rm -rf configs
to make sure there are no lingering old files (a--reconfig
does not overwrite an existingRunScript
file inconfigs/sim
)?The submitscript attached to the ticket does already use
--machine @MACHINE@
which makes sure that simfactory uses the correct machine description when executing the run script.Note: looking at you file
thornyflat.ini
in the ticket there is a missing[thornyflat]
that should be at the top of the file and that actually tells simfactory the name of the machine (the ini file section). - I would try and make sure that simfactory recognizes thornyflat by checking that
-
I wasn’t doing rm -rf configs before, and after doing that everything seems to be working (I had fixed the typos and [thornyflat] before). I’ve attached the config files I used below. I noticed there were some disabled thorns in the machine file (i’m assuming something with blas vs openblas?). I went ahead and kept them disabled since I assume there was a reason for disabling them. Also I had to specify --machine thornyflat during compilation since it does not choose it by default.
Thanks for the help!
-
- attached thornyflat.sub
- attached thornyflat.run
- attached thornyflat.ini
- attached thornyflat.cfg
<div class="preview-container wiki-content"><!-- loaded via ajax --></div> <div class="mask"></div> </div>
</div> </form>
-
@Maria @Anuj Kankani Ok. Thank you. I will add them to simulation factory if you would like me to.
-
I think that would be useful for future users
-
Very good. Someone will need to be listed as a maintainer of the machine though, would one of you be willing to do so? Basically this would involve keeping the files up to date and ideally running the full testsuite before each release (https://docs.einsteintoolkit.org/et-docs/Testsuite_Machines).
-
I can if needed, but given that I am very new to ETK and unsure of my future usage, I am probably not the best choice.
-
@Maria ?
-
reporter Eric provided the simfactory files necessary for thornyflat, and they used to run just fine. However, an update on ThornyFlat made it unusable. I tried to update the scripts, but I was not able to carry on a simulation with simfactory. I did not have the problem described by Anuj with reverting to a generic configuration. This should not happen, and is strange. My problem was that the runs hung forever in limbo, and could not start. Not even HelloWorld on the end nodes without simfactory. I am not receiving answers from the maintainers and don’t even have access to open a ticket. If Anuj has more success with it, please let me know. Anuj, I am happy to see that you are able to run ETK there. What configuration are you using? How many nodes/cpus are you using? Let me know if I be of assistance?
-
reporter Rolad, let me try Anuj files before adding them to simfactory.
-
Sure. I am also still looking for someone of those who created the files to volunteer to “maintain” the files ie check peridocaly (every 6months for the release) that they still work. In return that person (and any others named by them as having contributed to the files) gets to be a simfactory and thus ET author.
-
reporter Well. I thought I can do this. I used ThornyFlat before without problems, but for almost an year now, I’m not getting through. I will call them Tomorrow to make sure that I am able to access the WVU cluster and the ticket web page, as an outsider (I am not a WVU faculty).
-
@Maria I have run it on 2 nodes/80 cores on thornyflat and 128 cores on spruce knob (with the necessary changes to the configuration files for spruce knob). The files I attached are working for me without issue and the generic configuration problem was due to a mistake I was making.
Note: they have actually added ET as a module to thorny flat as well, but I was not able to get it to work and the hpc staff told me they are working on fixing it.
@Roland Haas, if Maria is unable to access the cluster, then I can volunteer as the maintainer for thorny flat and spruce knob
-
reporter Sure, no problem! Meantime, I still want to be able to run ETK on ThornyFlat. Anuj, what version of mpi/gcc are you using when you were able to run on ThornyFlat?
I was on the phone with Guillermo and Patrick for 2 hours Today. Apparently I have the same rights to compute nodes as any WVU faculty on ThornyFlat community nodes supported by NSF. It’s not the memory or access, although ThornyFlat was rather slow Today. The problem was tracked down to a bug that seems to appear when mpi4.1 talks with gcc11. It is not fixed yet. I am to try to use instead gcc3.1.6 and report back. They’ll work on their end to figure out if it’s fixable by them, or it’s a bug in mpi4.1 that must be reported. As for me not being able to send tickets, this seem to be managed by a different team, that I have to contact.
-
The modules I used are the ones listed in the machine file I attached. I also had issues with openmpi 4.something, so I switched to openmpi 3.something and gcc93.
Here are the modules I used:
module load lang/gcc/9.3.0
module load parallel/openmpi/3.1.6_gcc93
module load libs/fftw/3.3.9_gcc93
module load libs/hdf5/1.12.2_gcc93
module load libs/openblas/0.3.20_gcc93 -
reporter Yes, you’re right, with the 93 version of the compiler and openmpi, it does work. I raised the issue because I updated it to gcc11. We should get it to work with current software.
-
Issue
#2499was marked as a duplicate of this issue. -
So…, who will serve as maintainer for thornyflat and which files should be included in simfactory?
-
I can do it. I can also add Spruce Knob (the other WVU cluster). I will try to do it by this weekend and confirm/add the files once I run the test suite.
-
Do I need to run the full testsuite right now on the current Reimann release, or before the next release?
-
For the release the testsuite minimally needs to be run after the feature freeze, which for this release is 2022-10-13, see the timeline at https://docs.einsteintoolkit.org/et-docs/Release_Details .
Though for a first-time cluster you are probably well advised to try testing at least once before to make sure there are no major failures.
Test results are displayed here: http://einsteintoolkit.org/testsuite_results/index.php using data hosted in this bitbucket repository https://bitbucket.org/einsteintoolkit/testsuite_results . To be able to commit and push your results, would you mind sending me (rhaaas@illinois.edu) the email address associated with your Bitbucket account so that I can give you write permissions, please?
The names of the files in the repo follow a pattern that is looked for by the website code so you should adhere to the structure outlined in https://docs.einsteintoolkit.org/et-docs/Testsuite_Machines that is the names should be
<machine>__1_<N>.log
for the 1 process test and<machine>__2_<N/2>.log
for the 2 process test. -
Thanks! If I understand you correctly, I don’t need to run anything right now, just before the next release (and since its a first time cluster, before the feature freeze).
My email is anuj.kankani@mail.wvu.edu
-
Getting a pull request with the actual files before that would be great though. So that there can be comments on them. There are a number of settings that are not totally obvious and are not used by everyone so them missing may not be obvious right away.
-
I created my own branch in order to submit a pull request. I didn't see a folder for the cluster files so I have added a folder with the files and submitted the pull request and put you as the reviewer. Hopefully this is ok.
-
They go into the directories “mdb/machines”, “mdb/runscripts”, “mdb/submitscripts” and “mdb/optionlists”. So not one directory per machine but one directory per file type.
The files are actually part of simfactory not the testsuite results. So the pull request would have to be for https://bitbucket.org/simfactory/simfactory2/
-
Ok that’s what I thought I was supposed to do, but I don’t have permissions for the simfactory2 repo so I can’t create a new branch/pull request.
-
You can however create a pull request
I’ll add you with write permission after the pull request.
-
It says I do not have permission to create a pull request?
-
the way to do it is:
- fork the simfactory repo on bitbucket
- create a branch in your fork with the files you want
- create the pull request in your fork, it will show up in the list of pull requests in the source repository
-
Got it, thank you for all the help. Sorry for the mistakes, I don’t have much experience with git/bitbucket.
I’ve submitted the pull request.
-
reporter I agree with Anuj being in charge. Anuj, please try to work with the support team for Thorny Flat to fix the problem with MPI and keep the compilers up to date.
-
Applied as git hash c2214852 "thornyflat: correct directories" of simfactory2
-
- changed status to resolved
-
At the very end of https://docs.einsteintoolkit.org/et-docs/Testsuite_Machines after the git commit command, there is no git push command. Is this a mistake or am I missing something? Just wanted to double check before pushing my results.
-
There probably should be one. Right now it is only implied.
Though I suspect I will have to first actually grant you write access to the repository.
Ah, I just checked. You do already have write permission.
-
Thanks, I think you gave me write access back in August so I should be good.
- Log in to comment
With the command:
./simfactory/bin/sim create-submit tov --configuration etk --machine=thornyflat --parfile=par/static_tov.par --cores=10
I am getting the warnings:
Warning: Current Working directory does not match Cactus sourcetree, changing to /users/mbh0012/Cactus
…
Warning: Too many threads per process specified: specified num-threads=20 (ppn-used is 40)
Warning: Total number of threads and number of threads per process are inconsistent: procs=10, num-threads=20 (procs*num-smt must be an integer multiple of num-threads)
Warning: Total number of threads and number of cores per node are inconsistent: procs=10, ppn-used=40 (procs must be an integer multiple of ppn-used)