New machine thornyflat at WVU

Create issue
Issue #2499 open
Erik Schnetter created an issue

Comments (20)

  1. Roland Haas

    I mistook the trampoline for an actual cluster. If the machine works with this setup I see no reason not to include it, provided that someone at WVU is ok to test it for the releases and provides updates.

  2. Maria

    Eric, I am picking up this ticket to let you know that it is not sorted out yet. I tried to run GW150914.par on 4 nodes (160 cores) and nsnstohmns.par on 1 node (40 cores), both within a PBS batch script and directly from the terminal. My jobs were either cancelled “for the following reason: no feasible locations found to run job”, or I cancelled them after more of a week of sitting in the queue. My priority, as external user, must be rather low.

    Here is the output when starting from the terminal:

    1. for GW150914

    [mbh0012@trcis001 Cactus]$ simfactory/bin/sim create-submit GW150914_128 --configuration=bns --machine=thornyflat --parfile=par/GW150914.par --cores=128
    Warning: Current Working directory does not match Cactus sourcetree, changing to /users/mbh0012/Cactus
    Parameter file: /gpfs20/users/mbh0012/Cactus/par/GW150914.par
    Skeleton Created
    Job directory: "/scratch/mbh0012/simulations/GW150914_128"
    Executable: "/users/mbh0012/Cactus/exe/cactus_bns"
    Option list: "/scratch/mbh0012/simulations/GW150914_128/SIMFACTORY/cfg/OptionList"
    Submit script: "/scratch/mbh0012/simulations/GW150914_128/SIMFACTORY/run/SubmitScript"
    Run script: "/scratch/mbh0012/simulations/GW150914_128/SIMFACTORY/run/RunScript"
    Parameter file: "/scratch/mbh0012/simulations/GW150914_128/SIMFACTORY/par/GW150914.par"
    Assigned restart id: 0
    Warning: Total number of threads and number of threads per process are inconsistent: procs=128, num-threads=20 (procs*num-smt must be an integer multiple of num-threads)
    Warning: Total number of threads and number of cores per node are inconsistent: procs=128, ppn-used=40 (procs must be an integer multiple of ppn-used)
    Executing submit command: qsub /scratch/mbh0012/simulations/GW150914_128/output-0000/SIMFACTORY/SubmitScript
    Submit finished, job id is 468083

    This is how it sits in the queue:

    468083.trcis002.hpc.wv mbh0012 comm_sma GW150914_128-00 -- 4 160 -- 168:00:00 Q --

    Note, that I asked for 128 cores, and what it received is `procs=128, num-threads=20`. I thought I shoudl run with `OPM_NUM_THREADS=1` Is there a way to hard code this in the `/scratch/mbh0012/simulations/GW150914_128/output-0000/SIMFACTORY/SubmitScript`?

    2. for nsnstohmns

    [mbh0012@trcis001 Cactus]$ simfactory/bin/sim create-submit nsnstohmns --configuration=bns --machine=thornyflat --parfile=par/nsnstohmns.par --cores=40
    Warning: Current Working directory does not match Cactus sourcetree, changing to /users/mbh0012/Cactus
    Parameter file: /gpfs20/users/mbh0012/Cactus/par/nsnstohmns.par
    Skeleton Created
    Job directory: "/scratch/mbh0012/simulations/nsnstohmns"
    Executable: "/users/mbh0012/Cactus/exe/cactus_bns"
    Option list: "/scratch/mbh0012/simulations/nsnstohmns/SIMFACTORY/cfg/OptionList"
    Submit script: "/scratch/mbh0012/simulations/nsnstohmns/SIMFACTORY/run/SubmitScript"
    Run script: "/scratch/mbh0012/simulations/nsnstohmns/SIMFACTORY/run/RunScript"
    Parameter file: "/scratch/mbh0012/simulations/nsnstohmns/SIMFACTORY/par/nsnstohmns.par"
    Assigned restart id: 0
    Executing submit command: qsub /scratch/mbh0012/simulations/nsnstohmns/output-0000/SIMFACTORY/SubmitScript
    Submit finished, job id is 468082

    468082.trcis002.hpc.wv mbh0012 comm_sma nsnstohmns-0000 -- 1 40 -- 168:00:00 Q --

  3. Erik Schnetter reporter

    The first job you submitted requested 128 cores, which is not a multiple of 40 cores, the number of cores per node. The queuing system does not seem to support this. I recommend using 160 cores instead of 128 cores.

    Your second job seems to be waiting the queue just fine. If you want it to start earlier, then you could try asking for a shorter run time. Usually, jobs asking for a longer time take a longer time to start. Start by asking for one hour to see whether it works, then maybe for 24 hours. You could also inquire with the system administrators whether the job’s parameter are fine.

    -erik

  4. Roland Haas

    Any progress in fixing up / testing the machine? It’s not a bit hurry but if you want it to be included in the list of officially supported machines for the toolkit the testsuite with the release candidate must run at least once (mostly successfully) and the files must be in master.

  5. Erik Schnetter reporter

    Maria

    Are these Simfactory configuration files working for you? I’ve lost access to Thornyflats in the mean time. If I need to help you debug this, you would need to apply for a new account for me.

    -erik

  6. Roland Haas

    @Zach Etienne does thornyflat still exist? If so is the pull request still a good simfactory entry for it?

  7. Zach Etienne

    @Roland Haas : Yes, it does. However, I don’t use simfactory so I wouldn’t know if the PR is good. If it works for Maria and doesn’t cause any harm to simfactory etc, I would vote for inclusion.

  8. Roland Haas

    @Maria , can you verify that the files in the pull request work for you and will you volunteer to keep them up to date in the future and (ideally) test them for each ET release (or designate someone who will do so)?

  9. Maria

    Roland and Erik,

    I tested the thornyflat settings with Johnson release.

    1. To pull, I used: cd repos/simfactory2/ git fetch && git checkout origin/eschnett/thornyflat

    The result was: ls simfactory/mdb//thornyflat simfactory/mdb/machines/thornyflat.ini simfactory/mdb/runscripts/thornyflat.run simfactory/mdb/optionlists/thornyflat.cfg simfactory/mdb/submitscripts/thornyflat.sub

    1. To setup, I used the command: ./simfactory/bin/sim setup --optionlist=simfactory/mdb/optionlists/thornyflat.cfg --runscript simfactory/mdb/runscripts/thornyflat.run

    The result was:

  10. Maria

    lang/gcc/11.2.0 parallel/openmpi/4.1.2_gcc112 libs/fftw/3.3.10_gcc112_ompi412 libs/hdf5/1.12.1_gcc112_ompi412 libs/openblas/0.3.19_gcc112

  11. Roland Haas

    @Maria I added you as a “developer” to the Simfactory repo so you should be able to push your required changes your self into the eschnett/thornyflat branch.

    You may have to do a full checkout first ie:

    git clone git@bitbucket.org:simfactory/simfactory2.git
    git checkout eschnett/thornyflat
    

    then add the modified files and tell git about them

    git add mdb/machines/thornyflat.ini
    git add mdb/runscripts/thornyflat.run
    
    git commit -m 'thornyflat: update after modules have changed'
    git push
    

    where you will need one git add per changed file (or use git add -p which will interactively show each change to add to the commit).

  12. Maria

    Dear Eric and Roland,

    Sorry, I cannot commit the Simfactory scripts for ThornyFlat, because there seems to still be a problem. The production run gave me segmentation fault. I am pasting below the error's gobbledygook, please help me tease out and fix the problem:

  13. Log in to comment