mpirun hangs on fedora 23

Create issue
Issue #1882 closed
Mikael Sahrling created an issue

Running sim/mpirun -np 2-10 ./cactus_sim BBH* hangs on Fedora 23 after some time, 1min - 1hr. I have disabled the firewall since that caused mpirun to hang for others but it still hangs for me. It could be related to the usb interface since sometimes touching the keyboard/mouse during sim causes the system to hang.

I'm running on a single linux box with 64GB RAM.

Anyone seen anything like this?

Thanks,

Keyword:

Comments (7)

  1. Ian Hinder
    • removed comment

    Hmm. Haven't seen this before, but I haven't tried running on Fedora. Is it possible you are running out of memory? You could check "top" while the simulation is running, and compare the Cactus process' RSS memory usage with the 64 GB you have available. Maybe you can give more details of what you are doing? For example: 1. Version of the ET 2. Parameter file 3. Exact command line you are using to start the simulation 4. Do you get any output at all, or does it hang immediately after calling mpirun? Thanks for reporting the problem!

  2. Barry Wardell
    • removed comment

    It's possible this is happening after the initial data solver has finished (an hour is a plausible amount of time for TwoPunctures to generate initial data for a binary black hole system). Since (I think?) Cactus will only allocate memory for the grid functions once the initial data is finished, looking at "top" may make it appear that everything is fine in terms of memory usage, and then suddenly the memory consumption will shoot up. To assess if this is the case, you could post the last few lines of output you get.

  3. anonymous
    • removed comment
    1. This is Nov 15 2015 release
    2. I'm using the tool included BBHMedRes.par
    3. In directory: Cactus/exe: sim/mpirun -np 10 ./cactus_sim BBHMedRes.par. I have a Xeon 2620 with 6 cores. I have also observed this with np=2, np=5.
    4. It runs between 1min and 4 hours and suddenly it just freezes, see attached screenshot of frozen system. Keyboard/mouse are not communicating, the screen is no longer
      updating, network connection is off etc. Also the sim also stopped, I have waited an hour or so to see if during this time the simulation files were updating. But no.

    I don't think it's a memory issue since I'm quite a bit away from my max 24GB/64GB.

  4. Barry Wardell
    • removed comment

    From your screenshot it is clearly getting past initial data into evolution, so there shouldn't be any huge memory increases (certainly not enough to go from 24GB to 64GB).

    Does it also freeze if you try to run it without MPI? Since you're running on a single machine you don't need MPI and can use multiple OpenMP threads instead.

  5. anonymous
    • removed comment

    How do I do that? I tried ./cactus_sim BBHMedRes.par and that didn't work:

    I get: WARNING level 0 from host mars process 0

    while executing schedule bin BoundaryConditions, routine RotatingSymmetry180::Rot180_ApplyBC
    in thorn RotatingSymmetry180, file /home/mikael/Astronomy/GeneralRelativity/ET/Cactus/configs/sim/build/RotatingSymmetry180/rotatingsymmetry180.c:447:
    -> TAT/Slab can only be used if there is a single local component per MPI process
    

    I was told I need at least 2 MPI processes to run this sim. Issue tracker #1857

  6. anonymous
    • changed status to resolved
    • removed comment

    Two of my memory sticks are causing the problem. Without these installed I have no problems and can run for > 24 hrs and sim completes appropriately. With them installed the system halts/crashes in mins to hours.

    We can close this ticket. Thanks for your input and moral support!! It's good not to be alone out there.

  7. Log in to comment