Walltime being overwritten to 24 hours

Issue #2725 wontfix
Matthew Cerep created an issue

Whenever I submit a job, the walltime gets set to 24 hours. Even if I hardcode a max_walltime for TerminationTrigger as some value, when I look at the parfile it outputs in the simulation directory the value gets overwritten to 24. I went into the submitScript file and hardcoded the walltime there to my desired walltime as well, but that doesn’t help. The parfile and log.txt file (attached is the log file) says the walltime is 24 hours, but running qstat -u says it’s the desired walltime. I believe the issue is with SimFactory, but I’m not sure what’s going on or how to fix it.

Comments (15)

  1. Roland Haas

    Not sure if this is a bug or just unexpected behaviour.

    Most clusters have a maximum allowed walltime (often 24 or 48 hours). Simfactory is aware of the limit (via the maxwalltime setting in the machine.ini file) and will automatically split up longer jobs into chained segments of at most maxwalltime length.

    You seem to be on the thornyflat cluster where maxwalltime is 168 hours (see https://bitbucket.org/simfactory/simfactory2/src/master/mdb/machines/thornyflat.ini#lines-57), so that should actually not limit the walltime to 24 hours

    Second, simfactory contains hard-coded, historic, unfortunate, code that will reset the right hand side of the `TerminationTrigger::max_walltime` parameter to the walltime that simfactory was given in its submit command (or the default value for walltime, which I do not recall on top of my head). As far as I know this cannot be turned off (though I may well be wrong).

    Could you provide the exact simfactory command that you entered, please?

  2. Matthew Cerep reporter

    Yes, I’m on thornyflat in a queue that should allow for a maximum walltime of 168 hours, which is why I’m confused that it's being set to 24 hours. Here’s the command I entered.

    ./simfactory/bin/sim create-submit insp08_final_small_2 --parfile par/charged_binary_inspiral_final.rpar --walltime 168:00:00 --queue comm_small_week --cores 80 --ppn 40 --num-threads 1 --machine thornyflat

    The rparfile just uses the python package jhuki to generate the parfile, I tested it with the a regular parfile as well and the issue persists.

  3. Roland Haas

    Hmm. I am very puzzled. In your log file (thank you for including it) you have the lines:

    [LOG:2023-05-07 21:19:25] self.submit(submitScript)::No previous walltime available to be used, using walltime 168:00:00
    [LOG:2023-05-07 21:19:25] self.submit(submitScript)::Defined substituion properties for submission
    [LOG:2023-05-07 21:19:25] self.submit(submitScript)::{'SOURCEDIR': '/users/mtc00017/scratch/EToolKitNew/Cactus', 'SIMULATION_NAME': 'insp08_final_small_2', 'SHORT_SIMULATION_NAME': 'insp08_final_sm', 'SIMULATION_ID': 'simulation-insp08_final_small_2-thornyflat-tf.hpc.wvu.edu-mtc00017-2023.05.07-21.19.25-2587', 'RESTART_ID': 0, 'RUNDIR': '/users/mtc00017/scratch/simulations/insp08_final_small_2/output-0000', 'SCRIPTFILE': '/users/mtc00017/scratch/simulations/insp08_final_small_2/output-0000/SIMFACTORY/SubmitScript', 'EXECUTABLE': '/users/mtc00017/scratch/simulations/insp08_final_small_2/SIMFACTORY/exe/cactus_sim', 'PARFILE': '/users/mtc00017/scratch/simulations/insp08_final_small_2/output-0000/charged_binary_inspiral_final.rpar', 'HOSTNAME': 'tf.hpc.wvu.edu', 'USER': 'mtc00017', 'NODES': 2, 'PROCS_REQUESTED': 80, 'PPN': 40, 'NUM_PROCS': 80, 'NODE_PROCS': 40, 'PROCS': 80, 'NUM_THREADS': 1, 'PPN_USED': 40, 'NUM_SMT': 1, 'MEMORY': '98304', 'CPUFREQ': '2.10', 'ALLOCATION': 'NOALLOCATION', 'QUEUE': 'comm_small_week', 'EMAIL': 'mtc00017', 'WALLTIME': '24:00:00', 'WALLTIME_HH': '24', 'WALLTIME_MM': '00', 'WALLTIME_SS': '00', 'WALLTIME_SECONDS': 86400, 'WALLTIME_MINUTES': 1440.0, 'WALLTIME_HOURS': 24.0, 'SIMFACTORY': '/gpfs20/scratch/mtc00017/EToolKitNew/Cactus/repos/simfactory2/bin/sim', 'SUBMITSCRIPT': '/users/mtc00017/scratch/simulations/insp08_final_small_2/output-0000/SIMFACTORY/SubmitScript', 'CONFIGURATION': 'sim', 'FROM_RESTART_COMMAND': '', 'CHAINED_JOB_ID': ''}
    

    So the walltime was 168 hours at one point (the “No previous walltime available” really just means that this is output-0000) but then switched to 24hrs just afterwards. Looking at the Python code in lib/simrestart.py I cannot see how it could change to 24hrs.

        # import walltime if no --walltime is specified.
        if existingProperties is not None and not simenv.OptionsManager.HasOption('walltime') and existingProperties.HasProperty('walltime'):
            Walltime = restartlib.WallTime(existingProperties.GetProperty("walltime"))
            self.SimulationLog.Write("Using walltime %s from previous restart %s" % (existingProperties.GetProperty("walltime"), self.MaxRestartID))
        else:
            self.SimulationLog.Write("No previous walltime available to be used, using walltime %s" % Walltime.Walltime)
    
    [..stuff that does not touch Walltime...]
    
            walltt = Walltime
    
            # always restrict our walltime to maxwalltime if requested walltime
            # is too large.
            if MaxWalltime.walltime_seconds < Walltime.walltime_seconds:
                walltt = MaxWalltime
    
                # okay, our walltime requested was too large
                # find out if we should use automatic job chaining.
                if chainedJobId is None:
                    UseChaining = True
                    # TODO: i don't understand the job chaining logic. a
                    # restart should be presubmitted (instead of
                    # submitted) if there is a restart currently running.
                    # yet there is no check for this.
    
            new_properties['WALLTIME'] = walltt.Walltime
    

    Worse, if I try to mimic thornyflat (I have no account) on my workstation using

    [thornyflat]
    submit = cat @SCRIPTFILE@ >/dev/tty
    basedir = /data/rhaas/simulations
    sourcebasedir = /data/@USER@
    envsetup = true
    

    and running:

    ./simfactory/bin/sim create-submit --machine thornyflat --parfile par/tov_ET.par  --cores 1 foobar
    

    I do not see any such change in walltime myself.

    Could you run, on thornyflat, this command, please:

     ./simfactory/bin/sim print-mdb-entry thornyflat | grep maxwalltime
    

    which should output what simfactory thinks the maximum allowed walltime is (this is not output to the log, the 168 that you see comes from your command line option for walltime).

    You could also add some print("maxwalltime is ", MaxWalltime,Walltime) debug statements to lib/simrestart.py's submit function to print out the value. Though as said, I am not sure what is going on.

  4. Matthew Cerep reporter

    I just ran that command, and it output maxwalltime     = 24:00:00 , so simfactory does think the max walltime is 24 hours. I’ll take a look at lib/simrestart.py and see if I find anything odd.

  5. Matthew Cerep reporter
      <div class="preview-container wiki-content"><!-- loaded via ajax --></div>
      <div class="mask"></div>
    </div>
    

    </div> </form>

  6. Matthew Cerep reporter

    I just attached the log file for a job that’s currently running and at 24:43. What’s weird is that the log file seems to suggest simfactory still thinks the walltime is 24.

  7. Matthew Cerep reporter

    Sorry, I forgot to note that the only difference in this job’s parfile is that I commented out the lines having to do with TerminationTrigger.

  8. Matthew Cerep reporter

    That’s actually the wrong job (sorry, I ran a lot of test jobs yesterday!), but the correct log file still says it thinks that walltime is 24 hours.

  9. Roland Haas

    Before digging into the Python code, maybe it would be good to check both your simfactory/etc/defs.local.ini and simfactory/mdb/machines/thornyflat.ini to check if any of those sets a rogue maxwalltime for thornyflat.

  10. Matthew Cerep reporter

    I found it! simfactory/mdb/machines/thornyflat.ini set maxwalltime = 24:00:00 for some reason. Changing that fixed it! Thank you!

  11. Roland Haas

    Glad to have been able to help. I will close this ticket and create a new one for Steve’s suggestion.

  12. Log in to comment