Walltime being overwritten to 24 hours
Whenever I submit a job, the walltime gets set to 24 hours. Even if I hardcode a max_walltime for TerminationTrigger as some value, when I look at the parfile it outputs in the simulation directory the value gets overwritten to 24. I went into the submitScript file and hardcoded the walltime there to my desired walltime as well, but that doesn’t help. The parfile and log.txt file (attached is the log file) says the walltime is 24 hours, but running qstat -u says it’s the desired walltime. I believe the issue is with SimFactory, but I’m not sure what’s going on or how to fix it.
Comments (15)
-
-
reporter Yes, I’m on thornyflat in a queue that should allow for a maximum walltime of 168 hours, which is why I’m confused that it's being set to 24 hours. Here’s the command I entered.
./simfactory/bin/sim create-submit insp08_final_small_2 --parfile par/charged_binary_inspiral_final.rpar --walltime 168:00:00 --queue comm_small_week --cores 80 --ppn 40 --num-threads 1 --machine thornyflat
The rparfile just uses the python package jhuki to generate the parfile, I tested it with the a regular parfile as well and the issue persists.
-
-
assigned issue to
-
assigned issue to
-
Hmm. I am very puzzled. In your log file (thank you for including it) you have the lines:
[LOG:2023-05-07 21:19:25] self.submit(submitScript)::No previous walltime available to be used, using walltime 168:00:00 [LOG:2023-05-07 21:19:25] self.submit(submitScript)::Defined substituion properties for submission [LOG:2023-05-07 21:19:25] self.submit(submitScript)::{'SOURCEDIR': '/users/mtc00017/scratch/EToolKitNew/Cactus', 'SIMULATION_NAME': 'insp08_final_small_2', 'SHORT_SIMULATION_NAME': 'insp08_final_sm', 'SIMULATION_ID': 'simulation-insp08_final_small_2-thornyflat-tf.hpc.wvu.edu-mtc00017-2023.05.07-21.19.25-2587', 'RESTART_ID': 0, 'RUNDIR': '/users/mtc00017/scratch/simulations/insp08_final_small_2/output-0000', 'SCRIPTFILE': '/users/mtc00017/scratch/simulations/insp08_final_small_2/output-0000/SIMFACTORY/SubmitScript', 'EXECUTABLE': '/users/mtc00017/scratch/simulations/insp08_final_small_2/SIMFACTORY/exe/cactus_sim', 'PARFILE': '/users/mtc00017/scratch/simulations/insp08_final_small_2/output-0000/charged_binary_inspiral_final.rpar', 'HOSTNAME': 'tf.hpc.wvu.edu', 'USER': 'mtc00017', 'NODES': 2, 'PROCS_REQUESTED': 80, 'PPN': 40, 'NUM_PROCS': 80, 'NODE_PROCS': 40, 'PROCS': 80, 'NUM_THREADS': 1, 'PPN_USED': 40, 'NUM_SMT': 1, 'MEMORY': '98304', 'CPUFREQ': '2.10', 'ALLOCATION': 'NOALLOCATION', 'QUEUE': 'comm_small_week', 'EMAIL': 'mtc00017', 'WALLTIME': '24:00:00', 'WALLTIME_HH': '24', 'WALLTIME_MM': '00', 'WALLTIME_SS': '00', 'WALLTIME_SECONDS': 86400, 'WALLTIME_MINUTES': 1440.0, 'WALLTIME_HOURS': 24.0, 'SIMFACTORY': '/gpfs20/scratch/mtc00017/EToolKitNew/Cactus/repos/simfactory2/bin/sim', 'SUBMITSCRIPT': '/users/mtc00017/scratch/simulations/insp08_final_small_2/output-0000/SIMFACTORY/SubmitScript', 'CONFIGURATION': 'sim', 'FROM_RESTART_COMMAND': '', 'CHAINED_JOB_ID': ''}
So the walltime was 168 hours at one point (the “
No previous walltime available”
really just means that this isoutput-0000
) but then switched to 24hrs just afterwards. Looking at the Python code inlib/simrestart.py
I cannot see how it could change to 24hrs.# import walltime if no --walltime is specified. if existingProperties is not None and not simenv.OptionsManager.HasOption('walltime') and existingProperties.HasProperty('walltime'): Walltime = restartlib.WallTime(existingProperties.GetProperty("walltime")) self.SimulationLog.Write("Using walltime %s from previous restart %s" % (existingProperties.GetProperty("walltime"), self.MaxRestartID)) else: self.SimulationLog.Write("No previous walltime available to be used, using walltime %s" % Walltime.Walltime) [..stuff that does not touch Walltime...] walltt = Walltime # always restrict our walltime to maxwalltime if requested walltime # is too large. if MaxWalltime.walltime_seconds < Walltime.walltime_seconds: walltt = MaxWalltime # okay, our walltime requested was too large # find out if we should use automatic job chaining. if chainedJobId is None: UseChaining = True # TODO: i don't understand the job chaining logic. a # restart should be presubmitted (instead of # submitted) if there is a restart currently running. # yet there is no check for this. new_properties['WALLTIME'] = walltt.Walltime
Worse, if I try to mimic thornyflat (I have no account) on my workstation using
[thornyflat] submit = cat @SCRIPTFILE@ >/dev/tty basedir = /data/rhaas/simulations sourcebasedir = /data/@USER@ envsetup = true
and running:
./simfactory/bin/sim create-submit --machine thornyflat --parfile par/tov_ET.par --cores 1 foobar
I do not see any such change in walltime myself.
Could you run, on thornyflat, this command, please:
./simfactory/bin/sim print-mdb-entry thornyflat | grep maxwalltime
which should output what simfactory thinks the maximum allowed walltime is (this is not output to the log, the 168 that you see comes from your command line option for walltime).
You could also add some
print("maxwalltime is ", MaxWalltime,Walltime)
debug statements tolib/simrestart.py
'ssubmit
function to print out the value. Though as said, I am not sure what is going on. -
reporter I just ran that command, and it output
maxwalltime = 24:00:00
, so simfactory does think the max walltime is 24 hours. I’ll take a look atlib/simrestart.py
and see if I find anything odd. -
reporter - attached log.txt
<div class="preview-container wiki-content"><!-- loaded via ajax --></div> <div class="mask"></div> </div>
</div> </form>
-
reporter I just attached the log file for a job that’s currently running and at 24:43. What’s weird is that the log file seems to suggest simfactory still thinks the walltime is 24.
-
reporter Sorry, I forgot to note that the only difference in this job’s parfile is that I commented out the lines having to do with TerminationTrigger.
-
reporter That’s actually the wrong job (sorry, I ran a lot of test jobs yesterday!), but the correct log file still says it thinks that walltime is 24 hours.
-
- changed status to open
-
Before digging into the Python code, maybe it would be good to check both your
simfactory/etc/defs.local.ini
andsimfactory/mdb/machines/thornyflat.ini
to check if any of those sets a roguemaxwalltime
forthornyflat
. -
Maybe this shouldn’t happen silently.
-
reporter I found it!
simfactory/mdb/machines/thornyflat.ini
setmaxwalltime = 24:00:00
for some reason. Changing that fixed it! Thank you! -
Glad to have been able to help. I will close this ticket and create a new one for Steve’s suggestion.
-
- changed status to wontfix
No bug in simfactory, somehow ini files got changed on the system.
- Log in to comment
Not sure if this is a bug or just unexpected behaviour.
Most clusters have a maximum allowed walltime (often 24 or 48 hours). Simfactory is aware of the limit (via the maxwalltime setting in the machine.ini file) and will automatically split up longer jobs into chained segments of at most maxwalltime length.
You seem to be on the thornyflat cluster where maxwalltime is 168 hours (see https://bitbucket.org/simfactory/simfactory2/src/master/mdb/machines/thornyflat.ini#lines-57), so that should actually not limit the walltime to 24 hours
Second, simfactory contains hard-coded, historic, unfortunate, code that will reset the right hand side of the `TerminationTrigger::max_walltime` parameter to the walltime that simfactory was given in its submit command (or the default value for walltime, which I do not recall on top of my head). As far as I know this cannot be turned off (though I may well be wrong).
Could you provide the exact simfactory command that you entered, please?