Problem with phoSim crash when using checkpointing
Version: 3.5.2
Host: rhel6-64
I put the following line into the command/physics override file for a working phoSim config:
checkpointtotal 10
The job ran for a while -- and even produced two checkpoint files:
lsst_e_200_f2_R22_S11_E000_ckptfp_0.fits.gz lsst_e_200_f2_R22_S11_E000_ckptdt_0.fits.gz
But then phoSim crashed with this complaint (no core dump created):
[...] Number of Sources: 87874 Photons: 7.04e+11 Flux: 4.53e-22 ergs/cm2/s
Photon Raytrace commit none
Type Sources Photons (Sat,Rem,Rej,Acc)% Time (s) Photons/s
Electron to ADC Image Converter
FITSIO status = 104: could not open the named file failed to find or open the following file: (ffopen) lsst_e_200_f2_R22_S11_E000.fits.gz terminate called after throwing an instance of 'std::runtime_error' what(): FitsImage::FitsImage: cfitsio error /bin/sh: line 1: 11413 Aborted /nfs/farm/g/lsst/u1/software/redhat6-x86_64-64bit-gcc44/phoSim/3.5.2/bin/e2adc < e2adc_200_R22_S11_E000.pars Process Process-1: Traceback (most recent call last): File "/nfs/farm/g/lsst/u1/software/redhat6-x86_64-64bit-gcc44/anaconda/2.3.0/lib/python2.7/multiprocessing/process.py", line 258, in bootstrap self.run() File "/nfs/farm/g/lsst/u1/software/redhat6-x86_64-64bit-gcc44/anaconda/2.3.0/lib/python2.7/multiprocessing/process.py", line 114, in run self._target(self._args, *self._kwargs) File "/nfs/farm/g/lsst/u1/software/redhat6-x86_64-64bit-gcc44/phoSim/3.5.2/phosim.py", line 45, in jobChip runProgram("e2adc < e2adc"+fid+".pars", binDir) File "/nfs/farm/g/lsst/u1/software/redhat6-x86_64-64bit-gcc44/phoSim/3.5.2/phosim.py", line 64, in runProgram raise RuntimeError("Error running %s" % myCommand) RuntimeError: Error running /nfs/farm/g/lsst/u1/software/redhat6-x86_64-64bit-gcc44/phoSim/3.5.2/bin/e2adc < e2adc_200_R22_S11_E000.pars
real 6m6.563s user 5m23.264s sys 0m16.160s
===========================
Indeed, the target of the FITSIO complaint does not exist, but two others with similar names do exist. This behavior is reproducible.
- Tom
Comments (29)
-
-
reporter The command used was:
time /nfs/farm/g/lsst/u1/software/redhat6-x86_64-64bit-gcc44/phoSim/3.5.2/phosim.py ./phosim_input_200.txt -c twinkles_I_physics_override.txt -s R22_S11 --sed /nfs/farm/g/lsst/u/dragon/phoSimCPtest/SEDs -w /nfs/farm /g/lsst/u/dragon/phoSimCPtest/work -o /nfs/farm/g/lsst/u/dragon/phoSimCPtest/output
Which was executed from the SLAC directory, /nfs/farm/g/lsst/u/dragon/phoSimCPtest, an area which you may browse for other inputs (instance Catalog and SED data). The content of the command file, twinkles_I_physics_override.txt, follows. Note this is the same command file used for Twinkles Run 1 with the addition of the checkpointtotal command.
# Turn on debugging file
centroidfile 1
# Does this turn off treerings too?
cleardefects
# Also turn off clouds and airglow variation
clearclouds
airglowvariation 0
# Set the nominal dark sky brightness
zenith_v 21.8
# Leave on CRs but turn off fringing. ISR will take care of fringing,
# but CRs are currently taken out in
# image characterization.
fringing 0
# Attempt to activate the non-Condor checkpointing
checkpointtotal 10
-
oh, well you have to add:
checkpointcount 0
as well. and then go with runs where it goes from 0 to 9
-
reporter Just to clarify, if a job gets killed just after, say, checkpoint "N", does the command file need to be modified to start up? In other words, does the user need to assess the status of all produced checkpoint files and then edit the command file to explicitly tell phoSim which checkpoint file to start with?
-
yes, you will need to say which one you intend to do. so first run should have:
checkpointcount 0 checkpointtotal 10
second run should have:
checkpointcount 1 checkpointtotal 10
third run should have:
checkpointcount 2 checkpointtotal 10
if third run fails resubmit with:
checkpointcount 2 checkpointtotal 10
etc.
-
reporter - attached test2.log
-
reporter - attached test2-noCP.log
-
reporter John, thanks for that clarification.
And I have an update to this thread. Having inserted the necessary "checkpointcount 0" command, the test run was started from scratch. Unfortunately, it crashed exactly the way the first test crashed. To confirm this problem was associated with the checkpointing, the two checkpoint commands were removed from the command file and the job restarted again from scratch. This time the job continued to run beyond the point where its predecessor crashed.
I have attached the console logs for the two runs: test2.log and test2-noCP.log.
- Tom
-
ok, good, then closed for now.
-
- changed status to resolved
-
@johnrpeterson I'm confused. It sounds like it is still not working for @glanzman How can this be marked resolved?
-
reporter Indeed, the internal checkpointing is not working for me.
-
oh, sorry. could you though send the catalog and command file and version used, so we can reproduce? there is something strange because i don't see any photons in the log file.
-
reporter - attached phosim_input_200.txt
-
reporter - attached twinkles_I_physics_override.txt
-
reporter Both attached:
instanceCatalog = phosim_input_200.txt
commandFile = twinkles_I_physics_override.txt
-
@johnrpeterson Will you please reopen this issue so it doesn't get lost?
-
tom, it does look like something is fragile there involving checkpoints where there are no photon, which we will fix in a patch shortly. in the meantime, it seems to not have problems usually if you do less checkpoints, if you want to try that out.
-
reporter Thanks John. I have restarted a new test with checkpointtotal 2. Can you suggest a safe number? I am thinking that runs of between 6 and 24 hours are pretty safe on most batch systems' hardware, which would mean up to ~40 checkpoints for some of the longest runs.
However, I do not understand your comment about no photons. My test case was the very first Twinkles visit and it should have ordered up a great many photons. What would cause a checkpoint attempt before any photons had been generated?
-
reporter - changed status to open
Test with "checkpointtotal 2" causes a crash. Console log will be attached in subsequent post.
-
reporter - attached test3.log
Log of phoSim v3.5.2 running twinkles visit
#1with "checkpointtotal 2" in the command file. Shows crash during checkpointing operation. -
ok should be fixed in v3.5.3.
-
- changed status to closed
-
@johnrpeterson Wouldn't it be prudent to wait until Tom has had a chance to verify that the fix works before closing the ticket?
-
reporter I have run a number of tests based on the first Twinkles visit described earlier. The tests were:
1) 10 checkpoints
2) 4 ckpts
3) 0 ckpts
Each test ran to completion and produced 20 files in the /output directory (18 'a' + 1 'e' + centroid). I used fdiff and tkdiff to check for file differences. All of the FITS files appear to be identical except for the additional header keywords indicating the presence of checkpointing. However, the centroid (.txt) file was completely different. All three files contained 87875 lines, but the content between checkpoint and non-checkpoint versions was very different. The non-checkpointing control contained many lines like this:
992887068677.000000 855 468.678363 2349.667836
while both of the checkpointing versions contained only lines with "0 -nan -nan", e.g.,
992887068677.000000 0 -nan -nan
For this difference, I am reopening this issue.
It may be that other optional data products are also not identical.
From an operational perspective, I was unable to retain the execution times for each checkpoint component of each test. However, there was a large variation in execution times within a single visit, ranging from 5 min to several hundred minutes. This is not a huge deal, but given that each checkpoint component represents a new batch job, that means an inefficiency in waiting for unexpectedly short jobs to dispatch when they would have been dispatched much more quickly in a faster queue.
-
reporter - changed status to open
Difference in data products between checkpoint and non-checkpoint jobs using phoSim 3.5.3
-
reporter I would also ask whether it would be reasonable to request a "checkpoint and continue" option for this mechanism?
Such an option would have a very significant and positive impact on running large-scale production jobs. It would allow phoSim to be scheduled without the need to understand the duration of the longest running checkpoint job.
I would think such an option would be extremely easy to implement?
Thanks for this consideration, - Tom
-
closing this as everyone is using external checkpointing mechanisms.
-
- edited description
- changed status to resolved
- Log in to comment
Tom, can you send the command file and what you typed on the command line so we can reproduce it?