Problem with phoSim crash when using checkpointing

Issue #11 resolved

Thomas Glanzman created an issue 2016-07-22

Version: 3.5.2

Host: rhel6-64

I put the following line into the command/physics override file for a working phoSim config:

checkpointtotal 10

The job ran for a while -- and even produced two checkpoint files:

lsst_e_200_f2_R22_S11_E000_ckptfp_0.fits.gz lsst_e_200_f2_R22_S11_E000_ckptdt_0.fits.gz

But then phoSim crashed with this complaint (no core dump created):

[...] Number of Sources: 87874 Photons: 7.04e+11 Flux: 4.53e-22 ergs/cm2/s

Photon Raytrace commit none

Type Sources Photons (Sat,Rem,Rej,Acc)% Time (s) Photons/s

Electron to ADC Image Converter

FITSIO status = 104: could not open the named file failed to find or open the following file: (ffopen) lsst_e_200_f2_R22_S11_E000.fits.gz terminate called after throwing an instance of 'std::runtime_error' what(): FitsImage::FitsImage: cfitsio error /bin/sh: line 1: 11413 Aborted /nfs/farm/g/lsst/u1/software/redhat6-x86_64-64bit-gcc44/phoSim/3.5.2/bin/e2adc < e2adc_200_R22_S11_E000.pars Process Process-1: Traceback (most recent call last): File "/nfs/farm/g/lsst/u1/software/redhat6-x86_64-64bit-gcc44/anaconda/2.3.0/lib/python2.7/multiprocessing/process.py", line 258, in bootstrap self.run() File "/nfs/farm/g/lsst/u1/software/redhat6-x86_64-64bit-gcc44/anaconda/2.3.0/lib/python2.7/multiprocessing/process.py", line 114, in run self._target(self._args, *self._kwargs) File "/nfs/farm/g/lsst/u1/software/redhat6-x86_64-64bit-gcc44/phoSim/3.5.2/phosim.py", line 45, in jobChip runProgram("e2adc < e2adc"+fid+".pars", binDir) File "/nfs/farm/g/lsst/u1/software/redhat6-x86_64-64bit-gcc44/phoSim/3.5.2/phosim.py", line 64, in runProgram raise RuntimeError("Error running %s" % myCommand) RuntimeError: Error running /nfs/farm/g/lsst/u1/software/redhat6-x86_64-64bit-gcc44/phoSim/3.5.2/bin/e2adc < e2adc_200_R22_S11_E000.pars

real 6m6.563s user 5m23.264s sys 0m16.160s

===========================

Indeed, the target of the FITSIO complaint does not exist, but two others with similar names do exist. This behavior is reproducible.

Comments (29)

John Peterson
Tom, can you send the command file and what you typed on the command line so we can reproduce it?
- 2016-07-26T15:36:03+00:00
Thomas Glanzman reporter
The command used was:

time /nfs/farm/g/lsst/u1/software/redhat6-x86_64-64bit-gcc44/phoSim/3.5.2/phosim.py ./phosim_input_200.txt -c twinkles_I_physics_override.txt -s R22_S11 --sed /nfs/farm/g/lsst/u/dragon/phoSimCPtest/SEDs -w /nfs/farm /g/lsst/u/dragon/phoSimCPtest/work -o /nfs/farm/g/lsst/u/dragon/phoSimCPtest/output

Which was executed from the SLAC directory, /nfs/farm/g/lsst/u/dragon/phoSimCPtest, an area which you may browse for other inputs (instance Catalog and SED data). The content of the command file, twinkles_I_physics_override.txt, follows. Note this is the same command file used for Twinkles Run 1 with the addition of the checkpointtotal command.

# Turn on debugging file

centroidfile 1

# Does this turn off treerings too?

cleardefects

# Also turn off clouds and airglow variation

clearclouds

airglowvariation 0

# Set the nominal dark sky brightness

zenith_v 21.8

# Leave on CRs but turn off fringing. ISR will take care of fringing,

# but CRs are currently taken out in

# image characterization.

fringing 0

# Attempt to activate the non-Condor checkpointing

checkpointtotal 10
- 2016-07-29T17:20:24+00:00
John Peterson
oh, well you have to add:

checkpointcount 0

as well. and then go with runs where it goes from 0 to 9
- 2016-08-01T14:48:55+00:00
Thomas Glanzman reporter
Just to clarify, if a job gets killed just after, say, checkpoint "N", does the command file need to be modified to start up? In other words, does the user need to assess the status of all produced checkpoint files and then edit the command file to explicitly tell phoSim which checkpoint file to start with?
- 2016-08-01T16:59:22+00:00
John Peterson
yes, you will need to say which one you intend to do. so first run should have:

checkpointcount 0 checkpointtotal 10

second run should have:

checkpointcount 1 checkpointtotal 10

third run should have:

checkpointcount 2 checkpointtotal 10

if third run fails resubmit with:

checkpointcount 2 checkpointtotal 10

etc.
- 2016-08-01T17:35:37+00:00
Thomas Glanzman reporter
- attached test2.log
- 2016-08-01T18:12:38+00:00
Thomas Glanzman reporter
- attached test2-noCP.log
- 2016-08-01T18:12:54+00:00
Thomas Glanzman reporter
John, thanks for that clarification.

And I have an update to this thread. Having inserted the necessary "checkpointcount 0" command, the test run was started from scratch. Unfortunately, it crashed exactly the way the first test crashed. To confirm this problem was associated with the checkpointing, the two checkpoint commands were removed from the command file and the job restarted again from scratch. This time the job continued to run beyond the point where its predecessor crashed.

I have attached the console logs for the two runs: test2.log and test2-noCP.log.
- Tom
- 2016-08-01T18:13:27+00:00
John Peterson
ok, good, then closed for now.
- 2016-08-02T18:02:13+00:00
John Peterson
- changed status to resolved
- 2016-08-02T18:02:17+00:00
karl krughoff
@johnrpeterson I'm confused. It sounds like it is still not working for @glanzman How can this be marked resolved?
- 2016-08-02T18:17:13+00:00
Thomas Glanzman reporter
Indeed, the internal checkpointing is not working for me.
- 2016-08-02T19:18:21+00:00
John Peterson
oh, sorry. could you though send the catalog and command file and version used, so we can reproduce? there is something strange because i don't see any photons in the log file.
- 2016-08-02T20:05:02+00:00
Thomas Glanzman reporter
- attached phosim_input_200.txt
- 2016-08-02T20:10:30+00:00
Thomas Glanzman reporter
- attached twinkles_I_physics_override.txt
- 2016-08-02T20:10:52+00:00
Thomas Glanzman reporter
Both attached:

instanceCatalog = phosim_input_200.txt

commandFile = twinkles_I_physics_override.txt
- 2016-08-02T20:11:31+00:00
karl krughoff
@johnrpeterson Will you please reopen this issue so it doesn't get lost?
- 2016-08-02T20:31:28+00:00
John Peterson
tom, it does look like something is fragile there involving checkpoints where there are no photon, which we will fix in a patch shortly. in the meantime, it seems to not have problems usually if you do less checkpoints, if you want to try that out.
- 2016-08-04T19:30:31+00:00
Thomas Glanzman reporter
Thanks John. I have restarted a new test with checkpointtotal 2. Can you suggest a safe number? I am thinking that runs of between 6 and 24 hours are pretty safe on most batch systems' hardware, which would mean up to ~40 checkpoints for some of the longest runs.

However, I do not understand your comment about no photons. My test case was the very first Twinkles visit and it should have ordered up a great many photons. What would cause a checkpoint attempt before any photons had been generated?
- 2016-08-04T23:09:44+00:00
Thomas Glanzman reporter
- changed status to open
Test with "checkpointtotal 2" causes a crash. Console log will be attached in subsequent post.
- 2016-08-05T16:25:06+00:00
Thomas Glanzman reporter
- attached test3.log
Log of phoSim v3.5.2 running twinkles visit #1 with "checkpointtotal 2" in the command file. Shows crash during checkpointing operation.
- 2016-08-05T16:27:17+00:00
John Peterson
ok should be fixed in v3.5.3.
- 2016-08-12T14:03:18+00:00
John Peterson
- changed status to closed
- 2016-08-12T14:03:58+00:00
karl krughoff
@johnrpeterson Wouldn't it be prudent to wait until Tom has had a chance to verify that the fix works before closing the ticket?
- 2016-08-12T16:40:38+00:00
Thomas Glanzman reporter
I have run a number of tests based on the first Twinkles visit described earlier. The tests were:

1) 10 checkpoints

2) 4 ckpts

3) 0 ckpts

Each test ran to completion and produced 20 files in the /output directory (18 'a' + 1 'e' + centroid). I used fdiff and tkdiff to check for file differences. All of the FITS files appear to be identical except for the additional header keywords indicating the presence of checkpointing. However, the centroid (.txt) file was completely different. All three files contained 87875 lines, but the content between checkpoint and non-checkpoint versions was very different. The non-checkpointing control contained many lines like this:

992887068677.000000 855 468.678363 2349.667836

while both of the checkpointing versions contained only lines with "0 -nan -nan", e.g.,

992887068677.000000 0 -nan -nan

For this difference, I am reopening this issue.

It may be that other optional data products are also not identical.

From an operational perspective, I was unable to retain the execution times for each checkpoint component of each test. However, there was a large variation in execution times within a single visit, ranging from 5 min to several hundred minutes. This is not a huge deal, but given that each checkpoint component represents a new batch job, that means an inefficiency in waiting for unexpectedly short jobs to dispatch when they would have been dispatched much more quickly in a faster queue.
- 2016-08-22T20:47:04+00:00
Thomas Glanzman reporter
- changed status to open
Difference in data products between checkpoint and non-checkpoint jobs using phoSim 3.5.3
- 2016-08-22T20:47:50+00:00
Thomas Glanzman reporter
I would also ask whether it would be reasonable to request a "checkpoint and continue" option for this mechanism?

Such an option would have a very significant and positive impact on running large-scale production jobs. It would allow phoSim to be scheduled without the need to understand the duration of the longest running checkpoint job.

I would think such an option would be extremely easy to implement?

Thanks for this consideration, - Tom
- 2016-08-22T21:04:43+00:00
John Peterson
closing this as everyone is using external checkpointing mechanisms.
- 2017-07-12T13:47:03+00:00
John Peterson
- edited description
- changed status to resolved
- 2017-07-12T13:47:19+00:00
Log in to comment

Assignee: –

Type: bug

Priority: major

Status: resolved

Votes: 0

Watchers: 1