Problem with phoSim crash when using checkpointing

Issue #11 resolved
Thomas Glanzman created an issue

Version: 3.5.2

Host: rhel6-64

I put the following line into the command/physics override file for a working phoSim config:

checkpointtotal 10

The job ran for a while -- and even produced two checkpoint files:

lsst_e_200_f2_R22_S11_E000_ckptfp_0.fits.gz lsst_e_200_f2_R22_S11_E000_ckptdt_0.fits.gz

But then phoSim crashed with this complaint (no core dump created):

[...] Number of Sources: 87874 Photons: 7.04e+11 Flux: 4.53e-22 ergs/cm2/s


Photon Raytrace commit none



Type Sources Photons (Sat,Rem,Rej,Acc)% Time (s) Photons/s


Electron to ADC Image Converter

FITSIO status = 104: could not open the named file failed to find or open the following file: (ffopen) lsst_e_200_f2_R22_S11_E000.fits.gz terminate called after throwing an instance of 'std::runtime_error' what(): FitsImage::FitsImage: cfitsio error /bin/sh: line 1: 11413 Aborted /nfs/farm/g/lsst/u1/software/redhat6-x86_64-64bit-gcc44/phoSim/3.5.2/bin/e2adc < e2adc_200_R22_S11_E000.pars Process Process-1: Traceback (most recent call last): File "/nfs/farm/g/lsst/u1/software/redhat6-x86_64-64bit-gcc44/anaconda/2.3.0/lib/python2.7/multiprocessing/process.py", line 258, in bootstrap self.run() File "/nfs/farm/g/lsst/u1/software/redhat6-x86_64-64bit-gcc44/anaconda/2.3.0/lib/python2.7/multiprocessing/process.py", line 114, in run self._target(self._args, *self._kwargs) File "/nfs/farm/g/lsst/u1/software/redhat6-x86_64-64bit-gcc44/phoSim/3.5.2/phosim.py", line 45, in jobChip runProgram("e2adc < e2adc"+fid+".pars", binDir) File "/nfs/farm/g/lsst/u1/software/redhat6-x86_64-64bit-gcc44/phoSim/3.5.2/phosim.py", line 64, in runProgram raise RuntimeError("Error running %s" % myCommand) RuntimeError: Error running /nfs/farm/g/lsst/u1/software/redhat6-x86_64-64bit-gcc44/phoSim/3.5.2/bin/e2adc < e2adc_200_R22_S11_E000.pars

real 6m6.563s user 5m23.264s sys 0m16.160s

===========================

Indeed, the target of the FITSIO complaint does not exist, but two others with similar names do exist. This behavior is reproducible.

  • Tom

Comments (29)

  1. John Peterson

    Tom, can you send the command file and what you typed on the command line so we can reproduce it?

  2. Thomas Glanzman reporter

    The command used was:

    time /nfs/farm/g/lsst/u1/software/redhat6-x86_64-64bit-gcc44/phoSim/3.5.2/phosim.py ./phosim_input_200.txt -c twinkles_I_physics_override.txt -s R22_S11 --sed /nfs/farm/g/lsst/u/dragon/phoSimCPtest/SEDs -w /nfs/farm /g/lsst/u/dragon/phoSimCPtest/work -o /nfs/farm/g/lsst/u/dragon/phoSimCPtest/output

    Which was executed from the SLAC directory, /nfs/farm/g/lsst/u/dragon/phoSimCPtest, an area which you may browse for other inputs (instance Catalog and SED data). The content of the command file, twinkles_I_physics_override.txt, follows. Note this is the same command file used for Twinkles Run 1 with the addition of the checkpointtotal command.

    # Turn on debugging file

    centroidfile 1

    # Does this turn off treerings too?

    cleardefects

    # Also turn off clouds and airglow variation

    clearclouds

    airglowvariation 0

    # Set the nominal dark sky brightness

    zenith_v 21.8

    # Leave on CRs but turn off fringing. ISR will take care of fringing,

    # but CRs are currently taken out in

    # image characterization.

    fringing 0

    # Attempt to activate the non-Condor checkpointing

    checkpointtotal 10

  3. John Peterson

    oh, well you have to add:

    checkpointcount 0

    as well. and then go with runs where it goes from 0 to 9

  4. Thomas Glanzman reporter

    Just to clarify, if a job gets killed just after, say, checkpoint "N", does the command file need to be modified to start up? In other words, does the user need to assess the status of all produced checkpoint files and then edit the command file to explicitly tell phoSim which checkpoint file to start with?

  5. John Peterson

    yes, you will need to say which one you intend to do. so first run should have:

    checkpointcount 0 checkpointtotal 10

    second run should have:

    checkpointcount 1 checkpointtotal 10

    third run should have:

    checkpointcount 2 checkpointtotal 10

    if third run fails resubmit with:

    checkpointcount 2 checkpointtotal 10

    etc.

  6. Thomas Glanzman reporter

    John, thanks for that clarification.

    And I have an update to this thread. Having inserted the necessary "checkpointcount 0" command, the test run was started from scratch. Unfortunately, it crashed exactly the way the first test crashed. To confirm this problem was associated with the checkpointing, the two checkpoint commands were removed from the command file and the job restarted again from scratch. This time the job continued to run beyond the point where its predecessor crashed.

    I have attached the console logs for the two runs: test2.log and test2-noCP.log.

    • Tom
  7. karl krughoff

    @johnrpeterson I'm confused. It sounds like it is still not working for @glanzman How can this be marked resolved?

  8. John Peterson

    oh, sorry. could you though send the catalog and command file and version used, so we can reproduce? there is something strange because i don't see any photons in the log file.

  9. Thomas Glanzman reporter

    Both attached:

    instanceCatalog = phosim_input_200.txt

    commandFile = twinkles_I_physics_override.txt

  10. John Peterson

    tom, it does look like something is fragile there involving checkpoints where there are no photon, which we will fix in a patch shortly. in the meantime, it seems to not have problems usually if you do less checkpoints, if you want to try that out.

  11. Thomas Glanzman reporter

    Thanks John. I have restarted a new test with checkpointtotal 2. Can you suggest a safe number? I am thinking that runs of between 6 and 24 hours are pretty safe on most batch systems' hardware, which would mean up to ~40 checkpoints for some of the longest runs.

    However, I do not understand your comment about no photons. My test case was the very first Twinkles visit and it should have ordered up a great many photons. What would cause a checkpoint attempt before any photons had been generated?

  12. Thomas Glanzman reporter
    • changed status to open

    Test with "checkpointtotal 2" causes a crash. Console log will be attached in subsequent post.

  13. Thomas Glanzman reporter

    Log of phoSim v3.5.2 running twinkles visit #1 with "checkpointtotal 2" in the command file. Shows crash during checkpointing operation.

  14. karl krughoff

    @johnrpeterson Wouldn't it be prudent to wait until Tom has had a chance to verify that the fix works before closing the ticket?

  15. Thomas Glanzman reporter

    I have run a number of tests based on the first Twinkles visit described earlier. The tests were:

    1) 10 checkpoints

    2) 4 ckpts

    3) 0 ckpts

    Each test ran to completion and produced 20 files in the /output directory (18 'a' + 1 'e' + centroid). I used fdiff and tkdiff to check for file differences. All of the FITS files appear to be identical except for the additional header keywords indicating the presence of checkpointing. However, the centroid (.txt) file was completely different. All three files contained 87875 lines, but the content between checkpoint and non-checkpoint versions was very different. The non-checkpointing control contained many lines like this:

    992887068677.000000 855 468.678363 2349.667836

    while both of the checkpointing versions contained only lines with "0 -nan -nan", e.g.,

    992887068677.000000 0 -nan -nan

    For this difference, I am reopening this issue.

    It may be that other optional data products are also not identical.


    From an operational perspective, I was unable to retain the execution times for each checkpoint component of each test. However, there was a large variation in execution times within a single visit, ranging from 5 min to several hundred minutes. This is not a huge deal, but given that each checkpoint component represents a new batch job, that means an inefficiency in waiting for unexpectedly short jobs to dispatch when they would have been dispatched much more quickly in a faster queue.

  16. Thomas Glanzman reporter
    • changed status to open

    Difference in data products between checkpoint and non-checkpoint jobs using phoSim 3.5.3

  17. Thomas Glanzman reporter

    I would also ask whether it would be reasonable to request a "checkpoint and continue" option for this mechanism?

    Such an option would have a very significant and positive impact on running large-scale production jobs. It would allow phoSim to be scheduled without the need to understand the duration of the longest running checkpoint job.

    I would think such an option would be extremely easy to implement?

    Thanks for this consideration, - Tom

  18. Log in to comment