phoSim memory consumption

Issue #13 closed
Thomas Glanzman created an issue

During a recent ramp-up for a large phoSim production at SLAC and NERSC, I have run into a problem concerning memory usage. If I run phoSim interactively and use "top" to monitor cpu and memory usage, I see a fairly steady res of ~2.3 GB consumed by raytrace and another 5-6 GB consumed by python. Given that 'top' samples, one is likely to miss a transient high-water mark in memory usage. The SLAC batch system, LSF, also monitors memory consumption and confirms what I observed with "top". These numbers, reported independently for each checkpoint, are tabulated for a handful of examples on this spreadsheet:

https://docs.google.com/spreadsheets/d/16Qpt-qg2HwF_zdaK1bxSil3RI7m-1u4mIPzxyBsHOZo/edit?usp=sharing

Note the column "Max mem". The max mem seems fairly stable between 7.5 GB and 7.9 GB, although there are a few exceptional outliers (10 GB and 19 GB). This is a concern because large compute farms are typically populated with hosts providing something in the 1-3 GB/core range. Even the "Avg mem" values are too high for NERSC, and very limiting at SLAC.

The other concern is the column labeled "Max Swap" which is consistently about 14.5 GB. My understanding of the NERSC system is that "swap" per se does not exist at all within the edison or cori batch environments. Clearly even a brief excursion beyond the physical memory allowed per process -- with no swap being available -- will not work.

This may be a show-stopper running at NERSC. Even at SLAC, this means requesting additional resources which significantly reduces the pool of available hosts upon which to run, thereby greatly increasing the time required to complete a production run.

Comments (12)

  1. James Chiang

    Here are a couple of plots of the memory usage statistics for two flavors of phosim runs on the SLAC batch queues. The first is for a set of runs for Twinkles Run1.1 : Twinkles-phoSim-352_lsf_mem_usage.png These are histograms of the memory usage reported by lsf and extracted from the run log files. The Twinkles runs are single sensor (R22_S11) and the input instance catalogs were created via catsim assuming a 0.3 deg acceptance cone on the sky centered on the pointing location for our selected DDF. They have about 320k objects and an on-disk footprint of ~64MB uncompressed. The memory usage for these PhoSim Deep feasibility studies is typically significantly larger: PhoSim-deep-pre3_lsf_mem_usage.png For these runs, the instance catalogs are selected to cover the full focal plane and so have ~22M objects and an on-disk size of 4.4GB uncompressed. (Note that the logs for the PhoSim-Deep runs were created while running with internal checkpointing, so the memory sizes reported by lsf would differ somewhat from what a non-checkpointed job would produce.) Given these numbers, I suspect that much of the excess memory usage in the PhoSim-Deep runs relative to the Twinkles runs arise from reading in the entire input instance catalog into memory in phosim.py here.

  2. John Peterson

    Tom & Jim-

    I think there is something wrong about how you've been running it at SLAC. So yes, it is true that phosim typically uses about 2 Gbytes per chip. That memory is carefully allocated and studied to have all the accuracy needed to represent the physics. So that is unavoidable, but should generally be fine for any HPC system we've ever run on.

    However, the 7 Gbytes for the python script should be irrelevant. When we use condor scripts or if you use the phosim.py implementation of an alternative scripting environment, then the python scripts runs as a single job but there will be spawned 378 raytrace jobs and 42 trim jobs if you have a fully populated LSST focal plane. However, those 378 raytrace & 42 trim jobs should not be picking up the 7 Gbytes of RAM at all, and basically that 7 Gbyte job is a negligible few second job. So that seems like an implementation problem for your specific pipeline. Could you email me your SLAC scripts or write out how you have been running it (don't have to put it in this ticket)? I'll be able to help you to decouple these jobs, and show how to run it as it is normally run.

    john

  3. James Chiang

    John,

    Did you have a look at this line? That line reads the entire 4.4 GB instance catalog for the full focal plane runs into memory and since self.userCatalog is an attribute of the PhosimFocalplane class until the end of the phosim.py script, that memory is remains allocated. I don't see anywhere in phosim.py where that memory might be recovered before it goes out of scope at the end of the execution.

  4. Thomas Glanzman reporter

    And just to clarify how phoSim is being used, one phoSim instance explicitly simulates a single sensor, e.g., -s R01_S00, in order to work efficiently with LSF (SLAC) and SLURM (NERSC). This means running phosim.py separately for each job, which is why the total memory footprint is important.

  5. John Peterson

    yes, that is what i mean, tom. and yes, jim, that is why the memory is an issue in this situation. but the -s option is not really supposed to be used on large scale computing (it is just an option for saying i only want this one chip) and is resulting in this memory problem as well as non-parallelization of exposures as well as other problems. we can handle this discussion offline, as i'd like to know why you are using -s at all.

  6. karl krughoff

    As an interested bystander, I'd much prefer that the conversation be held here. I am interested both in the resolution of this issue and the process for resolving it.

  7. Thomas Glanzman reporter

    Another question:

    do the phosim instance catalogs you have been using use the “includeobj” line in them? or is the whole catalog contained in one file.

    john

    John, the instance catalog does not contain any "includeobj" directives. However, having only recently learned of this option, this is of future interest in reducing the total disk footprint associated with large productions. How does the use of this directive affect memory consumption?

    Regarding the "-s" option, I was not aware its use was discouraged for large-scale production. Given that simulating even a single sensor can run out the clock on the longest batch queues, are there other ways to break up the processing into manageable chunks (without requiring the use of condor)?

    Thanks, - Tom

  8. John Peterson

    so first, if you did use the includeobj commands (which would be standard for large catalogs) then there probably would be no memory problem.

    but yes, instead of the -s option you should be either doing:

    1) best idea: simply run the phosim.py with the -g condor option. then write some code to simply convert the condor submit files into the batch submission commands that you need (e.g. slurm) and then submit them. this should be very easy and the condor scripts are small and tell you what executable, what input files are needed, etc. and this will be minimal work, and you just have to translate between submission languages.

    2) second best idea: use the -g cluster option with phosim.py and then write the python modules for "script_writer" and "submitter" appropriate for your batch submission commands. this isn't as good as #1 because it will only parallelize raytrace jobs and not trim jobs which can be a lot of CPU times themselves. it will also mean that the submit node does need to do some non-trivial computation for several minutes/focal plane so this will be a problem with large data challenges. so option #1 is better.

    consequently, i'd really recommend #1 and it would leverage our ~10 years of testing on condor-based grids as well.

    the problems i can think of with a -s looping option is basically: 1) the run times are doubled due to the two exposures not being parallelized 2) the memory issue with this thread 3) the entire catalog is going to be passed around on the cluster (~4 Gbytes) constantly which will cause I/O problems. if you do option #1 or #2 then catalog will be trimmed and be only 20 Mbytes or so. 4) probably others...

    so -s is really just for if you give phosim a giant catalog and you want to say to phosim just do this particular chip. we never intended it to be used as a "looping option".

    john

  9. John Peterson

    the converter script of glenn makes the looping of phosim on large-scale computing other than condor so this is no longer an issue.

  10. Thomas Glanzman reporter

    John, Yes, the ability to trim the instance catalog is a big win and the use of the technology in Glenn's script has shown us what needs to be done to break the processing into manageable pieces. The remaining issue with "phosim.py -g" is that without condor present the script crashes. It is not prudent to accept the crash blindly as it may or may not be due to the inability to find the condor submit mechanism. Would it be reasonable to add a "--nosubmit" to phosim.py so that once the condor submit files are created, there is a normal exit?

    Thanks, - Tom

  11. John Peterson

    ok, closing now. and yes, we will add the --nosubmit option when we integrate glenn's script into the phosim package.

  12. Log in to comment