groupBaseline - Profile memory usage

Issue #83 new
ssnn created an issue

@jqz and I have run into memory issues with groupBaseline. In my computer, with nproc>1, groupBaseline run out of memory while with nproc=1 finished the analysis (and it didn't take too long). @jqz has reported similar issues in Farnam.

Comments (25)

  1. Jason Vander Heiden

    calcBaseline probably doesn't need to retain the entire input in the db slot. Just the relevant columns is uses.

  2. Jason Vander Heiden

    We should also see if we can get by with less data in each PDF. It's currently 4,000 points, I think?

  3. ssnn reporter

    Removing not needed columns helped in findNovelAlleles. So yes, we should do this, at least for the db that is used inside parallelization. I see calcBaseline is expected to return the full db in the slot db.

  4. Julian Zhou

    This makes sense except that then the @db field in the returned baseline object would not contain a groupBy column that’s required by groupBaseline(), which it looks for in the input baseline object’s @db field. Unless, of course, calcBaseline() takes an added groupBy argument, which makes little sense because no grouping is performed by itself..

  5. Jason Vander Heiden

    Ah, good point. Maybe we can drop some known non-grouping fields like the SEQUENCE, GERMLINE, START and LENGTH fields?

  6. Jason Vander Heiden

    Or we could require the groupBy argument to groupBaseline to be a vector of values. That seems contrary to the overall dplyr-ness of the tools though.

  7. ssnn reporter

    I would change the part that exports db to the cluster, to have only the needed columns. I think this would avoid copying many times the db when requesting >1 core

        if (nproc > 1) {        
            cluster <- parallel::makeCluster(nproc, type="PSOCK")
            parallel::clusterExport(cluster, list('db_subset',
    

    And them merge the new columns to the old db and remove db_subset and force garbage collection

  8. ssnn reporter

    Maybe move the cluster export part down after

        cols_observed <- grep( paste0("MU_COUNT_"),  colnames(db) ) 
        cols_expected <- grep( paste0("MU_EXPECTED_"),  colnames(db) ) 
    

    then export db[ ,c(cols_observed, cols_expected)]

  9. Julian Zhou

    A thought: or we could advise the users to only supply to calcBaseline a subset of the data.frame, including only the necessary columns..

  10. Julian Zhou

    Not that this would solve the issue: try stripping down your db to the bare minimum (only passing every single column, since majority nevwr gets used).

    I recently used bare-minimum datasets to do some benchmarking in terms of runtime. In doing so my experience that you need to allocate at least 40GB memery for ~150,000 sequences. I also found that going a bit higher( with about ~190,000 sequences even 60GB memory won’t do.

  11. Julian Zhou

    Usually SEQUENCE_IMGT, GERMLINE_IMGT_D_MASK, OBSV_MUT and EXP_MUT columns representing mutation counts (if not present calcBaseine() will automatically call observedMutations to get these. And the $groupBy and $ID columns you may need later for groupBaseline() and plotting. Do double check the names — just recallin. at the top of my head.

  12. ssnn reporter

    Similar as the issue with calcBaseline() exporting to the cluster db columns not needed, groupBaseline() exports baseline. Maybe exporting only the slots needed would help. Also, removing things and adding garbage collection.

  13. Roy Jiang

    The pdf attribute is causing 99% of the memory problems with baseline. Stripping columns does very little. Using <4000 points for the calculation would probably help.

    We can either scale the number of points being calculated or recommend subsampling. And implement a memory limit of 4GB. You can use these stats as a guide.

    > dim(baseline@db)
    [1] 275766      3
    
    > object.size(baseline)
    17921093080 bytes
    
    > object.size(baseline@db)
    188201824 bytes
    
    > object.size(baseline@pdfs)
    17653436976 bytes
    
  14. Julian Zhou

    More likely than not that’s gonna be difficult. The 4001 thing and more generally the numerical convolution part of the code is hard-coded. I’m not even sure if we have the original code that generated of some of the hard-coded values saved as .RData objects. In other words, to change pdf calculation would likely require a complete rewrite, yet at present none of us actually fully understands how the current code in, for example, groupBaseline() works..

  15. Jason Vander Heiden

    Paging @guryaari, is there a way to either:

    1. Save the PDF as parameters instead of values? (I think this is not possible, IIRC).
    2. Reduce the number of data points per PDF? Ie, would 1001 work as well as 4001?
  16. gur yaari
    1. If I understand you right, you ask if there is some analytical/numerical form of the posterior pdf that can be compressed by storing the parameters describing it only. If indeed this was your intention, you recall correctly. There is no easy parametrization to the posterior pdfs. For a pdf that resulted from combining many individual pdfs, it will eventually look like a gaussian and then the parametrization is quite trivial, and combining further such pdfs is also easy. The issue is to understand how many individual sequences are enough to approximate the resulting pdf with a gaussian. This needs to be tested for a given desired accuracy level.

    2. Changing the number of points will affect the resolution and accuracy of the selection estimations. 4001 points mean that the pdfs are sampled every (20+20)/4000=0.01 . Changing it to 1001 will result in a resolution of 40/1000=0.04. It will make the pdfs more spiky and less accurate. Of course one can still do so but with the cost of accuracy.

  17. Jason Vander Heiden

    Thanks, @guryaari. What do you think about cutting the sigma range down to +/- 10 (from +/- 20)? Is that workable?

  18. gur yaari

    I think this is less optimal since individual sequences with small number of mutations tend to have non-zero pdf values up until values of +-10. In such cases, trimming the pdfs will result in systematic bias of the combined sequences.

  19. Jason Vander Heiden

    Hrm. Okay, so it sounds like our options are not fantastic. We could make the PDF resolution (length and max sigma) a user exposed parameter, with some sensible recommendations/defaults. Looks like those are already parameters to every relevant function, so we'd just have to pass them through.

    The problem is calcBaselineBinomialPdf, which has parameters:

    x=3
    n=10
    p=0.33
    

    And uses the constants:

    CONST_I
    BAYESIAN_FITTED
    

    Not sure what any of this is or if it can/needs to be adjusted. @guryaari?

  20. Jason Vander Heiden

    Talked to @guryaari and we can probably do a normal approximation (store only sd and mean) for larger data sets. We need to test it to be sure, but we should only need the full PDFs for low numbers of sequences.

    We shouldn't have to adjust CONST_I or BAYESIAN_FITTED (he has the code to generate though).

  21. Log in to comment