groupBaseline - Profile memory usage
@jqz and I have run into memory issues with groupBaseline. In my computer, with nproc>1, groupBaseline run out of memory while with nproc=1 finished the analysis (and it didn't take too long). @jqz has reported similar issues in Farnam.
Comments (25)
-
-
We should also see if we can get by with less data in each PDF. It's currently 4,000 points, I think?
-
reporter Removing not needed columns helped in findNovelAlleles. So yes, we should do this, at least for the db that is used inside parallelization. I see calcBaseline is expected to return the full db in the slot db.
-
This makes sense except that then the @db field in the returned baseline object would not contain a groupBy column that’s required by groupBaseline(), which it looks for in the input baseline object’s @db field. Unless, of course, calcBaseline() takes an added groupBy argument, which makes little sense because no grouping is performed by itself..
-
Ah, good point. Maybe we can drop some known non-grouping fields like the
SEQUENCE
,GERMLINE
,START
andLENGTH
fields? -
Or we could require the
groupBy
argument togroupBaseline
to be a vector of values. That seems contrary to the overall dplyr-ness of the tools though. -
reporter I would change the part that exports db to the cluster, to have only the needed columns. I think this would avoid copying many times the db when requesting >1 core
if (nproc > 1) { cluster <- parallel::makeCluster(nproc, type="PSOCK") parallel::clusterExport(cluster, list('db_subset',
And them merge the new columns to the old db and remove db_subset and force garbage collection
-
reporter Maybe move the cluster export part down after
cols_observed <- grep( paste0("MU_COUNT_"), colnames(db) ) cols_expected <- grep( paste0("MU_EXPECTED_"), colnames(db) )
then export db[ ,c(cols_observed, cols_expected)]
-
A thought: or we could advise the users to only supply to
calcBaseline
a subset of the data.frame, including only the necessary columns.. -
Also having same problems with memory.
-
Not that this would solve the issue: try stripping down your db to the bare minimum (only passing every single column, since majority nevwr gets used).
I recently used bare-minimum datasets to do some benchmarking in terms of runtime. In doing so my experience that you need to allocate at least 40GB memery for ~150,000 sequences. I also found that going a bit higher( with about ~190,000 sequences even 60GB memory won’t do.
-
Which columns constitute a bare minimum db?
I support that idea for implementation.
-
Usually SEQUENCE_IMGT, GERMLINE_IMGT_D_MASK, OBSV_MUT and EXP_MUT columns representing mutation counts (if not present calcBaseine() will automatically call observedMutations to get these. And the $groupBy and $ID columns you may need later for groupBaseline() and plotting. Do double check the names — just recallin. at the top of my head.
-
reporter Similar as the issue with calcBaseline() exporting to the cluster db columns not needed, groupBaseline() exports baseline. Maybe exporting only the slots needed would help. Also, removing things and adding garbage collection.
-
reporter I did some changes in groupBaseline(). Let me know if these help.
-
The pdf attribute is causing 99% of the memory problems with baseline. Stripping columns does very little. Using <4000 points for the calculation would probably help.
We can either scale the number of points being calculated or recommend subsampling. And implement a memory limit of 4GB. You can use these stats as a guide.
> dim(baseline@db) [1] 275766 3 > object.size(baseline) 17921093080 bytes > object.size(baseline@db) 188201824 bytes > object.size(baseline@pdfs) 17653436976 bytes
-
More likely than not that’s gonna be difficult. The 4001 thing and more generally the numerical convolution part of the code is hard-coded. I’m not even sure if we have the original code that generated of some of the hard-coded values saved as .RData objects. In other words, to change pdf calculation would likely require a complete rewrite, yet at present none of us actually fully understands how the current code in, for example, groupBaseline() works..
-
Paging @guryaari, is there a way to either:
- Save the PDF as parameters instead of values? (I think this is not possible, IIRC).
- Reduce the number of data points per PDF? Ie, would 1001 work as well as 4001?
-
-
If I understand you right, you ask if there is some analytical/numerical form of the posterior pdf that can be compressed by storing the parameters describing it only. If indeed this was your intention, you recall correctly. There is no easy parametrization to the posterior pdfs. For a pdf that resulted from combining many individual pdfs, it will eventually look like a gaussian and then the parametrization is quite trivial, and combining further such pdfs is also easy. The issue is to understand how many individual sequences are enough to approximate the resulting pdf with a gaussian. This needs to be tested for a given desired accuracy level.
-
Changing the number of points will affect the resolution and accuracy of the selection estimations. 4001 points mean that the pdfs are sampled every (20+20)/4000=0.01 . Changing it to 1001 will result in a resolution of 40/1000=0.04. It will make the pdfs more spiky and less accurate. Of course one can still do so but with the cost of accuracy.
-
-
Thanks, @guryaari. What do you think about cutting the sigma range down to +/- 10 (from +/- 20)? Is that workable?
-
I think this is less optimal since individual sequences with small number of mutations tend to have non-zero pdf values up until values of +-10. In such cases, trimming the pdfs will result in systematic bias of the combined sequences.
-
Hrm. Okay, so it sounds like our options are not fantastic. We could make the PDF resolution (length and max sigma) a user exposed parameter, with some sensible recommendations/defaults. Looks like those are already parameters to every relevant function, so we'd just have to pass them through.
The problem is
calcBaselineBinomialPdf
, which has parameters:x=3 n=10 p=0.33
And uses the constants:
CONST_I BAYESIAN_FITTED
Not sure what any of this is or if it can/needs to be adjusted. @guryaari?
-
-
assigned issue to
-
assigned issue to
-
Talked to @guryaari and we can probably do a normal approximation (store only sd and mean) for larger data sets. We need to test it to be sure, but we should only need the full PDFs for low numbers of sequences.
We shouldn't have to adjust CONST_I or BAYESIAN_FITTED (he has the code to generate though).
-
reporter -
assigned issue to
-
assigned issue to
- Log in to comment
calcBaseline probably doesn't need to retain the entire input in the db slot. Just the relevant columns is uses.