kleinstein / shazam / issues / #83 - groupBaseline - Profile memory usage

Issue #83 new

ssnn created an issue 2017-04-28

@jqz and I have run into memory issues with groupBaseline. In my computer, with nproc>1, groupBaseline run out of memory while with nproc=1 finished the analysis (and it didn't take too long). @jqz has reported similar issues in Farnam.

Comments (25)

Jason Vander Heiden
calcBaseline probably doesn't need to retain the entire input in the db slot. Just the relevant columns is uses.
- 2018-02-26T18:53:56+00:00
Jason Vander Heiden
We should also see if we can get by with less data in each PDF. It's currently 4,000 points, I think?
- 2018-02-26T18:54:54+00:00
ssnn reporter
Removing not needed columns helped in findNovelAlleles. So yes, we should do this, at least for the db that is used inside parallelization. I see calcBaseline is expected to return the full db in the slot db.
- 2018-02-26T19:02:55+00:00
Julian Zhou
This makes sense except that then the @db field in the returned baseline object would not contain a groupBy column that’s required by groupBaseline(), which it looks for in the input baseline object’s @db field. Unless, of course, calcBaseline() takes an added groupBy argument, which makes little sense because no grouping is performed by itself..
- 2018-02-26T19:15:14+00:00
Jason Vander Heiden
Ah, good point. Maybe we can drop some known non-grouping fields like the SEQUENCE, GERMLINE, START and LENGTH fields?
- 2018-02-26T19:17:39+00:00
Jason Vander Heiden
Or we could require the groupBy argument to groupBaseline to be a vector of values. That seems contrary to the overall dplyr-ness of the tools though.
- 2018-02-26T19:19:48+00:00
ssnn reporter
I would change the part that exports db to the cluster, to have only the needed columns. I think this would avoid copying many times the db when requesting >1 core
```
    if (nproc > 1) {        
        cluster <- parallel::makeCluster(nproc, type="PSOCK")
        parallel::clusterExport(cluster, list('db_subset',
```
And them merge the new columns to the old db and remove db_subset and force garbage collection
- 2018-02-26T19:27:19+00:00

ssnn reporter

Maybe move the cluster export part down after

    cols_observed <- grep( paste0("MU_COUNT_"),  colnames(db) ) 
    cols_expected <- grep( paste0("MU_EXPECTED_"),  colnames(db) )

then export db[ ,c(cols_observed, cols_expected)]

2018-02-26T19:32:04+00:00

Julian Zhou
A thought: or we could advise the users to only supply to calcBaseline a subset of the data.frame, including only the necessary columns..
- 2018-02-26T20:30:57+00:00
Roy Jiang
Also having same problems with memory.
- 2018-03-10T00:28:18+00:00
Julian Zhou
Not that this would solve the issue: try stripping down your db to the bare minimum (only passing every single column, since majority nevwr gets used).

I recently used bare-minimum datasets to do some benchmarking in terms of runtime. In doing so my experience that you need to allocate at least 40GB memery for ~150,000 sequences. I also found that going a bit higher( with about ~190,000 sequences even 60GB memory won’t do.
- 2018-03-10T00:40:16+00:00
Roy Jiang
Which columns constitute a bare minimum db?

I support that idea for implementation.
- 2018-03-10T00:57:49+00:00
Julian Zhou
Usually SEQUENCE_IMGT, GERMLINE_IMGT_D_MASK, OBSV_MUT and EXP_MUT columns representing mutation counts (if not present calcBaseine() will automatically call observedMutations to get these. And the $groupBy and $ID columns you may need later for groupBaseline() and plotting. Do double check the names — just recallin. at the top of my head.
- 2018-03-10T01:05:14+00:00
ssnn reporter
Similar as the issue with calcBaseline() exporting to the cluster db columns not needed, groupBaseline() exports baseline. Maybe exporting only the slots needed would help. Also, removing things and adding garbage collection.
- 2018-03-10T11:47:11+00:00
ssnn reporter
I did some changes in groupBaseline(). Let me know if these help.
- 2018-03-11T00:30:02+00:00
Roy Jiang
The pdf attribute is causing 99% of the memory problems with baseline. Stripping columns does very little. Using <4000 points for the calculation would probably help.

We can either scale the number of points being calculated or recommend subsampling. And implement a memory limit of 4GB. You can use these stats as a guide.
```
> dim(baseline@db)
[1] 275766      3

> object.size(baseline)
17921093080 bytes

> object.size(baseline@db)
188201824 bytes

> object.size(baseline@pdfs)
17653436976 bytes
```
- 2018-03-11T23:43:20+00:00
Julian Zhou
More likely than not that’s gonna be difficult. The 4001 thing and more generally the numerical convolution part of the code is hard-coded. I’m not even sure if we have the original code that generated of some of the hard-coded values saved as .RData objects. In other words, to change pdf calculation would likely require a complete rewrite, yet at present none of us actually fully understands how the current code in, for example, groupBaseline() works..
- 2018-03-12T00:40:16+00:00
Jason Vander Heiden
Paging @guryaari, is there a way to either:
1. Save the PDF as parameters instead of values? (I think this is not possible, IIRC).
2. Reduce the number of data points per PDF? Ie, would 1001 work as well as 4001?
- 2018-03-12T18:45:42+00:00
gur yaari
1. If I understand you right, you ask if there is some analytical/numerical form of the posterior pdf that can be compressed by storing the parameters describing it only. If indeed this was your intention, you recall correctly. There is no easy parametrization to the posterior pdfs. For a pdf that resulted from combining many individual pdfs, it will eventually look like a gaussian and then the parametrization is quite trivial, and combining further such pdfs is also easy. The issue is to understand how many individual sequences are enough to approximate the resulting pdf with a gaussian. This needs to be tested for a given desired accuracy level.
2. Changing the number of points will affect the resolution and accuracy of the selection estimations. 4001 points mean that the pdfs are sampled every (20+20)/4000=0.01 . Changing it to 1001 will result in a resolution of 40/1000=0.04. It will make the pdfs more spiky and less accurate. Of course one can still do so but with the cost of accuracy.
- 2018-03-13T17:35:33+00:00
Jason Vander Heiden
Thanks, @guryaari. What do you think about cutting the sigma range down to +/- 10 (from +/- 20)? Is that workable?
- 2018-03-13T17:48:56+00:00
gur yaari
I think this is less optimal since individual sequences with small number of mutations tend to have non-zero pdf values up until values of +-10. In such cases, trimming the pdfs will result in systematic bias of the combined sequences.
- 2018-03-13T17:55:46+00:00
Jason Vander Heiden
Hrm. Okay, so it sounds like our options are not fantastic. We could make the PDF resolution (length and max sigma) a user exposed parameter, with some sensible recommendations/defaults. Looks like those are already parameters to every relevant function, so we'd just have to pass them through.

The problem is calcBaselineBinomialPdf, which has parameters:
```
x=3
n=10
p=0.33
```
And uses the constants:
```
CONST_I
BAYESIAN_FITTED
```
Not sure what any of this is or if it can/needs to be adjusted. @guryaari?
- 2018-03-14T16:10:14+00:00
Jason Vander Heiden
- assigned issue to
  
  Julian Zhou
- 2018-03-15T19:26:50+00:00
Jason Vander Heiden
Talked to @guryaari and we can probably do a normal approximation (store only sd and mean) for larger data sets. We need to test it to be sure, but we should only need the full PDFs for low numbers of sequences.

We shouldn't have to adjust CONST_I or BAYESIAN_FITTED (he has the code to generate though).
- 2018-07-10T18:55:19+00:00
ssnn reporter
- assigned issue to
  
  gur yaari
- 2020-03-06T16:14:27+00:00
Log in to comment

Assignee: gur yaari

Type: task

Priority: major

Status: new

Milestone: –

Votes: 0

Watchers: 5