Error: cannot allocate vector of size ...

Create issue
Issue #20 resolved
Marco Filipuzzi created an issue

Hello,

I am trying to use CHiCAGO to analyse C-HiC data, but I have run into a problem that I cannot overcome on my own: when processing some .chinput files of size > 8GB it runs out of memory before runChicago.R has finished.

As a simple sketch of my pipeline I can refer to "Issue #12: correct workflow for weight recalibration by Elisabetta Manduchi" that is: considering 2 biological replicates and would like to use them to recalibrate weights. I use the runChicago.R wrapper and would like to clarify if the below is the correct (and most efficient) workflow:

  1. Run separately runChicago.R on each of the 2 replicates, that is two runs: in one run the input is only rep1.chinput and in the other run the input is only rep2.chinput
  2. Get the 2 separate rds objects, rep1.rds and rep2.rds and run fitDistCurve.R with --input rep1.rds,rep2.rds
  3. Update the settings File with the new weights from (2)
  4. Use the updated settings File to do one more run of runChicago.R this time providing the comma-separated list rep1.chinput,rep2.chinput

I have no problem in most of the cases, but when my total size of the .chinput files on step 4 is > 8GB my job fails. I attached some of the logs with the typical error message. Consider that I have up to 120GB of RAM (24 CPU) available when I launch runChicago.R (attached plot with RAM usage). I also attached the script with the exact line code for running CHiCAGO that I use.

I tried to provide all the info you need to help me, but don't hesitate to contact me if something is missing or unclear. It would be great if we could use CHiCAGO for C-HiC data in our institute.

Thank you in advance Best, Marco

Comments (5)

  1. Mikhail Spivakov

    It does look like R has unfortunately run out of memory even with this huge amount of RAM available. One way to circumvent this would be to split the data randomly into several subsets (by bait) and run the analysis separately on each of them.

    Note - Dispersion estimates will most likely vary between the runs but I expect them to do so only very slightly. If this isn't the case, you may increase the number of draws used for estimating dispersion and/or a subset of baits used in each draw (brownianNoise.samples and brownianNoise.subset, respectively). I'd be surprised if it doesn't solve the issue in this case.

  2. Marco Filipuzzi reporter

    Hi Mikhail,

    thank you very much for such a fast reply! Let me see if I got it right: are you suggesting to randomly split the bait.baitmap in several subsets and run the analysis for each of the subset and then "merge" the outputs? ...or I don't get how to subset the data by bait and to get outputs to be "merged" easily....can you expand on that in case...

    In my opinion (that can of course be completely wrong ;P ), the problem is not strictly on the size of the input bam files, but in the fact that everything is kept in memory until the end (from a quick look I can see just these garbage collection: "if(printMemory){print(gc(reset=TRUE))"} in chicago.R). It is strange that 8Gb of input need +120Gb of RAM for all the processing. I am aware of the fact that this does not have a naive solution from your side, but I can see others facing this limit soon.

    Thank you again, I am happy even with a workaround solution since I liked your nice tool! Best, Marco

  3. Mikhail Spivakov

    Yes, this is what I'd try doing. Bait-to-bait interactions will come up in multiple subsets, but you can then use some heuristic to combine them (for example, the highest score, as we currently use to combine results for their both viewpoints within a single sample). If this doesn't resolve the issue, you may need to pool other ends into larger bins (by amending the rmap file accordingly), which will unfortunately lead to a reduced resolution of the analysis.

    We are aware of the memory bottleneck, and are considering a range of solutions - but unfortunately we aren't expecting them to be forthcoming soon.

  4. Marco Filipuzzi reporter

    Hi Mikhail,

    thank you very much for your quick support, I will then try with the subsetting of the bait list.

    best, Marco

  5. Log in to comment