run Chicago Error

Issue #22 new
Rola Dali created an issue

Hello,

I am using Chicago to analyze 5 capture HiC and have gotten the following error across all samples:

+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

use 'source("https://bioconductor.org/biocLite.R")' or 'source("http://bioconductor.org/biocLite.R")' to update 'BiocInstaller' after library("utils")

***runChicago.R

Loading required package: methods Warning message: package 'argparser' was built under R version 3.3.0

Loading the Chicago package and dependencies...

Loading required package: data.table

Welcome to CHiCAGO - version 1.1.8 If you are new to CHiCAGO, please consider reading the vignette through the command: vignette("Chicago"). NOTE: Default values of tlb.minProxOEPerBin and tlb.minProxB2BPerBin changed as of Version 1.1.5. No action is required unless you specified non-default values, or wish to re-run the pipeline on old chicagoData objects. See news(package="Chicago") Warning message: package 'Chicago' was built under R version 3.3.0 Warning: neither --en-feat-files nor --en-feat-list provided. Feature enrichments will not be computed

Setting the experiment...

Locating <baitmapfile>.baitmap in input_files... Found capture_targets_DpnII.baitmap Locating <rmapfile>.rmap in input_files... Found DpnII.sorted.rmap Locating <nperbinfile>.npb in input_files... Found capture_targets_DpnII.npb Locating <nbaitsperbinfile>.nbpb in input_files... Found capture_targets_DpnII.nbpb Locating <proxOEfile>.poe in input_files... Found capture_targets_DpnII.poe Checking the design files... Read 7227576 rows and 4 (of 4) columns from 0.214 GB file in 00:00:03

Reading input_files/Hi-C_capture.chinput Processing input... minFragLen = 150 maxFragLen = 40000 Filtered out 212087 interactions involving other ends < minFragLen or > maxFragLen. minNPerBait = 250 Filtered out 183586 baits with < minNPerBait reads.

Removed interactions with fragments adjacent to baits. Filtered out 0 baits without proximal non-Bait2bait interactions

Warning: directory chicago/Hi-C_capture exists and will be reused.

Starting chicagoPipeline...

*** Running normaliseBaits...

Normalising baits... Reading NPerBin file... Computing binwise means...

*** Running normaliseOtherEnds...

Preprocessing input... Computing trans-counts... Filtering out 1 other ends with top 0.01% number of trans-interactions Binning... Computing total bait counts... Reading NBaitsPerBin file... Read 7227576 rows and 76 (of 76) columns from 1.116 GB file in 00:00:23 Computing scaling factors... Computing binwise means... Computing normalised counts... Post-processing...

*** Running estimateTechnicalNoise...

Estimating technical noise based on trans-counts... Binning baits based on observed trans-counts... Defining interaction pools and gathering the observed numbers of trans-counts per pool... Computing the total number of possible interactions per pool... Preparing the data..... Processing fragment pools.. Plotting... Post-processing the results...

*** Running estimateDistFun...

*** Running estimateBrownianComponent...

s_i factors found - estimating Brownian component... Reading ProxOE file... Read 968660192 rows and 3 (of 3) columns from 20.670 GB file in 00:03:12 Sampling the dispersion... Getting consensus dispersion estimate...

*** Running getPvals...

Calculating p-values...

*** Running getScores...

Read 7227576 rows and 4 (of 4) columns from 0.214 GB file in 00:00:03 Calculating p-value weights... Calculating scores...

Saving the Chicago object...

Plotting examples...

Error in sample.int(length(x), size, replace, prob) : cannot take a sample larger than the population when 'replace = FALSE' Calls: plotBaits -> sample -> sample.int In addition: Warning messages: 1: In min(diff(x.unique)) : no non-missing arguments to min; returning Inf 2: In min(diff(x.unique)) : no non-missing arguments to min; returning Inf 3: In estimateBrownianComponent(cd) : subset > number of baits in data, so used the full dataset.

4: In estimateBrownianComponent(cd) : We're using the whole data set to calculate dispersion. There's no reason to sample repeatedly in this case, so overriding brownianNoise.samples to 1. Execution halted MUGQICexitStatus:1

+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

I have some questions:

1- I read through the other issues that this might be due to the fact that there are less than 16 baits but my capture file contains 245931 captured regions. What is the problem exactly and how can I fix it?

2- It says "minFragLen = 150 maxFragLen = 40000": Is that the inter tag distance?

Is so, why not extend the maxFragLen to the full length of chromosome1 and increase minFragLen to over 1000bp to avoid contiguous regions?

3- "Filtered out 212087 interactions involving other ends < minFragLen or > maxFragLen." Is that not a lot to filter out? what proportion of interactions do we expect to filter out?

4- It says: "Filtered out 183586 baits with < minNPerBait reads." That seems then that my "minNPerBait = 250" is too high for the coverage of my experiments I assume? Even with all that filtering, I should still have more than 16 baits left. so why am I getting that error?

Thank you!

Comments (6)

  1. Mikhail Spivakov

    Hi!

    The good news is that Chicago interaction calling had completed, and you're only getting an error when trying to plot examples. It may be simply the case that it's trying to choose more random baits to plot than you have available.

    I'd be intrigued to see how Chicago fares with so few baits - we've never tried this.

    Re your specific questions:

    1) I'm not sure what you mean by the difference between baits and captured regions. In Chicago's terminology, baits and captured fragments are synonyms.

    2) minFragLen and maxFragLen refer to restriction fragment size, not inter-fragment distance. See ?defaultSettings and Table S1 in Cairns et al. (https://static-content.springer.com/esm/art%3A10.1186%2Fs13059-016-0992-2/MediaObjects/13059_2016_992_MOESM2_ESM.pdf) for details.

    3, 4) This is rather a large number of interactions and baits to filter out, and I suspect this may be to do with .baitmap and/or .rmap files being misspecified. Perhaps you could share those files with me to have a look?

  2. Mikhail Spivakov

    Both links seem to be identical and point to the .baitmap file. Could you please re-post the link to the rmap?

  3. Mikhail Spivakov

    The files look good.

    However, I've now had a closer look at the Chicago output that you've provided in your original post.

    *minFragLen = 150 maxFragLen = 40000 Filtered out 212087 interactions involving other ends < minFragLen or > maxFragLen.

    First, minFragLen of 150 is too high for DpnII. There are ~30% fragments with len<150. Likewise, 40000 is probably way too high. A reasonable setting could be estimated by plotting the distribution of the total # reads per other end as a function of fragment length (e.g., as boxplots across bins of length). There may be some slight dependence across the board, and it should be captured by s_i's, but you may observe (as we did with the data we trained on) that very short and very long fragments have clearly shifted distributions compared with the rest. Such fragments are best filtered out - but I would aim to select the parameters such that the absolute majority of fragments (~80-90%) are retained. Since you've got a bigger problem below, you may want to just set these parameters to the 10th and 90th percentile (40 and 930, respectively) and leave fine-tuning until later.

    *Filtered out 212087 interactions involving other ends < minFragLen or > maxFragLen. minNPerBait = 250 Filtered out 183586 baits with < minNPerBait reads.

    I've checked and you've only got 189426 baits, so at this step you've lost nearly all of them, due to the fact that they have fewer than minNPerBait reads mapping to them (in this case, 250). In general, we'd expect baits to attract at least that number of reads in a good experiment, so that's a bit of a red flag. But nonetheless, I'd consider lowering this parameter considerably (say, two- to threefold), such that a reasonable proportion of baits are retained, and then checking whether interactions detected for them make sense. If they don't, the best solution would be to add more data - either by simply sequencing more or, if the profiles look suspicious, by repeating the whole experiment.

  4. Rola Dali reporter

    Thank you Mikhail. I will look into the DpnII distribution and the adjust minFragLen and maxFragLen. I will also look at my experimental coverage and adjust minNPerBait as well. I will let you know how things go when I get my results.

  5. Log in to comment