Hello,
In two different cases I have received the same error when running chicagoPipeline. The first is when I provide four different .chinput files and use readAndMerge to call interactions on the combined dataset (it works fine with two files). The second is when I merge two .bam files and generate a single .chinput file from this and attempt to call interactions. The error seems to occur during the binning step of normaliseOtherEnds.
Many thanks for your help.
Lindsey
Code:
cd <- setExperiment(designDir = "./") files <- c("1.chinput", "2.chinput", "3.chinput", "4.chinput") cd <- readAndMerge(files=files, cd=cd) cd <- chicagoPipeline(cd) *** Running normaliseBaits... Normalising baits... Reading NPerBin file... Computing binwise means... *** Running normaliseOtherEnds... Preprocessing input... Computing trans-counts... Filtering out 365 other ends with top 0.01% number of trans-interactions Binning... Error in if (cj == upper) next : missing value where TRUE/FALSE needed In addition: Warning message: In (1:g) * nnm : NAs produced by integer overflow
Comments (32)
-
-
reporter - attached merge.chinput
Hi Mikhail,
Here are the first 0.5 million lines of the merged .chinput file that is giving the same error.
Please let me know if this is not sufficient to understand the problem.
Thank you,
Lindsey
-
Hi Lindsey,
Thanks! What organism, restriction enzyme and capture design did you use? If one of those we released with the Genome Biol paper, just let us know, otherwise please share the full set of design files as well.
Best wishes, Mikhail
-
reporter Hi Mikhail,
The data are human, we used MboI and a self-made capture design with ~90,000 baits targeting all RefSeq promoters. You can find the design files here http://mnlab.uchicago.edu/users/lem/.
Just to clarify, I used samtools to merge two .bam files of replicate experiments and then ran bam2chicago.sh on this file to generate the .chinput file I sent earlier. Even with only the first 0.5 million lines of that file, I get the error.
Thank you very much,
Lindsey
-
Thanks Lindsey, we'll look into it within the next week or so and get back to you. Best wishes, Mikhail
-
Hi Lindsey, things get rather interesting, as the code runs fine with your design files and chinput file on our end! Could you please:
(1) let us know the output of sessionInfo()
(2) do debug(normaliseOtherEnds) and try to catch the exact command within normaliseOtherEnds() causing the error and what the data submitted to it look like? We can then compare with the same on our end.
Best wishes, Mikhail
-
PS. Could you please also try running Chicago via the runChicago.R wrapper available as part of chicagoTools?
$ Rscript ~/CHiCAGOv2/chicagoTools/runChicago.R --design-dir <your-design-folder> <your-chinput-file>.chinput <output-folder-name>
-
reporter Hi Mikhail,
Sorry for the delayed response. I put the full .chinput file that gives the error at http://mnlab.uchicago.edu/users/lem/ to ensure you get the same error as on our end.
The output of sessioninfo() is
R version 3.3.2 (2016-10-31) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 14.04.5 LTS locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] Chicago_1.0.4 data.table_1.9.6 loaded via a namespace (and not attached): [1] Rcpp_0.12.7 matrixStats_0.51.0 lattice_0.20-34 [4] Delaporte_3.0.0 Hmisc_3.17-4 MASS_7.3-45 [7] chron_2.3-47 grid_3.3.2 plyr_1.8.4 [10] acepack_1.3-3.3 gtable_0.2.0 scales_0.4.0 [13] ggplot2_2.1.0 latticeExtra_0.6-28 rpart_4.1-10 [16] Matrix_1.2-7.1 Formula_1.2-1 splines_3.3.2 [19] RColorBrewer_1.1-2 foreign_0.8-67 munsell_0.4.3 [22] survival_2.39-4 colorspace_1.2-7 cluster_2.0.5 [25] nnet_7.3-12 gridExtra_2.2.1
I ran debug(normaliseOtherEnds) and I believe the error is occurring with the cut2 command, part of the .addTLB function. I am still working on determining what the input data look like that are giving that error.
I ran Chicago with the wrapper and received the same error:
*** Running normaliseOtherEnds... Preprocessing input... Computing trans-counts... Filtering out 273 other ends with top 0.01% number of trans-interactions Binning... Error in if (cj == upper) next : missing value where TRUE/FALSE needed Calls: chicagoPipeline -> normaliseOtherEnds -> .addTLB -> cut2 In addition: Warning message: In (1:g) * nnm : NAs produced by integer overflow Execution halted
-
Hi Lindsay,
Thanks. I notice the potentially telling message from
cut2()
:In (1:g) * nnm : NAs produced by integer overflow
If NAs are produced this way, then the missing value error can certainly be provoked.
I've found a stackoverflow thread on this matter (http://stackoverflow.com/questions/39500312/using-cut2-in-hmisc-package-for-large-n) that is relevant.
A couple more things:
1) Can you please upgrade
Hmisc
to the latest version and try again?2) Just in case, what's the value of
.Machine$integer.max
on your machine?In the meanwhile, we'll try to reproduce the issue on our end with the full dataset.
-
reporter Thanks Mikhail. I will upgrade and try again. The value of .Machine$integer.max is 2147483647
-
Thanks, the max value is as expected.
One more thing: can you please also post the full output of runChicago, not just the lines triggering the error?
-
I can also confirm that
full-merged.chinput
runs throughnormaliseOtherEnds()
without a problem on our end:*** Running normaliseOtherEnds... Preprocessing input... Computing trans-counts... Filtering out 273 other ends with top 0.01% number of trans-interactions Binning... Computing total bait counts... Reading NBaitsPerBin file... Read 7125003 rows and 76 (of 76) columns from 1.048 GB file in 00:00:22 Computing scaling factors... Computing binwise means... Computing normalised counts... Post-processing...
-
reporter That is interesting. Here is the full output of runChicago:
***runChicago.R Loading required package: methods Loading the Chicago package and dependencies... Loading required package: data.table Welcome to CHiCAGO - version 1.0.4 If you are new to CHiCAGO, please consider reading the vignette through the command: vignette("Chicago"). Warning: neither --en-feat-files nor --en-feat-list provided. Feature enrichments will not be computed Setting the experiment... Locating <baitmapfile>.baitmap in /home/lem/projects/hic/test_chicago/... Found probes-MboI_fragments-fixed.baitmap Locating <rmapfile>.rmap in /home/lem/projects/hic/test_chicago/... Found MboI.rmap Locating <nperbinfile>.npb in /home/lem/projects/hic/test_chicago/... Found MboI.npb Locating <nbaitsperbinfile>.nbpb in /home/lem/projects/hic/test_chicago/... Found MboI.nbpb Locating <proxOEfile>.poe in /home/lem/projects/hic/test_chicago/... Found MboI.poe Checking the design files... Read 7125003 rows and 4 (of 4) columns from 0.211 GB file in 00:00:03 Reading /home/lem/projects/hic/test_chicago/chicago_19101/iPSC-19101-merge.chinput Read 14644857 rows and 5 (of 5) columns from 0.381 GB file in 00:00:04 Processing input... minFragLen = 150 maxFragLen = 40000 Filtered out 943189 interactions involving other ends < minFragLen or > maxFragLen. minNPerBait = 250 Filtered out 35400 baits with < minNPerBait reads. Removed interactions with fragments adjacent to baits. Filtered out 0 baits without proximal non-Bait2bait interactions Starting chicagoPipeline... *** Running normaliseBaits... Normalising baits... Reading NPerBin file... Computing binwise means... *** Running normaliseOtherEnds... Preprocessing input... Computing trans-counts... Filtering out 273 other ends with top 0.01% number of trans-interactions Binning... Error in if (cj == upper) next : missing value where TRUE/FALSE needed Calls: chicagoPipeline -> normaliseOtherEnds -> .addTLB -> cut2 In addition: Warning message: In (1:g) * nnm : NAs produced by integer overflow Execution halted
-
If we look into the
cut2()
code (which is quite straightforward), we'll see that the line that throws an overflow on your end is most likely this one:cuts <- approx(cum, xx, xout = (1:g) * nnm/g, method = "constant", rule = 2, f = 1)$y
If you do
debug(Hmisc:::cut2)
and check the values of the key variables within this function immediately after they are assigned, here's what I get:> nnm [1] 2134104 ... > g [1] 42 ... > (1:g) * nnm/g # note this is the expression that most likely fails in the approx() call [1] 50812 101624 152436 203248 254060 304872 355684 406496 457308 [10] 508120 558932 609744 660556 711368 762180 812992 863804 914616 [19] 965428 1016240 1067052 1117864 1168676 1219488 1270300 1321112 1371924 [28] 1422736 1473548 1524360 1575172 1625984 1676796 1727608 1778420 1829232 [37] 1880044 1930856 1981668 2032480 2083292 2134104 ... > cuts [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 [26] 1 1 1 1 1 2 2 2 2 2 2 2 3 3 3 4 13
Can you please try on your end?
-
PS. The full output matches what I see on my end until the point of error.
-
reporter Hi Mikhail,
I updated Hmisc to Version 4.0-0 and installed the latest version of Chicago (I had been running 1.0.4) and now it is working!
I got different values for nnm and g and many NAs in the (1:g)*nnm/g calculation. I am happy to send the output if you are interested.
Thank you so much for your help.
Lindsey
-
Great to hear it's worked. However, please post the output as I'm not sure if those NA's won't backfire...
-
Thanks. I'm still a bit concerned that the values don't match and that you end up with such a huge number of bins.
Can you please do
debug(Chicago:::.addTLB)
and re-run Chicago withfull-merged.chinput
.Once you get into
.addTLB
, and before the debugger executes this line:cuts = cut2(transLen[abs(distSign) <= s$maxLBrownEst]$transLength, m = minProxOEPerBin, onlycuts = TRUE)
can you please check the following:
1)
nrow(transLen)
# 33808832)
nrow(transLenB2B)
# 496023)
length(transLen[abs(distSign) <= s$maxLBrownEst]$transLength)
# 21341044)
s$maxLBrownEst
# 15000005)
minProxOEPerBin
# 50000And then, after that line is executed, can you please check the value of
cuts
, which on my side is0 1 2 3 4 5 13
. -
reporter Updating to v1.1.6, the output is the same (I was confirming that runChicago.R worked, but hadn't tried running Chicago through chicagoPipeline yet):
> nrow(transLen) [1] 3380883 nrow(transLenB2B) [1] 49602 > length(transLen[abs(distSign) <= s$maxLBrownEst]$transLength) [1] 2134104 > s$maxLBrownEst [1] 1500000 >minProxOEPerBin [1] 50000 > cuts [1] 0 1 2 3 4 5 13
However, when using v1.0.4 (the version that has been giving the original error message), I get these values (the line doesn't run so I have no values for cuts)
debug: cuts = cut2(transLen[abs(distSign) <= s$maxLBrownEst]$transLength, m = minProxOEPerBin, onlycuts = TRUE) Browse[2]> nrow(transLen) [1] 3380883 Browse[2]> nrow(transLenB2B) [1] 49602 Browse[2]> length(transLen[abs(distSign) <= s$maxLBrownEst]$transLength) [1] 1604111 Browse[2]> s$maxLBrownEst [1] 1500000 Browse[2]> minProxOEPerBin [1] 1000 Browse[2]> n Error in if (cj == upper) next : missing value where TRUE/FALSE needed In addition: Warning message: In (1:g) * nnm : NAs produced by integer overflow
-
Great! There was a binning issue in earlier versions (including v1.0.4) that we then fixed. Although it never triggered the error you've reported, it might be that these issues are somehow related. I'm marking this as resolved. Please don't hesitate to get in touch should you encounter any other issues.
-
- changed status to resolved
-
reporter Hi Mikhail,
Thanks for your help with this issue. Now that I am able to run Chicago on the merged .chinput file, I do want to mention an interesting finding. This merged .chinput file was generated by first merging two .bam files representing biological replicates and then running bam2chicago.sh on that merged bam file. Chicago calls 133,000 interactions on this merged .chinput file (the source of all the errors we have been discussing). However, when I simply give Chicago the two .chinput files representing each replicate (and using readAndMerge to combine them), it only finds 12,000 interactions. I am curious if you have any insight into why this is the case.
Thank you again,
Lindsey
-
Hi Lindsey,
This isn't necessarily a problem, as Chicago weights replicates based on effective library size, which may differ quite a bit from the total library size. Out of interest, what are the s_k's for each replicate (params(cd)$s_k)?
Another thing to check is whether paired-end BAM merging has in fact been done correctly. Can you please check if the pooled BAM file still keeps mate pairs next to each other (correct), or is sorted by coordinate (incorrect)?
Cheers, Mikhail
-
reporter Hi Mikhail,
The bam files for each replicate were first sorted by read name and then merged, so I don't think that is an issue.
The value of $s_k for rep1 is 1.007282 and for rep2 is 0.992771.
Does it make sense that the difference in "effective library size" should lead to such a stark difference in the number of interactions called (12k vs. 130k)? Do you think that first combining reads from biological replicates and then running through the pipeline produces less meaningful interactions?
Thanks, Lindsey
-
I find this really strange, given such similar s_k's for the replicates. Can you perhaps go into cd@x and plot the N's from the pre-merged sample versus the one merged by Chicago for a random subset of data (not just for peaks)? Do they differ only in magnitude or otherwise? Also, what is the median abs(distSign) for the peaks detected in either case?
-
reporter Before I go into that, could it be due to the fact that each replicate only has ~55M sequenced reads? Perhaps they are too sparsely sequenced that when combining the "background" of each, it makes it much harder to call interactions? Indeed, 7,000 of the 12,000 interactions called on the rep1+rep2 set are present in the rep1 dataset, and these interactions include all of the highest scoring interactions from rep1. So it seems that combining replicates makes it harder to call interactions, thus the only ones that are found are the most significant ones.
Working on answering your previous question now..
Lindsey
-
Thanks, Jonathan and I will also investigate this further. Perhaps you could share per-replicate chinput files with us for the same samples as full-merged?
-
reporter Thank you - the files are at http://mnlab.uchicago.edu/users/lem/
-
Hi Lindsey,
I've had a look at your files. It looks to me like the main difference between the "pre-merged" and "Chicago-merged" strategies is in the way baits are filtered.
By default, Chicago filters out baits with under 250 reads, in order to avoid overfitting and imprecise parameter estimates. In your data set, a lot of baits have around 125-250 reads in each replicate. These baits get filtered out in the "Chicago-merged" strategy. But in the "pre-merged" strategy, these baits have >250 reads after merging, so they don't get filtered out. This effect can be seen in the number of baits left after filtering: 9681 when Chicago-merging, as opposed to 16629 when pre-merging.
To fix the Chicago-merged strategy, you could try making the filtering step less aggressive by lowering the "minNPerBait" setting from its default of 250. For example, I tried it with minNPerBait = 125:
cd.merge <- setExperiment(designDir = designDir, settings = list(minNPerBait = 125)) cd.merge <- readAndMerge(cd=cd.merge, files=files.merge)
and got 17373 baits to take forward into the analysis.
Thanks,
Jonathan
-
reporter Hi Jonathan,
Thank you for looking into this, that makes sense based on what we are seeing with the pre-merged and Chicago-merged samples. As I understand, with Chicago, essentially only baits that already have high read coverage will make it through and be used in the combined dataset, so this method won't enable detection of interactions that are below-threshold in a replicate, but above-threshold in the combined dataset - as combining invariably increases read depth of each bait.
However, I would like to identify these "weaker" interactions (at least for now with the multiple low coverage replicates) so I would need to "pre-merge" - would you argue that this a flawed approach?
Additionally, do you think that lowering minNPerBait to 125 is risky given your observations with imprecise parameter estimates?
Thanks, Lindsey
-
Hi Lindsey,
Sure, setting minNPerBait to 125 is potentially risky, but the pre-merging strategy has similar risks because it accepts a similar set of baits. I would not expect the two strategies to give you hugely different results - the only philosophical difference is that the "pre-merging" strategy treats your replicates as two technical replicates (e.g. a single CHi-C library sequenced over multiple lanes) whereas the other strategy treats them as biological replicates.
In this situation, I would consider using plotBaits() to check that Chicago has made sensible calls in the low-coverage baits.
-
reporter Hi Jonathan,
Thank you for the feedback!
Lindsey
- Log in to comment
Hi Lindsey,
Thanks for the report.
Can you please share with us a minimal set of data that reproduces the problem? We should then be able to look into this.
Best wishes, Mikhail