Error in normaliseOtherEnds when combining multiple datsets

Issue #8 resolved
Lindsey Montefiori created an issue

Hello,

In two different cases I have received the same error when running chicagoPipeline. The first is when I provide four different .chinput files and use readAndMerge to call interactions on the combined dataset (it works fine with two files). The second is when I merge two .bam files and generate a single .chinput file from this and attempt to call interactions. The error seems to occur during the binning step of normaliseOtherEnds.

Many thanks for your help.

Lindsey

Code:

cd <- setExperiment(designDir = "./")

files <- c("1.chinput", "2.chinput", "3.chinput", "4.chinput")

cd <- readAndMerge(files=files, cd=cd)

cd <- chicagoPipeline(cd)

*** Running normaliseBaits...

Normalising baits...
Reading NPerBin file...
Computing binwise means...

*** Running normaliseOtherEnds...

Preprocessing input...
Computing trans-counts...
Filtering out 365 other ends with top 0.01% number of trans-interactions
Binning...
Error in if (cj == upper) next : missing value where TRUE/FALSE needed
In addition: Warning message:
In (1:g) * nnm : NAs produced by integer overflow

Comments (32)

  1. Mikhail Spivakov

    Hi Lindsey,

    Thanks for the report.

    Can you please share with us a minimal set of data that reproduces the problem? We should then be able to look into this.

    Best wishes, Mikhail

  2. Lindsey Montefiori reporter

    Hi Mikhail,

    Here are the first 0.5 million lines of the merged .chinput file that is giving the same error.

    Please let me know if this is not sufficient to understand the problem.

    Thank you,

    Lindsey

  3. Mikhail Spivakov

    Hi Lindsey,

    Thanks! What organism, restriction enzyme and capture design did you use? If one of those we released with the Genome Biol paper, just let us know, otherwise please share the full set of design files as well.

    Best wishes, Mikhail

  4. Lindsey Montefiori reporter

    Hi Mikhail,

    The data are human, we used MboI and a self-made capture design with ~90,000 baits targeting all RefSeq promoters. You can find the design files here http://mnlab.uchicago.edu/users/lem/.

    Just to clarify, I used samtools to merge two .bam files of replicate experiments and then ran bam2chicago.sh on this file to generate the .chinput file I sent earlier. Even with only the first 0.5 million lines of that file, I get the error.

    Thank you very much,

    Lindsey

  5. Mikhail Spivakov

    Thanks Lindsey, we'll look into it within the next week or so and get back to you. Best wishes, Mikhail

  6. Mikhail Spivakov

    Hi Lindsey, things get rather interesting, as the code runs fine with your design files and chinput file on our end! Could you please:

    (1) let us know the output of sessionInfo()

    (2) do debug(normaliseOtherEnds) and try to catch the exact command within normaliseOtherEnds() causing the error and what the data submitted to it look like? We can then compare with the same on our end.

    Best wishes, Mikhail

  7. Mikhail Spivakov

    PS. Could you please also try running Chicago via the runChicago.R wrapper available as part of chicagoTools?

    $ Rscript ~/CHiCAGOv2/chicagoTools/runChicago.R --design-dir <your-design-folder> <your-chinput-file>.chinput <output-folder-name>

  8. Lindsey Montefiori reporter

    Hi Mikhail,

    Sorry for the delayed response. I put the full .chinput file that gives the error at http://mnlab.uchicago.edu/users/lem/ to ensure you get the same error as on our end.

    The output of sessioninfo() is

    R version 3.3.2 (2016-10-31)
    Platform: x86_64-pc-linux-gnu (64-bit)
    Running under: Ubuntu 14.04.5 LTS
    
    locale:
     [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
     [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
     [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
     [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
     [9] LC_ADDRESS=C               LC_TELEPHONE=C
    [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
    
    attached base packages:
    [1] stats     graphics  grDevices utils     datasets  methods   base
    
    other attached packages:
    [1] Chicago_1.0.4    data.table_1.9.6
    
    loaded via a namespace (and not attached):
     [1] Rcpp_0.12.7         matrixStats_0.51.0  lattice_0.20-34
     [4] Delaporte_3.0.0     Hmisc_3.17-4        MASS_7.3-45
     [7] chron_2.3-47        grid_3.3.2          plyr_1.8.4
    [10] acepack_1.3-3.3     gtable_0.2.0        scales_0.4.0
    [13] ggplot2_2.1.0       latticeExtra_0.6-28 rpart_4.1-10
    [16] Matrix_1.2-7.1      Formula_1.2-1       splines_3.3.2
    [19] RColorBrewer_1.1-2  foreign_0.8-67      munsell_0.4.3
    [22] survival_2.39-4     colorspace_1.2-7    cluster_2.0.5
    [25] nnet_7.3-12         gridExtra_2.2.1
    

    I ran debug(normaliseOtherEnds) and I believe the error is occurring with the cut2 command, part of the .addTLB function. I am still working on determining what the input data look like that are giving that error.

    I ran Chicago with the wrapper and received the same error:

    *** Running normaliseOtherEnds...
    
    Preprocessing input...
    Computing trans-counts...
    Filtering out 273 other ends with top 0.01% number of trans-interactions
    Binning...
    Error in if (cj == upper) next : missing value where TRUE/FALSE needed
    Calls: chicagoPipeline -> normaliseOtherEnds -> .addTLB -> cut2
    In addition: Warning message:
    In (1:g) * nnm : NAs produced by integer overflow
    Execution halted
    
  9. Mikhail Spivakov

    Hi Lindsay,

    Thanks. I notice the potentially telling message from cut2():

    In (1:g) * nnm : NAs produced by integer overflow
    

    If NAs are produced this way, then the missing value error can certainly be provoked.

    I've found a stackoverflow thread on this matter (http://stackoverflow.com/questions/39500312/using-cut2-in-hmisc-package-for-large-n) that is relevant.

    A couple more things:

    1) Can you please upgrade Hmisc to the latest version and try again?

    2) Just in case, what's the value of .Machine$integer.max on your machine?

    In the meanwhile, we'll try to reproduce the issue on our end with the full dataset.

  10. Mikhail Spivakov

    Thanks, the max value is as expected.

    One more thing: can you please also post the full output of runChicago, not just the lines triggering the error?

  11. Mikhail Spivakov

    I can also confirm that full-merged.chinput runs through normaliseOtherEnds() without a problem on our end:

    *** Running normaliseOtherEnds...
    
    Preprocessing input...
    Computing trans-counts...
    Filtering out 273 other ends with top 0.01% number of trans-interactions
    Binning...
    Computing total bait counts...
    Reading NBaitsPerBin file...
    Read 7125003 rows and 76 (of 76) columns from 1.048 GB file in 00:00:22
    Computing scaling factors...
    Computing binwise means...
    Computing normalised counts...
    Post-processing...
    
  12. Lindsey Montefiori reporter

    That is interesting. Here is the full output of runChicago:

    ***runChicago.R
    
    
    Loading required package: methods
    
    
    Loading the Chicago package and dependencies...
    
    Loading required package: data.table
    Welcome to CHiCAGO - version 1.0.4
    If you are new to CHiCAGO, please consider reading the vignette through the command: vignette("Chicago").
    Warning: neither --en-feat-files nor --en-feat-list provided. Feature enrichments will not be computed
    
    Setting the experiment...
    
    Locating <baitmapfile>.baitmap in /home/lem/projects/hic/test_chicago/...
    Found probes-MboI_fragments-fixed.baitmap
    Locating <rmapfile>.rmap in /home/lem/projects/hic/test_chicago/...
    Found MboI.rmap
    Locating <nperbinfile>.npb in /home/lem/projects/hic/test_chicago/...
    Found MboI.npb
    Locating <nbaitsperbinfile>.nbpb in /home/lem/projects/hic/test_chicago/...
    Found MboI.nbpb
    Locating <proxOEfile>.poe in /home/lem/projects/hic/test_chicago/...
    Found MboI.poe
    Checking the design files...
    Read 7125003 rows and 4 (of 4) columns from 0.211 GB file in 00:00:03
    
    
    Reading /home/lem/projects/hic/test_chicago/chicago_19101/iPSC-19101-merge.chinput
    Read 14644857 rows and 5 (of 5) columns from 0.381 GB file in 00:00:04
    Processing input...
    minFragLen = 150 maxFragLen = 40000
    Filtered out 943189 interactions involving other ends < minFragLen or > maxFragLen.
    minNPerBait = 250
    Filtered out 35400 baits with < minNPerBait reads.
    
    Removed interactions with fragments adjacent to baits.
    Filtered out 0 baits without proximal non-Bait2bait interactions
    
    
    
    Starting chicagoPipeline...
    
    
    *** Running normaliseBaits...
    
    Normalising baits...
    Reading NPerBin file...
    Computing binwise means...
    
    *** Running normaliseOtherEnds...
    
    Preprocessing input...
    Computing trans-counts...
    Filtering out 273 other ends with top 0.01% number of trans-interactions
    Binning...
    Error in if (cj == upper) next : missing value where TRUE/FALSE needed
    Calls: chicagoPipeline -> normaliseOtherEnds -> .addTLB -> cut2
    In addition: Warning message:
    In (1:g) * nnm : NAs produced by integer overflow
    Execution halted
    
  13. Mikhail Spivakov

    If we look into the cut2() code (which is quite straightforward), we'll see that the line that throws an overflow on your end is most likely this one:

            cuts <- approx(cum, xx, xout = (1:g) * nnm/g, method = "constant", 
                rule = 2, f = 1)$y
    

    If you do debug(Hmisc:::cut2) and check the values of the key variables within this function immediately after they are assigned, here's what I get:

    > nnm
    [1] 2134104
    ...
    >  g
    [1] 42
    ...
    > (1:g) * nnm/g # note this is the expression that most likely fails in the approx() call
     [1]   50812  101624  152436  203248  254060  304872  355684  406496  457308
    [10]  508120  558932  609744  660556  711368  762180  812992  863804  914616
    [19]  965428 1016240 1067052 1117864 1168676 1219488 1270300 1321112 1371924
    [28] 1422736 1473548 1524360 1575172 1625984 1676796 1727608 1778420 1829232
    [37] 1880044 1930856 1981668 2032480 2083292 2134104
    ...
    > cuts
     [1]  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  1  1  1  1  1  1  1  1
    [26]  1  1  1  1  1  2  2  2  2  2  2  2  3  3  3  4 13
    

    Can you please try on your end?

  14. Lindsey Montefiori reporter

    Hi Mikhail,

    I updated Hmisc to Version 4.0-0 and installed the latest version of Chicago (I had been running 1.0.4) and now it is working!

    I got different values for nnm and g and many NAs in the (1:g)*nnm/g calculation. I am happy to send the output if you are interested.

    Thank you so much for your help.

    Lindsey

  15. Mikhail Spivakov

    Great to hear it's worked. However, please post the output as I'm not sure if those NA's won't backfire...

  16. Mikhail Spivakov

    Thanks. I'm still a bit concerned that the values don't match and that you end up with such a huge number of bins.

    Can you please do debug(Chicago:::.addTLB) and re-run Chicago with full-merged.chinput.

    Once you get into .addTLB, and before the debugger executes this line:

    cuts = cut2(transLen[abs(distSign) <= s$maxLBrownEst]$transLength, 
        m = minProxOEPerBin, onlycuts = TRUE)
    

    can you please check the following:

    1) nrow(transLen) # 3380883

    2) nrow(transLenB2B) # 49602

    3) length(transLen[abs(distSign) <= s$maxLBrownEst]$transLength) # 2134104

    4) s$maxLBrownEst # 1500000

    5) minProxOEPerBin # 50000

    And then, after that line is executed, can you please check the value of cuts, which on my side is 0 1 2 3 4 5 13.

  17. Lindsey Montefiori reporter

    Updating to v1.1.6, the output is the same (I was confirming that runChicago.R worked, but hadn't tried running Chicago through chicagoPipeline yet):

    > nrow(transLen)
    [1] 3380883
    nrow(transLenB2B)
    [1] 49602
    > length(transLen[abs(distSign) <= s$maxLBrownEst]$transLength)
    [1] 2134104
    > s$maxLBrownEst
    [1] 1500000
    >minProxOEPerBin
    [1] 50000
    > cuts
    [1]  0  1  2  3  4  5 13
    

    However, when using v1.0.4 (the version that has been giving the original error message), I get these values (the line doesn't run so I have no values for cuts)

    debug: cuts = cut2(transLen[abs(distSign) <= s$maxLBrownEst]$transLength,
        m = minProxOEPerBin, onlycuts = TRUE)
    Browse[2]> nrow(transLen)
    [1] 3380883
    Browse[2]> nrow(transLenB2B)
    [1] 49602
    Browse[2]> length(transLen[abs(distSign) <= s$maxLBrownEst]$transLength)
    [1] 1604111
    Browse[2]> s$maxLBrownEst
    [1] 1500000
    Browse[2]> minProxOEPerBin
    [1] 1000
    Browse[2]> n
    Error in if (cj == upper) next : missing value where TRUE/FALSE needed
    In addition: Warning message:
    In (1:g) * nnm : NAs produced by integer overflow
    
  18. Mikhail Spivakov

    Great! There was a binning issue in earlier versions (including v1.0.4) that we then fixed. Although it never triggered the error you've reported, it might be that these issues are somehow related. I'm marking this as resolved. Please don't hesitate to get in touch should you encounter any other issues.

  19. Lindsey Montefiori reporter

    Hi Mikhail,

    Thanks for your help with this issue. Now that I am able to run Chicago on the merged .chinput file, I do want to mention an interesting finding. This merged .chinput file was generated by first merging two .bam files representing biological replicates and then running bam2chicago.sh on that merged bam file. Chicago calls 133,000 interactions on this merged .chinput file (the source of all the errors we have been discussing). However, when I simply give Chicago the two .chinput files representing each replicate (and using readAndMerge to combine them), it only finds 12,000 interactions. I am curious if you have any insight into why this is the case.

    Thank you again,

    Lindsey

  20. Mikhail Spivakov

    Hi Lindsey,

    This isn't necessarily a problem, as Chicago weights replicates based on effective library size, which may differ quite a bit from the total library size. Out of interest, what are the s_k's for each replicate (params(cd)$s_k)?

    Another thing to check is whether paired-end BAM merging has in fact been done correctly. Can you please check if the pooled BAM file still keeps mate pairs next to each other (correct), or is sorted by coordinate (incorrect)?

    Cheers, Mikhail

  21. Lindsey Montefiori reporter

    Hi Mikhail,

    The bam files for each replicate were first sorted by read name and then merged, so I don't think that is an issue.

    The value of $s_k for rep1 is 1.007282 and for rep2 is 0.992771.

    Does it make sense that the difference in "effective library size" should lead to such a stark difference in the number of interactions called (12k vs. 130k)? Do you think that first combining reads from biological replicates and then running through the pipeline produces less meaningful interactions?

    Thanks, Lindsey

  22. Mikhail Spivakov

    I find this really strange, given such similar s_k's for the replicates. Can you perhaps go into cd@x and plot the N's from the pre-merged sample versus the one merged by Chicago for a random subset of data (not just for peaks)? Do they differ only in magnitude or otherwise? Also, what is the median abs(distSign) for the peaks detected in either case?

  23. Lindsey Montefiori reporter

    Before I go into that, could it be due to the fact that each replicate only has ~55M sequenced reads? Perhaps they are too sparsely sequenced that when combining the "background" of each, it makes it much harder to call interactions? Indeed, 7,000 of the 12,000 interactions called on the rep1+rep2 set are present in the rep1 dataset, and these interactions include all of the highest scoring interactions from rep1. So it seems that combining replicates makes it harder to call interactions, thus the only ones that are found are the most significant ones.

    Working on answering your previous question now..

    Lindsey

  24. Mikhail Spivakov

    Thanks, Jonathan and I will also investigate this further. Perhaps you could share per-replicate chinput files with us for the same samples as full-merged?

  25. Jonathan Cairns

    Hi Lindsey,

    I've had a look at your files. It looks to me like the main difference between the "pre-merged" and "Chicago-merged" strategies is in the way baits are filtered.

    By default, Chicago filters out baits with under 250 reads, in order to avoid overfitting and imprecise parameter estimates. In your data set, a lot of baits have around 125-250 reads in each replicate. These baits get filtered out in the "Chicago-merged" strategy. But in the "pre-merged" strategy, these baits have >250 reads after merging, so they don't get filtered out. This effect can be seen in the number of baits left after filtering: 9681 when Chicago-merging, as opposed to 16629 when pre-merging.

    To fix the Chicago-merged strategy, you could try making the filtering step less aggressive by lowering the "minNPerBait" setting from its default of 250. For example, I tried it with minNPerBait = 125:

    cd.merge <- setExperiment(designDir = designDir, settings = list(minNPerBait = 125)) cd.merge <- readAndMerge(cd=cd.merge, files=files.merge)

    and got 17373 baits to take forward into the analysis.

    Thanks,

    Jonathan

  26. Lindsey Montefiori reporter

    Hi Jonathan,

    Thank you for looking into this, that makes sense based on what we are seeing with the pre-merged and Chicago-merged samples. As I understand, with Chicago, essentially only baits that already have high read coverage will make it through and be used in the combined dataset, so this method won't enable detection of interactions that are below-threshold in a replicate, but above-threshold in the combined dataset - as combining invariably increases read depth of each bait.

    However, I would like to identify these "weaker" interactions (at least for now with the multiple low coverage replicates) so I would need to "pre-merge" - would you argue that this a flawed approach?

    Additionally, do you think that lowering minNPerBait to 125 is risky given your observations with imprecise parameter estimates?

    Thanks, Lindsey

  27. Jonathan Cairns

    Hi Lindsey,

    Sure, setting minNPerBait to 125 is potentially risky, but the pre-merging strategy has similar risks because it accepts a similar set of baits. I would not expect the two strategies to give you hugely different results - the only philosophical difference is that the "pre-merging" strategy treats your replicates as two technical replicates (e.g. a single CHi-C library sequenced over multiple lanes) whereas the other strategy treats them as biological replicates.

    In this situation, I would consider using plotBaits() to check that Chicago has made sensible calls in the low-coverage baits.

  28. Log in to comment