Some contigs are assigned to both "unbinned" and a bin

Issue #87 resolved
Bryan Merrill created an issue

Hi Rob,

I discovered that when using MetaBAT 2.15 (bioconda) on some of my metagenomes, several contigs end up both in unbinned and in a bin. However, the bug seems inconsistent… out of ~500 metagenomes I just ran MetaBAT 2.15 on, only 8 metagenomes contained bins where contigs could be found in a bin and unbinned. In the cases where it occurs, a given metagenome would have between 1 and 4 contigs end up in the same bin (bin.8, for example), and those same contigs appear in unbinned. Do you know why this might be the case, or what the proper assignment for these contigs is?

I’ve attached the log file if by chance it might be useful. I can also share the assembly and depth file (created from one BAM file where one sample’s reads were mapped onto its own assembly). Here is the command I used:

metabat2 -i assembly.fasta -a depth.txt \
-m 1500 -o bins/bin -t 8 --unbinned --seed 42 -v

I get this error in MetaBAT 2.14 as well (both compiled from the release here and bioconda).

Best,
Bryan

Comments (8)

  1. Rob Egan

    Hi Bryan,

    So my first thought is that the assembly.fasta has duplicate entries. Looking at the code I do not immediately see a way that a given sequence would be in both a bin and unbinned files, or more surprisingly for one contig to be in multiple bins and/or repeated in a bin, unless the assembly file itself has duplicate entries.

    If you are seeing non-deterministic behavior with the --seed option then the first guess explanation would be a race between threads, but again I don’t see that in the code.

    I do, however, see a potential problem where contigs would not be reported in the unbinned file if they were clustered into a bin but the total size of the bin did not meet the minClusterSize threshold, and so were not reported… but that isn’t the problem you are reporting.

    If you verify that the assembly.fa has all unique entries, then having the data that can reproduce the problem would help me track down the issue that you are observing.

    Best,

    Rob

  2. Bryan Merrill reporter

    Hi Rob,

    I checked the assembly.fa files for these samples, and they do not have duplicate entries. I have not yet tested whether the duplicated contigs behavior is deterministic given the same input (while setting --seed to the same number both times). What’s the best way to pass along the files? Thanks so much for looking into this!

    Best,
    Bryan

  3. Rob Egan

    I think that I found the problem – a race in the output and inadvertent re-use of the same buffer between threads. It has the potential to make incorrect bins, but in practice I only saw it give a few incorrect entries or omissions in the unbinned file when running in the default multi-threaded mode. I’m testing the fix that should both ensure that all unbinned contigs get into that file and no duplicates do.

  4. Bryan Merrill reporter

    That’s great news! Thanks for looking into this. Is the real destination of the duplicate contigs is the unbinned bin, or numbered bin? Or is it different each time?

  5. Rob Egan

    if it is is in a bin file, then it should be in the bin file. The bug has a very small chance of corruption of a bin file but a significant chance of corruption of the unbinned file… just how the threads land and buffers and page sizes, I believe. The tests of the fix look good so far and always produce identical bins (with the same --seed) regardless if it uses 1 thread or many.

  6. Bryan Merrill reporter

    Sounds good. Thank you! For metagenomes where no duplicate contigs are produced, does that mean that this bug was not at play? (e.g. do I only need to rerun MetaBAT on metagenomes where duplicate contigs were produced?)

  7. Rob Egan

    So the prudent thing to do is rerun with the same seed, but I expect you will see identical bins, as the fact that all bins have unique contigs leads me to expect that they are all fine.

    However there is a very slight chance that a few of the previous run's bins were incorrect (i.e. corrupted) because of the way the output files were being written. I would expect that this chance increases if the disks are slow and the # of threads is high and at least some of the bins are large. In my 30 tests of your data, if there was a problem, it was always bin#8 that shared contigs with the unbinned file. bin#8 is by far the largest, followed by the unbinned file (so they took the longest to write and definitely exceeded the system buffers for writing). Additionally, in all my tests, it was the unbinned file that was incorrect, and never the numbered bin files.

  8. Log in to comment