Understanding value of additional column vectors for binning
Hi MetaBAT community,
I am working with deeply sequenced human gut metagenomes (~100M 2x150bp reads per sample) that have been collected from the same individual across time. Given the pitfalls of co-assembly, the distance between timepoints (several months in some cases), and the time to assemble 100M reads, I am assembling these samples separately and using MetaBAT to generate genome bins.
In some tests, I’ve found that when I assemble individual1_sample1, map reads from both individual1_sample1 and individual1_sample2 onto the assembly for individual1_sample1, and then use both BAM files for MetaBAT binning, I get far more medium (>50% complete, <10% contaminated) and high-quality (>90% complete, <5% contaminated, estimated using CheckM). The increase in number of medium/high-quality bins recovered when mapping additional samples is positively correlated with the number of medium/high-quality bins I recover when only binning using self-mapping reads. I assume that perturbations in community composition ratio without a huge overhaul in strains present is providing additional differential coverage signal that MetaBAT is then able to use to either group smaller bins into more complete ones, or teasing apart “superbins” (e.g. ~100% complete, <200% contaminated).
I have three questions about this. First, are there options I could tweak in MetaBAT given the very high sequencing depth that would better allow it to detangle these “superbins” or increase performance without mapping additional samples? Second, how does MetaBAT change its methods as you add additional BAM files? For example, the documentation mentions that “lost” or “short” contigs can be binned 3 or more samples are provided, but are there any other benefits or drawbacks besides this to providing additional BAM files? Finally, when adding an additional BAM file, is it only helpful if many of the depth values for contigs are non-zero, or are very sparse columns (e.g. mapping samples with only some community overlap) acceptable to MetaBAT?
Thanks in advance for your help.
Best,
Bryan
Comments (4)
-
-
reporter Hi Rob,
Thank you very much for the detailed response. This all makes sense. Thanks for the co-assembly recommendations, too.
Follow-up on 1: Do you have any examples where a read % identity higher or lower than the default 97% is warranted?
Follow-up on 4: These “short” contigs are identified as contigs that are below the the minimum length
-m
but were included in the assembly provided to MetaBAT, is that correct? If I set-m 2500
but my assembly contains contigs >=1500 bp, provide >=3 BAM files, and MetaBAT tells me that there were too many small contigs to bin, does that justify lowering-m
so that there are fewer small contigs? Or would that mean that lowering-m
would be detrimental by spreading small contigs around the bins?Many thanks,
Bryan
-
For
#1the %ID needs to be modified from 97% if the error rate is high (i.e. long reads from PacBio or Nanopore), but long read usage of MetaBAT is not well tested. The 97% threshold was determined to have some discriminating power to differentiate strains in our evaluation sets where the metagenome had evolved over time. For example, we believe it proved useful, in this paper https://www.nature.com/articles/ismej2015241 . I believe they used 95% ID.
For the question on
#4. YMMV if you lower the minimum contig length and you should expect more contamination. We know that contigs with less than 2500 bases in length are less reliably placed in bins, in general, because the mapping of the reads to them does not yield reliable coverage metrics and the TNF similarity also is not reliable. MetaBAT is generally conservative and will only try to place contigs in a bin if it is likely they are from the same genome, so these short contigs just can not be placed with confidence anywhere… unless there is already an established bin with a strong and similar signature to a given short contig.
Working towards a more contiguous and comprehensive input assembly would likely yield better results than lowering the minimum length threshold.
-
- changed status to resolved
- Log in to comment
HI Bryan,
This is a complicated question, but I’m going to stick to some simple answers for it, since the algorithm is highly data dependent and can not necessarily be predicted from 1st principles what a perturbation in the inputs will produce.