Understanding value of additional column vectors for binning

Issue #81 resolved

Bryan Merrill created an issue 2019-10-30

Hi MetaBAT community,

I am working with deeply sequenced human gut metagenomes (~100M 2x150bp reads per sample) that have been collected from the same individual across time. Given the pitfalls of co-assembly, the distance between timepoints (several months in some cases), and the time to assemble 100M reads, I am assembling these samples separately and using MetaBAT to generate genome bins.

In some tests, I’ve found that when I assemble individual1_sample1, map reads from both individual1_sample1 and individual1_sample2 onto the assembly for individual1_sample1, and then use both BAM files for MetaBAT binning, I get far more medium (>50% complete, <10% contaminated) and high-quality (>90% complete, <5% contaminated, estimated using CheckM). The increase in number of medium/high-quality bins recovered when mapping additional samples is positively correlated with the number of medium/high-quality bins I recover when only binning using self-mapping reads. I assume that perturbations in community composition ratio without a huge overhaul in strains present is providing additional differential coverage signal that MetaBAT is then able to use to either group smaller bins into more complete ones, or teasing apart “superbins” (e.g. ~100% complete, <200% contaminated).

I have three questions about this. First, are there options I could tweak in MetaBAT given the very high sequencing depth that would better allow it to detangle these “superbins” or increase performance without mapping additional samples? Second, how does MetaBAT change its methods as you add additional BAM files? For example, the documentation mentions that “lost” or “short” contigs can be binned 3 or more samples are provided, but are there any other benefits or drawbacks besides this to providing additional BAM files? Finally, when adding an additional BAM file, is it only helpful if many of the depth values for contigs are non-zero, or are very sparse columns (e.g. mapping samples with only some community overlap) acceptable to MetaBAT?

Thanks in advance for your help.

Best,

Bryan

‌

Comments (4)

Rob Egan
HI Bryan,

This is a complicated question, but I’m going to stick to some simple answers for it, since the algorithm is highly data dependent and can not necessarily be predicted from 1st principles what a perturbation in the inputs will produce.
1. MetaBAT works best if you give it more information. So It is perfectly valid (and encouraged) to map all the samples to any assembly which you want to bin as long as you keep them as separate BAM files. This works even if the reads from a sample were not used in the creation of said assembly, as MetaBAT has a %ID filter to exclude any mappings from different species/strains that are in a read set but not present in the assembly. The variable abundances across samples will help, though zero and near zero coverage contigs get no and little benefit respectively -- i.e. the more a contig’s coverage varies across samples the the more information MetaBAT has to work with. If a contig has zero coverage across all the samples that were used to generate the assembly, that is a good indication that that contig should not be used at all.
2. Since MetaBAT works on only a single assembly, and you want to see the evolution across samples, I would recommend using a co-assembly, not individual assemblies. MetaHipMer is an option to co-assemble large data sets. Alternatively you can do an iterative assembly -- randomly downsample from each of your read sets to get an assembly of the most abundant genomes. Then extracting all the mapping reads that match the assembly so far (with high mapping %ID!), then using the remaining reads to assemble the remaining lower abundance genomes and concatenate this second assembly with the first, repeating until you either do not trust the next assembly or no more reads can be assembled.
3. The weighting of the similarity between contigs in the assembly for binning decisions, depends on how many samples are provided. If one provides more samples, then the similarity scores skews more and more towards the abundance correlation metric and away from the TNF.
4. The rescuing of short contigs is a separate operation after the large contigs have been binned, and it too depends on both the TNF and abundance similarity scores, though being a shorter sequence, its numbers are less reliable than the longer contigs and is the reason why we treat them in a secondary stage.
‌
- 2019-10-30T19:45:27+00:00
Bryan Merrill reporter
Hi Rob,

Thank you very much for the detailed response. This all makes sense. Thanks for the co-assembly recommendations, too.

Follow-up on 1: Do you have any examples where a read % identity higher or lower than the default 97% is warranted?

Follow-up on 4: These “short” contigs are identified as contigs that are below the the minimum length -m but were included in the assembly provided to MetaBAT, is that correct? If I set -m 2500 but my assembly contains contigs >=1500 bp, provide >=3 BAM files, and MetaBAT tells me that there were too many small contigs to bin, does that justify lowering -m so that there are fewer small contigs? Or would that mean that lowering -m would be detrimental by spreading small contigs around the bins?

Many thanks,

Bryan
- 2019-10-30T21:26:34+00:00
Rob Egan
For #1 the %ID needs to be modified from 97% if the error rate is high (i.e. long reads from PacBio or Nanopore), but long read usage of MetaBAT is not well tested. The 97% threshold was determined to have some discriminating power to differentiate strains in our evaluation sets where the metagenome had evolved over time. For example, we believe it proved useful, in this paper https://www.nature.com/articles/ismej2015241 . I believe they used 95% ID.

‌

For the question on #4. YMMV if you lower the minimum contig length and you should expect more contamination. We know that contigs with less than 2500 bases in length are less reliably placed in bins, in general, because the mapping of the reads to them does not yield reliable coverage metrics and the TNF similarity also is not reliable. MetaBAT is generally conservative and will only try to place contigs in a bin if it is likely they are from the same genome, so these short contigs just can not be placed with confidence anywhere… unless there is already an established bin with a strong and similar signature to a given short contig.

‌

Working towards a more contiguous and comprehensive input assembly would likely yield better results than lowering the minimum length threshold.

‌

‌

‌
- 2019-10-30T22:02:05+00:00
Rob Egan
- changed status to resolved
- 2019-11-13T00:53:39+00:00
Log in to comment

Assignee: –

Type: task

Priority: minor

Status: resolved

Votes: 0

Watchers: 1