Metagenome binning pipeline for large scale metagenomic data

Here we show every steps for users in order to follow to reconstruct high quality genome bins. In summary, we will co-assemble all reads from every samples to a big assembly and then will apply MetaBAT to generate genome bins. Lastly, we will show how the quality of each genome bin will be further improved by a simple post processing.

Check out a new tutorial Best Binning Practices for better usage of MetaBAT.

Dataset

We used 1704 human gut metagenomic samples studied by Nielsen et al. 2014. The dataset is publicly available to download.

Generating Metagenome Co-Assembly

Run Ray with kmer=31 and bloom filter defaults
run BBmap to Ray scaffolds >= 2500 bp, idfilter=90
run megahit on unmapped reads and mates
concatenate Ray >=2500 and megahit assemblies

The assembly is available to download.

Mapping reads to the assembly using BBmap

#!bash

for i in *.fastq; do runBBmap.sh assembly.fa $i; done > runBBmap.log 2>&1 &

The bam files are available to download (1.1T).

Running MetaBAT (using v0.25.3)

Calculating abundance

#!bash
jgi_summarize_bam_contig_depths --outputDepth depth.txt --pairedContigs paired.txt --minContigLength 1000 --minContigDepth 2 *.bam

Generating bins

#!bash
metabat -i assembly.fa -a depth.txt -o bin -v --saveTNF saved.tnf --saveDistance saved.dist

It took less than 2 hours for MetaBAT to produce 1634 bins using 32 threads and 17G memory log.
Canopy took 36 hours using 16 threads and 36G memory log.
CONCOCT, GroopM, and MaxBin all failed to generate any bins due to the large data size

The bins from MetaBAT are available to download.

Running MetaBAT with ensemble binning (using v0.30.3)

#!bash
metabat -i assembly.fa -a depth.txt -o binE -v --saveTNF saved.tnf --saveDistance saved.dist -B 20

Running CheckM

#!bash
checkm lineage_wf -f SCG_MetaBAT.txt -t 16 -x fa . ./SCG

Comparing MetaBAT vs. Canopy vs. CAG bins

MetaBAT and Canopy results are generated using our assembly described above.
CAG bins are from Nielsen et al. 2014, downloaded from here.
CheckM results are here.

source('http://portal.nersc.gov/dna/RD/Metagenome_RD/MetaBAT/Files/benchmark.R')
res <- list(MetaBAT=calcPerfBySCG("SCG_MetaBAT.txt"), Canopy=calcPerfBySCG("SCG_Canopy.txt"), CAG=calcPerfBySCG("SCG_CAG.txt"))
printPerf(res)

$`MetaBAT w/ Ensemble Binning`
         Recall
Precision 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.95
     0.7  946 832 716 601 464 304 120   36
     0.8  937 823 707 593 457 299 117   35
     0.9  882 769 656 544 414 265  96   26
     0.95 701 595 492 389 283 169  63   18
     0.99 187 118  75  41  19  11   8    4

$MetaBAT
         Recall
Precision 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.95
     0.7  936 786 668 566 429 270 108   29
     0.8  930 780 662 561 424 266 105   28
     0.9  872 724 610 512 379 237  91   22
     0.95 610 477 375 296 192 108  45   11
     0.99 129  67  46  25  10   5   4    2

$Canopy
         Recall
Precision 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.95
     0.7  759 682 589 494 384 246  86   22
     0.8  753 677 584 489 380 244  85   22
     0.9  716 642 550 456 349 221  72   16
     0.95 494 424 340 257 169  95  37   11
     0.99  95  59  31  14   5   3   3    3

$CAG
         Recall
Precision 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.95
     0.7  623 543 486 412 348 289 169   62
     0.8  619 540 483 410 346 287 168   62
     0.9  587 508 451 381 325 268 158   60
     0.95 502 431 382 327 284 234 141   57
     0.99 253 219 187 154 133 106  70   35

* Here the precision is strain level calculation. Due to the initial co-assembly procedure combining all samples, many contigs are indeed the mixture of many strains and it affects both MetaBAT and Canopy. Interestingly, published CAG bins seem to not have the issue suggesting that their assembly quality looks better (Compare >99% precision between Canopy and CAG results) * If the 'strain heterogeneity' score in CheckM table is considered to adjust strain contamination, meaning to calculate species level contamination, the result would be like below:

res <- list(MetaBAT=calcPerfBySCG("SCG_MetaBAT.txt", removeStrain=T), Canopy=calcPerfBySCG("SCG_Canopy.txt", removeStrain=T), CAG=calcPerfBySCG("SCG_CAG.txt", removeStrain=T))
printPerf(res)

$MetaBAT
         Recall
Precision 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.95
     0.7  938 788 670 568 430 271 109   29
     0.8  934 784 666 564 427 269 107   28
     0.9  908 760 646 548 411 260 103   28
     0.95 874 731 621 526 392 248  98   27
     0.99 528 422 349 291 213 149  60   16

$Canopy
         Recall
Precision 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.95
     0.7  759 682 589 494 384 246  86   22
     0.8  754 677 584 489 380 244  85   22
     0.9  750 675 583 489 380 244  85   22
     0.95 740 669 579 485 376 241  85   22
     0.99 464 406 339 286 225 160  62   19

$CAG
         Recall
Precision 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.95
     0.7  624 544 487 413 349 290 170   62
     0.8  623 544 487 413 349 290 170   62
     0.9  620 541 484 411 348 289 169   62
     0.95 592 513 456 387 334 278 166   62
     0.99 352 310 274 234 209 170 104   47

Post-processing of bins

Assembly from pooled samples in the above experiment raises the possibility that similar genomes (e.g., different strains) present in different samples are assembled into chimeric contigs, and these contigs subsequently contaminated the genome bins. Based on the assumption that strains of the same species may not always present in the same samples, we implemented a simple post-binning process to reduce the strain-level contamination. In short, we recruited reads from a single sample from which the most coverage comes using the contigs in a given bin, and then assembled these reads into a new set of contigs. The original contigs from this bin are replaced with the new contigs if the new contigs produce better completeness and precision using CheckM.

Merging bams if there were replicates of a sample MergedBams (1.1T).
For each bin, identify the sample that has the most coverage.
Extract reads from the chosen sample that mapped to contigs in each bin.
Run Velvet to the extracted reads.
Run CheckM whether new assembly has better quality than the bin and keep better ones.

The improved bins from MetaBAT + Velvet are available to download.

Comparing MetaBAT vs. MetaBAT + Velvet

res <- list(MetaBAT=calcPerfBySCG("SCG_MetaBAT.txt"), MetaBAT.P=calcPerfBySCG("SCG_MetaBAT_Velvet.txt", skip=0))
printPerf(res)

$MetaBAT
         Recall
Precision 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.95
     0.7  936 786 668 566 429 270 108   29
     0.8  930 780 662 561 424 266 105   28
     0.9  872 724 610 512 379 237  91   22
     0.95 610 477 375 296 192 108  45   11
     0.99 129  67  46  25  10   5   4    2

$MetaBAT.P
         Recall
Precision 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.95
     0.7  966 829 711 599 483 344 193   80
     0.8  961 824 706 594 478 340 190   79
     0.9  924 791 678 572 458 327 186   78
     0.95 790 674 571 481 384 280 169   73
     0.99 291 224 186 159 130  96  72   39

* Recall & precision of some bins improved significantly, especially highly precise bins. For instance, >95% precision & >50% recall bins increased from 375 to 571 and >99% precision & >50% recall bins increased from 46 to 186.

Comparing MetaBAT + Velvet vs. MGS draft genomes

MGS draft genomes are from Nielsen et al. 2014, downloaded from here.
One to one correspondence between MetaBAT bins and MGS draft genomes was decided by reciprocal mapping. The mapping result is here: mapPrecision.tab.
All relevant files are here.

qc0 <- read.table("mapPrecision.tab", as.is=T)
qc0$X <- qc0[,3] / rowSums(qc0[,c(3,4)])
qc0$Y <- qc0[,3] / rowSums(qc0[,c(3,5)])
qc0$S <- (qc0$X + qc0$Y) / 2
qc <- ddply(qc0, .(V2), function(x) x[which.max(x$S),]) #one to one

MB2 <- MB[-match(qc$V1,MB$V1),]
MB2 <- MB2[MB2$V13>30 & MB2$V14<.1,] #73
MB2.qc <- qc0[match(MB2$V1,qc0$V1),]
MB2.qc <- MB2.qc[-which(MB2.qc$X>.1),] #55
unique.MetaBAT <- nrow(MB2.qc)

MGS <- read.table("SCG_MGS.txt", comment.char='-', as.is=T, header=F, skip=2)
MB <- read.table("SCG_MetaBAT_Velvet.txt", as.is=T, header=F)

qc <- cbind(qc, MB[match(sub("\\.fa","",qc$V1), MB$V1),13:15], MGS[match(qc$V2, MGS$V1),13:15])
colnames(qc)[9:14] <- c("MB.C","MB.P","MB.SC","MGS.C","MGS.P","MGS.SC") 

qc <- qc[qc$MB.C>30 & qc$MB.P<10,] #342

shared.both <- nrow(qc)

library(grid)
library(VennDiagram)

pdf("Venn.pdf")
grid.draw(draw.pairwise.venn(unique.MetaBAT+shared.both, nrow(MGS), shared.both, c("MetaBAT \n+ Reassembly", "MGS \nDraft Genomes"),
                fill = c("dodgerblue", "goldenrod1", "darkorange1", "seagreen3", "orchid3")[1:2],
                cex = rep(2,3),
                cat.col = c("dodgerblue", "goldenrod1", "darkorange1", "seagreen3", "orchid3")[1:2],
                cat.cex = 2,
                cat.pos = c(-10,10),
                cat.dist = rep(0.1, 2),
                margin=.2,
                scaled=FALSE))
dev.off()
qc.out <- cbind(qc[,c('X','Y')],Size=rowSums(qc[,c(3,4)])/1e+6)

library(ggplot2)

pdf("ScatterPlot.pdf")
p <- ggplot(qc.out, aes(X, Y, size=Size)) + theme_bw()
p <- p + geom_point(shape=21, alpha=0.6, aes(size=Size), fill=gg_color_hue(2)[2]) + scale_size_continuous(range = c(2,10))
p <- p + theme(legend.position = c(.2,.7)); p$labels$size <- "Bin Size (mb)"
p <- p + xlab("Shared Size / MetaBAT Bin Size") + ylab("Shared Size / MGS Genome Size")
p <- p + ylim(c(0,1)) + xlim(c(0,1))
print(p)
dev.off()

Figure 5.png

Wiki

MetaBAT / Example_Large_Data