Wiki
Clone wikiMetaBAT / Example_Large_Data
Metagenome binning pipeline for large scale metagenomic data
Here we show every steps for users in order to follow to reconstruct high quality genome bins. In summary, we will co-assemble all reads from every samples to a big assembly and then will apply MetaBAT to generate genome bins. Lastly, we will show how the quality of each genome bin will be further improved by a simple post processing.
Check out a new tutorial Best Binning Practices for better usage of MetaBAT.
Dataset
We used 1704 human gut metagenomic samples studied by Nielsen et al. 2014. The dataset is publicly available to download.
Generating Metagenome Co-Assembly
- Run Ray with kmer=31 and bloom filter defaults
- run BBmap to Ray scaffolds >= 2500 bp, idfilter=90
- run megahit on unmapped reads and mates
- concatenate Ray >=2500 and megahit assemblies
The assembly is available to download.
Mapping reads to the assembly using BBmap
#!bash for i in *.fastq; do runBBmap.sh assembly.fa $i; done > runBBmap.log 2>&1 &
The bam files are available to download (1.1T).
Running MetaBAT (using v0.25.3)
Calculating abundance
#!bash jgi_summarize_bam_contig_depths --outputDepth depth.txt --pairedContigs paired.txt --minContigLength 1000 --minContigDepth 2 *.bam
Generating bins
#!bash
metabat -i assembly.fa -a depth.txt -o bin -v --saveTNF saved.tnf --saveDistance saved.dist
- It took less than 2 hours for MetaBAT to produce 1634 bins using 32 threads and 17G memory log.
- Canopy took 36 hours using 16 threads and 36G memory log.
- CONCOCT, GroopM, and MaxBin all failed to generate any bins due to the large data size
The bins from MetaBAT are available to download.
Running MetaBAT with ensemble binning (using v0.30.3)
#!bash metabat -i assembly.fa -a depth.txt -o binE -v --saveTNF saved.tnf --saveDistance saved.dist -B 20
Running CheckM
#!bash checkm lineage_wf -f SCG_MetaBAT.txt -t 16 -x fa . ./SCG
Comparing MetaBAT vs. Canopy vs. CAG bins
- MetaBAT and Canopy results are generated using our assembly described above.
- CAG bins are from Nielsen et al. 2014, downloaded from here.
- CheckM results are here.
source('http://portal.nersc.gov/dna/RD/Metagenome_RD/MetaBAT/Files/benchmark.R') res <- list(MetaBAT=calcPerfBySCG("SCG_MetaBAT.txt"), Canopy=calcPerfBySCG("SCG_Canopy.txt"), CAG=calcPerfBySCG("SCG_CAG.txt")) printPerf(res) $`MetaBAT w/ Ensemble Binning` Recall Precision 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.95 0.7 946 832 716 601 464 304 120 36 0.8 937 823 707 593 457 299 117 35 0.9 882 769 656 544 414 265 96 26 0.95 701 595 492 389 283 169 63 18 0.99 187 118 75 41 19 11 8 4 $MetaBAT Recall Precision 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.95 0.7 936 786 668 566 429 270 108 29 0.8 930 780 662 561 424 266 105 28 0.9 872 724 610 512 379 237 91 22 0.95 610 477 375 296 192 108 45 11 0.99 129 67 46 25 10 5 4 2 $Canopy Recall Precision 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.95 0.7 759 682 589 494 384 246 86 22 0.8 753 677 584 489 380 244 85 22 0.9 716 642 550 456 349 221 72 16 0.95 494 424 340 257 169 95 37 11 0.99 95 59 31 14 5 3 3 3 $CAG Recall Precision 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.95 0.7 623 543 486 412 348 289 169 62 0.8 619 540 483 410 346 287 168 62 0.9 587 508 451 381 325 268 158 60 0.95 502 431 382 327 284 234 141 57 0.99 253 219 187 154 133 106 70 35
res <- list(MetaBAT=calcPerfBySCG("SCG_MetaBAT.txt", removeStrain=T), Canopy=calcPerfBySCG("SCG_Canopy.txt", removeStrain=T), CAG=calcPerfBySCG("SCG_CAG.txt", removeStrain=T)) printPerf(res) $MetaBAT Recall Precision 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.95 0.7 938 788 670 568 430 271 109 29 0.8 934 784 666 564 427 269 107 28 0.9 908 760 646 548 411 260 103 28 0.95 874 731 621 526 392 248 98 27 0.99 528 422 349 291 213 149 60 16 $Canopy Recall Precision 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.95 0.7 759 682 589 494 384 246 86 22 0.8 754 677 584 489 380 244 85 22 0.9 750 675 583 489 380 244 85 22 0.95 740 669 579 485 376 241 85 22 0.99 464 406 339 286 225 160 62 19 $CAG Recall Precision 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.95 0.7 624 544 487 413 349 290 170 62 0.8 623 544 487 413 349 290 170 62 0.9 620 541 484 411 348 289 169 62 0.95 592 513 456 387 334 278 166 62 0.99 352 310 274 234 209 170 104 47
Post-processing of bins
Assembly from pooled samples in the above experiment raises the possibility that similar genomes (e.g., different strains) present in different samples are assembled into chimeric contigs, and these contigs subsequently contaminated the genome bins. Based on the assumption that strains of the same species may not always present in the same samples, we implemented a simple post-binning process to reduce the strain-level contamination. In short, we recruited reads from a single sample from which the most coverage comes using the contigs in a given bin, and then assembled these reads into a new set of contigs. The original contigs from this bin are replaced with the new contigs if the new contigs produce better completeness and precision using CheckM.
- Merging bams if there were replicates of a sample MergedBams (1.1T).
- For each bin, identify the sample that has the most coverage.
- Extract reads from the chosen sample that mapped to contigs in each bin.
- Run Velvet to the extracted reads.
- Run CheckM whether new assembly has better quality than the bin and keep better ones.
The improved bins from MetaBAT + Velvet are available to download.
Comparing MetaBAT vs. MetaBAT + Velvet
res <- list(MetaBAT=calcPerfBySCG("SCG_MetaBAT.txt"), MetaBAT.P=calcPerfBySCG("SCG_MetaBAT_Velvet.txt", skip=0)) printPerf(res) $MetaBAT Recall Precision 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.95 0.7 936 786 668 566 429 270 108 29 0.8 930 780 662 561 424 266 105 28 0.9 872 724 610 512 379 237 91 22 0.95 610 477 375 296 192 108 45 11 0.99 129 67 46 25 10 5 4 2 $MetaBAT.P Recall Precision 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.95 0.7 966 829 711 599 483 344 193 80 0.8 961 824 706 594 478 340 190 79 0.9 924 791 678 572 458 327 186 78 0.95 790 674 571 481 384 280 169 73 0.99 291 224 186 159 130 96 72 39
Comparing MetaBAT + Velvet vs. MGS draft genomes
- MGS draft genomes are from Nielsen et al. 2014, downloaded from here.
- One to one correspondence between MetaBAT bins and MGS draft genomes was decided by reciprocal mapping. The mapping result is here: mapPrecision.tab.
- All relevant files are here.
qc0 <- read.table("mapPrecision.tab", as.is=T) qc0$X <- qc0[,3] / rowSums(qc0[,c(3,4)]) qc0$Y <- qc0[,3] / rowSums(qc0[,c(3,5)]) qc0$S <- (qc0$X + qc0$Y) / 2 qc <- ddply(qc0, .(V2), function(x) x[which.max(x$S),]) #one to one MB2 <- MB[-match(qc$V1,MB$V1),] MB2 <- MB2[MB2$V13>30 & MB2$V14<.1,] #73 MB2.qc <- qc0[match(MB2$V1,qc0$V1),] MB2.qc <- MB2.qc[-which(MB2.qc$X>.1),] #55 unique.MetaBAT <- nrow(MB2.qc) MGS <- read.table("SCG_MGS.txt", comment.char='-', as.is=T, header=F, skip=2) MB <- read.table("SCG_MetaBAT_Velvet.txt", as.is=T, header=F) qc <- cbind(qc, MB[match(sub("\\.fa","",qc$V1), MB$V1),13:15], MGS[match(qc$V2, MGS$V1),13:15]) colnames(qc)[9:14] <- c("MB.C","MB.P","MB.SC","MGS.C","MGS.P","MGS.SC") qc <- qc[qc$MB.C>30 & qc$MB.P<10,] #342 shared.both <- nrow(qc) library(grid) library(VennDiagram) pdf("Venn.pdf") grid.draw(draw.pairwise.venn(unique.MetaBAT+shared.both, nrow(MGS), shared.both, c("MetaBAT \n+ Reassembly", "MGS \nDraft Genomes"), fill = c("dodgerblue", "goldenrod1", "darkorange1", "seagreen3", "orchid3")[1:2], cex = rep(2,3), cat.col = c("dodgerblue", "goldenrod1", "darkorange1", "seagreen3", "orchid3")[1:2], cat.cex = 2, cat.pos = c(-10,10), cat.dist = rep(0.1, 2), margin=.2, scaled=FALSE)) dev.off() qc.out <- cbind(qc[,c('X','Y')],Size=rowSums(qc[,c(3,4)])/1e+6) library(ggplot2) pdf("ScatterPlot.pdf") p <- ggplot(qc.out, aes(X, Y, size=Size)) + theme_bw() p <- p + geom_point(shape=21, alpha=0.6, aes(size=Size), fill=gg_color_hue(2)[2]) + scale_size_continuous(range = c(2,10)) p <- p + theme(legend.position = c(.2,.7)); p$labels$size <- "Bin Size (mb)" p <- p + xlab("Shared Size / MetaBAT Bin Size") + ylab("Shared Size / MGS Genome Size") p <- p + ylim(c(0,1)) + xlim(c(0,1)) print(p) dev.off()
Updated