Wiki
Clone wikiMetaBAT / Benchmark_MetaHIT
Benchmark of Automated Metagenome Binning Software Using Real Assembly
Check out a new tutorial Best Binning Practices for better usage of MetaBAT.
Check out new CAMI Challenge benchmark here.
For an alternative benchmark of metagenome binning software, we used the same MetaHIT human gut metagenome data (Accession #: ERP000108) as the controlled benchmark. Also it is advised to check a large scale example. We took Canopy, CONCOCT, GroopM, and MaxBin as the alternative software to compare with MetaBAT. We adopted single copy core genes as the gold standard to test recall (completeness) and precision (inverse of contamination); for the goal, we used CheckM.
Some of pitfalls of this benchmark are as follows:
-
Single copy core genes are one of the ways to measure the completeness of a bin and the degree of contamination, but they are not perfect. It often over- or under- estimates the truth due to many reasons including genome diversity and poor assembly quality.
-
Some methods in this benchmark already incorporated single copy gene information, which may bias the evaluation using the same information.
Using 1.5kb contig size cutoff (118,025 contigs)
- All results are available to download.
MetaBAT | Canopy** | CONCOCT | MaxBin | GroopM^^ | |
---|---|---|---|---|---|
Number of Bins Identified (>200kb) | 234 | 223 | 260 | 168 | 335 |
Number of Genomes Detected (Precision > .9 & Recall > .3) | 130 | 96 | 64 | 39 | 28 |
Wall Time (16 cores; 32 hyper-threads) | 00:03:36 | 00:02:31 | 82:19:53 | 06:49:39 | 12:19:12 |
Peak Memory Usage (for binning step) | 3.0G | 1.6G | 7G | 5.8G | 6.3G |
**Canopy only use abundance table as input, so it should have taken more time and memory to read and write sequence data like the others
^^GroopM without recruiting (chimeric bins removed).
Preprocessing
- We generated de-novo assembly using Ray Meta.
- Using BBMap, bam files for each library were produced.
- All files are available to download here.
Generating depth files
- It took 10 minutes using 32 hyper-threads with peak memory consumption of 8GB. Here is the log.
- Files are available here.
#!bash #depth file for MetaBAT jgi_summarize_bam_contig_depths --outputDepth depth.txt --pairedContigs paired.txt *.bam #depth file for CONCOCT awk 'NR > 1 {for(x=1;x<=NF;x++) if(x == 1 || (x >= 4 && x % 2 == 0)) printf "%s", $x (x == NF || x == (NF-1) ? "\n":"\t")}' depth.txt > depth_concoct.txt #depth file for MaxBin cut -f1,3 depth.txt | tail -n+2 > depth_maxbin.txt
Running MetaBAT (using version v0.25.3)
#!bash #Prepare proper folder structure mkdir -p ./1.5kb/MetaBAT #Run MetaBAT metabat -i assembly.fa -a depth.txt -o ./1.5kb/MetaBAT/bin --minContigByCorr 1500 --saveTNF saved.tnf --saveDistance saved.dist --verbose #Run CheckM checkm lineage_wf -f ./1.5kb/MetaBAT/SCG.txt -t 8 -x fa ./1.5kb/MetaBAT ./1.5kb/MetaBAT/SCG
- Recall and precision here correspond to completeness and 1 - contamination in CheckM table, respectively.
Running MetaBAT with ensemble binning (using version v0.30.3)
#!bash metabat -i assembly.fa -a depth.txt -o binE --saveTNF saved.tnf --saveDistance saved.dist --verbose -B 20
Running CONCOCT (using version 0.4.0)
#!bash mkdir -p ./1.5kb/CONCOCT cd 1.5kb/CONCOCT concoct --composition_file assembly.fa --coverage_file depth_concoct.txt --length_threshold 1500
system("mkdir -p ./1.5kb/CONCOCT/bins/ ./1.5kb/CONCOCT/small_bins") cls <- read.csv("./1.5kb/CONCOCT/clustering_gt2500.csv", header=F, as.is=T) invisible(foreach(i=unique(cls$V2)) %do% { write.table(cls$V1[cls$V2==i], file=sprintf("./1.5kb/CONCOCT/bins/%d.lst", i), col.names=F, row.names=F, quote=F) system(sprintf("./screen_list.pl ./1.5kb/CONCOCT/bins/%d.lst assembly.fa keep > ./1.5kb/CONCOCT/bins/%d.fa", i,i)) bin.size <- as.numeric(system(sprintf("./sizefasta.pl ./1.5kb/CONCOCT/bins/%d.fa", i), intern=T)) if(bin.size < 200000) system(sprintf("mv ./1.5kb/CONCOCT/bins/%d.fa ./1.5kb/CONCOCT/small_bins/", i)) })
#!bash checkm lineage_wf -f ./1.5kb/CONCOCT/SCG.txt -t 8 -x fa ./1.5kb/CONCOCT/bins ./1.5kb/CONCOCT/bins/SCG
Running GroopM (using version 0.3.0)
- Parse step with 32 threads spent > 120GB memory, so the number of threads reduced to 16.
- We tried two modes with or without recruiting step.
- Overall GroopM performed poorly in terms of precision (large bins had significant amount of contaminations).
#!bash #calculate depth file for GroopM groopm parse -t 16 database.gm assembly.fa *.bam #core binning groopm core -b 200000 -c 1500 database.gm #skipped refining stage since it is not automated #groopm refine database.gm #output core bins groopm extract -t 32 --prefix ./1.5kb/GroopM/core_only/bin_groopm ./1.5kb/GroopM/database.gm assembly.fa #recruiting unbinned contigs groopm recruit database.gm #output groopm extract -t 32 --prefix ./1.5kb/GroopM/recruited/bin_groopm ./1.5kb/GroopM/database.gm assembly.fa checkm lineage_wf -f ./1.5kb/GroopM/core_only/SCG.txt -t 32 -x fna ./1.5kb/GroopM/core_only ./1.5kb/GroopM/core_only/SCG checkm lineage_wf -f ./1.5kb/GroopM/recruited/SCG.txt -t 32 -x fna ./1.5kb/GroopM/recruited ./1.5kb/GroopM/recruited/SCG
Running MaxBin (using version 1.4.1)
#!bash run_MaxBin.pl -contig assembly.fa -out ./1.5kb/MaxBin/MaxBin.out -abund depth_maxbin.txt -thread 32 -min_contig_length 1500 checkm lineage_wf -f ./1.5kb/MaxBin/SCG.txt -t 8 -x fasta ./1.5kb/MaxBin ./1.5kb/MaxBin/SCG
Comparing all methods together
source('http://portal.nersc.gov/dna/RD/Metagenome_RD/MetaBAT/Files/benchmark.R') res <- list(MetaBAT=calcPerfBySCG("./1.5kb/MetaBAT/Sensitive/SCG.txt"), Canopy=calcPerfBySCG("./1.5kb/Canopy/SCG.txt"), CONCOCT=calcPerfBySCG("./1.5kb/CONCOCT/SCG.txt"), MaxBin=calcPerfBySCG("./1.5kb/MaxBin/SCG.txt"), GroopM=calcPerfBySCG("./1.5kb/GroopM/core_only/SCG.txt")) printPerf(res) $`MetaBAT with ensemble binning` Recall Precision 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.95 0.7 147 139 124 108 86 57 31 14 0.8 142 134 119 103 81 52 26 11 0.9 111 103 88 72 51 32 13 4 0.95 69 61 48 35 20 10 7 3 0.99 10 8 2 1 1 0 0 0 $MetaBAT Recall Precision 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.95 0.7 146 127 101 76 57 37 15 3 0.8 145 126 100 75 56 36 14 2 0.9 130 111 86 61 44 28 12 0 0.95 91 73 49 26 14 8 7 0 0.99 17 8 4 1 0 0 0 0 $Canopy Recall Precision 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.95 0.7 129 111 95 77 67 36 18 7 0.8 124 106 90 72 62 31 13 6 0.9 96 78 62 44 34 12 3 1 0.95 64 47 31 17 11 2 1 1 0.99 12 2 1 0 0 0 0 0 $CONCOCT Recall Precision 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.95 0.7 107 95 78 65 53 34 23 10 0.8 94 82 66 54 44 26 15 6 0.9 64 52 39 28 19 7 2 0 0.95 43 33 23 14 9 2 1 0 0.99 8 3 1 0 0 0 0 0 $MaxBin Recall Precision 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.95 0.7 91 74 54 31 20 4 3 1 0.8 69 54 37 21 14 4 3 1 0.9 39 26 18 12 8 2 2 0 0.95 17 11 8 7 6 2 2 0 0.99 1 0 0 0 0 0 0 0 $GroopM Recall Precision 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.95 0.7 47 37 31 23 19 10 4 2 0.8 39 29 23 17 14 8 3 1 0.9 28 19 16 12 9 4 2 0 0.95 18 13 10 7 6 3 2 0 0.99 2 1 1 0 0 0 0 0
pdf("Performance_By_SCG_All_1.5kb_2.pdf", width=12, height=4) plotPerf3(res) dev.off()
#First download http://portal.nersc.gov/dna/RD/Metagenome_RD/MetaBAT/Files/MetaHIT/results/Results_1.5kb.tar.gz and extract it before running below codes. ctgList <- getCtgList(res) ctgSizes <- read.table("assembly.fa.sizes", as.is=T) pdf("Performance_By_SCG_All_1.5kb_3.pdf") plotPerfVennBySCG(res, ctgList, ctgSizes, minRec=.3, minPrec=.9) dev.off() pdf("Performance_By_SCG_All_1.5kb_4.pdf") plotPerfVennBySCG(res, ctgList, ctgSizes, minRec=.3, minPrec=.95) dev.off()
Updated