Wiki
Clone wikiMetaBAT / Home
Benchmark of Automated Metagenome Binning Software in Complex Metagenomes
This is for MetaBAT 1. Check out a new tutorial for MetaBAT 2 in Best Binning Practices.
Check out new CAMI Challenge benchmark here.
For a realistic benchmark of metagenome binning software, we constructed a dataset from 264 MetaHIT human gut metagenome data (Accession #: ERP000108). We took Canopy, CONCOCT, GroopM, and MaxBin as the alternative software to compare with MetaBAT. These tools are easy to use and identify genome bins automatically. Canopy is scaleable clustering algorithm used in Nielsen et. al, 2014; here we used it as a contigs binning tool. GroopM has manual refinement step, which may improve binning significantly. MaxBin was selected as one of non co-abundance binning tools, which utilize one abundance; it may be fair comparison for MaxBin if only one sample is used; however, we assumed the availability of many samples.
This benchmark is a controlled one, which is assumed that the selected genomes (>5X mean coverage) are true members of the community and there are no more members in it. The advantage of this setting is that we can measure binning performance more accurately unless the assumption was not violated significantly. However, this might overestimate the performance, especially the community the researchers are dealing with is much more complex than this data set. To get some glimpse of it, we also organized another benchmark and a large scale example, where single copy core genes were used as the gold standard for the evaluation.
We welcome any idea or opinion to improve this benchmark and appreciate any attempt to optimize parameters of each software to maximize the performance. For now, we sticked to the default setting of each software.
Summary of the benchmark results
MetaBATe^ | MetaBAT | Canopy | CONCOCT | GroopM | MaxBin | |
---|---|---|---|---|---|---|
Number of Bins Identified (>200kb) | 246 | 340 | 230 | 235 | 445 | 260 |
Number of Genomes Detected (Precision > .9 & Recall > .3) | 111 | 111 | 90 | 56 | 6 | 39 |
Wall Time (16 cores; 32 hyper-threads) | 00:21:21 | 00:13:55 | 00:21:01 | 104:58:01 | 45:29:39 | 20:51:19 |
Peak Memory Usage (for binning step) | 11.5G | 3.9G | 1.82G | 9.55G | 38G | 7.7G |
^ MetaBAT with ensemble binning (using the option -B 20)
Procedures for the dataset construction
- After mapping all reads to bacterial genomes (finished + draft) from NCBI, we selected 290 (out of 291 initial screening) genomes having mean coverage > 5X (See genomes_table).
- Shredded scaffolds of selected genomes using truncated exponential distribution of minimum contig size of 2.5kb with 31 overlapped bases (See contig_histogram of contigs > 2.5kb).
- Mapped the reads from each library to calculate library specific contig coverage information using BBMap.
- The dataset is available to download here.
In case you wanted to reproduce the depth file
- It is unnecessary to regenerate the depth file for the reproduction of the results, but bam files are available to download here.
#!bash #depth file for MetaBAT jgi_summarize_bam_contig_depths --outputDepth depth.txt --pairedContigs paired.txt *.bam #depth file for CONCOCT and Canopy awk 'NR > 1 {for(x=1;x<=NF;x++) if(x == 1 || (x >= 4 && x % 2 == 0)) printf "%s", $x (x == NF || x == (NF-1) ? "\n":"\t")}' depth.txt > depth_concoct.txt #depth file for MaxBin cut -f1,3 depth.txt | tail -n+2 > depth_maxbin.txt
Running MetaBAT (using version 0.24.1)
#!bash #First, try sensitive mode to better sensitivity metabat -i assembly.fa -a depth.txt -o bin1 --sensitive -l -v --saveTNF saved.tnf --saveDistance saved.gprob #Try specific mode to improve specificity further; this time the binning will be much faster since it reuses saved calculations metabat -i assembly.fa -a depth.txt -o bin2 --specific -l -v --saveTNF saved.tnf --saveDistance saved.gprob #Try specific mode with paired data to improve sensitivity while minimizing the loss of specificity metabat -i assembly.fa -a depth.txt -p paired.txt -o bin3 --specific -l -v --saveTNF saved.tnf --saveDistance saved.gprob #All bins are available to download http://portal.nersc.gov/dna/RD/Metagenome_RD/MetaBAT/Files/results/
Running MetaBAT with ensemble binning (using version 0.30.3)
#!bash #recommended to use B >= 20 #use --seed for reproducibility metabat -i assembly.fa -a depth.txt -o bin4 --sensitive -l -v --saveTNF saved.tnf --saveDistance saved.gprob -B 20
Evaluating MetaBAT performance
###Statistics for running * It took 0 hours 13 minute 55 seconds using 16 cores (32 hyper-threads) to process 156,897 out of 195,601 contigs (>=2.5kb length & >=2 average coverage) in the sensitive mode. Subsequent binning with either specific and specific with pair modes took mere less than 2 minutes in both cases since MetaBAT reuses the saved distance file. * 72.35% (834931147 out of 1154063908 bases) was binned. 68.17% and 69.00% were binned for specific and specific with pair info, respectively. * Number of clusters (>= 200kb) formed in sensitive mode: 339. For specific and specific with pair mode formed both 335 bins. * Required memory was about 3.3GB. For specific with pair mode utilized 3.6GB at most. * Detailed info of verbose running message for sensitive mode is here.
###Summary statistics and figures showing the binning performance
#The following is R commands (tested on Linux) #Download the R file from the data directory (see above). It will try to download and install required libraries. source('http://portal.nersc.gov/dna/RD/Metagenome_RD/MetaBAT/Files/benchmark.R') res <- list(Sensitive=calcPerf("MetaBAT","bin1"), Specific=calcPerf("MetaBAT","bin2"), 'Specific & w/ Pair'=calcPerf("MetaBAT","bin3"))
Print out summary statistics
- The output table shows cumulative numbers. For instance, the 'sensitive' mode generates 110 bins (>.7 precision & >.5 recall) or 22 bins (>.99 precision & .8 recall).
- We use the result from 'Sensitive' mode (default mode) to compare other methods.
- In this scenario, recall can be considered as genome completeness; however, some contigs had very low coverage depths due to poor assembly quality and ignored in binning.
- So the recall represents worst case recovery performance considering assembly quality. Pure binning performance would be measured by removing contigs having poor quality; however, it was included for fair comparisons.
printPerf(res) $Sensitive Recall Precision 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.95 0.7 131 117 110 101 74 39 4 1 0.8 123 110 105 97 72 38 4 1 0.9 111 101 96 89 66 36 4 1 0.95 92 84 80 74 55 31 4 1 0.99 58 51 49 45 36 22 2 0 $Specific Recall Precision 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.95 0.7 136 119 113 102 73 31 2 0 0.8 125 109 104 95 69 30 2 0 0.9 114 103 98 91 65 29 2 0 0.95 99 90 86 80 57 27 2 0 0.99 66 59 57 53 41 22 2 0 $`Specific & w/ Pair` Recall Precision 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.95 0.7 136 121 113 104 76 35 5 0 0.8 125 111 104 97 72 34 5 0 0.9 112 103 96 92 67 33 5 0 0.95 98 90 85 81 60 31 5 0 0.99 66 60 57 54 43 25 5 0 $`Sensitive w/ Ensemble Binning` Recall Precision 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.95 0.7 127 125 120 111 96 50 12 4 0.8 122 120 116 108 94 50 12 4 0.9 111 109 105 98 86 48 12 4 0.95 96 96 92 87 76 43 11 3 0.99 54 54 53 52 45 29 4 2
Plot figures
- Again, the number of genomes in the data are 290.
- F1 considers recall and precision equally, but F0.5 puts more weight on precision, which is reasonable expectation in metagenome binning (See reference).
pdf("Performance_By_Bin.pdf", width=8, height=8) plotPerf(res) dev.off()
## Running Canopy
#!bash cc.bin -n 32 -i depth_concoct.txt -o bin_canopy -c prof_canopy --max_canopy_dist 0.1 --max_close_dist 0.4 --max_merge_dist 0.1 --min_step_dist 0.005 --max_num_canopy_walks 5 --stop_fraction 1 --canopy_size_stats_file stat
res <- list(Canopy=calcPerf("Canopy","bin_canopy","prof_canopy")) printPerf(res) $Canopy Recall Precision 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.95 0.7 104 98 90 83 61 25 2 1 0.8 101 95 88 81 59 24 2 1 0.9 91 88 81 75 55 23 1 0 0.95 88 85 78 72 52 22 1 0 0.99 77 74 68 64 45 20 1 0
## Running CONCOCT (using version 0.4.0)
#!bash
concoct --composition_file assembly.fa --coverage_file depth_concoct.txt
res <- list(CONCOCT=calcPerf("CONCOCT")) printPerf(res) $CONCOCT Recall Precision 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.95 0.7 62 62 62 62 60 43 17 6 0.8 60 60 60 60 58 41 16 6 0.9 56 56 56 56 54 38 15 6 0.95 50 50 50 50 48 34 15 6 0.99 36 36 36 36 34 23 11 4
#!bash #calculate depth file for GroopM groopm parse -t 32 database.gm assembly-filtered.fa *.bam
#!bash #binning groopm core -b 200000 database.gm #skipped refining stage since it is not automated #groopm refine database.gm #recruiting unbinned contigs groopm recruit database.gm #output groopm extract --prefix bin_groopm database.gm assembly-filtered.fa
res <- list(GroopM=calcPerf("GroopM","bin_groopm")) printPerf(res) $GroopM Recall Precision 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.95 0.7 20 19 15 10 5 4 0 0 0.8 12 12 10 7 3 3 0 0 0.9 6 6 4 3 2 2 0 0 0.95 4 4 2 2 1 1 0 0 0.99 1 1 1 1 1 1 0 0 #In case of skipping recruiting step $GroopM Recall Precision 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.95 0.7 21 20 16 11 4 4 0 0 0.8 13 13 11 8 3 3 0 0 0.9 6 6 4 3 2 2 0 0 0.95 4 4 2 2 1 1 0 0 0.99 1 1 1 1 1 1 0 0
## Running MaxBin (using version 1.4.1)
#!bash run_MaxBin.pl -contig assembly.fa -out MaxBin.out -abund depth_maxbin.txt -thread 32
#!bash res <- list(MaxBin=calcPerf("MaxBin","MaxBin.out")) printPerf(res) $MaxBin Recall Precision 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.95 0.7 75 69 58 44 24 7 2 0 0.8 57 55 48 39 23 7 2 0 0.9 39 37 34 29 19 7 2 0 0.95 26 26 26 22 16 5 0 0 0.99 6 6 6 5 5 2 0 0
res <- list(MetaBAT=calcPerf("MetaBAT","bin1"), Canopy=calcPerf("Canopy","bin_canopy","prof_canopy"), CONCOCT=calcPerf("CONCOCT"), MaxBin=calcPerf("MaxBin","MaxBin.out"), GroopM=calcPerf("GroopM","bin_groopm")) printPerf(res) $MetaBAT Recall Precision 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.95 0.7 131 117 110 101 74 39 4 1 0.8 123 110 105 97 72 38 4 1 0.9 111 101 96 89 66 36 4 1 0.95 92 84 80 74 55 31 4 1 0.99 58 51 49 45 36 22 2 0 $Canopy Recall Precision 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.95 0.7 104 98 90 83 61 25 2 1 0.8 101 95 88 81 59 24 2 1 0.9 91 88 81 75 55 23 1 0 0.95 88 85 78 72 52 22 1 0 0.99 77 74 68 64 45 20 1 0 $CONCOCT Recall Precision 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.95 0.7 62 62 62 62 60 43 17 6 0.8 60 60 60 60 58 41 16 6 0.9 56 56 56 56 54 38 15 6 0.95 50 50 50 50 48 34 15 6 0.99 36 36 36 36 34 23 11 4 $MaxBin Recall Precision 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.95 0.7 75 69 58 44 24 7 2 0 0.8 57 55 48 39 23 7 2 0 0.9 39 37 34 29 19 7 2 0 0.95 26 26 26 22 16 5 0 0 0.99 6 6 6 5 5 2 0 0 $GroopM Recall Precision 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.95 0.7 20 19 15 10 5 4 0 0 0.8 12 12 10 7 3 3 0 0 0.9 6 6 4 3 2 2 0 0 0.95 4 4 2 2 1 1 0 0 0.99 1 1 1 1 1 1 0 0
pdf("Performance_By_Bin_All_1.pdf", width=6, height=4) plotPerf3(res, prec=.9) dev.off() pdf("Performance_By_Bin_All_2.pdf") plotPerfVenn(res, sel=names(res)[1:4]) dev.off()

pdf("Performance_By_Bin_All.pdf", width=8, height=8) plotPerf(res, xlim=300) dev.off()

Conclusions
- Note that this benchmark is intended for testing with complex metagenomes, so compared methods might not be designed for them. MetaBAT is especially designed for complex metagenomes.
- Quality of real world metagenome assemblies is often quite poor. So the reported performance metrics might be overestimated. More realistic measure would be estimated by including novel contigs.
- MetaBAT outperforms Canopy, GroopM, and MaxBin in terms of both recall and precision.
- It is interesting that MaxBin outperforms GroopM in spite of the fact that MaxBin does not use co-abundance information.
- CONCOCT showed the best recall in some of bins (more bins having >.8); however, precision dropped quickly for the rest of bins so that overall performance was out-dueled by MetaBAT.
- In contrast, MetaBAT achieved better precision in most of bins at the cost of less recall.
- Computing times of MetaBAT and Canopy were significantly faster (100 ~ 500X) than the rest.
- Canopy consumed the least memory. MetaBAT consumed 3 ~ 10X less memory than the rest.
- MetaBAT identified the most genomes (>90% precision & >30% completeness)
- All binning methods complement each other; however, MetaBAT identified 111 out of 133 genomes detected by any of 5 tools
Updated