Benchmark of Automated Metagenome Binning Software in Complex Metagenomes

This is for MetaBAT 1. Check out a new tutorial for MetaBAT 2 in Best Binning Practices.

Check out new CAMI Challenge benchmark here.

For a realistic benchmark of metagenome binning software, we constructed a dataset from 264 MetaHIT human gut metagenome data (Accession #: ERP000108). We took Canopy, CONCOCT, GroopM, and MaxBin as the alternative software to compare with MetaBAT. These tools are easy to use and identify genome bins automatically. Canopy is scaleable clustering algorithm used in Nielsen et. al, 2014; here we used it as a contigs binning tool. GroopM has manual refinement step, which may improve binning significantly. MaxBin was selected as one of non co-abundance binning tools, which utilize one abundance; it may be fair comparison for MaxBin if only one sample is used; however, we assumed the availability of many samples.

This benchmark is a controlled one, which is assumed that the selected genomes (>5X mean coverage) are true members of the community and there are no more members in it. The advantage of this setting is that we can measure binning performance more accurately unless the assumption was not violated significantly. However, this might overestimate the performance, especially the community the researchers are dealing with is much more complex than this data set. To get some glimpse of it, we also organized another benchmark and a large scale example, where single copy core genes were used as the gold standard for the evaluation.

We welcome any idea or opinion to improve this benchmark and appreciate any attempt to optimize parameters of each software to maximize the performance. For now, we sticked to the default setting of each software.

Summary of the benchmark results

	MetaBATe^	MetaBAT	Canopy	CONCOCT	GroopM	MaxBin
Number of Bins Identified (>200kb)	246	340	230	235	445	260
Number of Genomes Detected (Precision > .9 & Recall > .3)	111	111	90	56	6	39
Wall Time (16 cores; 32 hyper-threads)	00:21:21	00:13:55	00:21:01	104:58:01	45:29:39	20:51:19
Peak Memory Usage (for binning step)	11.5G	3.9G	1.82G	9.55G	38G	7.7G

^ MetaBAT with ensemble binning (using the option -B 20)

Procedures for the dataset construction

After mapping all reads to bacterial genomes (finished + draft) from NCBI, we selected 290 (out of 291 initial screening) genomes having mean coverage > 5X (See genomes_table).
Shredded scaffolds of selected genomes using truncated exponential distribution of minimum contig size of 2.5kb with 31 overlapped bases (See contig_histogram of contigs > 2.5kb).
Mapped the reads from each library to calculate library specific contig coverage information using BBMap.
The dataset is available to download here.

In case you wanted to reproduce the depth file

It is unnecessary to regenerate the depth file for the reproduction of the results, but bam files are available to download here.

#!bash
#depth file for MetaBAT
jgi_summarize_bam_contig_depths --outputDepth depth.txt --pairedContigs paired.txt *.bam

#depth file for CONCOCT and Canopy
awk 'NR > 1 {for(x=1;x<=NF;x++) if(x == 1 || (x >= 4 && x % 2 == 0)) printf "%s", $x (x == NF || x == (NF-1) ? "\n":"\t")}' depth.txt > depth_concoct.txt

#depth file for MaxBin
cut -f1,3 depth.txt | tail -n+2 > depth_maxbin.txt

Running MetaBAT (using version 0.24.1)

#!bash

#First, try sensitive mode to better sensitivity
metabat -i assembly.fa -a depth.txt -o bin1 --sensitive -l -v --saveTNF saved.tnf --saveDistance saved.gprob

#Try specific mode to improve specificity further; this time the binning will be much faster since it reuses saved calculations
metabat -i assembly.fa -a depth.txt -o bin2 --specific -l -v --saveTNF saved.tnf --saveDistance saved.gprob

#Try specific mode with paired data to improve sensitivity while minimizing the loss of specificity
metabat -i assembly.fa -a depth.txt -p paired.txt -o bin3 --specific -l -v --saveTNF saved.tnf --saveDistance saved.gprob

#All bins are available to download
http://portal.nersc.gov/dna/RD/Metagenome_RD/MetaBAT/Files/results/

Running MetaBAT with ensemble binning (using version 0.30.3)

#!bash

#recommended to use B >= 20
#use --seed for reproducibility
metabat -i assembly.fa -a depth.txt -o bin4 --sensitive -l -v --saveTNF saved.tnf --saveDistance saved.gprob -B 20

Evaluating MetaBAT performance

###Statistics for running * It took 0 hours 13 minute 55 seconds using 16 cores (32 hyper-threads) to process 156,897 out of 195,601 contigs (>=2.5kb length & >=2 average coverage) in the sensitive mode. Subsequent binning with either specific and specific with pair modes took mere less than 2 minutes in both cases since MetaBAT reuses the saved distance file. * 72.35% (834931147 out of 1154063908 bases) was binned. 68.17% and 69.00% were binned for specific and specific with pair info, respectively. * Number of clusters (>= 200kb) formed in sensitive mode: 339. For specific and specific with pair mode formed both 335 bins. * Required memory was about 3.3GB. For specific with pair mode utilized 3.6GB at most. * Detailed info of verbose running message for sensitive mode is here.

###Summary statistics and figures showing the binning performance

#The following is R commands (tested on Linux)
#Download the R file from the data directory (see above). It will try to download and install required libraries.
source('http://portal.nersc.gov/dna/RD/Metagenome_RD/MetaBAT/Files/benchmark.R')

res <- list(Sensitive=calcPerf("MetaBAT","bin1"), Specific=calcPerf("MetaBAT","bin2"), 'Specific & w/ Pair'=calcPerf("MetaBAT","bin3"))

Print out summary statistics

The output table shows cumulative numbers. For instance, the 'sensitive' mode generates 110 bins (>.7 precision & >.5 recall) or 22 bins (>.99 precision & .8 recall).
We use the result from 'Sensitive' mode (default mode) to compare other methods.
In this scenario, recall can be considered as genome completeness; however, some contigs had very low coverage depths due to poor assembly quality and ignored in binning.

So the recall represents worst case recovery performance considering assembly quality. Pure binning performance would be measured by removing contigs having poor quality; however, it was included for fair comparisons.

printPerf(res)

$Sensitive
         Recall
Precision 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.95
     0.7  131 117 110 101  74  39   4    1
     0.8  123 110 105  97  72  38   4    1
     0.9  111 101  96  89  66  36   4    1
     0.95  92  84  80  74  55  31   4    1
     0.99  58  51  49  45  36  22   2    0

$Specific
         Recall
Precision 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.95
     0.7  136 119 113 102  73  31   2    0
     0.8  125 109 104  95  69  30   2    0
     0.9  114 103  98  91  65  29   2    0
     0.95  99  90  86  80  57  27   2    0
     0.99  66  59  57  53  41  22   2    0

$`Specific & w/ Pair`
         Recall
Precision 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.95
     0.7  136 121 113 104  76  35   5    0
     0.8  125 111 104  97  72  34   5    0
     0.9  112 103  96  92  67  33   5    0
     0.95  98  90  85  81  60  31   5    0
     0.99  66  60  57  54  43  25   5    0

$`Sensitive w/ Ensemble Binning`
         Recall
Precision 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.95
     0.7  127 125 120 111  96  50  12    4
     0.8  122 120 116 108  94  50  12    4
     0.9  111 109 105  98  86  48  12    4
     0.95  96  96  92  87  76  43  11    3
     0.99  54  54  53  52  45  29   4    2

Plot figures

Again, the number of genomes in the data are 290.
F1 considers recall and precision equally, but F0.5 puts more weight on precision, which is reasonable expectation in metagenome binning (See reference).
```
pdf("Performance_By_Bin.pdf", width=8, height=8)
plotPerf(res)
dev.off()
```

## Running Canopy

#!bash

cc.bin -n 32 -i depth_concoct.txt -o bin_canopy -c prof_canopy --max_canopy_dist 0.1 --max_close_dist 0.4 --max_merge_dist 0.1 --min_step_dist 0.005 --max_num_canopy_walks 5 --stop_fraction 1 --canopy_size_stats_file stat

###Information for running Canopy * Canopy is very fast and memory efficient but not as user-friendly as others. * It requires additional steps including filtering to obtain final genome bins. * The filter assumes the availability of at least 3 samples.

res <- list(Canopy=calcPerf("Canopy","bin_canopy","prof_canopy"))
printPerf(res)

$Canopy
         Recall
Precision 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.95
     0.7  104  98  90  83  61  25   2    1
     0.8  101  95  88  81  59  24   2    1
     0.9   91  88  81  75  55  23   1    0
     0.95  88  85  78  72  52  22   1    0
     0.99  77  74  68  64  45  20   1    0

## Running CONCOCT (using version 0.4.0)

#!bash

concoct --composition_file assembly.fa --coverage_file depth_concoct.txt

###Information for running CONCOCT * It seems like CONCOCT only use 10 threads regardless of the number of threads availability in the machine. * Binning result is available here.

res <- list(CONCOCT=calcPerf("CONCOCT"))
printPerf(res)

$CONCOCT
         Recall
Precision 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.95
     0.7   62  62  62  62  60  43  17    6
     0.8   60  60  60  60  58  41  16    6
     0.9   56  56  56  56  54  38  15    6
     0.95  50  50  50  50  48  34  15    6
     0.99  36  36  36  36  34  23  11    4

## Running GroopM (using version 0.3.0) ###If you wanted to reproduce depth file generation

#!bash
#calculate depth file for GroopM
groopm parse -t 32 database.gm assembly-filtered.fa *.bam

###This is just for binning procedure. * database.gm file can be downloaded here.

#!bash
#binning
groopm core -b 200000 database.gm

#skipped refining stage since it is not automated
#groopm refine database.gm

#recruiting unbinned contigs
groopm recruit database.gm

#output
groopm extract --prefix bin_groopm database.gm assembly-filtered.fa

###Information for running GroopM * Stretched N's should be removed before successfully running 'parse' command (saved as assembly-filtered.fa). * Depth calculation can be multi-threaded. It took 34 minutes using 32 threads. See log. * Considered bins greater than 200kb which is the cut-off used for other tools. * Core binning procedure seems single-threaded, and it took 32 hours and 37 minutes with 37GB peak memory consumption. See log. * Note that refining stage was skipped. We are not aware of the potential effect of it. * Recruiting step (it took 12 hours and 42 minutes) may be better skipped since 98.25% contigs are already binned after the core step. The result also showed that the step worsened the precision little bit. * Extract step took about 10 minutes and generated 433 bins (12 chimeric bins removed). * Identified bins are available here.

res <- list(GroopM=calcPerf("GroopM","bin_groopm"))
printPerf(res)

$GroopM
         Recall
Precision 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.95
     0.7   20  19  15  10   5   4   0    0
     0.8   12  12  10   7   3   3   0    0
     0.9    6   6   4   3   2   2   0    0
     0.95   4   4   2   2   1   1   0    0
     0.99   1   1   1   1   1   1   0    0

#In case of skipping recruiting step
$GroopM
         Recall
Precision 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.95
     0.7   21  20  16  11   4   4   0    0
     0.8   13  13  11   8   3   3   0    0
     0.9    6   6   4   3   2   2   0    0
     0.95   4   4   2   2   1   1   0    0
     0.99   1   1   1   1   1   1   0    0

## Running MaxBin (using version 1.4.1)

#!bash

run_MaxBin.pl -contig assembly.fa -out MaxBin.out -abund depth_maxbin.txt -thread 32

###Information for running MaxBin * MaxBin uses one abundance data, so depth_maxbin.txt represents coverage from all libraries. * Here is the log. * Identified bins are available here.

#!bash
res <- list(MaxBin=calcPerf("MaxBin","MaxBin.out"))
printPerf(res)

$MaxBin
         Recall
Precision 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.95
     0.7   75  69  58  44  24   7   2    0
     0.8   57  55  48  39  23   7   2    0
     0.9   39  37  34  29  19   7   2    0
     0.95  26  26  26  22  16   5   0    0
     0.99   6   6   6   5   5   2   0    0

### Comparing all methods together

res <- list(MetaBAT=calcPerf("MetaBAT","bin1"), Canopy=calcPerf("Canopy","bin_canopy","prof_canopy"), CONCOCT=calcPerf("CONCOCT"), MaxBin=calcPerf("MaxBin","MaxBin.out"), GroopM=calcPerf("GroopM","bin_groopm"))
printPerf(res)

$MetaBAT
         Recall
Precision 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.95
     0.7  131 117 110 101  74  39   4    1
     0.8  123 110 105  97  72  38   4    1
     0.9  111 101  96  89  66  36   4    1
     0.95  92  84  80  74  55  31   4    1
     0.99  58  51  49  45  36  22   2    0

$Canopy
         Recall
Precision 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.95
     0.7  104  98  90  83  61  25   2    1
     0.8  101  95  88  81  59  24   2    1
     0.9   91  88  81  75  55  23   1    0
     0.95  88  85  78  72  52  22   1    0
     0.99  77  74  68  64  45  20   1    0

$CONCOCT
         Recall
Precision 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.95
     0.7   62  62  62  62  60  43  17    6
     0.8   60  60  60  60  58  41  16    6
     0.9   56  56  56  56  54  38  15    6
     0.95  50  50  50  50  48  34  15    6
     0.99  36  36  36  36  34  23  11    4

$MaxBin
         Recall
Precision 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.95
     0.7   75  69  58  44  24   7   2    0
     0.8   57  55  48  39  23   7   2    0
     0.9   39  37  34  29  19   7   2    0
     0.95  26  26  26  22  16   5   0    0
     0.99   6   6   6   5   5   2   0    0

$GroopM
         Recall
Precision 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.95
     0.7   20  19  15  10   5   4   0    0
     0.8   12  12  10   7   3   3   0    0
     0.9    6   6   4   3   2   2   0    0
     0.95   4   4   2   2   1   1   0    0
     0.99   1   1   1   1   1   1   0    0

pdf("Performance_By_Bin_All_1.pdf", width=6, height=4)
plotPerf3(res, prec=.9)
dev.off()

pdf("Performance_By_Bin_All_2.pdf")
plotPerfVenn(res, sel=names(res)[1:4])
dev.off()

pdf("Performance_By_Bin_All.pdf", width=8, height=8)
plotPerf(res, xlim=300)
dev.off()

Conclusions

Note that this benchmark is intended for testing with complex metagenomes, so compared methods might not be designed for them. MetaBAT is especially designed for complex metagenomes.
Quality of real world metagenome assemblies is often quite poor. So the reported performance metrics might be overestimated. More realistic measure would be estimated by including novel contigs.
MetaBAT outperforms Canopy, GroopM, and MaxBin in terms of both recall and precision.
It is interesting that MaxBin outperforms GroopM in spite of the fact that MaxBin does not use co-abundance information.
CONCOCT showed the best recall in some of bins (more bins having >.8); however, precision dropped quickly for the rest of bins so that overall performance was out-dueled by MetaBAT.
In contrast, MetaBAT achieved better precision in most of bins at the cost of less recall.
Computing times of MetaBAT and Canopy were significantly faster (100 ~ 500X) than the rest.
Canopy consumed the least memory. MetaBAT consumed 3 ~ 10X less memory than the rest.
MetaBAT identified the most genomes (>90% precision & >30% completeness)
All binning methods complement each other; however, MetaBAT identified 111 out of 133 genomes detected by any of 5 tools

Wiki

MetaBAT / Home