Benchmark of Automated Metagenome Binning Software in Complex Metagenomes II

For an alternative benchmark of metagenome binning software, we used the same MetaHIT human gut metagenome data (Accession #: ERP000108) as the controlled benchmark. We took CONCOCT, GroopM, and MaxBin as the alternative software to compare with MetaBAT. We adopted single copy core genes as the gold standard to test recall (completeness) and precision (inverse of contamination); for the goal, we used CheckM.

Some of pitfalls of this benchmark are as follows:

Single copy core genes are one of the ways to measure the completeness of a bin and the degree of contamination, but they are not perfect. It often over- or under- estimates the truth due to many reasons including genome diversity and poor assembly quality.
Some methods in this benchmark already incorporated single copy gene information, which may bias the evaluation using the same information.

Summary of the benchmark results

Using 2.5kb contig size cutoff (60,619 contigs)

	MetaBAT*	CONCOCT	GroopM^^	MaxBin
Number of Bins Identified (>200kb)	172	195	257	122
Number of Quality Bins (Precision > .9 & Recall > .5)	58	36	20	10
Wall Time (16 cores; 32 hyper-threads)	01:04:21	30:15:49	4:33:20	03:23:46
Peak Memory Usage (for binning step)	2.8G	5.8G	3.4G	4.8G

Using 1.5kb contig size cutoff (118,025 contigs) **

	MetaBAT*	CONCOCT	GroopM^^	MaxBin
Number of Bins Identified (>200kb)	190	260	335	168
Number of Quality Bins (Precision > .9 & Recall > .5)	72	39	16	18
Wall Time (16 cores; 32 hyper-threads)	03:31:38	82:19:53	12:19:12	06:49:39
Peak Memory Usage (for binning step)	3.0G	7G	6.3G	5.8G

*Sensitive mode

^^GroopM without recruiting (chimeric bins removed).

**Details of 1.5kb results can be found here.

Preprocessing

We generated de-novo assembly using Ray Meta.
Using BBMap, bam files for each library were produced.
All files are available to download here.

Generating depth files

It took 10 minutes using 32 hyper-threads with peak memory consumption of 8GB. Here is the log.

Files are available here.

#!bash
#depth file for MetaBAT
jgi_summarize_bam_contig_depths --outputDepth depth.txt --pairedContigs paired.txt *.bam

#depth file for CONCOCT
awk 'NR > 1 {for(x=1;x<=NF;x++) if(x == 1 || (x >= 4 && x % 2 == 0)) printf "%s", $x (x == NF || x == (NF-1) ? "\n":"\t")}' depth.txt > depth_concoct.txt

#depth file for MaxBin
cut -f1,3 depth.txt | tail -n+2 > depth_maxbin.txt

Running MetaBAT (using version >= 0.22.1)

#!bash

#Prepare proper folder structure
mkdir -p ./2.5kb/MetaBAT/Sensitive ./2.5kb/MetaBAT/Specific ./2.5kb/MetaBAT/SpecificPair

#First, try sensitive mode to better sensitivity
metabat -i assembly.fa -a depth.txt -o ./2.5kb/MetaBAT/Sensitive/bin --sensitive -v --saveTNF saved_2.5kb.tnf --saveDistance saved_2.5kb.gprob

#Try specific mode to improve specificity further; this time the binning will be much faster since it reuses saved calculations
metabat -i assembly.fa -a depth.txt -o ./2.5kb/MetaBAT/Specific/bin --specific -v --saveTNF saved_2.5kb.tnf --saveDistance saved_2.5kb.gprob

#Try specific mode with paired data to improve sensitivity while minimizing the loss of specificity
metabat -i assembly.fa -a depth.txt -p paired.txt -o ./2.5kb/MetaBAT/SpecificPair/bin --specific -v --saveTNF saved_2.5kb.tnf --saveDistance saved_2.5kb.gprob

Evaluation of MetaBAT using CheckM

#!bash
checkm lineage_wf -f ./2.5kb/MetaBAT/Sensitive/SCG.txt -t 32 -x fa ./2.5kb/MetaBAT/Sensitive ./2.5kb/MetaBAT/Sensitive/SCG
checkm lineage_wf -f ./2.5kb/MetaBAT/Specific/SCG.txt -t 32 -x fa ./2.5kb/MetaBAT/Specific ./2.5kb/MetaBAT/Specific/SCG
checkm lineage_wf -f ./2.5kb/MetaBAT/SpecificPair/SCG.txt -t 32 -x fa ./2.5kb/MetaBAT/SpecificPair ./2.5kb/MetaBAT/SpecificPair/SCG

The results are available to download here.

Print out the results

To reduce the bias which exaggerates precision when recall is very low (meaning the bin is very small), only bins having recall > 0.2 were considered for the calculation.
Overall the results looked very similar; interestingly, sensitive mode performed more than expected (its precision was not worse than others as the first benchmark. Sensitive mode will be selected for the comparison with others.
Indeed MetaBAT has the luxury of choosing the best parameter for a given data set due to its extremely fast binning performance. One can even optimize further the parameters to get the best results in terms of single copy genes. MetaBAT reuses pre-calculated data if the probability parameters (p1, p2, p3) are greater than or equal to minimum of them used for the saved file (in this example it was 80).
Recall and precision here correspond to completeness and 1 - contamination in CheckM table, respectively.

#The following is R commands (tested on Linux)
#Download the R file from the data directory (see above). It will try to download and install required libraries.
source('http://portal.nersc.gov/dna/RD/Metagenome_RD/MetaBAT/Files/benchmark.R')

res <- list(Sensitive=calcPerfBySCG("./2.5kb/MetaBAT/Sensitive/SCG.txt"), Specific=calcPerfBySCG("./2.5kb/MetaBAT/Specific/SCG.txt"), SpecificPair=calcPerfBySCG("./2.5kb/MetaBAT/SpecificPair/SCG.txt"))
printPerf(res)

$Sensitive
         Recall
Precision 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.95
     0.7  106  89  75  60  43  28  11    2
     0.8  105  88  74  59  42  27  10    1
     0.9   89  72  58  44  28  17   7    0
     0.95  65  48  35  24  13   8   3    0
     0.99  17   9   4   2   2   2   0    0

$Specific
         Recall
Precision 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.95
     0.7  101  85  69  56  39  23   5    1
     0.8  101  85  69  56  39  23   5    1
     0.9   94  78  62  49  33  19   4    0
     0.95  68  52  38  27  16   8   2    0
     0.99  26  17   8   6   5   2   0    0

$SpecificPair
         Recall
Precision 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.95
     0.7  108  93  74  61  44  27  11    1
     0.8  107  92  73  60  43  26  10    1
     0.9   88  73  55  43  30  15   5    0
     0.95  57  42  29  20  13   8   3    0
     0.99  14   9   4   3   3   2   1    0

Plot the results

#!R
pdf("Performance_By_SCG.pdf", width=8, height=8)
plotPerf(res, xlim=max(sapply(res, nrow)))
dev.off()

Running CONCOCT (using version 0.4.0)

#!bash
concoct --composition_file assembly.fa --coverage_file depth_concoct.txt --length_threshold 2500

CONCOCT bins should be extracted first to calculate recall and precision.

system("mkdir -p ./2.5kb/CONCOCT/bins/ ./2.5kb/CONCOCT/small_bins")
cls <- read.csv("./2.5kb/CONCOCT/clustering_gt2500.csv", header=F, as.is=T)
invisible(foreach(i=unique(cls$V2)) %do% {
    write.table(cls$V1[cls$V2==i], file=sprintf("./2.5kb/CONCOCT/bins/%d.lst", i), col.names=F, row.names=F, quote=F)   
    system(sprintf("./screen_list.pl ./2.5kb/CONCOCT/bins/%d.lst assembly.fa keep > ./2.5kb/CONCOCT/bins/%d.fa", i,i))
    bin.size <- as.numeric(system(sprintf("./sizefasta.pl ./2.5kb/CONCOCT/bins/%d.fa", i), intern=T))
    if(bin.size < 200000)
        system(sprintf("mv ./2.5kb/CONCOCT/bins/%d.fa ./2.5kb/CONCOCT/small_bins/", i))
})

Run CheckM

#!bash
checkm lineage_wf -f ./2.5kb/CONCOCT/SCG.txt -t 32 -x fa ./2.5kb/CONCOCT/bins ./2.5kb/CONCOCT/bins/SCG

The results are available to download here.

res <- list(CONCOCT=calcPerfBySCG("./2.5kb/CONCOCT/SCG.txt"))
printPerf(res)
$CONCOCT
         Recall
Precision 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.95
     0.7   91  78  67  57  41  30  20    6
     0.8   83  70  59  50  34  23  14    5
     0.9   57  44  36  27  13   7   3    1
     0.95  35  23  16  11   1   1   1    0
     0.99   7   1   0   0   0   0   0    0

Running GroopM (using version 0.3.0)

Parse step with 32 threads spent > 120GB memory, so the number of threads reduced to 16.
We tried two modes with or without recruiting step.

Overall GroopM performed poorly in terms of precision (large bins had significant amount of contaminations).

#!bash
#calculate depth file for GroopM
groopm parse -t 16 database.gm assembly.fa *.bam

#core binning
groopm core -b 200000 -c 2500 database.gm

#skipped refining stage since it is not automated
#groopm refine database.gm

#output core bins
groopm extract -t 32 --prefix ./2.5kb/GroopM/core_only/bin_groopm ./2.5kb/GroopM/database.gm assembly.fa

#recruiting unbinned contigs
groopm recruit database.gm

#output
groopm extract -t 32 --prefix ./2.5kb/GroopM/recruited/bin_groopm ./2.5kb/GroopM/database.gm assembly.fa

checkm lineage_wf -f ./2.5kb/GroopM/core_only/SCG.txt -t 32 -x fna ./2.5kb/GroopM/core_only ./2.5kb/GroopM/core_only/SCG
checkm lineage_wf -f ./2.5kb/GroopM/recruited/SCG.txt -t 32 -x fna ./2.5kb/GroopM/recruited ./2.5kb/GroopM/recruited/SCG

The results are available to download here.

res <- list(Core=calcPerfBySCG("./2.5kb/GroopM/core_only/SCG.txt"), Recruited=calcPerfBySCG("./2.5kb/GroopM/recruited/SCG.txt"))
printPerf(res)
$Core
         Recall
Precision 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.95
     0.7   54  42  31  24  11   9   4    0
     0.8   48  36  25  18   6   5   2    0
     0.9   41  29  20  14   3   2   0    0
     0.95  30  20  13  10   3   2   0    0
     0.99   8   5   2   1   0   0   0    0

$Recruited
         Recall
Precision 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.95
     0.7   80  63  43  34  21  10   5    3
     0.8   66  49  30  24  13   7   3    1
     0.9   40  27  17  12   5   3   1    1
     0.95  15  12   8   6   1   1   0    0
     0.99   1   1   1   1   0   0   0    0

Running MaxBin (using version 1.4.1)

#!bash
run_MaxBin.pl -contig assembly.fa -out ./2.5kb/MaxBin/MaxBin.out -abund depth_maxbin.txt -thread 32 -min_contig_length 2500
checkm lineage_wf -f ./2.5kb/MaxBin/SCG.txt -t 32 -x fasta ./2.5kb/MaxBin ./2.5kb/MaxBin/SCG

The results are available to download here.

res <- list(MaxBin=calcPerfBySCG("./2.5kb/MaxBin/SCG.txt"))
printPerf(res)
$MaxBin
         Recall
Precision 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.95
     0.7   72  52  40  23   9   3   1    0
     0.8   55  36  26  18   7   2   1    0
     0.9   32  18  10   9   2   1   0    0
     0.95  19  12   7   7   1   1   0    0
     0.99   5   2   0   0   0   0   0    0

Comparing all methods together

res <- list(MetaBAT=calcPerfBySCG("./2.5kb/MetaBAT/Sensitive/SCG.txt"), CONCOCT=calcPerfBySCG("./2.5kb/CONCOCT/SCG.txt"), GroopM=calcPerfBySCG("./2.5kb/GroopM/core_only/SCG.txt"), MaxBin=calcPerfBySCG("./2.5kb/MaxBin/SCG.txt"))
printPerf(res)

$MetaBAT
         Recall
Precision 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.95
     0.7  106  89  75  60  43  28  11    2
     0.8  105  88  74  59  42  27  10    1
     0.9   89  72  58  44  28  17   7    0
     0.95  65  48  35  24  13   8   3    0
     0.99  17   9   4   2   2   2   0    0

$CONCOCT
         Recall
Precision 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.95
     0.7   91  78  67  57  41  30  20    6
     0.8   83  70  59  50  34  23  14    5
     0.9   57  44  36  27  13   7   3    1
     0.95  35  23  16  11   1   1   1    0
     0.99   7   1   0   0   0   0   0    0

$GroopM
         Recall
Precision 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.95
     0.7   54  42  31  24  11   9   4    0
     0.8   48  36  25  18   6   5   2    0
     0.9   41  29  20  14   3   2   0    0
     0.95  30  20  13  10   3   2   0    0
     0.99   8   5   2   1   0   0   0    0

$MaxBin
         Recall
Precision 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.95
     0.7   72  52  40  23   9   3   1    0
     0.8   55  36  26  18   7   2   1    0
     0.9   32  18  10   9   2   1   0    0
     0.95  19  12   7   7   1   1   0    0
     0.99   5   2   0   0   0   0   0    0

pdf("Performance_By_SCG_All_2.5kb.pdf", width=8, height=8)
plotPerf(res, xlim=max(sapply(res, nrow)))
dev.off()

Conclusions

MetaBAT outperformed GroopM and MaxBin in terms of all metrics (similar outcome as the controlled benchmark).
CONCOCT had better recall (completeness) at the cost of reduced precision. MetaBAT excelled CONCOCT in both combined metrics, F1 and F0.5.
GroopM seemed too liberal to select members for each bin so that it suffered significantly in precision. The greater completeness of bins was driven by excessive including of contigs, which caused poor precision.
MaxBin performed reasonably well without using co-abundance information.
The conclusion is very similar to previous benchmark, that MetaBAT is the fastest metagenome binning software producing very little contamination in bins with reasonable completeness which is suitable characteristics for complex metagenomes analyses.

Wiki

MetaBAT / Benchmark_MetaHIT_2.5kb

Benchmark of Automated Metagenome Binning Software in Complex Metagenomes II

Summary of the benchmark results

Using 2.5kb contig size cutoff (60,619 contigs)

Using 1.5kb contig size cutoff (118,025 contigs) **

Preprocessing

Generating depth files

Running MetaBAT (using version >= 0.22.1)

Evaluation of MetaBAT using CheckM

Print out the results

Plot the results

Running CONCOCT (using version 0.4.0)

Running GroopM (using version 0.3.0)

Running MaxBin (using version 1.4.1)

Comparing all methods together

Conclusions