Clone wiki

MetaBAT / Benchmark_MetaHIT

Benchmark of Automated Metagenome Binning Software Using Real Assembly

Check out a new tutorial Best Binning Practices for better usage of MetaBAT.

Check out new CAMI Challenge benchmark here.

For an alternative benchmark of metagenome binning software, we used the same MetaHIT human gut metagenome data (Accession #: ERP000108) as the controlled benchmark. Also it is advised to check a large scale example. We took Canopy, CONCOCT, GroopM, and MaxBin as the alternative software to compare with MetaBAT. We adopted single copy core genes as the gold standard to test recall (completeness) and precision (inverse of contamination); for the goal, we used CheckM.

Some of pitfalls of this benchmark are as follows:

  • Single copy core genes are one of the ways to measure the completeness of a bin and the degree of contamination, but they are not perfect. It often over- or under- estimates the truth due to many reasons including genome diversity and poor assembly quality.

  • Some methods in this benchmark already incorporated single copy gene information, which may bias the evaluation using the same information.

Using 1.5kb contig size cutoff (118,025 contigs)

  • All results are available to download.
MetaBAT Canopy** CONCOCT MaxBin GroopM^^
Number of Bins Identified (>200kb) 234 223 260 168 335
Number of Genomes Detected (Precision > .9 & Recall > .3) 130 96 64 39 28
Wall Time (16 cores; 32 hyper-threads) 00:03:36 00:02:31 82:19:53 06:49:39 12:19:12
Peak Memory Usage (for binning step) 3.0G 1.6G 7G 5.8G 6.3G

**Canopy only use abundance table as input, so it should have taken more time and memory to read and write sequence data like the others

^^GroopM without recruiting (chimeric bins removed).

Preprocessing

  1. We generated de-novo assembly using Ray Meta.
  2. Using BBMap, bam files for each library were produced.
  3. All files are available to download here.

Generating depth files

  • It took 10 minutes using 32 hyper-threads with peak memory consumption of 8GB. Here is the log.
  • Files are available here.
#depth file for MetaBAT
jgi_summarize_bam_contig_depths --outputDepth depth.txt --pairedContigs paired.txt *.bam

#depth file for CONCOCT
awk 'NR > 1 {for(x=1;x<=NF;x++) if(x == 1 || (x >= 4 && x % 2 == 0)) printf "%s", $x (x == NF || x == (NF-1) ? "\n":"\t")}' depth.txt > depth_concoct.txt

#depth file for MaxBin
cut -f1,3 depth.txt | tail -n+2 > depth_maxbin.txt

Running MetaBAT (using version v0.25.3)

#Prepare proper folder structure
mkdir -p ./1.5kb/MetaBAT

#Run MetaBAT
metabat -i assembly.fa -a depth.txt -o ./1.5kb/MetaBAT/bin --minContigByCorr 1500 --saveTNF saved.tnf --saveDistance saved.dist --verbose

#Run CheckM
checkm lineage_wf -f ./1.5kb/MetaBAT/SCG.txt -t 8 -x fa ./1.5kb/MetaBAT ./1.5kb/MetaBAT/SCG
  • Indeed MetaBAT has the luxury of choosing the best parameter for a given data set due to its extremely fast binning performance. One can even optimize further the parameters to get the best results in terms of single copy genes. MetaBAT reuses pre-calculated data using --saveTNF and --saveDistance options.

  • Recall and precision here correspond to completeness and 1 - contamination in CheckM table, respectively.

Running MetaBAT with ensemble binning (using version v0.30.3)

metabat -i assembly.fa -a depth.txt -o binE --saveTNF saved.tnf --saveDistance saved.dist --verbose -B 20

Running CONCOCT (using version 0.4.0)

mkdir -p ./1.5kb/CONCOCT
cd 1.5kb/CONCOCT

concoct --composition_file assembly.fa --coverage_file depth_concoct.txt --length_threshold 1500

CONCOCT bins should be extracted first to calculate recall and precision.

system("mkdir -p ./1.5kb/CONCOCT/bins/ ./1.5kb/CONCOCT/small_bins")
cls <- read.csv("./1.5kb/CONCOCT/clustering_gt2500.csv", header=F, as.is=T)
invisible(foreach(i=unique(cls$V2)) %do% {
    write.table(cls$V1[cls$V2==i], file=sprintf("./1.5kb/CONCOCT/bins/%d.lst", i), col.names=F, row.names=F, quote=F)   
    system(sprintf("./screen_list.pl ./1.5kb/CONCOCT/bins/%d.lst assembly.fa keep > ./1.5kb/CONCOCT/bins/%d.fa", i,i))
    bin.size <- as.numeric(system(sprintf("./sizefasta.pl ./1.5kb/CONCOCT/bins/%d.fa", i), intern=T))
    if(bin.size < 200000)
        system(sprintf("mv ./1.5kb/CONCOCT/bins/%d.fa ./1.5kb/CONCOCT/small_bins/", i))
})

Run CheckM

checkm lineage_wf -f ./1.5kb/CONCOCT/SCG.txt -t 8 -x fa ./1.5kb/CONCOCT/bins ./1.5kb/CONCOCT/bins/SCG

Running GroopM (using version 0.3.0)

  • Parse step with 32 threads spent > 120GB memory, so the number of threads reduced to 16.
  • We tried two modes with or without recruiting step.
  • Overall GroopM performed poorly in terms of precision (large bins had significant amount of contaminations).
#calculate depth file for GroopM
groopm parse -t 16 database.gm assembly.fa *.bam

#core binning
groopm core -b 200000 -c 1500 database.gm

#skipped refining stage since it is not automated
#groopm refine database.gm

#output core bins
groopm extract -t 32 --prefix ./1.5kb/GroopM/core_only/bin_groopm ./1.5kb/GroopM/database.gm assembly.fa

#recruiting unbinned contigs
groopm recruit database.gm

#output
groopm extract -t 32 --prefix ./1.5kb/GroopM/recruited/bin_groopm ./1.5kb/GroopM/database.gm assembly.fa

checkm lineage_wf -f ./1.5kb/GroopM/core_only/SCG.txt -t 32 -x fna ./1.5kb/GroopM/core_only ./1.5kb/GroopM/core_only/SCG
checkm lineage_wf -f ./1.5kb/GroopM/recruited/SCG.txt -t 32 -x fna ./1.5kb/GroopM/recruited ./1.5kb/GroopM/recruited/SCG

Running MaxBin (using version 1.4.1)

run_MaxBin.pl -contig assembly.fa -out ./1.5kb/MaxBin/MaxBin.out -abund depth_maxbin.txt -thread 32 -min_contig_length 1500
checkm lineage_wf -f ./1.5kb/MaxBin/SCG.txt -t 8 -x fasta ./1.5kb/MaxBin ./1.5kb/MaxBin/SCG

Comparing all methods together

source('http://portal.nersc.gov/dna/RD/Metagenome_RD/MetaBAT/Files/benchmark.R')
res <- list(MetaBAT=calcPerfBySCG("./1.5kb/MetaBAT/Sensitive/SCG.txt"), Canopy=calcPerfBySCG("./1.5kb/Canopy/SCG.txt"), CONCOCT=calcPerfBySCG("./1.5kb/CONCOCT/SCG.txt"), MaxBin=calcPerfBySCG("./1.5kb/MaxBin/SCG.txt"), GroopM=calcPerfBySCG("./1.5kb/GroopM/core_only/SCG.txt"))
printPerf(res)

$`MetaBAT with ensemble binning`
         Recall
Precision 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.95
     0.7  147 139 124 108  86  57  31   14
     0.8  142 134 119 103  81  52  26   11
     0.9  111 103  88  72  51  32  13    4
     0.95  69  61  48  35  20  10   7    3
     0.99  10   8   2   1   1   0   0    0

$MetaBAT
         Recall
Precision 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.95
     0.7  146 127 101  76  57  37  15    3
     0.8  145 126 100  75  56  36  14    2
     0.9  130 111  86  61  44  28  12    0
     0.95  91  73  49  26  14   8   7    0
     0.99  17   8   4   1   0   0   0    0

$Canopy
         Recall
Precision 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.95
     0.7  129 111  95  77  67  36  18    7
     0.8  124 106  90  72  62  31  13    6
     0.9   96  78  62  44  34  12   3    1
     0.95  64  47  31  17  11   2   1    1
     0.99  12   2   1   0   0   0   0    0

$CONCOCT
         Recall
Precision 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.95
     0.7  107  95  78  65  53  34  23   10
     0.8   94  82  66  54  44  26  15    6
     0.9   64  52  39  28  19   7   2    0
     0.95  43  33  23  14   9   2   1    0
     0.99   8   3   1   0   0   0   0    0

$MaxBin
         Recall
Precision 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.95
     0.7   91  74  54  31  20   4   3    1
     0.8   69  54  37  21  14   4   3    1
     0.9   39  26  18  12   8   2   2    0
     0.95  17  11   8   7   6   2   2    0
     0.99   1   0   0   0   0   0   0    0

$GroopM
         Recall
Precision 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.95
     0.7   47  37  31  23  19  10   4    2
     0.8   39  29  23  17  14   8   3    1
     0.9   28  19  16  12   9   4   2    0
     0.95  18  13  10   7   6   3   2    0
     0.99   2   1   1   0   0   0   0    0
pdf("Performance_By_SCG_All_1.5kb_2.pdf", width=12, height=4)
plotPerf3(res)
dev.off()

Performance_By_SCG_All_1.5kb_2.png

#First download http://portal.nersc.gov/dna/RD/Metagenome_RD/MetaBAT/Files/MetaHIT/results/Results_1.5kb.tar.gz and extract it before running below codes.
ctgList <- getCtgList(res)
ctgSizes <- read.table("assembly.fa.sizes", as.is=T)
pdf("Performance_By_SCG_All_1.5kb_3.pdf")
plotPerfVennBySCG(res, ctgList, ctgSizes, minRec=.3, minPrec=.9)
dev.off()
pdf("Performance_By_SCG_All_1.5kb_4.pdf")
plotPerfVennBySCG(res, ctgList, ctgSizes, minRec=.3, minPrec=.95)
dev.off()

Performance_By_SCG_All_1.5kb_3_4.png

Updated