Revisiting CAMI Challenge Data Set

MetaBAT 2 outperforms previous MetaBAT and other alternatives in both accuracy and computational efficiency . All are based on default parameters.

Prerequisites

The dataset is available here. Refer to paper for the details. All results are here.

Low Complexity Data Set

#!bash
$ metabat2 -i CAMI_low_RL_S001__insert_270_GoldStandardAssembly.fasta.gz -a depth-low.txt -o MetaBATLow/bin -v
[00:00:00] MetaBAT 2 (v2.10.2) using minContig 2500, minCV 1.0, minCVSum 1.0, maxP 95%, minS 60, and maxEdges 200.
[00:00:01] Finished reading 19499 contigs and 1 coverages from depth.txt
[00:00:01] Number of target contigs: 4367 of large (>= 2500) and 3823 of small ones (>=1000 & <2500).
[00:00:05] Finished TNF calculation.
[00:00:06] Finished Preparing TNF Graph Building [pTNF = 92.0; 2392 / 2500 (P = 94.96%)]
[00:00:06] Finished Building TNF Graph (4146 vertices and 239946 edges) [12.4Gb / 251.8Gb]
[00:00:06] Building SCR Graph and Binning (4046 vertices and 30893 edges) [P = 95.00%; 12.4Gb / 251.8Gb]
[00:00:07] 97.43% (137794377 bases) of large (>=2500) and 0.00% (0 bases) of small (<2500) contigs were binned.
35 bins (137794377 bases in total) formed.

Check the result using R

> source('http://portal.nersc.gov/dna/RD/Metagenome_RD/MetaBAT/Files/benchmark.R')
> printPerf(list(MetaBAT2=calcPerfCAMI("MetaBAT","MetaBATLow/bin",complexity='low'), MaxBin2=calcPerfCAMI("MaxBin","MaxBinLow/bin",complexity='low'), CONCOCT=calcPerfCAMI("CONCOCT","CONCOCT/low/clustering_gt1000.csv",complexity='low'), MyCC=calcPerfCAMI("MaxBin","MyCC/low/Cluster",complexity='low'), BinSanity=calcPerfCAMI("BinSanity","BinSanity/low/",complexity='low'), COCACOLA=calcPerfCAMI("CONCOCT","COCACOLA/low/result.csv",complexity='low')))

$MetaBAT2                                       $BinSanity
         Recall                                         Recall
Precision 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.95      Precision 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.95
     0.7   25  23  23  21  18  18  14   11           0.7   20  18  16  13  12  12  12   11
     0.8   24  22  22  21  18  18  14   11           0.8   16  14  12   9   9   9   9    8
     0.9   23  22  22  21  18  18  14   11           0.9   16  14  12   9   9   9   9    8
     0.95  22  21  21  20  17  17  14   11           0.95   9   9   7   5   5   5   5    4
     0.99  22  21  21  20  17  17  14   11           0.99   6   6   5   4   4   4   4    3

$MaxBin2                                        $COCACOLA                               
         Recall                                         Recall                  
Precision 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.95      Precision 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.95
     0.7   33  30  28  24  24  21  17   16           0.7    8   8   5   5   5   5   5    3
     0.8   29  27  25  23  23  20  17   16           0.8    7   7   4   4   4   4   4    2
     0.9   22  20  18  16  16  15  13   12           0.9    5   5   3   3   3   3   3    2
     0.95  17  16  15  14  14  13  12   11           0.95   2   2   1   1   1   1   1    1
     0.99  11  10   9   8   8   8   8    8           0.99   0   0   0   0   0   0   0    0  

$CONCOCT                                        $MyCC                                   
         Recall                                         Recall                  
Precision 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.95      Precision 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.95
     0.7   18  18  18  17  17  16  15   14           0.7   10  10  10  10  10  10  10   10
     0.8   17  17  17  17  17  16  15   14           0.8   10  10  10  10  10  10  10   10
     0.9   17  17  17  17  17  16  15   14           0.9   10  10  10  10  10  10  10   10
     0.95  17  17  17  17  17  16  15   14           0.95  10  10  10  10  10  10  10   10
     0.99  14  14  14  14  14  13  12   11           0.99   6   6   6   6   6   6   6    6

> plotPerf3(res, rec=seq(.5,.9,.1), legend.position=c(.95,.7))

Two panels represents precision .95 and .90, respectively.

Medium Complexity Data Set

#!bash
$ metabat2 -i CAMI_medium_GoldStandardAssembly.fasta.gz -a depth-medium.txt -o MetaBATMed/bin -v
[00:00:00] MetaBAT 2 (v2.10.2) using minContig 2500, minCV 1.0, minCVSum 1.0, maxP 95%, minS 60, and maxEdges 200. 
[00:00:05] Finished reading 63447 contigs and 4 coverages from depth.txt
[00:00:05] Number of target contigs: 13229 of large (>= 2500) and 10460 of small ones (>=1000 & <2500). 
[00:00:15] Finished TNF calculation.                                  
[00:00:18] Finished Preparing TNF Graph Building [pTNF = 88.0; 2386 / 2500 (P = 94.92%)]                       
[00:00:21] Finished Building TNF Graph (12565 vertices and 790316 edges) [12.8Gb / 251.8Gb]                                          
[00:00:22] Building SCR Graph and Binning (11908 vertices and 93567 edges) [P = 95.00%; 12.8Gb / 251.8Gb]                           
[00:00:22] 0.09% (450157 bases) of large (>=2500) contigs were re-binned out of small bins (<200000).
[00:00:25] 96.82% (488490337 bases) of large (>=2500) and 6.49% (1079936 bases) of small (<2500) contigs were binned.
171 bins (489570273 bases in total) formed.

> printPerf(list(MetaBAT2=calcPerfCAMI("MetaBAT","MetaBATMed/bin",complexity='medium'), MaxBin2=calcPerfCAMI("MaxBin","MaxBinMed/bin",complexity='medium'), CONCOCT=calcPerfCAMI("CONCOCT","CONCOCT/medium/clustering_gt1000.csv",complexity='medium'), MyCC=calcPerfCAMI("MaxBin","MyCC/medium/Cluster",complexity='medium'), BinSanity=calcPerfCAMI("BinSanity","BinSanity/medium/",complexity='medium'), COCACOLA=calcPerfCAMI("CONCOCT","COCACOLA/medium/result.csv",complexity='medium')))

$MetaBAT2                                       $MyCC
         Recall                                         Recall
Precision 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.95      Precision 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.95
     0.7  115 106  95  86  80  75  64   54           0.7   24  23  23  23  23  23  22   21
     0.8  109 100  90  82  76  73  63   53           0.8   23  22  22  22  22  22  21   20
     0.9  105  96  88  80  75  72  63   53           0.9   18  18  18  18  18  18  17   16
     0.95 102  93  85  78  73  71  62   52           0.95  17  17  17  17  17  17  17   16
     0.99  91  82  74  68  63  61  54   44           0.99   8   8   8   8   8   8   8    7

$MaxBin2                                        $BinSanity
         Recall                                         Recall
Precision 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.95      Precision 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.95
     0.7  107 103  93  87  76  71  68   61           0.7   68  59  54  49  46  44  38   37
     0.8   93  89  80  75  68  66  63   56           0.8   60  51  46  43  41  39  34   33
     0.9   76  73  66  63  59  59  56   49           0.9   49  41  36  34  32  31  28   27
     0.95  61  59  54  53  51  51  49   42           0.95  43  38  34  32  30  29  26   25
     0.99  33  32  31  31  31  31  31   27           0.99  21  19  17  17  16  16  15   14

$CONCOCT                                        $COCACOLA
         Recall                                         Recall
Precision 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.95      Precision 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.95
     0.7   23  21  20  20  19  19  14   12           0.7   16  13  12  11  10   9   9    4
     0.8   23  21  20  20  19  19  14   12           0.8   12  10  10   9   9   8   8    4
     0.9   18  17  17  17  16  16  14   12           0.9    8   6   6   5   5   5   5    2
     0.95  15  15  15  15  14  14  12   10           0.95   4   3   3   2   2   2   2    1
     0.99   7   7   7   7   6   6   6    5           0.99   0   0   0   0   0   0   0    0

> plotPerf3(res, rec=seq(.5,.9,.1), legend.position=c(.95,.7))

High Complexity Data Set

#!bash
$ metabat2 -i CAMI_high_GoldStandardAssembly.fasta.gz -a depth-high.txt -o MetaBATHigh/bin -v
[00:00:00] MetaBAT 2 (v2.10.2) using minContig 2500, minCV 1.0, minCVSum 1.0, maxP 95%, minS 60, and maxEdges 200.
[00:00:24] Finished reading 42038 contigs and 5 coverages from depth.txt
[00:00:24] Number of target contigs: 28615 of large (>= 2500) and 5547 of small ones (>=1000 & <2500).
[00:01:08] Finished TNF calculation.
[00:01:13] Finished Preparing TNF Graph Building [pTNF = 93.0; 2378 / 2500 (P = 95.12%)]
[00:01:23] Finished Building TNF Graph (27182 vertices and 1567728 edges) [14.8Gb / 251.8Gb]
[00:01:30] Building SCR Graph and Binning (26714 vertices and 433208 edges) [P = 95.00%; 14.9Gb / 251.8Gb]
[00:01:30] 0.04% (960710 bases) of large (>=2500) contigs were re-binned out of small bins (<200000).
[00:01:54] 98.10% (2517748867 bases) of large (>=2500) and 3.18% (281503 bases) of small (<2500) contigs were binned.
728 bins (2518030370 bases in total) formed.

> printPerf(list(MetaBAT2=calcPerfCAMI("MetaBAT","MetaBATHigh/bin",complexity='high'), MaxBin2=calcPerfCAMI("MaxBin","MaxBinHigh/bin",complexity='high'), CONCOCT=calcPerfCAMI("CONCOCT","CONCOCT/high/clustering_gt1000.csv",complexity='high'), MyCC=calcPerfCAMI("MaxBin","MyCC/high/Cluster",complexity='high'), BinSanity=calcPerfCAMI("BinSanity","BinSanity/high/",complexity='high'), COCACOLA=calcPerfCAMI("CONCOCT","COCACOLA/high/result.csv",complexity='high')))

$MetaBAT2                                       $MyCC
         Recall                                         Recall
Precision 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.95      Precision 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.95
     0.7  495 473 452 433 410 392 357  290           0.7    2   2   2   2   2   2   2    0
     0.8  482 461 440 424 403 387 352  287           0.8    2   2   2   2   2   2   2    0
     0.9  469 449 428 414 393 378 346  282           0.9    2   2   2   2   2   2   2    0
     0.95 446 428 407 395 376 362 333  270           0.95   2   2   2   2   2   2   2    0
     0.99 397 379 362 353 337 325 300  241           0.99   2   2   2   2   2   2   2    0

$MaxBin2                                        $BinSanity
         Recall                                         Recall
Precision 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.95      Precision 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.95
     0.7  279 276 270 268 260 249 224  179           0.7  221 211 204 201 196 192 185  172
     0.8  267 264 259 258 251 244 220  176           0.8  196 188 181 178 174 172 166  154
     0.9  250 248 244 243 238 231 211  169           0.9  156 150 144 142 138 137 133  125
     0.95 224 223 220 220 217 212 195  155           0.95 124 118 112 111 107 106 103   98
     0.99 156 156 155 155 155 152 144  114           0.99  69  67  65  65  65  65  63   62

$CONCOCT                                        $COCACOLA
         Recall                                         Recall
Precision 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.95      Precision 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.95
     0.7   37  37  36  36  36  36  35   25           0.7  105 105 105 104 103 101  98   80
     0.8   37  37  36  36  36  36  35   25           0.8  101 101 101 100  99  98  95   77
     0.9   36  36  35  35  35  35  35   25           0.9   90  90  90  89  88  87  85   70
     0.95  32  32  32  32  32  32  32   24           0.95  72  72  72  72  71  71  69   55
     0.99  25  25  25  25  25  25  25   22           0.99  32  32  32  32  32  32  31   21

> plotPerf3(res, rec=seq(.5,.9,.1), legend.position=c(.95,.7))

Wiki

MetaBAT / CAMI