Wiki

Clone wiki

MetaBAT / Best Binning Practices

Best Binning Practices

Note: This is for MetaBAT 2. See here for MetaBAT 1.

Check out new CAMI Challenge benchmark here.

Prerequisites

We will use 3 datasets in this guide. The dataset is available here. For this guide, we use MetaBAT v2.10.2 and CheckM v1.0.6.

Summary of Workflow

We will start with default setting in MetaBAT 2 and explore other advanced settings if necessary. Unlike MetaBAT 1, we won't do extensive parameter search, but we can still change some parameters to control the amount of data used for binning. For instance, if initial results looked great, using more data (e.g. --minContig 1500) might be helpful to improve completeness of genome bins at the cost of some contamination. For the evaluation of bins, we will use completeness and contamination estimations in CheckM. Although this is a guideline for more advanced usage of MetaBAT 2, it is safe to use default mode in most of cases without much loss of information as shown below, where the difference between each run is not significant.

CASE 1: When assembly is good quality and from relatively simple community.

Run MetaBAT 2 and CheckM

#!bash
$ metabat2 -i assembly.fa.gz -a depth.txt -o resA1/bin -v
[00:00:00] MetaBAT 2 (v2.10.2) using minContig 2500, minCV 1.0, minCVSum 1.0, maxP 95%, minS 60, and maxEdges 200. 
[00:00:05] Finished reading 79862 contigs and 42 coverages from depth.txt
[00:00:05] Number of target contigs: 26603 of large (>= 2500) and 52481 of small ones (>=1000 & <2500). 
[00:00:09] Finished TNF calculation.                                  
[00:00:19] Finished Preparing TNF Graph Building [pTNF = 72.0; 2380 / 2500 (P = 94.92%)]                       
[00:00:28] Finished Building TNF Graph (25392 vertices and 1285596 edges) [7.7Gb / 251.8Gb]                                          
[00:00:32] Building SCR Graph and Binning (23453 vertices and 130190 edges) [P = 95.00%; 7.7Gb / 251.8Gb]                           
[00:00:34] 5.75% (6928951 bases) of large (>=2500) contigs were re-binned out of small bins (<200000).
[00:00:34] 71.49% (127336848 bases) of large (>=2500) and 6.13% (4878858 bases) of small (<2500) contigs were binned.
104 bins (132215706 bases in total) formed.
$ checkm lineage_wf -f resA1/CheckM.txt -t 8 -x fa resA1/ resA1/SCG
* The default mimimum contig size (--minContig) is 2500. Contigs smaller than minContig but greater than 1000 are called 'small' contigs and will be considered for binning at later stage if 3 or more samples are available. In this example, there are 26603 large contigs used for binning and 6.13% of small contigs were binned additionally afterward. * Contigs failed to be binned to large bins (>= minClsSize, by default 200000 bases) will be given another chance to be binned with large bins. In this example, 5.75% of large contigs were re-binned. * Both additional binning would be 10% at most. And it can be disabled by supplying --noAdd option.

Check the result using R

> source('http://portal.nersc.gov/dna/RD/Metagenome_RD/MetaBAT/Files/benchmark.R')
> printPerf(list(calcPerfBySCG("resA1/CheckM.txt", removeStrain=F)), rec=c(seq(.1,.9,.1),.95), prec=c(seq(.6,.9,.1),.95,.99))
         Recall
Precision 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.95
     0.6   64  64  58  55  49  44  39  31  16    9
     0.7   64  64  58  55  49  44  39  31  16    9
     0.8   64  64  58  55  49  44  39  31  16    9
     0.9   60  60  54  51  45  40  36  28  14    8
     0.95  52  52  46  43  37  32  29  21  10    5
     0.99  16  16  11   8   4   3   3   3   1    1
> printPerf(list(calcPerfBySCG("resA1/CheckM.txt", removeStrain=T)), rec=c(seq(.1,.9,.1),.95), prec=c(seq(.6,.9,.1),.95,.99))
         Recall
Precision 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.95
     0.6   64  64  58  55  49  44  39  31  16    9
     0.7   64  64  58  55  49  44  39  31  16    9
     0.8   64  64  58  55  49  44  39  31  16    9
     0.9   64  64  58  55  49  44  39  31  16    9
     0.95  63  63  57  54  48  43  38  30  15    8
     0.99  47  47  41  38  32  28  25  20   7    2
* printPerf function is to show cumulative number of bins fulfilling given recall (completeness or sensitivity) and precision (purity or specificity) cutoffs. For instance, in the first table, there are 45 bins having >= 0.9 precision and 0.5 recall, and the number decreases to 40 if 0.6 recall were used instead. * The difference between two tables is whether to remove strain level contamination in CheckM results. If removeStrain is set to TRUE, contamination by different strains is ignored for precision calculation and the number of bins fulfilling the same precision cutoff will increase. * Compared two tables, it is clear that most of contaminations come from the mixture of different strains (e.g. see the bottom row where precision is 0.99). It appears that the assembly quality is good and bins contain virtually no contaminations that it may be worthwhile to try adding more data. The most obvious way is to decrease minContig cutoff to either 2kb or 1.5kb.

#!bash
$ metabat2 -i assembly.fa.gz -a depth.txt -o resA2/bin -v -m 2000
[00:00:00] MetaBAT 2 (v2.10.2) using minContig 2000, minCV 1.0, minCVSum 1.0, maxP 95%, minS 60, and maxEdges 200. 
[00:00:05] Finished reading 79862 contigs and 42 coverages from depth.txt
[00:00:05] Number of target contigs: 35247 of large (>= 2000) and 43837 of small ones (>=1000 & <2000). 
[00:00:09] Finished TNF calculation.                                  
[00:00:24] Finished Preparing TNF Graph Building [pTNF = 72.0; 2380 / 2500 (P = 95.20%)]                       
[00:00:42] Finished Building TNF Graph (33426 vertices and 1620734 edges) [7.7Gb / 251.8Gb]                                          
[00:00:51] Building SCR Graph and Binning (30476 vertices and 159182 edges) [P = 95.00%; 7.7Gb / 251.8Gb]                           
[00:00:52] 4.67% (6146882 bases) of large (>=2000) contigs were re-binned out of small bins (<200000).
[00:00:53] 69.84% (137881024 bases) of large (>=2000) and 7.04% (4247436 bases) of small (<2000) contigs were binned.
115 bins (142128460 bases in total) formed.
$ checkm lineage_wf -f resA2/CheckM.txt -t 8 -x fa resA2/ resA2/SCG
* Note the increased number of bins and bases from 104 bins (132215706 bases) to 115 bins (142128460 bases).

> diffPerf(calcPerfBySCG("resA2/CheckM.txt", removeStrain=F), calcPerfBySCG("resA1/CheckM.txt", removeStrain=F), rec=c(seq(.1,.9,.1),.95), prec=c(seq(.6,.9,.1),.95,.99))
         Recall
Precision 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.95
     0.6    4   4   6   2   5   4   3   6   5    1
     0.7    4   4   6   2   5   4   3   6   5    1
     0.8    2   2   4   0   3   2   1   5   4    1
     0.9    2   2   4   0   3   2   0   4   3    0
     0.95   2   2   4   0   3   2   0   4   3    2
     0.99  -1  -1   0  -3  -1  -1  -1  -1   0    0
* diffPerf function is to show difference between two independent runs of CheckM results. Positive numbers represent the first result has more bins fulfilling the cutoffs. Generally speaking, better binning would have more positive numbers in the bottom right, which means it has more complete and precise bins. * The overall completeness of bins improved without much addition of contamination, though slight decrease in precision in 99% cutoff (see the bottom row). * Let's try to add data even more by using -m 1500.

#!bash
$ metabat2 -i assembly.fa.gz -a depth.txt -o resA3/bin -v -m 1500
[00:00:00] MetaBAT 2 (v2.10.2) using minContig 1500, minCV 1.0, minCVSum 1.0, maxP 95%, minS 60, and maxEdges 200. 
[00:00:05] Finished reading 79862 contigs and 42 coverages from depth.txt
[00:00:05] Number of target contigs: 49253 of large (>= 1500) and 29831 of small ones (>=1000 & <1500). 
[00:00:10] Finished TNF calculation.                                  
[00:00:33] Finished Preparing TNF Graph Building [pTNF = 70.0; 2296 / 2500 (P = 91.84%)]                       
[00:01:07] Finished Building TNF Graph (45451 vertices and 2020298 edges) [7.7Gb / 251.8Gb]                                          
[00:01:17] Building SCR Graph and Binning (40097 vertices and 199891 edges) [P = 85.50%; 7.7Gb / 251.8Gb]                           
[00:01:18] 5.42% (7806480 bases) of large (>=1500) contigs were re-binned out of small bins (<200000).
[00:01:19] 68.53% (151873573 bases) of large (>=1500) and 8.19% (2957972 bases) of small (<1500) contigs were binned.
126 bins (154831545 bases in total) formed.
$ checkm lineage_wf -f resA3/CheckM.txt -t 8 -x fa resA3/ resA3/SCG
> diffPerf(calcPerfBySCG("resA3/CheckM.txt", removeStrain=F), calcPerfBySCG("resA2/CheckM.txt", removeStrain=F), rec=c(seq(.1,.9,.1),.95), prec=c(seq(.6,.9,.1),.95,.99))
         Recall
Precision 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.95
     0.6    1   1   1   4   0   3   4   1   1    3
     0.7    0   0   0   3  -1   2   3   0   0    3
     0.8    0   0   0   3  -1   2   3   0   0    3
     0.9    2   2   2   5   1   4   5   2   1    4
     0.95  -4  -4  -4  -1  -5  -1   1  -1  -1    1
     0.99   2   2   2   4   1   1   0   0   0    0
> diffPerf(calcPerfBySCG("resA3/CheckM.txt", removeStrain=T), calcPerfBySCG("resA2/CheckM.txt", removeStrain=T), rec=c(seq(.1,.9,.1),.95), prec=c(seq(.6,.9,.1),.95,.99))
         Recall
Precision 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.95
     0.6    2   2   2   5   1   4   5   2   2    4
     0.7    2   2   2   5   1   4   5   2   2    4
     0.8    1   1   1   4   0   3   4   1   1    3
     0.9    1   1   1   4   0   3   4   1   1    3
     0.95   3   3   3   6   2   5   6   3   3    4
     0.99   0   0   0   3   0   1   2   1   0    2
* The first table shows it improved somewhat; however, there were unignorable number of bins having more contaminations. On the other hand, the second table shows the nature of the contamination, which is mostly due to multiple strains.

CASE 2: When assembly is moderate quality or from relatively complex community.

Run MetaBAT 2 and CheckM

#!bash
$ metabat2 -i assembly.fa.gz -a depth.txt -o resB1/bin -v
[00:00:00] MetaBAT 2 (v2.10.2) using minContig 2500, minCV 1.0, minCVSum 1.0, maxP 95%, minS 60, and maxEdges 200.
[00:00:48] Finished reading 6154352 contigs and 3 coverages from depth.txt
[00:00:48] Number of target contigs: 201095 of large (>= 2500) and 727195 of small ones (>=1000 & <2500).
[00:01:12] Finished TNF calculation.
[00:04:40] Finished Preparing TNF Graph Building [pTNF = 70.0; 4627 / 5000 (P = 92.54%)]
[00:15:35] Finished Building TNF Graph (185825 vertices and 13839749 edges) [10.8Gb / 251.8Gb]                                          
[00:21:30] Building SCR Graph and Binning (162532 vertices and 1859133 edges) [P = 85.50%; 10.6Gb / 251.8Gb]                           
[00:21:35] 65.56% (697167749 bases) of large (>=2500) and 0.00% (0 bases) of small (<2500) contigs were binned.
483 bins (697167749 bases in total) formed.
$ checkm lineage_wf -f resB1/CheckM.txt -t 8 -x fa resB1/ resB1/SCG

Check the result using R

> printPerf(list(calcPerfBySCG("resB1/CheckM.txt", removeStrain=F)), rec=c(seq(.1,.9,.1),.95), prec=c(seq(.6,.9,.1),.95,.99))
         Recall
Precision 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.95
     0.6  190 190 153 137 112  92  77  57  29   17
     0.7  190 190 153 137 112  92  77  57  29   17
     0.8  182 182 146 130 105  87  72  54  26   17
     0.9  169 169 133 117  96  79  64  47  24   17
     0.95 148 148 112  97  78  65  51  38  23   16
     0.99  70  70  42  32  22  16  11   7   4    1
> printPerf(list(calcPerfBySCG("resB1/CheckM.txt", removeStrain=T)), rec=c(seq(.1,.9,.1),.95), prec=c(seq(.6,.9,.1),.95,.99))
         Recall
Precision 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.95
     0.6  190 190 153 137 112  92  77  57  29   17
     0.7  190 190 153 137 112  92  77  57  29   17
     0.8  186 186 150 134 109  90  75  57  29   17
     0.9  182 182 146 130 106  89  74  56  29   17
     0.95 169 169 133 117  97  82  68  50  28   17
     0.99 100 100  69  56  43  31  23  15   6    2
* Compared to the previous example, this one has greater number of large contigs but has only 3 samples. And it appears that there are some contaminations in terms of both strain level or above. * You may stop here or try to add more data anyway using lower contig size cutoff (-m 2000).

#!bash
$ metabat2 -i assembly.fa.gz -a depth.txt -o resB2/bin -v -m 2000
[00:00:00] MetaBAT 2 (v2.10.2) using minContig 2000, minCV 1.0, minCVSum 1.0, maxP 95%, minS 60, and maxEdges 200.
[00:00:45] Finished reading 6154352 contigs and 3 coverages from depth.txt
[00:00:46] Number of target contigs: 288498 of large (>= 2000) and 639792 of small ones (>=1000 & <2000).
[00:01:14] Finished TNF calculation.
[00:06:39] Finished Preparing TNF Graph Building [pTNF = 70.0; 4561 / 5000 (P = 91.22%)]
[00:29:00] Finished Building TNF Graph (264407 vertices and 18768911 edges) [10.9Gb / 251.8Gb]
[00:40:11] Building SCR Graph and Binning (227267 vertices and 2558138 edges) [P = 85.50%; 10.7Gb / 251.8Gb]                           
[00:40:16] 65.14% (819215630 bases) of large (>=2000) and 0.00% (0 bases) of small (<2000) contigs were binned.
492 bins (819215630 bases in total) formed.
$ checkm lineage_wf -f resB2/CheckM.txt -t 8 -x fa resB2/ resB2/SCG
> diffPerf(calcPerfBySCG("resB2/CheckM.txt", removeStrain=F), calcPerfBySCG("resB1/CheckM.txt", removeStrain=F), rec=c(seq(.1,.9,.1),.95), prec=c(seq(.6,.9,.1),.95,.99))
         Recall
Precision 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.95
     0.6   24  24  15   5   9   8   4   6   2    2
     0.7   20  20  12   2   6   5   2   4   0    1
     0.8   22  22  13   3   8   6   3   4   1    1
     0.9   17  17   8  -1   0   1  -1   2   0    0
     0.95   6   6  -2 -10  -6  -6  -8  -4  -5   -5
     0.99   1   1   0  -6  -5  -4  -2   1   1    2
> diffPerf(calcPerfBySCG("resB2/CheckM.txt", removeStrain=T), calcPerfBySCG("resB1/CheckM.txt", removeStrain=T), rec=c(seq(.1,.9,.1),.95), prec=c(seq(.6,.9,.1),.95,.99))
         Recall
Precision 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.95
     0.6   25  25  16   6  10   9   5   7   2    2
     0.7   22  22  13   3   7   6   3   5   1    1
     0.8   23  23  14   4   8   7   4   5   1    1
     0.9   21  21  12   3   7   6   3   5   1    1
     0.95  17  17   9   1   3   3   0   4   1    1
     0.99   6   6   4  -4  -3   2   0   3   4    3
* 9 more bins were recovered with 120M more bases. * It appears that there are about 5-10% additional contaminations but most of them are strain level. * FYI, we tried 1.5kb cutoff next.

#!bash
$ metabat2 -i assembly.fa.gz -a depth.txt -o resB3/bin -v -m 1500
[00:00:00] MetaBAT 2 (v2.10.2) using minContig 1500, minCV 1.0, minCVSum 1.0, maxP 95%, minS 60, and maxEdges 200.
[00:00:46] Finished reading 6154352 contigs and 3 coverages from depth.txt
[00:00:47] Number of target contigs: 469696 of large (>= 1500) and 458594 of small ones (>=1000 & <1500).
[00:01:19] Finished TNF calculation.
[00:10:34] Finished Preparing TNF Graph Building [pTNF = 70.0; 4390 / 5000 (P = 87.80%)]
[01:06:28] Finished Building TNF Graph (413179 vertices and 26227815 edges) [11.1Gb / 251.8Gb]
[01:28:59] Building SCR Graph and Binning (337690 vertices and 3546110 edges) [P = 76.00%; 11.4Gb / 251.8Gb]                           
[01:29:06] 61.91% (971087914 bases) of large (>=1500) and 0.00% (0 bases) of small (<1500) contigs were binned.
495 bins (971087914 bases in total) formed.
$ checkm lineage_wf -f resB3/CheckM.txt -t 8 -x fa resB3/ resB3/SCG
> diffPerf(calcPerfBySCG("resB3/CheckM.txt", removeStrain=F), calcPerfBySCG("resB2/CheckM.txt", removeStrain=F), rec=c(seq(.1,.9,.1),.95), prec=c(seq(.6,.9,.1),.95,.99))
         Recall
Precision 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.95
     0.6    7   7  20  11  12  11   6   1   7   -1
     0.7    4   4  16   7   8   7   4   1   7   -1
     0.8    3   3  15   6   6   5   2   1   8   -1
     0.9   -8  -8   5  -4  -3  -5  -6  -6   3   -2
     0.95  -7  -7   6  -2  -3  -5  -3  -3   5    0
     0.99 -14 -14  -5  -1  -1  -1  -3  -3   0   -1
> diffPerf(calcPerfBySCG("resB3/CheckM.txt", removeStrain=T), calcPerfBySCG("resB2/CheckM.txt", removeStrain=T), rec=c(seq(.1,.9,.1),.95), prec=c(seq(.6,.9,.1),.95,.99))
         Recall
Precision 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.95
     0.6    9   9  21  12  13  12   7   1   8   -1
     0.7    7   7  20  11  12  11   7   3   9    0
     0.8    7   7  19  10  11  10   7   3   9    0
     0.9    2   2  15   6   7   6   3   0   7    0
     0.95   3   3  16   8   8   4   2  -2   5   -1
     0.99 -11 -11  -3  -2  -2  -5  -2  -5   1   -1
* 150M additional bases are binned, but it appears that contaminations also increased significantly.

CASE 3: When assembly is poor quality or from highly complex community.

Run MetaBAT 2 and CheckM

#!bash
$ metabat2 -i assembly.fa.gz -a depth.txt -o resC1/bin -v
[00:00:00] MetaBAT 2 (v2.10.2) using minContig 2500, minCV 1.0, minCVSum 1.0, maxP 95%, minS 60, and maxEdges 200.
[00:01:15] Finished reading 5334784 contigs and 9 coverages from depth.txt
[00:01:16] Number of target contigs: 705500 of large (>= 2500) and 1244422 of small ones (>=1000 & <2500).
[00:02:43] Finished TNF calculation.
[00:09:03] Finished Preparing TNF Graph Building [pTNF = 89.0; 4786 / 5000 (P = 94.90%)]
[02:21:31] Finished Building TNF Graph (668596 vertices and 50400146 edges) [15.8Gb / 251.8Gb]
[03:44:10] Building SCR Graph and Binning (551394 vertices and 3408104 edges) [P = 85.50%; 16.1Gb / 251.8Gb]                           
[03:44:49] 3.25% (101083859 bases) of large (>=2500) contigs were re-binned out of small bins (<200000).
[03:45:04] 77.00% (3209083259 bases) of large (>=2500) and 9.33% (169227642 bases) of small (<2500) contigs were binned.
1281 bins (3378310901 bases in total) formed.
$ checkm lineage_wf -f resC1/CheckM.txt -t 8 -x fa resC1/ resC1/SCG

Check the result using R

> printPerf(list(calcPerfBySCG("resC1/CheckM.txt", removeStrain=F)), rec=c(seq(.1,.9,.1),.95), prec=c(seq(.6,.9,.1),.95,.99))
         Recall
Precision 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.95
     0.6  869 869 798 721 635 544 445 348 203   78
     0.7  842 842 771 695 609 520 425 329 192   75
     0.8  776 776 705 630 545 459 370 280 159   55
     0.9  637 637 566 494 412 332 250 185 101   35
     0.95 479 479 414 349 280 212 148 108  66   24
     0.99 200 200 157 112  79  50  28  20  17    8
> printPerf(list(calcPerfBySCG("resC1/CheckM.txt", removeStrain=T)), rec=c(seq(.1,.9,.1),.95), prec=c(seq(.6,.9,.1),.95,.99))
         Recall
Precision 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.95
     0.6  901 901 830 753 667 576 477 378 226   88
     0.7  892 892 821 744 658 569 472 374 225   88
     0.8  885 885 814 737 651 565 470 372 225   88
     0.9  859 859 788 715 632 548 455 361 219   84
     0.95 803 803 734 667 584 501 412 331 199   77
     0.99 489 489 433 376 310 252 198 148  97   40
* Compared to the previous examples, this one has greatest number of large contigs. * It appears that there are significant contaminations in terms of both strain level or above, so it is not advised to use lower minContig cutoff.

Other parameters we may need to handle exceptional cases

  • We consider contigs for binning when they have at least one >= minCV coverage (called effective coverage) in any samples and sum of effective coverage >= minCVSum. Currently the default is both minCV and minCVSum are 1.
  • maxP sets the upper limit for deciding contigs in binning consideration by their quality in TNF (Tetra Nucleotide Frequency) score. So --maxP 95 means that it assumes at least 5% is noise. In reality, noise might be much higher, but it is safe to set this high since we have internal lower limit (pTNF = 70) preventing unnecessary build-up of TNF graph.
  • maxEdges is another way to control complexity of TNF graph where we limit the maximum number of edges per node by their strength (e.g. --maxEdges 200 means top 200 edges over the threshold, which is decided automatically by maxP, will be kept at most).
  • Lastly, minS is for cutoff in probability for combined score (SCR) graph building.

Updated