Wiki
Clone wikiMetaBAT / Best Binning Practices
Best Binning Practices
Note: This is for MetaBAT 2. See here for MetaBAT 1.
Check out new CAMI Challenge benchmark here.
Prerequisites
We will use 3 datasets in this guide. The dataset is available here. For this guide, we use MetaBAT v2.10.2 and CheckM v1.0.6.
Summary of Workflow
We will start with default setting in MetaBAT 2 and explore other advanced settings if necessary. Unlike MetaBAT 1, we won't do extensive parameter search, but we can still change some parameters to control the amount of data used for binning. For instance, if initial results looked great, using more data (e.g. --minContig 1500) might be helpful to improve completeness of genome bins at the cost of some contamination. For the evaluation of bins, we will use completeness and contamination estimations in CheckM. Although this is a guideline for more advanced usage of MetaBAT 2, it is safe to use default mode in most of cases without much loss of information as shown below, where the difference between each run is not significant.
CASE 1: When assembly is good quality and from relatively simple community.
Run MetaBAT 2 and CheckM
#!bash $ metabat2 -i assembly.fa.gz -a depth.txt -o resA1/bin -v [00:00:00] MetaBAT 2 (v2.10.2) using minContig 2500, minCV 1.0, minCVSum 1.0, maxP 95%, minS 60, and maxEdges 200. [00:00:05] Finished reading 79862 contigs and 42 coverages from depth.txt [00:00:05] Number of target contigs: 26603 of large (>= 2500) and 52481 of small ones (>=1000 & <2500). [00:00:09] Finished TNF calculation. [00:00:19] Finished Preparing TNF Graph Building [pTNF = 72.0; 2380 / 2500 (P = 94.92%)] [00:00:28] Finished Building TNF Graph (25392 vertices and 1285596 edges) [7.7Gb / 251.8Gb] [00:00:32] Building SCR Graph and Binning (23453 vertices and 130190 edges) [P = 95.00%; 7.7Gb / 251.8Gb] [00:00:34] 5.75% (6928951 bases) of large (>=2500) contigs were re-binned out of small bins (<200000). [00:00:34] 71.49% (127336848 bases) of large (>=2500) and 6.13% (4878858 bases) of small (<2500) contigs were binned. 104 bins (132215706 bases in total) formed. $ checkm lineage_wf -f resA1/CheckM.txt -t 8 -x fa resA1/ resA1/SCG
Check the result using R
> source('http://portal.nersc.gov/dna/RD/Metagenome_RD/MetaBAT/Files/benchmark.R') > printPerf(list(calcPerfBySCG("resA1/CheckM.txt", removeStrain=F)), rec=c(seq(.1,.9,.1),.95), prec=c(seq(.6,.9,.1),.95,.99)) Recall Precision 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.95 0.6 64 64 58 55 49 44 39 31 16 9 0.7 64 64 58 55 49 44 39 31 16 9 0.8 64 64 58 55 49 44 39 31 16 9 0.9 60 60 54 51 45 40 36 28 14 8 0.95 52 52 46 43 37 32 29 21 10 5 0.99 16 16 11 8 4 3 3 3 1 1 > printPerf(list(calcPerfBySCG("resA1/CheckM.txt", removeStrain=T)), rec=c(seq(.1,.9,.1),.95), prec=c(seq(.6,.9,.1),.95,.99)) Recall Precision 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.95 0.6 64 64 58 55 49 44 39 31 16 9 0.7 64 64 58 55 49 44 39 31 16 9 0.8 64 64 58 55 49 44 39 31 16 9 0.9 64 64 58 55 49 44 39 31 16 9 0.95 63 63 57 54 48 43 38 30 15 8 0.99 47 47 41 38 32 28 25 20 7 2
#!bash $ metabat2 -i assembly.fa.gz -a depth.txt -o resA2/bin -v -m 2000 [00:00:00] MetaBAT 2 (v2.10.2) using minContig 2000, minCV 1.0, minCVSum 1.0, maxP 95%, minS 60, and maxEdges 200. [00:00:05] Finished reading 79862 contigs and 42 coverages from depth.txt [00:00:05] Number of target contigs: 35247 of large (>= 2000) and 43837 of small ones (>=1000 & <2000). [00:00:09] Finished TNF calculation. [00:00:24] Finished Preparing TNF Graph Building [pTNF = 72.0; 2380 / 2500 (P = 95.20%)] [00:00:42] Finished Building TNF Graph (33426 vertices and 1620734 edges) [7.7Gb / 251.8Gb] [00:00:51] Building SCR Graph and Binning (30476 vertices and 159182 edges) [P = 95.00%; 7.7Gb / 251.8Gb] [00:00:52] 4.67% (6146882 bases) of large (>=2000) contigs were re-binned out of small bins (<200000). [00:00:53] 69.84% (137881024 bases) of large (>=2000) and 7.04% (4247436 bases) of small (<2000) contigs were binned. 115 bins (142128460 bases in total) formed. $ checkm lineage_wf -f resA2/CheckM.txt -t 8 -x fa resA2/ resA2/SCG
> diffPerf(calcPerfBySCG("resA2/CheckM.txt", removeStrain=F), calcPerfBySCG("resA1/CheckM.txt", removeStrain=F), rec=c(seq(.1,.9,.1),.95), prec=c(seq(.6,.9,.1),.95,.99)) Recall Precision 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.95 0.6 4 4 6 2 5 4 3 6 5 1 0.7 4 4 6 2 5 4 3 6 5 1 0.8 2 2 4 0 3 2 1 5 4 1 0.9 2 2 4 0 3 2 0 4 3 0 0.95 2 2 4 0 3 2 0 4 3 2 0.99 -1 -1 0 -3 -1 -1 -1 -1 0 0
#!bash $ metabat2 -i assembly.fa.gz -a depth.txt -o resA3/bin -v -m 1500 [00:00:00] MetaBAT 2 (v2.10.2) using minContig 1500, minCV 1.0, minCVSum 1.0, maxP 95%, minS 60, and maxEdges 200. [00:00:05] Finished reading 79862 contigs and 42 coverages from depth.txt [00:00:05] Number of target contigs: 49253 of large (>= 1500) and 29831 of small ones (>=1000 & <1500). [00:00:10] Finished TNF calculation. [00:00:33] Finished Preparing TNF Graph Building [pTNF = 70.0; 2296 / 2500 (P = 91.84%)] [00:01:07] Finished Building TNF Graph (45451 vertices and 2020298 edges) [7.7Gb / 251.8Gb] [00:01:17] Building SCR Graph and Binning (40097 vertices and 199891 edges) [P = 85.50%; 7.7Gb / 251.8Gb] [00:01:18] 5.42% (7806480 bases) of large (>=1500) contigs were re-binned out of small bins (<200000). [00:01:19] 68.53% (151873573 bases) of large (>=1500) and 8.19% (2957972 bases) of small (<1500) contigs were binned. 126 bins (154831545 bases in total) formed. $ checkm lineage_wf -f resA3/CheckM.txt -t 8 -x fa resA3/ resA3/SCG
> diffPerf(calcPerfBySCG("resA3/CheckM.txt", removeStrain=F), calcPerfBySCG("resA2/CheckM.txt", removeStrain=F), rec=c(seq(.1,.9,.1),.95), prec=c(seq(.6,.9,.1),.95,.99)) Recall Precision 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.95 0.6 1 1 1 4 0 3 4 1 1 3 0.7 0 0 0 3 -1 2 3 0 0 3 0.8 0 0 0 3 -1 2 3 0 0 3 0.9 2 2 2 5 1 4 5 2 1 4 0.95 -4 -4 -4 -1 -5 -1 1 -1 -1 1 0.99 2 2 2 4 1 1 0 0 0 0 > diffPerf(calcPerfBySCG("resA3/CheckM.txt", removeStrain=T), calcPerfBySCG("resA2/CheckM.txt", removeStrain=T), rec=c(seq(.1,.9,.1),.95), prec=c(seq(.6,.9,.1),.95,.99)) Recall Precision 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.95 0.6 2 2 2 5 1 4 5 2 2 4 0.7 2 2 2 5 1 4 5 2 2 4 0.8 1 1 1 4 0 3 4 1 1 3 0.9 1 1 1 4 0 3 4 1 1 3 0.95 3 3 3 6 2 5 6 3 3 4 0.99 0 0 0 3 0 1 2 1 0 2
CASE 2: When assembly is moderate quality or from relatively complex community.
Run MetaBAT 2 and CheckM
#!bash $ metabat2 -i assembly.fa.gz -a depth.txt -o resB1/bin -v [00:00:00] MetaBAT 2 (v2.10.2) using minContig 2500, minCV 1.0, minCVSum 1.0, maxP 95%, minS 60, and maxEdges 200. [00:00:48] Finished reading 6154352 contigs and 3 coverages from depth.txt [00:00:48] Number of target contigs: 201095 of large (>= 2500) and 727195 of small ones (>=1000 & <2500). [00:01:12] Finished TNF calculation. [00:04:40] Finished Preparing TNF Graph Building [pTNF = 70.0; 4627 / 5000 (P = 92.54%)] [00:15:35] Finished Building TNF Graph (185825 vertices and 13839749 edges) [10.8Gb / 251.8Gb] [00:21:30] Building SCR Graph and Binning (162532 vertices and 1859133 edges) [P = 85.50%; 10.6Gb / 251.8Gb] [00:21:35] 65.56% (697167749 bases) of large (>=2500) and 0.00% (0 bases) of small (<2500) contigs were binned. 483 bins (697167749 bases in total) formed. $ checkm lineage_wf -f resB1/CheckM.txt -t 8 -x fa resB1/ resB1/SCG
Check the result using R
> printPerf(list(calcPerfBySCG("resB1/CheckM.txt", removeStrain=F)), rec=c(seq(.1,.9,.1),.95), prec=c(seq(.6,.9,.1),.95,.99)) Recall Precision 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.95 0.6 190 190 153 137 112 92 77 57 29 17 0.7 190 190 153 137 112 92 77 57 29 17 0.8 182 182 146 130 105 87 72 54 26 17 0.9 169 169 133 117 96 79 64 47 24 17 0.95 148 148 112 97 78 65 51 38 23 16 0.99 70 70 42 32 22 16 11 7 4 1 > printPerf(list(calcPerfBySCG("resB1/CheckM.txt", removeStrain=T)), rec=c(seq(.1,.9,.1),.95), prec=c(seq(.6,.9,.1),.95,.99)) Recall Precision 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.95 0.6 190 190 153 137 112 92 77 57 29 17 0.7 190 190 153 137 112 92 77 57 29 17 0.8 186 186 150 134 109 90 75 57 29 17 0.9 182 182 146 130 106 89 74 56 29 17 0.95 169 169 133 117 97 82 68 50 28 17 0.99 100 100 69 56 43 31 23 15 6 2
#!bash $ metabat2 -i assembly.fa.gz -a depth.txt -o resB2/bin -v -m 2000 [00:00:00] MetaBAT 2 (v2.10.2) using minContig 2000, minCV 1.0, minCVSum 1.0, maxP 95%, minS 60, and maxEdges 200. [00:00:45] Finished reading 6154352 contigs and 3 coverages from depth.txt [00:00:46] Number of target contigs: 288498 of large (>= 2000) and 639792 of small ones (>=1000 & <2000). [00:01:14] Finished TNF calculation. [00:06:39] Finished Preparing TNF Graph Building [pTNF = 70.0; 4561 / 5000 (P = 91.22%)] [00:29:00] Finished Building TNF Graph (264407 vertices and 18768911 edges) [10.9Gb / 251.8Gb] [00:40:11] Building SCR Graph and Binning (227267 vertices and 2558138 edges) [P = 85.50%; 10.7Gb / 251.8Gb] [00:40:16] 65.14% (819215630 bases) of large (>=2000) and 0.00% (0 bases) of small (<2000) contigs were binned. 492 bins (819215630 bases in total) formed. $ checkm lineage_wf -f resB2/CheckM.txt -t 8 -x fa resB2/ resB2/SCG
> diffPerf(calcPerfBySCG("resB2/CheckM.txt", removeStrain=F), calcPerfBySCG("resB1/CheckM.txt", removeStrain=F), rec=c(seq(.1,.9,.1),.95), prec=c(seq(.6,.9,.1),.95,.99)) Recall Precision 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.95 0.6 24 24 15 5 9 8 4 6 2 2 0.7 20 20 12 2 6 5 2 4 0 1 0.8 22 22 13 3 8 6 3 4 1 1 0.9 17 17 8 -1 0 1 -1 2 0 0 0.95 6 6 -2 -10 -6 -6 -8 -4 -5 -5 0.99 1 1 0 -6 -5 -4 -2 1 1 2 > diffPerf(calcPerfBySCG("resB2/CheckM.txt", removeStrain=T), calcPerfBySCG("resB1/CheckM.txt", removeStrain=T), rec=c(seq(.1,.9,.1),.95), prec=c(seq(.6,.9,.1),.95,.99)) Recall Precision 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.95 0.6 25 25 16 6 10 9 5 7 2 2 0.7 22 22 13 3 7 6 3 5 1 1 0.8 23 23 14 4 8 7 4 5 1 1 0.9 21 21 12 3 7 6 3 5 1 1 0.95 17 17 9 1 3 3 0 4 1 1 0.99 6 6 4 -4 -3 2 0 3 4 3
#!bash $ metabat2 -i assembly.fa.gz -a depth.txt -o resB3/bin -v -m 1500 [00:00:00] MetaBAT 2 (v2.10.2) using minContig 1500, minCV 1.0, minCVSum 1.0, maxP 95%, minS 60, and maxEdges 200. [00:00:46] Finished reading 6154352 contigs and 3 coverages from depth.txt [00:00:47] Number of target contigs: 469696 of large (>= 1500) and 458594 of small ones (>=1000 & <1500). [00:01:19] Finished TNF calculation. [00:10:34] Finished Preparing TNF Graph Building [pTNF = 70.0; 4390 / 5000 (P = 87.80%)] [01:06:28] Finished Building TNF Graph (413179 vertices and 26227815 edges) [11.1Gb / 251.8Gb] [01:28:59] Building SCR Graph and Binning (337690 vertices and 3546110 edges) [P = 76.00%; 11.4Gb / 251.8Gb] [01:29:06] 61.91% (971087914 bases) of large (>=1500) and 0.00% (0 bases) of small (<1500) contigs were binned. 495 bins (971087914 bases in total) formed. $ checkm lineage_wf -f resB3/CheckM.txt -t 8 -x fa resB3/ resB3/SCG
> diffPerf(calcPerfBySCG("resB3/CheckM.txt", removeStrain=F), calcPerfBySCG("resB2/CheckM.txt", removeStrain=F), rec=c(seq(.1,.9,.1),.95), prec=c(seq(.6,.9,.1),.95,.99)) Recall Precision 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.95 0.6 7 7 20 11 12 11 6 1 7 -1 0.7 4 4 16 7 8 7 4 1 7 -1 0.8 3 3 15 6 6 5 2 1 8 -1 0.9 -8 -8 5 -4 -3 -5 -6 -6 3 -2 0.95 -7 -7 6 -2 -3 -5 -3 -3 5 0 0.99 -14 -14 -5 -1 -1 -1 -3 -3 0 -1 > diffPerf(calcPerfBySCG("resB3/CheckM.txt", removeStrain=T), calcPerfBySCG("resB2/CheckM.txt", removeStrain=T), rec=c(seq(.1,.9,.1),.95), prec=c(seq(.6,.9,.1),.95,.99)) Recall Precision 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.95 0.6 9 9 21 12 13 12 7 1 8 -1 0.7 7 7 20 11 12 11 7 3 9 0 0.8 7 7 19 10 11 10 7 3 9 0 0.9 2 2 15 6 7 6 3 0 7 0 0.95 3 3 16 8 8 4 2 -2 5 -1 0.99 -11 -11 -3 -2 -2 -5 -2 -5 1 -1
CASE 3: When assembly is poor quality or from highly complex community.
Run MetaBAT 2 and CheckM
#!bash $ metabat2 -i assembly.fa.gz -a depth.txt -o resC1/bin -v [00:00:00] MetaBAT 2 (v2.10.2) using minContig 2500, minCV 1.0, minCVSum 1.0, maxP 95%, minS 60, and maxEdges 200. [00:01:15] Finished reading 5334784 contigs and 9 coverages from depth.txt [00:01:16] Number of target contigs: 705500 of large (>= 2500) and 1244422 of small ones (>=1000 & <2500). [00:02:43] Finished TNF calculation. [00:09:03] Finished Preparing TNF Graph Building [pTNF = 89.0; 4786 / 5000 (P = 94.90%)] [02:21:31] Finished Building TNF Graph (668596 vertices and 50400146 edges) [15.8Gb / 251.8Gb] [03:44:10] Building SCR Graph and Binning (551394 vertices and 3408104 edges) [P = 85.50%; 16.1Gb / 251.8Gb] [03:44:49] 3.25% (101083859 bases) of large (>=2500) contigs were re-binned out of small bins (<200000). [03:45:04] 77.00% (3209083259 bases) of large (>=2500) and 9.33% (169227642 bases) of small (<2500) contigs were binned. 1281 bins (3378310901 bases in total) formed. $ checkm lineage_wf -f resC1/CheckM.txt -t 8 -x fa resC1/ resC1/SCG
Check the result using R
> printPerf(list(calcPerfBySCG("resC1/CheckM.txt", removeStrain=F)), rec=c(seq(.1,.9,.1),.95), prec=c(seq(.6,.9,.1),.95,.99)) Recall Precision 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.95 0.6 869 869 798 721 635 544 445 348 203 78 0.7 842 842 771 695 609 520 425 329 192 75 0.8 776 776 705 630 545 459 370 280 159 55 0.9 637 637 566 494 412 332 250 185 101 35 0.95 479 479 414 349 280 212 148 108 66 24 0.99 200 200 157 112 79 50 28 20 17 8 > printPerf(list(calcPerfBySCG("resC1/CheckM.txt", removeStrain=T)), rec=c(seq(.1,.9,.1),.95), prec=c(seq(.6,.9,.1),.95,.99)) Recall Precision 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.95 0.6 901 901 830 753 667 576 477 378 226 88 0.7 892 892 821 744 658 569 472 374 225 88 0.8 885 885 814 737 651 565 470 372 225 88 0.9 859 859 788 715 632 548 455 361 219 84 0.95 803 803 734 667 584 501 412 331 199 77 0.99 489 489 433 376 310 252 198 148 97 40
Other parameters we may need to handle exceptional cases
- We consider contigs for binning when they have at least one >= minCV coverage (called effective coverage) in any samples and sum of effective coverage >= minCVSum. Currently the default is both minCV and minCVSum are 1.
- maxP sets the upper limit for deciding contigs in binning consideration by their quality in TNF (Tetra Nucleotide Frequency) score. So --maxP 95 means that it assumes at least 5% is noise. In reality, noise might be much higher, but it is safe to set this high since we have internal lower limit (pTNF = 70) preventing unnecessary build-up of TNF graph.
- maxEdges is another way to control complexity of TNF graph where we limit the maximum number of edges per node by their strength (e.g. --maxEdges 200 means top 200 edges over the threshold, which is decided automatically by maxP, will be kept at most).
- Lastly, minS is for cutoff in probability for combined score (SCR) graph building.
Updated