Wiki
Clone wikiMetaBAT / Best Binning Practices1
Best Binning Practices
Note: This is for MetaBAT 1. See here for MetaBAT 2.
In this page, we demonstrate the best binning practices using MetaBAT and CheckM. One of challenging parts in metagenomic binning is that there is no one-size-fits-all solution for every datasets, so it is necessary to explore the parameter space to find the best binning results. It would be ideal to have an automated solution, but so far it is not easily doable in practice. MetaBAT is extremely efficient binning method easily scalable to large number of complex community samples, and it also stores intermediate files for faster parameter exploration. We hope that this example workflow would guide future research and make it easier to find the best bins.
Prerequisites
We will use two datasets in this guide. The first one is MetaHIT dataset, as explained in a benchmark page, and the second dataset is here. For this guide, we use MetaBAT v0.32.4 and CheckM v1.0.6.
Summary of Workflow
We will start with default setting in MetaBAT and explore other advanced settings. To guide the direction of the search, we will use completeness and contamination estimations in CheckM. The main characteristic of default setting (--sensitive) is that it pursues the balance between sensitivity and specificity, so once the bins formed and evaluated by CheckM, it is important to decide whether one wants better sensitivity or specificity. One of main design goal of MetaBAT was to achieve as much specificity as possible, so it would be a good direction to tune parameters to improve sensitivity without much loss of specificity.
CASE 1: When It Needs More Sensitivity
The first case is that when there are many samples (in this case 264) or the community structure is at most moderately complex.
Run MetaBAT and CheckM
#!bash metabat -i assembly.fa -a depth.txt -o bin1/Bin --saveTNF saved.tnf --saveDistance saved.dist -v checkm lineage_wf -f bin1/CheckM.txt -t 8 -x fa bin1/ bin1/SCG
Check the result using R
> source('http://portal.nersc.gov/dna/RD/Metagenome_RD/MetaBAT/Files/benchmark.R') > printPerf(list(calcPerfBySCG("bin1/CheckM.txt", removeStrain=F)), rec=c(seq(.1,.9,.1),.95), prec=c(seq(.6,.9,.1),.95,.99)) Recall Precision 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.95 0.6 175 175 156 138 112 90 63 42 15 4 0.7 175 175 156 138 112 90 63 42 15 4 0.8 174 174 155 137 111 89 62 41 14 3 0.9 157 157 138 120 95 73 49 33 12 2 0.95 117 117 98 81 57 38 20 12 8 2 0.99 31 31 18 13 3 2 1 0 0 0 > printPerf(list(calcPerfBySCG("bin1/CheckM.txt", removeStrain=T)), rec=c(seq(.1,.9,.1),.95), prec=c(seq(.6,.9,.1),.95,.99)) Recall Precision 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.95 0.6 175 175 156 138 112 90 63 42 15 4 0.7 175 175 156 138 112 90 63 42 15 4 0.8 175 175 156 138 112 90 63 42 15 4 0.9 174 174 155 137 111 89 63 42 15 4 0.95 172 172 153 135 109 87 61 40 15 4 0.99 142 142 123 107 82 64 44 27 8 2
Trying '--verysensitive' mode
#!bash metabat -i assembly.fa -a depth.txt -o bin2/Bin --saveTNF saved.tnf --saveDistance saved.dist -v --verysensitive checkm lineage_wf -f bin2/CheckM.txt -t 8 -x fa bin2/ bin2/SCG
> printPerf(list(calcPerfBySCG("bin2/CheckM.txt", removeStrain=F)), rec=c(seq(.1,.9,.1),.95), prec=c(seq(.6,.9,.1),.95,.99)) Recall Precision 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.95 0.6 184 184 162 132 115 96 72 47 20 6 0.7 184 184 162 132 115 96 72 47 20 6 0.8 182 182 160 130 113 94 70 45 18 5 0.9 154 154 132 102 85 67 45 26 9 2 0.95 115 115 93 64 48 32 19 10 5 2 0.99 31 31 20 6 2 1 1 0 0 0 > printPerf(list(calcPerfBySCG("bin2/CheckM.txt", removeStrain=T)), rec=c(seq(.1,.9,.1),.95), prec=c(seq(.6,.9,.1),.95,.99)) Recall Precision 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.95 0.6 184 184 162 132 115 96 72 47 20 6 0.7 184 184 162 132 115 96 72 47 20 6 0.8 184 184 162 132 115 96 72 47 20 6 0.9 183 183 161 131 114 95 72 47 20 6 0.95 179 179 157 127 110 91 69 44 19 6 0.99 137 137 116 89 75 58 44 25 9 5 > diffPerf(calcPerfBySCG("bin2/CheckM.txt", removeStrain=F), calcPerfBySCG("bin1/CheckM.txt", removeStrain=F)) Recall Precision 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.95 0.6 9 9 6 -6 3 6 9 5 5 2 0.7 9 9 6 -6 3 6 9 5 5 2 0.8 8 8 5 -7 2 5 8 4 4 2 0.9 -3 -3 -6 -18 -10 -6 -4 -7 -3 0 0.95 -2 -2 -5 -17 -9 -6 -1 -2 -3 0 0.99 0 0 2 -7 -1 -1 0 0 0 0
Trying '--minContig 1500'
This change decreases the lower bound for initial binning so that more contigs tend to be binned at the cost of increased contamination. The default is 2500 and it should be no less than 1500.
#!bash metabat -i assembly.fa -a depth.txt -o bin1-1/Bin --minContig 1500 --saveTNF saved_1500.tnf --saveDistance saved_1500.dist -v checkm lineage_wf -f bin1-1/CheckM.txt -t 8 -x fa bin1-1/ bin1-1/SCG
> printPerf(list(calcPerfBySCG("bin1-1/CheckM.txt", removeStrain=F)), rec=c(seq(.1,.9,.1),.95), prec=c(seq(.6,.9,.1),.95,.99)) Recall Precision 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.95 0.6 191 191 174 144 113 93 67 45 18 6 0.7 191 191 174 144 113 93 67 45 18 6 0.8 190 190 173 143 112 92 66 44 17 5 0.9 173 173 156 126 96 76 53 35 14 4 0.95 129 129 112 82 53 37 21 13 9 3 0.99 34 34 27 15 2 1 0 0 0 0 > diffPerf(calcPerfBySCG("bin1-1/CheckM.txt", removeStrain=F), calcPerfBySCG("bin1/CheckM.txt", removeStrain=F)) Recall Precision 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.95 0.6 16 16 18 6 1 3 4 3 3 2 0.7 16 16 18 6 1 3 4 3 3 2 0.8 16 16 18 6 1 3 4 3 3 2 0.9 16 16 18 6 1 3 4 2 2 2 0.95 12 12 14 1 -4 -1 1 1 1 1 0.99 3 3 9 2 -1 -1 -1 0 0 0
Trying Ensemble Binning
Ensemble binning is a new addition to MetaBAT since v0.30.0. The main motive of its addition is to have a nice mechanism to combine bins so that it achieves better sensitivity without sacrificing specificity. It is known that MetaBAT produces highly specific bins but often it lacks sensitivity. The reason is that bins tend to be split to highly similar smaller pieces. So ensemble binning is a nice way to compensate the problem and will increase the sensitivity.
#!bash metabat -i assembly.fa -a depth.txt -o bin3-1/Bin --minContig 1500 --saveTNF saved_1500.tnf --saveDistance saved_1500.dist -v -B 20 --keep checkm lineage_wf -f bin3-1/CheckM.txt -t 8 -x fa bin3-1/ bin3-1/SCG
> diffPerf(calcPerfBySCG("bin3-1/CheckM.txt", removeStrain=F), calcPerfBySCG("bin1-1/CheckM.txt", removeStrain=F)) Recall Precision 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.95 0 -24 -24 -24 -20 -6 0 11 18 17 9 0.1 -25 -25 -25 -21 -7 -1 10 17 16 8 0.2 -26 -26 -26 -22 -8 -2 9 16 15 7 0.3 -29 -29 -29 -25 -11 -5 6 13 12 5 0.4 -31 -31 -31 -27 -13 -7 4 11 10 3 0.5 -38 -38 -38 -34 -20 -14 -3 4 8 3 0.6 -43 -43 -43 -39 -25 -19 -8 -1 4 1 0.7 -43 -43 -43 -39 -25 -19 -8 -1 4 1 0.8 -46 -46 -46 -42 -28 -22 -11 -4 1 -1 0.9 -49 -49 -49 -45 -32 -25 -16 -7 -2 -1 0.95 -36 -36 -36 -32 -20 -12 -8 -6 -4 -1 0.99 -8 -8 -10 -8 -2 -1 0 0 0 0
#!bash metabat -i assembly.fa -a depth.txt -o bin4/Bin --minContig 1500 --saveTNF saved_1500.tnf --saveDistance saved_1500.dist -v -B 20 --keep --specific checkm lineage_wf -f bin4/CheckM.txt -t 8 -x fa bin4/ bin4/SCG
> diffPerf(calcPerfBySCG("bin4/CheckM.txt", removeStrain=F), calcPerfBySCG("bin1-1/CheckM.txt", removeStrain=F)) Recall Precision 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.95 0 -35 -35 -34 -24 -9 0 12 10 13 7 0.1 -39 -39 -38 -28 -13 -4 8 6 9 3 0.2 -40 -40 -39 -29 -14 -5 7 5 8 2 0.3 -43 -43 -42 -32 -17 -8 4 2 5 2 0.4 -47 -47 -46 -36 -21 -12 0 -2 1 -1 0.5 -48 -48 -47 -37 -22 -13 -1 -3 0 -2 0.6 -51 -51 -50 -40 -25 -16 -4 -5 0 -2 0.7 -52 -52 -51 -41 -26 -17 -5 -6 -1 -2 0.8 -54 -54 -53 -43 -28 -19 -7 -8 -3 -2 0.9 -55 -55 -54 -44 -30 -21 -10 -8 -5 -2 0.95 -46 -46 -45 -35 -20 -11 -5 -4 -6 -1 0.99 -8 -8 -13 -10 -2 -1 0 0 0 0 > diffPerf(calcPerfBySCG("bin4/CheckM.txt", removeStrain=F), calcPerfBySCG("bin3-1/CheckM.txt", removeStrain=F)) Recall Precision 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.95 0 -11 -11 -10 -4 -3 0 1 -8 -4 -2 0.1 -14 -14 -13 -7 -6 -3 -2 -11 -7 -5 0.2 -14 -14 -13 -7 -6 -3 -2 -11 -7 -5 0.3 -14 -14 -13 -7 -6 -3 -2 -11 -7 -3 0.4 -16 -16 -15 -9 -8 -5 -4 -13 -9 -4 0.5 -10 -10 -9 -3 -2 1 2 -7 -8 -5 0.6 -8 -8 -7 -1 0 3 4 -4 -4 -3 0.7 -9 -9 -8 -2 -1 2 3 -5 -5 -3 0.8 -8 -8 -7 -1 0 3 4 -4 -4 -1 0.9 -6 -6 -5 1 2 4 6 -1 -3 -1 0.95 -10 -10 -9 -3 0 1 3 2 -2 0 0.99 0 0 -3 -2 0 0 0 0 0 0
#!bash metabat -i assembly.fa -a depth.txt -o bin4-1/Bin --minContig 1500 --saveTNF saved_1500.tnf --saveDistance saved_1500.dist -v -B 20 --keep --specific --pB 20 checkm lineage_wf -f bin4-1/CheckM.txt -t 8 -x fa bin4-1/ bin4-1/SCG
> diffPerf(calcPerfBySCG("bin4-1/CheckM.txt", removeStrain=F), calcPerfBySCG("bin4/CheckM.txt", removeStrain=F)) Recall Precision 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.95 0 20 20 17 20 15 11 5 1 0 -2 0.1 24 24 21 24 19 15 9 5 4 2 0.2 25 25 22 25 20 16 10 6 5 3 0.3 28 28 25 28 23 19 13 9 8 3 0.4 32 32 29 32 27 23 17 13 12 6 0.5 33 33 30 33 28 24 18 14 13 7 0.6 36 36 33 36 31 27 21 16 13 7 0.7 37 37 34 37 32 28 22 17 14 7 0.8 34 34 31 34 29 25 19 14 11 5 0.9 29 29 26 29 24 20 13 9 8 3 0.95 23 23 20 23 18 13 8 5 5 2 0.99 3 3 0 2 0 0 0 0 0 0
#!bash metabat -i assembly.fa -a depth.txt -o bin4-2/Bin --minContig 1500 --saveTNF saved_1500.tnf --saveDistance saved_1500.dist -v -B 20 --keep --specific --pB 5 checkm lineage_wf -f bin4-2/CheckM.txt -t 8 -x fa bin4-2/ bin4-2/SCG
> diffPerf(calcPerfBySCG("bin4-2/CheckM.txt", removeStrain=F), calcPerfBySCG("bin4-1/CheckM.txt", removeStrain=F)) Recall Precision 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.95 0.6 0 0 0 0 1 2 2 3 0 3 0.7 0 0 0 0 1 2 2 3 0 3 0.8 -1 -1 -1 -1 0 1 1 2 -1 2 0.9 -5 -5 -5 -5 -4 -3 -3 -2 -4 0 0.95 -2 -2 -2 -2 -1 0 0 1 0 0 0.99 0 0 0 0 0 0 0 0 0 0 > diffPerf(calcPerfBySCG("bin4-2/CheckM.txt", removeStrain=T), calcPerfBySCG("bin4-1/CheckM.txt", removeStrain=T)) Recall Precision 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.95 0.6 0 0 0 0 1 2 2 3 0 3 0.7 0 0 0 0 1 2 2 3 0 3 0.8 0 0 0 0 1 2 2 3 0 3 0.9 0 0 0 0 1 2 2 3 0 3 0.95 0 0 0 0 1 2 2 3 0 3 0.99 -2 -2 -2 -2 -1 1 0 2 -1 2
> printPerf(list(calcPerfBySCG("bin4-1/CheckM.txt", removeStrain=T)),rec=c(seq(.1,.9,.1),.95), prec=c(seq(.6,.9,.1),.95,.99)) [[1]] Recall Precision 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.95 0.6 176 176 157 140 119 104 84 56 31 11 0.7 176 176 157 140 119 104 84 56 31 11 0.8 176 176 157 140 119 104 84 56 31 11 0.9 176 176 157 140 119 104 84 56 31 11 0.95 173 173 154 137 116 102 82 54 31 11 0.99 136 136 117 101 83 71 56 32 17 6
CASE 2: When It Needs More Specificity
The second case is that when there are only a few samples (in this case 9) or the community structure is highly complex.
#!bash metabat -i assembly.fa.gz -a depth.txt -o bin1/Bin --saveTNF saved_2500.tnf --saveDistance saved_2500.dist -v checkm lineage_wf -f bin1/CheckM.txt -t 8 -x fa bin1/ bin1/SCG
> printPerf(list(calcPerfBySCG("bin1/CheckM.txt", removeStrain=F)),rec=c(seq(.1,.9,.1),.95), prec=c(seq(0,.9,.1),.95,.99)) [[1]] Recall Precision 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.95 0 908 908 799 703 620 530 435 337 210 102 0.1 870 870 761 665 582 492 398 300 175 74 0.2 866 866 757 661 578 488 394 296 171 72 0.3 857 857 748 652 569 479 385 288 165 67 0.4 850 850 741 645 562 472 378 282 161 66 0.5 832 832 723 627 544 454 360 267 153 61 0.6 813 813 704 608 525 435 342 253 144 57 0.7 781 781 672 576 493 403 316 233 134 52 0.8 738 738 630 534 451 366 287 215 125 49 0.9 639 639 531 438 361 283 215 161 90 34 0.95 470 470 373 292 227 170 119 88 47 18 0.99 223 223 150 94 64 34 16 13 9 3
#!bash metabat -i assembly.fa.gz -a depth.txt -o bin2/Bin --saveTNF saved_2500.tnf --saveDistance saved_2500.dist -v --superspecific checkm lineage_wf -f bin2/CheckM.txt -t 8 -x fa bin2/ bin2/SCG
> diffPerf(calcPerfBySCG("bin2/CheckM.txt", removeStrain=F), calcPerfBySCG("bin1/CheckM.txt", removeStrain=F)) Recall Precision 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.95 0 149 149 117 67 42 17 -17 -25 -57 -29 0.1 177 177 145 95 70 45 10 2 -30 -8 0.2 180 180 148 98 73 48 13 5 -27 -7 0.3 178 178 146 96 71 46 12 3 -27 -7 0.4 181 181 149 99 74 49 16 6 -25 -7 0.5 193 193 161 111 86 62 29 17 -19 -3 0.6 197 197 165 115 90 66 33 20 -17 -4 0.7 207 207 175 125 100 76 41 27 -14 -4 0.8 207 207 174 124 100 73 39 17 -14 -4 0.9 194 194 163 116 96 73 49 30 1 4 0.95 176 176 142 95 83 63 41 28 10 6 0.99 69 69 41 24 22 18 14 9 2 -1
#!bash metabat -i assembly.fa.gz -a depth.txt -o bin3/Bin --saveTNF saved_2500.tnf --saveDistance saved_2500.dist -v --superspecific -B 20 --keep checkm lineage_wf -f bin3/CheckM.txt -t 8 -x fa bin3/ bin3/SCG
> diffPerf(calcPerfBySCG("bin3/CheckM.txt", removeStrain=F), calcPerfBySCG("bin2/CheckM.txt", removeStrain=F)) Recall Precision 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.95 0 -75 -75 -21 31 36 53 77 77 63 26 0.1 -80 -80 -26 26 31 49 73 73 58 20 0.2 -82 -82 -28 24 29 47 71 71 56 19 0.3 -77 -77 -23 29 34 52 75 75 57 22 0.4 -80 -80 -26 26 31 49 71 72 55 20 0.5 -84 -84 -30 22 27 44 66 69 52 18 0.6 -94 -94 -40 12 17 34 55 58 42 14 0.7 -98 -98 -44 8 13 31 50 52 38 14 0.8 -92 -92 -38 14 19 37 50 56 36 9 0.9 -90 -90 -37 15 16 33 35 37 23 7 0.95 -101 -101 -47 0 -5 11 16 18 18 6 0.99 -66 -66 -24 0 -14 -6 -4 -8 -2 0
#!bash metabat -i assembly.fa.gz -a depth.txt -o bin3-1/Bin --saveTNF saved_2500.tnf --saveDistance saved_2500.dist -v --superspecific -B 20 --keep --pB 20 checkm lineage_wf -f bin3-1/CheckM.txt -t 8 -x fa bin3-1/ bin3-1/SCG
> diffPerf(calcPerfBySCG("bin3-1/CheckM.txt", removeStrain=F), calcPerfBySCG("bin3/CheckM.txt", removeStrain=F)) Recall Precision 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.95 0 -53 -53 -51 -33 -9 -5 3 8 38 25 0.1 -62 -62 -60 -42 -18 -15 -6 -1 30 20 0.2 -65 -65 -63 -45 -21 -18 -9 -4 27 19 0.3 -69 -69 -67 -49 -25 -22 -13 -8 24 16 0.4 -71 -71 -69 -51 -27 -24 -15 -10 22 14 0.5 -74 -74 -72 -54 -30 -27 -18 -14 17 11 0.6 -70 -70 -68 -50 -26 -23 -13 -10 18 13 0.7 -69 -69 -67 -49 -25 -23 -13 -6 18 11 0.8 -83 -83 -81 -63 -39 -35 -24 -15 7 8 0.9 -105 -105 -101 -84 -61 -55 -42 -32 -3 -2 0.95 -96 -96 -95 -78 -59 -54 -37 -28 -12 -3 0.99 -39 -39 -38 -30 -16 -12 -9 -4 -2 0 > diffPerf(calcPerfBySCG("bin3-1/CheckM.txt", removeStrain=T), calcPerfBySCG("bin3/CheckM.txt", removeStrain=T)) Recall Precision 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.95 0 -53 -53 -51 -33 -9 -5 3 8 38 25 0.1 -59 -59 -57 -39 -15 -12 -3 2 33 21 0.2 -59 -59 -57 -39 -15 -12 -3 2 33 23 0.3 -59 -59 -57 -39 -15 -12 -3 2 34 22 0.4 -61 -61 -59 -41 -17 -14 -5 0 31 20 0.5 -70 -70 -68 -50 -26 -23 -14 -8 24 15 0.6 -63 -63 -61 -43 -19 -16 -6 -3 28 18 0.7 -66 -66 -64 -46 -22 -19 -10 -7 22 18 0.8 -68 -68 -66 -48 -24 -22 -10 -4 21 17 0.9 -83 -83 -80 -63 -40 -34 -24 -12 13 13 0.95 -88 -88 -85 -65 -42 -39 -28 -17 6 8 0.99 -53 -53 -54 -41 -26 -22 -12 -6 9 6
> printPerf(list(calcPerfBySCG("bin3/CheckM.txt", removeStrain=F)),rec=c(seq(.1,.9,.1),.95), prec=c(seq(0,.9,.1),.95,.99)) [[1]] Recall Precision 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.95 0 982 982 895 801 698 600 495 389 216 99 0.1 967 967 880 786 683 586 481 375 203 86 0.2 964 964 877 783 680 583 478 372 200 84 0.3 958 958 871 777 674 577 472 366 195 82 0.4 951 951 864 770 667 570 465 360 191 79 0.5 941 941 854 760 657 560 455 353 186 76 0.6 916 916 829 735 632 535 430 331 169 67 0.7 890 890 803 709 606 510 407 312 158 62 0.8 853 853 766 672 570 476 376 288 147 54 0.9 743 743 657 569 473 389 299 228 114 45 0.95 545 545 468 387 305 244 176 134 75 30 0.99 226 226 167 118 72 46 26 14 9 2
Updated