Top level links:
- Main top level page for all documentation
- EnteroBase Features
- Registering on EnteroBase and logging in
- Using the API
- About the underlying pipelines and other internals
- How schemes in EnteroBase work
Details of assembly methods and in silico genotyping for Yersinia
For a general description of the in silico typing method used to apply these schemes on NGS data, please click here.
All MLST-like typing methods in EnteroBase are derived from a genome assembly of sequenced reads. For an explanation of this method please click here.
|MLST – Classic||Ribosomal MLST (Jolley, 2012)||Core Genome MLST||Whole Genome MLST|
|7 Loci||53 Loci||1,553 Loci||19,531 Loci|
|Conserved Housekeeping genes||Ribosomal proteins||Core genes||Any coding sequence|
|Highly conserved; Low resolution||Highly conserved; Medium resolution||Variable; High resolution||Highly variable; Extreme resolution|
|Different scheme for each species/genus||Single scheme across tree of life||Different scheme for each species/genus||Different scheme for each species/genus|
7 Gene MLST
Achtman 7 gene MLST scheme
The Achtman 7 gene MLST scheme is described in Laukkanen-Ninios et al (2011) Environ. Microbiol. 13(12), 3114-3127.
Genes included in Achtman 7 gene MLST (together with the length of sequence used for MLST taken from table 3 in the above cited paper):
McNally 7 gene MLST scheme
The McNally 7 gene MLST scheme is described in Hall et al (2015) J. Clin. Microbiol. 53(1), 35-42.
Genes included in McNally 7 gene MLST (together with the length of sequence used for MLST taken from table 2 in the above cited paper):
Ribosomal MLST (rMLST)
- rMLST is Copyright © 2010-2016, University of Oxford. rMLST is described in Jolley et al. 2012 Microbiology 158:1005-15.
Whole genome and Core genome MLST (cgMLST)
Whole genome MLST (wgMLST) and core genome MLST (cgMLST) schemes have been defined in EnteroBase, as a standard typing method for Yersinia, for subtler discrimination of genotype as compared to 7 gene MLST and rMLST schemes. Construction of these schemes consisted of three stages. Firstly, coding sequences were compiled from 242 Yersinia genomes; including 79 complete genomes in NCBI, 8 NCTC genomes from PacBio sequencing and 155 representatives for one genome per rMLST sequence type (not shown in complete ones) within EnteroBase. This encompassed the genomic diversity within the Yersinia genus and consisted of a total of 934,959 CDS, which were grouped into 102,516 gene clusters using Uclust. In order to identify homolog regions within each genome, the centroid sequences of each clusters were aligned onto all 242 genomes using nucleotide BLAST, where a gene was considered present if a match covered greater than 70% nucleotide identity over 50% of the length of the centroid sequences.
To identify paralogs, the sets of homologous regions with potential paralogs were identified if there were at duplicate matched within any single genome. These regions were iteratively sub-clustered based on phylogenetic topology. Firstly, each set of sets of homologous regions were aligned together. The resulting alignment was used to generate a Maximum likelihood tree using FastTree. The ETE3 package was used to bipartition the tree to maximise the nucleotide diversity (at least 5%) between the subtrees. Each of the resulted subtrees was evaluated iteratively until no two regions came from the same genome in the same subtree, or the maximum inter-subtree diversity was less than 5%. Then we replace the original set of homolog regions with all of its sub-trees.
After the division process, all the homolog sets were scored and ranked according to the summarised alignment scores of their homolog regions. Homolog sets were discarded if they had regions which overlapped with the regions within other sets that had greater scores.
Finally, a complete set of 29,533 pan genes was identified for these 242 genomes. This set was further refined to 19,531 clusters, after similar gene clusters were merged if genes shared over 70% amino acid similarity. From each cluster, a single representative with the greatest alignment score was chosen to create a wgMLST scheme for Yersinia. This removed potential non-specific matches to paralogs in the downstream typing procedure. 217 Yersinia genomes, representing all rMLST STs in EnteroBase (up to May. 2016), were typed using this novel scheme. To generate the cgMLST scheme, a subset of wgMLST loci was selected based on three criteria: 1) The loci presented in over 98% (1,804) of the genomes; 2) The coding frames for the loci were intact in over 94% (1,835) of the genomes; 3) The amount of alleles fell within the majority of all loci. This process yielded a total of 1,553 loci, forming the cgMLST scheme for Yersinia.