Wiki

Clone wiki

enterobase-web / Yersinia Statistics

Top level links:

Details of assembly methods and in silico genotyping for Yersinia

For a general description of the in silico typing method used to apply these schemes on NGS data, please click here.

All MLST-like typing methods in EnteroBase are derived from a genome assembly of sequenced reads. For an explanation of this method please click here.

MLST – Classic Ribosomal MLST (Jolley, 2012) Core Genome MLST Whole Genome MLST
7 Loci 53 Loci 1,553 Loci 19,531 Loci
Conserved Housekeeping genes Ribosomal proteins Core genes Any coding sequence
Highly conserved; Low resolution Highly conserved; Medium resolution Variable; High resolution Highly variable; Extreme resolution
Different scheme for each species/genus Single scheme across tree of life Different scheme for each species/genus Different scheme for each species/genus

7 Gene MLST

Achtman 7 gene MLST scheme

The Achtman 7 gene MLST scheme is described in Laukkanen-Ninios et al (2011) Environ. Microbiol. 13(12), 3114-3127.

Genes included in Achtman 7 gene MLST (together with the length of sequence used for MLST taken from table 3 in the above cited paper):

Gene Length
adk 389
argA 361
aroA 357
glnA 338
thrA 342
tmk 375
trpE 465

McNally 7 gene MLST scheme

The McNally 7 gene MLST scheme is described in Hall et al (2015) J. Clin. Microbiol. 53(1), 35-42.

Genes included in McNally 7 gene MLST (together with the length of sequence used for MLST taken from table 2 in the above cited paper):

Gene Length
aarF 500
dfp 455
galR 500
glnS 442
hemA 490
speA 452
rfaE 429

Ribosomal MLST (rMLST)

Whole genome and Core genome MLST (cgMLST)

Whole genome MLST (wgMLST) and core genome MLST (cgMLST) schemes have been defined in EnteroBase, as a standard typing method for Yersinia, for subtler discrimination of genotype as compared to 7 gene MLST and rMLST schemes. Construction of these schemes consisted of three stages. Firstly, coding sequences were compiled from 242 Yersinia genomes; including 79 complete genomes in NCBI, 8 NCTC genomes from PacBio sequencing and 155 representatives for one genome per rMLST sequence type (not shown in complete ones) within EnteroBase. This encompassed the genomic diversity within the Yersinia genus and consisted of a total of 934,959 CDS, which were grouped into 102,516 gene clusters using Uclust. In order to identify homolog regions within each genome, the centroid sequences of each clusters were aligned onto all 242 genomes using nucleotide BLAST, where a gene was considered present if a match covered greater than 70% nucleotide identity over 50% of the length of the centroid sequences.

To identify paralogs, the sets of homologous regions with potential paralogs were identified if there were at duplicate matched within any single genome. These regions were iteratively sub-clustered based on phylogenetic topology. Firstly, each set of sets of homologous regions were aligned together. The resulting alignment was used to generate a Maximum likelihood tree using FastTree. The ETE3 package was used to bipartition the tree to maximise the nucleotide diversity (at least 5%) between the subtrees. Each of the resulted subtrees was evaluated iteratively until no two regions came from the same genome in the same subtree, or the maximum inter-subtree diversity was less than 5%. Then we replace the original set of homolog regions with all of its sub-trees.

After the division process, all the homolog sets were scored and ranked according to the summarised alignment scores of their homolog regions. Homolog sets were discarded if they had regions which overlapped with the regions within other sets that had greater scores.

Finally, a complete set of 29,533 pan genes was identified for these 242 genomes. This set was further refined to 19,531 clusters, after similar gene clusters were merged if genes shared over 70% amino acid similarity. From each cluster, a single representative with the greatest alignment score was chosen to create a wgMLST scheme for Yersinia. This removed potential non-specific matches to paralogs in the downstream typing procedure. 217 Yersinia genomes, representing all rMLST STs in EnteroBase (up to May. 2016), were typed using this novel scheme. To generate the cgMLST scheme, a subset of wgMLST loci was selected based on three criteria: 1) The loci presented in over 98% (1,804) of the genomes; 2) The coding frames for the loci were intact in over 94% (1,835) of the genomes; 3) The amount of alleles fell within the majority of all loci. This process yielded a total of 1,553 loci, forming the cgMLST scheme for Yersinia.

Updated