Wiki

Clone wiki

enterobase-web / Salmonella Statistics

Top level links:

Details of assembly methods and in silico genotyping for Salmonella

For a general description of the in silico typing method used to apply these schemes on NGS data, please click here.

All MLST-like typing methods in EnteroBase are derived from a genome assembly of sequenced reads. For an explanation of this method please click here.

MLST – Classic Ribosomal MLST (Jolley, 2012) Core Genome MLST Whole Genome MLST
7 Loci 53 Loci 3,002 Loci 21,065 Loci
Conserved Housekeeping genes Ribosomal proteins Core genes Any coding sequence
Highly conserved; Low resolution Highly conserved; Medium resolution Variable; High resolution Highly variable; Extreme resolution
Different scheme for each species/genus Single scheme across tree of life Different scheme for each species/genus Different scheme for each species/genus

7 Gene MLST

Classic MLST scheme is described in Kidgell et al (2002) Infection, Genetics and Evolution. 2(1) 39-45.

Genes included in 7 gene MLST (together with the length of sequence used for MLST taken from table 1 in the above cited paper):

Gene Name/ Locus Tag Length
thrA STY0002 501
purE STY0582 399
sucA STY0779 501
hisD STY2281 501
aroC STY2616 501
hemD STY3622 432
dnaN STY3941 501

sal_mst.png

Minimal spanning tree (MSTree) of MLST data on 4257 isolates of S. enterica subspecies enterica. From Achtman et al. (2012) PLoS Pathog 8(6): e1002776.

Ribosomal MLST (rMLST)

Whole genome and Core genome MLST (cgMLST)

Whole genome MLST (wgMLST) and core genome MLST (cgMLST) schemes have been defined in EnteroBase, as a standard typing method for Salmonella, for subtler discrimination of genotype as compared to 7 gene MLST and rMLST schemes. Construction of these schemes consisted of three stages. Firstly, coding sequences were compiled from 537 Salmonella genomes; including 167 complete genomes in NCBI, 82 NCTC genomes from PacBio sequencing and 288 representatives for one genome per eBURST group -based on rMLST - within EnteroBase. This encompassed the genomic diversity within the Salmonella genus and consisted of a total of 2,406,798 CDS, which were grouped into 75,864 gene clusters using Uclust. In order to identify homolog regions within each genome, the centroid sequences of each clusters were aligned onto all 537 genomes using nucleotide BLAST, where a gene was considered present if a match covered greater than 70% nucleotide identity over 50% of the length of the centroid sequences.

To identify paralogs, the sets of homologous regions with potential paralogs were identified if there were at duplicate matched within any single genome. These regions were iteratively sub-clustered based on phylogenetic topology. Firstly, each set of sets of homologous regions were aligned together. The resulting alignment was used to generate a Maximum likelihood tree using FastTree. The ETE3 package was used to bipartition the tree to maximise the nucleotide diversity (at least 5%) between the subtrees. Each of the resulted subtrees was evaluated iteratively until no two regions came from the same genome in the same subtree, or the maximum inter-subtree diversity was less than 5%. Then we replace the original set of homolog regions with all of its sub-trees.

After the division process, all the homolog sets were scored and ranked according to the summarised alignment scores of their homolog regions. Homolog sets were discarded if they had regions which overlapped with the regions within other sets that had greater scores.

Finally, a complete set of 28,883 pan genes was identified for these 537 genomes. This set was further refined to 21,065 clusters, after similar gene clusters were merged if genes shared over 70% amino acid similarity. From each cluster, a single representative with the greatest alignment score was chosen to create a wgMLST scheme for Salmonella. This removed potential non-specific matches to paralogs in the downstream typing procedure. 3,144 Salmonella genomes, representing all rMLST STs in EnteroBase (up to May. 2016), were typed using this novel scheme. To generate the cgMLST scheme, a subset of wgMLST loci was selected based on three criteria: 1) The loci presented in over 98% (3,193) of the genomes; 2) The coding frames for the loci were intact in over 94% (3,063) of the genomes; 3) The amount of alleles fell within the majority of all loci. This process yielded a total of 3,002 loci, forming the cgMLST scheme for Salmonella.

Updated