Clone wiki

enterobase-web / Escherichia Statistics

Top level links:

Details of assembly methods and in silico genotyping for Escherichia

For a general description of the in silico typing method used to apply these schemes on NGS data, please click here.

All MLST-like typing methods in EnteroBase are derived from a genome assembly of sequenced reads. For an explanation of this method please click here.

MLST – Classic Ribosomal MLST (Jolley, 2012) Core Genome MLST Whole Genome MLST
7 Loci 53 Loci 2,513 Loci 25,002 Loci
Conserved Housekeeping genes Ribosomal proteins Core genes Any coding sequence
Highly conserved; Low resolution Highly conserved; Medium resolution Variable; High resolution Highly variable; Extreme resolution
Different scheme for each species/genus Single scheme across tree of life Different scheme for each species/genus Different scheme for each species/genus

7 Gene MLST

Classic MLST scheme is described in Wirth et al (2006) Mol. Microbiol. 60(5), 1136-1151.

Genes included in 7 gene MLST (together with the length of sequence used for MLST taken from figure 1 in the above cited paper):

Gene Length
adk 536
fumC 469
gyrB 460
icd 518
mdh 452
recA 510
purA 478

Ribosomal MLST (rMLST)

Whole genome and Core genome MLST (cgMLST)

Whole genome MLST (wgMLST) and core genome MLST (cgMLST) schemes have been defined in EnteroBase, as a standard typing method for Escherichia, for subtler discrimination of genotype as compared to 7 gene MLST and rMLST schemes. Construction of these schemes consisted of three stages. Firstly, coding sequences were compiled from 533 Escherichia genomes; including 283 complete genomes in NCBI, 234 NCTC genomes from PacBio sequencing and addtional 16 genomes representing cryptic environmental lineages. This encompassed the genomic diversity within the Escherichia genus and consisted of a total of 979,077 CDS, which were grouped into 109,529 gene clusters using Uclust. In order to identify homolog regions within each genome, the centroid sequences of each clusters were aligned onto all 533 genomes using nucleotide BLAST, where a gene was considered present if a match covered greater than 70% nucleotide identity over 50% of the length of the centroid sequences.

To identify paralogs, the sets of homologous regions with potential paralogs were identified if there were at duplicate matched within any single genome. These regions were iteratively sub-clustered based on phylogenetic topology. Firstly, each set of sets of homologous regions were aligned together. The resulting alignment was used to generate a Maximum likelihood tree using FastTree. The ETE3 package was used to bipartition the tree to maximise the nucleotide diversity (at least 5%) between the subtrees. Each of the resulted subtrees was evaluated iteratively until no two regions came from the same genome in the same subtree, or the maximum inter-subtree diversity was less than 5%. Then we replace the original set of homolog regions with all of its sub-trees.

After the division process, all the homolog sets were scored and ranked according to the summarised alignment scores of their homolog regions. Homolog sets were discarded if they had regions which overlapped with the regions within other sets that had greater scores.

Finally, a complete set of 34,044 pan genes was identified for these 533 genomes. This set was further refined to 25,002 clusters, after similar gene clusters were merged if genes shared over 70% amino acid similarity. From each cluster, a single representative with the greatest alignment score was chosen to create a wgMLST scheme for Escherichia. This removed potential non-specific matches to paralogs in the downstream typing procedure. 3,457 Escherichia genomes, representing all rMLST STs in EnteroBase (up to May. 2016), were typed using this novel scheme. To generate the cgMLST scheme, a subset of wgMLST loci was selected based on three criteria: 1) The loci presented in over 98% (2,605) of the genomes; 2) The coding frames for the loci were intact in over 94% (3,184) of the genomes; 3) The amount of alleles fell within the majority of all loci. This process yielded a total of 2,513 loci, forming the cgMLST scheme for Escherichia.