1. Srikar Chamala
  2. markerminer



image alt text

IMPORTANT !!!! - Use Chrome Web Browser to view this page, as some of the other web browsers were unable to to navigate thourgh web-links.

MarkerMiner v1.0 User Manual

Targeted sequencing using next-generation sequencing (NGS) platforms offers enormous potential for plant systematics by enabling economical acquisition of multilocus data sets that can resolve difficult phylogenetic problems. However, because discovery of single-copy nuclear (SCN) loci from NGS data requires both bioinformatics skills and access to high-performance computing resources, the application of NGS data has been limited.

MarkerMiner is an easy-to-use, fully automated, open-access bioinformatic workflow and application for effective discovery of SCN loci in flowering plants angiosperms(flowering plants), from user-provided angiosperm transcriptome assemblies (e.g. OneKP transcriptome assemblies;[ http://onekp.com]). It can be run locally or via the web, and its tabular and alignment outputs facilitate efficient downstream assessments of phylogenetic utility, locus selection, intron-exon boundary prediction, and primer or probe development.

Single-copy gene identification method of De Smet et al. (2013)

MarkerMiner compares user-provided transcriptomic data input against reference databases of known single-copy nuclear genes that were identified as part of a systematic survey of duplication-resistant genes in 17 angiosperm genomes by De Smet et al. (2013). The reference databases are composed of Orthologous Groups (OGs) that were constructed using data from the PLAZA 2.5 database (Van Bel et al. 2011) and the OrthoMCL method (Li et al. 2003).

DeSmet et al. (2013) classified OGs as single-copy if they were present across all 17 genomes. However, missing copies in up to two species or duplicates in up to three species were tolerated to accommodate possible variations in the reference genome annotations and the presence of recent duplicates or pseudogenes, respectively. Single-copy genes were classified as "Strictly" single-copy if OGs were truly single-copy for all species or "Mostly" single-copy if the OGs were duplicated in at least one or to up to three other surveyed species.

How MarkerMiner works:

MarkerMiner identifies clusters of single-copy gene transcripts present in each user-provided transcriptome assembly by aligning and filtering transcripts against a user-selected reference proteome database. MarkerMiner then generates a detailed tabular report of results.

Next, MarkerMiner runs each of the single-copy gene clusters through a multiple sequence alignment (MSA) step using MAFFT (Katoh and Standley 2013) and it outputs MSA files that users can use to assess phylogenetic utility (e.g. sequence variation) or, if appropriate, to conduct preliminary phylogenetic analyses.

Lastly, each of the single-copy gene MSAs are re-aligned with MAFFT (using the ‘--add’ functionality; Katoh and Frith 2012) profile alignment step using a user-selected coding reference sequence with intronic regions represented as Ns. Users can use MarkerMiner’s profile alignment output to identify putative splice junctions in the transcripts and to design primers or probes for targeted sequencing.

See also "MarkerMiner Output" below.

For more details and to cite MarkerMiner, please use the following manurscript:

Chamala, S., García, N., Godden, G. T., Krishnakumar, V., Jordon-Thaden, I. E., De Smet, R., Barbazuk, W. B., Soltis, D. E., and Soltis, P. S. 2015. MarkerMiner 1.0: A new application for phylogenetic marker development using angiosperm transcriptomes Applications in Plant Sciences 3(4): 1400115.


Please contact Srikar Chamala - srikarchamala[@]gmail[.]com

SECTION 1: Input Data Format and Naming Requirements

MarkerMiner will accept a path to the directory or folder with assembled transcriptome data files in FASTA format. So all transcriptome data files that needs to analysed need to be place in this folder. Users can process a single FASTA file or multiple FASTA files. However, all file names must use the following naming convention: file names must start with a four-letter species code followed by a hyphen (e.g. "DAT1-", "DAT2-", "DAT3-", etc.; illustrated below box). Also, file names should only be ending in either ".fa" or ".fasta" or ".fsa".


SECTION 2: Downloading sample input and output datasets

A test data set is provided to help users familiarize with the MarkerMiner web application.

Click the Download sample dataset link to retrieve a copy of sample input FASTA files and precomputed output files.

SECTION 3: Installing and Running MarkerMiner

MarkerMiner can be run both using command line and graphical user interface (GUI). Below are the web-links for instructions on installing and running MarkerMiner.

  • MarkerMiner via iPlant Atmosphere - Click here.

  • MarkerMiner via Comandline (Linux/Unix) - Click here.

  • MarkerMiner via Docker Comandline (Linux/Unix/Mac/Windows) - Click here.

SECTION 4: MarkerMiner Results and Output

The MarkerMiner output directory (Figure 1)includes the following:

1. Tab-delimited results:

  • single_copy_genes.txt (Figure 2)

  • single_copy_genes.secondaryTranscripts.txt – additional set of transcripts passing the BLAST filtering criteria and aligning uniquely to the same reference single-copy protein.

2. markerminer_run_logfile.txt - MarkerMiner run log file; first file also contain the version of MarkerMiner you ran.

3. input_transcriptomes.txt - Abosolute file paths of the transcriptome assemblies used in the MarkerMiner


5. Sequence alignments (MAFFT_NUC_ALIGN_FASTA; Figure 3)

6. Profile alignments with reference CDS (MAFFT_ADD_REF_ALIGN_FASTA; Figure 4)

image alt text

Figure 1 Unzipped directory of MarkerMiner output.

The tab-delimited results file (single_copy_genes.txt) file includes the following details for each SCN locus detected by MarkerMiner: a reference gene ID, a single-copy classification according to De Smet at al. (e.g. "strictly" or “mostly”), a gene functional description, the number of orthologues detected across all assemblies, and a scaffold ID for each of the assemblies included in the analysis (Figure 2.9). Note: “NA” indicates the absence of data for individual gene loci and headers ‘DAT1’, ‘DAT2’, ‘DAT3’, and ‘DAT4’ corresponding to species code.

image alt text

Figure 2 Tab-delimited MarkerMiner output.

Note: Gene functional description have been extracted from ftp://ftp.arabidopsis.org/home/tair/Genes/TAIR10_genome_release/TAIR10_functional_descriptions_20130831.txt

Multiple sequence alignment (Figure 3) and profile alignment (Figure 4) files are provided in PHYLIP and FASTA format, respectively.

image alt text

Figure 3 Multiple sequence alignment (MAFFT) output from MarkerMiner visualized using Geneious (http://www.geneious.com/).

image alt text

Figure 4 Example of alignment with reference CDS with masked intronic regions showing putative intron-exon boundaries and intron sizes, visualized using Geneious (http://www.geneious.com/).

SECTION 5: Access to the MarkerMiner code repositories

Both the pipeline and supporting web application code, released under the MIT license, are available for access at Bitbucket.org. Below are the links to the repositories:

Please refer to instructions provided within the repositories to INSTALL and run the pipeline (and optionally, the web application) locally.

SECTION 6: MarkerMiner Pipeline Benchmarking

Table 1. MarkerMiner runtimes and memory usages for three example datasets (see Appendix 1 of MarkerMiner Manuscript) using four CPUs. Variables such as transcript lengths, number of transcripts, etc. associated with individual transcriptomes comprising a dataset may result in longer runtimes and memory usage.

Dataset Number of Transcriptomes Memory (MB) Time (Hrs)
Amaryllidaceae 7 1979 4:48
Draba 6 971 2:13
Solanum 6 998 6:00

Literature Cited

Camacho, C., T. Madden, N. Ma, T. Tao, R. Agarwala, and A. Morgulis. 2013. BLAST Command Line Applications User Manual. BLAST® Help [Internet]. Bethesda (MD): National Center for Biotechnology Information (US).

De Smet, R., K.L. Adams, K. Vandepoele, M.C.E. Van Montagu, S. Maere, and Y. Van de Peer. 2013. Convergent gene loss following gene and genome duplications creates single-copy families in flowering plants. Proceedings of the National Academy of Sciences 110: 2898–2903.

Godden, G.T., I.E. Jordon-Thaden, S. Chamala, A.A. Crowl, N. García, C.C. Germain-Aubrey, J.M. Heaney, et al. 2012. Making next-generation sequencing work for you: approaches and practical considerations for marker development and phylogenetics. Plant Ecology & Diversity 5: 427–450.

Katoh, K., and M. C. Frith. 2012. Adding unaligned sequences into an existing alignment using MAFFT and LAST. Bioinformatics 28: 3144-3146.

Katoh, K., and D.M. Standley. 2013. MAFFT multiple sequence alignment software version 7: improvements in performance and usability. Molecular biology and evolution 30: 772–780.

Li, L., C.J. Stoeckert, and D.S. Roos. 2003. OrthoMCL: identification of ortholog groups for eukaryotic genomes. Genome research 13: 2178–2189.

Van Bel, M., S. Proost, E. Wischnitzki, S. Movahedi, C. Scheerlinck, Y. Van de Peer, and K. Vandepoele. 2011. Dissecting plant genomes with the PLAZA comparative genomics platform. Plant Physiology, pp–111.