Clone wiki

repex_tarean / RepeatExlorer2 and Tarean Manual

UNDER CONSTRUCTION

Table of Contents

Introduction

RepeatExplorer is a computational pipeline for discovery and characterization of repetitive sequences in eukaryotic genomes. The pipeline uses high-throughput genome sequencing data as an input and performs a graph-based clustering analysis of sequence read similarities to identify repetitive elements within analyzed samples. The analysis principles were described in Novak et al. (2010) and examples of its application can be found in a number of published papers (see Appendix). It should be noted that although the repeat identification algorithm generally works for any genome, some parts of the pipeline (e.g. protein domain-based classification of mobile elements) were primarily developed for application in plant genomics. However, there is a possibility to supply a custom repeat database to improve sensitivity in the classification of repeats in the genomes of species not in the plant kingdom.

RepeatExplorer can be used through Galaxy based web interface on our public server. See http://www.repeatexplorer.org for list of available Galaxy servers with RepeatExplorer . The main public Galaxy server with RepeatExplorer can be accessed at address https://galaxy-elixir.cerit-sc.cz. This server is provided in the test mode within Elixir CZ project and is maintained by CESNET and CERIT-SC that are participants of this project.

Users requiring more computational resources can set up their own instance of RepeatExplorer using its freely available source code. Consult installation instructions provided in Appendix.

An interface to RepeatExplorer was implemented within Galaxy platform (http://galaxy.psu.edu/) and takes advantage of various tools provided in this environment. Only the tools directly needed to upload and process sequences for RepeatExplorer are covered in this manual. In other cases, please refer to the Galaxy wiki and help pages. Attention should be payed to principles of data sharing and the use of workflows, as these features are used to provide data samples and analysis templates related to the examples given below (Chapter 3). An overview of the RepeatExplorer tools and links between them is schematically represented in Appendix.

Please include the following citations to your publications when presenting results obtained using RepeatExplorer:

Principle of clustering analysis: Novak, P., Neumann, P., Macas, J. (2010) - Graph-based clustering and characterization of repetitive sequences in next-generation sequencing data. BMC Bioinformatics 11: 378.

RepeatExplorer: Novak, P., Neumann, P., Pech, J., Steinhaisl, J., Macas, J. (2013) - RepeatExplorer: a Galaxy-based web server for genome-wide characterization of eukaryotic repetitive elements from next generation sequence reads. Bioinformatics

To provide feedback or report a problem please send email to server administrator: admin@repeatexplorer.org.

Basic steps

Getting your data to/from the server

Direct upload/Download

This option is suitable for small files (<2 GB) only. In the left panel (Tools) select: Get Data --> Upload File

Datasets can be downloaded from the dataset menu using diskette icon. In case you encounter connection problems please use the ftp download method described below.

Using FTP

Large datasets and/or multiple files should be uploaded via FTP employing FTP over implicit TLS/SSL protocol. We recommend using the FileZilla FTP client with host name set to repeatexplorer-elixir.cerit-sc.cz and server type set to FTPES. To logon, use your RepeatExplorer account username and password. Alternatively, the command line tool curl can be used:

curl -T [ my_data_filename ] -k -v -u [ username ] ftps://repeatexplorer-elixir.cerit-sc.cz/

replacing [ my_data_filename ] and [ username ] with the name of your file and username, respectively. Following the execution of the command, you will prompted to enter your password to complete the transfer of data.

Following the transfer, the files will appear in the Files uploaded via FTP list within Tools --> Get Data --> Upload File. Select the files you wish to import and click on the "Execute" button. Once imported, the files will be removed from the list.

Please note that FTP can also be used to transfer output data from your analysis to your local computer. To do so, use the Tools --> Repeat Explorer --> EXPERIMENTAL TOOLS --> Transfer data to ftp server utility which will copy the selected file to your FTP directory on the server. This tool also generates some information about the file like its size and md5sum. Upon completion, login to your RepeatExplorer account using an FTP client and download the file to your computer. This option is highly recommended for downloading all large output files, because their download via web browser can take a long time and downloads cannot be resumed. Please note that the tool is currently suitable for downloading single files only (e.g. compressed archives of clustering results). Alternatively, you can download a file from the ftp server using the curl command which allows resuming of downloads. To ensure that file was transferred correctly, check the md5sum. Here is an example of using curl to perform an ftp download with the ability to resume:

curl -C - -o [ local_output_filename ] -k -v -u [ username ] \ 
ftps://repeatexplorer-elixir.cerit-sc.cz/[ server_output_filename ]

where you replace [ local_output_filename ], [ username ] and [ server_output_filename ] with the desired output filename for your download on your computer, your username and the filename that you chose when transferring your data to the ftp server in Galaxy, respectively. Again, you will be prompted for your password after executing the command.

Downloading sequences from EBI SRA

Publically available datasets can be downloaded directly from the EBI Short Read Archive using Get Data --> EBI SRA tool. Enter the ENA accession number in the search window, locate the corresponding dataset and select download link in the "Galaxy" column.

Pre-processing of sequence reads

The clustering analysis requires a single file containing read sequences in FASTA format as an input. If such a file can be uploaded by the user, no pre-processing is required. However, data obtained from sequencing facilities or downloaded from public archives are usually in FASTQ format combining nucelotide sequence information with sequencing quality scores. There are a number of programs for analyzing and pre-processing raw sequence reads in Tools --> NGS: QC and manipulation. Some additional tools are provided in Tools --> Repeat Explorer --> Utilities. Tools recommended for pre-processing FASTQ data are listed below (help on using these tools is provided below their input forms):

RepeatExplorer Utilities (Tools --> Utilities)

  • Preprocessing of fastq paired-reads: This tool performs preprocessing of paired-end reads in fastq format including trimming, quality filtering, adapter filtering (cutadapt) and interlacing. Note: Broken pairs (i.e. one of the reads in a pair is removed due to low quality) are removed.

  • Preprocessing of fastq reads: Preprocessing of single-end reads in fastq format including trimming, quality filtering, adapter filtering (cutadapt) and sampling.

  • FASTA read name affixer: Append prefixes and/or suffixes to sequences names in a FASTA file.

  • Sequence sampling: Perform random sampling of sequences from larger dataset.

  • Read name affixer: Manipulate read names by adding prefix and/or suffix codes and remove spaces. The file must be in FASTQ format.

  • Rename Sequences: Replace read names in FASTA files with numbers. It is possible to keep the first characters of the original name by specifying a prefix length, i.e. when the read names containing species codes for a comparative analysis.

  • FASTA interlacer: Join paired reads from different files into a single interlaced file, i.e. reads from the same pair are next to each other in the file. Note: each read in the first file must have its corresponding mate in the second and in the same position as the first file.

  • Scan paired reads: Check paired-end reads for sequence overlap, which may occur due to short fragment length.

  • RepeatMasker custom search: Check previous clustering results using RepeatMasker against custom database of repeats.

  • Chip-Seq Mapper: Map ChiP-Seq and Input reads to contigs obtained from RepeatExplorer clustering.

Other Commonly-Used Tools (Tools --> NGS: QC and manipulation)

  • (ILLUMINA FASTQ) FASTQ Groomer: Groomer has to be run first in order to use any other tool for FASTQ manipulation. Take care to select correct FASTQ quality scores type.

  • (FASTQC: FASTQ/SAM/BAM) FASTQC: This tool performs some simple checks to assess the quality of your high-throughput sequencing data, such as the distribution of quality scores across sequence reads, read length distribution, and number of indeterminate bases in your sequences.

  • (FASTX-TOOLKIT FOR FASTQ DATA) Filter by quality: This filter can be optionally used to discard low-quality reads. Use Compute quality statistics, Draw quality score boxplot and Draw nucleotides distribution chart from the same toolbox to assess the quality of your data.

  • (GENERIC FASTQ MANIPULATION) FASTQ to FASTA converter: As a final step it converts reads to FASTA format.

Examples of input formats

  • simple clustering - any plain fasta format is suitable:

         >1
         acgacagctgactaatgc
         >2
         cttcgaggctacacgagct
         >3
         actatcgacactgccggcgcg
         ...
    
  • comparative analysis of AB and XY genomes, sequence identifier must code genome type (prefix length = 2):

        >AB1
        acgacagctgactaatgc
        >AB2
        cttcgaggctacacgagct
        >AB3
        actatcgacactgccggcgcg
        ...
        >XY1
        gccccgtcgccgtccgtgtcg
        >XY2
        tgtgtgcccgtctgcgcgccccc
        >XY3
        atatgctatgcgcgc
        ...
    
  • pair-end reads - last character codes pair:

        >1f
        acgacagctgactaatgc
        >1r
        cttcgaggctacacgagct
        >2f
        actatcgacactgccggcgcg
        >2r
        gccccgtcgccgtccgtgtcg
        >3f
        tgtgtgcccgtctgcgcgccccc
        >3r
        atatgctatgcgcgc
        ...
    
  • comparative analysis with pair-end reads:

        >AB1f
        acgacagctgactaatgc
        >AB1r
        cttcgaggctacacgagct
        >AB2f
        actatcgacactgccggcgcg
        >AB2r
        gccccgtcgccgtccgtgtcg
        >XY3f
        tgtgtgcccgtctgcgcgccccc
        >XY3r
        atatcgtcgtgctatgcgcgc
        >XY4f
        tggggcctgtgcccgtctgcgcgccccc
        >XY4r
        atatgctatgcgcgc
        ...
    

Clustering analysis

There are two tools for performing a clustering analysis on your high-throughput sequencing data. Both of these can be run from Tools --> Repeatexplorer2 --> Clustering. To perform a full clustering analysis, i.e. to characterize all repeat types in the genome, use RepeatExplorer2 clustering. To analyze only tandem repeats, use the TAndem REpeat ANalyzer (TAREAN) pipeline. These pipelines differ in the parameter options available and also the requirements for the input data. TAREAN requires paired-end sequence data, while the regular pipeline can use single-end only reads. However, it is recommended to use paired-end data, as this will facilitate the annotation of clusters that may otherwise remain unknown.

It should be noted that due to its computational complexity, the clustering procedure can take several days to finish, depending on the number of reads and repeat composition of analyzed samples. In extreme cases of genomes rich in certain types of repeats (e.g., satellite DNA), the running time of the clustering pipelines can be up to two weeks, whereas repeat-poor and small datasets may be analyzed in several hours. To avoid exhausting available memory, repeat complexity of analyzed data is estimated before performing full-scale analysis using a small, randomly sampled subset of reads. If necessary, the number of reads in the dataset is then automatically reduced by random sampling (see analysis log file or the html summary for information about the total number of reads used in the analysis). However, it is still recommended to perform a test run with a small subset (e.g. 100,000) of reads before running any large-scale analysis.

Parameters

Repeat identification using graph-based read clustering is a multi-step procedure that starts with an all-to-all sequence comparison in order to find pairs of reads with similarity that satisfy a specified threshold. This threshold is explicitly set to 90% sequence similarity spanning at least 55% of the read length (in the case of reads differing in length it applies to the longer one). In the current version of the pipeline, these values cannot be changed. There are a number of other adjustable parameters to be set based on your input data and analysis type. The following parameters are for the full clustering analysis (RepeatExplorer2 clustering).

  • NGS Reads: A file with sequence reads in FASTA format. It is usually generated from raw sequence reads using Pre-processing tools.

  • paired-end reads: Change to yes if you are using paired-end or mate-pair reads. It is crucial that the input file contains only complete read pairs and that both sequences from a pair are listed in succession. Use RepeaExplorer --> Utilities --> FASTA interlacer to achieve this arrangement. Please avoid using FASTQ interlacer located in NGS:QC and manipulation. This tool has high memory requirements and is suitable only when your paired sequences in separate files are not in the same order.

  • Sample size: The RepeatExplorer2 pipeline estimates the total number of reads that can be used in analysis by default. So, you can input as many reads as you wish and it will not cause any problems. However, should you want to run the pipeline on a smaller dataset, you can enter a number for this parameter. The total number of reads analyzed should be at least 1000.

  • Advanced options: Change to yes if you wish to setup a more complex analysis, such as a comparative analysis of multiple species or use a custom database. You can also change some other additional parameters here.

  • Perform comparative analysis: Change to yes for performing an analysis will multiple samples and then set the Group code length parameter according to your data, i.e. how long are the prefixes in your sequence read names that distinguish reads from different species? Choose that value here. Also, you can append sample codes to read names using one of the pre-processing tools called read name affixer.

  • Use custom repeat database: This option can be used to aid repeat classification within clusters and is recommended especially for species which are under-represented in the RepeatMasker databases. The database should be a single file containing DNA sequences in FASTA format. There should be information about repeat type/family encoded within the FASTA header line of each sequence, in the same format as used for RepeatMasker libraries (e.g., >sequence_id#Copia/Angela). The custom library should be uploaded to the server using the Get Data --> Upload File tool.

  • Cluster size threshold for detailed analysis: Directories gathering various types of data and outputs from additional analyses are generated for a certain number of the largest clusters (see Description of the output files). The minimum size of clusters to be selected is defined as a proportion of the number of all analyzed reads (e.g., employing a default value of 0.01% with a dataset of 1,000,000 reads, all clusters containing at least 100 reads will be included). Setting this parameter below 0.01% is not recommended as it would lead to analyzing large numbers (>300) of clusters which is time consuming.

  • Perform automatic filtering of abundant satellite repeats: Set to yes if you wish to filter large satellite repeats from your data to allow more reads overall to be analyzed during your analysis.

  • Keep original sequence names: Sequences are renamed by default. If you want to keep the original sequence names (not recommended), uncheck this option. However, in the case of using original names of paired-end reads it is required that the left and right mates are distinguished by the last character of the read name. It is also necessary that there are only complete pairs and left mates alternate with their right mates. If performing a comparative analysis, this will not affect the Group code length parameter.

In TAREAN, it is not possible to run a comparative analysis and the analysis must be run with paired-end reads. Other than that difference, there is one additional parameter that is different in a TAREAN analysis, namely:

  • Perform cluster merging: This option merges clusters that are strongly connected through paired-end reads, i.e. there are several instances where pairs are split between two clusters (see section on cluster connectivity).

Description of the output files

Execution of the clustering analysis results in the generation of four new entries in the History panel. Two of them, Log file and Contigs consist of single plain text files, whereas HTML summary and Archive with clustering results contain multiple folders and files that can be downloaded as zip archives. The content of the HTML summary output can also be directly viewed using "Display data in browser" option (an eye symbol). Below is a description of the most important files within output data.

Log file

This file lists analysis parameters and gathers various messages generated during the pipeline run, including the pipeline and database versions, which can be included in a bug report should you encounter one. The log file is constantly updated during the run and, therefore, can be used to monitor the progress of the pipeline.

HTML Report

The HTML report of the clustering pipeline is available for download in an archived format, i.e. as a zip file. Please do not try to download the unarchived HTML summary, because there are many files and this will likely cause problems on your computer. However, it is possible to browse through the HTML summary (format: HTML) in your browser on the Galaxy server.

This archive contains an overview of clustering results. It can be inspected either directly from the Galaxy menu, or after downloading and unpacking the archive by opening the file HTML_summary_of_graph_based_clustering...html (within the HTML_summary... directory). There is a histogram showing sizes and cumulative proportions of the clusters, total proportions of clustered reads and singlets. Below, there is a table that lists various information for the largest clusters. Further details can be viewed for each cluster by following the link CLnumber.

Archive with clustering results

You can download the clustering archive directly in your web browser window or transfer it to the FTP server (recommended for large archives). The file download name will look something like this:

Galaxy4-[RepeatExplorer2_-_Archive_with_HTML_report_from_data_1].zip

and you can unpack it as you would any zip file on your operating system. In the directory where you have unzipped the file, you will see a number of files and folders:

.
├── Galaxy4-[RepeatExplorer2_-_Archive_with_HTML_report_from_data_1].zip
└── UnzippedGalaxyArchive
   *├── seqclust
   *├── libdir
    ├── logfile.txt <-- 
    ├── style1.css
    ├── summary_histogram.png
    ├── CLUSTER_TABLE.csv
    ├── SUPERCLUSTER_TABLE.csv
    ├── COMPARATIVE_ANALYSIS_COUNTS.csv <-- from comparative analysis
    ├── index.html <-- Master HTML file
    ├── cluster_report.html
    ├── summarized_annotation.html
    ├── supercluster_report.html
    ├── tarean_output_help.html
    ├── tarean_report.html
    ├── HOW_TO_CITE.html
    ├── contigs.fasta <--
    ├── TR_consensus_rank_1_.fasta <--
    ├── TR_consensus_rank_2_.fasta <--
    ├── TR_consensus_rank_3_.fasta <--
    └── TR_consensus_rank_4_.fasta <--

* Folders
  • logfile.txt RepeatExplorer2 Log: you can open this file in a text editor to see how your analysis proceeded. This file may also be helpful in reporting bugs as you can see how far the analysis went and if any error messages were printed out.

  • index.html Master HTML File: this file can be opened in your web browser and provides a summary of the clustering run as well as a gateway to all the other HTML files in this directory. The other HTML files provide more detailed information about the clustering run and allow you to investigate the results for a single cluster.

  • COMPARATIVE_ANALYSIS_COUNTS.csv Comparative Analysis Count Table: here you will find the number of reads in each cluster for each sample included in your comparative analysis. Note: you must run the comparative analysis pipeline to obtain such a file.

  • TR_consensus_rank_#_.fasta Tandem Repeat Monomers: the reconstructed monomers of tandem repeats discovered in your data are shown in these fasta files.

  • contigs.fasta Contigs from clusters: In this file, there are contigs from the assembly of each cluster. These can be used in post-processing analyses such as looking for TE protein domains.

  • /seqClust/sequences/: directory storing sequence reads which were used as input for the clustering analysis

    • seqClust: mutli-fasta file with all sequence reads (in the case when user-provided set of reads was sampled, only the reads actually used for analysis are included here)
    • index.tab: if the reads were renamed, their original and new ids are stored in this file
    • seqClust.nhr, seqClust.nin, seqClust.nsq: blast database files
    • seqClust.cidx: index file used by cdbyank program (part of the TGICL package)
  • /seqClust/clustering/: main directory for storing clustering results

    • hitsort_PID90_LCOV55.cls: assignment of reads into clusters; for each cluster, there is a fasta-like header line with cluster number and size (number of reads), followed by a line containing ids of all reads assigned to the cluster. For example:

      >CL1 5
      id_1 id_2 id_3 id_4 id_5
      >CL2 3
      id_6 id_7 id_8
      etc....
      
    • hitsort_PID90_LCOV55: pairs of reads with significant similarity (lists all pairs with similarity >=90% covering >=55% of the length of the longer read and blast bit score of the hit)

    • graph_layouts.pdf: graph layouts and statistics for the largest clusters
  • /seqClust/clustering/blastx/: results of blastx similarity search of reads from individual clusters against the database of plant transposable element protein domains

  • /seqClust/clustering/clusters/dir_CLnumber/: directories storing detailed information for the largest clusters (minimal size of clusters to be listed here is defined by the Cluster size threshold for detailed analysis option)

    • reads.ids, reads.fas: ids and fasta sequences, respectively, of the reads assigned to the cluster
    • contigs.CLnumber: all contigs assembled for the cluster
    • contigs.CLnumber.minRD5: contigs with average read depth >= 5 sorted by the read depth (_sort-GR sorted according to genome representation; _sort-length sorted according to contig length)
    • contigs.CLnumber.prof.pdf: read depth profiles of contigs
    • ACE_CLnumber.ace: cap3 assembly file (can be viewed e.g. using clview program)
    • CLnumber.GL: graph layout (to be viewed using SeqGrapher program avalable from http://cran.r-project.org/web/packages/SeqGrapheR/index.html)
    • CLnumber_blastx.csv: blastx hits of reads to database of plant transposable element protein domains
    • CLnumber_domains.csv: summary table of blastx hits listed in CLnumber_blastx.csv
  • /seqClust/assembly/: output files from the assembly of reads within the clusters

    • contigs: all contigs in fasta format (contig names are derived from their cluster of origin)
    • contigs.info: all contigs with additional information about their length, average read depth and genome representation (read depth x length) encoded in the fasta header line:

      >CLxContigY (length[bp]-read_depth-genome_representation)
      
    • contigs.info.minRD5:contigs with average read depth >= 5 sorted according to read depth (_sort-GR sorted according to genome representation; _sort-length sorted according to contig length)

Re-clustering, Cluster Merging

This tool has been removed from the RepeatExplorer2 pipeline, as the super-clustering of paired-end data has rendered it obsolete.

Identification and analysis of LTR-retroelement protein domains

This analysis is aimed at extraction and phylogenetic analysis of conserved regions of LTR-retroelement protein domains from a set of input nucleotide sequences. It has been designed for analyzing contig sequences obtained from the clustering analysis; however, it can be applied to any multi-fasta file of DNA sequences provided they do not contain multiple domains of the same type. The analysis consists of three consecutive steps:

  • Tools --> Protein Domains Tools --> Protein Domains Tools --> Protein Domains Finder: This tool uses external aligning program http://last.cbrc.jp/ and RepeatExplorer database of Viridiplantae TE protein domains (! Classification of data from non-Viridiplantae species might not be reliable !)

  • Tools --> Protein Domains Tools --> Protein Domains Filter: This tool runs filtering on either the primary GFF3 file of all domains, i.e. output of Protein Domains Finder tool or an already filtered GFF3 file. Domains can be filtered based on: sequence similarity, length, or number of frameshifts or stop codons per 100 amino acids.

  • Tools --> RepeatMasker Search --> Custom RepeatMasker Search: This tool allows for a post-clustering analysis of repeat domains present in your data from a custom repeat database. This tool can be run directly on the archive output of your clustering run.

Examples of analysis workflows

The following examples were designed to illustrate the most frequent applications of RepeatExplorer and to practically demonstrate its various tools and data types. Although the examples use real sequence data as an input, these datasets were reduced in size for the sake of analysis speed, therefore providing lower sensitivity in repeat detection compared to analyzing larger volumes of sequence data. In addition, some aspects of downstream analyzes are covered only briefly and should be treated more thoroughly when performing real analysis.

The examples are available via Galaxy menu Shared Data --> Published Histories, or directly using the links provided below. Each example history provides a record of finished analysis, including input data, output of individual analysis steps and parameters used to run the tools. Please read the annotations of individual steps in histories as they provide an explanation for the workflow. The workflows extracted from the example histories are also available (to import workflow to your account go to "Shared data -> Published workflows" in the Galaxy menu, select workflow from a list and then "Import workflow"). After importing, select "Edit" workflow in order to view its structure and eventually modify some parameters to suit your data. Alternatively, histories can also be imported to user accounts and used to extract workflows (History --> Extract Workflow) for repeated use with different input data. Input data used for all examples are provided as a separate history ("Input data for example histories"). Original raw sequencing data used for the examples are from whole genome shotgun sequencing of rye (Secale cereale) plants containing or lacking supernumerary B chromosomes (EBI SRA study ERP001061; Martis et al. 2012), and from pea (Pisum sativum) genome (SRA study ERP001104; Neumann et al. 2012).

Example history #1: Clustering analysis of a small sample dataset of 454 reads followed by identification and phylogenetic analysis of retrotransposon RT domains in assembled contigs

A simple example that includs a random sampling of 200,000 sequences from FASTA formatted set of 454 reads and subsequent clustering analysis. The dataset was prepared from sequencing rye plants containing B chromosomes.

Link: http://www.repeatexplorer.org/u/jirka/h/example-history-1-1 workflow_1\ Workflow representing Example history #1

Example history #2: Comparative analysis of repeats between two genomes

The example demonstrates the processing of raw 454 sequence data downloaded in FASTQ format from a public repository, random sampling of reads from several sequencing runs in order to obtain a more representative dataset and various read manipulations (quality filtering, trimming to the same length). Two samples representing genome variants of rye (Secale cereale) differing in the presence (4B) or absence (0B) of supernumerary B chromosomes are processed in parallel and subsequently used for comparative analysis of their repeat composition.

Link: http://www.repeatexplorer.org/u/jirka/h/example-history-2 workflow_2\ Workflow representing Example history #2

Example history #3: Clustering analysis using paired-end Illumina reads

The history shows utilization of paired-end reads for repeat characterization in the genome of garden pea (Pisum sativum). Datasets containing forward and reverse reads are processed separately, then combined and used for the clustering analysis.

Link: http://www.repeatexplorer.org/u/jirka/h/example-history-3 workflow_3\ Workflow representing Example history #3

Command line version

Clustering can be also performed without Galaxy platform using command line version of the pipeline. Installation of command line version is described in Apendix. RepeatExplorer is also vailable on Czech National Grid Infrastructure (see www.metacentrum.cz ). To use RepeatExplorer command line version in metacentrum type:

module add repeatexplorer
seqclust_cmd.py -h

When you use seqclust_cmd.py on matacentrum PBS cluster, be carefull about resources requirements. Reserve at least 8 cpu with 16gb of RAM and select 'long queue' - job usually needs several days to finish ( qsub -l:nodes=1:ppn=8:mem=16gb -q long). It is likely that the real need of RAM will be bigger than specified as the read memory requiremnt are hard to predict. In metacentrum, jobs which use more resources than what was requested upon submission can be authomatically terminated. To avoid termination of running jobs, it is good idea to reserve 32 GB in qsub command but specify only 16 GB in seqclust-cmd.py.

~~~~~~~~~

Usage: seqclust_cmd.py [options]

Options: -h, --help show this help message and exit -s SEQS, --sequences=SEQS input sequences in fasta format -m MINCL, --mincl=MINCL minimal size of cluster for detailed analysis [% of total reads] -o MINOVL, --minovl=MINOVL minimal overlap for assembly -d REPEATMASKER, --repeatmasker=REPEATMASKER repeatmasker database, possible options are All, Viridiplantae, Metazoa, Mammalia, Fungi, None -v OUTPUT_DIR, --output_dir=OUTPUT_DIR Output directory -p, --paired pair reads -a, --sq_rename do not rename sequences -l OVERLAP, --overlap=OVERLAP minimal overlap(default 55, 30-500) -k CUSTOM_DATABASE, --custom_database=CUSTOM_DATABASE file with custom repeat masker database -e RPS_BLAST, --rps_blast=RPS_BLAST if you want to run rpsblast against CDD specify e value (1e-2 - 1e-10) -f PREFIX, --prefix=PREFIX prefix length - for comparative analysis -z SEQCLUST_DIR, --seqclust_dir=SEQCLUST_DIR directory which contain previous clustering results with seqclust directory, this directory must be different from output directory -b MERGE, --merge=MERGE file with lists of clusters for merging -r MAX_MEM, --max_mem=MAX_MEM Maximal amount of available RAM in kB if not set, clustering tries to use whole available RAM -c CPU, --cpu=CPU number of cpu to use, by default all available processors are used

EXAMPLES:

clustering with default: seqclust_cmd.py -s sequences.fas -v output_directory clustering with comparative analysis when specieas are coded by the first 4 characters in sequence names:
seqclust_cmd.py -s sequences.fas -f 4 -v output_directory clustering with pair illumina reads: seqclust_cmd.py -s sequences.fas -p -v output_directory

merging of clusters from previous clustering: seqclust_cmd.py -z output_directory -b merge.txt -v output_directory2

~~~~~~~~~~~~~~~~~~

Appendices

Galaxy Wiki: http://wiki.g2.bx.psu.edu/

FileZilla FTP client: http://filezilla-project.org/

List of papers using graph-based read clustering for repeat identification

(sorted chronologically)

Novak, P., Neumann, P., Macas, J. (2010) - Graph-based clustering and characterization of repetitive sequences in next-generation sequencing data. BMC Bioinformatics 11: 378.

Macas, J., Kejnovsky, E., Neumann, P., Novak, P., Koblizkova, A., Vyskot, B. (2011) - Next generation sequencing-based analysis of repetitive DNA in the model dioecious plant Silene latifolia. PLoS ONE 6: e27335.

Renny-Byfield, S., Chester, M., Kovarik, A., Le Comber, S.C., Grandbastien, M.A., Deloger, M., Nichols, R., Macas, J., Novak, P., Chase, M.W., Leitch, A.R. (2011) - Next generation sequencing reveals genome downsizing in allopolyploid Nicotiana tabacum, predominantly through the elimination of paternally derived repetitive DNAs. Mol. Biol. Evol. 28: 2843-2854.

Torres, G.A., Gong, Z., Iovene, M., Hirsch, C.D., Buell, C.R., Bryan, G.J., Novak, P., Macas, J., Jiang, J. (2011) - Organization and evolution of subtelomeric satellite repeats in the potato genome. G3: Genes, Genomes, Genetics 1: 85-92.

Pagan, H.J.T., Macas, J., Novak, P., McCulloch, E.S., Stevens, R.D., Ray, D.A. (2012) - Survey sequencing reveals elevated DNA transposon activity, novel elements, and variation in repetitive landscapes among bats. Genome Biol. Evol., 4: 575-585.

Renny-Byfield, S., Kovarik, A., Chester, M., Nichols, R.A., Macas, J., Novak, P., Leitch, A.R. (2012) - Independent, rapid and targeted loss of highly repetitive DNA in natural and synthetic allopolyploids of Nicotiana tabacum. PLoS ONE 7: e36963.

Neumann, P., Navratilova, A., Schroeder-Reiter, E., Koblizkova, A., Steinbauerova, V., Chocholova, E., Novak, P., Wanner, G., Macas, J. (2012) - Stretching the rules: monocentric chromosomes with multiple centromere domains. PLoS Genetics 8: e1002777.

Piednoel, M., Aberer, A.J., Schneeweiss, G.M., Macas, J., Novak, P., Gundlach, H., Temsch, E.M., Renner, S.S. (2012) - Next-generation sequencing reveals the impact of repetitive DNA across phylogenetically closely related genomes of Orobanchaceae. Mol. Biol. Evol. 29: 3601-3611.

Martis, M.M., Klemme, S., Moghaddam, A.M.B., Blattner, F.R., Macas, J., Schmutzer, T., Scholz, U., Gundlach, H., Wicker, T., Simkova, H., Novak, P., Neumann, P., Kubalakova, M., Bauer, E., Haseneyer, G., Fuchs, J., Dolezel, J., Stein, N., Mayer, K.F.X., Houben, A. (2012) - Selfish supernumerary chromosome reveals its origin as a mosaic of host genome and organellar sequences. Proc. Natl. Acad. Sci. USA 109: 13343-13346.

Gong, Z., Wu, Y., Koblizkova, A., Torres, G.A., Wang, K., Iovene, M., Neumann, P., Zhang, W., Novak, P., Buell, R., Macas, J., Jiang, J. (2012) - Repeatless and repeat-based centromeres in potato: implications for centromere evolution. Plant Cell, 24: 3559-3574.

Renny-Byfield, S., Kovarik, A., Chester, M., Nichols, R.A., Macas, J., Novak, P., Leitch, A.R. (2012) - Independent, rapid and targeted loss of highly repetitive DNA in natural and synthetic allopolyploids of Nicotiana tabacum. PLoS ONE 7: e36963.

Novak, P., Neumann, P., Pech, J., Steinhaisl, J., Macas, J. (2013) - RepeatExplorer: a Galaxy-based web server for genome-wide characterization of eukaryotic repetitive elements from next generation sequence reads. Bioinformatics 29: 792-793.

Heckmann, S., Macas, J., Kumke, K., Fuchs, J., Schubert, V., Ma, L., Novak, P., Neumann, P., Taudien, S., Platzer, M., Houben, A. (2013) - The holocentric species Luzula elegans shows interplay between centromere and large-scale genome organization. Plant J. 73: 555-565.

Renny-Byfield, S., Kovarik, A., Kelly, L., Macas, J., Novak, P., Chase, M., Nichols, R.A., Pancholi, M., Grandbastien, M.A., Leitch, A. (2013) - Diploidisation and genome size change in allopolyploids is associated with differential dynamics of low and high copy sequences. Plant J., in press.

Renny-Byfield, S., Kovarik, A., Kelly, L., Macas, J., Novak, P., Chase, M., Nichols, R.A., Pancholi, M., Grandbastien, M.A., Leitch, A. (2013) - Diploidisation and genome size change in allopolyploids is associated with differential dynamics of low and high copy sequences. Plant J.,74: 829-839

Klemme, S., Banaei-Moghaddam, A.M., Macas, J., Wicker, T., Novak, P., Houben, A. (2013) - High-copy sequences reveal a distinct evolution of the rye B chromosome. New Phytol.,199: 550-558.

Steflova, P., Tokan, V., Vogel, I., Lexa, M., Macas, J., Novak, P., Hobza, R., Vyskot, B., Kejnovsky, E. (2013) - Contrasting patterns of transposable element and satellite distribution on sex chromosomes (XY1Y2) in the dioecious plant Rumex acetosa. Genome Biol. Evol. 5: 769-782.

Installation

Dependencies

There is number of additional dependencies not provided by RepeatExplorer authors. Additional programs include:

Adding RepeatExplorer to your local Galaxy installation

  • To obtain copy of RepeatExplorer from repository, run Mercurial commands:

    hg clone https://bitbucket.org/repeatexplorer/repeatexplorer
    cd repeatexplorer
    hg update -r stable
    

    Mercurial is a revision control tool for software development. If you do not have Mercurial installed, RepeatExplorer can be downloaded as a zip archive from https://bitbucket.org/repeatexplorer/repeatexplorer/get/stable.zip.

  • From repeatexplorer directory copy directory umbr_programs to $GALAXY_DIR/tools/

  • Modify file $GALAXY_DIR/tool_conf.xml by adding content of file repeatexplorer/tools.xml into appropriate location. This will add RepeatExplorer tools to Galaxy tool menu. To understand the syntax of tool_conf.xml, consult Galaxy wiki (http://wiki.g2.bx.psu.edu/).
  • add content of repeatexplorer/tool-data directory to $GALAXY_DIR/tool-data directory

The above steps can be also performed using script install2galaxy.sh executed from repeatexplorer directory:

./install2galaxy.sh -d $GALALXY_DIR\

If using install2galaxy.sh script, we recommend to make a backup copy of tool_conf.xml. Note that install2galaxy.sh script will place RepeatExplorer menu as the last item of installed Galaxy tools.

Setting up correct paths

File seqclust.config located in $GALAXY_DIR/tools/umbr_programs/seqclust/programs/ directory defines some environment variables necessary for RepeatExplorer functionality. It is possible to either set variables according to your local installation or adjust your program and databases locations to correspond to the default configuration setting. A second option will ease future RepeatExplorer updates. The configuration file defines following variable:

  • $TGICL location of TGICL program directory. Essential executable files,including mgblast and cap3, are located in $TGCIL/bin
  • $PROG_COMMUNITY location of Louvain clustering program directory (do not forget to compile executables!)
  • $REPEAT_MASKER RepeatMasker installation directory. This directory contain both executable and RepeatMasker database. RepeatMasker uses cross_match search engine. Note that the path to cross_match executable is hard coded in the file $REPEAT_MASKER/RepeatMaskerConfig.pm. To set correct path to cross_match, modify CROSSMATCH_DIR and CROSSMATCH_PRGM variables in RepeatMaskerConfig.pm script or use configuration script which is provided with RepeatMasker.
  • $RPSBLAST_DATBASE and $RPSBLAST_DATBASE_ANNOTATION location of CDD database files

Additional variables in seqclust.config:

  • $MAXEDGES can limit the maximal size of the data set which could be processed. Normally, this limit is set based on the available computer RAM. If the gathering information about memory size fails, then the $MAXEDGES variable is used instead. By default $MAXEDGES is set to 350000000 which is suitable for computer with 16 GB of RAM.
  • variables $MAXEDGES_FOR_LAYOUT and $MAXNODES_FOR_LAYOUT limit the maximal size of graph for which the layout is calculated. If number of sequences or similarity hits in cluster exceed $MAXNODES_FOR_LAYOUT or $MAXEDGES_FOR_LAYOUT respectively, sample of cluster is created and used for layout calculation. The increasing these parameters can significantly affect computation time.

Updates

If RepeatExplorer was obtained using Mercurial, then running commands from repeatexplorer folder will update installation

hg pull
hg update
./install2galaxy.sh -d $GALAXY_DIR

alternatively, download files manually from repository, unpack and install with /install2galaxy.sh -d $GALAXY_DIR command

Command line version

Command line version of clustering and merging is provided. See the README.txt for installation intructions

RepeatExplorer performance

Currently, the clustering step uses the Louvain method. While this method outperforms the previously used method , in terms of computational time, it still requires that the whole graph is loaded into memory. Memory usage is directly proportional to the total number of similarity hits. The number of similarity hits E can be calculated from:

E = N(N-1)k

Where N is the total number of reads and k is a coefficient which depends on the repetitivenes of the genome. Less reads can be used for highly repetitive genomes and conversely, less repetitive genomes will allow one to use more sequencing data. Based on the previously analyzed data from P. sativum, it is possible to cluster up to 4 million 100 nt long reads on the computer with 16GB of RAM. At this setting, the whole clustering and subsequent analysis needs approximately 8 days to finish. With the amount 500 thousand sequence reads which, is still sufficient for a repeat survey,the calculation finishes in about 6 hrs. Also note that there is a considerable amount of data generated. For example, clustering of 4 million P.sativum reads yields 50GiB of uncompressed files. To prevent the exhausting of the available memory, each clustering run is preceded by testing to estimate the limit for the number of reads. If the total number of sequences exceeds the limit, only a fraction of reads is used for clustering.A limit is set either based on the available memory or from $MAXEDGES parameter as described above.

To cut down computation time, some parts of RepeatExplorer were parallelized to take advantage of multicore processors. Namely, all to all sequence comparison with mgablast, protein domain search with rpsblast and blastx and graph layout calculation. This parallelization does not required any special setting except installation of GNU parallel and R packages foreach, multicore and doMC.

License

Copyright (c) 2012 Petr Novak (petr@umbr.cas.cz), Jiri Macas, Pavel Neumann

This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program. If not, see http://www.gnu.org/licenses/.

Schematic representation of the RepeatExplorer pipeline

pipeline\ Scheme of the clustering pipeline

Authors

Petr Novak(petr@umbr.cas.cz), Pavel Neuman,Jiri Macas,Jamie McCann

Updated