Wiki

Clone wiki

Profrep / Home

REPEAT ANNOTATION TOOLS FOR GENOME ASSEMBLIES

1. PROFREP

- Sequences PROFiles of REPetitive elements -

Profrep is primarily designed to localize repetitive elements and protein domains that encodes them (optional) on DNA sequences in scale of whole genomes. Accordingly, it enables assembly annotation of repeats and moreover quantitatively determine their amounts (repetitive profiles). Identification of repeats itself is accomplished by RepeatExplorer pipeline. For some species the repeats are already identified and classified - these are in the ProfRep internal database. For others you need to run RepeatExplorer first.

Profrep comprises set of tools, the main Profrep tool and associated tools to prepare the input data and to allow some further processing of the outputs

TOOLS STRUCTURE:

Profrep Data Preparation
    * Extract Data For ProfRep
    * ProfRep DB Reducing
ProfRep Main
    * ProfRep 
ProfRep Supplementary Tools
    * ProfRep Refiner
    * ProfRep Masker
    * GFF Region Selector

Profrep Data Preparation

These tools are used for data preparation if you work with species that does not have prepared annotation datasets (see the list of species in the INPUTS sections below)

Extract Data For ProfRep

All the required input files can be easily extract from RepeatExplorer HTML archive using the tool Extract Data for Profrep (especially for GALAXY usage). Alternatively, you can provide your custom data, e. g. classification table - in this case please follow carefully the requirements regarding the formats below.

ProfRep DB Reducing

Profrep DB reducing stands for eliminating number of reads in case of working with large genomes. This step prior running the ProfRep tool itself can speed up the computing time significantly. It might have a slight impact on the resulting profiles, however, when the parameters are set up reasonably, the resulting quantitative representations will still be accurate enough. Degree of reduction is adjustable according to size of clusters from which the reads will take place in the process. It means that all the clusters which contains at least the given number of reads will undergo the reduction.

As an input, choose a file of all reads sequences and list of all clusters from RE output archive (hitsort.cls).

! NOTE For the REDUCED prepared datasets in ProfRep, the reduction was run for the reads of most represented clusters (in RE archive: seqclust -> clustering -> clusters)

HOW DB REDUCING WORKS

This tool will reduce the database of all reads based on similarities between them. Basically, it creates clusters of similar reads and the reduced database will then be composed of one representative read for all from the cluster, also indicating the number of reads that it represents. As the new reads database is produced, CLS file containing reads connected to clusters has to be modified as well. The actual reduction level depends on number of clusters envolved and how big they are. Default value for cluster size to be involved in reducing is 1000, which means all clusters containing 1000 and more reads are going to be reduced.

ProfRep Main

The ProfRep main tool engages outputs of RepeatExplorer for repeats annotation in DNA sequences (typically assemblies but not necessarily). Moreover, it provides repetitive profiles of the sequence, pointing out quantitative representation of individual repeats along the sequence as well as the overall repetitiveness.

INPUTS

  • DNA sequence(s) [multiFASTA]

  • Species specific dataset consisting of:

    • List of all reads sequences [multiFASTA]
      • In RE archive: seqclust -> sequences -> sequences.fasta
    • CLS file [multiFASTA]
      • in RE archive: seqclust -> clustering -> hitsort.cls
    • Classification table [TSV, CSV]
      • in RE archive: CLUSTER_TABLE.csv (automatic classification)

    There are already prepared annotation datasets for the following species:

    • Cuscuta europea (2018)
    • Pisum sativum Cameor (2017)
    • Beta vulgaris (Kowar et al. 2016)
    • Pisum sativum Terno (Macas et al. 2015)
    • Genlisea nigrocaulis (Vu et al. 2015)
    • Rhynchospora pubera (Marques at al. 2015)

    They are avalaible from GALAXY roll-up menu Choose existing annotation dataset. The menu also contains reduced datasets with reduced numbers of reads - these are marked as REDUCED. For other species you can use relevant data from RE output archive. In Galaxy ProfRep Main tool choose Use custom annotation data -> Yes. Then you are asked to upload the three files mentioned above. To obtain them from RepeatExplorer archive, go to Profrep Data Preparation -> Extract Data For ProfRep. To upload your own or adjusted files please follow the REQUIREMENTS FOR CUSTOM DATA below.


REQUIREMENTS FOR CUSTOM DATA

Reads sequences - List of all sequencing reads in multiFASTA format

Example:

    >1f
    ACAAAATAAGTAAAAATATAAATTGTACCTTATGTTGATGTAAAATGAACCCATACACCTTATGTTAAATGTTTTTGCAAGTCATCAAGTAATAACTTTC
    >1r
    AATGTAAGATATGTTTGGTGGGTTTGTTTCTTTGCTTCAAAGTATAGATCCATATTAACCAATTTTGTTTCAATTTAGACTCTCACATTTAGAATATTTCA
    >2f
    GGAATTAATCAAGAAGACTCTTCAAAGTCGAAAGATTGAAAAGTATGTATAAATCCCAGGAGTACGTTTCTCGACGAGCGCGAAGCGTTTGGGAGTACAAG

CLS file - Clustering output of RepeatExplorer (hitsort.cls) containing list of all clusters and belonging reads in form of FASTA file: * >number_of_cluster TAB number_of reads_in cluster
* line with TAB separated reads that belong to the cluster

Example:

    >CL78448        2
    1624460f        63975r
    >CL78449        2
    542765f     938471f
    >CL78450        1
    882044r

Classification table - TAB-separated list of cluster numbers and their repetitive classification. The list does not have to necessarily contain all the clusters. The clusters can be classified either automatically (RE output) or manually - classification may be an arbitrary custom string. However, it is highly desirable to use the classification consensus, specifically it is demanded for ProfRep Refiner when joining fragmented segments based on protein domains. This means: * individual classification levels are separated by a pipe character "|"
* the first classification level is derived from the origin of the repetitive sequence, i.e. repeat, organelle. * mobile elements classification should follow protein domains classification * for the rest of repeats (e.g. satellites, MITEs) arbitrary custom classification with any number of levels is allowed

Example:

    42      repeat|mobile_element|Class_I|LTR|Ty1/copia|SIRE
    43      repeat|mobile_element|Class_I|LTR|Ty3/gypsy|non-chromovirus|OTA|Ogre/Tat|TatIV/Ogre
    45      repeat|mobile_element|Class_I|LTR|Ty3/gypsy|non-chromovirus|OTA|Athila
    48      repeat|satellite|PisTR/B
    134     organelle|plastid

OUTPUTS

  • HTML summary report,JBrowse Data Directory showing basic information and repetitive profile graphs as well as protein domains (optional) for individual sequences (up to 50). This output also serves as an data directory for JBrowse genome browser. You can create a standalone JBrowse instance for further detailed visualization of the output tracks using Galaxy-integrated tool. This output can also be downloaded as an archive containing all relevant data for visualization via locally installed JBrowse server (see more about visualization in OUTPUT VISUALIZATION below)
  • Ns GFF - reports unspecified (N) bases regions in the sequence
  • Repeats GFF - reports repetitive regions of a certain length (defaultly 80) and above hits/copy numbers threshold (defaultly 5)
  • Domains GFF - reports protein domains, classification of domain, chain orientation and alignment sequences
  • Log file

OUTPUTS VISUALIZATION

  • Use JBrowse - Data Directory to Standalone tool to create standalone instance which can be browsed from Galaxy or downloaded locally

    JBrowse -> JBrowse - Data Directory to Standalone -> choose HTML report,JBrowse Data Directory from ProfRep output

    The tracks that are displayed in JBrowse are:

    • Reference DNA sequence in all reading frames
    • Ns GFF
    • Repeats GFF
    • Domains GFF
    • BigWig track - quantitative XY graphs showing amounts of individual repeats per position in hits or copy numbers (depends on the selection). This includes ALL track showing overall repetitivnes, which means sum of all classified repeats plus all the other hits mapped on the sequence

    ! NOTE If you want to visualize BigWig tracks, JBrowse has to run under an HTTP server (e.g. Apache, Nginx)

  • Choose individidual output tracks and add them to an already existing JBrowse instance or visualize them individually:

    JBrowse -> JBrowse genome browser -> choose original reference seq -> choose Update existing JBrowse instance - choose the standalone instance created by Data Directory to Standalone tool OR New JBrowse instance -> Insert Track Group -> Insert Annotation Track (GFF, BigWig, BAM etc.)

  • Alternatively, you can visualize the output data in locally installed JBrowse via web server. For this, you have to download HTML report,JBrowse Data Directory archive containing all relevant files in the required data structure.

HOW PROFREP WORKS

The main ProfRep tool runs Blast+ similarity search on given DNA against the database of all reads (low coverage sequencing). The reported hits are filtered to get only the ones with significant similarity and appropriate length. These and other search parameters are all adjustable (Advanced options in Galaxy formular). The similarity search runs in parallel which lowers the computing times significantly especially when working with large input data - it defaultly uses all the sources available. The parallelization sliding window is set to 5kb with 150b overlap, but both parameters are adjustable. When changing them, make sure that the overlap is at least of reads length so that the hits on borders are covered. The hits for every position are recorded and classified based on to which cluster the corresponding read belongs. The summed profile ALL is created based on all individual profiles plus profiles of all mapped (but unclustered or unclassified) reads, keeping track of the overal sequence representation of repeats. Protein domains search is accomplished by DANTE tool (see below), running defaultly as a ProfRep module (can be switched off). The protein domains outputs are already filtered with default parameters optimized for Viridiplantae species.

ProfRep Supplementary Tools

These additional tools can be used for further work with the ProfRep outputs

ProfRep Refiner

Tools to interconnect the fragmented parts of repetitive regions. Regions are also "cleaned" in the way, that some ambiguous situations (when a part of DNA is covered by multiple regions with different classifications) are evaluated and regions that do not meet the criteria are removed (see below). These ambiguous reagions do not necessarily have to point out on lower stringency of the similarity search. These nested regions of different classes might be sometimes convenient to report - overlapping repeats can e.g. belong to different clusters or represent some chimeric elements.

Use repeats GFF file as an input and a new GFF will be created with the connected regions.

HOW REFINING WORKS

Refining process of repeats regions run in two consecutive steps: 1. Prior the regions interconnecting, it is necessary to filter out some nested regions of different classifcation which can disrupt the process. To differentiate those from the ones that we want to preserve because they can be of some importance we set the following rules: At first clusters of all overlapping repeat regions are created. Within a cluster, regions are gradually checked based on descending PID. All the other regions occuring inside the current one, with some borders tolerance on each side (defaultly 10 bp), are removed in case their PID is more than 5% lower than the current region. Otherwise it will be preserved.

!NOTE Average PID (perentage of identity) is counted as the mean of uniform classification hit's PIDs per each position and then it is averaged within the whole region reported in repeats GFF.

  1. Then the actual region joining follows in the next step. It searches for consecutive repeats to create segments with the same classification that are not further from each other than a gap threshold (defaultly 250). These segments cannot be corrupted by repeats of different classification. The confidence of such region is defaultly supported by the domain information. A certain minimum amount of protein domains (deafultly ) of equal orientation must be present inside the new region and their classification must correspond to the repeats classification. The classification must be also unambiguous in a way, that individual parts are classified to the very last classification level (checked based on domains classification table). Repetitive elements which are not coded by protein domains, such as satellites or MITes (i.e. not mobile elements), are not checked for the additional domains information and the regions are connected only based on the gap criterion.

ProfRep Masker

Enables to mask the original sequence based on repetitive regions reported by ProfRep. It allows either lowercase or "N" mode of masking

GFF Region Selector

This tools enables to extract a region of interest from input GFF. It facilitates e.g. comparing features in several GFF files but for fraction of whole data. Use arbitrary GFF as an input. Type in the selected region and the corresponding seq ID in the following form:

    original_seq_name:start-end (e.g. chr1:1000-2000)

The coordinates in the modified output GFF will be recalculated with respect to this range. Then choose a new name of sequence for modified region (this is important especially for adding the mofified GFF to JBrowse so that it can be matched with the cut reference sequence). When not specified, the new seq ID reported in output GFF will be in the following form original_seq_name_cut_start:end


WORKFLOW EXAMPLES

TASK 1:

You want to run Profrep on fraction of large data (let's say a segment of an individual chromosome) and visualize the outputs together with some additional annotation tracks (GFF files from assembies etc.) to compare them

SOLUTION:

  1. Run ProfRep main tool on DNA segment you wish to analyze

  2. Use GFF Region Selector to create shrunk additional GFF file to compare it with ProfRep output data

    In the tool you enter the interval range and the original sequence name which you want to cut in the following form:

        original_seq_name:start-end (e.g. chr1:1000-2000)
    

    If you want visualize the cut track with JBrowse make sure that the new sequence name corresponds to the cit DNA seq name that was used for ProfRep analysis - it will appear in the modified GFF 3. Use JBrowse -> Data Directory to Standalone choose the HTML output of ProfRep to create JBrowse standalone instance 4. Use JBrowse -> JBrowse genomic browser to visualize all the tracks * 4.1. Use a genome from history -> Select the cut DNA seq * 4.2. Produce Standalone instance -> Yes * 4.3. Select Update existing Jbrowse instance - choose the standalone instance that was from Profrep HTML report in the previous step * 4.4 Insert Track Group -> Insert Annotation Track -> Track Type -> GFF

    Once you run this, a new JBrowse instance will be created so in the future you can add more additional track to it in the same manner


2. DANTE

- Domain based ANnotation of Transposable Elements -

Protein Domains Tools are designed to identify and localize protein domains of transposable elements in an arbitrary DNA sequence, including whole genomes. As the domains are subsequently classified to as detailed level of repetitive classification as possible, the overall transposon composition of the DNA can be inferred. Accordingly, it enables to explore how individual types of repeats are distributed and their density along the sequence. This can be useful for genome annotation or it can serve as supplementary tool for RepeatExplorer pipeline - to refine the protein domains annotation and classification after clustering. Moreover, it can provide a wider usage, such as deriving phylogenetic relations of the repeats.

Protein Domains Finder

This tool provides preliminary output of all domains types which are not filtered for quality.

INPUTS

  • DNA sequence [multiFasta]

OUTPUTS

  • All protein domains GFF3 - individual domains are reported per line as regions (start-end) on the original DNA sequence including the seq ID and strand orientation. The last "Attributes" column contains several comma-separated information related to the domain annotation, alignment and its quality. This file can undergo further filtering using Protein Domain Filter tool.

Proteins Domains Filter

Filters GFF3 output from previous step to obtain certain kind of domain and/or allows to adjust quality filtering

INPUTS

  • All protein domains file [GFF3]
  • Filtering parameters:
    • Minimum identity (default 0.35)
    • Minimum similarity (default 0.45)
    • Minimum alignment length (default 0.8)
    • Interruptions (frameshifts + stop codons) per 100 AA (default 3)
    • Protein domain type (default All)
    • Custom repeat type

OUTPUTS

  • Filtered GFF3 - also contains basic statistics of domains types for individual sequences and repeat classifications before and after filtering
  • Translated protein sequences of the filtered domain regions of original DNA in fasta format
HOW DANTE WORKS

This tool uses external aligning programme LAST and RepeatExplorer database of Viridiplantae TE protein domains.

Lastal runs similarity search to find hits between query DNA sequence and our database of protein domains from all Viridiplantae repetitive elements. Hits with overlapping positions in the sequence (even through other hits) forms a cluster which represents one potential protein domain. Strand orientation is taken into consideration when forming the clusters which means each cluster is built from forward or reverse stranded hits exclusively. The clusters are subsequently processed separately; within one cluster positions are scanned base-by-base and classification strings are assigned for each of them based on the database sequences which were mapped on that place. These asigned classification strings consist of a domain type as well as class and lineage of the repetitive element where the database protein comes from. Different classification levels are separated by "|" character. Every hit is scored according to the scoring matrix used for DNA-protein alignment (BLOSUM80). For single position only the hits reaching certain percentage (defaultly 80%) of the overall best score within the whole cluster are reported. One cluster of overlapping hits represents one domain region and is recorded as one line in the resulting GFF3 file. Regarding the classition strings assigned to one region (cluster) there are three situations that can occur:

  • There is a single classification string assigned to each position as well as classifications along all the positions in the region are mutually uniform, in this case domain's final classification is equivalent to this unique classification
  • There are multiple classification strings assigned to one cluster, i.e. one domain, which leads to classification to the common (less specific) level of all the strings
  • There is a conflict at the domain type level, domains are reported with slash (e.g. RT/INT) and the classification is in this case ambiguous

All the records containing ambiguous domain type (e.g. RH/INT) are filtered out automatically. They do not take place in filtered gff file neither the protein sequence is derived from these potentially chimeric domains. Optimal results (for general usage) should be reached using the default quality filtering parameters which are appropriate to find all types of protein domains. Keep in mind that the results should be critically assessed based on your input data anyhow.

!NOTE: If you are working with non-Viridiplantae eukayrotic organisms, the results might not be reliable as the domains database is mostly based on Viridiplantae organisms. In this case filtering parameters should be adjusted and results more carefully inspected. Based on the testing we recommend to set up these parameters for non-Viridiplantae eukayrotics:

  • Minimum identity: 0,3
  • Minimum similarity: 0,4
  • Minimum alignmnet length: 0,8
  • Interruptions: 3

Updated