HTTPS SSH

RepeatExplorer2 with TAREAN (Tandem Repeat Analyzer)


New version of RepeatExplorer with TAndem REpeat ANalyzer

Authors

Petr Novak, Jiri Macas, Pavel Neumann Biology Centre CAS, Czech Republic

Change log

link

Dependencies

  • python 3.4 or higher with installed modules:
  • R with packages:
    • igraph version 1.0.0 or higher
    • parallel
    • scales
    • stringr
    • hwriter
    • R2HTML
    • optparse
    • Rserve
    • plyr
    • png
    • plotrix
    • DT
    • data.tree
    • plyr
    • tools
    • DBI, RSQLite
    • Biostrings R dependencies can be checked with checkR.R script
  • mafft - multiple alingment program
  • ImageMagick
  • NCBI blast: (specific version required):
    • v. 2.2.28+ tested, works OK
    • does not work with v. 2.2.25 at all
    • with 2.2.31+ 2.3.0+ works but use more memory(more then 10 fold), not clear why
  • DIAMOND - Accelerated BLAST compatible local sequence aligner (see https://github.com/bbuchfink/diamond).DIAMOND is required only if you do not want to use blastx for protein domain search
  • NCBI legacy blast - command blastall is required
  • LAST - similarity search program (see http://last.cbrc.jp), version 956 or higher. LAST is require only if you want to use "OXFORD_NANOPORE" option.

Instalation

To use RepeatExplorer without installation, We recommend to use our freely available galaxy server at https://repeatexplorer-elixir.cerit-sc.cz. This server is provided in frame of ELIXIR-CZ project. Additionally, the galaxy server includs also additional tools useful data preprocessing, quality contraol and genome annotation.

For standalone installation, follow the instruction below.

Installation requirements

Python 3.4 or higer is required, install pyRserve using pip:

pip install Rserve

To download source:

git clone https://petrnovak@bitbucket.org/petrnovak/repex_tarean.git cd
repex_tarean

Compile source using:

make

Check your R installation and dependencies using script:

./check_R.R

If you get an error, install required R packages packages

Add support for 32-bit executables: If you are using ubuntu - add 32-bit

support by running:

sudo apt-get install libc6:i386 libncurses5:i386 libstdc++6:i386 sudo
apt-get update sudo apt-get install libc6:i386 libncurses5:i386
libstdc++6:i386

Protein databases

Repeatexplorer2 utilize REXdb database of protein domains for repeat annotation and classification. Structure of database is described on http://repeatexplorer.org/. Current version of database for repeatexplorer is fetched from bitbucket repository [https://bitbucket.org/petrnovak/re_databases]https://bitbucket.org/petrnovak/re_databases() during compilation using make command

Custom protein database

Alternatively, you can compile your own custom protein database. Path to protein databases is stored in configuration file config.py. Change the values for fasta files in variable PROTEIN_DATABASE_OPTIONS accordingly and plase you fasta file with databases to databases directory. Additionally, you will have to used blast command makeblastdb to formate blast databases. If you want to use diamond program, create corresponding database as well.

Custom protein database is provided as fasta file and must conform the following syntax:

>unique_sequence-id#classification:Protein-domain-name
aasequence

Example of correct sequences names in fasta file:

>REXdb_ID4978#Class_I/LTR/Ty1_copia/Tork:Ty1-RH
HKRFGHYNLKSIQFAQKQELVKDLPNIQTFSEVCEGCQLGKQHRLPFPSSATWR
ASEKLELVHSDVCGPMNTSSLNGSKYFILFIDDFTRMTWVYFLKQKSEVFSVFK
HEQACGGHFSAKKTATKVLQCGFYWP
>REXdb_ID6520#Class_I/LTR/Ty3_gypsy/non-chromovirus/OTA/Athila:Ty3-RT
ENPGRILSGFNGSSTTSLGDIVLPVQAGPVTLNVQSSVAQELSPFNVILGKFKI
FVENQSGCLLKKLRTDNGKEYTSTEFNKFCDDLGVERQLTVSYSPQQNGVSERK
NRSVLEMARCMIFEKKLPKSFWAEAINTAVYLQN
>REXdb_ID14309#Class_I/LINE:LINE-ENDO
ENPGRILSGFNGSSTTSLGDIVLPVQAGPVTLNVQSSVAQELSPFNVILGENPG
RILSGFNGSSTTSLGDIVLPVQAGPVTLNVQSSVAQELSPFNVILGXMEEAERA
LQDLKHHLQSPPILTAPLPGEDLLLYIVATTHVASSAT
>REXdb_ID7018#Class_I/LTR/Ty3_gypsy/non-chromovirus/OTA/Ogre_Tat/TatIV_Ogre:Ty3-INT
ENPGRILSGFNGSSTTSLGDIVLPVQAGPVTLNVQSSVAQELSPFNVILGXMEE
AERALQDLKHHLQSPPILTAPLPGEDLLLYIVATTHVASSAT
>REXdb_ID3666#Class_I/LTR/Ty1_copia/Ivana:Ty1-RT
XMEEAERALQDLKHHLQSPPILTAPLPGEDLLLYIVATTHVASSATENPGRILS
GFNGSSTTSLGDIVLPVQAGPVTLNVQSSVAQELSPFNVILG

List possible values of protein domain-name for viridiplantae database:

CACTA-TPase
DIRS-RH
DIRS-RT
DIRS-YR
Harbinger-TPase
hAT-TPase
Helitron-HEL1
Helitron-HEL2
Kolobok-TPase
LINE-ENDO
LINE-RH
LINE-RT
Mariner-TPase
Merlin-TPase
MuDR-TPase
Novosib-TPase
PARA-RH
PARA-RT
Penelope-RT
PiggyBac-TPase
P-TPase
Sola1-TPase
Sola2-TPase
Ty1-GAG
Ty1-INT
Ty1-PROT
Ty1-RH
Ty1-RT
Ty3-aRH
Ty3-CHDCR
Ty3-CHDII
Ty3-GAG
Ty3-RH
Ty3-INT
Ty3-PROT
Ty3-RT
PARA-PROT

List of possible values for classification for viridiplantae custom database:

Class_I/DIRS
Class_I/LINE
Class_I/LTR/Ty1_copia/Ale
Class_I/LTR/Ty1_copia/Alesia
Class_I/LTR/Ty1_copia/Angela
Class_I/LTR/Ty1_copia/Bianca
Class_I/LTR/Ty1_copia/Bryco
Class_I/LTR/Ty1_copia/Gymco-I
Class_I/LTR/Ty1_copia/Gymco-II
Class_I/LTR/Ty1_copia/Ikeros
Class_I/LTR/Ty1_copia/Ivana
Class_I/LTR/Ty1_copia/Osser
Class_I/LTR/Ty1_copia/SIRE
Class_I/LTR/Ty1_copia/TAR
Class_I/LTR/Ty1_copia/Tork
Class_I/LTR/Ty1_copia/Ty1-outgroup
Class_I/LTR/Ty3_gypsy/chromovirus/Chlamyvir
Class_I/LTR/Ty3_gypsy/chromovirus/chromo-outgroup
Class_I/LTR/Ty3_gypsy/chromovirus/chromo-unclass
Class_I/LTR/Ty3_gypsy/chromovirus/CRM
Class_I/LTR/Ty3_gypsy/chromovirus/Galadriel
Class_I/LTR/Ty3_gypsy/chromovirus/Reina
Class_I/LTR/Ty3_gypsy/chromovirus/Tcn1
Class_I/LTR/Ty3_gypsy/chromovirus/Tekay
Class_I/LTR/Ty3_gypsy/non-chromovirus/nonchromo-outgroup
Class_I/LTR/Ty3_gypsy/non-chromovirus/OTA/Athila
Class_I/LTR/Ty3_gypsy/non-chromovirus/OTA/Ogre_Tat/TatI
Class_I/LTR/Ty3_gypsy/non-chromovirus/OTA/Ogre_Tat/TatII
Class_I/LTR/Ty3_gypsy/non-chromovirus/OTA/Ogre_Tat/TatIII
Class_I/LTR/Ty3_gypsy/non-chromovirus/OTA/Ogre_Tat/TatIV_Ogre
Class_I/LTR/Ty3_gypsy/non-chromovirus/OTA/Ogre_Tat/TatV
Class_I/LTR/Ty3_gypsy/non-chromovirus/Phygy
Class_I/LTR/Ty3_gypsy/non-chromovirus/Selgy
Class_I/pararetrovirus
Class_I/Penelope
Class_II/Subclass_1/TIR/EnSpm_CACTA
Class_II/Subclass_1/TIR/hAT
Class_II/Subclass_1/TIR/Kolobok
Class_II/Subclass_1/TIR/Merlin
Class_II/Subclass_1/TIR/MuDR_Mutator
Class_II/Subclass_1/TIR/Novosib
Class_II/Subclass_1/TIR/P
Class_II/Subclass_1/TIR/PIF_Harbinger
Class_II/Subclass_1/TIR/PiggyBac
Class_II/Subclass_1/TIR/Sola1
Class_II/Subclass_1/TIR/Sola2
Class_II/Subclass_1/TIR/Tc1_Mariner
Class_II/Subclass_2/Helitron

List possible values of protein-domain-name for metazoa database:

Academ-TPase
BEL-GAG
BEL-INT
BEL-PROT
BEL-RH
BEL-RT
CACTA-TPase
DIRS-RH
DIRS-RT
DIRS-YR
Ginger-TPase
Helitron-HEL1
Helitron-HEL2
LINE-ENDO
LINE-RH
LINE-RT
Merlin-TPase
Penelope-ENDO
Penelope-RT
PIF/Harbinger-TPase
PiggyBac-TPase
P-TPase
Retrovirus-INT
Retrovirus-PROT
Retrovirus-RH
Retrovirus-RT
Sola1-TPase
Sola2-TPase
Sola3-TPase
Ty1-GAG
Ty1-INT
Ty1-PROT
Ty1-RH
Ty1-RT
Ty3-GAG
Ty3-INT
Ty3-PROT
Ty3-RH
Ty3-RT
Zator-TPase

List of possible values for classification for metazoa custom database:

Class_I/DIRS
Class_II/Subclass_1/TIR/Academ
Class_II/Subclass_1/TIR/EnSpm_CACTA
Class_II/Subclass_1/TIR/Ginger
Class_II/Subclass_1/TIR/Merlin
Class_II/Subclass_1/TIR/P
Class_II/Subclass_1/TIR/PIF_Harbinger
Class_II/Subclass_1/TIR/PiggyBac
Class_II/Subclass_1/TIR/Sola1
Class_II/Subclass_1/TIR/Sola2
Class_II/Subclass_1/TIR/Sola3
Class_II/Subclass_1/TIR/Zator
Class_II/Subclass_2/Helitron
Class_I/LINE
Class_I/LTR/Bel-Pao
Class_I/LTR/Retrovirus
Class_I/LTR/Ty1_copia
Class_I/LTR/Ty3_gypsy
Class_I/Penelope

If you do not want/cannot used predefined metazoa or viridiplantae classification scheme, you can define your own classification scheme and save is as R data.tree object using saveRDS function and set correct path to this classification file in PROTEIN_DATABASE_OPTIONS

Example of classification tree which is stored in databases/classification_tree_metazoa_xx.rds for metazoa:

1  All
2   ¦--contamination
3   ¦--organelle
4   ¦   ¦--plastid
5   ¦   °--mitochondria
6   °--repeat
7       ¦--rDNA
8       ¦   ¦--45S_rDNA
9       ¦   ¦   ¦--18S_rDNA
10      ¦   ¦   ¦--25S_rDNA
11      ¦   ¦   °--5.8S_rDNA
12      ¦   °--5S_rDNA
13      ¦--satellite
14      °--mobile_element
15          ¦--Class_I
16          ¦   ¦--SINE
17          ¦   ¦--LTR
18          ¦   ¦   ¦--Bel-Pao
19          ¦   ¦   ¦--Ty1_copia
20          ¦   ¦   ¦--Ty3_gypsy
21          ¦   ¦   °--Retrovirus
22          ¦   ¦--DIRS
23          ¦   ¦--LINE
24          ¦   °--Penelope
25          °--Class_II
26              ¦--Subclass_1
27              ¦   °--TIR
28              ¦       ¦--MITE
29              ¦       ¦--Academ
30              ¦       ¦--EnSpm_CAC
31              ¦       ¦--Ginger
32              ¦       ¦--Merlin
33              ¦       ¦--P
34              ¦       ¦--PIF_Harbinger
35              ¦       ¦--PiggyBac 
36              ¦       ¦--Sola1
37              ¦       ¦--Sola2
38              ¦       ¦--Sola3
39              ¦       °--Zator
40              °--Subclass_2
41                  °--Helitron

Running RepeatExplorer:

Clustering can be setup either in Galaxy interface using repex_tarean.xml and repex_full_clustering.xml or run from command line

usage: seqclust [-h] [-p] [-A] [-t] [-l LOGFILE] [-m {float range 0.0..100.0}]
                [-M {0,float range 0.1..1}] [-o {float range 30.0..80.0}]
                [-c CPU] [-s SAMPLE] [-P PREFIX_LENGTH] [-v OUTPUT_DIR]
                [-r MAX_MEMORY] [-d DATABASE DATABASE] [-C] [-k]
                [-a {2,3,4,5}]
                [-tax {VIRIDIPLANTAE2.2,METAZOA3.0,METAZOA2.0,VIRIDIPLANTAE3.0}]
                [-opt {ILLUMINA,ILLUMINA_SHORT,OXFORD_NANOPORE}]
                [-D {BLASTX_W2,BLASTX_W3,DIAMOND}]
                sequences

RepeatExplorer:
    Repetitive sequence discovery and clasification from NGS data



positional arguments:
  sequences

optional arguments:
  -h, --help            show this help message and exit
  -p, --paired
  -A, --automatic_filtering
  -t, --tarean_mode     analyze only tandem reapeats without additional classification
  -l LOGFILE, --logfile LOGFILE
                        log file, logging goes to stdout if not defines
  -m {float range 0.0..100.0}, --mincl {float range 0.0..100.0}
  -M {0,float range 0.1..1}, --merge_threshold {0,float range 0.1..1}
                        threshold for mate-pair based cluster merging, default 0 - no merging
  -o {float range 30.0..80.0}, --min_lcov {float range 30.0..80.0}
                        minimal overlap coverage - relative to longer sequence length, default 55
  -c CPU, --cpu CPU     number of cpu to use, if 0 use max available
  -s SAMPLE, --sample SAMPLE
                        use only sample of input data[by default max reads is used
  -P PREFIX_LENGTH, --prefix_length PREFIX_LENGTH
                        If you wish to keep part of the sequences name,
                         enter the number of characters which should be 
                        kept (1-10) instead of zero. Use this setting if
                         you are doing comparative analysis
  -v OUTPUT_DIR, --output_dir OUTPUT_DIR
  -r MAX_MEMORY, --max_memory MAX_MEMORY
                        Maximal amount of available RAM in kB if not set
                        clustering tries to use whole available RAM
  -d DATABASE DATABASE, --database DATABASE DATABASE
                        fasta file with database for annotation and name of database
  -C, --cleanup         remove unncessary large files from working directory
  -k, --keep_names      keep sequence names, by default sequences are renamed
  -a {2,3,4,5}, --assembly_min {2,3,4,5}
                        Assembly is performed on individual clusters, by default 
                        clusters with size less then 5 are not assembled. If you 
                        want need assembly of smaller cluster set *assmbly_min* 
                        accordingly
  -tax {VIRIDIPLANTAE2.2,METAZOA3.0,METAZOA2.0,VIRIDIPLANTAE3.0}, --taxon {VIRIDIPLANTAE2.2,METAZOA3.0,METAZOA2.0,VIRIDIPLANTAE3.0}
                        Select taxon and protein database version
  -opt {ILLUMINA,ILLUMINA_SHORT,OXFORD_NANOPORE}, --options {ILLUMINA,ILLUMINA_SHORT,OXFORD_NANOPORE}
  -D {BLASTX_W2,BLASTX_W3,DIAMOND}, --domain_search {BLASTX_W2,BLASTX_W3,DIAMOND}
                        Detection of protein domains can be performed by either blastx or
                         diamond" program. options are:
                          BLASTX_W2 - blastx with word size 2 (slowest, the most sesitive)
                          BLASTX_W3 - blastx with word size 3 (default)
                          DIAMOND   - diamond program (significantly faster, less sensitive)
                        To use this option diamond program must be installed in your PATH

Reproducibility

To make clustering reproducible between runs with the same data, environment variable PYTHONHASHSEED must be set:

export PYTHONHASHSEED=0

Disk space requirements

Large sqlite database for temporal data is created in OS specific temp directory- usually /tmp/ To use alternative location, it is necessary specify TEMP environment variable.

CPU and RAM requirements

Resources requirements can be set either from command line arguments --max-memory and --cpu or using environment variables TAREAN_MAX_MEM and TAREAN_CPU. If not set, pipeline use all available resources

How cite

If you use RepeatExplorer for general repeat characterization in your work please cite:

or

If you use TAREAN for satellite detection and characterization please cite: