HTTPS SSH

RepeatExplorer2 with TAREAN (Tandem Repeat Analyzer)


New version of RepeatExplorer with TAndem REpeat ANalyzer

Authors

Petr Novak, Jiri Macas, Pavel Neumann Biology Centre CAS, Czech Republic

Change log

link

Dependencies

  • python 3.4 or higher with installed modules:
  • R with packages:
    • igraph version 1.0.0 or higher
    • parallel
    • scales
    • stringr
    • hwriter
    • R2HTML
    • optparse
    • Rserve
    • plyr
    • png
    • plotrix
    • DT
    • data.tree
    • plyr
    • tools
    • DBI, RSQLite
    • Biostrings R dependencies can be checked with checkR.R script
  • mafft - multiple alingment program
  • ImageMagick
  • NCBI blast: (specific version required):
    • v. 2.2.28+ tested, works OK
    • does not work with v. 2.2.25 at all
    • with 2.2.31+ 2.3.0+ works but use more memory(more then 10 fold), not clear why
  • DIAMOND - Accelerated BLAST compatible local sequence aligner (see https://github.com/bbuchfink/diamond).DIAMOND is required only if you do not want to use blastx for protein domain search
  • NCBI legacy blast - command blastall is required

Instalation

Currently we provide only source code but not the protein database. Protein database is however necessary if you want to run full repeat analysis. For full repeat analysis we recommend to use our freely available galaxy server at https://repeatexplorer-elixir.cerit-sc.cz. Protein database is not necessary if you run clustering in TAREAN mode (tandem repeat analysis only)

If you really need full RepeatExplorer2 installation with protein database, contact Pavel Neumann for more information (neumann at umbr.cas.cz)

Installation requirements

Python 3.4 or higer is required, install pyRserve using pip:

pip install Rserve

To download source:

git clone https://petrnovak@bitbucket.org/petrnovak/repex_tarean.git cd
repex_tarean

Compile source using:

make

Check your R installation and dependencies using script:

./check_R.R

If you get an error, install required R packages packages

Add support for 32-bit executables: If you are using ubuntu - add 32-bit

support by running:

sudo apt-get install libc6:i386 libncurses5:i386 libstdc++6:i386 sudo
apt-get update sudo apt-get install libc6:i386 libncurses5:i386
libstdc++6:i386

Installing protein databases

run script :

fetch_databases.sh

This script will download necessary files to databases directory. Currently, databases can be downloaded only with valid password. If you need access send request to (neumann at umbr.cas.cz).

Running RepeatExplorer:

Clustering can be setup either in Galaxy interface using repex_tarean.xml or run from command line

usage: seqclust [-h] [-p] [-A] [-t] [-l LOGFILE] [-m {float range 0.0..100.0}]
                [-M {0,float range 0.1..1}] [-o {float range 30.0..80.0}]
                [-c CPU] [-s SAMPLE] [-P PREFIX_LENGTH] [-v OUTPUT_DIR]
                [-r MAX_MEMORY] [-d DATABASE DATABASE] [-C] [-k]
                [-opt {ILLUMINA,ILLUMINA_SHORT,OXFORD_NANOPORE}]
                [-D {BLASTX_W2,BLASTX_W3,DIAMOND}]
                sequences

RepeatExplorer:
    Repetitive sequence discovery and clasification from NGS data



positional arguments:
  sequences

optional arguments:
  -h, --help            show this help message and exit
  -p, --paired
  -A, --automatic_filtering
  -t, --tarean_mode     analyze only tandem reapeats without additional classification
  -l LOGFILE, --logfile LOGFILE
                        log file, logging goes to stdout if not defines
  -m {float range 0.0..100.0}, --mincl {float range 0.0..100.0}
  -M {0,float range 0.1..1}, --merge_threshold {0,float range 0.1..1}
                        threshold for mate-pair based cluster merging, default 0 - no merging
  -o {float range 30.0..80.0}, --min_lcov {float range 30.0..80.0}
                        minimal overlap coverage - relative to longer sequence length, default 55
  -c CPU, --cpu CPU     number of cpu to use, if 0 use max available
  -s SAMPLE, --sample SAMPLE
                        use only sample of input data[by default max reads is used
  -P PREFIX_LENGTH, --prefix_length PREFIX_LENGTH
                        If you wish to keep part of the sequences name,
                         enter the number of characters which should be 
                        kept (1-10) instead of zero. Use this setting if
                         you are doing comparative analysis
  -v OUTPUT_DIR, --output_dir OUTPUT_DIR
  -r MAX_MEMORY, --max_memory MAX_MEMORY
                        Maximal amount of available RAM in kB if not set
                        clustering tries to use whole available RAM
  -d DATABASE DATABASE, --database DATABASE DATABASE
                        fasta file with database for annotation and name of database
  -C, --cleanup         remove unncessary large files from working directory
  -k, --keep_names      keep sequence names, by default sequences are renamed
  -opt {ILLUMINA,ILLUMINA_SHORT,OXFORD_NANOPORE}, --options {ILLUMINA,ILLUMINA_SHORT,OXFORD_NANOPORE}
                        this option is experimental, not fully implemented
  -D {BLASTX_W2,BLASTX_W3,DIAMOND}, --domain_search {BLASTX_W2,BLASTX_W3,DIAMOND}
                        Detection of protein domains can be performed by either blastx or
                         diamond" program. options are:
                          BLASTX_W2 - blastx with word size 2 (slowest, the most sesitive)
                          BLASTX_W3 - blastx with word size 3 (default)
                          DIAMOND   - diamond program (significantly faster, less sensitive)
                        To use this option diamond program must be installed in your PATH

Reproducibility

To make clustering reproducible between runs with the same data, environment variable PYTHONHASHSEED must be set:

export PYTHONHASHSEED=0

Disk space requirements

Large sqlite database for temporal data is created in OS specific temp directory- usually /tmp/ To use alternative location, it is necessary specify TEMP environment variable.

CPU and RAM requirements

Resources requirements can be set either from command line arguments --max-memory and --cpu or using environment variables TAREAN_MAX_MEM and TAREAN_CPU. If not set, pipeline use all available resources

How cite

If you use RepeatExplorer for general repeat characterization in your work please cite:

or

If you use TAREAN for satellite detection and characterization please cite: