HTTPS SSH
== RepeatExplorer ==

RepeatExplorer is a web-based computational pipeline for discovery and characterization
of repetitive sequences in eukaryotic genomes. The pipeline uses shotgun high-throughput
genome sequencing data and does not require assembled genome. RepeatExplorer was
implemented under Galaxy environment. To see RepeatExplorer in action visit our Galaxy server at
http://repeatexplorer.umbr.cas.cz. RepeatExplorer manual with the installation instruction can be 
found at http://repeatexplorer.umbr.cas.cz/static/html/help/manual.html



=== Licence ===

Copyright (c) 2012 Petr Novak (petr@umbr.cas.cz), Jiri Macas and Pavel Neumann,
Laboratory of Molecular Cytogenetics(http://w3lamc.umbr.cas.cz/lamc/)
Institute of Plant Molecular Biology, Biology Centre AS CR, Ceske Budejovice, Czech Republic

This program is free software: you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation, either version 3 of the License, or
(at your option) any later version.

This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
GNU General Public License for more details.
You should have received a copy of the GNU General Public License
along with this program.  If not, see <http://www.gnu.org/licenses/>.



=== How to install ===

To install to Galaxy server consult help pages at http://repeatexplorer.umbr.cas.cz.
To use command line version of clustering and reclustering, all dependencies must be installed and file config.sh must be correctly set to specify path to directories with executables.
To run clustering use seqclust_cmd.py. Use seqclust_cmd.py -h for help
To test installation - you can run scripts in tests/ directory

=== Dependencies ====

These dependencies are assumed to have executables in path:

R (v >= 2.14 r-project.org) including packages:
foreach, igraph, getopt, R2HTML, lattice, doMC, multicore, ape and Biostrings (available from www.bioconductor.org)
Perl and BioPerl  (core)
Python v. >=2.6 
ImageMagick
NCBI Basic Local Alignment Search Tool version 2.2.xx
Muscle	(not necessary for clustering
fasty36 (not necessary for clustering)


Included dependencies, does not require setting for command line version:

GNU parallel (included )
Louvain clustering  - now provided with RepeatExplorer, must be compiled from source, see 'louvain' directory
TGICL - copy of tgicl was obtained from http://sourceforge.net/projects/tgicl/files/tgicl/tgicl_linux/tgicl_linux.tar.gz/download (newer version does not work with repatexplorer!)

Paths to below dependencies have to be specified in config.sh:
 
if RepeatMasker is not in path, RepeatMasker directory must be specified explicitly in config.sh 
Conserved domain database (only necessary when rpsblast search is included)


=== how to use RepeatExplorer on www.metacentrum.cz ===
Currently RepeatExplorer is available as module, to use it, type:
  module add repeatexplorer
  seqclust_cmd.py -h

If you wish to use your own installation:
- get copy of RepeatExplorer from bitbucket repository - https://bitbucket.org/petrnovak/repeatexplorer/get/tip.zip and unpack,
in repeaexplorer/louvain type 'make' to compile clustering executables

- download legacy blast (File:blast-2.2.26-x64-linux.tar.gz) from ftp://ftp.ncbi.nlm.nih.gov/blast/executables/release/LATEST/, unpack it to repeatexplorer directory

- for configuration use config_metacentrum.sh - in repeatexplorer directory type:
	mv config_metacentrum.sh config.sh

- before clustering run:
	module add R-2.14.0 python-2.6.2 bioperl-1.6.1 repeatmasker
  Note- to be able to use repeatmasker, you have to confirm repeatmasker licence agreement ( http://metavo.metacentrum.cz/cs/myaccount/licence.html ). repeatmasker executable should be in your path.

- to test you configuration run scripts in tests/ directory:
	./test1.sh run clustering and reclustering without repeatmasker search 
	./test2.sh include viridiplantae repeatmasker database
	./test3.sh  uses only one processor
	./test4.sh  these tests takes couple hours, include comparative analysis
	 
  outputs from test scripts are located in test_data/test_dir/runx, check also log files in the same directory

- if tests finished without error you can run clustering using seqclust_cmd.py script 
  for usage type:
    ./seqclust_cmd.py -h

resources requirements:
reserve at least 8 cpu with 16gb of RAM and select 'long queue' - job needs several day to finish ( qsub -l:nodes=1:ppn=8:mem=16gb -q long). It is probable however that with the real need of RAM will be bigger - this depends on genome so it could be good idea to reserve 32 GB but specify only 16 GB in seqclust-cmd.py. 

If you want to use Conserved domain database search - download database and set appropriate location of database files in config.sh file