Wiki

Clone wiki

ATLAS-Pipeline / Rhea

Local InDel-Realignment

Locally realign alongside known InDels and a dataset from your population of interest. This step is highly recommended to reduce false variants in your downstream results.

For low-depth data, local realignment is typically performed on the whole dataset at once. This bears the need to re-run the whole local realignment step every time a sample is added to the project and can be computationally very demanding for large datasets.
Instead, the user provides two sets of samples:
a) a subset of samples to identify potential target intervals with GATKs RealignerTargetCreator as an initial step to the Rhea pipeline (target set)
b) a subset of samples to realign alongside every sample (guidance set) in the Local Realignment step of the Rhea pipeline.

For each sample of the study, private target positions are identified with RealignerTargetCreator. IndelRealigner is performed providing known variants (optional), a union of the target set, the guidance set and the private positions.

Before running the pipeline:

For this step of the pipeline you need a valid version of GATK 3.8.1 on your system. If it is not globally installed, you need to register the version while the conda environment is active with
gatk3-register path/to/GenomeAnalysisTK.jaror gatk3-register path/to/GenomeAnalysisTK.tar.bz2

You need to create the following three input files:

  1. samples file
    A tab separated file containing all samples you want to realign in two columns: Sample and Path
    Sample must contain the filename without the .bam suffix
    Path can be the absolute or relative path to the file. Remember to put a final /
    An example can be found in example_files/example.samples_Rhea.tsv

  2. target file
    A tab separated file containing two columns: Target and Path
    Target must contain the filename without the .bam suffix
    Path can be the absolute or relative path to the file. Remember to put a final /
    An example can be found in example_files/example.targets.tsv
    Select a set of samples as a target set among all the samples you tend to analyze. If you have prior knowledge on your data, a high diversity is recommended. In testing our human genomes we chose 6 modern and 6 ancient samples at 30X and 10X respectively. This approach has proven to not be improved when adding more samples.

  3. guidance file
    A file containing the absolute or relative path to each of the guidance samples. If you have very different depths within your dataset, it is recommended to downsample the selected BAMfiles to equalize them. We recommend doing this with ATLAS downsample. As your path and prefix will probably differ between your original and your downsampled files, you always need to give this information in a second column after the bam path to the guidance list. This prefix must be the same as in your samples file.
    An example can be found in example_files/example.guidance.tsv

How to run the Rhea

Rhea needs to be run in two runs. An initial TargetCreator step, and the actual LocalRealignment, each with its own config-file:

Initial step: If you add samples later to your dataset, you don't need to re-run this step.


#1. -- Choose paths to use

runScript: Rhea-targetCreator


##1. List of targets to use.
targets: supporting_files/targets.tsv

#2.  The location of the reference file which was used for alignment
ref: ../supporting_files/hs37d5.fa

#3.  Known indel sites
##   In case there are known InDel sites available for your species (like e.g. the 1000 Genomes gold standard for humans) you can provide up to two sets.
##   Both keywords may not be missing, but an empty input can be forwarded
known1: "-known ../supporting_files/Mills_and_1000G_gold_standard.indels.b37.vcf"
known2: " "

#4.  If you work on a system where GATK needs to be called as a specific module (like vital-it), you can add this here. Otherwise leave an empty string (" ").
GATK: "module add UHTS/Analysis/GenomeAnalysisTK/3.7"
#GATK: " "

#5.  You can parallelize the local realignment per contig or chromosome. This is recommended if your genome is very large. If you have a high number of smaller contigs it might not be better to not parallelize.
##   Important: If you have parallelized during the Target Creator step, you must parallelize during the local realignment step as well!
parallelizeChrom: T

Actual Local Realignment step.


#1. -- Choose paths to use

runScript: Rhea-localReal

#2. sample file
sample_file: supporting_files/samples_Rhea.tsv

#3. guidance list
bamsGuidance: ../supporting_files/bams_guidance.list

#3.  If you have performed the Rhea-targetCreator step within this folder, select "contigs:From_target_creator"
##   If you want to use a target set from another project, or from another researcher (e.g. to recreate results), select "external" and provide the complete path to the contigs file with 'contigList'.

contigs: From_target_creator
#contigs: external
#contigList: supporting_files/contigs_list

#2.  The location of the reference file which was used for alignment
ref: ../supporting_files/hs37d5.fa

#3.  Known indel sites
##   In case there are known InDel sites available for your species (like e.g. the 1000 Genomes gold standard for humans) you can provide up to two sets. 
known1: "-known ../supporting_files/Mills_and_1000G_gold_standard.indels.b37.vcf"
known2: " "

#4.  If you work on a system where GATK needs to be called as a specific module (like vital-it), you can add this here. Otherwise leave an empty string (" ").
GATK: "module add UHTS/Analysis/GenomeAnalysisTK/3.7"
#GATK: " "


#5.  You can parallelize the local realignment per contig or chromosome. This is recommended if your genome is very large. If you have a high number of smaller contigs it might not be better to not parallelize.
##   Important: If you have parallelized during the Target Creator step, you must parallelize during the local realignment step as well!
parallelizeChrom: T

Results

The contigs list from Target Creator step can be found either under 'Results/2.LOCREAL/TargetIntervals/all.guidance.interval_list' or under Results/2.LOCREAL/TargetIntervals/Contigs.interval_list, depending on your options (parallelization).
The final realigned bamfiles can be found in Results/2.LOCREAL/final/*bam

Updated