Wiki

Clone wiki

ATLAS-Pipeline / Gaia

Genome Wide Alignment Including Adapter-trimming

This part of the workflow handles the raw data analysis from unaligned FASTQ files to aligned BAM files.
It includes first quality checks, adapter trimming, alignment, marking of duplicates and some prior filtering.

Before running the pipeline:

  1. Create a config file. An example can be found at example_files/example_config_Gaia.yaml
  2. Create a samples file

Configfile

Provide for each project an individual configfile in yaml format. This file can be shared with other researchers to perform the exact same analysis independently.
This is a template to an example configfile for Gaia:


runScript: Gaia  

# 1. samples file
sample_file: samples_Gaia.tsv

# 2. programs, references, etc.
atlas: /path/to/your/atlas/executable/atlas/atlas  
ref: /path/to/your/reference/file/reference.fa

# 3. how was your bamfile sequenced? -- uncomment only ONE option for your analysis
#sequence: single
sequence: paired

# 4. Thresholds
mappingqual: 30

# 5. additional inputs
CN: Test  #sequencing location for header-information

# 6. does the raw-data contain adapters? Select T/F. if TRUE, adapter-trimming will be perwormed. If FALSE the fastq-files will be aligned without trimming.
Adapter: T 

# 7. if adapters should be removed, TrimGalore will run with default parameters, including the removal of standard illumina adapters.
# here you can specify different adapter sequences and/or parameters:
AdapterSequence1: default
AdapterSequence2: default
lengthFilter: 30
qualityFilter: 0

#8 Java memory - specify the memory allocated for picard-tools MarkDuplicates
Xmx: -Xmx120G

#9 how many threads to use when multi-thread is possible/advised?
threads: 10

Samples file

The samples file should contain a tab separated table with the following columns:

  • Sample - The prefix you want to give your sample in the end
  • Lib - Duplicates are being marked among all files of the same sample that share the same Lib identifyer.
  • File - The prefix of each of your input files. No restriction on characters or signs.
    Suffix must be according to Illumina standard.
    For paired-end data: only enter one line and specify sequencing mode in your config file. R1 and R2 files must have the same prefix.
  • Path - Either complete or relative path to each sample. No specific folder structure needed.

Example:

/path/to/sample1/file1_R1_001.fastq.gz
/path/to/sample1/file2_R1_001.fastq.gz
../relative/path/sample2/file1_R1_001.fastq.gz
../relative/path/sample2/file2_R1_001.fastq.gz
/additional/path/sample2/file3_R1_001.fastq.gz

Sample Lib File Path
Sample1 LibA file1 /path/to/sample1/
Sample1 LibB file2 /path/to/sample1/
Sample2 LibA file1 ../relative/path/sample2/
Sample2 LibA file2 ../relative/path/sample2/
Sample2 LibB file3 /additional/path/sample2/

Results:

The final aligned and filtered bamfiles can be found in Results/1.FASTQ/10.MkDup_per_sample/

Also have a look at your fastQC results in Results/1.FASTQ/02.fastqc/. You can open the *html files with any internet-browser. Check for adapter contamination or any other potential quality-problems. For details, refer to the fastQC manual.

Updated