HTTPS SSH

QUADTrim

A DNA read (FastQ format) quality and adapter trim program written in C. It works on both paired end and single end sequences;
for paired fastq files it expects the two file varient and will maintain sequence alignment (by moving sequences with one pass
and one fail to a singleton (pass) and lowquality (fail) files).

Licence

QUADtrim is released under the BSD 3-clause License (BSD New). See COPYING
file for full licence details

Dependencies

Required:

  • gcc (or other but requires change to Makefile
  • make (gnu version)

Optional: extra dependencies only required for added features

Install

  • Download: https://bitbucket.org/arobinson/quadtrim/downloads/?tab=tags
    • Note: click the "Tags" tab to find downloads (not the Downloads tab)
  • Build: make
    • add ARGTABLE=Y to use libargtable2 instead of getopt (requires libargtable2)
    • add GZIP=Y to enable gzip file I/O (requires libz (ZLIB v1.2.6+))
      • add GZIP=OLD to enable gzip file I/O with older ZLIB library (slightly slower algorithm)
  • Install: Copy the bin/quadtrim file to somewhere in your path

Usage

 quadtrim [-CfghstZ] [-a <int>] [-A <int>] [-c <int>] [-d <str>] [-D <int>]
          [-l <int>] [-m <int>] [-M <int>] [-N <int>] [-o <int>] [-p <int>]
          [-q <int>] [-S <str>] [-T <int>] [-w <int>] [-z [0-2]] <file1> [<file2>]

 Performs various trimming and filtering on single or paired-end reads.
 For paired-end reads, if a read passes but its pair fails it is separated into
 a 'singleton' FastQ file.  Output files are given same name as input but with
 added postfix's for each output:
  -trimmed   the successful trimmed and filtered reads (x2)
  -singleton the successful reads whose pair failed*
  -discard   the sequences which failed*
  -adapter   the adapter that was trimmed*
   * filenames changed.  e.g. for read1.fq and read2.fq => readX-discard.fq


[[Common options]]
 -m <int>    Mode: a bit mask selecting required trimming modes. i.e. add numbers
             for operations you want to perform.   [default: 6]
              1 = adapter-trim
              2 = quality-trim
              4 = chastity-filter
              8 = N-base-filter
 -C          Print citation information
 -h          Print this help message
 -O <str>    Output file directory.                [default: .]
              '<SRC>' = with input files
 -t          Display only statistics in tab format
 -z <int>    Output compression                    [default:  0]
              0 = match input,
              1 = gzip,
              2 = plain
 -Z          Report zlib usage (return code: 0=yes, 10=no)
 <file1>     Read1 (or single) FastQ file
 <file2>     Optional Read2 FastQ file


[[Adapter trim options]]
 Adapter trimming is performed by global aligning paired reads to each other and
 checking for cases where the alignment over-hangs the start of the reads and
 trims appropriately

 -A <int>    Minimum score for an alignment        [default: 20]
 -c <int>    Score for a correctly aligned base    [default: 1]
 -D <int>    Write adapters found to file          [default: 0 (off)]
              0  = off, 
              1+ = minimum length of adapter to write
 -f          Fast mode. Stop at first acceptable match (rather than longest)
 -F          Maximum mismatch rate in adapter filtering [default: 0.1]
              0.1 = 1 in 10 bases wrong
 -l <int>    Minimum read length                   [default: 50]
 -M <int>    Maximum mismatches in alignment       [default: 1]
 -S <str>    Provide an adapter sequence to filter with.  Can repeat
             more than once to provide more adapters.
 -T <int>    Adapter discovery mode (pre-process)  [default: 2]
              0 = off, 
              1 = discovery, 
              2 = discovery with cache.
 -w <int>    Score for alignment mismatch          [default: -2]


[[Quality trim options]]
 Trims paired (or single) reads based on the quality score of the reads.  The
 default mode is to trim only the 3' end but the 5' end can be performed too.

 -a <int>    Minimum read average quality          [default: 20]
 -g          Remove G bases from tail of reads
 -l <int>    Minimum read length                   [default: 50]
 -o <int>    Phred score + <int> == ASCII code     [default: 33]
 -p <int>    Maximum poor quality bases included   [default:  3]
 -q <int>    Minimum base cutoff quality           [default: 15]
 -s          Trim start of sequences too


[[Chastity filter options]]
 Removes reads that contain the illuminaTM chastity filter flag


[[N base filter options]]
 Removes reads that contain too many N's (unknown bases) in the sequence

 -N <int>    Maximum N bases included              [default: 2]


[[Preset defaults]]
Alternative default settings.  The following options are available:

 -d bulls    => -q 20 -a 20 -l 50 -p 3
 -d sheep    => -q 20 -a 20 -l 50 -p 3
 -d v1       => -m 10 -q 20 -a 20 -l 50 -p 3 -O <SRC>


 ZLIB (gzip): Supported
 Arg parser:  getopt
 Version:     2.0.1

Algorithm

The algorithm used by quality-trim depends on whether or not you use the Start sequence trimming option (-s)

Start and end trimming

  • Checks Chastity (if required)
    • Scans seqname line for a ' '{space} then a ':'{colon}
    • Checks if next character is 'Y' (i.e. fails chastity)
  • Scans for first good base
  • Scans for X bad bases and stores hit (if passes):
    • Average quality
    • Min Length
  • Keep best (longest) match or writes to low quality file

End only trimming

  • Checks Chastity (if required)
  • Scans sequence for X bad bases and chops here
  • Checks average quality passes
  • Checks length remaining passes
  • If all ok then writes to pass file (otherwise to low quality file)

Limitations

QUADtrim is currently set to 510 bases/quality scores and 126 characters on the annotation line (and + line).
These numbers are defined near the top of the source file and can be increased if required (NAME_BUFFER_SIZE, and
SEQ_BUFFER_SIZE). If you try to go beyond ~32000 on either of these you may get strange behaviour (wrapping on
the positive section of a signed short which is only guaranteed to be 16-bits (-1 for the sign).

Testing

QUADtrim comes with built in unit testing. The tests require python (2.7) and g(un)zip.

  • Execute: make test

Known issues

  • With a random sprinkle of bad bases biased towards the front QUADtrim may reject a sequence (length) even though
    there is a longer piece later that may pass length. I.e. it doesn't perform overlapping matches.