HTTPS SSH

QualityTrim

A DNA read (FastQ format) quality trim program written in C. It works on both paired end and single end sequences; for paired fastq files it expects the two file varient and will maintain sequence alignment (by moving sequences with one pass and one fail to a singleton (pass) and lowquality (fail) files).

Licence

quality-trim is released under the GNU Lesser General Public License v3+ (LGPLv3+). See COPYING and COPYING.LESSER files for full licence details

Dependencies

Required:

  • gcc (or other but requires change to Makefile
  • make (gnu version)

Optional: extra dependencies only required for added features

Install

  • Download: https://bitbucket.org/arobinson/qualitytrim/downloads#tag-downloads
    • Note: click the "Tags" tab to find downloads (not the Downloads tab)
  • Build: make
    • add ARGTABLE=Y to use libargtable2 instead of getopt (requires libargtable2)
    • add GZIP=Y to enable gzip file I/O (requires libz (ZLIB v1.2.6+))
      • add GZIP=OLD to enable gzip file I/O with older ZLIB library (slightly slower algorithm)
  • Install: Copy the bin/quality-trim file to somewhere in your path

Usage

quality-trim [-hnst] [-a <int>] [-l <int>] [-N <int>] [-o <int>] [-p <int>]
[-q <int>] [-z [0-2]] <file1> [<file2>]

-a <int>    Minimum read average quality          [default: 20]
-l <int>    Minimum read length                   [default: 50]
-N <int>    Maximum N bases included              [default: -1 (Any)];
-o <int>    Phred score + <int> == ASCII code     [default: 33]
-p <int>    Maximum poor quality bases included   [default:  3]
-q <int>    Minimum quality base                  [default: 15]
-z <int>    0 = match input, 1 = gzip, 2 = plain  [default:  0]
-h          Print this help message
-n          Don't attempt Ilumina Chastity check
-s          Trim start of sequences too
-t          Display only statistics in tab format

Algorithm

The algorithm used by quality-trim depends on whether or not you use the Start sequence trimming option (-s)

Start and end trimming

  • Checks Chastity (if required)
    • Scans seqname line for a ' '{space} then a ':'{colon}
    • Checks if next character is 'Y' (i.e. fails chastity)
  • Scans for first good base
  • Scans for X bad bases and stores hit (if passes):
    • Average quality
    • Min Length
  • Keep best (longest) match or writes to low quality file

End only trimming

  • Checks Chastity (if required)
  • Scans sequence for X bad bases and chops here
  • Checks average quality passes
  • Checks length remaining passes
  • If all ok then writes to pass file (otherwise to low quality file)

Limitations

quality-trim is currently set to 510 bases/quality scores and 126 characters on the annotation line (and + line). These numbers are defined near the top of the source file and can be increased if required (NAME_BUFFER_SIZE, and SEQ_BUFFER_SIZE). If you try to go beyond ~32000 on either of these you may get strange behaviour (wrapping on the positive section of a signed short which is only guaranteed to be 16-bits (-1 for the sign).

Testing

Quality-trim comes with built in unit testing. The tests require python (2.7), biopython and g(un)zip.

  • Execute: make test
    • Note: if you build quality-trim without the GZIP support the 3 tests for this will fail.

Known issues

  • With a random sprinkle of bad bases biased towards the front quality-trim may reject a sequence (length) even though there is a longer piece later that may pass length. I.e. it doesn't perform overlapping matches.