Source

galaxy-central (ngs) / tools / fastq / fastq_manipulation.xml

Full commit
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
<tool id="fastq_manipulation" name="Manipulate FASTQ" version="1.0.1">
  <options sanitize="False" /> <!-- This tool uses a file to rely all parameter information (actually a dynamically generated python module), we can safely not sanitize any parameters -->
  <description>reads on various attributes</description>
  <command interpreter="python">fastq_manipulation.py $input_file $fastq_manipulation_file $output_file $output_file.files_path '${input_file.extension[len( 'fastq' ):]}'</command>
  <inputs>
    <!-- This tool is purposely over-engineered (e.g. Single option conditionals) to allow easy enhancement with workflow/rerun compatibility -->
    <page>
      <param name="input_file" type="data" format="fastqsanger,fastqcssanger" label="FASTQ File" help="Requires groomed data: if your data does not appear here try using the FASTQ groomer."/>
      <!-- Match Reads -->
      <repeat name="match_blocks" title="Match Reads">
        <conditional name="match_type">
          <param name="match_type_selector" type="select" label="Match Reads by">
            <option value="identifier">Name/Identifier</option>
            <option value="sequence">Sequence Content</option>
            <option value="quality">Quality Score Content</option>
          </param>
          <when value="identifier">
            <conditional name="match">
              <param name="match_selector" type="select" label="Identifier Match Type">
                <option value="regex">Regular Expression</option>
              </param>
              <when value="regex">
                <param type="text" name="match_by" label="Match by" value=".*" />
              </when>
            </conditional>
          </when>
          <when value="sequence">
            <conditional name="match">
              <param name="match_selector" type="select" label="Sequence Match Type">
                <option value="regex">Regular Expression</option>
              </param>
              <when value="regex">
                <param type="text" name="match_by" label="Match by" value=".*" />
              </when>
            </conditional>
          </when>
          <when value="quality">
            <conditional name="match">
              <param name="match_selector" type="select" label="Quality Match Type">
                <option value="regex">Regular Expression</option>
              </param>
              <when value="regex">
                <param type="text" name="match_by" label="Match by" value=".*" />
              </when>
            </conditional>
          </when>
        </conditional>
      </repeat>
      <!-- Manipulate Matched Reads -->
      <repeat name="manipulate_blocks" title="Manipulate Reads">
        <conditional name="manipulation_type">
          <param name="manipulation_type_selector" type="select" label="Manipulate Reads on">
            <option value="identifier">Name/Identifier</option>
            <option value="sequence">Sequence Content</option>
            <option value="quality">Quality Score Content</option>
            <option value="miscellaneous">Miscellaneous Actions</option>
          </param>
          <when value="identifier">
            <conditional name="manipulation">
              <param name="manipulation_selector" type="select" label="Identifier Manipulation Type">
                <option value="translate">String Translate</option>
              </param>
              <when value="translate">
                <param name="from" type="text" label="From" value="" />
                <param name="to" type="text" label="To" value="" />
              </when>
            </conditional>
          </when>
          <when value="sequence">
            <conditional name="manipulation">
              <param name="manipulation_selector" type="select" label="Sequence Manipulation Type">
                <option value="rev_comp">Reverse Complement</option>
                <option value="rev_no_comp">Reverse, No Complement</option>
                <option value="no_rev_comp">Complement, No Reverse</option>
                <option value="trim">Trim</option>
                <option value="dna_to_rna">DNA to RNA</option>
                <option value="rna_to_dna">RNA to DNA</option>
                <option value="translate">String Translate</option>
                <option value="change_adapter">Change Adapter Base</option>
              </param>
              <when value="rev_comp">
                <!-- no extra settings -->
              </when>
              <when value="rev_no_comp">
                <!-- no extra settings -->
              </when>
              <when value="no_rev_comp">
                <!-- no extra settings -->
              </when>
              <when value="trim">
                <conditional name="offset_type">
                  <param name="base_offset_type" type="select" label="Define Base Offsets as" help="Use Absolute for fixed length reads (Illumina, SOLiD)&lt;br&gt;Use Percentage for variable length reads (Roche/454)">
                    <option value="offsets_absolute" selected="true">Absolute Values</option>
                    <option value="offsets_percent">Percentage of Read Length</option>
                  </param>
                  <when value="offsets_absolute">
                    <param name="left_column_offset" label="Offset from 5' end" value="0" type="integer" help="Values start at 0, increasing from the left">
                      <validator type="in_range" message="Base Offsets must be positive" min="0" max="inf"/>
                      <validator type="expression" message="An integer is required.">int( float( value ) ) == float( value )</validator>
                    </param>
                    <param name="right_column_offset" label="Offset from 3' end" value="0" type="integer" help="Values start at 0, increasing from the right">
                      <validator type="in_range" message="Base Offsets must be positive" min="0" max="inf"/>
                      <validator type="expression" message="An integer is required.">int( float( value ) ) == float( value )</validator>
                    </param>
                  </when>
                  <when value="offsets_percent">
                    <param name="left_column_offset" label="Offset from 5' end" value="0" type="float">
                      <validator type="in_range" message="Base Offsets must be between 0 and 100" min="0" max="100"/>
                    </param>
                    <param name="right_column_offset" label="Offset from 3' end" value="0" type="float">
                      <validator type="in_range" message="Base Offsets must be between 0 and 100" min="0" max="100"/>
                    </param>
                  </when>
                </conditional>
                <param name="keep_zero_length" label="Keep reads with zero length" type="boolean" truevalue="keep_zero_length" falsevalue="exclude_zero_length" selected="False"/>
              </when>
              <when value="dna_to_rna">
                <!-- no extra settings -->
              </when>
              <when value="rna_to_dna">
                <!-- no extra settings -->
              </when>
              <when value="translate">
                <param name="from" type="text" label="From" value="" />
                <param name="to" type="text" label="To" value="" />
              </when>
              <when value="change_adapter">
                <param name="new_adapter" label="New Adapter" type="text" value="G" help="An empty string will remove the adapter base" />
              </when>
            </conditional>
          </when>
          <when value="quality">
            <conditional name="manipulation">
              <param name="manipulation_selector" type="select" label="Quality Manipulation Type">
                <option value="translate">String Translate</option>
                <!-- <option value="modify_each_score">Apply Transformation to each Score</option> Not enabled yet-->
              </param>
              <when value="translate">
                <param name="from" type="text" label="From" value="" />
                <param name="to" type="text" label="To" value="" />
              </when>
              <when value="modify_each_score">
                <param name="map_score" type="text" label="Modify Score by" value="$score + 1" />
              </when>
            </conditional>
          </when>
          <when value="miscellaneous">
            <conditional name="manipulation">
              <param name="manipulation_selector" type="select" label="Miscellaneous Manipulation Type">
                <option value="remove">Remove Read</option>
              </param>
              <when value="remove">
                <!-- no extra settings -->
              </when>
            </conditional>
          </when>
        </conditional>
      </repeat>
    </page>
  </inputs>
  <configfiles>
    <configfile name="fastq_manipulation_file">##create an importable module
#import binascii
import re
import binascii
from string import maketrans
##does read match
def match_read( fastq_read ):
    #for $match_block in $match_blocks:
        #if $match_block['match_type']['match_type_selector'] == 'identifier':
    search_target = fastq_read.identifier[1:] ##don't include @
        #elif $match_block['match_type']['match_type_selector'] == 'sequence':
    search_target = fastq_read.sequence
        #elif $match_block['match_type']['match_type_selector'] == 'quality':
    search_target = fastq_read.quality
        #else:
        #continue
        #end if
    if not re.search( binascii.unhexlify( "${ binascii.hexlify( str( match_block['match_type']['match']['match_by'] ) ) }" ), search_target  ):
        return False
    #end for
    return True
##modify matched reads
def manipulate_read( fastq_read ):
    new_read = fastq_read.clone()
    #for $manipulate_block in $manipulate_blocks:
        #if $manipulate_block['manipulation_type']['manipulation_type_selector'] == 'identifier':
            #if $manipulate_block['manipulation_type']['manipulation']['manipulation_selector'] == 'translate':
    new_read.identifier = "@%s" % new_read.identifier[1:].translate( maketrans( binascii.unhexlify( "${ binascii.hexlify( str( manipulate_block['manipulation_type']['manipulation']['from'] ) ) }" ), binascii.unhexlify( "${ binascii.hexlify( str( manipulate_block['manipulation_type']['manipulation']['to'] ) ) }" ) ) )
            #end if
        #elif $manipulate_block['manipulation_type']['manipulation_type_selector'] == 'sequence':
            #if $manipulate_block['manipulation_type']['manipulation']['manipulation_selector'] == 'translate':
    new_read.sequence = new_read.sequence.translate( maketrans( binascii.unhexlify( "${ binascii.hexlify( str( manipulate_block['manipulation_type']['manipulation']['from'] ) ) }" ), binascii.unhexlify( "${ binascii.hexlify( str( manipulate_block['manipulation_type']['manipulation']['to'] ) ) }" ) ) )
            #elif $manipulate_block['manipulation_type']['manipulation']['manipulation_selector'] == 'rev_comp':
    new_read = new_read.reverse_complement()
            #elif $manipulate_block['manipulation_type']['manipulation']['manipulation_selector'] == 'rev_no_comp':
    new_read = new_read.reverse()
            #elif $manipulate_block['manipulation_type']['manipulation']['manipulation_selector'] == 'no_rev_comp':
    new_read = new_read.complement()
            #elif $manipulate_block['manipulation_type']['manipulation']['manipulation_selector'] == 'trim':
                #if $manipulate_block['manipulation_type']['manipulation']['offset_type']['base_offset_type'] == 'offsets_percent':
    left_column_offset = int( round( float( ${ manipulate_block['manipulation_type']['manipulation']['offset_type']['left_column_offset'] } ) / 100.0 * float( len( new_read ) ) ) )
    right_column_offset = int( round( float( ${ manipulate_block['manipulation_type']['manipulation']['offset_type']['right_column_offset'] } ) / 100.0 * float( len( new_read ) ) ) )
                #else
    left_column_offset = ${ manipulate_block['manipulation_type']['manipulation']['offset_type']['left_column_offset'] }
    right_column_offset = ${ manipulate_block['manipulation_type']['manipulation']['offset_type']['right_column_offset'] }
                #end if
    if right_column_offset > 0:
        right_column_offset = -right_column_offset
    else:
        right_column_offset = None
    new_read = new_read.slice( left_column_offset, right_column_offset )
    if not ( ${str( manipulate_block['manipulation_type']['manipulation']['keep_zero_length'] ) == 'keep_zero_length'} or len( new_read ) ):
        return None
            #elif $manipulate_block['manipulation_type']['manipulation']['manipulation_selector'] == 'dna_to_rna':
    new_read = new_read.sequence_as_DNA()
            #elif $manipulate_block['manipulation_type']['manipulation']['manipulation_selector'] == 'rna_to_dna':
    new_read = new_read.sequence_as_RNA()
            #elif $manipulate_block['manipulation_type']['manipulation']['manipulation_selector'] == 'change_adapter':
    if new_read.sequence_space == 'color':
        new_read = new_read.change_adapter( binascii.unhexlify( "${ binascii.hexlify( str( manipulate_block['manipulation_type']['manipulation']['new_adapter'] ) ) }" ) )
            #end if
        #elif $manipulate_block['manipulation_type']['manipulation_type_selector'] == 'quality':
            #if $manipulate_block['manipulation_type']['manipulation']['manipulation_selector'] == 'translate':
    new_read.quality = new_read.quality.translate( maketrans( binascii.unhexlify( "${ binascii.hexlify( str( manipulate_block['manipulation_type']['manipulation']['from'] ) ) }" ), binascii.unhexlify( "${ binascii.hexlify( str( manipulate_block['manipulation_type']['manipulation']['to'] ) ) }" ) ) )
            #elif $manipulate_block['manipulation_type']['manipulation']['manipulation_selector'] == 'map_score':
    def score_method( score ):
        raise Exception, "Unimplemented" ##This option is not yet available, need to abstract out e.g. column adding tool action: preventing users from using 'harmful' actions
        new_read.quality_map( score_method )
            #end if
        #elif $manipulate_block['manipulation_type']['manipulation_type_selector'] == 'miscellaneous':
            #if $manipulate_block['manipulation_type']['manipulation']['manipulation_selector'] == 'remove':
    return None
            #end if
        #else:
        #continue
        #end if
    #end for
    if new_read.description != "+":
        new_read.description = "+%s" % new_read.identifier[1:] ##ensure description is still valid
    return new_read
def match_and_manipulate_read( fastq_read ):
    new_read = fastq_read
    if match_read( fastq_read ):
        new_read = manipulate_read( fastq_read )
    return new_read
</configfile>
  </configfiles>
  <outputs>
    <data format="input" name="output_file" />
  </outputs>
  <tests>
    <!-- match all and do nothing -->
    <test>
      <param name="input_file" value="sanger_full_range_original_sanger.fastqsanger" ftype="fastqsanger" />
      <param name="match_type_selector" value="identifier" />
      <param name="match_selector" value="regex" />
      <param name="match_by" value=".*" />
      <param name="manipulation_type_selector" value="identifier" />
      <param name="manipulation_selector" value="translate" />
      <param name="from" value="" />
      <param name="to" value="" />
      <output name="output_file" file="sanger_full_range_original_sanger.fastqsanger" />
    </test>
    <!-- match None and do nothing -->
    <test>
      <param name="input_file" value="sanger_full_range_original_sanger.fastqsanger" ftype="fastqsanger" />
      <param name="match_type_selector" value="identifier" />
      <param name="match_selector" value="regex" />
      <param name="match_by" value="STRINGDOESNOTEXIST" />
      <param name="manipulation_type_selector" value="identifier" />
      <param name="manipulation_selector" value="translate" />
      <param name="from" value="" />
      <param name="to" value="" />
      <output name="output_file" file="sanger_full_range_original_sanger.fastqsanger" />
    </test>
    <!-- match all and remove -->
    <test>
      <param name="input_file" value="sanger_full_range_original_sanger.fastqsanger" ftype="fastqsanger" />
      <param name="match_type_selector" value="identifier" />
      <param name="match_selector" value="regex" />
      <param name="match_by" value=".*" />
      <param name="manipulation_type_selector" value="miscellaneous" />
      <param name="manipulation_selector" value="remove" />
      <output name="output_file" file="empty_file.dat" />
    </test>
    <!-- match None and remove -->
    <test>
      <param name="input_file" value="sanger_full_range_original_sanger.fastqsanger" ftype="fastqsanger" />
      <param name="match_type_selector" value="identifier" />
      <param name="match_selector" value="regex" />
      <param name="match_by" value="STRINGDOESNOTEXIST" />
      <param name="manipulation_type_selector" value="miscellaneous" />
      <param name="manipulation_selector" value="remove" />
      <output name="output_file" file="sanger_full_range_original_sanger.fastqsanger" />
    </test>
    <!-- match all and trim to 4 inner-most bases -->
    <test>
      <param name="input_file" value="sanger_full_range_original_sanger.fastqsanger" ftype="fastqsanger" />
      <param name="match_type_selector" value="identifier" />
      <param name="match_selector" value="regex" />
      <param name="match_by" value=".*" />
      <param name="manipulation_type_selector" value="sequence" />
      <param name="manipulation_selector" value="trim" />
      <param name="base_offset_type" value="offsets_absolute"/>
      <param name="left_column_offset" value="45"/>
      <param name="right_column_offset" value="45"/>
      <param name="keep_zero_length" value="true" />
      <output name="output_file" file="fastq_trimmer_out1.fastqsanger" />
    </test>
    <test>
      <param name="input_file" value="sanger_full_range_original_sanger.fastqsanger" ftype="fastqsanger" />
      <param name="match_type_selector" value="identifier" />
      <param name="match_selector" value="regex" />
      <param name="match_by" value=".*" />
      <param name="manipulation_type_selector" value="sequence" />
      <param name="manipulation_selector" value="trim" />
      <param name="base_offset_type" value="offsets_percent"/>
      <param name="left_column_offset" value="47.87"/>
      <param name="right_column_offset" value="47.87"/>
      <param name="keep_zero_length" value="true" />
      <output name="output_file" file="fastq_trimmer_out1.fastqsanger" />
    </test>
    <!-- match all and rev comp -->
    <test>
      <param name="input_file" value="sanger_full_range_original_sanger.fastqsanger" ftype="fastqsanger" />
      <param name="match_type_selector" value="identifier" />
      <param name="match_selector" value="regex" />
      <param name="match_by" value=".*" />
      <param name="manipulation_type_selector" value="sequence" />
      <param name="manipulation_selector" value="rev_comp" />
      <output name="output_file" file="sanger_full_range_rev_comp.fastqsanger" />
    </test>
    <!-- match all and rev comp, with ambiguous DNA -->
    <test>
      <param name="input_file" value="misc_dna_original_sanger.fastqsanger" ftype="fastqsanger" />
      <param name="match_type_selector" value="identifier" />
      <param name="match_selector" value="regex" />
      <param name="match_by" value=".*" />
      <param name="manipulation_type_selector" value="sequence" />
      <param name="manipulation_selector" value="rev_comp" />
      <output name="output_file" file="misc_dna_as_sanger_rev_comp_1.fastqsanger" />
    </test>
    <!-- match all and rev comp, with ambiguous RNA -->
    <test>
      <param name="input_file" value="misc_rna_original_sanger.fastqsanger" ftype="fastqsanger" />
      <param name="match_type_selector" value="identifier" />
      <param name="match_selector" value="regex" />
      <param name="match_by" value=".*" />
      <param name="manipulation_type_selector" value="sequence" />
      <param name="manipulation_selector" value="rev_comp" />
      <output name="output_file" file="misc_rna_as_sanger_rev_comp_1.fastqsanger" />
    </test>
    <!-- match first seq and rev comp -->
    <test>
      <param name="input_file" value="sanger_full_range_original_sanger.fastqsanger" ftype="fastqsanger" />
      <param name="match_type_selector" value="identifier" />
      <param name="match_selector" value="regex" />
      <param name="match_by" value="FAKE0001" />
      <param name="manipulation_type_selector" value="sequence" />
      <param name="manipulation_selector" value="rev_comp" />
      <output name="output_file" file="sanger_full_range_rev_comp_1_seq.fastqsanger" />
    </test>
    <!-- match first seq and rev comp: i.e. undo above -->
    <test>
      <param name="input_file" value="sanger_full_range_rev_comp_1_seq.fastqsanger" ftype="fastqsanger" />
      <param name="match_type_selector" value="identifier" />
      <param name="match_selector" value="regex" />
      <param name="match_by" value="FAKE0001" />
      <param name="manipulation_type_selector" value="sequence" />
      <param name="manipulation_selector" value="rev_comp" />
      <output name="output_file" file="sanger_full_range_original_sanger.fastqsanger" />
    </test>
    <!-- match all and DNA to RNA -->
    <test>
      <param name="input_file" value="sanger_full_range_original_sanger.fastqsanger" ftype="fastqsanger" />
      <param name="match_type_selector" value="identifier" />
      <param name="match_selector" value="regex" />
      <param name="match_by" value=".*" />
      <param name="manipulation_type_selector" value="sequence" />
      <param name="manipulation_selector" value="dna_to_rna" />
      <output name="output_file" file="sanger_full_range_as_rna.fastqsanger" />
    </test>
    <!-- match all and RNA to DNA -->
    <test>
      <param name="input_file" value="sanger_full_range_as_rna.fastqsanger" ftype="fastqsanger" />
      <param name="match_type_selector" value="identifier" />
      <param name="match_selector" value="regex" />
      <param name="match_by" value=".*" />
      <param name="manipulation_type_selector" value="sequence" />
      <param name="manipulation_selector" value="rna_to_dna" />
      <output name="output_file" file="sanger_full_range_original_sanger.fastqsanger" />
    </test>
  </tests>
<help>
This tool allows you to build complex manipulations to be applied to each matching read in a FASTQ file. A read must match all matching directives in order for it to be manipulated; if a read does not match, it is output in a non-modified manner. All reads matching will have each of the specified manipulations performed upon them, in the order specified.

Regular Expression Matches are made using re.search, see http://docs.python.org/library/re.html for more information.
  All matching is performed on a single line string, regardless if e.g. the sequence or quality score spans multiple lines in the original file.

String translations are performed using string.translate, see http://docs.python.org/library/string.html#string.translate and http://docs.python.org/library/string.html#string.maketrans for more information.

.. class:: warningmark

Only color space reads can have adapter bases substituted.


-----

**Example**

Suppose you have a color space sanger formatted sequence (fastqcssanger) and you want to double-encode the color space into psuedo-nucleotide space (this is different from converting) to allow these reads to be used in tools which do not natively support it (using specially designed indexes). This tool can handle this manipulation, however, this is generally not recommended as results tend to be poorer than those produced from tools which are specially designed to handle color space data.

Steps:

1. Click **Add new Match Reads** and leave the matching options set to the default (Matching by sequence name/identifier using the regular expression "\*."; thereby matching all reads). 
2. Click **Add new Manipulate Reads**, change **Manipulate Reads on** to "Sequence Content", set **Sequence Manipulation Type** to "Change Adapter Base" and set **New Adapter** to "" (an empty text field). 
3. Click **Add new Manipulate Reads**, change **Manipulate Reads on** to "Sequence Content", set **Sequence Manipulation Type** to "String Translate" and set **From** to "0123." and **To** to "ACGTN".
4. Click Execute. The new history item will contained double-encoded psuedo-nucleotide space reads.

------

**Citation**

If you use this tool, please cite `Blankenberg D, Gordon A, Von Kuster G, Coraor N, Taylor J, Nekrutenko A; Galaxy Team. Manipulation of FASTQ data with Galaxy. Bioinformatics. 2010 Jul 15;26(14):1783-5. &lt;http://www.ncbi.nlm.nih.gov/pubmed/20562416&gt;`_


</help>
</tool>