Source

JGI_iTagger / scripts / itagger.pl

Full commit
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
#!/usr/bin/env perl

=pod

=head1 NAME

itagger.pl

=head1 PURPOSE

Pipeline for amplicon analysis (e.g. 16S rRNA, fungal ITS)

=head1 SYNOPSIS

    itagger.pl --config C<path> --lib C<path> --dir C<path>

=head1 OPTIONS

=over 5

=item --config C<path>

Config file in Ini format; see below.

=item --lib C<path>

Table of library name and location of Fastq file.  File should contain all and only reads associated with a library (i.e. one barcode of one lane of one run).

=item --dir C<path>

Base folder for output.

=item --group C<string>

Upon successful completion, set group for all folders, files; optional.

=item --tarball C<path>

Compress output; does not delete uncompressed output dir.  Optional.

=item --log C<path>

Log config file.  Optional.

=back

=head1 CONFIGURATION FILE

All paths and parameters are included in the configuration file, in Ini format, which has sections described below and are formatted thusly:

    [AMPLICON]
    NAME=16S_V4

=head2 AMPLICON

Information about the amplicon being sequenced.

=over 5

=item NAME = C<string>

Name of the amplicon; examples: 16S_V, ITS.  Required.

=item LEN_MEAN = C<int>

Average length of the amplicon sequence, as determined from Reference database.  That is, base-pairs between primers.  Required.

=item LEN_STDEV = C<int>

Standard deviation in amplicon length.  Required.

=item TRIM5 = C<int>

The number of bp from the 5' end of the amplicon to exclude from clustering and classification.  That is, bases immediately following the forward primer.  Allows exclusion of moderately conserved sequence which are problematic for classification.  For example, used for ITS but not 16S analysis.  Optional.

=item TRIM3 = C<int>

The number of bp from the 3' end of the amplicon to trim (i.e. bases following reverse primer).  Optional.

=back

=head2 DUK

We currently use C<duk> to remove contaminant reads; namely, sequence associated with sequencing library construction (control and adapter sequences).

=over 5

=item CONTAM_DB = C<path>

Multiple reference files in Fasta format are supported. Typically these would be QC sequences (e.g. PhiX) and adapter sequences.  Providing these as separate files will result in their numbers being reported separately. Optional.

=item KMER = C<int>

Kmer size for duk search.  Higher number increass specificity at expense of sensitivity.  Optional; default=21.

=item STEP = C<int>

Step size for search.  Lower number increases sensitivity but is slower.  Optional; default=1.

=item CUTOFF = C<float>

Cutoff detection level score.  Optional; default=1.

=back

=head2 CUTADAPT

C<cutadapt> is used to remove primer sequences.  For paired reads, if the primer is not found in both foward and reverse reads, the read-pair is filtered.

=over 5

=item PRIMER1 = C<string>

Forward (read1) primer; may contain ambiguous nucleotide characters.  Must be in same strand as read1.  Required.

=item PRIMER2 = C<string>

Reverse (read2) primer; may contain ambiguous nt chars.  Must be in same strand as read2.  Required.

=item ERROR_RATE = C<float>

Maximum fraction of errors allowed in detected primer sequence.  Increasing this number will allow more matches, but may increase false positives.  As we expect and require finding primer matches, a high error rate is recommended.  Optional; default=0.4.

=item MIN_OVERAP = C<int>

The minimum length of primer that must be found in the read to be considered a good hit.  Optional; default=10.

=back

=head2 Merging Paired Reads

It is assumed the amplicon primers were designed such that the read-pairs will overlap, to provide complete coverage of the amplicon sequence.  Paired-end reads are merged (assembled) into a single sequence using either C<Flash> or C<Pandaseq>.  Pairs which are not combined are discarded.

=head3 FLASH

=over 5

=item MIN_OVERLAP = C<int>

Minimum bp overlap required to merge paired reads.  Optional; default=10.

=item MAX_MISMATCH = C<float>

Maximum fraction of errors allowed in the overlap region.  Optional; default=0.3.

=back

=head3 PANDASEQ

=over 5

=item THRESHOLD = C<float>

Matching threshold (0-1); optional (default=0.5).

=item MIN_OVERLAP = C<int>

Minimum overlap in base pairs; optional (default=10).

=back

=head2 AMPQC

Quality filtering of amplicons prior to clustering and classification.

=over 5

=item EXP_ERR = C<float>

Filter amplicons whose total expected number of errors exceeds this threshold.  Optional; default=1.0.

=back

=head2 USEARCH

All libraries' equences are dereplicated and sorted by decreasing total normalized size before being clustered using C<USEARCH>.  We cluster iteratively, starting from a low centroid radius.  The clustering step is partially parallelized for speed.

=over 5

=item MIN_LIB_SIZE = C<int>

Discard libraries with less than this number of sequences.  Optional; default=1000.

=item MAX_RADIUS = C<int>

Sequences with less than this distance (percent sequence mismatch) from the centroid are added to the cluster.  Optional; default=3.

=item REF_DB = C<path>

Reference database of known amplicon sequences, used for chimera checking. The database should be of high quality and not include chimeric sequences.  It is recommended that the reference sequences are trimmed to the amplicon sequence (i.e. electronic PCR) and dereplicated.  They may also be clustered to reduce the database size.  Required.

=back

=head2 OTU_FILTER

=over 5

=item MIN_NORM_SIZE = C<int>

An OTU must have C<MIN_NUM_LIBS> different libraries with at least C<MIN_NORM_SIZE> total normalized reads, otherwise the OTU is filtered.  Optional; default=10.

=item MIN_NUM_LIBS = C<int>

See C<MIN_NORM_SIZE>.  Optional; default=2.

=back

=head2 RDP_CLASSIFIER

Parameters for RDP Classifier and for filtering clusters from further analysis.  For example, to discard chloroplast sequences from 16S analysis.

=over 5

=item TRAINING_FILE = C<path>

Amplicon-specific RDP Classifier training file.  Required.

=item LEVEL = C<string>

An OTU must be classified at this level with a confidence above cutoff (below) in order to pass.  Optional; default="class".

=item CUTOFF = C<float>

Filter OTUs which do not meet minimum confidence level at the specified taxonomic classification (above).  Optional default=0.5.

=item FILTER = C<string>

To filter OTUs which do not match that described in search string.  String follows the pattern: X__NAME, where X indicated taxonomic level (k=kingdom, p=phylum, c=class, o=order, f=family, g=genus).  Multiple search strings should be separated by pipe ("|") characters with no spaces.  Use quotation marks if your search string contains any spaces.  Examples: k__bacteria|k__archaea, k__fungi.  Case insensitive.  Optional.

=back

=cut

=head2 QIIME

Analysis of OTU tables using C<QIIME>.

=over 5

=item RAREFY_MIN = C<float>

Cutoff for single-rarefaction.  Optional; by default use 10% of normalized mean.

=back

=head2 DIVERSITY I<(beta-testing at this time)>

Diversity plots and estimates using C<R> scripts.  Run only if C<FACTORS> is defined.

=over 5

=item FACTORS = C<PATH>

Table of libraries X factors.  Library IDs must match exactly to those used in the OTU table (i.e. C<DATA> section above).

=back

=head1 AUTHORS

Julien Tremblay (julien.tremblay@mail.mcgill.ca),
Edward Kirton (ESKirton@LBL.gov)

=head1 COPYRIGHT

Copyright (c) 2013 United States Department of Energy Joint Genome Institute.  Use freely under the same license as Perl itself.  Refer to wrapped tools for their own licenses and copyright information.

=cut

use strict;
use warnings;
use Env qw/TMPDIR RDP_JAR_PATH ITAGGER_LOG_CONFIG/;
use Getopt::Long;
use Pod::Usage;
use File::Spec::Functions qw(rel2abs);
use Config::Tiny;
use FindBin qw($Bin);
use Parallel::ForkManager;
use Log::Log4perl;
use iTagger::ReadQC qw(libQC);
use iTagger::ReadQC::Meta qw(summary);
use iTagger::ClusterOTUs qw(clusterOTUs);
use iTagger::QIIME qw(qiimeReport);

our $VERSION = "1.2";

my ($help, $man, $configFile, $libsFile, $dir, $group, $tarball);
my $threads = `nproc`;
chomp $threads;
my $logConfigFile = $ITAGGER_LOG_CONFIG;
GetOptions(
    'config=s' => \$configFile,
    'libs=s' => \$libsFile,
    'dir=s' => \$dir,
    'group=s' => \$group,
    'threads=i' => \$threads,
    'tarball=s' => \$tarball,
    'log=s' => \$logConfigFile,
    'help' => \$help,
    'man' => \$man ) or pod2usage(2);
pod2usage(1) if $help;
pod2usage( -exitstatus => 0, -verbose => 2 ) if $man;
die("--threads required\n") unless $threads;

# LOGGER
die("--log config file and \$ITAGGER_LOG_CONFIG not defined\n") unless $logConfigFile;
Log::Log4perl->init($logConfigFile);
my $logger = Log::Log4perl->get_logger('iTagger');
$logger->info("Begin iTagger version $VERSION with $threads threads");

# ENV
$logger->logdie("\$TMPDIR not defined") unless defined($TMPDIR);
$logger->logdie("\$RDP_JAR_PATH not defined") unless defined($RDP_JAR_PATH);

# OUTPUT DIR
$logger->logdie("--dir path required") unless $dir;
$dir = rel2abs($dir);
-d $dir or mkdir($dir) or $logger->logdie("Unable to mkdir $dir: $!");

# CONFIG FILE
$logger->logdie("--config file required") unless $configFile;
$logger->logdie("--config file not found: $configFile") unless -f $configFile;
$configFile = rel2abs($configFile);
my $config = Config::Tiny->read($configFile);
foreach (qw/AMPLICON AMPQC USEARCH RDP_CLASSIFIER/)
{
    $logger->logdie("Config file missing section: $_") unless exists($config->{$_});
}
$config->{_}->{VERSION} = $VERSION;
$config->write("$dir/config.ini");

# LIBS
$logger->logdie("--libs file required") unless $libsFile;
$logger->logdie("--libs file not found: $libsFile") unless -f $libsFile;
$libsFile = rel2abs($libsFile);
my %libs;
open( my $in, '<', $libsFile) or $logger->logdie($!);
while (<$in>)
{
    chomp;
    my @row = split(/\t/);
    $logger->logdie("Invalid libs file") unless @row == 2;
    my ($name, $file) = @row;
    $name =~ s/\W/_/g;
    $logger->logdie("Duplicate lib, $name") if exists($libs{$name});
    $libs{$name} = $file;
}
close($in);

# READ QC
my @libs = sort keys %libs;
my $pm = new Parallel::ForkManager($threads);
-d "$dir/data" or mkdir("$dir/data") or $logger->logdie($!);
my $numLibs = scalar(@libs);
my $qc = 0;
for ( my $i=0; $i<=$#libs; $i++ )
{
    my $name = $libs[$i];
    next if -e "$dir/data/$name/readQC.log";
    my $file = $libs{$name};
    unless ( -e $file )
    {
        $logger->warn("Lib $name file not found: $file");
        next;
    }
    ++$qc;
    $pm->start and next;
    my $qc = new iTagger::ReadQC($logConfigFile, $configFile, "$dir/data", $file, $name);
    $qc->libQC();
    $pm->finish;
}
$pm->wait_all_children;
my $report = new iTagger::ReadQC::Meta($logConfigFile, $configFile, "$dir/data");
$report->summary();

# CLUSTERING
my $otuFile = "$dir/otu/envcom.pass.tsv";
if ( $qc or ! -e $otuFile )
{
    clusterOTUs($logConfigFile, $configFile, $dir, $threads);
}

# QIIME
qiimeReport($logConfigFile, $configFile, $dir);

# README
open(my $out, '>', "$dir/README.txt") or die($!);
while (<DATA>) { print $out $_ }
close(DATA);
print $out 'duk: ', `duk | grep Version | tail -1 | cut -f 1 -d ':'`;
print $out 'cutadapt: ', `cutadapt --version`;
print $out `flash --version | head -1`;
print $out `pandaseq -v 2>&1`;
print $out `usearch --version`;
print $out "RDP Classifier: $RDP_JAR_PATH\n";
print $out 'QIIME: ', `which alpha_diversity.py`;
close($out);

# FIX PERMS
if ( $group )
{
    system("chgrp -R $group $dir");
    system("chmod -R a+rX $dir");
    system("chmod -R ug+rwX $dir");
}

# TARBALL
if ( $tarball )
{
    system("tar -cvzf $tarball $dir");
}
exit;
__DATA__
iTagger v1.1 METHODS

The iTagger amplicon analysis pipeline uses several publicly available tools to analyze amplicon libraries, such as 16S rRNA or fungal ITS variable regions for phylogenetic analysis.  All libraries to be compared should be identically constructed, sequenced, and analyzed.

I. INPUT:

(1) Configuration file in INI format with parameters and paths of reference databases.
    OUTPUT: The config file is copied to <OUTDIR>/config.ini
(2) Libraries tabular file indicates library/condition name and path to Fastq file.

II. READ QC:

Each library's Fastq file is processed as described below and the results are saved in the data/ folder, which also includes a readQC.log file which indicates the read-pairs at each step and the percentage of pairs which pass each step (i.e. percentages are per-step, not of total input).
    OUTPUT: Saved in <OUTDIR>/data/<LIB_NAME>/
            Summarized in readQC.log

(1) CONTAM FILTER: Filter one or more contaminants using Duk (e.g. PhiX control, sequencing library adapter dimers, human contaminants, etc.).  For paired reads, the entire pair is filtered if one end-read has a high-scoring hit.
    OUTPUT: duk.log
(2) PRIMER TRIM: PCR primers (of the conserved region) are trimmed away.  For paired-end reads, both forward and reverse primers must be found, otherwise the entire pair is filtered.
(3) HARD TRIM: For particular libraries (e.g. fungal ITS), it is useful to trim a predefined number of bases from the 5' and 3' ends of the sequence.  For fungal ITS, we observed better RDP Classifier results after trimming conserved regions.  We do not recommend hard trimming for 16S sequences.
(4) ITERATIVE PAIR MERGING: Reads are trimmed as a pair, removing the last base from whichever end has the highest expected error in a window 5bp wide.  Reads are trimmed from mean + 3 standard deviations to mean - 3 standard deviations in 0.5 standard deviation steps.  After each trimming step, pairs are merged into single sequences with either Flash or Pandaseq.  Pairs which are not merged continue to the next round of trimming.  Paired reads which are not combined are discarded.
    OUTPUT: Not-combined reads, nc.fastq.bz2
(5) EXPECTED-ERROR FILTER: Merged reads are filtered if they have an expected number of errors which exceeds the threshold.  The config file indicates the maximum number of expected errors per 100bp. Note that Flash and Pandaseq produce different quality scores in the overlap-assembled regions.
    OUTPUT: Filtered extended reads, ex.fastq.bz2
            Quality report, qualStat.pdf, qualStat.tsv
(7) DEREPLICATE: Count the number of times each sequence is observed and output in tabular (seq-obs) format, ordered by sequence.
    OUTPUT: seqobs.tsv.bz2

III. CLUSTERING:

USEARCH is used for clustering, although there is a provision for iterative clustering which can (a) provide faster processing and (b) allow processing of larger files than can normally be processed (particularly with the 32bit version).  RDP Classifier is used for taxonomic classification of the resultant cluster centroid sequences and it's accuracy is dependent upon providing a well-curated RDP reference database.
    OUTPUT: Saved in <OUTDIR>/otu
            Summarized in cluster.log

(1) MERGE LIBRARIES: The seq-obs files for all libraries are merged, dereplicated, and sorted by decreasing abundance.  Low-abundance sequences are separated and excluded from clustering, step 2 (although they will be mapped and counted in step 3).
(2) ITERATIVE CLUSTER OTUS: Refer to the USEARCH documentation for the algorithm description.  Our use of USEARCH differs slightly from that described in the USEARCH documentation in that we iterate between single-threaded clustering and multi-threaded searching in order to reduce run-time. We also use .obs files for tracking cluster members, so a final mapping and counting step (as described in the USEARCH docs) is unnecessary.  Clustering is done iteratively starting at 99% identity, and decreasing by 1% identity until the level described in the config file is reached (e.g. 97% for 16S, 95% for Fungal ITS).
(3) MAP LOW-ABUNDANCE SEQUENCES: Rare sequences, which cannot form their own clusters, are mapped to the cluster centroid sequences and counted.
(4) REFERENCE DB CHIMERA FILTER: Centroid sequences are compared to the reference database and likely chimeric sequences discarded, using UCHIME.
    OUTPUT: Final cluster centroids, otu.fasta.bz2
(5) CLASSIFICATION: Assign taxonomic classification to each cluster using RDP Classifier.  The config file indicates a taxonomic level (e.g. family) and confidence level (e.g. 0.5) which is used to decide which classifications are useful.  Clusters which can be acceptably classified are output to otu.tax.tsv, while the others are written to otu.unk.tsv.
    OUTPUT: RDP output, rdp.tsv
            Classified OTUs, otu.tax.tsv
            Unclassified OTUs, otu.unk.tsv
(6) TAX FILTER: Clusters with classifications which do not match those indicated in the config file are discarded and the desired clusters are written to the otu.tax.filtered.tsv file.
    OUTPUT: Final OTU table, otu.tax.filtered.tsv

IV. TAXONOMIC ANALYSIS:

QIIME is used to manipulate the final OTUs file.  A few graphs are generated plus some rarefied tables which may be useful for subsequent analysis.

(1) GENERATE BIOM: The BIOM JSON file is generated from the OTU tabular file.
    OUTPUT: otu.biom
(2) ABUNDANCE THRESHOLD: Filter OTUs are assorted levels and calculate alpha diversities.
    OUTPUT: Several files in <OUTDIR>/abundance_thresholds/
(3) SINGLE RAREFACTION: This is done at 1000 and at a level calculated from the trimmed mean and standard deviation of the library sizes (10% trimmed, and calc. mean - i * stdev; while i is the highest number from 2 to 0.5, step 0.5, until the cutoff is above 0).
    OUTPUT: <OUTDIR>/otu/rarefied.1000.biom, rarefied.1000.filtered.biom
            <OUTDIR>/otu/rarefied.<X>.biom, rarefied.<X>.filtered.biom
(4) SUMMARIZE TAXONOMY: with both relative and absolute abundance
    OUTPUT: <OUTDIR>/tax_mapping/relative/
            <OUTDIR>/tax_mapping/absolute/
(5) PLOT RANK-ABUNDANCE: Generate PDF rank-abundance graph of all samples.
    OUTPUT: <OUTDIR>/otu/log_rank_abundance.pdf
(6) PLOT TAXA SUMMARY: Make taxa plot of absolute abundance
        OUTPUT: <OUTDIR>/tax_mapping/plots/
(7) PHYLUM BARPLOT: generate a phylum-level barplot using absolute abundance for a quick overview of the data.
        OUTPUT: <OUTDIR>/tax_mapping/taxonomy_phylum_L2.tab

* * *

iTagger was written by Julien Tremblay (julien.tremblay@mail.mcgill.ca) and Edward Kirton (ESKirton@LBL.gov) and is Copyright (c) 2013 by the US DOE Joint Genome Institute but is freely available for use without any warranty under the same license as Perl itself.  v1.1 was released 12/12/2013.  Refer to wrapped tools for their author and license information.

* * *

External executable versions: