Pushed to berkeleylab/jgi_itagger
9e0b863 fix low-qual concatenated reads output file name
iTagger iTagger is the production pipeline used for processing Illumina amplicon libraries at the US DOE Joint Genome Institute. This pipeline uses usearch and qiime to analyze amplicon libraries, such as 16S rRNA or fungal ITS variable regions for phylogenetic analysis. All samples to be compared should be identically constructed, sequenced, and analyzed. This software wraps popular tools to facilitate analysis of large numbers of samples and provides some minor enhancements for managing many samples and their counts and incremental clustering. Authors should credit the creators of usearch, mafft, and qiime. READ QC: Overlapping read pairs are merged into unpaired consensus sequences (by usearch's merge_pairs and READ_QC/MERGE_MAX_DIFF_PCT parameter); unmerged reads are discarded. The PCR primers must be found with the correct orientation and within the expected spacing (by usearch's search_oligodb and parameters: AMPLICON/LEN_MEAN, AMPLICON/LEN_STDEV, READ_QC/PRIMERS, READ_QC/PRIMER_TRIM_MAX_DIFFS, READ_QC/LEN_FILTER_MAX_DIFFS), otherwise the read is discarded. The read quality scores are evaluated and those with too many expected errors are discarded (READ_QC/MAX_EXP_ERR_RATE). Lastly, identical sequences are dereplicated, counted, sorted alphabetically, and written to a .seqobs file for subsequent merging with other such files. Additionally, a log file is produced which records the ID and reason each filtered read was excluded. CLUSTERING: Samples containing less than CLUSTERING/SAMPLE_MIN_SIZE sequences are discarded. All remaining samples' seqobs files are combined (i.e. dereplicate identical sequences), and the sequences are sorted by decreasing abundance. Sequences are separated depending on whether they contain CLUSTERING/CENTROID_MIN_SIZE copies, where the low-abundance sequences are set aside and not used during clustering. The former are saved in a .fasta and matching .obs table which records the sequences per sample; the latter are saved in a .fasta file with the sequences per sample recorded in the sequence headers. The clusterable sequences are incrementally clustered by usearch's cluster_otus, starting at 99% identity, and increasing the radius by 1% each iteration until reaching CLUSTERING/OTU_CLUSTERING_PCT_IDENTITY. The sequences are resorted by decreasing abundance between each step. After clustering, the low-abundance sequences are mapped to the cluster centroids (by usearch's usearch_global) and are added to the OTUs' counts if they are within the prescribed percent-identity threshold, otherwise they are discarded; this step creates no new clusters. Refer to the USEARCH documentation (http://drive5.com) for a description of the usearch clustering algorithm. CLASSIFICATION: Cluster centroid sequences are evaluated with usearch's utax and the specified reference database (which may have been filtered). The resultant taxonomic predictions are filtered; if an OTU does not have any taxonomic classifications at the CLASSIFICATION/CUTOFF threshold, it is written to the otu.unknown.fasta and .obs files. Additionally, if the optional CLASSIFICATION/CONTAM regex is provided, any OTU with matching taxonomic classifications are filtered to the otu.contam.fasta and .obs files; this is generally used for removing chloroplast sequences from rhizosphere samples. The accepted OTUs are found in the otu.fasta and .obs file. MULTIPLE SEQUENCE ALIGNMENT AND PHYLOGENETIC TREE: OTUs are aligned using MAFFT (using parameters: --maxiterate 1000 --globalpair) and a tree constructed using QIIME's make_phylogeny.py, which produces a Newick file. It is left to the end user to generate graphical representations from this file. DIVERSITY ANALYSES: An OTU table, BIOM file, and QIIME-format mapping file are generated which may be used to analyze the results using QIIME tools. QIIME's core_diversity_analyses.py pipeline is run, using the DIVERSITY/SAMPLING_DEPTH parameter, which generates several files under the core_diversity_analyses folder. Refer to the QIIME documentation for a description (http://qiime.org). AUTHORS: iTagger was originally written by Julien Tremblay (firstname.lastname@example.org) and later developed by Edward Kirton (ESKirton@LBL.gov), to whom correspondence should be addressed. The work conducted by the U.S. Department of Energy Joint Genome Institute is supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC02-05CH11231. AVAILABILITY: The code is freely available at Bitbucket (https://bitbucket.org/berkeleylab/jgi_itagger). SUMMARY OF OUTPUT: FILES GENERATED BY itaggerReadQc.pl: ROOT/reads/*.seqobs = sorted sequence and observations (counts) ROOT/reads/*.filtered = log of filtered reads FILES GENERATED BY itaggerClusterOtus.pl: ROOT/otu/align.txt = UTAX classification alignments ROOT/otu/tax.tsv = UTAX classification table ROOT/otu/otu.fasta = final OTU centroid sequences ROOT/otu/otu.fasta.obs = final OTU centroid observations per sample table ROOT/otu/contam.otu.fasta = filtered contaminant centroid sequences ROOT/otu/contam.otu.fasta.obs = filtered contaminant centroid obs table ROOT/otu/unk.otu.fasta = unclassified OTU centroid sequences ROOT/otu/unk.otu.fasta.obs = unclassified OTU obs table ROOT/otu/log.txt = summary report ROOT/otu/mapping.txt = QIIME mapping file ROOT/otu/otu.tsv = OTU table of abundances, including taxonomy ROOT/otu/otu.biom = OTU abundance+tax table in BIOM format ROOT/otu/msa.fasta = multiple sequence alignment of final centroids ROOT/otu/otu.tre = phylogenetic tree of final centroids ROOT/otu/core_diversity_analyses/ = folder containing QIIME core diversity analyses output INSTALLATION To install this module, run the following commands: perl Makefile.PL make make test make install SUPPORT AND DOCUMENTATION After installing, you can find documentation for this module with the perldoc command. perldoc iTagger You can also look for information at: https://bitbucket.org/berkeleylab/jgi_itagger LICENSE AND COPYRIGHT This software is Copyright (c) 2013 by the US DOE Joint Genome Institute but is freely available for use without any warranty under the same license as Perl itself. Refer to wrapped tools for their credits and license information. This license does not grant you the right to use any trademark, service mark, tradename, or logo of the Copyright Holder. This license includes the non-exclusive, worldwide, free-of-charge patent license to make, have made, use, offer to sell, sell, import and otherwise transfer the Package with respect to any patent claims licensable by the Copyright Holder that are necessarily infringed by the Package. If you institute patent litigation (including a cross-claim or counterclaim) against any party alleging that the Package constitutes direct or contributory patent infringement, then this Artistic License to you shall terminate on the date that such litigation is filed. Disclaimer of Warranty: THE PACKAGE IS PROVIDED BY THE COPYRIGHT HOLDER AND CONTRIBUTORS "AS IS' AND WITHOUT ANY EXPRESS OR IMPLIED WARRANTIES. THE IMPLIED WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, OR NON-INFRINGEMENT ARE DISCLAIMED TO THE EXTENT PERMITTED BY YOUR LOCAL LAW. UNLESS REQUIRED BY LAW, NO COPYRIGHT HOLDER OR CONTRIBUTOR WILL BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, OR CONSEQUENTIAL DAMAGES ARISING IN ANY WAY OUT OF THE USE OF THE PACKAGE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.