Source

galaxy-central / tools / fastx_toolkit / fasta_clipping_histogram.xml

<tool id="cshl_fasta_clipping_histogram" name="Length Distribution">
	<description>chart</description>
	<requirements><requirement type="package">fastx_toolkit</requirement></requirements>
	<command>fasta_clipping_histogram.pl $input $outfile</command>
	
	<inputs>
		<param format="fasta" name="input" type="data" label="Library to analyze" />
	</inputs>

	<outputs>
		<data format="png" name="outfile" metadata_source="input" />
	</outputs>
<help>

**What it does**

This tool creates a histogram image of sequence lengths distribution in a given fasta dataset file.

**TIP:** Use this tool after clipping your library (with **FASTX Clipper tool**), to visualize the clipping results.

-----

**Output Examples**

In the following library, most sequences are 24-mers to 27-mers. 
This could indicate an abundance of endo-siRNAs (depending of course of what you've tried to sequence in the first place).

.. image:: ./static/fastx_icons/fasta_clipping_histogram_1.png


In the following library, most sequences are 19,22 or 23-mers. 
This could indicate an abundance of miRNAs (depending of course of what you've tried to sequence in the first place).

.. image:: ./static/fastx_icons/fasta_clipping_histogram_2.png


-----


**Input Formats**

This tool accepts short-reads FASTA files. The reads don't have to be short, but they do have to be on a single line, like so::

   >sequence1
   AGTAGTAGGTGATGTAGAGAGAGAGAGAGTAG
   >sequence2
   GTGTGTGTGGGAAGTTGACACAGTA
   >sequence3
   CCTTGAGATTAACGCTAATCAAGTAAAC


If the sequences span over multiple lines::

   >sequence1
   CAGCATCTACATAATATGATCGCTATTAAACTTAAATCTCCTTGACGGAG
   TCTTCGGTCATAACACAAACCCAGACCTACGTATATGACAAAGCTAATAG
   aactggtctttacctTTAAGTTG

Use the **FASTA Width Formatter** tool to re-format the FASTA into a single-lined sequences::

   >sequence1
   CAGCATCTACATAATATGATCGCTATTAAACTTAAATCTCCTTGACGGAGTCTTCGGTCATAACACAAACCCAGACCTACGTATATGACAAAGCTAATAGaactggtctttacctTTAAGTTG


-----



**Multiplicity counts (a.k.a reads-count)**

If the sequence identifier (the text after the '>') contains a dash and a number, it is treated as a multiplicity count value (i.e. how many times that individual sequence repeated in the original FASTA file, before collapsing).

Example 1 - The following FASTA file *does not* have multiplicity counts::

    >seq1
    GGATCC
    >seq2
    GGTCATGGGTTTAAA
    >seq3
    GGGATATATCCCCACACACACACAC

Each sequence is counts as one, to produce the following chart:

.. image:: ./static/fastx_icons/fasta_clipping_histogram_3.png


Example 2 - The following FASTA file have multiplicity counts::

    >seq1-2
    GGATCC
    >seq2-10
    GGTCATGGGTTTAAA
    >seq3-3
    GGGATATATCCCCACACACACACAC

The first sequence counts as 2, the second as 10, the third as 3, to produce the following chart:

.. image:: ./static/fastx_icons/fasta_clipping_histogram_4.png

Use the **FASTA Collapser** tool to create FASTA files with multiplicity counts.

------

This tool is based on `FASTX-toolkit`__ by Assaf Gordon.

 .. __: http://hannonlab.cshl.edu/fastx_toolkit/
 
</help>
</tool>
<!-- FASTA-Clipping-Histogram is part of the FASTX-toolkit, by A.Gordon (gordon@cshl.edu) -->