HTTPS SSH

piRNA pipeline

Menu

Purpose

These scripts were written for semi-automated analyses of small RNA-Seq data with a focus on piRNA analyses. Installing them requires some minimum familiarity at using command-line tools on an UNIX-like system such as GNU/Linux. Using them also, but maybe a bit less. The scripts have many limitations, but if you know how to edit bash and/or python scripts, you may want to modify them in order to make the pipeline better suit your needs.

Mapping

The main pipeline maps the reads on a genome (that of D. melanogaster as of January 2015) complemented by canonical transposable element sequences using bowtie21.

Annotation

The reads are then assigned to some categories mainly based on information obtained from Flybase, plus some information from Brennecke et al. 2007 (for piRNA clusters) and Czech et al. 2008 (for ovary siRNA clusters), and the list of canonical transposable elements curated by Casey Bergman.

Some reads are identified as possible piRNA, based on having a size of 23 to 30 nucleotides and mapping to some particular regions of the genome (annotated piRNA clusters, ovary siRNA clusters, transposable elements and 3'UTR of coding genes).

Further processing

These piRNA candidates are then mapped stringently and exhaustively on the genome (not complemented by canonical transposable element sequences) in order to identify and count the unique mappers of the main piRNA clusters (clusters 1 (42AB), 2 (20A), 6 (80E-F), 8 (20A-B) and 17 by default). This can be used for normalization purpose when comparing results from several libraries.

The piRNA candidates are also mapped more loosely and less exhaustively on canonical transposable elements, piRNA clusters and coding genes 3'UTR. This results on sequence files and data tables that can be used for further analyses (size histograms, pileups, scatter plots, ping-pong analysis).

The main pipeline results are summarized in a table giving read counts by annotation categories and paths to files containing classified reads (if such a file was written to disk).

Installing

Note that you will need a working internet connexion and quite some disk space, as genomic data will be downloaded from the internet.

Directories

The installation is made locally in the user directory. Things will be installed in the following somewhat standard directories:

~/bin
~/include
~/lib
~/lib64
~/share
~/src

Other directories will be created and populated:

~/Genomes
~/ncbi-outdir
~/.local

You need to create the ~/src directory (if it doesn't exist yet) where you will clone the repository of the piRNA pipeline:

mkdir ~/src

Getting the piRNA pipeline code and data

Then go into it and clone the repository (git must already have been installed, see below):

cd ~/src
git clone https://blaiseli@bitbucket.org/blaiseli/pirna-pipeline.git

This will create a ~/src/pirna-pipeline directory containing the installation and analysis scripts, as well as some data used to build Drosophila melanogaster annotated genome and a modified copy of the code necessary to build homerTools.

If git is missing, it can be installed on a Debian-based distribution by issuing the command:

sudo apt-get install git

Configuring the $PATH

You should add the newly cloned directory to your $PATH variable, so as to be able to execute the scripts. The installation scripts will later install some programs "locally", that is, in your user home directory. For these programs to be accessible to the piRNA pipeline, some other directories also have to be added to the $PATH variable: ~/bin and ~/.local/bin.

The $PATH setting is typically done by editing some configuration file and sourcing it. If you are using the bash shell (default in most GNU/Linux distributions), having the following lines in your ~/.bashrc file should work:

PATH=${HOME}/src/pirna-pipeline:"$PATH"

# set PATH so it includes user's private bin if it exists
if [ -d ~/bin ] ; then
    PATH=~/bin:"${PATH}"
fi
if [ -d ~/.local/bin ] ; then
    PATH=~/.local/bin:"${PATH}"
fi

export GENOMES="${HOME}/Genomes"

The last line defines an environment variable2 indicating to the scripts where to install and/or find genomic data for mapping and annotating. You may edit this if you want to install that data somewhere else. Hopefully it will work, provided you have the necessary permissions to create and modify the content of the directory defined by the $GENOME environment variable. You may want to do this for instance for disk space management purposes on a multi-partitioned system.

To have these $PATH and $GENOME change take effect, either log out and log in again or issue the following command:

source ~/.bashrc

From now on, in the current terminal and any new terminal, the commands necessary to install the piRNA pipeline should be available on the command line.

Installing dependencies

First, install the programs on which the piRNA pipeline depends by issuing the command:

get_dependencies.sh

This will install some programs but not necessarily all of their dependencies. Some other programs may need to be installed before launching the above installation script.

In particular, you will need gcc and make, since some programs will be compiled from source (in an attempt to obtain faster executables than if these programs were installed via the standard packaging system).

You will also need the python and ncurses development libraries as well as the setuptools, biopython, scipy, matplotlib and networkx python modules. On Debian-based distributions, this will likely install with:

sudo apt-get install make
sudo apt-get install gcc
sudo apt-get install libncurses5-dev libncursesw5-dev
sudo apt-get install python-setuptools
sudo apt-get install python-dev
sudo apt-get install python-biopython
sudo apt-get install python-scipy
sudo apt-get install python-matplotlib
sudo apt-get install python-networkx

Note that python (version 2.73) itself is also needed but is likely already installed (and will likely be installed automatically as a dependency of some of the above python packages), as well as perl and some other rather standard utilities.

Diagnostics about raw data are generated using fastqc. On Debian-based distributions, this will likely install with:

sudo apt-get install fastqc

Optional dependencies

Some of the scripts that generate scatter plots comparing read abundances between pairs of libraries, use the pgf and the preview LaTeX packages as well as other useful things contained in the texlive-latex-extra set of LaTeX packages. On Debian-based distributions, this will likely install with:

sudo apt-get install pgf
sudo apt-get install preview-latex-style
sudo apt-get install texlive-latex-extra

Note that a LaTeX distribution is also necessary, but will be pulled in as a dependency of the above packages.

Some of the scripts that produce graphical output use R. On Debian-based distributions, this will likely install with:

sudo apt-get install r-base

If you are working on a computing server, it is useful to have a way to start your analyses and disconnect without stopping them. This can be done using programs such as screen, tmux or byobu (which actually uses either one of the former two programs). Therefore, although not a dependency of the piRNA pipeline, I would recommend you to also have one of these programs installed. On Debian-based distributions byobu will likely install with:

sudo apt-get install byobu

Installing the data

To install the genome and annotation files, run the following script:

get_genome.sh

This currently only installs the data for Drosophila melanogaster version 5.

Some modifications to the script will have to be done if you want version 6, and the siRNA ovary cluster and piRNA cluster information in ~/src/pirna-pipeline/custom_annotations will have to be updated too, due to changes in genomic coordinates.

Adaptation for other genomes will require heavier modifications, including in the mapping scripts (it may be that these scripts will also have to be adapted for D. melanogaster version 6).

Hopefully the piRNA pipeline will now be installed. Next, you will have to prepare and organise your small RNA-Seq data.

Preparing RNA-Seq data

Note that the commented script examples.sh contains examples of commands for data preparation and pipeline usage, that will perform analyses in a piRNA_analyses sub-directory of the source directory cloned from git.

Raw data organization

As the analyses will generate some directories (and fill them with files), I recommend that you start with an empty directory to avoid any possible interference with possibly already existing directories with the same name. Several analyses can take place from within the same directory as long as the name used for the libraries are different. If your system's capacities allows it, you can even run several analyses at the same time.

Say you name that directory piRNA_analyses (you may chose whatever you want for the name as long as it follows general rules for naming a directory in a UNIX-like system like GNU/Linux). Go in that directory with the appropriate cd command. Further example commands will assume you are typing them from your piRNA_analyses directory.

You need to create a sub-directory piRNA_analyses/raw_data where your libraries will reside. For each library, create a sub-directory. For instance:

mkdir -p raw_data/my_control
mkdir raw_data/my_mutant

The -p option for the first library is there to ensure that the raw_data will be created if it doesn't exist yet.

Avoid spaces and special characters for your library names and try to be consistent in your naming schemes.

Each library sub-directory should contain the file in fastq format containing your small RNA-Seq reads. This file should be named my_control.fastq for the data in raw_data/my_control, my_mutant.fastq for the data in raw_data/my_mutant, and so on. Note that a symbolic link should work, if you want to avoid duplicating raw data stored elsewhere on your system and/or under another file name. As of when I'm writing these lines (13/01/2014), the pipeline doesn't yet accept compressed raw data files.

Raw data installation script

For your convenience, a script is provided to help you installing raw data: install_raw_data.sh. This script should be executed from the main directory in which your analyses will take place. In our example piRNA_analyses. This script takes as first argument the name you want to use for your library and as second argument, the full path to the raw data file in fastq or compressed fastq.gz or fastq.bz2 format, or an already downloaded SRA archive, or an SRR identifier. The script takes care of creating the corresponding sub-directory inside raw_data. Different behaviours can occur, depending on the format guessed for the original data.

  1. If the original data file seems not to be compressed (.fq or .fastq extension), a symbolic link to the original data is created in the sub-directory. In this case, be warned that the beginning of the analyses may be slow or even fail if the original data resides in a network-mounted disk.

  2. If the original data file seems to be compressed (.gz or .bz2 extension), an uncompressed version is created in the sub-directory.

  3. If the original data seems to be an SRA archive (.sra extension), the sequences with a length at least 10 will be extracted in fastq format using fastq-dump.

  4. If an SRR identifier is provided (starting with "SRR"), the script will use fastq-dump to try to download the data from the sequence read archive, and extract the sequence with a length at least 10 as above.

Say you have extracted a .fastq file from an SRA archive, in the raw_data directory, named SRR298536.fastq. To install it using the name my_control, you would run the data-installation script as follows:

install_raw_data.sh my_control raw_data/SRR298536.fastq

Two example files are provided in the piRNA_analyses sub-directory of the archive downloaded using git.

They consist of the first ten thousand reads of two libraries obtained from NCBI GEO repository:

[http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM744629] [http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM744630]

Further data and results organization

The analyses will create a directory using the name of your libraries to store most of the results. Continuing with our above examples, these would be my_control and my_mutant. Sub-directories will be created within these directories. You will usually find tables containing various sorts of read counts in sub-directories whose names start with histos and you will find files containing mapped reads in sub-directories whose names start with reads.

Log files containing the output of the programs running during the pipeline execution will be placed in a logs directory, and summary tables will go into a summaries directory. These two directories, will be created if necessary, and will be present directly within your analyses base directory (e.g. piRNA_analyses), not in the libraries own directories.

Main pipeline

Before starting the main pipeline on a library, you should know what 3' adaptor sequence has to be removed from the raw data and how many barcoding nucleotides need to be trimmed. That much nucleotides will be trimmed at both ends of the reads after adaptor trimming.

To start the main pipeline, type the following command (from your main analyses directory, e.g. piRNA_analyses):

do_pipeline_small.sh <library_name> <adapter_sequence> <barcode_length>

Substitute the parameters by the relevant information (without the inferior and superior characters). If the sequences are already trimmed, enter '""' for the adapter sequence, and '0' for the barcode length.

You may prefix the command by the command time in order to obtain the running time at the end. The analysis of a large dataset (millions of reads) can take several hours or days depending on the characteristics of your computer.

Modifying analysis parameters

Some analysis parameters can be modified by setting environment variables2.

If you want to use a different maximum size of the reads than the default value (30), you can set the MAX_LEN environment variable. This can be done in the command line, before the script and its arguments.

MAX_LEN="40" do_pipeline_small.sh <library_name> <adapter_sequence> <barcode_length>

Warning: This is an experimental feature. Among the possible risks, it may happen that the longer reads do not map as well as the shorter ones due to inappropriate settings for bowtie2.

Warning: If your goal is to do a comparison with previous results, do not forget to rename or move somewhere else the directory containing them, because the existing files will likely interfere with the new analyses. For a similar reason, you cannot run the same command at the same time with different values for the environment variables.

Library diagnostics

The main pipeline will use the fastqc program to compute a report containing various diagnostics about the library. The report will be located in a fastqc_report sub-directory within the directory containing the raw data. In our examples, the report files would be:

raw_data/my_control/fastqc_report/my_control_fastqc.html
raw_data/my_mutant/fastqc_report/my_mutant_fastqc.html

The report can be opened using an html visualizer, like a web browser. See the fastqc documentation for diagnostics explanations.

General mapping results

You can find a summary of the annotation results in the summaries directory. In our examples, the files would be:

summaries/my_control_on_D_melanogaster_results.xls
summaries/my_mutant_on_D_melanogaster_results.xls

These files are tab-separated tables that spreadsheet programs such as gnumeric or excel should be able to open. The first column is the annotation category, the second the number of reads assigned to that category, the third the path to the file containing those reads (if it exists) and the fourth is some comment clarifying a bit what the category corresponds to.

You can easily extract read counts from the table at the command line using awk:

awk '$1=="piRNA_candidates"{print $2}' summaries/my_control_on_D_melanogaster_results.xls
awk '$1=="piRNA_candidates"{print $2}' summaries/my_mutant_on_D_melanogaster_results.xls

The above commands will give you the second columns of the lines whose first column is piRNA_candidates in the two summary files respectively, that is, the number of piRNA candidates identified in the two libraries.

piRNA candidates further mapping results

The main pipeline maps the piRNA candidates on canonical transposable elements sequences in order to evaluate the contribution of each transposable element to the piRNA population.

Apart from the summaries directory, the main pipeline therefore also creates a directory for the analysis of each library, containing the following sub-directories:

histos_piRNA_candidate_on_TE
reads_piRNA_candidate_on_TE
mapped_piRNA_candidate_on_TE
  1. The first one contains tables that can be opened with a spreadsheet program. There is one file for each transposable element, containing counts of piRNA candidates having mapped on the transposable element, with detail by read size, mapping direction (forward or reverse) and number of mismatches. There is also a file named <library_name>_piRNA_candidate_on_TE_ref_counts.txt (substitute your library name to get the actual file name). This file contains a global summary with one line per transposable element, without distinguishing reads by size.

  2. The second directory contains files in which the reads have been written. There is one file in fastq format (possibly compressed) for each transposable element.

  3. The last directory contains a compressed version of the raw mapping results obtained by bowtie2.

Similar directories exist for the mapping of piRNA candidates on 3'UTRs or piRNA clusters.

The three mapping types (on transposable elements, piRNA clusters and 3'UTRs) are also done with other read categories, and the corresponding sub-directories are present within the directory for your library. The read categories are the following:

  • pi-si-TE-3UTR: Small RNAs mapping on piRNA clusters, ovary siRNA clusters, transposable elements or 3'UTR of coding genes (and not in higher priority categories such as rRNAs or miRNAs).

  • piRNA_candidate: Subset of the above, of size between 23 and 30 nucleotides (included).

  • piRNA_unique: Reads of size between 23 and 30 nucleotides (included) having perfectly matched in one site only. Note that they may have mapped in other references as the previous category.

  • siRNA_candidate: Other subset of pi-si-TE-3UTR, of size 21. Note that these are not necessarily mapping on ovary siRNA clusters.

Some scripts are available to produce graphical representations of these mapping results.

Graphical representations

Some scripts are provided to generate graphical representations of the results of your analyses. These scripts are not automatically run as part of the pipeline. They often use the results for several libraries.

To use these scripts, you need to call them on the command line, with their arguments. The arguments and their order are detailed below for each script.

Apart from the command-line arguments given to a script, the aspect of the graphics can be further modified using environment variables2. You can set these variables in the command line, separated by spaces, before the script and its arguments.

Here is a dummy example of script call with environment variables and arguments:

VAR1="No" VAR2="1000" script.sh arg1 arg2 arg3

Size histograms

The script do_plot_histos.sh can be used to generate read size histograms for various read categories having mapped on various references.

This script takes the following arguments (in that order):

  1. Read category (see earlier), to be chosen among:

    • pi-si-TE-3UTR
    • piRNA_candidate
    • piRNA_unique
    • siRNA_candidate
  2. Reference category, to indicate the type of reference on which the reads have been mapped. To be chosen among:

    • TE: Canonical transposable elements.
    • pi_clusters: piRNA clusters from Brennecke et al. 2007.
    • 3UTR: 3'UTRs of coding genes.
  3. Normalization type, to determine how the read counts will be normalized. To be chosen among:

    • mappers: Small RNAs having mapped on the genome or on canonical transposable elements.
    • mi: Small RNAs having mapped on miRNAs.
    • mt: Reads having mapped on mitochondrial genome.
    • endosi: Small RNAs having mapped on ovary siRNA clusters from Czech et al. 2008.
    • n: Unique mappers of piRNA cluster number n from Brennecke et al. 2007.
  4. Type of mismatches representation, to determine whether or not to use colour intensities to visualize the quantities of reads from different mismatch categories. To be chosen among:

    • mm: Paler colours will be used for lower numbers of mismatches.
    • nomm: The same colours will be used whatever the number of mismatches of the reads.
  5. Reference choice, given in one of the following ways:

    • The name of one of the references contained in the category chosen with argument 2.
    • A text file containing a list of such references (one per line).
    • tot: To make an histogram of reads mapping on all references of the category chosen with argument 2.
  6. Name of a library.

  7. Optionally, other library names. If more than one library is chosen, the histograms for the same reference will share the same vertical scale between libraries.

The script has to be run from the same directory as the one from which you started the analyses (piRNA_analyses in our example).

The figures are produced in pdf format and saved in a sub-directory whose name is based on the three first arguments. For instance, when plotting size histograms for piRNA candidates mapping on transposable elements and normalizing by number of mappers, the directory will be named histos_piRNA_candidate_on_TE_norm_mappers.

Within that directory, for each reference chosen using the fourth argument, there is a pdf file for each library.

The histogram bars are actually stacks of bars representing (normalized) counts of reads with increasing numbers of mismatches as the bars are further away from the horizontal axis (up for forward mapping reads and down for reverse mapping reads). The colour intensity of the bar grows with the number of mismatches of the reads it represents.

Environment variables for do_plot_histos.sh, with example settings

  • Normalize by a certain number of normalizers instead of using the raw number of normalizers:
BY="1000000"

Pileups

The script pileup_on_refs.sh can be used to generate pileup figures for various read categories having mapped on various references.

This script is used in a way rather similar to do_plot_histos.sh. It takes the following arguments (in that order):

  1. Read category (see earlier), to be chosen among:

    • pi-si-TE-3UTR
    • piRNA_candidate
    • piRNA_unique
    • siRNA_candidate
  2. Reference category, to indicate the type of reference on which the reads have been mapped. To be chosen among:

    • TE: Canonical transposable elements.
    • pi_clusters: piRNA clusters from Brennecke et al. 2007.
    • 3UTR: 3'UTRs of coding genes.
  3. Normalization type, to determine how the read counts will be normalized. To be chosen among:

    • mappers: Small RNAs having mapped on the genome or on canonical transposable elements.
    • mi: Small RNAs having mapped on miRNAs.
    • mt: Reads having mapped on mitochondrial genome.
    • endosi: Small RNAs having mapped on ovary siRNA clusters from Czech et al. 2008.
    • n: Unique mappers of piRNA cluster number n from Brennecke et al. 2007.
  4. Type of mismatches representation, to determine whether or not to use colour intensities to visualize the quantities of reads from different mismatch categories. To be chosen among:

    • mm: Paler colours will be used for lower numbers of mismatches.
    • nomm: The same colours will be used whatever the number of mismatches of the reads.
  5. Reference choice, given in one of the following ways (no tot option here):

    • The name of one of the references contained in the category chosen with argument 2.
    • A text file containing a list of such references (one per line).
  6. Name of a library.

  7. Optionally, other library names. If more than one library is chosen, the histograms for the same reference will share the same vertical scale between libraries.

The script has to be run from the same directory as the one from which you started the analyses (piRNA_analyses in our example).

The figures are produced in pdf format and saved in a sub-directory whose name is based on the three first arguments. For instance, when plotting size histograms for piRNA candidates mapping on transposable elements and normalizing by number of mappers, the directory will be named histos_piRNA_candidate_on_TE_norm_mappers.

The figures are produced in pdf format and saved in a sub-directory whose name is based on the three first arguments. For instance, when plotting pileups for piRNA candidates mapping on transposable elements, the directory will be named pileups_piRNA_candidate_on_TE_norm_mappers.

Within that directory, for each reference chosen using the fourth argument, there is a pdf file for each library.

Environment variables for pileup_on_refs.sh, with example settings

  • Use another output directory than the default:
OUT_DIR="pileups_test"

Note: Do not use the "~" symbol in environment variables. Use ${HOME} instead.

  • Normalize by a certain number of normalizers instead of using the raw number of normalizers:
BY="1000000"
  • Ignore forward mapping reads:
NOF="Yes"
  • Ignore reverse mapping reads:
NOR="Yes"

(Actually, for both of the two above environment variables, the reads will be ignored whatever the value you use for NOF or NOR, so do not set NOF="No" if you do not want to ignore forward mapping reads.)

Scatter plots

The script do_scatterplot.sh can be used to generate a scatter plot comparing read counts between two libraries.

It takes the following arguments (in that order):

  1. X-axis library name

  2. Y-axis library name

  3. Read category (see earlier), to be chosen among:

    • pi-si-TE-3UTR
    • piRNA_candidate
    • piRNA_unique
    • siRNA_candidate
  4. Reference category, to indicate the type of reference on which the reads have been mapped. To be chosen among:

    • TE: Canonical transposable elements.
    • pi_clusters: piRNA clusters from Brennecke et al. 2007.
    • 3UTR: 3'UTRs of coding genes.
  5. Normalization type, to determine how the read counts will be normalized. To be chosen among:

    • mappers: Small RNAs having mapped on the genome or on canonical transposable elements.
    • mi: Small RNAs having mapped on miRNAs.
    • mt: Reads having mapped on mitochondrial genome.
    • endosi: Small RNAs having mapped on ovary siRNA clusters from Czech et al. 2008.
    • n: Unique mappers of piRNA cluster number n from Brennecke et al. 2007.
  6. Scale or transformation type. To be chosen among:

    • log2: Base 2 logarithmic scale.
    • log10: Base 10 logarithmic scale.
    • k: A linear scaling factor. Not every value will work well. Some may even result in a failure to generate the figure. Check the output of the script and do some trial an errors. The ideal value will depend on the normalization and on the data. Note that it is not possible to represent references with no reads if a logarithmic scale is chosen. For that reason, by default, references with less than 1000 reads in either library will be ignored. You can use the MIN_COUNTS environment variable to modify this threshold (see below).
  7. A comma-separated list of files listing the references. If you provide several lists, this will allow you to categorize references and plot them with different colours (see next option). Give one file per reference category and don't use spaces to separate the file names, only use commas. References not present in the lists will not be used in the plot.

  8. A comma-separated list of colours (using LaTeX language colour keywords) to use for the references categorized using the previous option. There should be the same number of colours as there are listing files. You can repeat colours. Don't separate the colours by spaces, only use commas.

  9. Optionally, a number indicating the category of reads to use for the counts. To be chosen among:

    • 2: All reads mapping on the reference (default)
    • 3: Forward mapping reads.
    • 4: Reverse mapping reads.
    • 5: Forward perfectly mapping reads (0 mismatches).
    • 6: Reverse perfectly mapping reads (0 mismatches).
    • 7: Forward not perfectly mapping reads (1 or more mismatches).
    • 8: Reverse not perfectly mapping reads (1 or more mismatches).

The number actually corresponds to the column in the tables used to read the data, which have headers looking as follows:

#ref    total   total_F total_R F_0mm   R_0mm   F_mm    R_mm    F_1_mm  R_1_mm  F_2_mm  R_2_mm  F_3_mm  R_3_mm  F_4_mm  R_4_mm

The number of columns in the tables will depend on the maximum number of mismatches recorded.

The pdf figures will be generated in a scatterplots directory within your main analyses directory.

One version of the scatterplots will be generated from LaTeX files containing TikZ code (files ending in "_with_names.tex", also located in the scatterplots directory), that you can modify and recompile if you know how to do that. In these versions of the plots, axes labelling is rather imperfect, but the name of a plotted element can be displayed in some pdf viewers (actually, "Acrobat Reader" is the only one I know of) when pointing the mouse on a point. This can help you interpret the better-looking version of the plots (generated with the pyplot python module), that do not have such a feature.

Along with the scatterplots, fold heatmaps are generated (pdf files containing "fold" in their names). In these figures, an element is associated with a rectangle whose colour represents the base 2 logarithm of the (X-axis / Y-axis) ratio of normalized number of reads. A file containing the elements represented in the heatmap will be also written, with one element per line, in the order in which they appear in the plot. This file can be later used to force the order of appearance in other heatmaps using the ORDER environment variable (see below).

There is a possibility to convert element names (using the NAME_CONV environment variable) as they are read in the counts file. This can be useful to apply an order obtained from the analysis of data using different names for transposable elements.

Environment variables for do_scatterplot.sh, with example settings

  • Use another output directory than the default:
OUT_DIR="scatterplots_new"

Note: Do not use the "~" symbol in environment variables. Use ${HOME} instead.

  • Minimum number of reads for a reference to be included in the scatter plot:
MIN_COUNTS="1000"
  • Normalize by a certain number of normalizers instead of using the raw number of normalizers:
BY="1000000"
AXIS_FONT="footnotesize"
TICK_FONT="tiny"
  • Choose a size for the plot markers (default size is 4):
MARKER_SIZE="10"
  • Label font size for the fold heatmaps:
HM_LABEL="9"
  • Choose a colour map for the heatmap:
CMAP="GnBu"

You can refer to this page to help you choose a colour map: https://jakevdp.github.io/blog/2014/10/16/how-bad-is-your-colormap/

(If you add "_r" to the colour map name, you will obtain the colour series in the reverse order.)

  • Center the heatmap colour map around zero:
CENTER_CMAP="Yes"

(Actually, there will be recentering whatever the value you use for CENTER_CMAP, so do not set CENTER_CMAP="No" if you do not want to recenter the colour map.)

  • Force the heatmap colour scale to use a certain range of log folds:
MAX_LOGFOLD="3.0"

This only has an effect if CENTER_CMAP has been set.

  • Plot the x=y line::
PLOT_DIAGONAL="Yes"

(As for the previous variable, the line will be plot whatever the value you use for PLOT_DIAGONAL.)

  • Sort the elements in the heatmap according to a pre-defined order:
ORDER="elements_order.txt"

This environment variable should be set to the path to a file containing a list of elements (one per line), defining which elements should be used for the heat map, and in which order. If ORDER is not defined, the elements will be ordered according to their normalized read counts folds.

This can be used to have the same order in various plots.

Note that the elements not having enough counts (see MIN_COUNTS) will not be represented, even if present in the file.

  • Convert the names of the elements:
NAME_CONV="name2converted.txt"

This environment variable should be set to the path to a file containing tab-separated pairs (element name, converted name), defining how to rename elements. The first name in the pair should correspond to the name as it appears in the counts file. The second name should correspond to the name as it appears in the files defined by the ORDER environment variable (if present) and the 7-th argument (groups of elements used to associate colours).

Warning: If an element read in the file does not appear in the file given by the NAME_CONV option, it will be kept as-is. In this regard, be careful with the spelling and lowercase vs. uppercase distinction.

  • Additionally, if you set a NO_LINE variable to "Yes" (or anything not empty, actually), the regression line will not be displayed on the graph.

Similarly, setting these variables to anything not empty will disable some outputs:

  • Do not output the fold heatmap:
NOHEAT="True"
  • Do not output the pyplot scatterplot:
NOSCATTER="True"
  • Do not output the LaTeX scatterplot:
NOTEX="True"

Ping-pong pipeline

Slice size impact

A read will more likely have its ping-pong partners present in a large set of reads than in a small one. Therefore, ping-pong analysis cannot be done on the whole set of identified piRNA candidates if one wants to be able to compare results of the analyses from several libraries (unless the libraries all end up having luckily the same number of piRNA candidates).

The solution implemented in the piRNA pipeline is to perform the analyses on slices of piRNA candidates of equal size4. In order to have an idea of the variability of the estimate of the proportion of reads with ping-pong partners, there is the possibility to repeat the analyzes on several successive slices of reads. The reads are written out in files at the annotation phase in the order they come in the output of bowtie2. This order is itself the order of the reads in the raw data fastq file. Provided this order can be considered random, the slices used for the ping-pong analyses can be considered random selections of piRNA reads.

Global ping-pong

The ping-pong analyses can be done on the whole set of piRNA candidates, in which case the number and size of slices should be determined based on the number of piRNA candidates in the library where this number is the smallest. You will have to decide a slice size and a number of slices and then use that number for the ping-pong analyses of all the libraries that you want to compare. The slice size times the number of slices should be no higher than the smaller number of piRNA candidates among the libraries.

See the results of the commands given earlier to make your choices of slice number and size. Say you found that the smallest library has 309 piRNA candidates. You may decide to analyze 3 slices of 103 reads. You would then type the following commands:

do_ping_pong.sh my_control 3 103 all
do_ping_pong.sh my_mutant 3 103 all

For each library, in each slice, piRNA reads are mapped on the genome complemented with canonical transposable element sequence, using bowtie2, with such settings that some mismatches will be accepted (the exact number of accepted mismatches cannot be set in bowtie2). The reason for this choice is that ping-pong partners are expected to come from different loci in the genome, typically one from an active transposable element copy, and the other from a degenerate copy present in a piRNA cluster. It is therefore likely that at least one read of a ping-pong partners pair will not map perfectly on the reference5.

The mapping results are then examined. References (chromosomes and canonical transposable elements) are scanned for the presence of couples of forward-reverse mapping reads with a 10 nucleotide overlap on their 5' ends. Every read in a slice that is part of such a pair is counted as ping-ponger. The other reads are counted as non ping-pongers.

Per element ping-pong

The ping-pong analyses can also be done for a given transposable element separately. In this case, a sub-category of piRNA candidates will be used, namely those that have mapped on the given transposable element during the second phase of the main pipeline execution. The number of piRNA candidates that have mapped on a given transposable element can be found in the file <library_name>/histos_piRNA_candidate_on_TE/<library_name>_piRNA_candidate_on_TE_ref_counts.txt. In that file, that can be opened with a spreadsheet program, each line corresponds to one of the transposable elements.

You can extract the number of piRNA candidates having mapped on a given element (say, mdg1) using awk, as follows (still carrying on with our example libraries):

awk '$1=="mdg1"{print $2}' my_control/histos_piRNA_candidate_on_TE/my_control_piRNA_candidate_on_TE_ref_counts.txt
awk '$1=="mdg1"{print $2}' my_mutant/histos_piRNA_candidate_on_TE/my_mutant_piRNA_candidate_on_TE_ref_counts.txt

You will then determine the slice size according to the same considerations as above. With the ridiculously small amount of mdg1 piRNA reads present in the example libraries (36 and 3), you cannot make meaningful ping-pong analyses. But for the sake of still following our example, let's say you decide to use one slice of 3 reads. You would then type the following commands:

do_ping_pong.sh my_control 1 3 mdg1
do_ping_pong.sh my_mutant 1 3 mdg1

For the ping-pong analyses of the reads from another transposable element, replace the last argument with the corresponding element name, and the slice number and slice sizes according to your choices.

Partners overlap probability

Whenever do_ping_pong.sh is called, a "ping-pong signature" is also computed using the algorithm provided in Antoniewski_2014 in the form of an overlap probability.

Automated ping-pong analyses

The pipeline contains a script that, given a list of transposable elements (one per line in a simple text file), a slice number and library names, will automatically determine the maximum slice size for each element and perform the ping-pong analyses by calling the do_ping_pong.sh script. For instance, if you put in your main analyses directory a file named TE_list.txt with the transposable elements names, and you want to repeat the ping-pong analyses on 4 slices, you would type the following command:

do_do_ping_pong.sh TE_list.txt 4 my_control my_mutant

The results of the ping-pong analyses will be written in the ping_pong_results directory, in a sub-directory named after the names of the libraries, the base name of the file containing the list of transposable elements and the number of slices.

In the above example, the directory would be ping_pong_results/my_control_my_mutant_TE_list_4_slices.

The percentages of ping-pong would be found in table:

ping_pong_results/my_control_my_mutant_TE_list_4_slices/ping_pong_my_control_my_mutant.xls

The script also generates graphical output for partner overlap probability in a signature_plots subdirectory.

If the file listing the transposable elements contains a "pseudo transposable element" named global, the "global" ping-pong analysis will be performed, along with the analyses for the possible actual transposable elements contained in the list.


Blaise Li (blaise dot li at igh dot cnrs dot fr)


  1. The installation scripts should install a copy of that program (and others), you may try to edit the scripts if you want to use already installed versions of the programs. 

  2. An environment variable is an information that can be used by several programs. $HOME and $PATH are some typical environment variables. Here, the $GENOME environment variable is used to configure the location of the genome data. $PATH is used by the system to define where executable programs are expected to be found. $HOME indicates your user directory. Other temporary environment variables can be set on the command line to alter the behaviour of a program. 

  3. The pipeline may work fine with earlier version but this was not tested, but is likely not compatible with version 3. 

  4. Another choice could have been to define slices upstream in the annotation pipeline, for instance by defining slices at the level of the cleaned small RNA read, before isolating piRNA candidates. The expected effect of such a choice is that the proportion of reads having a ping-pong partner would be under-estimated (in comparison with the choice implemented here) in libraries where piRNAs represent a smaller proportion of the small RNAs. 

  5. Mismatches are also expected between transposable element reads and canonical sequences. Not allowing mismatches when mapping reads will lead to underestimations in piRNA counts for RNA-Seq libraries made from strains whose genome differ from the reference genome.