HTTPS SSH

Ribosome Profiling Analysis

Ribosome profiling analysis workflow

The workflow is written using Snakemake.

Dependencies are installed using Bioconda where possible.

Setup environment and run workflow

  1. Clone workflow into working directory

    git clone https://lance_parsons@bitbucket.org/lance_parsons/hsinli_ribosomal_profiling.git
    cd hsinli_ribosomal_profiling
    
  2. Copy data into genome and data subdirectories

    Required inputs:

    • genome/eschericia_coli_k12_nc_000913_3.gtf
    • genome/escherichia_coli_k12_nc_000913_3_gene_annotations.gtf
    • genome/escherichia_coli_k12_nc_000913_3.bed
    • data/mapping/SAMPLE.bam

    Needed for older versions:

    • data/riboprofile/SAMPLE_p_offsets.txt
  3. Edit config as needed

    nano ecoli_config.yaml
    

    Alternatively, create a new config file and change the config file listed in counts.snakefile

  4. Install dependencies into isolated environment and activate

    conda env create -n hsinli_ribosomal_profiling --file environment.yml
    source activate hsinli_ribosomal_profiling
    pip install --no-deps --upgrade  --install-option='--recythonize' --install-option='--single-version-externally-managed' git+git://github.com/joshuagryphon/plastid
    
  5. Execute workflow

    snakemake -s counts.snakefile -n
    

Running workflow using SLURM on gen-comp1 (cetus)

export DRMAA_LIBRARY_PATH=/usr/local/lib/libdrmaa.so
snakemake -s counts.snakefile -rp -w 60 -j 1000 --drmaa ' --mem=10000'

Woolstenhulme Workflow

Based on method described in High-Precision Analysis of Translational Pausing by Ribosome Profiling in Bacteria Lacking EFP

For each start codon
    Get reads with 3' end mapping to window (-30,+30bp)
    Get counts for each bp and sum

Plastid Workflow

The analysis was put into two Snakemake workflows.

  1. offsets.snakefile - This workflow calculates the fiveprime offsets using the stop codons. See discussion below for some steps involved.

    1. Simulate 5' and 3' UTRs since they are not annotated for e. coli:

      ./simulate_utr.py \
          --upstream_utr_length 50 \
          --downstream_utr_length 50 \
          genome/escherichia_coli_k12_nc_000913_3.gtf \
          > genome/escherichia_coli_k12_nc_000913_3_50bp_utrs.gtf
      
    2. Generate ROI regions:

      metagene generate genome/ecoli_cds_stop \
          --landmark cds_stop \
          --annotation_files genome/escherichia_coli_k12_nc_000913_3_50bp_utrs.gtf  \
          --add_three
      
    3. Calculate offsets:

      snakemake -s offsets.snakefile
      
    4. Review the output files: data/riboprofile/{sample}/{sample}_stopcodon_p_offsets.txt and create final offsets as data/riboprofile/{sample}/{sample}_p_offsets.txt

  2. counts.snakefile - This workflow uses the offsets and generates counts.

    snakemake -s counts.snakefile
    

Initial exploration

Below are the steps taken to figure out the correct workflow.

p-site offset determination (trial 1)

Python dependencies: requirements.txt

  1. Adapter Trimming: cutadapt on Galaxy (hsili)

  2. Mapping: BWA on Galaxy (hsili)

  3. Determine P-site offsets

    1. Simulate 5' UTRs since they are not annotated for e. coli:

      simulate_utr.py genome/escherichia_coli_k12_nc_000913_3.gtf \
          > genome/escherichia_coli_k12_nc_000913_3_50bp_utr.gtf`
      
    2. Generate ROI regions:

      metagene generate genome/ecoli_cds_start \
          --landmark cds_start \
          --annotation_files genome/escherichia_coli_k12_nc_000913_3_50bp_utr.gtf \
          --upstream 50 \
          --downstream 50 \
          --add_three
      
    3. Generate p-site offsets:

      psite genome/ecoli_cds_start_rois.txt data/Nup_riboprofile_min200 \
          --min_length 29 \
          --max_length 35 \
          --require_upstream \
          --count_files data/mapping/Nup.bam \
          --min_counts 200
      

      Varied min_counts to improve estimation of initiation peaks:

      • --min_count 10 did not result in any estimated initiation peaks.

      • --min_count 50 enabled three (30, 31, and 32) to be estimated.

      • --min_count 100 enabled four (29, 32, 33, and 35) to be estimated.

      • --min_count 200 enabled five (30, 31, 32, 33, 34, and 35) to be estimated.

      • --min_count 300 was poor (very few reads upstream).

      Perhaps this is due to noisy or poor coverage of some transcripts. Looking at the data visually, it's not clear to me that there are the required initiation peak or stop codon peaks.

A-site offset determination (trial 2)

Since it does not seem that the start of a codon is often a stall site for e. coli, we will attempt to determine the offsets using to other methods.

  1. The known A-Site in secM (496-498 bp, Pro166 / chr 108201-108203)

    1. Manually generate A-Site ROI file as genome/secM_asite_roi.txt

    2. Generate a-site offsets:

      psite genome/secM_asite_roi.txt data/riboprofile/Nup_riboprofile_secM_min10 \
          --min_length 29 \
          --max_length 35 \
          --count_files data/mapping/Nup.bam \
          --min_counts 10
      
  2. Stop codons

    1. Simulate 5' and 3' UTRs since they are not annotated for e. coli:

      ./simulate_utr.py \
          --upstream_utr_length 50 \
          --downstream_utr_length 50 \
          genome/escherichia_coli_k12_nc_000913_3.gtf \
          > genome/escherichia_coli_k12_nc_000913_3_50bp_utrs.gtf
      
    2. Generate ROI regions:

      metagene generate genome/ecoli_cds_stop \
          --landmark cds_stop \
          --annotation_files genome/escherichia_coli_k12_nc_000913_3_50bp_utrs.gtf  \
          --add_three
      
    3. Generate p-site offsets:

      psite genome/ecoli_cds_stop_rois.txt \
          data/riboprofile/Nup_riboprofile_stopcodons_min10 \
          --min_length 29 \
          --max_length 35 \
          --count_files data/mapping/Nup.bam \
          --min_counts 10
          ## A-site offset determination (trial 2)
      
      Since it does not seem that the start of a codon is often a stall site for e.
      coli, we will attempt to determine the offsets using to other methods.
      
      1. The known A-Site in secM (496-498 bp, Pro166 / chr 108201-108203)

        1. Manually generate A-Site ROI file as genome/secM_asite_roi.txt

        2. Generate a-site offsets:

            psite genome/secM_asite_roi.txt data/riboprofile/Nup_riboprofile_secM_min10 \
                --min_length 29 \
                --max_length 35 \
                --count_files data/mapping/Nup.bam \
                --min_counts 10
          
      2. Stop codons

        1. Simulate 5' and 3' UTRs since they are not annotated for e. coli:

            ./simulate_utr.py \
                --upstream_utr_length 50 \
                --downstream_utr_length 50 \
                genome/escherichia_coli_k12_nc_000913_3.gtf \
                > genome/escherichia_coli_k12_nc_000913_3_50bp_utrs.gtf
          
        2. Generate ROI regions:

            metagene generate genome/ecoli_cds_stop \
                --landmark cds_stop \
                --annotation_files genome/escherichia_coli_k12_nc_000913_3_50bp_utrs.gtf  \
                --add_three
          
        3. Generate p-site offsets:

            psite genome/ecoli_cds_stop_rois.txt \
                data/riboprofile/Nup_riboprofile_stopcodons_min10 \
                --min_length 29 \
                --max_length 35 \
                --count_files data/mapping/Nup.bam \
                --min_counts 10
          

Ribosomal occupancy counts

Use the p-site offsets calculated above to generate ribosomal occupancy counts.

  1. Manually tweak the p-site offset files if needed.

    data/riboprofile/Nup_riboprofile_final_p_offsets.txt
    
  2. Generate count vector for each gene:

    get_count_vectors --annotation_files genome/escherichia_coli_k12_nc_000913_3.gtf \
        --count_files data/mapping/Nup.bam \
        --fiveprime_variable \
        --offset data/riboprofile/Nup_riboprofile_final_p_offsets.txt \
        data/riboprofile/count_data/
    
  3. Generate wiggle file:

    make_wiggle -o data/riboprofile/nup_riboprofile \
        --count_files data/mapping/Nup.bam \
        --min_length 29 \
        --max_length 35 \
        --fiveprime_variable \
        --offset data/riboprofile/Nup_riboprofile_final_p_offsets.txt