Execution time for DeNovo module and questions

Issue #75 resolved
Sebastien created an issue

Hi, I'm trying to use miARma-seq to analyse smallRNA-seq data. I would like to compare the quantification of miARma-seq and miRDeep2.

I ran the Known miRNA pipeline with my fastq files. I compare the quantification of miARma-seq and miRDeep2 but they are so many differences between the two quantification. (I attach the Venn Diagramm of the quantification of miRNA with the two software). Have you maybe an explanation for those differences?

Concerning the DeNovo module, do you have an idea of the execution time of this module with 8 fastq files with 20 000 000 reads in each files (I'm currently using 30 threads to execute the program)? Do you know if my config file is correct?

Config file:

;General parameters [General] ; type of analysis (miRNA, mRNA or circRNA) type=miRNA ;0 for no verbose, otherwise to print "almost" everything verbose=0 ; Folder for miRNA reads read_dir=/data/STAGE_snin/Projet_Varrault/ ; Number of process to run at the same time threads=30 ; label for the analsysis label=Projet_Varrault ; Folder where miARma has been instaled miARmaPath=/data/software/miARma/cbbio-miarma-ba7e51dd6d2e/ ; Folder to store results output_dir=/data/STAGE_snin/Results/miARma_DeNovo/ ; organism used organism=mouse stats_file=/data/STAGE_snin/Results/miARma_DeNovo//miARma_stat.45422.log logfile=/data/STAGE_snin/Results/miARma_DeNovo//miARma_logfile.45422.log

;Stats file where stats data will be saved [Quality] ;Character string to put in the name of the results directory prefix=Both

[DeNovo] ;Indexed genome to align your reads in format .ebwt (Mandatory for analysis with miRDeep) bowtie1index=/data/Genomes/mm9/bowtie1/mm9 ;Adapter to trimm at read 3' adapter=TGGAATTCTCGGGTGCCAAGGAACTCCAGTCAC ;a fasta file with all mature sequence from your organism mature_miRNA_file=/data/Genomes/mirbase/mature_mmu.fa ;a fasta file with all known pre-miRNa sequence precursor_miRNA_file=/data/Genomes/mirbase/hairpin_mmu.fa ;fasta file for the cmplete genome of our organism genome=/data/Genomes/mm9/multifasta/mm9.fa

[DEAnalysis] ; Complete path of the target file. targetfile=/data/STAGE_snin/Results/input_miARma/Varrault_targets.txt ; Path of the contrast file. contrastfile=/data/STAGE_snin/Results/input_miARma/Varrault_contrasts.txt #This value refers to filter processing in the reads (Should be "yes" or "no"). filter=yes ;Specific software to perform the Differential Expression Analysis (Allowed values: edger, noiseq or edger-noiseq) desoft=EdgeR-Noiseq ; providing replicates replicates=yes ;Provide a file with normalized reads cpm=yes ;Normalization method for EdgeR edger_normethod=RLE

[TargetPrediction] ; #Optional argument to select statistically significant results. 0.8 as default noiseq_cutoff=0.8 ; #Optional argument to select statistically significant results. 0.05 as default edger_cutoff=0.05

Thank you

Sebastien

Comments (14)

  1. Eduardo Andres Leon

    Hi, I don't understand what you mean when you say: "I would like to compare the quantification of miARma-seq and miRDeep2.". miARma performs a de novo prediction using miRDeep2 and a differential expression analysis using a "standard" pipeline, but both are run under miARma-Seq. In fact the parameters and software are different among them, so I guess that obtaining different results is expected.

    Regarding the execution time. Most of processes by miRDeep are no threaded/paralelized so it will take a lot. Nevertheless, the "standard" pipeline are threaded, so it will take a little bit shorter. But again, both are intended to perform different analysis

    Finally, if you are using the filter parameter in the [DEAnalysis], you should include the parameter repthreshold to specify the number of samples to filter from. That is, if you have 3 control samples and 3 testing samples, this value should be 3

    Edu

  2. Sebastien reporter

    Hi, I wanted to compare the two software because I got too many differences in the quantification using miRDeep2 or miARma-seq Known miRNA module. As you can see on the VennDiagram, the two software gave me very different number of found miRNA. Some of the miRNA found by miRDeep2 were counted more than 100 000 times and 0 time by miARma-seq. To obtain those results, I ran miRDepp2 alone and took only the quantification of known miRNA. For miARma-seq, I ran the Known_miRNAs_pipeline.ini config and replace the path to different folder, I had some parameters to the counting module "-O and -M" to count multi-mapping reads. Do you have an explanation about the differences of the count of miRNA found by miRDeep2 and miARma-seq?

    I'm still waiting for the results of the DeNovo module. I'll come back to post the results of this module when I get it.

    Sebastien

  3. Eduardo Andres Leon

    Wow this is really interesting. I've read in various papers that miRDeep2 is really good in order to predict de novo miRNAs and "not very good" quantifying known miRNAs. That's the reason we have included a more accepted pipeline for those known miRNAs. That's why we have never took into account the quantification of known miRNAs by miRDeep2. Could you tell me a couple of miRNAs detected by miRDeep (those found 100.000 times) which are not counted in the standard pipeline ?

  4. Sebastien reporter

    Here is the result I got using the command:

    awk -F "\t" '{if($2>=100000) print $0}' my_file

    name miRDeep2 miARma-seq

    mmu-miR-351-5p 122771 0

    mmu-miR-146b-5p 592205 0

    mmu-miR-30a-5p 353825 0

    mmu-miR-191-5p 212262 0

    mmu-miR-26a-5p 197246 0

    mmu-miR-182-5p 670728 0

    mmu-miR-181a-5p 114541 0

    mmu-miR-143-5p 1340476 0

    mmu-miR-199b-5p 189001 0

    mmu-miR-22-5p 780844 0

    mmu-miR-30d-5p 167779 0

    mmu-miR-148a-5p 814204 0

    mmu-miR-103-1-5p 190220 0

    mmu-miR-103-2-5p 191009 0

    mmu-miR-378a-5p 650860 0

    mmu-miR-199a-5p 189796 0

    mmu-let-7f-5p 237222 0

    mmu-miR-21a-5p 434758 0

    mmu-miR-103-3p 190220 0

    mmu-miR-10b-5p 4824920 0

    I used a custom python script to get the quantification of miRDeep2 from the result.csv file of miRDeep2

  5. Sebastien reporter

    Do you think the DeNovo module can take more than 4 days runnig for 8 fastq files with 20 000 000 reads?

  6. Eduardo Andres Leon

    that sounds too much. You can check log files (miARma_log and miARma_stats) to see the time needed to analyse each file

  7. Sebastien reporter

    well it seems that the aligner bugged. The analysis started for the first file only there is no more information for other files.

    Here is the log after fastQC:

    miRDeep :: [Tue May 9 15:34:53 2017] Executing mkdir -p /data/STAGE_snin/Results/miARma_DeNovo//miRDeep_results/ ;export PERL5LIB=/data/software/miARma/cbbio-miarma-ba7e51dd6d2e//lib/Perl/; mapper.pl /data/STAGE_snin/Projet_Varrault//mirCE1.PF.R1.fastq -e -h -i -j -n -m -k TGGAATTCTCGGGTGCCAAGGAACTCCAGTCAC -o 30 -p /data/Genomes/mm9/bowtie1/mm9 -s /data/STAGE_snin/Results/miARma_DeNovo//miRDeep_results/mirCE1.PF.R1.fa -t /data/STAGE_snin/Results/miARma_DeNovo//miRDeep_results/mirCE1.PF.R1_vs_genome.arf >> /data/STAGE_snin/Results/miARma_DeNovo//miARma_stat.45422.log 2>&1 miRDeep :: [Tue May 9 15:34:53 2017] Executing mkdir -p /data/STAGE_snin/Results/miARma_DeNovo//miRDeep_results/ ; export PERL5LIB=/data/software/miARma/cbbio-miarma-ba7e51dd6d2e//lib/Perl/;miRDeep2.pl /data/STAGE_snin/Results/miARma_DeNovo//miRDeep_results/mirCE1.PF.R1.fa /data/Genomes/mm9/multifasta/mm9.fa /data/STAGE_snin/Results/miARma_DeNovo//miRDeep_results/mirCE1.PF.R1_vs_genome.arf /data/Genomes/mirbase/mature_mmu.fa none /data/Genomes/mirbase/hairpin_mmu.fa -r mirCE1.PF.R1 -P -d -c -v >> /data/STAGE_snin/Results/miARma_DeNovo//miARma_stat.45422.log 2>&1

  8. Sebastien reporter

    I got this message when I stopped the execution:

    miRDeep ERROR :: system args failed: 2 (mkdir -p /data/STAGE_snin/Results/miARma_DeNovo//miRDeep_results/ ;export PERL5LIB=/data/software/miARma/cbbio-miarma-ba7e51dd6d2e//lib/Perl/; mapper.pl /data/STAGE_snin/Projet_Varrault//mirCE1.PF.R1.fastq -e -h -i -j -n -m -k TGGAATTCTCGGGTGCCAAGGAACTCCAGTCAC -o 30 -p /data/Genomes/mm9/bowtie1/mm9 -s /data/STAGE_snin/Results/miARma_DeNovo//miRDeep_results/mirCE1.PF.R1.fa -t /data/STAGE_snin/Results/miARma_DeNovo//miRDeep_results/mirCE1.PF.R1_vs_genome.arf >> /data/STAGE_snin/Results/miARma_DeNovo//miARma_stat.45422.log 2>&1) at /data/software/miARma/cbbio-miarma-ba7e51dd6d2e//lib//CbBio/RNASeq/Aligner.pm line 2210. [snin@thetys miARma_config_files]$ mapper.pl /data/STAGE_snin/Projet_Varrault//mirCE1.PF.R1.fastq -e -h -i -j -n -m -k TGGAATTCTCGGGTGCCAAGGAACTCCAGTCAC -o 30 -p /data/Genomes/mm9/bowtie1/mm9 -s /data/STAGE_snin/Results/miARma_DeNovo//miRDeep_results/mirCE1.PF.R1.fa -t /data/STAGE_snin/Results/miARma_DeNovo//miRDeep_results/mirCE1.PF.R1_vs_genome.arf >> /data/STAGE_snin/Results/miARma_DeNovo//miARma_stat.45422.log 2>&1)

  9. Sebastien reporter

    There is no -o <int> arg in the miRDeep2 documentation for the mapper.pl module. Maybe it can introduce some error?

  10. Eduardo Andres Leon

    This -o is for mapper.pl :

    This script takes as input a file with deep sequencing reads (these can be in different formats, see the options below). The script then processes the reads and/or maps them to the reference genome, as designated by the options given. Options:

    Read input file: -a input file is seq.txt format -b input file is qseq.txt format -c input file is fasta format -e input file is fastq format -d input file is a config file (see miRDeep2 documentation). options -a, -b, -c or -e must be given with option -d.

    Preprocessing/mapping: -g three-letter prefix for reads (by default 'seq') -h parse to fasta format -i convert rna to dna alphabet (to map against genome) -j remove all entries that have a sequence that contains letters other than a,c,g,t,u,n,A,C,G,T,U,N -k seq clip 3' adapter sequence -l int discard reads shorter than int nts, default = 18 -m collapse reads

    -p genome map to genome (must be indexed by bowtie-build). The 'genome' string must be the prefix of the bowtie index. For instance, if the first indexed file is called 'h_sapiens_37_asm.1.ebwt' then the prefix is 'h_sapiens_37_asm'. -q map with one mismatch in the seed (mapping takes longer)

    -r int a read is allowed to map up to this number of positions in the genome default is 5

    Output files: -s file print processed reads to this file -t file print read mappings to this file

    Other: -u do not remove directory with temporary files -v outputs progress report

    -n overwrite existing files

    -o number of threads to use for bowtie

  11. Sebastien reporter

    Hi, I come back to you to report that I made a mistake during my use of miARma and miRDeep2. I used the gff file provided by mirbase (v21) and the reference genome I used was mm9. The gff file is from mm10. It was a stupid mistake I did. I saw that after investigation with my internship tutor.

    By the way thank you for your help. You did a great job with your tool.

    Best regards. Sebastien

  12. Eduardo Andres Leon

    Don't worry. These kind of mistakes are very frequent ;)

    Glad to hear that you like the tool

  13. Log in to comment