discrepancies in read number summaries

Issue #5 resolved
Fabian Buske
created an issue

I found the following disagreeing read reports (version: tip from the 19/04/2013):

Using the SRR027956 library, in the first mapping step 6443641 reads are reported for both read pair libraries:

INFO:hiclib.mapping:Mapping command: /share/ClusterShare/software/contrib/fabbus/bowtie2/2.1.0/bowtie2 -x /share/ClusterShare/biodata/contrib/genomeIndices_garvan/iGenomes/Homo_sapiens/UCSC/hg19/Sequence/Bowtie2Index/genome -q - -5 0 -3 51 -p 32 --very-sensitive
INFO:hiclib.mapping:Output formatting command: samtools view -bS -
[samopen] SAM header is present: 25 sequences.
**6443641 reads; of these:**
  6443641 (100.00%) were unpaired; of these:
    508325 (7.89%) aligned 0 times
    4116914 (63.89%) aligned exactly 1 time
    1818402 (28.22%) aligned >1 times

However, in the final summary the number of original reads is smaller (6005556):

     loading data from file /share/ClusterShare/software/contrib/Cancer-Epigenetics/tmp/HiC_compare/GM1_HindIII/hiclib/HindIII_SRR027956-mapped_reads.hdf5 (assuming h5dict)
     Original reads: 6005556  -> No same fragment: 5349615 -> remove unused chrom: 5349615 -> ...
     ... -> No unmapped reads: 5349615 -> no extra DEs (--> (<500) <--): 5059661
          Number of reads changed  6005556 ---> 5059661           Fragments number changed -   0 --->  793717

I wonder where the remaining 438085 read went. I cannot find this number popping up anywhere in the log file. Also, I observed similar read discrepancy in all of my other runs.

Comments (5)

  1. mimakaev

    Hi Fabian,

    The first number you see is total number of reads. Our library only consideres reads which were maps for at least one side in the genome, therefore the total number of SS + DS reads is less than original read number.


  2. Fabian Buske reporter

    Thanks for the clarification Max. I have changed the bug report into an enhancement. I would appreciate any clarification on the missing (unmapped) number of reads in the final summary. It could be that this is already the case with "No unmapped reads: 5349615", which could be a sum of all unmapped reads between the two libraries. However, it is not quite clear at the moment and I am speculating, which I would rather avoid doing.

    Regards, Fabian

  3. Fabian Buske reporter

    A little remark: "No same fragment" and "no extra DEs" is a somewhat confusing description in the log. Does "No" and "no" correspond to "the number of" or "not a" or either of it depending on the capitalisation. Some additional clarification would make the log summary much more readable.

    Anyhow, the hiclib is a terrific piece of work. Thanks a lot!

  4. mimakaev

    Fabian, I was thinking for a long time about making some comprehensive reporting there...

    If you want to improve reporting in this part, and do it for yourself, please commit it to the main library. I can give you write access...

  5. Log in to comment