Add additional evidence columns to findNovelAlleles output

Issue #13 resolved
Jason Vander Heiden created an issue

We could use a few more evidence fields for the germlines working group:

  • Short mutation specification in a standard format. Eg, 120A>C,220G>T.
  • p-value from y-intercept confidence interval.
  • The proportion of records in the sequence dataset matching this unmutated sequence.
  • The percentage at which this allele was observed in the sequence dataset, compared to other alleles.
  • Number of unique J sequences found associated with the inferred V sequence.
  • Number of unique CDR3s found associated with the inferred V sequence.

Comments (6)

  1. Jason Vander Heiden reporter

    We can skip the p-value, as it might mislead people into thinking it's a p-value for the allele call instead of a filter during one of the steps.

  2. ssnn
    1. Short mutation specification in a standard format. Eg, 120A>C,220G>T. --> MU_SPEC
    2. p-value from y-intercept confidence interval.
    3. The proportion of records in the sequence dataset matching this unmutated sequence. --> UNMUTATED_COUNT
    4. The percentage at which this allele was observed in the sequence dataset, compared to other alleles. --> NOVEL_IMGT_COUNT/GERMLINE_CALL_COUNT (?)
    5. Number of unique J sequences found associated with the inferred V sequence. --> UNMUTATED_SNP_J_GENE_LENGTH_COUNT, NOVEL_IMGT_NUM_J (?)
    6. Number of unique CDR3s found associated with the inferred V sequence. --> UNMUTATED_SNP_J_GENE_LENGTH_COUNT, NOVEL_IMGT_NUM_CDR3 (?)

    I think some of the information would be easier gathered after inferGenotype and reassignAlleles (in particular the ones with ?)

  3. ssnn

    In #86154ad and #d8df3eb I have changed the counting method of perfect match of the sequence NOVEL_IMGT in the input data. Before I was using the whole sequence as the search pattern. Now I use only the substring defined by pos_range (default is 1:312). This gives a count much closer to that in PERFECT_MATCH_COUNT (763 now, around 150 before this commit, perfect match count=836) . The difference must be N characters.

      GERMLINE_CALL                NOTE       POLYMORPHISM_CALL NT_SUBSTITUTIONS
    1   IGLV2-14*01 Novel allele found! IGLV2-14*01_G132A_G168T    132G>A,168G>T
    2   IGLV2-14*01 Novel allele found! IGLV2-14*01_G132A_G168T    132G>A,168G>T
      NOVEL_IMGT_COUNT NOVEL_IMGT_UNIQUE_J NOVEL_IMGT_UNIQUE_CDR3 PERFECT_MATCH_COUNT
    1              763                   3                    619                 836
    2              763                   3                    619                 836
      PERFECT_MATCH_FREQ GERMLINE_CALL_COUNT GERMLINE_CALL_PERC MUT_MIN MUT_MAX
    1           0.148227                5640                100       1      10
    2           0.148227                5640                100       2      11
      MUT_PASS_COUNT GERMLINE_IMGT_COUNT POS_MIN POS_MAX Y_INTERCEPT Y_INTERCEPT_PASS
    1           4123                   0       1     312       0.125                2
    2           3582                   0       1     312       0.125                2
      SNP_PASS UNMUTATED_COUNT UNMUTATED_FREQ UNMUTATED_SNP_J_GENE_LENGTH_COUNT
    1     4006            1577      0.2796099                                41
    2     3466             836      0.1482270                                19
      SNP_MIN_SEQS_J_MAX_PASS ALPHA MIN_SEQS J_MAX MIN_FRAC
    1                       1  0.05       50  0.15     0.75
    2                       1  0.05       50  0.15     0.75
    

    The output of inferGenotype:

          GENE        ALLELES COUNTS TOTAL NOTE
    1 IGLV2-14 01_G132A_G168T    152   152     
    

    The output of genotypeFasta:

                                                                                                                                        IGLV2-14*01_G132A_G168T 
    "CAGTCTGCCCTGACTCAGCCTGCCTCC---GTGTCTGGGTCTCCTGGACAGTCGATCACCATCTCCTGCACTGGAACCAGCAGTGACGTTGGT---------GGTTATAACTATGTCTCCTGGTACCAACAACACCCAGGCAAAGCCCCCAAACTCATGATTTATGATGTC---------------------AGTAATCGGCCCTCAGGGGTTTCT---AATCGCTTCTCTGGCTCCAAG------TCTGGCAACACGGCCTCCCTGACCATCTCTGGGCTCCAGGCTGAGGACGAGGCTGATTATTACTGCAGCTCATATACAAGCAGCAGCACTCTC" 
    

    The output of reassignAlleles:

    > table(tmp$V_CALL_GENOTYPED)
    
    IGLV2-14*01_G132A_G168T 
                       5640 
    
  4. Log in to comment