Add additional evidence columns to findNovelAlleles output

Jason Vander Heiden reporter

assigned issue to

ssnn

2018-03-15T17:45:01+00:00

ssnn

Added some in 5c434d6. TODO p-value. Double check I have not mixed counted positions and sequences.

2018-04-04T12:31:13+00:00

Jason Vander Heiden reporter

We can skip the p-value, as it might mislead people into thinking it's a p-value for the allele call instead of a filter during one of the steps.

2018-04-11T20:38:18+00:00

ssnn

Short mutation specification in a standard format. Eg, 120A>C,220G>T. --> MU_SPEC
~~p-value from y-intercept confidence interval.~~
The proportion of records in the sequence dataset matching this unmutated sequence. --> UNMUTATED_COUNT
The percentage at which this allele was observed in the sequence dataset, compared to other alleles. --> NOVEL_IMGT_COUNT/GERMLINE_CALL_COUNT (?)
Number of unique J sequences found associated with the inferred V sequence. --> UNMUTATED_SNP_J_GENE_LENGTH_COUNT, NOVEL_IMGT_NUM_J (?)
Number of unique CDR3s found associated with the inferred V sequence. --> UNMUTATED_SNP_J_GENE_LENGTH_COUNT, NOVEL_IMGT_NUM_CDR3 (?)

I think some of the information would be easier gathered after inferGenotype and reassignAlleles (in particular the ones with ?)

2018-04-12T19:14:10+00:00

ssnn

In #86154ad and #d8df3eb I have changed the counting method of perfect match of the sequence NOVEL_IMGT in the input data. Before I was using the whole sequence as the search pattern. Now I use only the substring defined by pos_range (default is 1:312). This gives a count much closer to that in PERFECT_MATCH_COUNT (763 now, around 150 before this commit, perfect match count=836) . The difference must be N characters.

  GERMLINE_CALL                NOTE       POLYMORPHISM_CALL NT_SUBSTITUTIONS
1   IGLV2-14*01 Novel allele found! IGLV2-14*01_G132A_G168T    132G>A,168G>T
2   IGLV2-14*01 Novel allele found! IGLV2-14*01_G132A_G168T    132G>A,168G>T
  NOVEL_IMGT_COUNT NOVEL_IMGT_UNIQUE_J NOVEL_IMGT_UNIQUE_CDR3 PERFECT_MATCH_COUNT
1              763                   3                    619                 836
2              763                   3                    619                 836
  PERFECT_MATCH_FREQ GERMLINE_CALL_COUNT GERMLINE_CALL_PERC MUT_MIN MUT_MAX
1           0.148227                5640                100       1      10
2           0.148227                5640                100       2      11
  MUT_PASS_COUNT GERMLINE_IMGT_COUNT POS_MIN POS_MAX Y_INTERCEPT Y_INTERCEPT_PASS
1           4123                   0       1     312       0.125                2
2           3582                   0       1     312       0.125                2
  SNP_PASS UNMUTATED_COUNT UNMUTATED_FREQ UNMUTATED_SNP_J_GENE_LENGTH_COUNT
1     4006            1577      0.2796099                                41
2     3466             836      0.1482270                                19
  SNP_MIN_SEQS_J_MAX_PASS ALPHA MIN_SEQS J_MAX MIN_FRAC
1                       1  0.05       50  0.15     0.75
2                       1  0.05       50  0.15     0.75

The output of inferGenotype:

      GENE        ALLELES COUNTS TOTAL NOTE
1 IGLV2-14 01_G132A_G168T    152   152

The output of genotypeFasta:

                                                                                                                                    IGLV2-14*01_G132A_G168T 
"CAGTCTGCCCTGACTCAGCCTGCCTCC---GTGTCTGGGTCTCCTGGACAGTCGATCACCATCTCCTGCACTGGAACCAGCAGTGACGTTGGT---------GGTTATAACTATGTCTCCTGGTACCAACAACACCCAGGCAAAGCCCCCAAACTCATGATTTATGATGTC---------------------AGTAATCGGCCCTCAGGGGTTTCT---AATCGCTTCTCTGGCTCCAAG------TCTGGCAACACGGCCTCCCTGACCATCTCTGGGCTCCAGGCTGAGGACGAGGCTGATTATTACTGCAGCTCATATACAAGCAGCAGCACTCTC"

The output of reassignAlleles:

> table(tmp$V_CALL_GENOTYPED)

IGLV2-14*01_G132A_G168T 
                   5640

2018-04-16T16:47:37+00:00

ssnn

changed status to resolved

2018-08-09T17:12:52+00:00

Comments (6)