-
assigned issue to
Add additional evidence columns to findNovelAlleles output
We could use a few more evidence fields for the germlines working group:
- Short mutation specification in a standard format. Eg,
120A>C,220G>T
. - p-value from y-intercept confidence interval.
- The proportion of records in the sequence dataset matching this unmutated sequence.
- The percentage at which this allele was observed in the sequence dataset, compared to other alleles.
- Number of unique J sequences found associated with the inferred V sequence.
- Number of unique CDR3s found associated with the inferred V sequence.
Comments (6)
-
reporter -
Added some in 5c434d6. TODO p-value. Double check I have not mixed counted positions and sequences.
-
reporter We can skip the p-value, as it might mislead people into thinking it's a p-value for the allele call instead of a filter during one of the steps.
-
- Short mutation specification in a standard format. Eg, 120A>C,220G>T. --> MU_SPEC
p-value from y-intercept confidence interval.- The proportion of records in the sequence dataset matching this unmutated sequence. --> UNMUTATED_COUNT
- The percentage at which this allele was observed in the sequence dataset, compared to other alleles. --> NOVEL_IMGT_COUNT/GERMLINE_CALL_COUNT (?)
- Number of unique J sequences found associated with the inferred V sequence. --> UNMUTATED_SNP_J_GENE_LENGTH_COUNT, NOVEL_IMGT_NUM_J (?)
- Number of unique CDR3s found associated with the inferred V sequence. --> UNMUTATED_SNP_J_GENE_LENGTH_COUNT, NOVEL_IMGT_NUM_CDR3 (?)
I think some of the information would be easier gathered after inferGenotype and reassignAlleles (in particular the ones with ?)
-
In #86154ad and #d8df3eb I have changed the counting method of perfect match of the sequence NOVEL_IMGT in the input data. Before I was using the whole sequence as the search pattern. Now I use only the substring defined by pos_range (default is 1:312). This gives a count much closer to that in PERFECT_MATCH_COUNT (763 now, around 150 before this commit, perfect match count=836) . The difference must be N characters.
GERMLINE_CALL NOTE POLYMORPHISM_CALL NT_SUBSTITUTIONS 1 IGLV2-14*01 Novel allele found! IGLV2-14*01_G132A_G168T 132G>A,168G>T 2 IGLV2-14*01 Novel allele found! IGLV2-14*01_G132A_G168T 132G>A,168G>T NOVEL_IMGT_COUNT NOVEL_IMGT_UNIQUE_J NOVEL_IMGT_UNIQUE_CDR3 PERFECT_MATCH_COUNT 1 763 3 619 836 2 763 3 619 836 PERFECT_MATCH_FREQ GERMLINE_CALL_COUNT GERMLINE_CALL_PERC MUT_MIN MUT_MAX 1 0.148227 5640 100 1 10 2 0.148227 5640 100 2 11 MUT_PASS_COUNT GERMLINE_IMGT_COUNT POS_MIN POS_MAX Y_INTERCEPT Y_INTERCEPT_PASS 1 4123 0 1 312 0.125 2 2 3582 0 1 312 0.125 2 SNP_PASS UNMUTATED_COUNT UNMUTATED_FREQ UNMUTATED_SNP_J_GENE_LENGTH_COUNT 1 4006 1577 0.2796099 41 2 3466 836 0.1482270 19 SNP_MIN_SEQS_J_MAX_PASS ALPHA MIN_SEQS J_MAX MIN_FRAC 1 1 0.05 50 0.15 0.75 2 1 0.05 50 0.15 0.75
The output of inferGenotype:
GENE ALLELES COUNTS TOTAL NOTE 1 IGLV2-14 01_G132A_G168T 152 152
The output of
genotypeFasta
:IGLV2-14*01_G132A_G168T "CAGTCTGCCCTGACTCAGCCTGCCTCC---GTGTCTGGGTCTCCTGGACAGTCGATCACCATCTCCTGCACTGGAACCAGCAGTGACGTTGGT---------GGTTATAACTATGTCTCCTGGTACCAACAACACCCAGGCAAAGCCCCCAAACTCATGATTTATGATGTC---------------------AGTAATCGGCCCTCAGGGGTTTCT---AATCGCTTCTCTGGCTCCAAG------TCTGGCAACACGGCCTCCCTGACCATCTCTGGGCTCCAGGCTGAGGACGAGGCTGATTATTACTGCAGCTCATATACAAGCAGCAGCACTCTC"
The output of
reassignAlleles
:> table(tmp$V_CALL_GENOTYPED) IGLV2-14*01_G132A_G168T 5640
-
- changed status to resolved
- Log in to comment