Weird observedMutations results
Dear Shazam developers, I got unexpectedly high number of mut_freq and mut_count in my dataset. But if I took the sequence out to do Igblast, it shows 100% identity to the germline. Here’s the largest mut_freq sequence in my data as an example.
tmp$sequence_alignment
[1] "CAGCTGGTGGAGTCTGGGGGAGGCTTAGTGAAGCCTGGAGGGTCCCTGAAACTCTCCTGTGCAGCCTCTGGATTCACTTTCAGTGACTATGGAATGCACTGGGTTCGTCAGGCTCCAGAGAAGGGGCTGGAGTGGGTTGCATACATTAGTAGTGGCAGTAGTACCATCTACTATGCAGACACAGTGAAGGGCCGATTCACCATCTCCAGAGACAATGCCAAGAACACCCTGTTCCTGCAAATGACCAGTCTGAGGTCTGAGGACACGGCCATGTATTACTGTGCAAGGCCTCATTACTACGGTAGTAGAGGGTACTTCGATGTCTGGGGCACAGGGACCACGGTAACCGTCTCCTCAG"
tmp$germline_alignment
[1] "CAGCTGGTGGAGTCTGGGGGAGGCTTAGTGAAGCCTGGAGGGTCCCTGAAACTCTCCTGTGCAGCCTCTGGATTCACTTTCAGTGACTATGGAATGCACTGGGTTCGTCAGGCTCCAGAGAAGGGGCTGGAGTGGGTTGCATACATTAGTAGTGGCAGTAGTACCATCTACTATGCAGACACAGTGAAGGGCCGATTCACCATCTCCAGAGACAATGCCAAGAACACCCTGTTCCTGCAAATGACCAGTCTGAGGTCTGAGGACACGGCCATGTATTACTGTGCAAGGNNNNATTACTACGGTAGTAGNNGGTACTTCGATGTCTGGGGCACAGGGACCACGGTCACCGTCTCCTCAG"
tmp$germline_alignment_d_mask
[1] "CAGCTGGTGGAGTCTGGGGGA...GGCTTAGTGAAGCCTGGAGGGTCCCTGAAACTCTCCTGTGCAGCCTCTGGATTCACTTTC............AGTGACTATGGAATGCACTGGGTTCGTCAGGCTCCAGAGAAGGGGCTGGAGTGGGTTGCATACATTAGTAGTGGC......AGTAGTACCATCTACTATGCAGACACAGTGAAG...GGCCGATTCACCATCTCCAGAGACAATGCCAAGAACACCCTGTTCCTGCAAATGACCAGTCTGAGGTCTGAGGACNNNNNNNNNNNNNNNNNNNNNNGGTACTTCGATGTCTGGGGCACAGGGACCACGGTCACCGTCTCCTCAG"
tmp$mu_freq
[1] 0.607717
tmp$mu_count
[1] 189
And here’s the igblast result from this exact sequence:
And my code to generate my result: I use AssignGenes.py to align and CreateGermlines.py for the germline information
data_h_freq_comb = observedMutations(h_bcr, sequenceColumn = "sequence_alignment",
germlineColumn = "germline_alignment_d_mask",
regionDefinition = NULL,
frequency = T,
combine = T)
Any suggestions and help would be appreciated!!! Thanks!
Comments (4)
-
-
- changed status to resolved
Reopen if needed
-
reporter Hi, Sorry for the late response, I forgot to mention that this sequence is from mouse not human. But thanks fro the workflow, I will try on that.
I have another issue that is not quite related to this question, but if you don’t mind, I will just post here.
I noticed that in the IMGT database from the docker container, the TRBC1 constant sequence is missing. Is that a reason to skip TRBC1?
less /usr/local/share/germlines/imgt/mouse/imgt_mouse_TRBC.fasta
And inside has only 2 sequences:
>M26057+M26058+M26059+M26060|TRBC2*01|Mus_musculus_B10.A|F|EX1+EX2+EX3+EX4|M26057:63..437;M26058:48..65;M26059:29..135;M26060:78..95|519 nt|1|+1| | | |519+0=519| | | naggatctgagaaatgtgactccacccaaggtctccttgtttgagccatcaaaagcagag attgcaaacaaacaaaaggctaccctcgtgtgcttggccaggggcttcttccctgaccac gtggagctgagctggtgggtgaatggcaaggaggtccacagtggggtcagcacggaccct caggcctacaaggagagcaattatagctactgcctgagcagccgcctgagggtctctgct accttctggcacaatcctcgaaaccacttccgctgccaagtgcagttccatgggctttca gaggaggacaagtggccagagggctcacccaaacctgtcacacagaacatcagtgcagag gcctggggccgagcagactgtggaatcacttcagcatcctatcatcagggggttctgtct gcaaccatcctctatgagatcctactggggaaggccaccctatatgctgtgctggtcagt ggcctagtgctgatggccatggtcaagaaaaaaaattcc >AE000665|TRBC2*03|Mus_musculus_129|F|EX1+EX2+EX3+EX4|166812..167186+167692..167709+167854..167960+168243..168260|519 nt|1|+1| | | |519+0=519| | | naggatctgagaaatgtgactccacccaaggtctccttgtttgagccatcaaaagcagag attgcaaacaaacaaaaggctaccctcgtgtgcttggccaggggcttcttccctgaccac gtggagctgagctggtgggtgaatggcaaggaggtccacagtggggtcagcacggaccct caggcctacaaggagagcaattatagctactgcctgagcagccgcctgagggtctctgct accttctggcacaatcctcgaaaccacttccgctgccaagtgcagttccatgggctttca gaggaggacaagtggccagagggctcacccaaacctgtcacacagaacatcagtgcagag gcctggggccgagcagactgtggaatcacttcagcatcctatcatcagggggttctgtct gcaaccatcctctatgagatcctactggggaaggccaccctatatgctgtgctggtcagt ggcctggtgctgatggccatggtcaagaaaaaaaattcc
Thanks!
Shaowen
-
We use this query to get the reference germlines from IMGT: https://www.imgt.org/genedb/GENElect?query=14.1+TRBC&species=Mus. It only returns these two sequences, no TRBC1. I can see in other pages of IMGT that TRBC1 exists, but I don’t know why it is not part of the results the query. I will get back to you.
- Log in to comment
Hi Shaowen. I noticed that the sequences in sequence_alignment and germline_alignment don’t have IMGT gaps. Immcantation uses IMGT aligned sequences in the *_aligment fields. So I created a fasta file with your
tmp$sequence_alignment
sequence and repeated your analysis: AssingGenes, CreateGermlines, observedMutations. The mutation frequency I get is 0.1083591.The commands I used to run AssignGenes and createGermlines:
In R: