MakeDb.py - duplicated germline solutions
when I used MakeDb.py on data of rhesus monkey with IgBlast, I sometimes got duplicated germline solutions for the same sequence. An example in D_CALL is attached.
Comments (6)
-
-
reporter hello, The problem is the identity between the optional germlines, two identical options. Why do we need both of them?
-
Because the genes are only a perfect match in the IgBLAST alignment of the D, which is only 7 nucleotides in length for the above example. The actual germline genes differ, as shown below.
>IGHD3-16*01 GTATTATGATTACGTTTGGGGGAGTTATGCTTATACC >IGHD3-3*01 GTATTACGATTTTTGGAGTGGTTATTATACC >IGHD3-3*02 GTATTAGCATTTTTGGAGTGGTTATTATACC
Presumably the alignment is only 7 nucleotides in length due to exonuclease activity during V(D)J recombination.
The reason why IgBLAST reports them all as the top hit, and why we retain them all after parsing, is because they are all potentially the truth. There's no way to distinguish which is the true germline gene from an alignment, because there's not enough information.
This is something that can be done later using some sort of counting approach like what is done in Tigger or IgDiscover. For example, you might find that
IGHD3-3*02
is very rare in a pool of sequences, and might therefore conclude that that individual does not possessIGHD3-3*02
, so you could remove those ambiguous calls.For most analysis, it's usually safe to use something like
alakazam::getAllelle(db$D_CALL, first=TRUE)
to extract only the first gene call, use those for analysis, and just throw out very rare genes. But if your research question concerns genotyping and alleles, then you'll need some extra steps to resolve the ambiguous calls.I'm still not certain this is exactly the problem you are describing though. Maybe?
-
reporter Sorry, now I see that my example was wrong. In some sequences I got in D_CALL column (after MakeDb) output like
IGHD2-1*01,IGHD2-1*01,IGHD2-2*01
As you can see, there are two identical germline options, and this was my question.
-
Ah, that's different. That shouldn't be the case.
What does the
# V-(D)-J rearrangement summary
section of the IgBLAST output look like for that record?I suspect there might somehow be duplicate records in the IgBLAST database.
-
- changed status to resolved
Not sure about the cause or resolution here, but no news is good news?
- Log in to comment
Greetings @hodihalev,
Could you please clarify the problem you're seeing? I looked at the example you attached and I don't see anything usual about the
D_CALL
column.IgBLAST will sometimes assign multiple gene calls as a top hit. For example:
In the IgBLAST output above, it assigned the D to
IGHD3-16*02,IGHD3-3*01,IGHD3-3*02
, because all three genes matched with 100% identity.Is this what you're seeing?