MakeDb.py - duplicated germline solutions

Issue #112 resolved
hodi halev created an issue

when I used MakeDb.py on data of rhesus monkey with IgBlast, I sometimes got duplicated germline solutions for the same sequence. An example in D_CALL is attached.

Comments (6)

  1. Jason Vander Heiden

    Greetings @hodihalev,

    Could you please clarify the problem you're seeing? I looked at the example you attached and I don't see anything usual about the D_CALL column.

    IgBLAST will sometimes assign multiple gene calls as a top hit. For example:

    # V-(D)-J rearrangement summary for query sequence (Top V gene match, Top D gene match, Top J gene match, Chain type, stop codon, V-J frame, Productive, Strand).  Multiple equivalent top matches, if present, are separated by a comma.
    IGHV5-51*01 IGHD3-16*02,IGHD3-3*01,IGHD3-3*02   IGHJ6*02,IGHJ6*04   VH  No  In-frame    Yes +
    
    # Hit table (the first field indicates the chain type of the hit)
    # Fields: query id, subject id, % identity, alignment length, mismatches, gap opens, gaps, q. start, q. end, s. start, s. end, evalue, bit score, query seq, subject seq, BTOP
    # 9 hits found
    V   SRR765688.35420 IGHV5-51*01 89.062  256 28  0   0   21  276 41  296 2.01e-87    313 
    V   SRR765688.35420 IGHV5-51*03 89.243  251 27  0   0   23  273 43  293 5.14e-86    308 
    V   SRR765688.35420 IGHV5-51*02 88.281  256 30  0   0   21  276 41  296 1.51e-85    307 
    D   SRR765688.35420 IGHD3-16*02 100.000 7   0   0   0   306 312 30  36  11      14.4
    D   SRR765688.35420 IGHD3-3*01  100.000 7   0   0   0   280 286 24  30  11      14.4
    D   SRR765688.35420 IGHD3-3*02  100.000 7   0   0   0   280 286 24  30  11      14.4
    J   SRR765688.35420 IGHJ6*02    96.429  28  1   0   0   319 346 8   35  1.30e-09    48.1
    J   SRR765688.35420 IGHJ6*04    96.429  28  1   0   0   319 346 8   35  1.30e-09    48.1
    J   SRR765688.35420 IGHJ6*01    96.296  27  1   0   0   319 345 8   34  5.13e-09    46.1
    

    In the IgBLAST output above, it assigned the D to IGHD3-16*02,IGHD3-3*01,IGHD3-3*02, because all three genes matched with 100% identity.

    Is this what you're seeing?

  2. hodi halev reporter

    hello, The problem is the identity between the optional germlines, two identical options. Why do we need both of them?

  3. Jason Vander Heiden

    Because the genes are only a perfect match in the IgBLAST alignment of the D, which is only 7 nucleotides in length for the above example. The actual germline genes differ, as shown below.

    >IGHD3-16*01
    GTATTATGATTACGTTTGGGGGAGTTATGCTTATACC
    >IGHD3-3*01
    GTATTACGATTTTTGGAGTGGTTATTATACC
    >IGHD3-3*02
    GTATTAGCATTTTTGGAGTGGTTATTATACC
    

    Presumably the alignment is only 7 nucleotides in length due to exonuclease activity during V(D)J recombination.

    The reason why IgBLAST reports them all as the top hit, and why we retain them all after parsing, is because they are all potentially the truth. There's no way to distinguish which is the true germline gene from an alignment, because there's not enough information.

    This is something that can be done later using some sort of counting approach like what is done in Tigger or IgDiscover. For example, you might find that IGHD3-3*02 is very rare in a pool of sequences, and might therefore conclude that that individual does not possess IGHD3-3*02, so you could remove those ambiguous calls.

    For most analysis, it's usually safe to use something like alakazam::getAllelle(db$D_CALL, first=TRUE) to extract only the first gene call, use those for analysis, and just throw out very rare genes. But if your research question concerns genotyping and alleles, then you'll need some extra steps to resolve the ambiguous calls.

    I'm still not certain this is exactly the problem you are describing though. Maybe?

  4. hodi halev reporter

    Sorry, now I see that my example was wrong. In some sequences I got in D_CALL column (after MakeDb) output like

    IGHD2-1*01,IGHD2-1*01,IGHD2-2*01
    

    As you can see, there are two identical germline options, and this was my question.

  5. Jason Vander Heiden

    Ah, that's different. That shouldn't be the case.

    What does the # V-(D)-J rearrangement summary section of the IgBLAST output look like for that record?

    I suspect there might somehow be duplicate records in the IgBLAST database.

  6. Log in to comment