junction_aa contains AA beyond conserved F residue

Issue #110 resolved
Scott Christley created an issue

I noticed this with the old AIRR formats branch, so I moved to the latest code on default and still see the error. The junction_aa field (AIRR format) contains amino acids beyond the trailing conserved F residue. This input sequence

GAGTCGGCTGCTCCCTCCCAAACATCTGTGTACTTCTGTGCCAGCAGCCCGGGGCAAGCCTACAATCAGCCCCAGCATTTTGGTGAT

will produce CASSPGQAYNQPQHFGD when it should be CASSPGQAYNQPQHF. This is not specific to AIRR format. In the normal ChangeO output, it gives this as the junction sequence:

GAGTCGGCTGCTCCCTCCCAAACATCTGTGTACTTCTGTGCCAGCAGCCCGGGGCAAGCCTACAATCAGCCCCAGCATTTTGGTGAT

and TTT is the conserved F, so it goes too far.

Comments (20)

  1. Jason Vander Heiden

    Can you email me or post the full input sequence? It's probably a D/J start/end problem.

    Also, what germlines where used? Species?

    This is IgBLAST output, yes?

  2. Scott Christley reporter

    Sorry, cut/paste the wrong sequence for the junction. The full sequence is correct:

    GAGTCGGCTGCTCCCTCCCAAACATCTGTGTACTTCTGTGCCAGCAGCCCGGGGCAAGCCTACAATCAGCCCCAGCATTTTGGTGAT

    but the junction sequence is:

    TGTGCCAGCAGCCCGGGGCAAGCCTACAATCAGCCCCAGCATTTTGGTGAT

    This is human TCR with IMGT germline and IgBlast. Note that the issue isn't just this one sequence, but happens with all of the sequences. I've also tried with a 454 test dataset and see the same, so I think it's a general problem. I can send additional data if you need.

  3. Jason Vander Heiden

    We might need to add another motif for the TCR J genes or something. I can take a look.

    So the full sequence is only 88 nucleotides long? Can you send me any more examples? Preferably full length sequences if they exist.

  4. Jason Vander Heiden

    Sorry for the slow. I've been out sick. Looks like it's caused by short sequences without the full (F|W)GXG motif present in the J, so the junction just runs to the end of the sequence because it can't find the motif.

    Doesn't look too hard to fix. I just need change it so that the motif is searched for in the full germline sequence, and the indexing altered accordingly. I'll try to sort that out tomorrow.

  5. Scott Christley reporter

    Interesting, I didn't know such a motif existed. Our own tool uses a different technique which is why I've not noticed this earlier, it's only with the recent AIRR work that I'm starting to use ChangeO annotations for analysis. No rush, but sounds like our first AIRR test case! Hope you get well soon!

  6. Jason Vander Heiden

    I just pushed a fix and it seems to work with both the data you sent and my internal test data.

    Let me know if it's still buggy. If so, we might want to consider just dropping this bit of code in favor of requiring IgBLAST 1.5.0+. 1.5.0 has been out long enough, so I don't know that we need to do our own CDR3 extraction anymore.

    I haven't check the quality of IgBLAST's CDR3 calls, but that shouldn't be a huge deal to do.

  7. Scott Christley reporter

    We just got a new dataset from Adaptive, and I'm running it through with your recent fix. It gets some of the CDR3's right but there are still many that are incorrect. Here is an example sequence:

    GGGTTGGAGTCGGCTGCTCCCTCCCAAACATCTGTGTACTTCTGTGCCAGCAGAGAAGGGACAGGGTATGAAGCTTTCTTTGGACAA

    Change-O reports CASREGTGYEAFFGQ while Adaptive and our internal tool report CASREGTGYEAFF

  8. Jason Vander Heiden

    Ugh. Adaptive. I'll take a look at it soon. In the middle of system rebuild today.

    Though, I talked to Jian at the AIRR meeting and I really think we can just drop support for old IgBLAST versions and use the CDR3 calls from IgBLAST directly nowadays.

  9. Scott Christley reporter

    It's on my todo list to upgrade IgBlast, if it doesn't break the parser code then it should be easy.

  10. Jason Vander Heiden

    Hey, I can't reproduce the error with the current (tip) of changeo and IgBLAST 1.7. I get the following results from changeo and raw IMGT/V-QUEST and IgBLAST:

     CHANGEO_JUNCTION> TGTGCCAGCAGAGAAGGGACAGGGTATGAAGCTTTCTTT
         CHANGEO_CDR3>    GCCAGCAGAGAAGGGACAGGGTATGAAGCTTTC
      IGBLAST1.7_CDR3>    GCCAGCAGAGAAGGGACAGGGTATGAAGCTTTC
    IMGT/V-QUEST_CDR3>    GCCAGCAGAGAAGGGACAGGGTATGAAGCTTTC
    

    I should still deprecate our CDR3 search tool in favor of passing it through IgBLAST, but it might be nice to know if the code was correct...

  11. Jason Vander Heiden

    Oh, and I added this sequence to the test data set in the repo under tests/data: igblast1.7_tr.fmt7 and reads_tr.fasta.

  12. Scott Christley reporter

    Think that worked, that sequence and others look good. I'm still on IgBlast 1.4.

    I'm noticing something else, some junction_aa in the AIRR file are ending with X, think this might be a translation thing as it seems to occur when junction length isn't a multiple of 3. The sequences are tagged as functional=T though. Here is a few examples:

    seq: TGTGAGCACCTTGGAGCTGGGGGACTCGGCCCTTTATCTTTGCGCCAGCAGCTTGGGCGGGGAGACCGGGGAGCTGTTTTTGGAGAA
    
    junction_nt: TGCGCCAGCAGCTTGGGCGGGGAGACCGGGGAGCTGTTTTT
    junction_aa: CASSLGGETGELFX
    
    seq: CACGTTGGCGTCTGCTGTACCCTCTCAGACATCTGTGTACTTCTGTGCCAGCAGTGACTCTGAGGCCGGGGAGCTGTTTTTGGAGAA
    
    junction_nt: TGTGCCAGCAGTGACTCTGAGGCCGGGGAGCTGTTTTT
    junction_aa: CASSDSEAGELFX
    

    I'm not sure if that F before the X is considered the conserved residue. According to our tool, it thinks these sequences are missing the conserved TRP/PHE. IMGT seems to say the same thing. IMGT adds in the 'G' so the X becomes a L. Here's an example for a non-functional sequence

    seq: GACGATCCAGCGCACACAGCAGGAGGACTCGGCCGTGTATCTCTGTGCCGCACCTCCTCCCACGGAGATACGCAGTATTTTGGCCCA
    
    junction_nt: TGTGCCGCACCTCCTCCCACGGAGATACGCAGTATTTT
    junction_aa: CAAPPPTEIRSIX
    
  13. Jason Vander Heiden

    Weird. I didn't change anything in the code... Just tested it.

    Yes, the X is a frame problem. We pad the end with gaps to complete the translation in that case, which makes it an X.

    We could truncate it and remove the extra bases instead. Not sure which would be more "correct". Either way sounds good to me. Preference?

  14. Scott Christley reporter

    From a biological perspective, the translation hasn't stopped at that nucleotide, there are more nucleotides afterward and so an actually amino acid will get translated. That seems to be IMGT's perspective as they fill the gap with the actual nucleotides in the sequence, thus giving a valid amino acid translation.

    In some of our analysis code, we might through away CDR3's that have X's in them, presuming that those are caused by N's in the sequence, so my preference would be to not have the X. Whether you translate forward, or trim the extra bases, not sure I have a preference.

  15. Jason Vander Heiden

    Hrm. Okay. I think I'm leading towards truncate?

    Just because I'd rather not add amino acid to the junction that aren't within its boundaries, even if that means losing a nucleotide or two. (Missing data is better than bad data?)

  16. Scott Christley reporter

    Sounds good to me. Some of these are situations were the TRP/PHE is missing, thus the CDR3 end is ambiguous anyways, so as long as consistent in the logic, I think its good.

    So do you actually trim the nucleotides from junction_nt, or just trim for the translation? I'm thinking you only trim for the translation...

  17. Jason Vander Heiden

    Only for the translation, because Biopython's translate function chokes if the sequence isn't a multiple of 3.

  18. Jason Vander Heiden

    Okay, I changed it to trim instead of pad when the junction length isn't a multiple of 3.

  19. Log in to comment