unexpected junctions/ junction lengths from "MakeDb.py igblast "

Issue #49 resolved
Former user created an issue

I had a couple very worrying observations about junctions in igblast output parsed by MakeDb.

  • junction lengths (even those of functional, in-frame sequences) are commonly not a multiple of three
  • the modulus of junction lengths (of functional, in-frame sequences) is fairly evenly distributed
  • these observations are also true of the junction sequence itself , not just the "JUNCTION_LENGTH" field
  • when i compared a few MakeDb junctions by eye to CDR3s from our old igblast parser (or from VQuest), it seemed that the MakeDb junction is missing several nucleotides on each side (my expectation is that it would typcially be longer by a codon on each end ?)
MakeDb.py igblast -s  test.fasta -i  test.vs.igblast_IG.fmt7
import pandas as pd 
df = pd.read_csv('test.vs.igblast_IG_db-pass.tab', index=None)
(df['JUNCTION_LENGTH'] % 3).value_counts()

I haven't looked at TCR sequences at all.

We haven't tested the bleeding edge of changeo (last results tested were commit #234), but i've looked back at parsed output from commits spanning the last 4-6 weeks. We're using igblast 1.3.0 .

I guess the first thing to do is test with the latest changeo commit and latest igblast .

Obviously this is a show stopper for us, so let me know what we can do to test / work on.

If it is seems this will take a while to fix, we'll probably migrate back to our old parser for a while.

Comments (27)

  1. Namita Gupta

    The junction we are pulling out currently is that which is defined by IgBlast and it doesn't follow IMGT definitions at all. According to their definition, the junction returned is simply the rearrangement region + 5 nucleotides into the V and 5 nucleotides into the J. This is a known issue with IgBlast and we have not yet implemented anything to recreate the IMGT-defined CDR3/junction region. I'm not sure if your old parser was pulling out the CDR3 in some other way from the IgBlast results?

  2. Former user Account Deleted

    I remember this igblast issue, but i thought you had implemented something to look for the conserved CYS/TRP (that's what we did, but we didn't try all that hard to look for the CORRECT CYS/TRP). ooookay.

    still, aren't you surprised that junction length is not = (multiple of 3) + 5 +5 ?

    or does is the definition of rearrangement region very sensitive to the alignment accuracy - to the point where you can't even see an enrichment of 3-multiple lengths ?

  3. Namita Gupta

    Yeah, we hadn't gotten to implementing anything like that yet.

    I am guessing that in most cases, the rearrangement itself doesn't have to be in multiples of three, but rather where in the V/J the conserved amino acids are determine its multiple of three-ness. I think there is too much missing information without the lengths of V and J that are supposed to be included to assume that the IgBlast junctions should have any biases in length.

  4. Former user Account Deleted

    ok thanks for explaining.

    I guess you can close the issue , since it's sort of expected behavior on your end .

  5. Namita Gupta

    MakeDb needs a way of recreating IMGT CDR3/junction from IgBlast results. But as it stands, behavior is as expected.

  6. Former user Account Deleted

    do you have any sense for whether this places the reported Junction in a consistent frame ?

  7. Namita Gupta

    Not really, I think you could get an idea by looking at the distribution of how many V nucleotides tend to be in the junction region from IMGT results. If there is a pattern to that, then it should be reflected in the IgBlast results.

  8. Namita Gupta

    Not really, I think you could get an idea by looking at the distribution of how many V nucleotides tend to be in the junction region from IMGT results. If there is a pattern to that, then it should be reflected in the IgBlast results.

  9. Former user Account Deleted

    no (if i understand what you're suggesting) , the absence of the pattern is how i noticed this behavior in the first place ...

    (df['JUNCTION_LENGTH'] % 3).value_counts() was very flat.

    i guess i was hoping that the noise was at the end rather than the beginning, and there was a change of knowing the correct frame.

    did you guys ever look at higblast (shugay-associated wrapper+parser)

  10. Namita Gupta

    I actually meant if you look at IMGT results (which have correct junctions) and look at the distribution of the length of the V-region that is in the junction (and same for J), that can give you a hint about whether the noise is at the beginning or end of the junction.

    I remember vaguely looking at it, but from what Jason recalls, it only looked for the beginning conserved residue, and not the residue on the J-end of the junction.

  11. Jason Vander Heiden

    Just talked to Gur about Shugay wrapper... He didn't have a lot of luck with it. Sounds like it doesn't really work.

    He's also going to send his IgBLAST CDR3 correction code our way to look at. If you have something as well, we could see if we can add it to MakeDb?

  12. David Koppstein

    Hi Jason, could you clarify what Gur said about the Shugay wrapper?

    The reason I ask is because we'd like to have the CDR1, 2, and 3 nucleotide/amino acid sequences for our antibody synthesis team (they need to know if certain aa's, e.g. lysine, are in the CDRs because of the chemistry they use). Currently, the Shugay wrapper is the only igblast tool I could find that explicitly outputs these. We can get it to run, but if it's giving the wrong answer, we'd like to know.

    Of course, we'd also like to have the coding VDJ sequence for antibody synthesis in addition to -- I think SEQUENCE_VDJ isn't quite the same thing as the coding sequence (Sonia mod3'd it too). Currently, our old in-house igblast wrapper gives us that information, but we'd like to migrate to using the Kleinstein stack as much as possible =)

    I can also take a look at how these regions are parsed out and maybe will take a stab at implementing it in MakeDb.py.

  13. Jason Vander Heiden

    Gur will check with his students, as he recalls, the CDR3 definition is not correct... However, if you find that it does work correctly, maybe the easiest thing is to just add a parser for the higblast output to MakeDb instead of re-implementing what they've done.

  14. David Koppstein

    I just finished a wrapper for higblast (wrapping the wrapper?) that could easily be merged with MakeDb.py output. I'll check the CDR3 calls with our old in-house CDR3-calling script and follow up. Do you guys call the CDR regions on a regular basis, and if so, how?

  15. Jason Vander Heiden

    Okay, let me know. And yes, we do analyze the CDRs a lot, but we do our alignments with IMGT and extract the regions we want from the IMGT-gapped sequences.

  16. David Koppstein

    It looks like the CDR3s from higblast usually agree, although they are also prepended with a cysteine and appended with a tryptophan or phenylalanine. Occasionally they disagree. Higblast also appears to have some special encoding for ambiguous characters that is not well-documented, e.g. CARDEGIA??gYLKPFDYW.

    Higblast appears to do some kind of additional filtering that isn't clear to me from reading the docs. Even when I ran it at the lowest quality threshold, it throws out about 80% of the reads. I think this is because it implicitly tries to also group reads into clonotypes, so I might be comparing apples to oranges.

  17. Former user Account Deleted

    these statements apply to BCR reads only, yes ?

    It still seems unclear which 80% of reads it's throwing out ?

    I wonder if it's throwing out high-N sequences ...OR maybe what it perceives to be shallowly sequenced mRNAs (i forget how it's using the migec number information or how you hacked that )

    the {prepending,appending} of {CYS,[TRP|PHE]} is what i associate as the difference between CDR3 and JUNCTION , though don't know if that's an official definition

  18. David Koppstein

    Yes, I haven't tried it with TCR yet.

    Unfortunately, higblast "forgets" the fasta header so I (warning: ugly hack) encoded the fasta header as a unique read count which higblast does report and then go back and match that with a fasta header.

    Having a canonical way to parse out all the CDR regions and the coding sequence from igblast would quite beneficial for us. Besides the IMGT licensing issues, we're running so many samples now that submitting each to the server by hand is not so feasible. I'd be interested in helping to merge Gur's CDR3-calling code and/or our in-house parser (written before my time at AbVitro so I'm not so familiar with it, located here) into MakeDb.py, as well as adding CDR1/2 and coding sequence functionality (I think Sonia has found that SEQUENCE_VDJ != coding sequence).

  19. Former user Account Deleted

    regarding parseIgblastN.py, which i think DK just linked to you guys.

    Raj Chaari (Church lab) mainly wrote it.

    We compared output vs IMGT and were happy that CDR3 and V/J calls were consistent for IG/TR However, there was a bit of hocus-pocus involved in finding the right frame and dealing with indels, that i don't clearly recall.

    Known issues: - sometimes the V-start and J end are 1-off, and so can't quite be taken literally for the sake of parsing out the VDJ nt - sometimes there is an X in the Vnt sequence - DK just noticed this. it was a temporary placeholder for dealing with indels, but i haven't investigated by it gets propogated. - only tested with 1.3.0 outfmt7, needs outfmt3, - doesn't really look for conserved TRP/PHE, just takes the first in the J-alignment or something like that .

  20. Jason Vander Heiden

    Regarding the prepended/appended conversed residues, that is indeed the official (read: IMGT) distinction between junction and CDR3, so it's actually a bit better for us that it reports the junction instead of the CDR3.

    Heard back from Gur's student, and the problem wasn't the CDR3 definition, but the loss of the sequence identifier, which it seems like you have a solution for.

    I'm guessing the SEQUENCE_VDJ sequence not being the coding sequence is just a frame issue. Wherever IgBLAST says the V starts and J ends is where the sequence starts/ends. Might just have to pad the front or end, depending upon whether V_START % 3 == 0.

    The CDR1/2 and SEQUENCE_VDJ frame issues (assuming that's the problem) can be solved by adding the IMGT gaps to the V region. Gur has some code for this as well, so maybe we can integrate that easily.

    Have y'all taken a look at MiXCR? It's from the same group, but newer. I foresee some issues with it (like being tied to the NCBI germlines), but might be worth a look see.

  21. David Koppstein

    If you're interested in playing with higblast on presto output, here is my presto-compatible higblast wrapper. Caveats: it expect certain strings like IgA or IgK in the PRCONS field because higblast only runs on one locus at a time, so we need to split up the fasta file by locus. Furthermore, it currently outputs only the 'ID' part of the SEQUENCE_ID (and calls it DB_MB...), although that should be easy to change.

    Again, my solution to the loss of sequence identifier was to give each read a unique UMI and read count, and then matching the read count to the sequence identifier afterwards. But it might be doing something under the hood that makes this not a viable solution. So take it cum grano salis.

    Haven't looked at MiXCR, but might try playing with it. What's the issue with the NCBI germlines?

    I remember some of you guys are planning on coming up to Boston in August (I think Sonia and I would also be happy to go down to New Haven). Maybe this might be worth a Kleinstein/AbVitro sprint? We'd like to migrate over to using the Kleinstein stack as much as possible!

  22. Jason Vander Heiden

    Cool, thanks. I'll check it out. I think the NCBI germlines aren't as exhaustive as the IMGT or UNSW databases. Also, you can't genotype, then rerun the alignment against a custom repertoire.

    If we are going to organize a sprint, I think we should look at the non-IgBLAST options as well. But IgBLAST has the advantage of not having to justify its use in a paper.

  23. Former user Account Deleted

    if we should look at the non-IgBLAST options as well, there's https://github.com/churchlab/vdj/

    as of a year ago, Uri was still a proponent of this old tool suite (which is not being maintained, but he just got a faculty position at Mt Sinai and is planning to do vdj-seq as part of his work)

  24. Namita Gupta

    I have updated the IgBLAST parser to make an IMGT gapped sequence and infer the IMGT junction from this as well. This requires passing in the directory with germline sequences to the '-r' repo flag.

  25. Log in to comment