unexpected junctions/ junction lengths from "MakeDb.py igblast "

Issue #49 resolved

Former user created an issue 2015-08-03

I had a couple very worrying observations about junctions in igblast output parsed by MakeDb.

junction lengths (even those of functional, in-frame sequences) are commonly not a multiple of three
the modulus of junction lengths (of functional, in-frame sequences) is fairly evenly distributed
these observations are also true of the junction sequence itself , not just the "JUNCTION_LENGTH" field
when i compared a few MakeDb junctions by eye to CDR3s from our old igblast parser (or from VQuest), it seemed that the MakeDb junction is missing several nucleotides on each side (my expectation is that it would typcially be longer by a codon on each end ?)

MakeDb.py igblast -s  test.fasta -i  test.vs.igblast_IG.fmt7

import pandas as pd 
df = pd.read_csv('test.vs.igblast_IG_db-pass.tab', index=None)
(df['JUNCTION_LENGTH'] % 3).value_counts()

I haven't looked at TCR sequences at all.

We haven't tested the bleeding edge of changeo (last results tested were commit #234), but i've looked back at parsed output from commits spanning the last 4-6 weeks. We're using igblast 1.3.0 .

I guess the first thing to do is test with the latest changeo commit and latest igblast .

Obviously this is a show stopper for us, so let me know what we can do to test / work on.

If it is seems this will take a while to fix, we'll probably migrate back to our old parser for a while.

Comments (27)

Namita Gupta
The junction we are pulling out currently is that which is defined by IgBlast and it doesn't follow IMGT definitions at all. According to their definition, the junction returned is simply the rearrangement region + 5 nucleotides into the V and 5 nucleotides into the J. This is a known issue with IgBlast and we have not yet implemented anything to recreate the IMGT-defined CDR3/junction region. I'm not sure if your old parser was pulling out the CDR3 in some other way from the IgBlast results?
- 2015-08-03T14:28:47+00:00
Former user Account Deleted
I remember this igblast issue, but i thought you had implemented something to look for the conserved CYS/TRP (that's what we did, but we didn't try all that hard to look for the CORRECT CYS/TRP). ooookay.

still, aren't you surprised that junction length is not = (multiple of 3) + 5 +5 ?

or does is the definition of rearrangement region very sensitive to the alignment accuracy - to the point where you can't even see an enrichment of 3-multiple lengths ?
- 2015-08-03T14:50:39+00:00
Namita Gupta
Yeah, we hadn't gotten to implementing anything like that yet.

I am guessing that in most cases, the rearrangement itself doesn't have to be in multiples of three, but rather where in the V/J the conserved amino acids are determine its multiple of three-ness. I think there is too much missing information without the lengths of V and J that are supposed to be included to assume that the IgBlast junctions should have any biases in length.
- 2015-08-03T14:57:59+00:00
Former user Account Deleted
ok thanks for explaining.

I guess you can close the issue , since it's sort of expected behavior on your end .
- 2015-08-03T16:30:42+00:00
Namita Gupta
- changed status to resolved
MakeDb needs a way of recreating IMGT CDR3/junction from IgBlast results. But as it stands, behavior is as expected.
- 2015-08-03T16:32:00+00:00
Former user Account Deleted
do you have any sense for whether this places the reported Junction in a consistent frame ?
- 2015-08-03T18:21:12+00:00
Namita Gupta
Not really, I think you could get an idea by looking at the distribution of how many V nucleotides tend to be in the junction region from IMGT results. If there is a pattern to that, then it should be reflected in the IgBlast results.
- 2015-08-03T18:24:18+00:00
Namita Gupta
Not really, I think you could get an idea by looking at the distribution of how many V nucleotides tend to be in the junction region from IMGT results. If there is a pattern to that, then it should be reflected in the IgBlast results.
- 2015-08-03T18:25:56+00:00
Former user Account Deleted
no (if i understand what you're suggesting) , the absence of the pattern is how i noticed this behavior in the first place ...

(df['JUNCTION_LENGTH'] % 3).value_counts() was very flat.

i guess i was hoping that the noise was at the end rather than the beginning, and there was a change of knowing the correct frame.

did you guys ever look at higblast (shugay-associated wrapper+parser)
- 2015-08-03T18:50:38+00:00
Namita Gupta
I actually meant if you look at IMGT results (which have correct junctions) and look at the distribution of the length of the V-region that is in the junction (and same for J), that can give you a hint about whether the noise is at the beginning or end of the junction.

I remember vaguely looking at it, but from what Jason recalls, it only looked for the beginning conserved residue, and not the residue on the J-end of the junction.
- 2015-08-03T20:09:24+00:00
Jason Vander Heiden
Just talked to Gur about Shugay wrapper... He didn't have a lot of luck with it. Sounds like it doesn't really work.

He's also going to send his IgBLAST CDR3 correction code our way to look at. If you have something as well, we could see if we can add it to MakeDb?
- 2015-08-04T20:02:39+00:00
David Koppstein
Hi Jason, could you clarify what Gur said about the Shugay wrapper?

The reason I ask is because we'd like to have the CDR1, 2, and 3 nucleotide/amino acid sequences for our antibody synthesis team (they need to know if certain aa's, e.g. lysine, are in the CDRs because of the chemistry they use). Currently, the Shugay wrapper is the only igblast tool I could find that explicitly outputs these. We can get it to run, but if it's giving the wrong answer, we'd like to know.

Of course, we'd also like to have the coding VDJ sequence for antibody synthesis in addition to -- I think SEQUENCE_VDJ isn't quite the same thing as the coding sequence (Sonia mod3'd it too). Currently, our old in-house igblast wrapper gives us that information, but we'd like to migrate to using the Kleinstein stack as much as possible =)

I can also take a look at how these regions are parsed out and maybe will take a stab at implementing it in MakeDb.py.
- 2015-08-04T20:37:09+00:00
Jason Vander Heiden
Gur will check with his students, as he recalls, the CDR3 definition is not correct... However, if you find that it does work correctly, maybe the easiest thing is to just add a parser for the higblast output to MakeDb instead of re-implementing what they've done.
- 2015-08-04T20:45:30+00:00
David Koppstein
I just finished a wrapper for higblast (wrapping the wrapper?) that could easily be merged with MakeDb.py output. I'll check the CDR3 calls with our old in-house CDR3-calling script and follow up. Do you guys call the CDR regions on a regular basis, and if so, how?
- 2015-08-04T20:59:58+00:00
Jason Vander Heiden
Okay, let me know. And yes, we do analyze the CDRs a lot, but we do our alignments with IMGT and extract the regions we want from the IMGT-gapped sequences.
- 2015-08-04T21:02:50+00:00
David Koppstein
It looks like the CDR3s from higblast usually agree, although they are also prepended with a cysteine and appended with a tryptophan or phenylalanine. Occasionally they disagree. Higblast also appears to have some special encoding for ambiguous characters that is not well-documented, e.g. CARDEGIA??gYLKPFDYW.

Higblast appears to do some kind of additional filtering that isn't clear to me from reading the docs. Even when I ran it at the lowest quality threshold, it throws out about 80% of the reads. I think this is because it implicitly tries to also group reads into clonotypes, so I might be comparing apples to oranges.
- 2015-08-05T15:17:09+00:00
Former user Account Deleted
these statements apply to BCR reads only, yes ?

It still seems unclear which 80% of reads it's throwing out ?

I wonder if it's throwing out high-N sequences ...OR maybe what it perceives to be shallowly sequenced mRNAs (i forget how it's using the migec number information or how you hacked that )

the {prepending,appending} of {CYS,[TRP|PHE]} is what i associate as the difference between CDR3 and JUNCTION , though don't know if that's an official definition
- 2015-08-05T16:00:28+00:00
David Koppstein
Yes, I haven't tried it with TCR yet.

Unfortunately, higblast "forgets" the fasta header so I (warning: ugly hack) encoded the fasta header as a unique read count which higblast does report and then go back and match that with a fasta header.

Having a canonical way to parse out all the CDR regions and the coding sequence from igblast would quite beneficial for us. Besides the IMGT licensing issues, we're running so many samples now that submitting each to the server by hand is not so feasible. I'd be interested in helping to merge Gur's CDR3-calling code and/or our in-house parser (written before my time at AbVitro so I'm not so familiar with it, located here) into MakeDb.py, as well as adding CDR1/2 and coding sequence functionality (I think Sonia has found that SEQUENCE_VDJ != coding sequence).
- 2015-08-05T16:13:10+00:00
Former user Account Deleted
regarding parseIgblastN.py, which i think DK just linked to you guys.

Raj Chaari (Church lab) mainly wrote it.

We compared output vs IMGT and were happy that CDR3 and V/J calls were consistent for IG/TR However, there was a bit of hocus-pocus involved in finding the right frame and dealing with indels, that i don't clearly recall.

Known issues: - sometimes the V-start and J end are 1-off, and so can't quite be taken literally for the sake of parsing out the VDJ nt - sometimes there is an X in the Vnt sequence - DK just noticed this. it was a temporary placeholder for dealing with indels, but i haven't investigated by it gets propogated. - only tested with 1.3.0 outfmt7, needs outfmt3, - doesn't really look for conserved TRP/PHE, just takes the first in the J-alignment or something like that .
- 2015-08-05T16:50:44+00:00
Jason Vander Heiden
Regarding the prepended/appended conversed residues, that is indeed the official (read: IMGT) distinction between junction and CDR3, so it's actually a bit better for us that it reports the junction instead of the CDR3.

Heard back from Gur's student, and the problem wasn't the CDR3 definition, but the loss of the sequence identifier, which it seems like you have a solution for.

I'm guessing the SEQUENCE_VDJ sequence not being the coding sequence is just a frame issue. Wherever IgBLAST says the V starts and J ends is where the sequence starts/ends. Might just have to pad the front or end, depending upon whether V_START % 3 == 0.

The CDR1/2 and SEQUENCE_VDJ frame issues (assuming that's the problem) can be solved by adding the IMGT gaps to the V region. Gur has some code for this as well, so maybe we can integrate that easily.

Have y'all taken a look at MiXCR? It's from the same group, but newer. I foresee some issues with it (like being tied to the NCBI germlines), but might be worth a look see.
- 2015-08-05T18:06:24+00:00
David Koppstein
If you're interested in playing with higblast on presto output, here is my presto-compatible higblast wrapper. Caveats: it expect certain strings like IgA or IgK in the PRCONS field because higblast only runs on one locus at a time, so we need to split up the fasta file by locus. Furthermore, it currently outputs only the 'ID' part of the SEQUENCE_ID (and calls it DB_MB...), although that should be easy to change.

Again, my solution to the loss of sequence identifier was to give each read a unique UMI and read count, and then matching the read count to the sequence identifier afterwards. But it might be doing something under the hood that makes this not a viable solution. So take it cum grano salis.

Haven't looked at MiXCR, but might try playing with it. What's the issue with the NCBI germlines?

I remember some of you guys are planning on coming up to Boston in August (I think Sonia and I would also be happy to go down to New Haven). Maybe this might be worth a Kleinstein/AbVitro sprint? We'd like to migrate over to using the Kleinstein stack as much as possible!
- 2015-08-05T18:38:14+00:00
Jason Vander Heiden
Cool, thanks. I'll check it out. I think the NCBI germlines aren't as exhaustive as the IMGT or UNSW databases. Also, you can't genotype, then rerun the alignment against a custom repertoire.

If we are going to organize a sprint, I think we should look at the non-IgBLAST options as well. But IgBLAST has the advantage of not having to justify its use in a paper.
- 2015-08-05T21:45:41+00:00
Former user Account Deleted
if we should look at the non-IgBLAST options as well, there's https://github.com/churchlab/vdj/

as of a year ago, Uri was still a proponent of this old tool suite (which is not being maintained, but he just got a faculty position at Mt Sinai and is planning to do vdj-seq as part of his work)
- 2015-08-05T21:50:43+00:00
Namita Gupta
I have updated the IgBLAST parser to make an IMGT gapped sequence and infer the IMGT junction from this as well. This requires passing in the directory with germline sequences to the '-r' repo flag.
- 2015-09-22T19:51:10+00:00
Former user Account Deleted
wow, that's awesome!
- 2015-09-22T20:02:36+00:00
Jason Vander Heiden
Yep! Any chance we could get y'all to give it a test drive?
- 2015-09-22T20:07:06+00:00
Former user Account Deleted
probably couldn't stop DK from doing so even if i tried !
- 2015-09-22T20:11:54+00:00
Log in to comment

Assignee: –

Type: bug

Priority: major

Status: resolved

Milestone: –

Votes: 0

Watchers: 2