- edited description
- changed title to Potential issue in MakeDB.py: Junction region not accurate for IgL in rheus macaques
Potential issue in MakeDB.py: Junction region not accurate for IgL in rheus macaques
After running IgBLAST v1.6.1, I used MakeDb.py 0.4.1-2018.07.16 for an IgL library in rhesus macaques. It seems that the junction is not being determined correctly for IgL. There are additional ~3 amino acids included before C even though the CDR3_IGBLAST and CDR3_IGBLAST_AA look right. The junction is predicted correctly for IgG and IgK. Could this be due to the difference in numbering for IgG/IgK and IgL?
IGH & IGK
FR1-IMGT CDR1-IMGT FR2-IMGT CDR2-IMGT FR3-IMGT
(1-26) (27-38) (39-55) (56-65) (66-104)
IGL
FR1-IMGT CDR1-IMGT FR2-IMGT CDR2-IMGT FR3-IMGT
(1-27) (28-39) (40-58) (59-68) (69-107)
Comments (15)
-
reporter -
Greetings, sorry for not getting back to you sooner on this.
There are some occasional problems with junction assignment using IgBLAST <= v1.7.0. MakeDb will check for versions older that and try to infer the correct junction, but this doesn’t work if the CDR3 doesn’t start at nucleotide 312. This can happen even in a few IGH sequences if it has “extra” positions, such as amino acid 9 and 9A. Which seems to be the case here. As of changeo v0.4.1 (release notes), which you seem to be using, if the output is from IgBLAST >= 1.7.0 it will use what IgBLAST provides as the CDR3 to determine the junction. This is because IgBLAST added the start/end positions of its CDR3 assignment in v1.7.0 (release notes). This is originally why the
CDR3_IGBLAST
andCDR3_IGBLAST_AA
fields were added; they should provide the same information as theJUNCTION
andCDR3_IMGT
fields for new versions of IgBLAST.Would you be able to email an example of your input files to immcantation@googlegroups.com (igblast output, reference database, and fasta file)? We could take a look at it. I think you’re correct about what the cause is, but we could double check.
-
reporter I think I have figured out the issue. It has to do with the IMGT numbering for rhesus macaques.
I had created a custom database for rhesus macaque Ig sequences (https://github.com/BosingerLab/slow_delivery_immunization/tree/master/databases).
To be able to use it with immcantation pipeline, I tried to insert gaps using the IMGT DomainGapAlign tool (http://www.imgt.org/3Dstructure-DB/cgi/DomainGapAlign.cgi).
It seems that IMGT has more gaps for rhesus macaque sequences compared to human sequences which shifts C104 to C107. I tried getting gapped sequence for IGLV1-10*01 by aligning either to human or rhesus macaques. Following are the results:
Align to Homo sapiens:
QSVLTQPPS.ASGAPGQSVTISCSGSSSNI....GSNYVYWYQQLSGKAPKLLIYNN.......NQRPSGVP.DRFSGSK..SGTSASLAISGLQSEDEADYYCA...........AWDSSLSG....
Align to Macaca mulatta:
QSVLTQPPS.ASGAPGQSVT.ISCSGSSSNI....GSNYVYWYQQLSGKAP..KLLIYNN.......NQRPSGVP.DRFSGSK..SGTSASLAISGLQSEDEADYYC........AAWDSSLSG....
The nucleotide sequence in the IMGT reference sequences (http://www.imgt.org/vquest/refseqh.html) corresponds to the above gapped sequence for M. mulatta.
Is there any way to address an alternative region definition in MakeDb.py?
I can send you some example files if that helps.
-
The first thing I'd try is running the sequences through IgBLAST v1.14 and then through MakeDb with the
--partial
argument (it should then ignore the requirement that the junction start at 312).Yeah, if you want to send us some example files, we’ll look at it. I don’t think we’ve tested with macaque IGL data. It’d be good to look at.
-
reporter I ran IgBLAST v1.14 and used the partial flag with MakeDb.py: 0.4.1-2018.07.16. It fixes the JUNCTION prediction. However, the CDR3_IMGT is still not right. Here are some examples:
JUNCTION CDR3_IMGT CDR3_IGBLAST CDR3_IGBLAST_AA TGCTGCTCATATGCAGGCATCTACACATGGGTATTC TATTACTGCTGCTCATATGCAGGCATCTAC TGCTCATATGCAGGCATCTACACATGGGTA CSYAGIYTWV TGTCAGGTGTGGGACAGTAGTAGTGATCATCCGTTATTC TATTACTGTCAGGTGTGGGACAGTAGTAGTGAT CAGGTGTGGGACAGTAGTAGTGATCATCCGTTA QVWDSSSDHPL TGTTACTCAACAGATAGCAATGGTATTTATGGCTTATTC TACTACTGTTACTCAACAGATAGCAATGGTATT TACTCAACAGATAGCAATGGTATTTATGGCTTA YSTDSNGIYGL TGCGAAACATGGGATAACAGTCTGAATGGTCCGTTACTC TATTACTGCGAAACATGGGATAACAGTCTGAAT GAAACATGGGATAACAGTCTGAATGGTCCGTTA ETWDNSLNGPL TGTCAGTCTTTTGACAGCAGCAGTTTTGTGTTC TATTACTGTCAGTCTTTTGACAGCAGC CAGTCTTTTGACAGCAGCAGTTTTGTG QSFDSSSFV TGCGGAGCATGGGATAGCAGCCTGAGTGTTTACATCTTC TATTACTGCGGAGCATGGGATAGCAGCCTGAGT GGAGCATGGGATAGCAGCCTGAGTGTTTACATC GAWDSSLSVYI TGTCAGTCTTTTGACAGCAGCGGTTATTGGGTATTC TATTACTGTCAGTCTTTTGACAGCAGCGGT CAGTCTTTTGACAGCAGCGGTTATTGGGTA QSFDSSGYWV TGTGCCATATGGCACAACTACGCTTTATTC TATTACTGTGCCATATGGCACAAC GCCATATGGCACAACTACGCTTTA AIWHNYAL TGCGCAGCATGGGATGATAGCCTGTTCGGTGTCTTATTC TATTACTGCGCAGCATGGGATGATAGCCTGTTC GCAGCATGGGATGATAGCCTGTTCGGTGTCTTA AAWDDSLFGVL
Will this affect the accuracy of estimating mutations for the for IMGT_V region using shazam? How can I fix this?
Thanks a lot for your help.
-
Ah, yeah, the
*_IMGT
fields are based on the human numbering, so those will still be off.For shazam, there is a mechanism to deal with this. You can create a custom RegionDefinition object that specifies the CDR and FWR boundaries of your data. The built-in IMGT RegionDefinitions all use the human numbering, but should serve as a good example of what to pass into createRegionDefinition.
All of shazam’s mutation and selection functions will have a
regionDefinition
argument you can pass your custom boundary object to. So, you should be okay there. Please let us know if that doesn’t work properly.To fix this in MakeDb, I think the best solution would be to implement something similar to what is in shazam. For example, by adding an argument to take in a yaml file with a set of custom boundaries for each locus, so you could work with any custom gapping criteria. Let me make a separate issue for this and put it on the feature list. I’m not sure when it’ll get addressed, because we’re in the middle of a major update across the tools to make the AIRR Rearrangement format the default. But, it should be an relatively straightforward feature to add.
-
reporter I am having an issue with ConvertDB.py for getting genbank format with rhesus IgL. I used newer IgBLAST version with partial flag for MakeDB. I see that the JUNCTION prediction is correct. However, it seems that ConvertDb is predicting junction based on the human numbering. Is there a way around this? Thanks.
-
Would you be able to email a small example data set to immcantation@googlegroups.com? Just a few sequences is fine. The fix might be as simple as adding a column named
JUNCTION_START
to the input file with the correct start position (w.r.t.SEQUENCE_INPUT
). Either that or writing theJUNCTION_START
from IgBLAST into the MakeDb output. But I’ll need to test that. -
@Jason Vander Heiden should we make some changes to the code or the docs re this issue?
-
This would be both code and docs fixes. I’ve added part of what’s needed in the `igblastp` branch (
Alignment.RegionDefinition
), but there's a few more things to do.I’m going to try to wrap that branch up today, then take a look at this after.
-
I added a
--regions
argument to specify eitherdefault
orrhesus-igl
region boundaries in 5d6c5dbI spot checked against the example Rhesus IGL data Amit sent (
sequence_id=M04545:37:000000000-BHHGB:1:1113:10074:4956
). Seems like it works. The junction, fwr1-4, cdr1-3 all match what I get out of IMGT/V-QUEST. The example data was an old version of IgBLAST, so it didn’t have the cdr3 start/end in the subregion section, so I had to update the old method (changeo.Alignment.inferJunction), which is imperfect and deprecated.I decided not to allow any region boundaries through a yaml file in favor of the two predefined ones. Though, we could very easy change that if we wanted, because that’s how it works internally. Ie, region boundaries are loaded from
changeo/data/regions.yaml
.Still needs testing. I had to abstract more things than I thought I would need to to get it to work.
@Kenneth Hoehn , I replaced
Alignment.getRegions
with thegetRegions
method ofAlignment.RegionDefinition
so you might want to rework the bit that uses the old function in BuildTrees. See L191 in MakeDb.py (_imgt_check
) and L1489 in IO.py (IgBLASTReader.parseSections
) for examples.
-
@Jason Vander Heiden I think I’ve fixed this in BuildTrees - are the changes in d72c25f what you had in mind?
-
Yep. You also want to remove the import of
Alignment.getRegions
, so it doesn’t break if we delete that function. -
Cool - done: bc1e32b
-
- changed status to resolved
Done.
- Log in to comment
After running IgBLAST v1.6.1, I used MakeDb.py 0.4.1-2018.07.16 for an IgL library. It seems that the junction is not being determined correctly for IgL. There are additional ~3 amino acids included before C even though the CDR3_IGBLAST and CDR3_IGBLAST_AA look right. The junction is predicted correctly for IgG and IgK. Could this be due to the difference in numbering for IgG/IgK and IgL?
IGH & IGK
FR1-IMGT CDR1-IMGT FR2-IMGT CDR2-IMGT FR3-IMGT
(1-26) (27-38) (39-55) (56-65) (66-104)
IGL
FR1-IMGT CDR1-IMGT FR2-IMGT CDR2-IMGT FR3-IMGT
(1-27) (28-39) (40-58) (59-68) (69-107)