Potential issue in MakeDB.py: Junction region not accurate for IgL in rheus macaques

Issue #150 resolved
amit upadhyay created an issue

After running IgBLAST v1.6.1, I used MakeDb.py 0.4.1-2018.07.16 for an IgL library in rhesus macaques. It seems that the junction is not being determined correctly for IgL. There are additional ~3 amino acids included before C even though the CDR3_IGBLAST and CDR3_IGBLAST_AA look right. The junction is predicted correctly for IgG and IgK. Could this be due to the difference in numbering for IgG/IgK and IgL?

IGH & IGK

FR1-IMGT CDR1-IMGT FR2-IMGT CDR2-IMGT FR3-IMGT

(1-26) (27-38) (39-55) (56-65) (66-104)

IGL

FR1-IMGT CDR1-IMGT FR2-IMGT CDR2-IMGT FR3-IMGT

(1-27) (28-39) (40-58) (59-68) (69-107)

Comments (15)

  1. amit upadhyay reporter

    After running IgBLAST v1.6.1, I used MakeDb.py 0.4.1-2018.07.16 for an IgL library. It seems that the junction is not being determined correctly for IgL. There are additional ~3 amino acids included before C even though the CDR3_IGBLAST and CDR3_IGBLAST_AA look right. The junction is predicted correctly for IgG and IgK. Could this be due to the difference in numbering for IgG/IgK and IgL?

    IGH & IGK

    FR1-IMGT CDR1-IMGT FR2-IMGT CDR2-IMGT FR3-IMGT

    (1-26) (27-38) (39-55) (56-65) (66-104)

    IGL

    FR1-IMGT CDR1-IMGT FR2-IMGT CDR2-IMGT FR3-IMGT

    (1-27) (28-39) (40-58) (59-68) (69-107)

  2. Jason Vander Heiden

    Greetings, sorry for not getting back to you sooner on this.

    There are some occasional problems with junction assignment using IgBLAST <= v1.7.0. MakeDb will check for versions older that and try to infer the correct junction, but this doesn’t work if the CDR3 doesn’t start at nucleotide 312. This can happen even in a few IGH sequences if it has “extra” positions, such as amino acid 9 and 9A. Which seems to be the case here. As of changeo v0.4.1 (release notes), which you seem to be using, if the output is from IgBLAST >= 1.7.0 it will use what IgBLAST provides as the CDR3 to determine the junction. This is because IgBLAST added the start/end positions of its CDR3 assignment in v1.7.0 (release notes). This is originally why the CDR3_IGBLAST and CDR3_IGBLAST_AA fields were added; they should provide the same information as the JUNCTION and CDR3_IMGT fields for new versions of IgBLAST.

    Would you be able to email an example of your input files to immcantation@googlegroups.com (igblast output, reference database, and fasta file)? We could take a look at it. I think you’re correct about what the cause is, but we could double check.

  3. amit upadhyay reporter

    I think I have figured out the issue. It has to do with the IMGT numbering for rhesus macaques.

    I had created a custom database for rhesus macaque Ig sequences (https://github.com/BosingerLab/slow_delivery_immunization/tree/master/databases).

    To be able to use it with immcantation pipeline, I tried to insert gaps using the IMGT DomainGapAlign tool (http://www.imgt.org/3Dstructure-DB/cgi/DomainGapAlign.cgi).

    It seems that IMGT has more gaps for rhesus macaque sequences compared to human sequences which shifts C104 to C107. I tried getting gapped sequence for IGLV1-10*01 by aligning either to human or rhesus macaques. Following are the results:

    Align to Homo sapiens:

    QSVLTQPPS.ASGAPGQSVTISCSGSSSNI....GSNYVYWYQQLSGKAPKLLIYNN.......NQRPSGVP.DRFSGSK..SGTSASLAISGLQSEDEADYYCA...........AWDSSLSG....
    

    Align to Macaca mulatta:

    QSVLTQPPS.ASGAPGQSVT.ISCSGSSSNI....GSNYVYWYQQLSGKAP..KLLIYNN.......NQRPSGVP.DRFSGSK..SGTSASLAISGLQSEDEADYYC........AAWDSSLSG....
    

    The nucleotide sequence in the IMGT reference sequences (http://www.imgt.org/vquest/refseqh.html) corresponds to the above gapped sequence for M. mulatta.

    Is there any way to address an alternative region definition in MakeDb.py?

    I can send you some example files if that helps.

  4. Jason Vander Heiden

    The first thing I'd try is running the sequences through IgBLAST v1.14 and then through MakeDb with the --partial argument (it should then ignore the requirement that the junction start at 312).

    Yeah, if you want to send us some example files, we’ll look at it. I don’t think we’ve tested with macaque IGL data. It’d be good to look at.

  5. amit upadhyay reporter

    I ran IgBLAST v1.14 and used the partial flag with MakeDb.py: 0.4.1-2018.07.16. It fixes the JUNCTION prediction. However, the CDR3_IMGT is still not right. Here are some examples:

    JUNCTION                                CDR3_IMGT                           CDR3_IGBLAST                        CDR3_IGBLAST_AA
    TGCTGCTCATATGCAGGCATCTACACATGGGTATTC    TATTACTGCTGCTCATATGCAGGCATCTAC      TGCTCATATGCAGGCATCTACACATGGGTA      CSYAGIYTWV
    TGTCAGGTGTGGGACAGTAGTAGTGATCATCCGTTATTC TATTACTGTCAGGTGTGGGACAGTAGTAGTGAT   CAGGTGTGGGACAGTAGTAGTGATCATCCGTTA   QVWDSSSDHPL
    TGTTACTCAACAGATAGCAATGGTATTTATGGCTTATTC TACTACTGTTACTCAACAGATAGCAATGGTATT   TACTCAACAGATAGCAATGGTATTTATGGCTTA   YSTDSNGIYGL
    TGCGAAACATGGGATAACAGTCTGAATGGTCCGTTACTC TATTACTGCGAAACATGGGATAACAGTCTGAAT   GAAACATGGGATAACAGTCTGAATGGTCCGTTA   ETWDNSLNGPL
    TGTCAGTCTTTTGACAGCAGCAGTTTTGTGTTC       TATTACTGTCAGTCTTTTGACAGCAGC         CAGTCTTTTGACAGCAGCAGTTTTGTG         QSFDSSSFV
    TGCGGAGCATGGGATAGCAGCCTGAGTGTTTACATCTTC TATTACTGCGGAGCATGGGATAGCAGCCTGAGT   GGAGCATGGGATAGCAGCCTGAGTGTTTACATC   GAWDSSLSVYI
    TGTCAGTCTTTTGACAGCAGCGGTTATTGGGTATTC    TATTACTGTCAGTCTTTTGACAGCAGCGGT      CAGTCTTTTGACAGCAGCGGTTATTGGGTA      QSFDSSGYWV
    TGTGCCATATGGCACAACTACGCTTTATTC          TATTACTGTGCCATATGGCACAAC            GCCATATGGCACAACTACGCTTTA            AIWHNYAL
    TGCGCAGCATGGGATGATAGCCTGTTCGGTGTCTTATTC TATTACTGCGCAGCATGGGATGATAGCCTGTTC   GCAGCATGGGATGATAGCCTGTTCGGTGTCTTA   AAWDDSLFGVL
    

    Will this affect the accuracy of estimating mutations for the for IMGT_V region using shazam? How can I fix this?

    Thanks a lot for your help.

  6. Jason Vander Heiden

    Ah, yeah, the *_IMGT fields are based on the human numbering, so those will still be off.

    For shazam, there is a mechanism to deal with this. You can create a custom RegionDefinition object that specifies the CDR and FWR boundaries of your data. The built-in IMGT RegionDefinitions all use the human numbering, but should serve as a good example of what to pass into createRegionDefinition.

    All of shazam’s mutation and selection functions will have a regionDefinition argument you can pass your custom boundary object to. So, you should be okay there. Please let us know if that doesn’t work properly.

    To fix this in MakeDb, I think the best solution would be to implement something similar to what is in shazam. For example, by adding an argument to take in a yaml file with a set of custom boundaries for each locus, so you could work with any custom gapping criteria. Let me make a separate issue for this and put it on the feature list. I’m not sure when it’ll get addressed, because we’re in the middle of a major update across the tools to make the AIRR Rearrangement format the default. But, it should be an relatively straightforward feature to add.

  7. amit upadhyay reporter

    I am having an issue with ConvertDB.py for getting genbank format with rhesus IgL. I used newer IgBLAST version with partial flag for MakeDB. I see that the JUNCTION prediction is correct. However, it seems that ConvertDb is predicting junction based on the human numbering. Is there a way around this? Thanks.

  8. Jason Vander Heiden

    Would you be able to email a small example data set to immcantation@googlegroups.com? Just a few sequences is fine. The fix might be as simple as adding a column named JUNCTION_START to the input file with the correct start position (w.r.t. SEQUENCE_INPUT). Either that or writing the JUNCTION_START from IgBLAST into the MakeDb output. But I’ll need to test that.

  9. Jason Vander Heiden

    This would be both code and docs fixes. I’ve added part of what’s needed in the `igblastp` branch (Alignment.RegionDefinition), but there's a few more things to do.

    I’m going to try to wrap that branch up today, then take a look at this after.

  10. Jason Vander Heiden

    I added a --regions argument to specify either default or rhesus-igl region boundaries in 5d6c5db

    I spot checked against the example Rhesus IGL data Amit sent (sequence_id=M04545:37:000000000-BHHGB:1:1113:10074:4956). Seems like it works. The junction, fwr1-4, cdr1-3 all match what I get out of IMGT/V-QUEST. The example data was an old version of IgBLAST, so it didn’t have the cdr3 start/end in the subregion section, so I had to update the old method (changeo.Alignment.inferJunction), which is imperfect and deprecated.

    I decided not to allow any region boundaries through a yaml file in favor of the two predefined ones. Though, we could very easy change that if we wanted, because that’s how it works internally. Ie, region boundaries are loaded from changeo/data/regions.yaml.

    Still needs testing. I had to abstract more things than I thought I would need to to get it to work.

    @Kenneth Hoehn , I replaced Alignment.getRegions with the getRegions method of Alignment.RegionDefinition so you might want to rework the bit that uses the old function in BuildTrees. See L191 in MakeDb.py (_imgt_check) and L1489 in IO.py (IgBLASTReader.parseSections) for examples.

  11. Jason Vander Heiden

    Yep. You also want to remove the import of Alignment.getRegions, so it doesn’t break if we delete that function.

  12. Log in to comment