MakeDb igblast issue with VDJ

Issue #50 resolved
Former user created an issue

In the process of analyzing our data we noticed a potential bug in MakeDb in which the Sequence_VDJ differs from the corresponding region in the Sequence_Input in the majority of our sequences by around 1-4 nucleotides (based on the sequences I have compared from a few different samples.) Some of the samples I saw had insertions or deletions and others had substitutions.

I took the same sample and ran MakeDb on the IMGT output and then on the IgBlast output. In these sequences I tested to see how many of the sequences had the Sequence_VDJ in the Sequence_Input. For IMGT: 23378/25116 For IgBlast: 51/21587

I extracted 11 of the sequences with the same Seq_ID and Seq_Input in which MakeDb imgt Seq_VDJ is in Seq_Input but MakeDb igblast Seq_VDJ is not in Seq_Input to show you as an example. Some of these sequences have different D calls, however, they all have the same V and J call. In the below example the V, D, and J calls are all the same in igblast and imgt and the D and J seq starts are the same in IMGT and IgBlast (with the differences in bold):

IMGT Seq_VDJ: "CAGGTGCAGCTGGTGCAGTCTGGGGCTGAGGTGAAGAAGCCTGGGTCTTCGGTGAAGGTCTCCTGCAAGGCTTCTGGAGNCACCTTCAACACCCATTCTGTCAACTGGGTACGACNGGCCCCTGGACGAGGGCTTGAGTGGATGGGAGGGACCATCCCTANCTTTAATNCCATGAAGTACTCACAGCAGTTCCAGGGCAGGCTCACCATTACCGCGGACGAGTCCACGAGCACGGGCCACATGGAACTGAGCAGCCTGAGATCTGAGGACACGGCCGTATATTACTGTGCGAGAGCGATCTCGCGGGTTCGGGGAACTGTTATAATGGGTGACTTTGACAACTGGGGCCAGGGAACCCTGGTCACCGTCTCCTCAG"

IgBlast Seq_VDJ: "CAGGTGCAGCTGGTGCAGTCTGGGGCTGAGGTGAAGAAGCCTGGGTCTTCGGTGAAGGTCTCCTGCAAGGCTTCTGGAGNCACCTTCAACACCCATTCTGTCAACTGGGTACGACNGGCCCCTGGACGAGGGCTTGAGTGGATGGGAGGGACCATCCCTANCTTTAATNCCATGAAGTACTCACAGCAGTTCCAGGGCAGGCTCACCATTACCGCGGACGAGTCCACGAGCACGGGCCACATGGAACTGAGCAGCCTGAGATCTGAGGACACGGCCGTATATTACTGTGCGAGAGGATCTCGCGGGGTTCGGGGACTGTTATAATGGGTGAACTTTGACAACTGGGGCCAGGGAACCCTGGTCACCGTCTCCTCAG"

Seq_INPUT: "GATCACATAACAACCACATTCCTCCTCTAAAGAAGCCCCTGGGAGCACAGCTCATCACCATGGACTGGACCTAGAGGTTCCTCTTTGTGGTGGCAGCAGCTACAGGTGTCCAGTCCCAGGTGCAGCTGGTGCAGTCTGGGGCTGAGGTGAAGAAGCCTGGGTCTTCGGTGAAGGTCTCCTGCAAGGCTTCTGGAGNCACCTTCAACACCCATTCTGTCAACTGGGTACGACNGGCCCCTGGACGAGGGCTTGAGTGGATGGGAGGGACCATCCCTANCTTTAATNCCATGAAGTACTCACAGCAGTTCCAGGGCAGGCTCACCATTACCGCGGACGAGTCCACGAGCACGGGCCACATGGAACTGAGCAGCCTGAGATCTGAGGACACGGCCGTATATTACTGTGCGAGAGCGATCTCGCGGGTTCGGGGAACTGTTATAATGGGTGACTTTGACAACTGGGGCCAGGGAACCCTGGTCACCGTCTCCTCAGCCTCCACAAAGGGCA"

I have attached the fasta sequences and both MakeDb files for those 11 sequences

Please let us know if you have any idea as to what is causing this

(and if this wasn't clear enough let me know and I'll try to explain it better)

Comments (14)

  1. Jason Vander Heiden

    Hrm. I suspect this is due to how we correct for when IgBLAST assigns parts of the input sequence to multiple regions. It might just be an off-by-one indexing error, as it looks like the IgBLAST sequence has a deleted position in at both ends of the junction. I'll take a look on Friday and see if I can figure it out. Thanks for pointing this out.

  2. Namita Gupta

    Thanks for pointing this out! Would you mind sending me the IgBlast output file for the 11 sequences you pulled out so I can get a better sense of where exactly our parser is going wrong?

  3. Namita Gupta

    Jason was right, the indexing was off by a character. It should be fixed now, please let me know if you still find issues.

  4. Former user Account Deleted

    Hi Namita, I just downloaded the new repository and reran the code and came across the exact same issue. Is the correct code for MakeDb in the new version that was uploaded yesterday?

    For example: SEQUENCE INPUT end: TTGCAAATGAACAGCCTGAGAGGCGAGGACACGGCCGTATATTACTGTGCGAAAGACCTCCCGACTTATACCGATGGCTGGATTGACTATTACGGAATGCAGGTCTGGGGCCAAGGGAGCACGGTCACCGTCTCCTCA SEQUENCE VDJ end: TTGCAAATGAACAGCCTGAGAGGCGAGGACACGGCCGTATATTACTGTGCGAAAGACTCCCGACCTTATACCATGGCTGGATTGAACTATTACGGAATGCAGGTCTGGGGCCAAGGGAGCACGGTCACCGTCTCCTCA

  5. Former user Account Deleted

    I just compared the previous VDJ seq to the "new" one and they're identical so I am not sure what you changed Could you upload the new MakeDb code to here? Thanks!

  6. Namita Gupta

    I cannot exactly recreate the second error you reported above, but can you try the repository now? For the initial example you gave me, the IMGT and IgBLAST still do not agree because IMGT has very different D calls (they don't seem correct at all). However, at this point, the SEQUENCE_INPUT and SEQUENCE_VDJ for IgBLAST results do match up.

  7. Jason Vander Heiden

    Rebecca, depending upon whether you are installing the Python 2 or Python 3 version, you may also have to delete the *.pyc files from the folder (Py2) or clean the build, dist and changeo.egg-info folders (Py3).

  8. Former user Account Deleted

    I retested the MakeDb on the same sample and now the numbers are much more similar to those seen in IMGT! Thanks so much

  9. Log in to comment