Legitimate insertions are removed when parsing IgBLAST results

Issue #132 resolved
Derek Matthew Croote created an issue

I believe the SEQUENCE_VDJ and SEQUENCE_IMGT columns of the MakeDB.py db-pass.tab output are missing legitimate nucleotide insertions relative to the reference.

Example case:

>GATAATACCATATATAA|CONSCOUNT=2|PRCONS=IGL
GNCANCNNNNCCNNCACTGGNAGTACCTCCANCCTCGGGTCNGGTTGTNATGTACACTGGTACCAGCACTTTCCAGGANCAGCCCCCAAACTGCTCATCTTTAGTAACAACAATCGGCCCTCAGGGGTCCCTGACCGATTCTCTGGCTCCAAGTCTGGCACCTCAGCCTCCCTGGCCATCACTGGGCTCCAGGCTGAAGATGAGGCTGATTATTACTGCCAGTCGTATGACGACAGACTGAGTGCTTATGTCTTCGGGACTGGGACCAAGGTCACCGTCCTGGGTCAGCCCAAGGCCAACCCCACT
>AATAATACCATATATAA|CONSCOUNT=2|PRCONS=IGL
GNCANCNNNNCCNNCACTGGNAGTACCTCCANCCTCGGGTCNGGTTGTNATGTACACTGGTACCAGCACTTTCCAGGANCAGCCCCCTTTAAACTGCTCATCTTTAGTAACAACAATCGGCCCTCAGGGGTCCCTGACCGATTCTCTGGCTCCAAGTCTGGCACCTCAGCCTCCCTGGCCATCACTGGGCTCCAGGCTGAAGATGAGGCTGATTATTACTGCCAGTCGTATGACGACAGACTGAGTGCTTATGTCTTCGGGACTGGGACCAAGGTCACCGTCCTGGGTCAGCCCAAGGCCAACCCCACT

The second sequence is exactly the same as the first except I've inserted TTT after CCCCC (arbitrary choice).

Running this fasta through the latest immcantation docker devel image (changeo build 69da217d0c1b) SEQUENCE_VDJ and SEQUENCE_IMGT are identical whereas I would have expected them to reflect the insertion.

dbpass = pd.read_csv('/home/derek/scratch/immcantation/project/sample/sample_db-pass.tab',
                     sep='\t')
for col in ['SEQUENCE_VDJ', 'SEQUENCE_IMGT', 'SEQUENCE_INPUT']:
    print('Unique {} seqs:\t{}'.format(col, len(dbpass[col].unique())))
Unique SEQUENCE_VDJ seqs:   1
Unique SEQUENCE_IMGT seqs:  1
Unique SEQUENCE_INPUT seqs: 2

Comments (7)

  1. Jason Vander Heiden

    Thanks, @dcroote. I've been travelling, but I'm back in the office now. I'll take a look at this today or tomorrow.

  2. Jason Vander Heiden

    Hey @dcroote, I checked this and I do see the same behavior, but looking through the code this appears to be intentional so that the ungapped references are aligned with the SEQUENCE_VDJ. Thus, mimicking the output of IMGT/HighV-QUEST for IgBLAST output.

    I don't know if it needs to work like this for some reason other than synchronizing the output of the two aligner programs. Let me bring it up with the rest of the team and see what the rationale is.

    If you need the V(D)J region with indels immediately, then slicing SEQUENCE_INPUT from V_SEQ_START - 1 to J_SEQ_START + J_SEQ_LENGTH should get you that (assuming I don't have an off-by-one error there).

    I check on this and get back to you.

  3. Derek Matthew Croote reporter

    Thanks for taking a look @javh . I was able to get the VDJ sequences with the indels, but I would still be in favor of changing the current behavior. My opinion is that SEQUENCE_VDJ should reflect the actual VDJ sequence and I would guess most users expect the same. Furthermore, the current behavior is a bit inconsistent because SEQUENCE_VDJ includes deletions (via gap characters) and (obviously) SNVs. If an ungapped reference version of the sequence is desired perhaps it could be in an additional column like SEQUENCE_VDJ_REF ?

  4. Jason Vander Heiden

    The deletions are kept to preserve the alignment against an unmodified reference database. Because this is usable for mutational analysis as-is:

    REF: ATGC
    SEQ: AT-C
    

    And this is not:

    REF: ATGC
    SEQ: ATCCGC
    

    So that's the reasoning.

    But, SEQUENCE_VDJ is mostly a historical artifact from before we had the IMGT-gapping working for the IgBLAST parser, because some sort of proper alignment was required for all the downstream analysis.

    I actually don't think anything needs it anymore, so it's probably safe to change the content of SEQUENCE_VDJ to include an unmodified alignment for the reasons you specify. It'd be a good time to change it too, because there's a lot of other major changes in v0.4.1.

    (I still have to check with everyone to make sure it won't break anything.)

  5. Derek Matthew Croote reporter

    Thanks Jason- appreciate the explanation and looking forward to the next release!

  6. Jason Vander Heiden

    I went ahead and changed the behavior SEQUENCE_VDJ to keep insertions in MakeDb-igblast in commit 7c4a811. So this will be in the next release. The behavior in unchanged in MakeDb-imgt, because that's what IMGT returns.

  7. Log in to comment