odd mutation results

Issue #152 resolved
Scott Christley created an issue

I’m porting VDJServer’s workflows to use immcantation/suite:4.1.0. For mutations, I’m getting large counts, almost like 10x higher than I’d expect. Here’s how I call with AIRR TSV file input.

db.om <- observedMutations(db, sequenceColumn='sequence_alignment', germlineColumn='germline_alignment_d_mask', regionDefinition=NULL, frequency=FALSE, nproc=1)
db.om2 <- observedMutations(db.om, sequenceColumn='sequence_alignment', germlineColumn='germline_alignment_d_mask', regionDefinition=IMGT_V_BY_CODONS, frequency=FALSE, nproc=1)

I’ve attached the AIRR TSV with the input and results. You can see columns mu_count_seq_r (CW) and mu_count_seq_s (CX) are large at 90 and 40. If I run the sequence through the igblast website and look at the alignment, I see 11 aa that have changed, but with mu_count_seq_r at 90, that should be at least 30 aa that changed. Am I reading the values incorrectly? Or maybe I’m running the function incorrectly?

This particular data is non-cloned. I reconstructed the germline with:

CreateGermlines.py -d ${file} -r $vdj_db -g dmask --failed

Comments (9)

  1. ssnn

    How did you process your data? db$sequence_alignment and db$germline_alignment don’t have gaps. If I run db$sequence through IgBlast and MakeDb in immcantation/suite:4.1.0, I get gapped sequence_alignment and germline_alignment

    > db$sequence_alignment
    [1] "CTGTCTCTGGTGGCTCCATCAGTAGTTACTACTGGAGCTGGATCCGGCAGCCCGCCGGGAAGGGACTGGAGTGGATTGCGCGGATCTCTGGCAGTGGGACGACCAATAACAACGCCTCCCTCAAGAGTCGACTCACCGTGTCGGTAGACTCGTCCAAGAACCAGTTGTCCCTGAAGCTGAGGTCTGTGACCGCCGCGGACACTGCCGTATATTACTGTGCGAGAGAGGATGTAGTACCGGACAGTTACAACTACGGTATGGACGTCTGGGGCCAAGGGACCA"
    > ssnn_db$sequence_alignment
    [1] "......................................................................CTGTCTCTGGTGGCTCCATC............AGTAGTTACTACTGGAGCTGGATCCGGCAGCCCGCCGGGAAGGGACTGGAGTGGATTGCGCGGATCTCTGGCAGT.........GGGACGACCAATAACAACGCCTCCCTCAAG...AGTCGACTCACCGTGTCGGTAGACTCGTCCAAGAACCAGTTGTCCCTGAAGCTGAGGTCTGTGACCGCCGCGGACACTGCCGTATATTACTGTGCGAGAGAGGATGTAGTACCGGACAGTTACAACTACGGTATGGACGTCTGGGGCCAAGGGACCA"
    > db$germline_alignment
    [1] "CTGTCTCTGGTGGCTCCATCAGTAGTTACTACTGGAGCTGGATCCGGCAGCCCGCCGGGAAGGGACTGGAGTGGATTGGGCGTATCTATACCAGTGGGAGCACCAACTACAACCCCTCCCTCAAGAGTCGAGTCACCATGTCAGTAGACACGTCCAAGAACCAGTTCTCCCTGAAGCTGAGCTCTGTGACCGCCGCGGACACGGCCGTGTATTACTGTGCGAGAGANNNNGTAGTACCNNNNNNNTACTACTACGGTATGGACGTCTGGGGCCAAGGGACCA"
    > ssnn_db$germline_alignment
    [1] "CAGGTGCAGCTGCAGGAGTCGGGCCCA...GGACTGGTGAAGCCTTCGGAGACCCTGTCCCTCACCTGCACTGTCTCTGGTGGCTCCATC............AGTAGTTACTACTGGAGCTGGATCCGGCAGCCCGCCGGGAAGGGACTGGAGTGGATTGGGCGTATCTATACCAGT.........GGGAGCACCAACTACAACCCCTCCCTCAAG...AGTCGAGTCACCATGTCAGTAGACACGTCCAAGAACCAGTTCTCCCTGAAGCTGAGCTCTGTGACCGCCGCGGACACGGCCGTGTATTACTGTGCGAGAGANNNNGTAGTACCNNNNNNNTACTACTACGGTATGGACGTCTGGGGCCAAGGGACCA"
    

    I also get different v_germline_start

    > db$v_call
    [1] "IGHV4-4*07"
    > ssnn_db$v_call
    [1] "IGHV4-4*07"
    > db$v_sequence_start
    [1] 1
    > ssnn_db$v_sequence_start
    [1] 1
    > db$v_germline_start
    [1] 68
    > ssnn_db$v_germline_start
    [1] 1
    

    And I get less mutations:

    > db$mu_count_seq_s
    [1] 40
    > ssnn_db.om$mu_count_seq_s
    [1] 5
    > db$mu_count_seq_r
    [1] 90
    > ssnn_db.om$mu_count_seq_r
    [1] 14
    

  2. Scott Christley reporter

    I’m running the sequences through IgBlast 1.16 and have it directly produce an AIRR TSV, so I’m not running MakeDb (maybe that is a step I’m missing?). So the gaps are missing from the IgBlast output. The other thought I had was the germline database. When you run IgBlast, are you using germline sequences that have the IMGT gapping? Maybe that’s my issue, it looks like I run IgBlast with non-gapped sequences then run CreateGermlines where I pass the gapped sequences.

  3. Jason Vander Heiden

    IgBLAST has to be run with a non-gapped reference database (they are removed during the db build process), but CreateGermlines needs to be run using a reference database with the sample type of numbering that is present in the input file (in both sequence_alignment and the _start and _end fields). Ie, if you are using the native AIRR output from IgBLAST, then you'll need to run CreateGermlines with non-gapped references or everything will be off. In theory anyway - I don't think we've tested CreateGermlines with non-gapped input sequences. It should work most of the time, but I suspect it'll break if there are inserts in the reference, as those won't appear as gaps in the germline sequences.

    The best fix would be for me to finish the generic AIRR input IMGT gapping tool, but I haven’t yet:

    https://bitbucket.org/kleinstein/changeo/issues/131/imgt-seqence-germline-build-in-convertdb

    Alternatively, just use the germline_alignment column output from IgBLAST.

  4. Scott Christley reporter

    Ok, I think I understand, so it worked for @ssnn because IgBlast output was parsed by MakeDB instead of using the AIRR TSV directly? So that sounds like one workaround. If I just use germline_alignment and skip CreateGermlines which I think you are suggesting, does that mean we lose the IMGT numbering, that is the positions across multiple sequences do not line up? That might still be the case if running with CreateGermlines with non-gapped references, I’m not sure what the purpose of running CreateGermlines would be?

  5. Jason Vander Heiden

    Yeah, you won’t get correct IMGT numbering out of CreateGermlines without v_germline_start /v_germline_end fields that match the gapped germline sequence length, which you won’t get out of igblast directly. So, yeah, skip CreateGermlines or run the data through MakeDb.

  6. ssnn

    My vote goes for run data through MakeDb and have gapped alignment fields, which is what most tools expect. In your particular example (mutational analysis) having not gapped sequences can be problematic if you use region definitions (like IMGT_V_BY_CODONS), because you will be counting mutations in positions outside of the V region.

  7. Scott Christley reporter

    An fyi followup, I tried CreateGermlines with the non-gapped references and it almost worked. Many sequences had the correct mutation count, but there was some that still had odd, large counts. I didn’t look into detail though, as I assume those sequences had something about them that the code couldn’t handle. Unfortunately, it’s challenging for me to go back and re-run IgBlast from scratch, so I need to stick with the old, stable code for the moment. I look forward to #131

  8. Log in to comment