- attached test.airr.tsv
odd mutation results
I’m porting VDJServer’s workflows to use immcantation/suite:4.1.0. For mutations, I’m getting large counts, almost like 10x higher than I’d expect. Here’s how I call with AIRR TSV file input.
db.om <- observedMutations(db, sequenceColumn='sequence_alignment', germlineColumn='germline_alignment_d_mask', regionDefinition=NULL, frequency=FALSE, nproc=1)
db.om2 <- observedMutations(db.om, sequenceColumn='sequence_alignment', germlineColumn='germline_alignment_d_mask', regionDefinition=IMGT_V_BY_CODONS, frequency=FALSE, nproc=1)
I’ve attached the AIRR TSV with the input and results. You can see columns mu_count_seq_r (CW)
and mu_count_seq_s (CX)
are large at 90 and 40. If I run the sequence through the igblast website and look at the alignment, I see 11 aa that have changed, but with mu_count_seq_r at 90, that should be at least 30 aa that changed. Am I reading the values incorrectly? Or maybe I’m running the function incorrectly?
This particular data is non-cloned. I reconstructed the germline with:
CreateGermlines.py -d ${file} -r $vdj_db -g dmask --failed
Comments (9)
-
reporter -
How did you process your data?
db$sequence_alignment
anddb$germline_alignment
don’t have gaps. If I rundb$sequence
through IgBlast and MakeDb in immcantation/suite:4.1.0, I get gappedsequence_alignment
andgermline_alignment
> db$sequence_alignment [1] "CTGTCTCTGGTGGCTCCATCAGTAGTTACTACTGGAGCTGGATCCGGCAGCCCGCCGGGAAGGGACTGGAGTGGATTGCGCGGATCTCTGGCAGTGGGACGACCAATAACAACGCCTCCCTCAAGAGTCGACTCACCGTGTCGGTAGACTCGTCCAAGAACCAGTTGTCCCTGAAGCTGAGGTCTGTGACCGCCGCGGACACTGCCGTATATTACTGTGCGAGAGAGGATGTAGTACCGGACAGTTACAACTACGGTATGGACGTCTGGGGCCAAGGGACCA" > ssnn_db$sequence_alignment [1] "......................................................................CTGTCTCTGGTGGCTCCATC............AGTAGTTACTACTGGAGCTGGATCCGGCAGCCCGCCGGGAAGGGACTGGAGTGGATTGCGCGGATCTCTGGCAGT.........GGGACGACCAATAACAACGCCTCCCTCAAG...AGTCGACTCACCGTGTCGGTAGACTCGTCCAAGAACCAGTTGTCCCTGAAGCTGAGGTCTGTGACCGCCGCGGACACTGCCGTATATTACTGTGCGAGAGAGGATGTAGTACCGGACAGTTACAACTACGGTATGGACGTCTGGGGCCAAGGGACCA" > db$germline_alignment [1] "CTGTCTCTGGTGGCTCCATCAGTAGTTACTACTGGAGCTGGATCCGGCAGCCCGCCGGGAAGGGACTGGAGTGGATTGGGCGTATCTATACCAGTGGGAGCACCAACTACAACCCCTCCCTCAAGAGTCGAGTCACCATGTCAGTAGACACGTCCAAGAACCAGTTCTCCCTGAAGCTGAGCTCTGTGACCGCCGCGGACACGGCCGTGTATTACTGTGCGAGAGANNNNGTAGTACCNNNNNNNTACTACTACGGTATGGACGTCTGGGGCCAAGGGACCA" > ssnn_db$germline_alignment [1] "CAGGTGCAGCTGCAGGAGTCGGGCCCA...GGACTGGTGAAGCCTTCGGAGACCCTGTCCCTCACCTGCACTGTCTCTGGTGGCTCCATC............AGTAGTTACTACTGGAGCTGGATCCGGCAGCCCGCCGGGAAGGGACTGGAGTGGATTGGGCGTATCTATACCAGT.........GGGAGCACCAACTACAACCCCTCCCTCAAG...AGTCGAGTCACCATGTCAGTAGACACGTCCAAGAACCAGTTCTCCCTGAAGCTGAGCTCTGTGACCGCCGCGGACACGGCCGTGTATTACTGTGCGAGAGANNNNGTAGTACCNNNNNNNTACTACTACGGTATGGACGTCTGGGGCCAAGGGACCA"
I also get different
v_germline_start
> db$v_call [1] "IGHV4-4*07" > ssnn_db$v_call [1] "IGHV4-4*07" > db$v_sequence_start [1] 1 > ssnn_db$v_sequence_start [1] 1 > db$v_germline_start [1] 68 > ssnn_db$v_germline_start [1] 1
And I get less mutations:
> db$mu_count_seq_s [1] 40 > ssnn_db.om$mu_count_seq_s [1] 5 > db$mu_count_seq_r [1] 90 > ssnn_db.om$mu_count_seq_r [1] 14
-
reporter I’m running the sequences through IgBlast 1.16 and have it directly produce an AIRR TSV, so I’m not running MakeDb (maybe that is a step I’m missing?). So the gaps are missing from the IgBlast output. The other thought I had was the germline database. When you run IgBlast, are you using germline sequences that have the IMGT gapping? Maybe that’s my issue, it looks like I run IgBlast with non-gapped sequences then run
CreateGermlines
where I pass the gapped sequences. -
IgBLAST has to be run with a non-gapped reference database (they are removed during the db build process), but CreateGermlines needs to be run using a reference database with the sample type of numbering that is present in the input file (in both
sequence_alignment
and the_start
and_end
fields). Ie, if you are using the native AIRR output from IgBLAST, then you'll need to run CreateGermlines with non-gapped references or everything will be off. In theory anyway - I don't think we've tested CreateGermlines with non-gapped input sequences. It should work most of the time, but I suspect it'll break if there are inserts in the reference, as those won't appear as gaps in the germline sequences.The best fix would be for me to finish the generic AIRR input IMGT gapping tool, but I haven’t yet:
https://bitbucket.org/kleinstein/changeo/issues/131/imgt-seqence-germline-build-in-convertdb
Alternatively, just use the
germline_alignment
column output from IgBLAST. -
reporter Ok, I think I understand, so it worked for @ssnn because IgBlast output was parsed by MakeDB instead of using the AIRR TSV directly? So that sounds like one workaround. If I just use
germline_alignment
and skipCreateGermlines
which I think you are suggesting, does that mean we lose the IMGT numbering, that is the positions across multiple sequences do not line up? That might still be the case if running withCreateGermlines
with non-gapped references, I’m not sure what the purpose of runningCreateGermlines
would be? -
Yeah, you won’t get correct IMGT numbering out of
CreateGermlines
withoutv_germline_start
/v_germline_end
fields that match the gapped germline sequence length, which you won’t get out of igblast directly. So, yeah, skipCreateGermlines
or run the data through MakeDb. -
My vote goes for run data through MakeDb and have gapped alignment fields, which is what most tools expect. In your particular example (mutational analysis) having not gapped sequences can be problematic if you use region definitions (like
IMGT_V_BY_CODONS
), because you will be counting mutations in positions outside of the V region. -
- changed status to resolved
This thread is inactive. I will close the issue as resolved. Please, reopen if needed.
-
reporter An fyi followup, I tried CreateGermlines with the non-gapped references and it almost worked. Many sequences had the correct mutation count, but there was some that still had odd, large counts. I didn’t look into detail though, as I assume those sequences had something about them that the code couldn’t handle. Unfortunately, it’s challenging for me to go back and re-run IgBlast from scratch, so I need to stick with the old, stable code for the moment. I look forward to
#131 - Log in to comment
test data