CreateGermlines gives warning about IMGT-numbering spacers

Issue #138 resolved
Giulia Moro created an issue

Hello! I am trying to run the Immcantation pipeline on some data I have, everything goes smoothly up to CreateGermlines, that gives the warning:

WARNING> Germline reference sequences do not appear to contain IMGT-numbering spacers. Results may be incorrect.

and no sequences pass this step.

The germline files I am using are the ones downloaded via the fetchimgt.sh script in the webinar folder and the V sequences generated by TIgGER with findNovelAlleles, inferGenotype, genotypeFasta and writeFasta (which seem to have all the IMGT-numbering spacers one could hope for).

Am I doing something silly I have not realized?

Command I am using and output:

CreateGermlines.py -d WTCHG_460561_701501_igh_genotyped_clone-pass.tab -r ../data/imgt/human/vdj/*IGH[DJ].fasta genotype/WTCHG_460561_701501_v_genotype.fasta -g full --cloned --vf V_CALL_GENOTYPED --failed --log WTCHG_460561_701501_CG.log --cloned

START> CreateGermlines FILE> WTCHG_460561_701501_igh_genotyped_clone-pass.tab GERM_TYPES> full SEQ_FIELD> SEQUENCE_IMGT V_FIELD> V_CALL_GENOTYPED D_FIELD> D_CALL J_FIELD> J_CALL CLONED> True CLONE_FIELD> CLONE

PROGRESS> 11:34:53 |Sorting by clone | 0.0 min PROGRESS> 11:34:56 |Done | 0.0 min

PROGRESS> 11:34:58 |####################| 100% (14,457) 0.0 min

OUTPUT> None RECORDS> 14457 PASS> 0 FAIL> 14457 END> CreateGermlines

Head of the fasta file with the V sequences:

IGHV1-202 CAGGTGCAGCTGGTGCAGTCTGGGGCT...GAGGTGAAGAAGCCTGGGGCCTCAGTGAAG GTCTCCTGCAAGGCTTCTGGATACACCTTC............ACCGGCTACTATATGCAC TGGGTGCGACAGGCCCCTGGACAAGGGCTTGAGTGGATGGGATGGATCAACCCTAAC... ...AGTGGTGGCACAAACTATGCACAGAAGTTTCAG...GGCAGGGTCACCATGACCAGG GACACGTCCATCAGCACAGCCTACATGGAGCTGAGCAGGCTGAGATCTGACGACACGGCC GTGTATTACTGTGCGAGAGA IGHV1-301 CAGGTCCAGCTTGTGCAGTCTGGGGCT...GAGGTGAAGAAGCCTGGGGCCTCAGTGAAG GTTTCCTGCAAGGCTTCTGGATACACCTTC............ACTAGCTATGCTATGCAT TGGGTGCGCCAGGCCCCCGGACAAAGGCTTGAGTGGATGGGATGGATCAACGCTGGC... ...AATGGTAACACAAAATATTCACAGAAGTTCCAG...GGCAGAGTCACCATTACCAGG GACACATCCGCGAGCACAGCCTACATGGAGCTGAGCAGCCTGAGATCTGAAGACACGGCT GTGTATTACTGTGCGAGAGA

Comments (8)

  1. Julian Zhou

    Can you check the FASTA file outputted by TIgGER? That is, genotype/WTCHG_460561_701501_v_genotype.fasta?

    I think this might be because that the germline FASTA outputted by TIgGER has IMGT gaps as --- instead of .... So if in that file you see things like, ATGCATGCC---TTTATG, try changing all instances of --- to ....

    I remember running into this issue the first time I tried including inferred novel germline alleles from TIgGER, and I've always made a mental note since to switch the ---s to ...s before passing it to Change-O.

    If this doesn't work, then we'll have to wait for @javh

  2. Giulia Moro reporter

    This is what the novel V sequences files look like:

    >IGHV1-2*02 
    CAGGTGCAGCTGGTGCAGTCTGGGGCT...GAGGTGAAGAAGCCTGGGGCCTCAGTGAAG GTCTCCTGCAAGGCTTCTGGATACACCTTC............ACCGGCTACTATATGCAC TGGGTGCGACAGGCCCCTGGACAAGGGCTTGAGTGGATGGGATGGATCAACCCTAAC... ...AGTGGTGGCACAAACTATGCACAGAAGTTTCAG...GGCAGGGTCACCATGACCAGG GACACGTCCATCAGCACAGCCTACATGGAGCTGAGCAGGCTGAGATCTGACGACACGGCC GTGTATTACTGTGCGAGAGA 
    >IGHV1-3*01 
    CAGGTCCAGCTTGTGCAGTCTGGGGCT...GAGGTGAAGAAGCCTGGGGCCTCAGTGAAG GTTTCCTGCAAGGCTTCTGGATACACCTTC............ACTAGCTATGCTATGCAT TGGGTGCGCCAGGCCCCCGGACAAAGGCTTGAGTGGATGGGATGGATCAACGCTGGC... ...AATGGTAACACAAAATATTCACAGAAGTTCCAG...GGCAGAGTCACCATTACCAGG GACACATCCGCGAGCACAGCCTACATGGAGCTGAGCAGCCTGAGATCTGAAGACACGGCT GTGTATTACTGTGCGAGAGA
    

    I think they are fine...

  3. Julian Zhou

    You were looking at the non-novel ones in the file. Look for allele names containing a suffix like "_T288C" -- those are the ones added by TIgGER. For example, I looked at one of the FASTA files I got from TIgGER, scrolling through the file, I found things like

    >IGHV4-39*07
    CAGCTGCAGCTGCAGGAGTCGGGCCCA...GGACTGGTGAAGCCTTCGGAGACCCTGTCC
    CTCACCTGCACTGTCTCTGGTGGCTCCATCAGC......AGTAGTAGTTACTACTGGGGC
    TGGATCCGCCAGCCCCCAGGGAAGGGGCTGGAGTGGATTGGGAGTATCTATTATAGT...
    ......GGGAGCACCTACTACAACCCGTCCCTCAAG...AGTCGAGTCACCATATCAGTA
    GACACGTCCAAGAACCAGTTCTCCCTGAAGCTGAGCTCTGTGACCGCCGCGGACACGGCC
    GTGTATTACTGTGCGAGAGA
    >IGHV4-59*01_T288C
    CAGGTGCAGCTGCAGGAGTCGGGCCCA---GGACTGGTGAAGCCTTCGGAGACCCTGTCC
    CTCACCTGCACTGTCTCTGGTGGCTCCATC------------AGTAGTTACTACTGGAGC
    TGGATCCGGCAGCCCCCAGGGAAGGGACTGGAGTGGATTGGGTATATCTATTACAGT---
    ------GGGAGCACCAACTACAACCCCTCCCTCAAG---AGTCGAGTCACCATATCAGTA
    GACACGTCCAAGAACCAGTTCTCCCTGAAGCTGAGCTCTGTGACCGCCGCGGACACGGCC
    GTGTATTACTGTGCGAGAGA
    

    Notice how the non-novel allele IGHV4-39*07 has all gaps as ..., whereas the novel allele IGHV4-59*01_T288C has all gaps as ---.

  4. Giulia Moro reporter

    Yes, I was a fool and trusted the first few sequences. I found the novel sequences and I have exactly the situation you described. I'll change that and try again, but I am pretty confident it is just that.

    Thank you so much!

  5. Jason Vander Heiden

    Hrm. Even if the novel alleles are missing the . characters, which will cause the germline reconstruction for those alleles to fail, that warning about the missing "IMGT-numbering spacers" shouldn't occur. It really only checks to make sure some sequence have them, not all.

    We're trying to get a tigger release together, so I'll make a note to fix the output there, but something else might be going on.

    If the fix @jqz suggested doesn't work, could you email the input files (germlines and tab file) to immcantation@googlegroups.com? We can take a look. Your command looks fine, so I don't have any suggestion that wouldn't require some debugging.

  6. Giulia Moro reporter

    The problem persists after changing the spacers.

    I sent you an email, thank you for your help!

  7. Log in to comment