DefineClones doesn't work when db was created with MakeDb igblast --asis-calls

Issue #143 new
ssnn created an issue

User working with non IMGT nomenclature reports he is able to use MakeDb along with --asis-calls but then DefineClones fails. Renaming the sequences to use the IMGT format, DefineClones works. So, MakeDb accepts non IMGT nomenclature, but downstream analysis expect it, which is inconsistent. DefineClones uses these patterns to identify genes/alleles/families.

# Ig and TCR Regular expressions
allele_regex = re.compile(r'((IG[HLK]|TR[ABGD])([VDJ][A-Z0-9]+[-/\w]*[-\*][\.\w]+))')
gene_regex = re.compile(r'((IG[HLK]|TR[ABGD])([VDJ][A-Z0-9]+[-/\w]*))')
family_regex = re.compile(r'((IG[HLK]|TR[ABGD])([VDJ][A-Z0-9]+))')

v_allele_regex = re.compile(r'((IG[HLK]|TR[ABGD])V[A-Z0-9]+[-/\w]*[-\*][\.\w]+)')
d_allele_regex = re.compile(r'((IG[HLK]|TR[ABGD])D[A-Z0-9]+[-/\w]*[-\*][\.\w]+)')
j_allele_regex = re.compile(r'((IG[HLK]|TR[ABGD])J[A-Z0-9]+[-/\w]*[-\*][\.\w]+)')

allele_number_regex = re.compile(r'(?<=\*)([\.\w]+)')
c_gene_regex = re.compile(r'((IG[HLK]|TR[ABGD])([DMAGEC][P0-9]?[A-Z]?))')

We should find a way to let users define the patterns, or specify delimiters as in ConvertDb genbank, which takes --allele-delim.

This issue is for DefineClones, but maybe others will be affected.

Thoughts? @javh @jqz @ruoyijiangyale

Comments (2)

  1. Jason Vander Heiden

    The IMGT nomenclature is pretty deeply embedded in changeo, alakazam, shazam, tigger, rdi, and scoper. And not all of it is using the regex and extraction functions defined in changeo and alakazam, so I don't think it would be as simple as passing in alternative regex as it would in those two (alakazam::getSegment already takes alternative regex). MakeDb-igblast having the option to pass unmodified calls through to a tsv is one thing, because it doesn't interact with the gene names at all and it (prior to the AIRR format) at least let people get the IgBLAST output into a tsv format. I think adding the equivalent of --asis-calls throughout the pipeline would be a huge task. And you can't get away with just using raw strings, because the nomenclature adherence is needed for things like:

    1. Harmonizing mismatched data. Eg, knowing that Homosap IGHV1-69*01 F in the V_CALL column corresponds to >A12345|IGHV1-69*01|Homo sapiens||ABC|F||| in the fasta header of the reference database.
    2. Extracting the allele, gene and family levels (IGHV1-69*01 vs IGHV1-69 vs IGHV1).

    I think if I was going to add support for this, I'd probably add a new subcommand to ConvertDb to convert the V_CALL, D_CALL and J_CALL fields into the IMGT nomenclature. Using either some rules or taking a mapping file as input. The latter probably being easier and less error prone. You should be able to already do this with ParseDb-update, but the command would be really ugly, so it'd just be variation on ParseDb-update. Not sure exactly how to structure the mapping file, but it should probably enforce family/gene/allele in some way. Eg:

    INPUT             CHAIN   GENE    ALLELE
    SomeRandomName    IGHV    1-69    01
    SomethingElse     IGHD    2       01
    

    Maybe? I think some trial and error will be involved with the user to nail down the need and implementation. Not sure this is the best approach, but it seems a lot simpler to me than trying to disentangle the IMGT nomenclature from the whole suite.

  2. Log in to comment