- edited description
DefineClones doesn't work when db was created with MakeDb igblast --asis-calls
User working with non IMGT nomenclature reports he is able to use MakeDb
along with --asis-calls
but then DefineClones
fails. Renaming the sequences to use the IMGT format, DefineClones
works.
So, MakeDb
accepts non IMGT nomenclature, but downstream analysis expect it, which is inconsistent. DefineClones
uses these patterns to identify genes/alleles/families.
# Ig and TCR Regular expressions
allele_regex = re.compile(r'((IG[HLK]|TR[ABGD])([VDJ][A-Z0-9]+[-/\w]*[-\*][\.\w]+))')
gene_regex = re.compile(r'((IG[HLK]|TR[ABGD])([VDJ][A-Z0-9]+[-/\w]*))')
family_regex = re.compile(r'((IG[HLK]|TR[ABGD])([VDJ][A-Z0-9]+))')
v_allele_regex = re.compile(r'((IG[HLK]|TR[ABGD])V[A-Z0-9]+[-/\w]*[-\*][\.\w]+)')
d_allele_regex = re.compile(r'((IG[HLK]|TR[ABGD])D[A-Z0-9]+[-/\w]*[-\*][\.\w]+)')
j_allele_regex = re.compile(r'((IG[HLK]|TR[ABGD])J[A-Z0-9]+[-/\w]*[-\*][\.\w]+)')
allele_number_regex = re.compile(r'(?<=\*)([\.\w]+)')
c_gene_regex = re.compile(r'((IG[HLK]|TR[ABGD])([DMAGEC][P0-9]?[A-Z]?))')
We should find a way to let users define the patterns, or specify delimiters as in ConvertDb genbank, which takes --allele-delim
.
This issue is for DefineClones, but maybe others will be affected.
Thoughts? @javh @jqz @ruoyijiangyale
Comments (2)
-
reporter -
The IMGT nomenclature is pretty deeply embedded in changeo, alakazam, shazam, tigger, rdi, and scoper. And not all of it is using the regex and extraction functions defined in changeo and alakazam, so I don't think it would be as simple as passing in alternative regex as it would in those two (
alakazam::getSegment
already takes alternative regex). MakeDb-igblast having the option to pass unmodified calls through to a tsv is one thing, because it doesn't interact with the gene names at all and it (prior to the AIRR format) at least let people get the IgBLAST output into a tsv format. I think adding the equivalent of--asis-calls
throughout the pipeline would be a huge task. And you can't get away with just using raw strings, because the nomenclature adherence is needed for things like:- Harmonizing mismatched data. Eg, knowing that
Homosap IGHV1-69*01 F
in theV_CALL
column corresponds to>A12345|IGHV1-69*01|Homo sapiens||ABC|F|||
in the fasta header of the reference database. - Extracting the allele, gene and family levels (
IGHV1-69*01
vsIGHV1-69
vsIGHV1
).
I think if I was going to add support for this, I'd probably add a new subcommand to ConvertDb to convert the
V_CALL
,D_CALL
andJ_CALL
fields into the IMGT nomenclature. Using either some rules or taking a mapping file as input. The latter probably being easier and less error prone. You should be able to already do this with ParseDb-update, but the command would be really ugly, so it'd just be variation on ParseDb-update. Not sure exactly how to structure the mapping file, but it should probably enforce family/gene/allele in some way. Eg:INPUT CHAIN GENE ALLELE SomeRandomName IGHV 1-69 01 SomethingElse IGHD 2 01
Maybe? I think some trial and error will be involved with the user to nail down the need and implementation. Not sure this is the best approach, but it seems a lot simpler to me than trying to disentangle the IMGT nomenclature from the whole suite.
- Harmonizing mismatched data. Eg, knowing that
- Log in to comment