DefineClones doesn't work when db was created with MakeDb igblast --asis-calls

Jason Vander Heiden

The IMGT nomenclature is pretty deeply embedded in changeo, alakazam, shazam, tigger, rdi, and scoper. And not all of it is using the regex and extraction functions defined in changeo and alakazam, so I don't think it would be as simple as passing in alternative regex as it would in those two (alakazam::getSegment already takes alternative regex). MakeDb-igblast having the option to pass unmodified calls through to a tsv is one thing, because it doesn't interact with the gene names at all and it (prior to the AIRR format) at least let people get the IgBLAST output into a tsv format. I think adding the equivalent of --asis-calls throughout the pipeline would be a huge task. And you can't get away with just using raw strings, because the nomenclature adherence is needed for things like:

Harmonizing mismatched data. Eg, knowing that Homosap IGHV1-69*01 F in the V_CALL column corresponds to >A12345|IGHV1-69*01|Homo sapiens||ABC|F||| in the fasta header of the reference database.
Extracting the allele, gene and family levels (IGHV1-69*01 vs IGHV1-69 vs IGHV1).

I think if I was going to add support for this, I'd probably add a new subcommand to ConvertDb to convert the V_CALL, D_CALL and J_CALL fields into the IMGT nomenclature. Using either some rules or taking a mapping file as input. The latter probably being easier and less error prone. You should be able to already do this with ParseDb-update, but the command would be really ugly, so it'd just be variation on ParseDb-update. Not sure exactly how to structure the mapping file, but it should probably enforce family/gene/allele in some way. Eg:

INPUT             CHAIN   GENE    ALLELE
SomeRandomName    IGHV    1-69    01
SomethingElse     IGHD    2       01

Maybe? I think some trial and error will be involved with the user to nail down the need and implementation. Not sure this is the best approach, but it seems a lot simpler to me than trying to disentangle the IMGT nomenclature from the whole suite.

2019-01-23T17:19:41+00:00

Comments (2)

ssnn reporter
- edited description
- 2019-01-23T15:37:05+00:00
Jason Vander Heiden
The IMGT nomenclature is pretty deeply embedded in changeo, alakazam, shazam, tigger, rdi, and scoper. And not all of it is using the regex and extraction functions defined in changeo and alakazam, so I don't think it would be as simple as passing in alternative regex as it would in those two (alakazam::getSegment already takes alternative regex). MakeDb-igblast having the option to pass unmodified calls through to a tsv is one thing, because it doesn't interact with the gene names at all and it (prior to the AIRR format) at least let people get the IgBLAST output into a tsv format. I think adding the equivalent of --asis-calls throughout the pipeline would be a huge task. And you can't get away with just using raw strings, because the nomenclature adherence is needed for things like:
1. Harmonizing mismatched data. Eg, knowing that Homosap IGHV1-69*01 F in the V_CALL column corresponds to >A12345|IGHV1-69*01|Homo sapiens||ABC|F||| in the fasta header of the reference database.
2. Extracting the allele, gene and family levels (IGHV1-69*01 vs IGHV1-69 vs IGHV1).
I think if I was going to add support for this, I'd probably add a new subcommand to ConvertDb to convert the V_CALL, D_CALL and J_CALL fields into the IMGT nomenclature. Using either some rules or taking a mapping file as input. The latter probably being easier and less error prone. You should be able to already do this with ParseDb-update, but the command would be really ugly, so it'd just be variation on ParseDb-update. Not sure exactly how to structure the mapping file, but it should probably enforce family/gene/allele in some way. Eg:
```
INPUT             CHAIN   GENE    ALLELE
SomeRandomName    IGHV    1-69    01
SomethingElse     IGHD    2       01
```
Maybe? I think some trial and error will be involved with the user to nail down the need and implementation. Not sure this is the best approach, but it seems a lot simpler to me than trying to disentangle the IMGT nomenclature from the whole suite.
- 2019-01-23T17:19:41+00:00
Log in to comment