- edited description
disagreement with higblast on igblast output
Hi, I just wanted to let you know that I ran MakeDb.py igblast with --regions -- it works great, thanks Namita for implementing this! I just wanted to report that I noticed some discrepancies with the region calls when compared to higblast. Igblast is still a bit of a black box for me so I'm not sure what to believe or whether this is just a systematic, but just thought I'd let you know.
Here is some sample output:
In [10]: df3[['CDR3', 'cdr3nt']].head()
Out[10]:
CDR3 \
0 ATAAGTGACTACGGTGACAATGGGCTTGACCTC
1 NaN
2 NaN
3 GC
4 NaN
cdr3nt
0 TGTGCGAGACATCGCGCCCCCGGTATAAGTGACTACGGTGACAATG...
1 TGTGCCAGCAGGCACACAGATACGCAGTATTTTGGCCCAGGCACCC...
2 TGTCAACAATATTATAGTTTTCCTCGTACCTTC
3 TGTGCCAGCAGCCAAGAAAACTCTGG
4 TGTCAGCAATATTATGGTATTCCTGACACTTTT
where 'CDR3' is the MakeDb.py column, and cdr3nt is the higblast column.
It looks like in the instances where neither are NaN, MakeDb.py columns are in most cases a subset of the higblast columns:
In [12]: len(df3[['CDR3', 'cdr3nt']].dropna().apply(lambda x: x['CDR3'] in x['cdr3nt'], axis=1))
Out[12]: 1445
In [13]: sum(df3[['CDR3', 'cdr3nt']].dropna().apply(lambda x: x['CDR3'] in x['cdr3nt'], axis=1))
Out[13]: 1388
Comments (18)
-
reporter -
Hey, just a double-check, I do assume that the reference database you use when running igblast is just the ungapped version of the IMGT reference repo that you then submit to MakeDb. I think if these two repositories are not consistent, the behavior probably won't be predictable.
If those repositories are consistent, could you send me a small example to play with? IgBlast output, input to IgBlast, your repo files, and a file with the data.frame snippet above so I know what higblast calls for the set of sequences?
-
reporter Hi Namita,
Indeed, I used the same reference database for running igblast and MakeDb.
I'll send you a link to two tsv files from MakeDb and higblast as well as the fasta input and igblast fmt7 output via e-mail.
If you merge on columns SEQUENCE_ID for the MakeDb table and DB_MB for the higblast table you can compare the outputs directly.
The igblast reference database that I used is in bitbucket.org/abvitro/abpair_pipeline/db/human -- I've granted you access.
Also, here's a link to the higblast binary that works properly when using
--details all
(there is a bug in the latest release): https://www.dropbox.com/s/m3pedu6i5kl1shl/higblast-0.9.6-SNAPSHOT.jarwhich you can put into the directory from here: https://github.com/mikessh/higblast/releases/download/0.9.5/higblast-0.9.5.zip
if you want to run it yourself on your own datasets.
Best, David
-
Account Deleted just a reminder for people with abpair_pipeline access: Kleinstein lab members are covered under the MTA, but be careful not to post any abvitro amplicon sequences, pipelines, or primer-combinations publicly (for example on third-party git sites)
thanks :)
-
Hey David,
I only see the ungapped repertoires at that location. Do you have the gapped ones somewhere as well? (MakeDb needs the gapped ones.)
-
reporter Hi Jason, thanks for the heads up about that -- we had been using the ungapped version for our in-house parser, and I think both Sonia and I didn't realize that MakeDb requires the gapped version. Is it mentioned in the documentation somewhere?
Does using the gapped version require any modifications to the command used to invoke igblast? Currently, we're using:
params: seqtype=lambda wildcards: 'Ig' if wildcards.IGorTR == 'IG' else 'TCR', outfmt=lambda wildcards: "'7 std qseq'" if wildcards.outfmt == '7' else '3' # required for makedb to work, see https://bitbucket.org/kleinstein/changeo/issue/47/does-creategermlines-work-with-igblast run: command = ( "igblastn -query {input.fasta} " "-germline_db_V db/{config[organism]}/{wildcards.IGorTR}.V.{config[organism]}.F+ORF+infrP.ungapped.fasta " "-num_alignments_V {config[igblastn][v_alignments]} " "-germline_db_D db/{config[organism]}/{wildcards.IGorTR}.D.{config[organism]}.F+ORF+infrP.ungapped.fasta " "-num_alignments_D {config[igblastn][d_alignments]} " "-germline_db_J db/{config[organism]}/{wildcards.IGorTR}.J.{config[organism]}.F+ORF+infrP.ungapped.fasta " "-num_alignments_J {config[igblastn][j_alignments]} " "-evalue {config[igblastn][evalue]} " "-domain_system imgt " "-num_threads {threads} " "-ig_seqtype {params.seqtype} " "-strand plus " "-organism {config[organism]} " "-auxiliary_data {input.optional_file}/{config[ref_auxiliary]} " "-outfmt {params.outfmt} " "-out {output.fmtfile}")
-
Hey David, the way we implemented the new IMGT-numbering and junction identification in MakeDb is to take the gapped repertoire and re-insert those gaps into the sequence returned by IgBLAST, since BLAST cannot take a gapped reference database. I don't think your call to IgBLAST needs to be changed, just the repo that is passed to MakeDb needs to have the gapped V germline sequences and not the ungapped. I will add a note to the documentation to make that more clear. Can you let me know if that change fixes your region issues?
-
reporter I see, thanks. So, just to be clear, you typically run IgBLAST on <ungapped reference> (i.e. the same one that I've been running on), and then run MakeDb.py -r <gapped database>, where <gapped database> is e.g. here?
-
Correct. However, it's important for numbering consistency that the ungapped and gapped references are the same except for the IMGT-gaps. When we run IgBLAST, the reference we use is made from the gapped database you linked to by using the method described in the IgBLAST README ( edit_imgt_file.pl on the fasta file followed by execution of makeblastdb).
-
reporter Sorry, I'm still a bit confused -- I thought you said that you run IgBLAST on the ungapped reference earlier? Could you perhaps send the commands you use to invoke IgBLAST and MakeDb so it's clearer to me?
-
Account Deleted i think she means: 1- download the gapped database she linked to. 2- use the method described in the IgBLAST README (edit_imgt_file.pl ) to remove gaps and otherwise process 3- use that ungapped database to run igblast 4- use the original gapped database when running makeDB
-
reporter OK, that's what I thought as well, was just confused by phrasing. Got it, thanks! We're under a bit of a deadline crunch at the moment so I probably won't be able to run this until next week, but will let you know then.
-
Because it might simplify your life, I granted to you access to https://bitbucket.org/kleinstein/germlines.
There are two scripts in the scripts folder:
- getIMGT.sh - Downloads gapped germlines from IMGT automagically.
- IMGT2IgBLAST.sh - Builds IgBLAST db (may not be current with the output of getIMGT.sh, as we don't IgBLAST often)
-
reporter Thanks, I appreciate that. Just to be clear, the fasta files in that repository appear to be ungapped; do you also download and process the gapped germlines elsewhere?
EDIT: I think if you s/query=7.14/query=7.1/g then it'll get the gapped germline sequences.
-
The ones in the repo should be gapped. Only the V-regions are gapped. The D/J are the same as the ungapped. Also, as it might be relevant, you should now be able to feed the
-r
flag of MakeDb-igblast (and CreateGermlines) a list of files and/or folders, so you don't have to keep an organized set of folders if you'd rather do a glob for the correct files. -
reporter Ah, it's only the V-regions. Got it. Thanks for all the clarifications =)
-
-
assigned issue to
-
assigned issue to
-
- changed status to resolved
The output of our IgBLAST parser and IMGT seem to consistently agree, except when the aligners disagree, which I think is a better measure than agreement with higblast. Going to close for now. Feel free to reopen if any discrepancies pop up.
- Log in to comment