disagreement with higblast on igblast output

Issue #54 resolved

David Koppstein created an issue 2015-09-23

Hi, I just wanted to let you know that I ran MakeDb.py igblast with --regions -- it works great, thanks Namita for implementing this! I just wanted to report that I noticed some discrepancies with the region calls when compared to higblast. Igblast is still a bit of a black box for me so I'm not sure what to believe or whether this is just a systematic, but just thought I'd let you know.

Here is some sample output:

In [10]: df3[['CDR3', 'cdr3nt']].head()
Out[10]:
                                CDR3  \
0  ATAAGTGACTACGGTGACAATGGGCTTGACCTC
1                                NaN
2                                NaN
3                                 GC
4                                NaN

                                              cdr3nt
0  TGTGCGAGACATCGCGCCCCCGGTATAAGTGACTACGGTGACAATG...
1  TGTGCCAGCAGGCACACAGATACGCAGTATTTTGGCCCAGGCACCC...
2                  TGTCAACAATATTATAGTTTTCCTCGTACCTTC
3                         TGTGCCAGCAGCCAAGAAAACTCTGG
4                  TGTCAGCAATATTATGGTATTCCTGACACTTTT

where 'CDR3' is the MakeDb.py column, and cdr3nt is the higblast column.

It looks like in the instances where neither are NaN, MakeDb.py columns are in most cases a subset of the higblast columns:

In [12]: len(df3[['CDR3', 'cdr3nt']].dropna().apply(lambda x: x['CDR3'] in x['cdr3nt'], axis=1))
Out[12]: 1445

In [13]: sum(df3[['CDR3', 'cdr3nt']].dropna().apply(lambda x: x['CDR3'] in x['cdr3nt'], axis=1))
Out[13]: 1388

Comments (18)

David Koppstein reporter
- edited description
- 2015-09-23T21:05:45+00:00
Namita Gupta
Hey, just a double-check, I do assume that the reference database you use when running igblast is just the ungapped version of the IMGT reference repo that you then submit to MakeDb. I think if these two repositories are not consistent, the behavior probably won't be predictable.

If those repositories are consistent, could you send me a small example to play with? IgBlast output, input to IgBlast, your repo files, and a file with the data.frame snippet above so I know what higblast calls for the set of sequences?
- 2015-09-23T21:11:30+00:00
David Koppstein reporter
Hi Namita,

Indeed, I used the same reference database for running igblast and MakeDb.

I'll send you a link to two tsv files from MakeDb and higblast as well as the fasta input and igblast fmt7 output via e-mail.

If you merge on columns SEQUENCE_ID for the MakeDb table and DB_MB for the higblast table you can compare the outputs directly.

The igblast reference database that I used is in bitbucket.org/abvitro/abpair_pipeline/db/human -- I've granted you access.

Also, here's a link to the higblast binary that works properly when using --details all (there is a bug in the latest release): https://www.dropbox.com/s/m3pedu6i5kl1shl/higblast-0.9.6-SNAPSHOT.jar

which you can put into the directory from here: https://github.com/mikessh/higblast/releases/download/0.9.5/higblast-0.9.5.zip

if you want to run it yourself on your own datasets.

Best, David
- 2015-09-23T21:25:18+00:00
Former user Account Deleted
just a reminder for people with abpair_pipeline access: Kleinstein lab members are covered under the MTA, but be careful not to post any abvitro amplicon sequences, pipelines, or primer-combinations publicly (for example on third-party git sites)

thanks :)
- 2015-09-23T22:00:59+00:00
Jason Vander Heiden
Hey David,

I only see the ungapped repertoires at that location. Do you have the gapped ones somewhere as well? (MakeDb needs the gapped ones.)
- 2015-09-23T22:15:43+00:00

David Koppstein reporter

Hi Jason, thanks for the heads up about that -- we had been using the ungapped version for our in-house parser, and I think both Sonia and I didn't realize that MakeDb requires the gapped version. Is it mentioned in the documentation somewhere?

Does using the gapped version require any modifications to the command used to invoke igblast? Currently, we're using:

    params:
        seqtype=lambda wildcards: 'Ig' if wildcards.IGorTR == 'IG' else 'TCR',
        outfmt=lambda wildcards: "'7 std qseq'" if wildcards.outfmt == '7' else '3' # required for makedb to work, see https://bitbucket.org/kleinstein/changeo/issue/47/does-creategermlines-work-with-igblast

    run:
        command = (
            "igblastn -query {input.fasta} "
            "-germline_db_V db/{config[organism]}/{wildcards.IGorTR}.V.{config[organism]}.F+ORF+infrP.ungapped.fasta "
            "-num_alignments_V {config[igblastn][v_alignments]} "
            "-germline_db_D db/{config[organism]}/{wildcards.IGorTR}.D.{config[organism]}.F+ORF+infrP.ungapped.fasta "
            "-num_alignments_D {config[igblastn][d_alignments]} "
            "-germline_db_J db/{config[organism]}/{wildcards.IGorTR}.J.{config[organism]}.F+ORF+infrP.ungapped.fasta "
            "-num_alignments_J {config[igblastn][j_alignments]} "
            "-evalue {config[igblastn][evalue]} "
            "-domain_system imgt "
            "-num_threads {threads} "
            "-ig_seqtype {params.seqtype} "
            "-strand plus "
            "-organism {config[organism]} "
            "-auxiliary_data {input.optional_file}/{config[ref_auxiliary]} "
            "-outfmt {params.outfmt} "
            "-out {output.fmtfile}")

2015-09-24T14:52:02+00:00

Namita Gupta
Hey David, the way we implemented the new IMGT-numbering and junction identification in MakeDb is to take the gapped repertoire and re-insert those gaps into the sequence returned by IgBLAST, since BLAST cannot take a gapped reference database. I don't think your call to IgBLAST needs to be changed, just the repo that is passed to MakeDb needs to have the gapped V germline sequences and not the ungapped. I will add a note to the documentation to make that more clear. Can you let me know if that change fixes your region issues?
- 2015-09-24T15:01:19+00:00
David Koppstein reporter
I see, thanks. So, just to be clear, you typically run IgBLAST on <ungapped reference> (i.e. the same one that I've been running on), and then run MakeDb.py -r <gapped database>, where <gapped database> is e.g. here?
- 2015-09-24T15:13:26+00:00
Namita Gupta
Correct. However, it's important for numbering consistency that the ungapped and gapped references are the same except for the IMGT-gaps. When we run IgBLAST, the reference we use is made from the gapped database you linked to by using the method described in the IgBLAST README ( edit_imgt_file.pl on the fasta file followed by execution of makeblastdb).
- 2015-09-24T15:21:06+00:00
David Koppstein reporter
Sorry, I'm still a bit confused -- I thought you said that you run IgBLAST on the ungapped reference earlier? Could you perhaps send the commands you use to invoke IgBLAST and MakeDb so it's clearer to me?
- 2015-09-24T15:51:59+00:00
Former user Account Deleted
i think she means: 1- download the gapped database she linked to. 2- use the method described in the IgBLAST README (edit_imgt_file.pl ) to remove gaps and otherwise process 3- use that ungapped database to run igblast 4- use the original gapped database when running makeDB
- 2015-09-24T15:56:11+00:00
David Koppstein reporter
OK, that's what I thought as well, was just confused by phrasing. Got it, thanks! We're under a bit of a deadline crunch at the moment so I probably won't be able to run this until next week, but will let you know then.
- 2015-09-24T16:05:43+00:00
Jason Vander Heiden
Because it might simplify your life, I granted to you access to https://bitbucket.org/kleinstein/germlines.

There are two scripts in the scripts folder:
- getIMGT.sh - Downloads gapped germlines from IMGT automagically.
- IMGT2IgBLAST.sh - Builds IgBLAST db (may not be current with the output of getIMGT.sh, as we don't IgBLAST often)
- 2015-09-24T16:07:12+00:00
David Koppstein reporter
Thanks, I appreciate that. Just to be clear, the fasta files in that repository appear to be ungapped; do you also download and process the gapped germlines elsewhere?

EDIT: I think if you s/query=7.14/query=7.1/g then it'll get the gapped germline sequences.
- 2015-09-24T16:15:11+00:00
Jason Vander Heiden
The ones in the repo should be gapped. Only the V-regions are gapped. The D/J are the same as the ungapped. Also, as it might be relevant, you should now be able to feed the -r flag of MakeDb-igblast (and CreateGermlines) a list of files and/or folders, so you don't have to keep an organized set of folders if you'd rather do a glob for the correct files.
- 2015-09-24T16:28:16+00:00
David Koppstein reporter
Ah, it's only the V-regions. Got it. Thanks for all the clarifications =)
- 2015-09-24T16:34:51+00:00
Jason Vander Heiden
- assigned issue to
  
  Namita Gupta
- 2015-12-02T15:36:51+00:00
Jason Vander Heiden
- changed status to resolved
The output of our IgBLAST parser and IMGT seem to consistently agree, except when the aligners disagree, which I think is a better measure than agreement with higblast. Going to close for now. Feel free to reopen if any discrepancies pop up.
- 2016-02-03T17:19:44+00:00
Log in to comment

Assignee: Namita Gupta

Type: bug

Priority: major

Status: resolved

Milestone: –

Votes: 0

Watchers: 2