MakeDb igblast partial alignments filter not working correctly

Issue #133 resolved
jockbanan created an issue

Hello! I recently tried to use changeO + igblast on an example dataset and some 99.5 % of my sequences were filtered out by makeDb and were only retained when the --partial option was present. Looking deeper into it, I think the tool actually filters out perfectly complete alignments as partial, while it retains some crappy partial alignments as complete. I have two examples here:

seq1 TGCGGGGCTCTGGGAGAGGAGCCCAGCACTGGAAGTCGCCGGTGTTTCCATTCGGTGATCAGCACTGAACACAGAGGACTCACCATGGAGTTTGGGCTGAGCTGGGTTTTCCTCGTTGCTCTTTTAAGAGGTGTCCAGTGTCAGGTACATCTAGTGGAGTCTGGGGGAGACGTGGTCCAGCCTGGGCGGTCCCTGAGACTCTCCTGTGCAGCCTCTGGATTCCCCGTCAGTAACTATCCTATGCACTGGGTCCGCCAGGCTCCAGGCAAGGGCCTAGAGTGGGTGGGACTTATATCCAATGATGGAGAAGGTAGAGCCTATGGAGACTCCGTGAGGGGCCGAGTCACCATTTCCAGAGACCCTTCCAAGAACACAGTGTATCTGCAAATGAACAGCCTGAGACCTGAGGACACAGCTGTGTATTTCTGTGCGAAAGAGGAGTATAGCAGTGGGCGGGCGGGATTGTTTGTCTACTGGGGCCAGGGAACCCTGGTCACCGTCTCCTCAGCATCCCCGACCAGCCCCAAGGTCTTCCCGCTG

This is a nice sequence containing everything from 5' UTR to a portion of the constant region, yet it gets filtered out as partial, though igblast aligns everything from complete FR1 to FR4. Looking at the code, I think it gets filtered because of line 137 in MakeDb.py:

try: imgt_check = (rec.junction == rec.sequence_imgt[309:(309 + rec.junction_length)])

In the results table, the junction does not start at position 309 of sequence_imgt.

The second example is the opposite:

seq2 ATATTAAAAAATAAAAATATATAGTTCGTAAAAAAAAGAAAAGAATGGATCAAGAAGTGAGAGTGATGAGTGCATGAAATATTAATATGAAATGAAGATGCTTGGTGGGTTTTAGTATTAATTAATGAATTTGCTTAGATTTGTTTTATGTTTTTTTTTTTTTTTTTTTTTCAAGCAGAAGACGGCATACGAGATTACAAGGTGACTGGAGTTCAGACGTGTGCTCTTCCGATCTTAAGTATACACATTGGGCAGCCCTGATTAGGGGGAAAAGGGTTGGGGCGGATGCACTCCCTGGAGAGACAGTGACCGTTGGCTGCCTCGCACAGGACTTCCTTCCCGACTCCATCACCCCCAAGATCTGACAACACCTTACTCTGCGTTGATACCACTAACAAAACCAGATCGGAAGAGCACACGTCTGAACTCCAGTCACCTTGTAATCTCGTATGCCGTCTTCTGCTTGAAAAAAAAAAAAAATATATAAACAAATATCCTAACACTACTCTACACACCCACACCCTCCTTATATCTTCCCCCCCTCTACCCATCCCCACTCTATACTAATACTCTAATCATTCTCCCCCCCACCACACCATTCTCCTATATCACACAATTTTATAATAAATTTATAAT

in this sequence, igblast only identifies 18 nucleotides of a V-gene and 7 nucleotides of a J-gene. It though passes the "strict" filter and appears in the results even when --partial parameter is not present.

I think this is not the expected behaviour, or do I understand the definition of a partial alignment incorrectly?

Thank you! Adam

Comments (8)

  1. Jason Vander Heiden

    Hi @jockbanan, thanks for reporting this. I'll take a closer look at this today/tomorrow and get back to you.

    Briefly, partial is intended to let anything through with at least one valid segment call, regardless of how good the alignment is. Non-partial is anything with a complete set of annotations, again, regardless of quality. (Data from a lot of protocols are truncate in some way.) You can post-filter on the score/e-value to remove the poor alignments.

    The 309 junction start test is checking to (a) make sure the IMGT-gapped sequence was correctly built from the IgBLAST output and (b) verify the junction is extracted correctly in versions of IgBLAST old than 1.6 (it used a different definition back then). So that's more of an "invalid" alignment than a "partial" one. Even though this will drop some biologically valid sequences. If you add the --failed argument you should get these records in the fail file though.

    Most of our downstream tools require a valid IMGT-gapped sequence, so I worry about passing these ungappable sequences through silently, but maybe there's a better solution than just failing them. Let me look at the code and get back to you. We might need some better documentation, another filtering flag, or to change the behavior of --partial.

  2. jockbanan reporter

    Hi Jason, thank you for you reply. I understand now that the second sequence makes it through the filter, because igblast finds some calls, even though they are very short and of low quality.

    For the first sequence, the reason is then apparently that the tool failed to build the IMGT-gapped sequence correctly, right? This seems to happen to nearly all of my sequences.

    I forgot to mention, I tried to use igblast 1.7 (igblstn 2.6.1+) and igblast 1.9 (igblastn 2.8.0+), both with pretty much the same result. The sequences are human IGH and I'm using the IMGT reference.

  3. jockbanan reporter

    OK, I think I know where the problem is. I built IMGT database using the scripts according to the documentation. Then I passed the fasta files generated by imgt2igblast.sh (igblast/fasta folder) as -r parameter to MakeDb.py. This results in most of the sequences in any of my samples not passing the 309 junction start test. But when I use the original imgt fasta files downloaded by the fetch_imgtdb.sh script (germlines/imgt/human/vdj folder) as -r for MakeDb.py, the result is OK - most of my sequences pass and the contents of the SEQUENCE_IMGT field in the results file seems to be gapped correctly.

    I don't know whether this is exactly a bug, but if not, I think it should be documented that one should not use the fasta files generated by imgt2igblast.sh, it's a bit confusing I think.

  4. Jason Vander Heiden

    Ah, okay, that would do it. That's not a bug. The IMGT-gapped reference set is mandatory. But, this is a common mistake. We have it documented in a couple places already (MakeDb-igblast command line help, warning box in the example), but we obviously need to make it more clear somehow.

    Maybe we should add some sort of validation check on the references passed to MakeDb's -r argument? Or restructure the docs so that the warning is near the top of the page? It's at the bottom right now, so it's easy to miss.

  5. jockbanan reporter

    I read the warning before, it is visible enough. It's just that I didn't realize the output of the database prep scripts would not be IMGT-gapped. It's kind of logical, but most of the times in any tools what so ever, one uses the files generated by the tool's prep programs, so it feels like the most natural thing to do. I think a single sentence would do, but a validation check would be nice as well (so that one knows instantly what happened).

  6. Jason Vander Heiden

    Okay, will do. There's a few other issues with error/warning clarity too (see #134). We'll try to get all this into the next release. It might take a couple weeks for the release though.

  7. Jason Vander Heiden

    Added a naive check for IMGT-gapped reference sequences to MakeDb and CreateGermlines. Clarified the commandline help for --partial and -r. Reworked the IgBLAST documentation to (hopefully) be more concise and explicit about the requirements.

  8. Log in to comment