Steps to finish AIRR formats support

Issue #98 resolved
Jason Vander Heiden created an issue

Just a list to keep track of everything still do to. In no particular order:

  1. Revert standard column to JUNCTION from CDR3, as this seems to be the consensus.
  2. Add AIRR formats reader class.
  3. Generalize input and output for CreateGermlines, DefineClones and AlignRecords.
  4. Test that CIGAR alignment provides adequate information for CreateGermlines.
  5. Figure out whether ParseDb can be generalized.
  6. Figure out how to restrict output of MakeDb to only changeo/airr standard columns, while including additional columns from the input files, and without introduce empty columns with aligner specific fields.
  7. Determine whether we can generate CIGAR strings from IMGT output and do so, if possible.

@schristley, any additions? I might try to get through all this next week sometime.

Comments (33)

  1. Scott Christley

    Looks good, I'm stuck on other projects this week but hopefully can get back to AIRR next week.

  2. Jason Vander Heiden reporter

    Finished (1) and (6) for changeo format in 562b183. Still needs some changes to the airr-formats library on the airr side.

  3. Jason Vander Heiden reporter

    The answer to (7) appears to be "no" as indels are only reported for the V segment. Meaning, we'll probably have to do a proxy CIGAR for just the start/end positions or keep those columns, at least as optional fields required by CreateGermlines.

  4. Scott Christley

    How about the pRESTO annotations? I've got the new code integrated into the VDJServer app, and it is producing an AIRR TSV but DUPCOUNT isn't showing up. AIRR is using "duplicate_count" so I guess a mapping needs to be added, and other standard pRESTO annotations we should consider?

  5. Jason Vander Heiden reporter

    Hrm. That's tricky, because those annotations can really be anything. For the changeo format, we just pass them through with flag to decide if they should or should not be included (--noparse).

    Though, DUPCOUNT (duplicate_count) and CONSCOUNT (read_count) are common enough that we should probably translate them.

  6. Jason Vander Heiden reporter

    Oh, and we need to make a couple changes to the library and merge it into master. Specifically the follow, IIRC:

    1. cdr3_na to junction_nt
    2. cdr3_aa to junction_aa

    I feel like there was something else, but I'm drawing a blank.

    Oh, and I need to add metadata read/write support to MakeDb. Once that's done, I think it'd be finished enough for now. I can probably draw a line at functioning AIRR output from MakeDb for release. Then sort out AIRR input to other tools later.

  7. Jason Vander Heiden reporter

    Okay, I added the cdr3 to junction changes and DUPCOUNT/CONSCOUNT into the latest commit of changeo (81c2fc0) and the airr-formats library (changeo-integration branch).

    As for meta data... Do we have a decision on how that's to be formatted? Yaml, yes?

    This is basically all we get from IgBLAST:

    # IGBLASTN 2.6.1+
    # Database: /home/jason/share/igblast/database/imgt_human_ig_v /home/jason/share/igblast/database/imgt_human_ig_d /home/jason/share/igblast/database/imgt_human_ig_j
    # Domain classification requested: imgt
    

    We can get more from IMGT:

    Date    Thu Mar 10 23:01:19 CET 2016
    IMGT/V-QUEST programme version  3.3.6
    IMGT/V-QUEST reference directory release    201606-4
    Species Homo sapiens
    Receptor type or locus  IGH
    IMGT/V-QUEST reference directory set     F+ORF+in-frame P
    Search for insertions and deletions  yes
    Nb of nucleotides to add (or exclude) in 3' of the V-REGION for the evaluation of the alignment score   0
    Nb of nucleotides to exclude in 5' of the V-REGION for the evaluation of the nb of mutations     0
    
  8. Scott Christley

    Great, I will test them out too.

    For the metadata, we are using the PROV data model. The PROV python library supports a few serialization formats but I've been using JSON. Not sure if it supports YAML. My intent was for the AIRR reference library to provide some simple methods for recording this info, instead of tools using the PROV library directly, though they can certainly do that.

    Technically because Change-O is not calling IgBlast/IMGT itself (though maybe you have a pipeline that does?), it doesn't have to record the VDJ assignment metadata data. IgBlast/IMGT themselves, or a tool that calls them, would record this info. Change-O should just record the 1 activity of parsing to generate the rearrangement file. My thoughts were a little different before, as I created 3 methods:

    addRearrangementActivity()

    addRearrangementActivityWithParser()

    addAnnotationActivity

    So addRearrangementActivityWithParser() was my initial thought for Change-O where it could record both the VDJ assignment and the parsing, but I'm thinking maybe add a 4th method that just records the parsing activity addParserActivity()? The methods should be enhanced a little too as PROV allows any number of user-defined attributes to be added, which is how we will add version numbers, program names, parameters, etc.

  9. Jason Vander Heiden reporter

    Okay, sounds good.

    Hrm. My inclination would be to go with addParserActivity() instead of addRearrangementActivityWithParser(), so that if you wanted to add both alignment metadata and parsing metadata, you'd just call both in series (instead of one function with lots of arguments).

    Also, I guess we need to add a flag to control whether the metadata goes into the header or a separate file? https://github.com/airr-community/airr-formats/issues/30

  10. Jason Vander Heiden reporter

    I'm going to remove the provenance steps for now, so I can merge this into default.

    I'll open a separate branch for meta data stuff once we makes some final decisions.

  11. Scott Christley

    That sounds good. With the design we are considering, Change-O will need to support the user passing in an existing metadata file, so airr-formats can append to it.

  12. Scott Christley

    I just noticed that the AIRR output is not providing values for the "end" fields (v_end, d_end, j_end), and also missing the "start/end" fields for framework and cdr regions. The command line looks like this:

    MakeDb.py igblast -s $1 -i $2 -r $VDJ_DB_ROOT/${organism}/ReferenceDirectorySet/${seqType}_VDJ.fna --regions --scores --partial --format airr
    
  13. Jason Vander Heiden reporter

    Okay, I'll take a look. Haven't had a chance to work on it since we did the merge and field finalization.

  14. Jason Vander Heiden reporter

    Also, we need to output 0-based half-open intervals for the positions. Changeo uses 1-based closed intervals internally.

    I'll change this eventually, but it'll require a lot of testing when I change this. For now, we should just modify the positions upon output for the AIRR format.

  15. Jason Vander Heiden reporter

    VDJ end positions are now output.

    The FWR/CDR start/end positions are not a trivial fix, because these aren't calculated. The IMGT positions are used for these.

    I think we should change the library so that it doesn't print optional columns with no data. That'll fix the CDR/FWR positions for now and be cleaner in general.

    Also, looks like I need to convert empty integer fields from 0 valued to None valued.

  16. Scott Christley

    I tested out your commits on VDJServer and they are producing VDJ end positions. They appear to be correct based upon the igblast output, but yeah it's 1-based. Does the 0-based intervals require changes to the CIGAR string? One thought is to let the AIRR library translate between 1-based and 0-based. Though this would only work for fields that the library knows are positions and can translate.

    One thing I noticed is it appears that rev_comp is always false.

    In general, I'm not sure how to know ahead of time if there is no data. If changeo knows based upon its parameters or such, that it won't have data for certain columns, then that makes sense.

  17. Jason Vander Heiden reporter

    Naw, the CIGAR should be fine either way. CIGAR is lengths (base counts), not start/end positions.

    I'll double check rev_comp. With the data I've been testing on it's expected to always be false because the sequences that go into IgBLAST are always 5' to 3'.

    Hrm. The easiest way to handle missing data is probably to have arguments for optional fields and custom fields passed to __init__ for the writer. Probably best accomplished by splitting the RearrangementsFile into separate reader and writer classes in the spirit of the csv library.

    Maybe with arguments like so:

    • optional = False - no optional fields written
    • optional = True - all optional fields written
    • optional = list - list defining a subset of optional fields to write.
    • custom = list - list defining a custom fields to write.

    I don't like the idea of having an argument that accepts multiple types (bool and list), so something like this would be cleaner, but less flexible:

    • optional = None - no optional fields written
    • optional = list - list defining a subset of optional fields to write.
  18. Jason Vander Heiden reporter

    I pushed a conversion to 0-based start positions.

    v_germ_end,d_germ_end and j_germ_end are missing from the rearrangements.yaml file.

    Also, the ordering of the column is wonky. Eg, fwr4_end, v_end, d_start.

  19. Jason Vander Heiden reporter

    All right, I fixed the above (missing columns and ordering).

    Just need to decide what to do with the optional/custom fields and class structure.

  20. Scott Christley

    There is a mechanism to tell the AIRR library about the columns, might have disabled it when switched moving to csv package, so will get it working again.

  21. Scott Christley

    From looking at the code, I can see that IMGTReader and igBLASTReader are using ChangeoSchema to extend the field list based upon command options (parse_score, parse_region, etc). Now this is almost the set of fields to tell AIRR about, after translating them with AIRRSchema, but there are two problems:

    1. The list is missing the derived fields.
    2. The list contains fields we don't want to output.

    An example is j_seq_length which should be excluded but is needed to construct the derived field j_seq_end.

    We don't know the actual set of AIRR fields until changeo.Receptor toDict() is called which adds annotations and the derived fields.

    So to handle this I did a few things:

    • Added parameter to AIRRSchema.asAIRR() which indicates whether to do a strict mapping or not, strict=true meaning that non-AIRR fields are not mapped.
    • When writing a row, check if its the first row, and tell AIRR about all of the fields.

    This doesn't completely work because it is putting in some fields that are specific to IMGT, like p3v_length, etc. This is because it is in the changeo.Receptor fields, but for IgBlast these fields are empty. I don't see a way to eliminate them. They aren't in the set of fields when the IgBlast parser is read, but they are added by Receptor.toDict().

    I think long-term, IMGTReader and IgBlastReader should use changeo.Receptor to get the appropriate field list based upon the command options, and then that list is either translated to ChangeO or AIRR.

    I also fixed a bug with d_germ_end and with rev_comp. I will submit a pull request shortly.

  22. Jason Vander Heiden reporter

    Thanks. I'll merge the pull request as soon as I can. Something came up today, so it might be a few days before I get to it.

    Briefly, how changeo works is that changeo.Receptor is what's actually used by the tools. It doesn't define what fields are input/output, that's supposed to be handled by the various reader/writer classes. I do currently have arguments to the __init__ of IgBLASTReader and IMGTReader defining what to parse, but I'm probably going to cut those arguments. They aren't necessary and it'd be cleaner to just always parse everything, even if it's a bit slower.

    What fields are written, based on the commandline arguments, is determined by the ChangeoWriter class (see MakeDb.writeDb). The changeo.Receptor class contains more fields than are written (eg, the derived fields aren't written for changeo formatted data). Ideally, the AIRRWriter works the same, so that changeo.Receptor isn't involved in I/O at all. Then data can be read or written in changeo, airr or whatever format freely without any changes to the functions that do any of the analysis. It still needs some cleanup to do this properly.

    The goal is: input (airr, changeo, igblast, imgt) -> changeo.Receptor -> output (airr, changeo).

    So, getting the field list from changeo.Receptor.toDict() won't work with this architecture, because changeo.Receptor shouldn't care or define what you plan to write. It should have everything for all supported formats. That's what I was thinking with adding the @property set for derived fields. Changeo doesn't use _end it uses _length, so I figured having both in changeo.Receptor was the easiest way to support both formats.

    The ChangeoSchema and AIRRSchema class are just kind of placeholders until we get everything finalized. Ideally they'd just be yaml files.

  23. Jason Vander Heiden reporter

    I removed the output of the optional fields with missing data. It's a bit kludgey and I don't like it, but it seems to be working (I need to test more).

    I think it could be a lot cleaner with a bit of redesign, but we should discuss if we want to bother with that right now.

  24. Jason Vander Heiden reporter

    I did a little testing. Seems good. If there are no probems on your end, then I think we are done-ish. Yes?

    I can merge into default and we can worry about code cleanup at some later point.

  25. Scott Christley

    Ran into some cluster problems yesterday that prevented me from testing, but will work on it today, will let you know by later this afternoon.

  26. Jason Vander Heiden reporter

    Cool. No rush. I probably won't do the merge today anyway, because it'll likely be a bit of a chore (it's months out of sync).

  27. Scott Christley

    Ran some tests and everything looks good, just a few ideas:

    • The presto annotations (duplicate_count, consensus_count) always come through whether they have data or not.
    • The score, cigar, evalue, etc., fields are marked as mandatory, so maybe make parse_scores=true by default for format=airr?
    • It might be nice to optionally add the extra changeo fields for format=airr.

    Let's make these enhancement requests for future, unless you feel it's a quick fix, but I would really like to lock-down the code as I've got to re-run a bunch of analysis for the Dec. meeting.

    I tried to merge master but got conflicts so no easy luck there...

  28. Jason Vander Heiden reporter

    I can fix the count annotations and the score annotations.

    The extra fields can be fixed pretty easily. The only tricky part with that is the count columns, because passing the presto annotations through is easy, but then you get both a duplicate_count and dupcount column. I'll figure it out. Probably just need to pass the presto columns through the AIRRSchema before adding them.

    Yeah, there are a lot of conflicts. The AIRR Formats branch has a fairly large refactor of the core library. I'll take care of it.

  29. Jason Vander Heiden reporter

    I've fixed the count annotations, marked the score fields as always True and added pass through and conversion of fields from the fasta headers.

    I did the merge. Give me a day to test. I'm sure it broke something... I just need to figure out what. :)

  30. Jason Vander Heiden reporter

    All right, I think all post-merge issues are fixed.

    Let's see where we are now? Make sure everything works on your ends and meets your needs?

    Still a lot of code cleanup to do, but I'll just plink at that very slowly. It needs unit tests anyway, so I should probably combine the refactor with test authoring.

  31. Jason Vander Heiden reporter

    Now that export of the AIRR format from MakeDb is reasonably finished, I'm going to close this mega-issue in favor of smaller issues with the ancillary tasks. Starting with #108 and #109.

  32. Log in to comment