Pull out additional columns with MakeDB

Chris Bolen reporter

marked as minor

2015-06-22T23:00:32+00:00

Jason Vander Heiden

Need to add the same functionality for the IgBLAST parser.

2015-06-29T16:47:15+00:00

Former user Account Deleted

Is Vscore the Evalue/ bit score?

Have you guys run into the thing where standalone igblast just completely ignores the Evalue threshold ? only tested in 1.3.0 . i totally forgot about this "feature"

2015-06-30T12:30:35+00:00

Jason Vander Heiden

This would be the bit score, but perhaps we should also add % identity and E-value. Though, there is no E-value for IMGT. Sounds like you need an E-value column though?

I noticed the issue on the abpair repo, but haven't experienced IgBLAST ignoring the E-value threshold. Though, this is entirely due to my avoidance of IgBLAST. It's far too "feature" rich.

2015-06-30T16:57:31+00:00

Former user Account Deleted

bit score would be fine instead of E-value, thanks though!

2015-06-30T17:05:41+00:00

Jason Vander Heiden

Regarding pulling out additional columns in a generalized fashion (aside from the scores)... I don't see anything worth pulling out other than the region (CDR, FWR) positions and/or sequences. We had planned to do this anyway, so I made a separate issue (issue ~~#46~~) for this.

Anything else you can think of? There are a lot of columns across multiple files in the IMGT output, so if we do decide we need a general approach then doing them in defined "chunks" of reserved columns names might be the cleanest way, as it wouldn't involve having to maintain a list of column name translations and locations.

2015-06-30T18:16:52+00:00

Former user Account Deleted

The only thing I can think of are the V alignment statistics, mainly percent identity.

( I still don't quite understand why those were taken out (you said it was redundant with something, but maybe that's not the case for igblast?) . I know the prescribed method for getting this number is via CreateGermlines (+ something ?) but last I heard wasn't working for igblast . )

2015-06-30T19:15:42+00:00

Jason Vander Heiden

We could add both the % identity and E-value. The main concern with % identity is that people might use it as mutation count, which we'd like to avoid because of the N/gap issues and inconsistency across aligners. But if people want to do that, then who are we to say "no"?

2015-06-30T19:25:48+00:00

Jason Vander Heiden

assigned issue to

Jason Vander Heiden

2015-06-30T19:34:42+00:00

Former user Account Deleted

what?!? who are these people ?? ;)

but how would you resolve the inconsistency across aligners ?

2015-06-30T19:35:20+00:00

Jason Vander Heiden

Heh. There out there somewhere...

I don't think we can fix the inconsistency problem. I guess people know which aligner they used though, so we might just need to make it clear in the docs that the alignment stats aren't raw output from the aligner and comparable between IMGT and IgBLAST.

2015-06-30T19:38:34+00:00

Former user Account Deleted

i think david just started digging into changeo to pull out the mutation column ... i think we could really use it for a deadline next week.

2015-06-30T19:55:25+00:00

Jason Vander Heiden

Mutations from IgBLAST? The '% identity' or 'mismatches' columns?

2015-06-30T19:58:38+00:00

Former user Account Deleted

oops you caught me.

hmmm, thinking about which would be less bad, less redundant, or more complementary with the IMGT colnames

is V_SEQ_LENGTH the length of the input sequence aligning to the V germ ?

would MISMATCHES / V_SEQ_LENGTH be roughly 1- % identity ? Then we could subtract out the Ns in the V aligned region for as a quickie correction

2015-06-30T20:10:15+00:00

Namita Gupta

V_SEQ_LENGTH is intended to be the length of the input sequence aligning to the V germline. This is not always the case for IMGT, as they have inconsistent numbering within the input sequence, but is definitely the case for IgBLAST. What you are suggesting (MISMATCHES / V_SEQ_LENGTH) would be a quick hack for getting mutations if CreateGermlines doesn't work. Once it does, it would be better to use one of the functions in the R packages (either getSeqDistance in alakazam or calcObservedMutations in shm) and add the column that way.

2015-06-30T20:22:01+00:00

Jason Vander Heiden

Also, I'm not sure what the relationship is between all these:

% identity
alignment length
mismatches
gap opens
gaps
evalue
bit score

in the IgBLAST output.

We're happy to add whatever y'all need from the output, but I think getting CreateGermlines working is probably the best approach. That might not be a quick solution though. '% identity' would be the most consistent with the IMGT output; it'd be calculated differently, but IMGT at least reports a value for it. I'm sure you could do the same correction from '% identity' if you knew how the percentage was calculated.

I think it's '% identity' = 1 - 'mismatches' / 'alignment length', where 'V_SEQ_LENGTH' = 'q.end' - 'q.start' + 1 and 'alignment length' = 'V_SEQ_LENGTH' + 'gaps'... which might be a bug on our end. Not sure.

2015-06-30T20:37:52+00:00

Former user Account Deleted

in case it's useful, I DID try CreateGermlines with MakeDb-parsed igblast, using this command:

CreateGermlines.py -d IGHV_igblast_db-pass.tab --sf SEQUENCE_VDJ --failed --log CG_try3.log --outdir try3 -g vonly -r ./germlines

all sequences fail with this error (regardless of -g option)

ERROR> Germline sequence is 131 nucleotides longer than input sequence

Where the number of nucleotides (131 in this case) seems to be V_SEQ_START - 1

2015-06-30T21:15:15+00:00

Jason Vander Heiden

Thanks. I dumped this into issue ~~#47~~.

Right now, I'm thinking I'll just bundle several of the alignment score columns into an optional flag and add them to the output - '% identity', 'bit score' and 'evalue' at the very least. Might do it when I get back from the gym tonight or maybe tomorrow. It should be trivial, and if it's properly segregated from the main data standard we don't have to agonize so much about it being consistent across aligners.

2015-06-30T21:24:25+00:00

David Koppstein

Hi, I made some quick changes to the ChangeO parser as a stopgap for our deadline. Briefly, I added these columns:

V_IDENTITY, V_ALIGNMENT_LENGTH, V_MISMATCHES, V_GAP_OPENS, V_GAPS, V_EVALUE, and V_BIT_SCORE

Here is the commit. I defer to you guys on how best to move forward in the long run in a way that can incorporate any/all of these variables while still keeping the format consistent between the IgBlast and IMGT output tables (if that is the goal).

P.S. I just noticed that you just made a few new commits, so this is a little behind/diverges from your ChangeO. Also, we both added a _float static method to IgRecord. Hah.

P.P.S. is the pandas parser not sufficient for inferring type?

2015-06-30T21:25:15+00:00

Jason Vander Heiden

Heh. That _float method is important. I don't think we've checked the data types pandas infers. We should probably look at that. I'm sure there is no default inference of Bio.Seq.Seq data types, but that might be doable with the converters argument to read_csv(). Not sure. We might ditch pandas anyway. The indexing can be kinda slow if you retrieve by value (compared to using a dictionary).

I did some funny things with IgRecord when I originally wrote it. By 'funny' I mean overly complicated. I need to think of a cleaner way to handle the conversion of the data types.

Can you give me access to your changeo fork? I can't get to the commit...

2015-06-30T21:42:13+00:00

David Koppstein

Whoops, sorry -- you should have access now.

2015-06-30T21:56:13+00:00

Former user Account Deleted

I suppose *_START columns in changeo 1-based, in keeping with blast conventions .

2015-07-02T19:05:23+00:00

Jason Vander Heiden

Yes, I think numbering is 1-based.

2015-07-02T22:28:09+00:00

Jason Vander Heiden

I added a --scores flag to both the IgBLAST and IMGT parsers, which adds the columns:

V_SCORE
V_IDENTITY
V_EVALUE
J_SCORE
J_IDENTITY
J_EVALUE

In the case of IMGT, the E-value columns with be empty.

The internals are going to need some tinkering to do this in a more general way, such that IMGT and IgBLAST don't need to have the same columns. Sounds like @dkoppstein has a fix for your immediate IgBLAST problem (though that version will probably break for IMGT output).

I'm going to leave the issue open, but set the task down for a bit until we can restructure some stuff in MakeDb. We need to add support for another couple aligners, so some thinking is required...

2015-07-02T22:35:32+00:00

Former user Account Deleted

yeah we have a great fix thanks to DK.

I was going to ask, what's your main objection to using 1-V_IDENTITY as an estimate for V mutation ? is it:

1 - Ns counted as mismatches and aligned columns (seems easy to fix) 2 - aligner inconsistencies ( seems somewhat inevitable, so best practice is to just compare apples to apples) 3 - lack of allelotyping/genotyping ( i don't have a sense for how to quantitate this but i assume it's a minor contribution to me) or maybe... 4 - that part of the aligned V region (terminii?) might have been altered from germline during recombination rather than maturation (not even sure if this is a thing... only occurred to me today)

right now, i might guess that the relative contributions of these are: 1 > 4 > 2 > 3

2015-07-03T01:35:14+00:00

Former user Account Deleted

ps what other aligners ? :D

2015-07-03T01:40:15+00:00

Jason Vander Heiden

Well, I think the fact that there is a list of concerns is probably the most important concern, but I would say my concerns (in order) are:

Not knowing exactly what V_IDENTITY means (how it's calculated). This is fixable with a bit of work, but I don't know that it's the best use of time.
Mutation counts not being correct. This isn't just Ns, but also gaps. Maybe an in-frame gap 3 bp long should be counted as a mutation, but otherwise it should probably be considered a sequencing error. This is also fixable if you solve (1), but now you have two problems to solve...
I think inconsistency across aligners is either extremely important or not important at all. If you are going to use two aligners in the same study (eg, benchmarking different aligners), then I think it's essential that mutations be counted the same way. If you won't be comparing then I doubt it matters. Though, I think there are long-term benefits to consistency, because it lets you make comparisons to your previous work.
You can't really genotype without working with blood data. So it's a good thing to do, but not always applicable.
Not sure about your (4), but having your analysis restricted to the V or J segment separately is a potential problem as well.

So I think if you address (1) and (2), then you're okay. Other than that, I have philosophical concerns...

Alignment error isn't the same thing as mutation, and the identity score is meant to capture the former. I find that anytime you try to repurpose something that was intended to solve a different problem you end up wasting a lot of time in the long run for a short-term solution. You get these series of hacks that are a headache to maintain, because inevitably some edge case pops up that you didn't account for and you have to add another layer of duct tape. You also end up generating code that isn't reusable, which means it's not extensible at all. If you want to count mutations as unique mutations per clonal group, or count replacement vs silent mutations, or count mutations in the full V(D)J sequence, then you need to implement a different solution... and that solution would also have solved your first problem, making the initial hack a waste of time.

P.S.

We have some ImmunoSEQ (gasp!) data to analyze and I want to try out partis.

2015-07-03T15:54:01+00:00

Jason Vander Heiden

changed status to resolved

I think this has what's needed for now.

2015-07-31T20:09:16+00:00

Comments (28)