Update required fields in CreateGermlines

Jason Vander Heiden

If this is for theREGIONS_PN log entry, we won't be able to make the N/P fields required by CreateGermlines as we can't get them from IgBLAST (so requiring them would break CreateGermlines). I'm guessing that the cleanest approach would probably be to have the REGIONS log generated differently if these additional fields are found in the input.

2016-05-17T21:02:16+00:00

ssnn reporter

What about creating a --regions subcommand to add the field REGIONS to the final db file? NP1_LENGTH and NP2_LENGTH would be required fields. If N1_LENGTH, P3V_LENGTH,... and the others are found (IMGT style), get REGIONS coded as VPNPDPNPJ, otherwise (IgBLAST style), use the VNDNJ schema.

2016-05-18T02:00:20+00:00

Jason Vander Heiden

I'm not seeing the benefit. Presumably, the purpose would be so you can parse the 'VNDNJ' string to get the start/length of each region, but you need that info to create the string in the first place. Seems cleaner to just use the start/length fields directly in whatever application needs them. But maybe I'm missing something? Is there another use?

2016-05-18T02:10:29+00:00

Namita Gupta

Is there a use to having the start and end other than to make the string? I feel like I want the string to use for stuff more so than the positions. Or if anything, have both.

2016-05-18T02:15:06+00:00

ssnn reporter

I think the use is just to have something that can help visualize the different regions in the context of the sequence and the germline.

2016-05-18T02:20:08+00:00

Jason Vander Heiden

I thought this began from a desire to analyze the N/P sequences separately (length, amino acids, etc), so they needed the positions/lengths to pull out the nucleotides from the input sequence. It would also be useful for doing VH replacement footprint searches.

But if it's just for the sake of visualization, then I think having it in the log alone is sufficient. However, I don't see a downside to adding a regions option to the -g flag to make a GERMLINE_REGIONS field (or whatever we want to call it). I mean, it's another thing we'd need to maintain which seems to have limited use, but if we are already putting it in the log it's not really any more effort.

2016-05-18T02:29:26+00:00

Namita Gupta

I agree it may make more sense to add the regions flag to CreateGermlines. Right now I really do want to know what region each nt of the gapped sequence is in.

2016-05-18T02:35:06+00:00

ssnn reporter

Ok. The flag, then. If someone needs the info, use the flag, otherwise, don't clutter the db file.

2016-05-18T02:40:07+00:00

Jason Vander Heiden

@namita1025 I think something like:

v <- cumsum(c(312, df$NP1_LENGTH, df$D_SEQ_LENGTH, df$NP2_LENGTH, df$J_SEQ_LENGTH))
cut(s2c(df$SEQUENCE_IMGT[1]), breaks=v[1])

Would do that. (Syntax totally made up - I'm sure it needs fixing.)

2016-05-18T02:46:49+00:00

Namita Gupta

No, I think my whole issue is that I want to know which nucleotides in the CDR3 belong to the V....so hard-coding 312 doesn't solve my problem.

2016-05-18T14:10:05+00:00

Jason Vander Heiden

312 is the start of the CDR3, so you probably need 312 to (V_GERM_LENGTH - 312). I missed some bits in the syntax above, but all the info is already in the db files.

2016-05-18T15:54:39+00:00

Namita Gupta

I have checked these results by eye, seem to match up. Using the --cloned flag has interesting results, whichever sequence is selected to represent the clone, the N/P counts for that sequence are used to make the germline for the entire clone. This sequence is not always the consensus N/P count for the clone, but is selected (if I recall correctly) for having the longest V and J sequences and the consensus V/J gene calls. I think this looks good enough to close the issue.

2016-06-15T18:22:51+00:00

Namita Gupta

changed status to resolved

Donezo

2016-07-08T20:46:55+00:00

Comments (13)