Add options to clonal consensus generation to create non-randomized consensus sequences

Jason Vander Heiden reporter

edited description

2016-02-04T22:08:39+00:00

Jason Vander Heiden reporter

marked as minor

2016-02-04T22:09:57+00:00

Jason Vander Heiden reporter

assigned issue to

Julian Zhou
edited description

2016-07-30T21:00:53+00:00

Jason Vander Heiden reporter

edited description

2016-08-02T15:19:32+00:00

Jason Vander Heiden reporter

marked as major

2016-08-02T15:19:40+00:00

Jason Vander Heiden reporter

Notes about method 1:

I would make this entirely deterministic, and resolve ties via some simple documented rule, such as taking the first character. Try to avoid ties entirely by having a minimumFrequency argument that defines the minimum fraction required to assign a position in the consensus sequence. If no base exceeds the threshold, then assign an N. Set the default threshold at 0.6. See BuildConsensus in pRESTO.

Notes about method 2:

By "unique mutations" I mean the union of all mutations in all sequences within a clone. I think the easiest way to do this is to generate a consensus using ambiguous characters. Eg, if a clone has sequences with A and T at position 10, then assign position 10 W. The downside is that this will require a custom method for counting mutations from these sequences in the TargetingModel and MutationProfiling functions. The usual method of counting mutations is to count W as either A or T, but instead we'll want to count W as both A and T. Meaning, we'll probably need to add some flags specifying how to handle ambiguous characters to several functions.

Notes about method 3:

This would be the same as setting minimumFrequency=1.0 in method 1, so we can skip this method entirely and just add something to the vignette and man page about it. Or we could add another method, but just use the same code as method 1. I'm in favor of skipping it. (Assuming I haven't missed some way this is fundamentally different.)

2016-08-04T16:05:17+00:00

Julian Zhou

Method 3, as it is currently phrased - "[a]n approximate trunk sequence containing only mutations shared by all sequences", is indeed different from using Method 1 and setting minimumFrequency=1.0.

In Method 1, the frequency in minimumFrequency is the frequency of nucleotides at each position across sequences. It has nothing to do with mutations. Setting minimumFrequency to 1 is equivalent to obtaining a consensus sequence in which any non-N position is 100% representative of that position across all input sequences (otherwise it would be an N).

Method 3 seems ill-defined. There are at least 2 problems.

(1) Unless the user performed clonal assignment, used --cloned during CreateGermlines.py, and created a single germline of consensus length for each clone, the germline sequences suppied could be different, which would render the concept of "shared mutation" meaningless.

(2) Even if the user supplied a single consensus germline, what are positions at which there are no mutations in any input sequence going to be in the output clonal consensus? Ns? If so, many clonal consensus sequences are likely to have a lot of N's, since not all clones might be highly mutated, and not many positions may carry the same mutation across all sequences in the same clone.

2017-06-01T03:31:58+00:00

Jason Vander Heiden reporter

Well, I think approximate is the more important word than mutations, but you're right. Method 3 is ill-defined.

(1) This shouldn't matter because there would be no consideration of the germline sequence. It'd just be positions that are common to all sequences, regardless of whether they are the same or different from germline.

(2) Yeah, there would be a lot of Ns. You'd just be subsetting the positions to those that are invariant.

Honestly, I'm not seeing a good use case for Method 3. We should encourage people to go through the lineage and MRCA extraction workflow if they want such an analysis - that would be more meaningful.

2017-06-01T15:09:16+00:00

Julian Zhou

changed status to resolved

Resolved via 3d3996f.

Method 1 and 2 were implemented through method="thresholdedFreq" and "catchAll" in collapseClones(), calcClonalConsensus(), and calcClonalConsensusHelper().

Per discussion with @javh, Method 3 was not implemented. Users are encouraged to "go through the lineage and MRCA extraction workflow if they want such an analysis".

2017-06-02T06:16:52+00:00

Comments (9)