Add options to clonal consensus generation to create non-randomized consensus sequences

Issue #27 resolved
Jason Vander Heiden created an issue

Could use options for collapseClones() that generate:

  1. Non-randomized consensus sequence based on frequency at positions.
  2. Representative sequence containing all unique mutations.
  3. An approximate trunk sequence containing only mutations shared by all sequences.

Could be either an argument to collapseClones(), probably with non-randomized consensus as the default, or separate functions. Will probably need to add additional arguments as specific options for each type of clonal consensus sequence (eg, frequency threshold for assigning an N, how to pick unique mutations when there are multiple at the same position, etc), so how unwieldy that gets will inform what the best approach is

Comments (9)

  1. Jason Vander Heiden reporter

    Notes about method 1:

    I would make this entirely deterministic, and resolve ties via some simple documented rule, such as taking the first character. Try to avoid ties entirely by having a minimumFrequency argument that defines the minimum fraction required to assign a position in the consensus sequence. If no base exceeds the threshold, then assign an N. Set the default threshold at 0.6. See BuildConsensus in pRESTO.

    Notes about method 2:

    By "unique mutations" I mean the union of all mutations in all sequences within a clone. I think the easiest way to do this is to generate a consensus using ambiguous characters. Eg, if a clone has sequences with A and T at position 10, then assign position 10 W. The downside is that this will require a custom method for counting mutations from these sequences in the TargetingModel and MutationProfiling functions. The usual method of counting mutations is to count W as either A or T, but instead we'll want to count W as both A and T. Meaning, we'll probably need to add some flags specifying how to handle ambiguous characters to several functions.

    Notes about method 3:

    This would be the same as setting minimumFrequency=1.0 in method 1, so we can skip this method entirely and just add something to the vignette and man page about it. Or we could add another method, but just use the same code as method 1. I'm in favor of skipping it. (Assuming I haven't missed some way this is fundamentally different.)

  2. Julian Zhou

    Method 3, as it is currently phrased - "[a]n approximate trunk sequence containing only mutations shared by all sequences", is indeed different from using Method 1 and setting minimumFrequency=1.0.

    In Method 1, the frequency in minimumFrequency is the frequency of nucleotides at each position across sequences. It has nothing to do with mutations. Setting minimumFrequency to 1 is equivalent to obtaining a consensus sequence in which any non-N position is 100% representative of that position across all input sequences (otherwise it would be an N).

    Method 3 seems ill-defined. There are at least 2 problems.

    (1) Unless the user performed clonal assignment, used --cloned during CreateGermlines.py, and created a single germline of consensus length for each clone, the germline sequences suppied could be different, which would render the concept of "shared mutation" meaningless.

    (2) Even if the user supplied a single consensus germline, what are positions at which there are no mutations in any input sequence going to be in the output clonal consensus? Ns? If so, many clonal consensus sequences are likely to have a lot of N's, since not all clones might be highly mutated, and not many positions may carry the same mutation across all sequences in the same clone.

  3. Jason Vander Heiden reporter

    Well, I think approximate is the more important word than mutations, but you're right. Method 3 is ill-defined.

    (1) This shouldn't matter because there would be no consideration of the germline sequence. It'd just be positions that are common to all sequences, regardless of whether they are the same or different from germline.

    (2) Yeah, there would be a lot of Ns. You'd just be subsetting the positions to those that are invariant.

    Honestly, I'm not seeing a good use case for Method 3. We should encourage people to go through the lineage and MRCA extraction workflow if they want such an analysis - that would be more meaningful.

  4. Julian Zhou

    Resolved via 3d3996f.

    Method 1 and 2 were implemented through method="thresholdedFreq" and "catchAll" in collapseClones(), calcClonalConsensus(), and calcClonalConsensusHelper().

    Per discussion with @javh, Method 3 was not implemented. Users are encouraged to "go through the lineage and MRCA extraction workflow if they want such an analysis".

  5. Log in to comment