diversity calculation quality-aware ?

Issue #18 resolved
Former user created an issue

Proposing the idea of making the diversity calc quality-aware, in the sense that you can mask some minqual, or region specific (diversity before position X?)

I think at some point @javh implemented this for stitching and found it unhelpful, but in a multiple alignment I can imagine more utility.

Use case is: R2 quality is bad (particularly at terminus), but maskqual 20 would throw away bases that are 98.9% chance of correct, which in turn in bad for stitching and generally losing information. Relaxing diversity threshold allows sequences to pass BuildConsensus but sort of obviates the whole point of BuildConsensus.

Alternatively, a somewhat complementary approach is a diversity calculation flag allowing one to skip over the poorly aligned terminal region of the MSA.

So far these are not too common use cases because we tend to underload our sequencing runs, so this is just an idea....

Comments (15)

  1. Jason Vander Heiden

    Hrm. I've been meaning to swap the canonical diversity calculation for the (much faster) error calculation in EstimateError (mismatches from consensus). This might get you want you want in combination with the -q flag.

    It might also be worth running the sequences through FilterSeq-trimqual before running AssemblePairs.

    Let me ponder this for a bit.

  2. Jason Vander Heiden

    BuildConsensus will now ignore deleted positions (from --maxgap) when calculating the error. For example, cases like:

       INSEQ1> ----ATG
       INSEQ2> --GCATG
    CONSENSUS>     ATG
    

    Would be ERROR=0 (DIVERSITY=0) as it would only count mismatches in the final three positions (ATG).

    Also, positions that are Ns in the consensus are now ignored as well (as there is no mismatch from the consensus in these cases), for example:

       INSEQ1> AAAAATG
       INSEQ2> CCGCATG
    CONSENSUS> NNNNATG
    

    Would also be ERROR=0.

    I know this isn't exactly the request, but do either of these behaviors solve the problem? I'm thinking this addresses problems with both gap and low quality positions (with -q flag) without adding any parameters. If not, the changes I just made to do this should make a "quality masking" approach a lot easier to implement... So that's still a possibility.

  3. Former user Account Deleted

    Ok, so i understand you're moving from a mean pairwise diversity approach to a <distance-to-consensus> (ie "ERROR") approach , for the sake of speed and maybe with some side benefits. (But I don't think of BuildConsensus being that slow right now...)

    That first case makes sense, but the second case.... ("positions that are Ns in the consensus are now ignored as well") ... that behavior is less intuitive. If my goal is to filter things that disagree, I couldn't do it this way. So i'm thinking about how i would recode things. I guess you might suggests a max-N filter after consensus building? But the Ns mean different things. For example, i'm willing to accept Ns that are Ns in all sequences (maybe because of a gapped reference alignment) but if the consensus sequences has Ns due to high diversity , I DO want to filter those sequences... Question: would the diversity calculation in AlignSets remain the same , or remain as an option ?

  4. Jason Vander Heiden

    Yeah, so there is now some potential conflict between the -q, --freq and --maxerr parameters, or they all now act in concert (depending upon your perspective). It's sorta similar to the issue I had with the quality masking in AssemblePairs - by filtering out low confidence positions you get an artificially lower error rate. At some point, it gets hard to "tune by judgement" with increasing parameters...

    You could certainly over assign Ns (high -q and --freq), and end up with a consensus that is all Ns and passes the error filter, but then filter these out with FilterSeq-missing. Or, you could have stringent N assignment (low -q and --freq), and rely more on the error calculation to remove bad consensus sequences. Kinda depends on whether you want partially informative or fully informative consensus sequences.

    My preference would be to leave the diversity calculation as is in AlignSets, and perhaps recode it in C at some point to speed it up, as it's not really that offensive for a multiple alignment to be slow. You'd get two metrics then, one from AlignSets and one from BuildConsensus.

    Though, I'm also not opposed to putting the diversity calculation back into BuildConsensus if the error rate seems like a bad idea. I fixed a few internals in the code, so it should be easy to put diversity back in if we want.

    Thoughts?

  5. Jason Vander Heiden

    Had another thought last night... Why don't I just put the diversity calculation back in and make --maxdiv and --maxerr mutually exclusive arguments? Provides backwards compatibility too, which is good.

    Whatcha think?

  6. Former user Account Deleted

    i think that's pretty awesome.

    i haven't fully digested the consequences of switching to maxerr (and that example you gave kind of scared me !) so ... yeah.

    thanks for asking :)

  7. Jason Vander Heiden

    Huh. That's kinda cool. Looks like it's species diversity though, not nucleotide diversity, so it won't solve my problem in this case...

  8. Jason Vander Heiden

    Okay, --maxdiv is back in. What say you on this issue? Close, leave open, think some more?

  9. Former user Account Deleted

    hmmmmmmmm does BuildConsensus use quality info (where available) to generate the consensus ?

  10. Jason Vander Heiden

    Yeah, it does. So as it stands now:

    1. Quality scores are considered during consensus generation.
    2. --maxerr calculation considers the UID read group error after accounting for consensus quality, -q, --freq and -maxgap.
    3. --maxdiv calculation considers UID read group diversity before any factors effecting consensus generation.
  11. Log in to comment