diversity calculation quality-aware ?
Proposing the idea of making the diversity calc quality-aware, in the sense that you can mask some minqual, or region specific (diversity before position X?)
I think at some point @javh implemented this for stitching and found it unhelpful, but in a multiple alignment I can imagine more utility.
Use case is: R2 quality is bad (particularly at terminus), but maskqual 20 would throw away bases that are 98.9% chance of correct, which in turn in bad for stitching and generally losing information. Relaxing diversity threshold allows sequences to pass BuildConsensus but sort of obviates the whole point of BuildConsensus.
Alternatively, a somewhat complementary approach is a diversity calculation flag allowing one to skip over the poorly aligned terminal region of the MSA.
So far these are not too common use cases because we tend to underload our sequencing runs, so this is just an idea....
Comments (15)
-
-
BuildConsensus will now ignore deleted positions (from
--maxgap
) when calculating the error. For example, cases like:INSEQ1> ----ATG INSEQ2> --GCATG CONSENSUS> ATG
Would be ERROR=0 (DIVERSITY=0) as it would only count mismatches in the final three positions (ATG).
Also, positions that are Ns in the consensus are now ignored as well (as there is no mismatch from the consensus in these cases), for example:
INSEQ1> AAAAATG INSEQ2> CCGCATG CONSENSUS> NNNNATG
Would also be ERROR=0.
I know this isn't exactly the request, but do either of these behaviors solve the problem? I'm thinking this addresses problems with both gap and low quality positions (with -q flag) without adding any parameters. If not, the changes I just made to do this should make a "quality masking" approach a lot easier to implement... So that's still a possibility.
-
-
assigned issue to
-
assigned issue to
-
Account Deleted Ok, so i understand you're moving from a mean pairwise diversity approach to a <distance-to-consensus> (ie "ERROR") approach , for the sake of speed and maybe with some side benefits. (But I don't think of BuildConsensus being that slow right now...)
That first case makes sense, but the second case.... ("positions that are Ns in the consensus are now ignored as well") ... that behavior is less intuitive. If my goal is to filter things that disagree, I couldn't do it this way. So i'm thinking about how i would recode things. I guess you might suggests a max-N filter after consensus building? But the Ns mean different things. For example, i'm willing to accept Ns that are Ns in all sequences (maybe because of a gapped reference alignment) but if the consensus sequences has Ns due to high diversity , I DO want to filter those sequences... Question: would the diversity calculation in AlignSets remain the same , or remain as an option ?
-
Yeah, so there is now some potential conflict between the
-q
,--freq
and--maxerr
parameters, or they all now act in concert (depending upon your perspective). It's sorta similar to the issue I had with the quality masking in AssemblePairs - by filtering out low confidence positions you get an artificially lower error rate. At some point, it gets hard to "tune by judgement" with increasing parameters...You could certainly over assign Ns (high
-q
and--freq
), and end up with a consensus that is all Ns and passes the error filter, but then filter these out with FilterSeq-missing. Or, you could have stringent N assignment (low-q
and--freq
), and rely more on the error calculation to remove bad consensus sequences. Kinda depends on whether you want partially informative or fully informative consensus sequences.My preference would be to leave the diversity calculation as is in AlignSets, and perhaps recode it in C at some point to speed it up, as it's not really that offensive for a multiple alignment to be slow. You'd get two metrics then, one from AlignSets and one from BuildConsensus.
Though, I'm also not opposed to putting the diversity calculation back into BuildConsensus if the error rate seems like a bad idea. I fixed a few internals in the code, so it should be easy to put diversity back in if we want.
Thoughts?
-
Had another thought last night... Why don't I just put the diversity calculation back in and make
--maxdiv
and--maxerr
mutually exclusive arguments? Provides backwards compatibility too, which is good.Whatcha think?
-
Account Deleted i think that's pretty awesome.
i haven't fully digested the consequences of switching to maxerr (and that example you gave kind of scared me !) so ... yeah.
thanks for asking :)
-
Account Deleted i don't want to be an enabler of your unhealthy addiction to speed, BUT, this might be relevant:
http://www.drive5.com/usearch/manual/cmd_fasta_diversity.html
-
Huh. That's kinda cool. Looks like it's species diversity though, not nucleotide diversity, so it won't solve my problem in this case...
-
Account Deleted well that's embarassing. i should have had less caffeine today.
-
But how will you science without science fuel?
-
Okay,
--maxdiv
is back in. What say you on this issue? Close, leave open, think some more? -
Account Deleted hmmmmmmmm does BuildConsensus use quality info (where available) to generate the consensus ?
-
Yeah, it does. So as it stands now:
- Quality scores are considered during consensus generation.
--maxerr
calculation considers the UID read group error after accounting for consensus quality,-q
,--freq
and-maxgap
.--maxdiv
calculation considers UID read group diversity before any factors effecting consensus generation.
-
- changed status to resolved
Assuming no news is good news. Let me know if we need to revisit this and reopen the request.
- Log in to comment
Hrm. I've been meaning to swap the canonical diversity calculation for the (much faster) error calculation in EstimateError (mismatches from consensus). This might get you want you want in combination with the -q flag.
It might also be worth running the sequences through FilterSeq-trimqual before running AssemblePairs.
Let me ponder this for a bit.