diversity calculation quality-aware ?

Issue #18 resolved

Former user created an issue 2015-03-31

Proposing the idea of making the diversity calc quality-aware, in the sense that you can mask some minqual, or region specific (diversity before position X?)

I think at some point @javh implemented this for stitching and found it unhelpful, but in a multiple alignment I can imagine more utility.

Use case is: R2 quality is bad (particularly at terminus), but maskqual 20 would throw away bases that are 98.9% chance of correct, which in turn in bad for stitching and generally losing information. Relaxing diversity threshold allows sequences to pass BuildConsensus but sort of obviates the whole point of BuildConsensus.

Alternatively, a somewhat complementary approach is a diversity calculation flag allowing one to skip over the poorly aligned terminal region of the MSA.

So far these are not too common use cases because we tend to underload our sequencing runs, so this is just an idea....

Comments (15)

Jason Vander Heiden
Hrm. I've been meaning to swap the canonical diversity calculation for the (much faster) error calculation in EstimateError (mismatches from consensus). This might get you want you want in combination with the -q flag.

It might also be worth running the sequences through FilterSeq-trimqual before running AssemblePairs.

Let me ponder this for a bit.
- 2015-03-31T16:24:01+00:00
Jason Vander Heiden
BuildConsensus will now ignore deleted positions (from --maxgap) when calculating the error. For example, cases like:
```
   INSEQ1> ----ATG
   INSEQ2> --GCATG
CONSENSUS>     ATG
```
Would be ERROR=0 (DIVERSITY=0) as it would only count mismatches in the final three positions (ATG).

Also, positions that are Ns in the consensus are now ignored as well (as there is no mismatch from the consensus in these cases), for example:
```
   INSEQ1> AAAAATG
   INSEQ2> CCGCATG
CONSENSUS> NNNNATG
```
Would also be ERROR=0.

I know this isn't exactly the request, but do either of these behaviors solve the problem? I'm thinking this addresses problems with both gap and low quality positions (with -q flag) without adding any parameters. If not, the changes I just made to do this should make a "quality masking" approach a lot easier to implement... So that's still a possibility.
- 2015-04-28T00:26:18+00:00
Jason Vander Heiden
- assigned issue to
  
  Jason Vander Heiden
- 2015-04-28T19:41:54+00:00
Former user Account Deleted
Ok, so i understand you're moving from a mean pairwise diversity approach to a <distance-to-consensus> (ie "ERROR") approach , for the sake of speed and maybe with some side benefits. (But I don't think of BuildConsensus being that slow right now...)

That first case makes sense, but the second case.... ("positions that are Ns in the consensus are now ignored as well") ... that behavior is less intuitive. If my goal is to filter things that disagree, I couldn't do it this way. So i'm thinking about how i would recode things. I guess you might suggests a max-N filter after consensus building? But the Ns mean different things. For example, i'm willing to accept Ns that are Ns in all sequences (maybe because of a gapped reference alignment) but if the consensus sequences has Ns due to high diversity , I DO want to filter those sequences... Question: would the diversity calculation in AlignSets remain the same , or remain as an option ?
- 2015-04-28T20:45:06+00:00
Jason Vander Heiden
Yeah, so there is now some potential conflict between the -q, --freq and --maxerr parameters, or they all now act in concert (depending upon your perspective). It's sorta similar to the issue I had with the quality masking in AssemblePairs - by filtering out low confidence positions you get an artificially lower error rate. At some point, it gets hard to "tune by judgement" with increasing parameters...

You could certainly over assign Ns (high -q and --freq), and end up with a consensus that is all Ns and passes the error filter, but then filter these out with FilterSeq-missing. Or, you could have stringent N assignment (low -q and --freq), and rely more on the error calculation to remove bad consensus sequences. Kinda depends on whether you want partially informative or fully informative consensus sequences.

My preference would be to leave the diversity calculation as is in AlignSets, and perhaps recode it in C at some point to speed it up, as it's not really that offensive for a multiple alignment to be slow. You'd get two metrics then, one from AlignSets and one from BuildConsensus.

Though, I'm also not opposed to putting the diversity calculation back into BuildConsensus if the error rate seems like a bad idea. I fixed a few internals in the code, so it should be easy to put diversity back in if we want.

Thoughts?
- 2015-04-28T21:27:02+00:00
Jason Vander Heiden
Had another thought last night... Why don't I just put the diversity calculation back in and make --maxdiv and --maxerr mutually exclusive arguments? Provides backwards compatibility too, which is good.

Whatcha think?
- 2015-04-29T18:13:37+00:00
Former user Account Deleted
i think that's pretty awesome.

i haven't fully digested the consequences of switching to maxerr (and that example you gave kind of scared me !) so ... yeah.

thanks for asking :)
- 2015-04-29T18:25:07+00:00
Former user Account Deleted
i don't want to be an enabler of your unhealthy addiction to speed, BUT, this might be relevant:

http://www.drive5.com/usearch/manual/cmd_fasta_diversity.html
- 2015-04-29T20:00:08+00:00
Jason Vander Heiden
Huh. That's kinda cool. Looks like it's species diversity though, not nucleotide diversity, so it won't solve my problem in this case...
- 2015-04-29T20:09:15+00:00
Former user Account Deleted
well that's embarassing. i should have had less caffeine today.
- 2015-04-29T20:15:26+00:00
Jason Vander Heiden
But how will you science without science fuel?
- 2015-04-29T20:31:03+00:00
Jason Vander Heiden
Okay, --maxdiv is back in. What say you on this issue? Close, leave open, think some more?
- 2015-04-30T01:08:39+00:00
Former user Account Deleted
hmmmmmmmm does BuildConsensus use quality info (where available) to generate the consensus ?
- 2015-05-01T15:10:07+00:00
Jason Vander Heiden
Yeah, it does. So as it stands now:
1. Quality scores are considered during consensus generation.
2. --maxerr calculation considers the UID read group error after accounting for consensus quality, -q, --freq and -maxgap.
3. --maxdiv calculation considers UID read group diversity before any factors effecting consensus generation.
- 2015-05-01T16:50:21+00:00
Jason Vander Heiden
- changed status to resolved
Assuming no news is good news. Let me know if we need to revisit this and reopen the request.
- 2015-07-31T17:33:14+00:00
Log in to comment

Assignee: Jason Vander Heiden

Type: enhancement

Priority: trivial

Status: resolved

Votes: 0

Watchers: 1