create shazam::indexByUnion

Issue #101 resolved
ssnn created an issue

@javh @nimanouri I think distToNearest fist=TRUE groups as DefineClones --act firts, but distToNearest fist=FALSE does not behave as DefineClones --act set. It would be useful to have a shazam::indexByUnion function that we could use wherever gene calls are used to group. This may be relevant in the case of distToNearest, which is used to find the threshold for DefineClones.

Comments (14)

  1. Jason Vander Heiden

    This would be another case of replicating python code in R. Alternatively, we could move distToNearest into changeo and reuse the existing code from DefineClones.

    Maybe we should see how much this matters? Would something analogous to the --act set actually change the distance-to-nearest distribution? I dunno. Not sure how we could test that without implementing the indexByUnion function anyway.

  2. nima nouri

    Grouping methods were always suspicious... I agree to bring everything in one script: either R or python. The way we have now is apple and orange situation.

  3. ssnn reporter

    I agree we should avoid replication and my vote goes to move distToNearest to changeo. Actually, it would be great that this step is done automagically inside DefineClones. But in the short term, the easiest seems to have an R version of indexByUnion

  4. Jason Vander Heiden

    It'd be a big task to move distToNearest into changeo, but @ruoyijiangyale has already done a decent chunk of the work.

    It would still be a lot of work, but probably worth it in the long run.

  5. Roy Jiang

    I have no biases either way. In fact I am actually slightly biased towards having all of changeo into R (except maybe MakeDb) rather than the other way around. python is not great for databases in general and groupby is as speedy as a python dict.

  6. ssnn reporter

    I am very biased toward R. Say no more. Everything to R! Ok, yes, I understand this is unrealistic. For me, right now, the only annoying thing is having to use distToNearest in R to find a threshold for DefineClones in python. Would be nice to have a --dist auto for DefineClones that calls internally a python distToNearest and uses @nimanouri 's method to find the threshold.

  7. Jason Vander Heiden

    Yeah, it should really all be possible in one step. The thing that annoys me is having to maintain the exact same algorithm and models across two code bases. It's really easy to make a small error and end up with a mutation model that's different between shazam and changeo.

    The problem with R is memory. Everything has to be loaded into memory. And it's finicky about wrapping external applications.

  8. ssnn reporter

    Confessions time. As I always end up having to load db's to R, I just make intensive use of system2 to run changeo from my beautiful markdown (now I am using bookdown) files. I will ask Santa Claus to bring me an R package that wraps changeo.

  9. Jason Vander Heiden

    Heh. That ain't happenin'. If you really want to go that route, you could probably use something like reticulate to call changeo functions. Everything in changeo goes through a "main" function, so you don't actually need to use the commandline. You can just import and call the main function.

  10. Jason Vander Heiden

    Seeing as I'm spring cleaning issues, what's the consensus here?

    I'm inclined to skip it.

  11. nima nouri

    I have not worked on this yet. I will investigate it as soon as my deadlines are over. We have new cloning method in-line which we need to discuss and decide what we are going to do with them.

  12. Log in to comment