Adding DBSCAN to Freud

Issue #132 wontfix
Eric Harper created an issue

@amayank has a PR which is a wrapper to sklearn's DBSCAN. This is a very useful feature to have around, but as written I don't think it belongs in freud.

I would like to add DBSCAN to freud at the C++ level as it's a very useful method for clustering (and we already have a clustering section). It would be nice to be able to have the current clustering and DBSCAN available for a way to generate lists of particles to run analysis on, so I think it's a worthwhile goal. It'd also be nice to be able to have PBCs baked in instead of having to do buffer calls like in sklearn, which should be pretty easy given the wrap commands we already have

Comments (10)

  1. Carl Simon Adorf

    I never really understood why we have our own DBSCAN implementation when sklearn has that and so many other clustering routines perfectly well implemented. It's much easier to use the sklearn implementation since it has a uniform clustering API and different methods are more easily exchangeable.

  2. Eric Harper reporter
    1. We don't have our own DBSCAN implementation, we just have a disjoint set clustering (which, having briefly looked at it, we now also have from the ./extern/Eigen library, so I should have thought that through...)
    2. sklearn is great for processing on its own, but could be useful for processing within freud, for example to group points to perform another analysis on w/i freud...for example, create a list of clusters, then iterate through only those points to calculate rdf or something. I guess there's no reason you couldn't just pipe your points through sklearn's dbscan, take the result and pipe through individual instances of rdf, but that would/could be painful and not allow for lower-level C++ integration. I suppose demonstrating a use-case would be important to actually spending the time to add
    3. Mayank solves a problem that I had when I started playing with DBSCAN: PBCs. This would be even more efficient (potentially) in C++ as you might be able to avoid the buffer (again, not sure)
    4. We already have most of the pieces that it would take to write our own implementation, esp. since we're either adding HOOMD's kd-tree or have access to something similar from eigen (I would need to check, but it looks like maybe that's availble?) that Erin is already using for some things, maybe.
  3. Eric Harper reporter

    On the other hand, without clear-cut needs to add, and the fact that's one more thing to support, using sklearn might be the best way to go about this. I'm still reluctant to add this kind of feature into Freud directly, especially since: 1. I haven't checked how sklearn's license might affect Freud's; BSD is probably fine, but I don't know how it will impact what we plan with freud 2. Nothing else in freud is written like this; this makes sense as more of a stand-alone utility. The Voronoi module is kind of like this, but that was accepted as a temporary thing until we fix the qhull voronoi implementation.

    Overall, I would say this should just be spun off as its own repo so that others can use it, but it doesn't really feel like part of freud (since it's a sklearn wrapper and doesn't really use anything in freud).

    @csadorf @mspells @amayank @joaander

  4. Carl Simon Adorf

    I thought we had both DBSCAN and OPTICS in freud (maybe not master, but in a branch), that's where my confusion came from.

    Having a minimal example of a workflow where such an integration would lead to substantial benefit would probably be important for this discussion.

  5. Eric Harper reporter

    @csadorf I don't see OPTICS anywhere in any branch (again, I could be wrong, but nothing came up in searching via git or in the source), and AFAIK mayank's DBSCAN branch is the only thing with DBSCAN (cluster just uses a cutoff, which isn't DBSCAN).

    @amayank For now I'm just suggesting we create a repo (maybe even a meta-repo where people can store their nice utils and helper functions) for this kind of thing, since it's very useful but doesn't really fit in the design of Freud. I'll make sure to update the design document so that it reflects the nature of the code that should go into Freud.

  6. Mayank Agrawal

    I initially thought that freud is collection of useful set of analysis code where components may not be necessarily independent. But I see the point. I agree that this looks more like a utility function and doesn't really use anything in freud. A design document will surely be helpful.

    A utils directory in freud is also a good idea similar to what plato has, which can store all the orphan routines.

  7. Joshua Anderson

    If this is just a thin wrapper around an existing implementation, then I agree that it should be maintained outside freud. Freud is a loosely connected set of useful analysis code, yes - but sharing a common goal of fast C++ implementations of operations on particles in periodic boxes.

  8. Bradley Dice

    From the above discussion, it seems there is some consensus that this feature does not belong in freud. The topic can be revisited in the future if needed.

  9. Log in to comment