density method of findThreshold might need a bandwidth adjustment

Issue #107 resolved
Jason Vander Heiden created an issue

findThreshold with method="density" on the webinar example (HD13M) doesn't look right. Bandwidth problem?

Comments (12)

  1. nima nouri

    I have played with parameters which I was allowed to tune. As we discussed before the main issue is the 'bandwidth' argument in KernSmooth::bkde function, which is automatically/internally calculated by kedd::h.ucv function. Tuning the arguments really doesn't change it. And if we change them the "densiy" method will loose its generality.

  2. nima nouri

    True... the bandwidth calculated by kedd::h.ucv is smaller (0.01) than it should be (e.g. 0.025)

  3. Jason Vander Heiden reporter

    Did you tune the kedd::h.ucv parameters and/or try alternative implementations like stats::bandwidth?

  4. nima nouri

    Yes I tuned parameters... It doesn't change. Most of the parameters are calculated internally and therefore fixed. I am not sure what does stats::bandwidth do, but the bandwidth needs to be calculated with 4'th derivative of kernel density which is met by kedd::h.ucv. A skim over the stats::bandwidth function doesn't say any thing about it.

  5. Jason Vander Heiden reporter

    stats::bandwidth was just an example. I kind of suspect that one has already been tried in the past. Sounds like we need an alternative approach to bandwidth selection.

  6. Jason Vander Heiden reporter

    Okay @nimanouri, it looks like part of the problem is that the least-squares cross-validation approach to bandwidth detection isn't intended to work on data with ties (duplicate values). More details: http://www.ism.ac.jp/editsec/aism/pdf/060_1_0021.pdf

    I initially swapped the package used for bandwidth detection from kedd::h.ucv to ks::hucv because the ks package has a lot more parameters and alternative methods for bandwidth selection, but I just went back to the existing packages for now with some minor tweaks:

    Right now, the only changes I made are form this:

    bandwidth <- kedd::h.ucv(distances, 4)$h
    dens <- KernSmooth::bkde(distances, bandwidth=bandwidth)
    

    To this:

    bandwidth <- kedd::h.ucv(unique(distances), 4)$h
    dens <- KernSmooth::bkde(distances, bandwidth=bandwidth, canonical=TRUE)
    

    Which is infinitely faster as an added benefit.

    We can probably swap entirely over to the ks package because there are a lot more ways we could tune this with ks, but let's test the changes with kedd and KernSmooth for now. If we do swap to ks, the following should be the same as the current changes:

    bandwidth <- ks::hucv(unique(distances), deriv.order=4)
    guassian_scaling <- (1/(4 * pi))^(1/10)
    dens <- ks::kde(distances, h=bandwidth*guassian_scaling, binned=TRUE)
    

    I'm passing it back to you to test the old density method against the changes to the density approach and the gmm method. I'll email the plots separately.

  7. Jason Vander Heiden reporter

    Oh, I did experiment with some of the other methods in ks for bandwidth selection, but they yielded similar (or the same) values, so I didn't end up swapping. Ie, ks::hscv and ks:hpi. I didn't adjust parameters other than deriv.order=4 though.

  8. Log in to comment