Euclidean clustering not working on a specific file

Issue #462 resolved
Anastasia Baryshnikova created an issue

USE CASE: WHAT DO YOU WANT TO DO?

Cluster a dataset using Euclidean distance.

STEPS TO REPRODUCE AN ISSUE (OR TRIGGER A NEW FEATURE)

  1. Load the data file (attached).
  2. Select Cluster > Hierarchical.
  3. Select "Euclidean" as the similarity metric for rows (leave columns unchanged).
  4. Click "Cluster".

CURRENT BEHAVIOR

Clustering progress is shown but nothing happens afterwards (no CDT file generated and no clustergram visualized).

EXPECTED BEHAVIOR

The window with clustering options closes and the clustergram loads.

DEVELOPERS ONLY SECTION

SUGGESTED CHANGE (Pseudocode optional)

e.g. Add a color selection class

FILES AFFECTED (where the changes will be implemented) - developers only

e.g. selectColor.java & settingsPanel.java

LEVEL OF EFFORT - developers only

trivial/minor/medium/major/overhaul (choose one)

COMMENTS

Comments (13)

  1. Christopher Keil repo owner

    A log statement of the failure:

    Initializing DistMatrixCalculator (1)
    java.lang.IndexOutOfBoundsException: Index: 7, Size: 7
     - java.util.concurrent.FutureTask.report(FutureTask.java:122)
     - java.util.concurrent.FutureTask.get(FutureTask.java:192)
     - javax.swing.SwingWorker.get(SwingWorker.java:602)
     - Cluster.ClusterProcessor.clusterAxis(ClusterProcessor.java:133)
     - Controllers.ClusterDialogController$ClusterTask.calculateAxis(ClusterDialogController.java:585)
     - Controllers.ClusterDialogController$ClusterTask.doInBackground(ClusterDialogController.java:229)
     - Controllers.ClusterDialogController$ClusterTask.doInBackground(ClusterDialogController.java:1)
     - javax.swing.SwingWorker$1.call(SwingWorker.java:295)
     - java.util.concurrent.FutureTask.run(FutureTask.java:266)
     - javax.swing.SwingWorker.run(SwingWorker.java:334)
     - java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
     - java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
     - java.lang.Thread.run(Thread.java:745)
    DistTask is done: success.
    Done. Closing writer for dataset_bug_average_1.gtr
    Data reordering is done: success.
    Clustering was interrupted due to: java.lang.IndexOutOfBoundsException: Index: 7, Size: 7
    Got reordered 0 labels for row
    Generated ATR file (empty): dataset_bug_average_1.atr
    Initializing DistMatrixCalculator (2)
    DistTask is done: success.
    Done. Closing writer for dataset_bug_average_1.atr
    Got reordered 520 labels for column
    Reordering complete.
    Post-cluster reordering valid? false
    Reordering of data is not valid. Cancelling.
    Done. Closing writer for dataset_bug_average_1.atr
    Data reordering is done: success.
    Post-cluster reordering valid? false
    Something occurred during reordering.
    ClusterTask is done: cancelled.
    Determined dir: \test_data\treeview_test_datasets_150408\dataset_bug\average
    Attempting delete of cluster files in \test_data\treeview_test_datasets_150408\dataset_bug\average
    Attempted delete of dataset_bug_average_1.atr
    dataset_bug_average_1.atr was successfully deleted.
    Attempted delete of dataset_bug_average_1.gtr
    dataset_bug_average_1.gtr was successfully deleted.
    Directory average still has 3 files.
    average could not be deleted.
    
  2. Robert Leach

    Looks like it's specifically when rows are clustered using euclidean distance. If you do column only, it succeeds. I'll edit the description of the issue.

  3. Robert Leach

    This appears to be a difficult problem. I have traced the issue down to the findCurrentMin(final double oldMin) method in DistanceMatrix.java. What's happening is:

    • This method searches through the distance matrix to see if it can find a new minimum distance value that is less than oldMin
    • If it finds a new min value, it records where it found the value with 2 indexes: min_row_index and min_col_index
    • HOWEVER, on iteration 17 of a while loop in ClusterProcessor.java (which ends up reaching this code), oldMin is less than everything in the distance matrix, thus min_row_index and min_col_index do not receive new indexes
    • The problem is that the old min_row_index (7) is larger than the size of the current distance matrix (max index of 6)
    • Perhaps notable (not sure): the last inner distance matrix array's first value is "Infinity" - perhaps that stopped the distance matrix from reaching its final size?

    I do not know yet how to fix this problem. I found the problem by putting a breakpoint inside the while loop for iteration 17:

    if(loopNum == 17) {
        LogBuffer.println("Pause here");
    }
    

    and then I stepped through line by line. I knew it was iteration 17 because of debug prints.

    The distance matrix is a 2D matrix where each inner array grows in size. Perhaps the last one where the bug is shrunk?

    @TreeView3Dev - Any thoughts? You're more familiar with this code than me.

  4. Robert Leach

    OK, I see the problem. There are negative distances in the euclidean distance matrix for this data and the min value is initialized with a prositive double value: min = Double.MIN_VALUE. Apparently, euclidean distances are not expected to be negative. Going to investigate further.

  5. Robert Leach

    OK, I got it. The calculation of euclidean distance was dividing the sum of the squared differences (between pairs of values in every column) by the number of rows instead of the number of columns... subtracting that number of rows by the number of columns with missing data, resulting in negative values when there were more columns with missing values than there were rows.

    There are 2 things of note here as well:

    1. Euclidean distance is supposed to be the square root of the sum of the squared differences. Presumably this is an attempt to normalize by the number of values compared.
    2. There are other methods for getting the "expected euclidean distance" which could yield more accurate distances given missing data.

    But for now, I will just fix the code as it was intended to work.

  6. Robert Leach

    Resolved issue #462. Clustering with euclidean distance now works correctly.

    Euclidean distance was being calculated by summing the squared differences between values in a pair of vectors. But it was being normalized by dividing by the number of rows in the data instead of the number of values in the rows. That value was getting the number of skipped columns subtracted from it, which can turn out to be larger than the number of rows, creating negative distance values and there-by, an exception. I just changed the number of rows to the number of columns and now works.

    → <<cset 7d8a72e769fd>>

  7. Log in to comment