- edited description
Euclidean clustering not working on a specific file
USE CASE: WHAT DO YOU WANT TO DO?
Cluster a dataset using Euclidean distance.
STEPS TO REPRODUCE AN ISSUE (OR TRIGGER A NEW FEATURE)
- Load the data file (attached).
- Select Cluster > Hierarchical.
- Select "Euclidean" as the similarity metric for rows (leave columns unchanged).
- Click "Cluster".
CURRENT BEHAVIOR
Clustering progress is shown but nothing happens afterwards (no CDT file generated and no clustergram visualized).
EXPECTED BEHAVIOR
The window with clustering options closes and the clustergram loads.
DEVELOPERS ONLY SECTION
SUGGESTED CHANGE (Pseudocode optional)
e.g. Add a color selection class
FILES AFFECTED (where the changes will be implemented) - developers only
e.g. selectColor.java & settingsPanel.java
LEVEL OF EFFORT - developers only
trivial/minor/medium/major/overhaul (choose one)
COMMENTS
Comments (13)
-
-
reporter - changed milestone to C - 03
-
repo owner -
assigned issue to
-
assigned issue to
-
repo owner - changed status to open
-
repo owner A log statement of the failure:
Initializing DistMatrixCalculator (1) java.lang.IndexOutOfBoundsException: Index: 7, Size: 7 - java.util.concurrent.FutureTask.report(FutureTask.java:122) - java.util.concurrent.FutureTask.get(FutureTask.java:192) - javax.swing.SwingWorker.get(SwingWorker.java:602) - Cluster.ClusterProcessor.clusterAxis(ClusterProcessor.java:133) - Controllers.ClusterDialogController$ClusterTask.calculateAxis(ClusterDialogController.java:585) - Controllers.ClusterDialogController$ClusterTask.doInBackground(ClusterDialogController.java:229) - Controllers.ClusterDialogController$ClusterTask.doInBackground(ClusterDialogController.java:1) - javax.swing.SwingWorker$1.call(SwingWorker.java:295) - java.util.concurrent.FutureTask.run(FutureTask.java:266) - javax.swing.SwingWorker.run(SwingWorker.java:334) - java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) - java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) - java.lang.Thread.run(Thread.java:745) DistTask is done: success. Done. Closing writer for dataset_bug_average_1.gtr Data reordering is done: success. Clustering was interrupted due to: java.lang.IndexOutOfBoundsException: Index: 7, Size: 7 Got reordered 0 labels for row Generated ATR file (empty): dataset_bug_average_1.atr Initializing DistMatrixCalculator (2) DistTask is done: success. Done. Closing writer for dataset_bug_average_1.atr Got reordered 520 labels for column Reordering complete. Post-cluster reordering valid? false Reordering of data is not valid. Cancelling. Done. Closing writer for dataset_bug_average_1.atr Data reordering is done: success. Post-cluster reordering valid? false Something occurred during reordering. ClusterTask is done: cancelled. Determined dir: \test_data\treeview_test_datasets_150408\dataset_bug\average Attempting delete of cluster files in \test_data\treeview_test_datasets_150408\dataset_bug\average Attempted delete of dataset_bug_average_1.atr dataset_bug_average_1.atr was successfully deleted. Attempted delete of dataset_bug_average_1.gtr dataset_bug_average_1.gtr was successfully deleted. Directory average still has 3 files. average could not be deleted.
-
- marked as critical
- edited description
-
Looks like it's specifically when rows are clustered using euclidean distance. If you do column only, it succeeds. I'll edit the description of the issue.
-
- edited description
-
This appears to be a difficult problem. I have traced the issue down to the
findCurrentMin(final double oldMin)
method inDistanceMatrix.java
. What's happening is:- This method searches through the distance matrix to see if it can find a new minimum distance value that is less than oldMin
- If it finds a new min value, it records where it found the value with 2 indexes:
min_row_index
andmin_col_index
- HOWEVER, on iteration 17 of a while loop in
ClusterProcessor.java
(which ends up reaching this code),oldMin
is less than everything in the distance matrix, thusmin_row_index
andmin_col_index
do not receive new indexes - The problem is that the old
min_row_index
(7) is larger than the size of the current distance matrix (max index of 6) - Perhaps notable (not sure): the last inner distance matrix array's first value is "
Infinity
" - perhaps that stopped the distance matrix from reaching its final size?
I do not know yet how to fix this problem. I found the problem by putting a breakpoint inside the while loop for iteration 17:
if(loopNum == 17) { LogBuffer.println("Pause here"); }
and then I stepped through line by line. I knew it was iteration 17 because of debug prints.
The distance matrix is a 2D matrix where each inner array grows in size. Perhaps the last one where the bug is shrunk?
@TreeView3Dev - Any thoughts? You're more familiar with this code than me.
-
OK, I see the problem. There are negative distances in the euclidean distance matrix for this data and the min value is initialized with a prositive double value:
min = Double.MIN_VALUE
. Apparently, euclidean distances are not expected to be negative. Going to investigate further. -
OK, I got it. The calculation of euclidean distance was dividing the sum of the squared differences (between pairs of values in every column) by the number of rows instead of the number of columns... subtracting that number of rows by the number of columns with missing data, resulting in negative values when there were more columns with missing values than there were rows.
There are 2 things of note here as well:
- Euclidean distance is supposed to be the square root of the sum of the squared differences. Presumably this is an attempt to normalize by the number of values compared.
- There are other methods for getting the "expected euclidean distance" which could yield more accurate distances given missing data.
But for now, I will just fix the code as it was intended to work.
-
- changed status to resolved
Resolved issue
#462. Clustering with euclidean distance now works correctly.Euclidean distance was being calculated by summing the squared differences between values in a pair of vectors. But it was being normalized by dividing by the number of rows in the data instead of the number of values in the rows. That value was getting the number of skipped columns subtracted from it, which can turn out to be larger than the number of rows, creating negative distance values and there-by, an exception. I just changed the number of rows to the number of columns and now works.
→ <<cset 7d8a72e769fd>>
-
- changed version to beta2
- Log in to comment