Clustering fails when the delimiter is not tab

Issue #489 on hold
Robert Leach created an issue

USE CASE: WHAT DO YOU WANT TO DO?

Cluster a file whose delimiter is not tab.

STEPS TO REPRODUCE AN ISSUE (OR TRIGGER A NEW FEATURE)

  1. Open small_133x133.txt in a text editor and replace all tab characters with semicolons OR: Open https://bitbucket.org/TreeView3Dev/treeview3/downloads/test_14000x7.csv
  2. Open the semicolon-delimited small_133x133.txt file and select semicolon as the delimiter and deselect tab OR: open the test_14000x7.csv file and select comma as the delimiter and deselect tab
  3. Cluster the file

CURRENT BEHAVIOR

Note, a clustered file is created in the file system and it is tab-delimited.

Immediately, this error appears:

clusterissue-jennasdata.png

Clicking OK results in the matrix disappearing (replaced with the welcome screen) and an open dialog appearing:

clusterissue-jennasdata2-after.png

Here is the log:

Checking if preferences exist for the new file.
Loading with info from existing node.
Found "" from row label types.
Found "ID" from row label types.
Resetting model.
Adding data to model...
Calculating mean.
Calculating median.
Truncating sorted data array.
Setting base values.
Done parsing for CDT-format.
No ATR file found for this CDT file.
No GTR file found for this CDT file.
Resetting MapContainers and DendroView components.
Returning default ColorSet at 0
Importing labels...
Importing color settings...
Found Existing node in MRU list for /Users/rleach/Downloads/Web/test_14000x7.csv
Creating subNode File1477081409055
Creating new fileset /Users/rleach/Downloads/Web/test_14000x7.csv
Restoring components states.
Successfully loaded: /Users/rleach/Downloads/Web/test_14000x7.csv
Setting pBar max: 14
Initializing DistMatrixCalculator.
Done./Users/rleach/Downloads/Web/test_14000x7/average/test_14000x7_average_3.atr
DistTask is done: success.
Done./Users/rleach/Downloads/Web/test_14000x7/average/test_14000x7_average_3.atr
ProcessorClusterTask is done: success.
ClusterTask is done: success.
Done./Users/rleach/Downloads/Web/test_14000x7/average/test_14000x7_average_3.cdt
The rows have not been clustered.
Success! The column tree file was found.
Getting preferences for transfer to clustered file.
Loading with info from existing node.
Resetting model.
SaveTask is done: success.
Alert: No numeric data could be found in the input file.
The input file must contain tab-delimited numeric values.

EXPECTED BEHAVIOR

I expect clustering to work. It worked for Anastasia, though I expect she was probably running alpha03, not the current version of master.

DEVELOPERS ONLY SECTION

SUGGESTED CHANGE (Pseudocode optional)

I bet the clustered file is being imported and is expecting commas or semicolons (the way the original file was imported), but the cdt file is created with tab as the delimiter, thus the error.

Either the import of a clustered file generated by treeview should always expect tabs or, since we allow the user to select multiple delimiters, when the clustered file is written, it should use (arbitrarily) the first selected delimiter and parse the file using all selected delimiters when the original file was imported.

FILES AFFECTED (where the changes will be implemented) - developers only

unknown

LEVEL OF EFFORT - developers only

minor

COMMENTS

Comments (8)

  1. Christopher Keil repo owner

    Okay, so this is a nasty one. Clustering etc. works fine although I did add some better error logging + enabled pop up saying what happened for all Exception during distance matrix calculation.

    The problem is, once again: data loading.

    Consider this row:

    "1,3-beta-gluten...", 1.0, 0.0e, 0.34, 7.31e-01 ...

    We are currently not correctly treating quoted strings which means that the row will be split into:

    1 | 3-beta-gluten... | 1.0 | 0.0e | 0.34 | 7.31e-01 ... instead of 1,3-beta-gluten... | 1.0 | 0.0e | 0.34 | 7.31e-01 ...

    The fix for this issue is to implement recognition of quoted strings, which is related to #490

  2. Log in to comment