Ensure CDT file format consistency
USE CASE: WHAT DO YOU WANT TO DO?
Make sure that we maintain the correct data format and compatibility with other software (as much as possible).
STEPS TO REPRODUCE AN ISSUE (OR TRIGGER A NEW FEATURE)
Following our discussion about CDTImporter and its inability to import CDT files from Treeview 3.0, I ran a few tests.
I created 2 test files (see image below) and clustered them with both Cluster 3.0 and Treeview 3.0 (alpha 3). Test1 has just data, Test2 has data + it specifies EWEIGHT and GWEIGHT.
CURRENT BEHAVIOR
Unlike Treeview 3.0 (alpha3), Cluster 3.0:
-
needs the 1st cell to be non-empty
-
adds GWEIGHT and EWEIGHT info in both cases (good)
-
adds a column "NAME" by duplicating the original row names (bad - it is not necessary).
-
adds 6 decimal places to the data and the GWEIGHT/EWEIGHT (bad but not critical)
Unlike Cluster 3.0, Treeview 3.0 (alpha 3):
-
is ok with the 1st cell being empty
-
does not add GWEIGHT and EWEIGHT unless they were present in the input file (bad)
-
does not add "NAME" (good)
-
adds 1 decimal place to the data but leaves GWEIGHT/EWEIGHT intact (bad but not critical)
Both softwares:
- add the row dendrogram (GID) before the row labels, but the column dendrogram (AID) after the column labels (bad because inconsistent).
EXPECTED BEHAVIOR
Treeview 3.0 should:
-
if the 1st cell is empty, add an X to it
-
add GWEIGHT and EWEIGHT to clustered files, even if they did not have them to begin with.
-
if possible, do not change the number format (if it was an integer, do not add decimal places).
-
change GID and AID to "NAME".
-
keep the row/column order consistent (see attached image).
Note:
Here's original CDT file specification (from http://tldrify.com/kku). The proposed changes are consistent with this definition.
A generalized CDT file is a tab-delimited text file with the following specifications. The leftmost column and topmost row are reserved for headers. The file must contain at least two columns followed by a column with the header GWEIGHT, and at least one row followed by a row with the header EWEIGHT. Any rows and columns before the EWEIGHT and GWEIGHT are treated as annotation, and any after are treated as data.
DEVELOPERS ONLY SECTION
SUGGESTED CHANGE (Pseudocode optional)
e.g. Add a color selection class
FILES AFFECTED (where the changes will be implemented) - developers only
e.g. selectColor.java & settingsPanel.java
LEVEL OF EFFORT - developers only
trivial/minor/medium/major/overhaul (choose one)
COMMENTS
Comments (5)
-
reporter -
repo owner The GID/ AID provides the original order by numbering each element before reordering so one can infer original order (e.g. COL12X). Is that sufficient?
And before questions arise: the double GWEIGHT in your
test2_average.cdt
example has already been taken care of in a recent PR. -
reporter - changed milestone to F/S - 02
-
Screenshot of stacktrace for reference.
-
- changed version to beta2
- Log in to comment
In terms of CDTImporter (a Cytoscape app that transforms CDT files into networks):
it needs the data to be floats (does not recognize integers as edge weights)
it needs the original order of header rows
needs the NAME column
needs the GID/AID row/columns but doesn't care what's in them
In my personal opinion, it's too hard-coded... You can't change much about the CDT file (as produced by Cluster 3.0) without breaking the import in CDTimporter.