Ensure CDT file format consistency

Issue #491 new
Anastasia Baryshnikova created an issue

USE CASE: WHAT DO YOU WANT TO DO?

Make sure that we maintain the correct data format and compatibility with other software (as much as possible).

STEPS TO REPRODUCE AN ISSUE (OR TRIGGER A NEW FEATURE)

Following our discussion about CDTImporter and its inability to import CDT files from Treeview 3.0, I ran a few tests.

I created 2 test files (see image below) and clustered them with both Cluster 3.0 and Treeview 3.0 (alpha 3). Test1 has just data, Test2 has data + it specifies EWEIGHT and GWEIGHT.

all_tests.png

CURRENT BEHAVIOR

Unlike Treeview 3.0 (alpha3), Cluster 3.0:

  • needs the 1st cell to be non-empty

  • adds GWEIGHT and EWEIGHT info in both cases (good)

  • adds a column "NAME" by duplicating the original row names (bad - it is not necessary).

  • adds 6 decimal places to the data and the GWEIGHT/EWEIGHT (bad but not critical)

Unlike Cluster 3.0, Treeview 3.0 (alpha 3):

  • is ok with the 1st cell being empty

  • does not add GWEIGHT and EWEIGHT unless they were present in the input file (bad)

  • does not add "NAME" (good)

  • adds 1 decimal place to the data but leaves GWEIGHT/EWEIGHT intact (bad but not critical)

Both softwares:

  • add the row dendrogram (GID) before the row labels, but the column dendrogram (AID) after the column labels (bad because inconsistent).

EXPECTED BEHAVIOR

Treeview 3.0 should:

  • if the 1st cell is empty, add an X to it

  • add GWEIGHT and EWEIGHT to clustered files, even if they did not have them to begin with.

  • if possible, do not change the number format (if it was an integer, do not add decimal places).

  • change GID and AID to "NAME".

  • keep the row/column order consistent (see attached image).

right_row_col_order.png

Note:

Here's original CDT file specification (from http://tldrify.com/kku). The proposed changes are consistent with this definition.

A generalized CDT file is a tab-delimited text file with the following specifications. The leftmost column and topmost row are reserved for headers. The file must contain at least two columns followed by a column with the header GWEIGHT, and at least one row followed by a row with the header EWEIGHT. Any rows and columns before the EWEIGHT and GWEIGHT are treated as annotation, and any after are treated as data.

DEVELOPERS ONLY SECTION

SUGGESTED CHANGE (Pseudocode optional)

e.g. Add a color selection class

FILES AFFECTED (where the changes will be implemented) - developers only

e.g. selectColor.java & settingsPanel.java

LEVEL OF EFFORT - developers only

trivial/minor/medium/major/overhaul (choose one)

COMMENTS

Comments (5)

  1. Anastasia Baryshnikova reporter

    In terms of CDTImporter (a Cytoscape app that transforms CDT files into networks):

    • it needs the data to be floats (does not recognize integers as edge weights)

    • it needs the original order of header rows

    • needs the NAME column

    • needs the GID/AID row/columns but doesn't care what's in them

    In my personal opinion, it's too hard-coded... You can't change much about the CDT file (as produced by Cluster 3.0) without breaking the import in CDTimporter.

  2. Christopher Keil repo owner

    The GID/ AID provides the original order by numbering each element before reordering so one can infer original order (e.g. COL12X). Is that sufficient?

    And before questions arise: the double GWEIGHT in your test2_average.cdt example has already been taken care of in a recent PR.

  3. Log in to comment