Store all cluster params with output file

USE CASE: WHAT DO YOU WANT TO DO?

I want to know what the differences are between different clustered files - otherwise, storing numbered copies of the various clustering attempts is pointless.

Storing data about the cluster options in the file name, while convenient and handy for knowing what file has which data in it, seems incomplete when the number of options is 5 (type, linkage method, ignore zeroes, row linkage method, and column linkage method) and prone to change via user manipulation.

STEPS TO REPRODUCE AN ISSUE (OR TRIGGER A NEW FEATURE)

Cluster a file
Cluster the file again using the same linkage method but different other options

CURRENT BEHAVIOR

The cluster files are named the same aside from the number at the end, so unless the user remembers the order in which they created them, the only way to be sure you're looking at data clustered in a particular way is to re-perform the clustering.

EXPECTED BEHAVIOR

All the clustering parameters should be associated with the file in some way - whether it's in the file name or in a header in the file itself so that the user knows how a clustered dataset was derived and how they can reproduce the same clustering method on a new dataset.

DEVELOPERS ONLY SECTION

SUGGESTED CHANGE (Pseudocode optional)

The easiest thing would be to include all options in the file name. Alternatively, creating a commanded header and altering the file reading routines to skip commented lines would work as well.

FILES AFFECTED (where the changes will be implemented) - developers only

unknown

LEVEL OF EFFORT - developers only

medium

COMMENTS

Comments (0)