Warn when file deemed too big for clustering/memory

USE CASE: WHAT DO YOU WANT TO DO?

Get a warning when I try to do something on data that is too big for the function to work well.

STEPS TO REPRODUCE AN ISSUE (OR TRIGGER A NEW FEATURE)

Open a file that has more than 6000 rows or columns
Cluster the large dimension

CURRENT BEHAVIOR

From Anastasia: "a user reported that, when one side of the clustering is too big, Treeview (alpha3) stops the process without warning and without any output. I couldn't reproduce the behavior on my computer, probably because I have more memory. But the user's file was ~20000 rows x 7 columns. Clustering only columns works almost instantaneously."

EXPECTED BEHAVIOR

If a user tries to do something on data that will likely cause an error, open a warning dialog that will present them with a warning and the options to either cancel or proceed anyway.

DEVELOPERS ONLY SECTION

SUGGESTED CHANGE (Pseudocode optional)

When the user clicks continue in the cluster interface, it should trigger a check before running clustering which does the following:

1. Use the linear equation derived below in the memory analyses in the comments to
    predict memory requirements:

    f(x) = m*x + b

    where:

    b=1.897118644067797
    m=2.1838331160365045e-8
    x = matrix width * matrix height (whether clustering rows or columns or both)

2. Determine max allowed memory using `Runtime.getRuntime().maxMemory();`

3. Use the linear equation derived below in the running time analyses in the comments to
    predict the running time for both the smaller dimension (rows or columns only) and the
    estimated running time of doing both dimensions:

    f(x) = (m*x + b) / getProcSpeedWeight()

    where:

    b=108.15254237287856
    m=0.000020638293909480444
    x = matrix width or height squared
    s = processor speed in Ghz (note: the equation was derived using a proc speed of
       3.5Ghxz, so we're using s to weight the result)

    We'd have to create the getProcSpeedWeight method that would look something like
    this (since the linear equation was derived using a proc speed of 3.5Ghz):

    double speed = getProcGhz();
    if(speed.isNaN)
        return(1.0)
    return(speed / 3.5)

    And getProcGhz would get the processor speed (and make sure it's reported in Ghz)
    using these tips:

     Mac:

        #Command line call:
        >sysctl -n machdep.cpu.brand_string
        Intel(R) Core(TM) i7-4771 CPU @ 3.50GHz

     Windows:

        >wmic cpu get name
        Name
        Intel(R) Core(TM) i7-3615QM CPU @ 2.30GHz

     Linux:

        >cat /proc/cpuinfo | grep ‘model name’
        model name : Intel(R) Atom(TM) CPU N270   @ 1.60GHz

4. If the user is clustering both dimensions and (E =) the estimated running time for the
    smaller dimension over the estimated running time of clustering both dimensions is less
    than Y and the running time for both dimensions is greater than Z minutes

4.1. Open a warning dialog stating how long clustering is estimated to take and that
       clustering the smaller dimension is estimated to take an E'th of the time with the
       options: cancel, cluster smaller dimension, and cluster both dimensions

5. If (step 4 was false or (step 4 was true and the user did not cancel)) AND the estimated
    required memory plus X is greater than max allowed memory

5.1. If the user is clustering both dimensions and one dimension is smaller than the other

5.1.1. Use the linear equation derived below in the memory analyses in the comments to
          predict memory requirements for clustering the smaller dimension:

          f(x) = m*x + b

          where:

          b=1.897118644067797
          m=2.1838331160365045e-8
          x = matrix width * matrix height (whether clustering rows or columns or both)

5.1.2. If the required memory for the smaller dimension plus X is greater than max
          allowed memory

5.1.2.1. Open a warning dialog stating that the clustering data is too big and that
             clustering the smaller dimension is estimated to still be possible and give them
             these 4 options: cancel, cluster both dimensions anyway (i.e. risk it), cluster
             smaller dimension, and restart with more memory (which makes a system call to
             `java -Xmx<necessary_memory> -jar treeview3.jar <current_input_file>` and
             quits).

5.1.3. Else open a warning dialog stating that the clustering data is too big and give them
          these 3 options: cancel, cluster anyway (i.e. risk it), and restart with more memory
          (which makes a system call to `java -Xmx<necessary_memory> -jar treeview3.jar
          <current_input_file>` and quits).

5.2. Else open a warning dialog stating that the clustering data is too big and give them
       these 3 options: cancel, cluster anyway (i.e. risk it), and restart with more memory
       (which makes a system call to `java -Xmx<necessary_memory> -jar treeview3.jar
       <current_input_file>` and quits).

Of course, keep track of whether the clustering parameters should change and change them if necessary - or whether clustering should be cancelled.

Furthermore, in ClusterProcessor.java on lines 99 and 133, we should open a dialog warning stating that clustering ran out of memory and gives them 2 options: OK (which pretty much does nothing) and restart with more memory (which makes a system call to java -Xmx<necessary_memory> -jar treeview3.jar <current_input_file> and quits).

NOTES - suggested default values:

X = 500Mb
Y = 0.25
Z = 3 minutes

I have not yet tested oblong matrices. The results may affect the linear equations above.

FILES AFFECTED (where the changes will be implemented) - developers only

unknown

LEVEL OF EFFORT - developers only

minor

COMMENTS

Old issue description:

The warning about data size should occur upon clustering and/or upon opening. The size that will generate the warning might be different for the different actions. E.g. Clustering rows only or columns only depends on that dimension size.

If it's known ahead of time that the data is too big to cluster, disable clustering for the affected dimension(s).

If clustering runs out of memory, present a meaningful error dialog message stating that fact and suggest they try clustering only 1 dimension.

Comments (46)