Warn when file deemed too big for clustering/memory

Issue #62 new
Robert Leach created an issue

USE CASE: WHAT DO YOU WANT TO DO?

Get a warning when I try to do something on data that is too big for the function to work well.

STEPS TO REPRODUCE AN ISSUE (OR TRIGGER A NEW FEATURE)

  1. Open a file that has more than 6000 rows or columns
  2. Cluster the large dimension

CURRENT BEHAVIOR

From Anastasia: "a user reported that, when one side of the clustering is too big, Treeview (alpha3) stops the process without warning and without any output. I couldn't reproduce the behavior on my computer, probably because I have more memory. But the user's file was ~20000 rows x 7 columns. Clustering only columns works almost instantaneously."

EXPECTED BEHAVIOR

If a user tries to do something on data that will likely cause an error, open a warning dialog that will present them with a warning and the options to either cancel or proceed anyway.

DEVELOPERS ONLY SECTION

SUGGESTED CHANGE (Pseudocode optional)

When the user clicks continue in the cluster interface, it should trigger a check before running clustering which does the following:

1. Use the linear equation derived below in the memory analyses in the comments to
    predict memory requirements:

    f(x) = m*x + b

    where:

    b=1.897118644067797
    m=2.1838331160365045e-8
    x = matrix width * matrix height (whether clustering rows or columns or both)

2. Determine max allowed memory using `Runtime.getRuntime().maxMemory();`

3. Use the linear equation derived below in the running time analyses in the comments to
    predict the running time for both the smaller dimension (rows or columns only) and the
    estimated running time of doing both dimensions:

    f(x) = (m*x + b) / getProcSpeedWeight()

    where:

    b=108.15254237287856
    m=0.000020638293909480444
    x = matrix width or height squared
    s = processor speed in Ghz (note: the equation was derived using a proc speed of
       3.5Ghxz, so we're using s to weight the result)

    We'd have to create the getProcSpeedWeight method that would look something like
    this (since the linear equation was derived using a proc speed of 3.5Ghz):

    double speed = getProcGhz();
    if(speed.isNaN)
        return(1.0)
    return(speed / 3.5)

    And getProcGhz would get the processor speed (and make sure it's reported in Ghz)
    using these tips:

     Mac:

        #Command line call:
        >sysctl -n machdep.cpu.brand_string
        Intel(R) Core(TM) i7-4771 CPU @ 3.50GHz

     Windows:

        >wmic cpu get name
        Name
        Intel(R) Core(TM) i7-3615QM CPU @ 2.30GHz

     Linux:

        >cat /proc/cpuinfo | grep ‘model name’
        model name : Intel(R) Atom(TM) CPU N270   @ 1.60GHz

4. If the user is clustering both dimensions and (E =) the estimated running time for the
    smaller dimension over the estimated running time of clustering both dimensions is less
    than Y and the running time for both dimensions is greater than Z minutes

4.1. Open a warning dialog stating how long clustering is estimated to take and that
       clustering the smaller dimension is estimated to take an E'th of the time with the
       options: cancel, cluster smaller dimension, and cluster both dimensions

5. If (step 4 was false or (step 4 was true and the user did not cancel)) AND the estimated
    required memory plus X is greater than max allowed memory

5.1. If the user is clustering both dimensions and one dimension is smaller than the other

5.1.1. Use the linear equation derived below in the memory analyses in the comments to
          predict memory requirements for clustering the smaller dimension:

          f(x) = m*x + b

          where:

          b=1.897118644067797
          m=2.1838331160365045e-8
          x = matrix width * matrix height (whether clustering rows or columns or both)

5.1.2. If the required memory for the smaller dimension plus X is greater than max
          allowed memory

5.1.2.1. Open a warning dialog stating that the clustering data is too big and that
             clustering the smaller dimension is estimated to still be possible and give them
             these 4 options: cancel, cluster both dimensions anyway (i.e. risk it), cluster
             smaller dimension, and restart with more memory (which makes a system call to
             `java -Xmx<necessary_memory> -jar treeview3.jar <current_input_file>` and
             quits).

5.1.3. Else open a warning dialog stating that the clustering data is too big and give them
          these 3 options: cancel, cluster anyway (i.e. risk it), and restart with more memory
          (which makes a system call to `java -Xmx<necessary_memory> -jar treeview3.jar
          <current_input_file>` and quits).

5.2. Else open a warning dialog stating that the clustering data is too big and give them
       these 3 options: cancel, cluster anyway (i.e. risk it), and restart with more memory
       (which makes a system call to `java -Xmx<necessary_memory> -jar treeview3.jar
       <current_input_file>` and quits).

Of course, keep track of whether the clustering parameters should change and change them if necessary - or whether clustering should be cancelled.

Furthermore, in ClusterProcessor.java on lines 99 and 133, we should open a dialog warning stating that clustering ran out of memory and gives them 2 options: OK (which pretty much does nothing) and restart with more memory (which makes a system call to java -Xmx<necessary_memory> -jar treeview3.jar <current_input_file> and quits).

NOTES - suggested default values:

  • X = 500Mb
  • Y = 0.25
  • Z = 3 minutes

I have not yet tested oblong matrices. The results may affect the linear equations above.

FILES AFFECTED (where the changes will be implemented) - developers only

unknown

LEVEL OF EFFORT - developers only

minor

COMMENTS

Old issue description:

The warning about data size should occur upon clustering and/or upon opening. The size that will generate the warning might be different for the different actions. E.g. Clustering rows only or columns only depends on that dimension size.

If it's known ahead of time that the data is too big to cluster, disable clustering for the affected dimension(s).

If clustering runs out of memory, present a meaningful error dialog message stating that fact and suggest they try clustering only 1 dimension.

Comments (46)

  1. Anastasia Baryshnikova

    Related to this: a user reported that, when one side of the clustering is too big, Treeview (alpha3) stops the process without warning and without any output. I couldn't reproduce the behavior on my computer, probably because I have more memory. But the user's file was ~20000 rows x 7 columns. Clustering only columns works almost instantaneously.

  2. Robert Leach reporter
    • edited description

    Added change request form. Anyone think that the threshold for the warning should be more intelligent with regard to memory?

  3. Robert Leach reporter

    I've been trying to profile the memory usage given different file sizes. @TreeView3Dev - do you have an idea, in terms of like array sizes how the sizes of the variables scale with rows and columns for clustering? Like if there are 6000x6000 rows/cols, what's the worst case/peak number of array indexes in the largest arrays involved in clustering? Or can you point me to the variables that grow so I can check it out myself?

    Incidentally, here's what I've determined using JProfiler:

    6kx6k
        haven't accurately measured time yet...
        2.76 Gigs allocated at the peak memory usage
    
    5kx5k
        ~ 9:41
        2.38G
    
    4kx4k
        ~ 6:47
        2.19G
    
    3kx3k
        ~ 4:56
        2.11G
    
    2kx2k
        ~ 3:30
        1.95G
    
    1kx1k
        ~ 2:21
        1.99G
    

    The memory values are peak memory allocated, not used.

    It looks like int arrays are the largest consumer of memory during clustering:

    varstreeview.png

  4. Robert Leach reporter

    Without investigating in depth, the precise memory requirements of clustering, I thought that plotting the requirements and fitting an exponential curve might be good enough. Here's what I've got. Note, this applies to square matrices. I should also investigate either oblong matrices or just clustering 1 side. While I was at it, I took a look at running time too. The warning dialog could estimate a running time too with the option to change parameters to cut down running time by clustering only 1 dimension:

    tv3timeperf.png

    tv3memperf.png

  5. Robert Leach reporter

    I just did a quick and dirty profile of Cluster.app using the same parameters on the 6kx6k file. "quick" - It took an hour and 11 minutes (much much slower than treeview), but it used at its peak, 640.3Mb. Though maybe it wasn't a good test, because the output file from cluster is messed up.

    I noted that before clustering, just to load the 6k matrix, treeview has 2.24G memory allocated. (It's only a couple hundred megs before opening the file.) So clustering is only adding about 510Mb. Seems reasonable.

    Lance had been concerned about memory efficiency, so I thought it would be good to check these things out.

  6. Robert Leach reporter

    Lance pointed out that the running line and memory should scale linearly against width*height, so:

    linear-treeview-time.png

    linear-treeview-memory.png

  7. Robert Leach reporter
    • changed status to open

    Forgot to open this earlier this week. Been trying to figure out the best way to predict when a problem will occur.

  8. Robert Leach reporter

    @abarysh - Do you happen to have a copy of the ~20000 rows x 7 columns file that the user experienced clustering problems with? I'm a little dumbfounded because my testing has shown that the amount of memory is not any different when clustering only rows versus both - probably because I'm working on a square matrix, I'm guessing. I can mock up my own test file, but there's nothing like real data to test with.

    I also might be able to simulate the amount of memory she had by running the jar with low memory via -Xmx or -mx on the command line. Do you happen to know how much memory she had on her computer?

  9. Robert Leach reporter

    I discovered how you can determine your system's default maximum heap size. It would be nice to know what the user's value was:

    java -XX:+PrintFlagsFinal |& grep MaxHeapSize

    Mine is 4.3G:

    uintx MaxHeapSize                              := 4294967296      {product}
    

    I'll try java -Xmx1g -jar treeview3.jar and see how the crash occurs when I try to cluster a large file. We might be able to catch an axception somewhere too.

  10. Anastasia Baryshnikova

    This dataset is from Jenna Gaska, you may want to contact here directly and see if she can reproduce the issue and what are the memory numbers on her computer.

  11. Robert Leach reporter

    OK, so I did the test with low memory. Surprisingly, the clustering seems to finish successfully but craps out at the end. It's unclear from the log how one might catch the error and present the user with a meaningful/useful message about increasing memory.

    Initializing DistMatrixCalculator.
    DistTask is done: success.
    java.lang.OutOfMemoryError: Java heap space
     - java.util.concurrent.FutureTask.report(FutureTask.java:122)
     - java.util.concurrent.FutureTask.get(FutureTask.java:192)
     - javax.swing.SwingWorker.get(SwingWorker.java:602)
     - Cluster.ClusterProcessor.clusterAxis(ClusterProcessor.java:130)
     - Controllers.ClusterDialogController$ClusterTask.calculateAxis(ClusterDialogController.java:584)
     - Controllers.ClusterDialogController$ClusterTask.doInBackground(ClusterDialogController.java:239)
     - Controllers.ClusterDialogController$ClusterTask.doInBackground(ClusterDialogController.java:164)
     - javax.swing.SwingWorker$1.call(SwingWorker.java:295)
     - java.util.concurrent.FutureTask.run(FutureTask.java:266)
     - javax.swing.SwingWorker.run(SwingWorker.java:334)
     - java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
     - java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
     - java.lang.Thread.run(Thread.java:745)
    Done./Users/rleach/PROJECT/TREEVIEW/TestData/large_6kx6k/average/large_6kx6k_average_4.gtr
    ProcessorClusterTask is done: success.
    Initializing DistMatrixCalculator.
    java.lang.OutOfMemoryError: GC overhead limit exceeded
     - java.util.concurrent.FutureTask.report(FutureTask.java:122)
     - java.util.concurrent.FutureTask.get(FutureTask.java:192)
     - javax.swing.SwingWorker.get(SwingWorker.java:602)
     - Cluster.ClusterProcessor.calcDistance(ClusterProcessor.java:96)
     - Controllers.ClusterDialogController$ClusterTask.calculateAxis(ClusterDialogController.java:576)
     - Controllers.ClusterDialogController$ClusterTask.doInBackground(ClusterDialogController.java:251)
     - Controllers.ClusterDialogController$ClusterTask.doInBackground(ClusterDialogController.java:164)
     - javax.swing.SwingWorker$1.call(SwingWorker.java:295)
     - java.util.concurrent.FutureTask.run(FutureTask.java:266)
     - javax.swing.SwingWorker.run(SwingWorker.java:334)
     - java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
     - java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
     - java.lang.Thread.run(Thread.java:745)
    java.lang.OutOfMemoryError: GC overhead limit exceeded
    Done./Users/rleach/PROJECT/TREEVIEW/TestData/large_6kx6k/average/large_6kx6k_average_4.atr
    DistTask is done: success.
    Done./Users/rleach/PROJECT/TREEVIEW/TestData/large_6kx6k/average/large_6kx6k_average_4.atr
    ProcessorClusterTask is done: success.
    Something occurred during reordering.
    ClusterTask is done: cancelled.
    
  12. Robert Leach reporter

    OK, it looks like ClusterProcessor.java, lines 99 and 133 are where the 2 exceptions are caught. That's where we should pop up the warning dialog that suggests they restart with more memory (if we don't pre-empt the situation by warning them ahead of time that they could encounter a memory issue.

  13. Christopher Keil repo owner

    @hepcat72 Very cool analysis, thanks for that. I am dying to get back into clustering code to improve it and make it unit testable. I am sure that my first implementation, which was oen of the very first things I ever did for this project, are a bit crap design wise. It has been refactored once or twice since (for example making distance matrix calculation its own class). The profiling you did is really cool and my goal is too make the code simpler and cleaner, and most of all: testable.

    I am very interested in taking this up but would love for you to help me out in terms of the profiling, which I think is excellent.

    The Java Runtime class provides us with all the infos we need in terms of memory at runtime: https://docs.oracle.com/javase/7/docs/api/java/lang/Runtime.html

  14. Robert Leach reporter
    • edited description

    I put a design (really shabby pseudocode) for handling this issue in the SUGGESTED CHANGE section of the issue description.

  15. Robert Leach reporter

    So @TreeView3Dev - do you intend to take this issue? I was going to do it, but I neglected to assign it to myself before opening it. But if you want to take it, that's fine. I'm not enjoying working on this issue. Lance had suggested doing a more accurate memory calculation based on the algorithm and I wasn't looking forward to that. But you're more familiar with the code so if you decide to do that, you'd be more prepared to do that. I put a design for the solution in the SUGGESTED CHANGE section that roughly represents what I was planning on doing.

    I'm going to run some tests now on oblong matrices and I'll post my results.

  16. Robert Leach reporter

    Jenna told me that her laptop has 4G memory and her JVM starts with 1G by default.

    My estimate is that about 7G is necessary for her data, but it's difficult to test because clustering it takes so long - an estimated hour and 10 minutes (if my linear equations are accurate).

  17. Christopher Keil repo owner

    The issue about deploying to all systems will allow us to set new JVm heap defaults. But depending on OS and architecture there are still limits.

  18. Robert Leach reporter

    Oh yeah, and I asked Jenna how she would have preferred to be presented (or whether she should be presented) with an option to cluster only 1 axis and she said she would have preferred a dialog pop up when she clicks continue in the cluster interface which gives her 3 options: cancel, do it anyway, or do it with the smaller axis.

    BTW, I forgot to take Ghz of the processor(s) into account in the estimated time. My Mac's cores are 3.5Ghz (or 3.5 billion operations per second).

    The accuracy wouldn't have to be exact. Using a linear fit derived from actual performance data on one computer is a very rough estimate to begin with. Using the proc speed wouldn't affect the ratio we would use when comparing running times of 1 axis versus both. The place where it would have an effect is when determining (when there's a difference between dimension sizes), if some arbitrary minimum running time is enough to warrant a suggested change of parameters.

    If we wanted to take processor speed into account to weight the time result (to yield a very rough estimate), we'd have to have a way to get the processor speed on mac, windows, and linux. There does not appear to be a standard package that gets this info independent of system. Here are my notes on getting it for the various systems.

    Useful: http://wpguru.co.uk/2015/02/how-to-find-your-cpu-details-from-the-command-line/

    Mac:

    #Command line call:
    >sysctl -n machdep.cpu.brand_string
    Intel(R) Core(TM) i7-4771 CPU @ 3.50GHz
    

    Windows:

    >wmic cpu get name
    Name
    Intel(R) Core(TM) i7-3615QM CPU @ 2.30GHz
    

    Linux:

    >cat /proc/cpuinfo | grep ‘model name’
    model name : Intel(R) Atom(TM) CPU N270   @ 1.60GHz
    

    I will update the suggested change to include an adjustment for proc speed.

  19. Robert Leach reporter
    • changed status to new

    I'm throwing this back in as new because I'm done editing the issue and the suggested pseudocode. @TreeView3Dev had indicated a willingness/desire to take this issue and I'm happy to move on to my next thing.

  20. Christopher Keil repo owner

    @hepcat72 In case there is not enough memory (5.1.2, 5.1.3, 5.2) you want to make a system call to quit and restart the JVM with -Xmx[req mem]. This is not possible from within the JVM unless you use Runtime.exec(). This would let you run cmd.exe on Windows (analogous on other OSs) and pipe the java command. I really do think we shouldn't do that. Executing anything on the command line for a user is in my opinion not appropriate.

    We can suggest to do that however and provide a copy-paste ready command and let the user decide if they want to run it on their command line. I am very much against doing this automatically.

  21. Christopher Keil repo owner

    I think this issue needs to be approached from a much simpler angle. There are several issues I see with the proposed solution.

    1) We cannot get CPU information from inside a running Java app without doing something hacky - Java doesn't allow it. Only option is before application startup and passing the result to the app (e.g. as a file).

    2) It would not reliably inform us on how long clustering will actually take on a particular user machine. So if we guesstimate anyways, why not keep it simple and derived from heuristics instead of an overengineered prediction algorithm?

    if 5k lines in axis to cluster -> approximately runtime is 30-45min.
    

    3) Lots of implementation effort for little user gain

    Clustering above 5k lines per axis reliably leads to performance issues even on modern machines. We can just generate a warning dialog with the option to continue or cancel depending on the axis size. The severity of the warning can be adjusted in multiple steps according to axis size and estimates can be changed or improved in the future by editing very few lines.

    If memory errors occur, we catch them as exception and display the already existing Out-of-memory dialog, suggesting to restart the app with more RAM.

    Opinions @abarysh and @lance_parsons?

  22. Robert Leach reporter

    Well, you could get the proc speed using the system calls, but I hear ya. It's a fancy guess, really, and might prove to be way off. And you're right, we do catch the memory exceptions now, so the situation has improved, but it's still a pretty big loss for a user to wait a half hour and have it fail. Though I think that a static threshold has drawbacks too and has the potential to be disruptive to the user unnecessarily. So how about this?

    What if we keep running estimates on run time and memory as clustering progresses based on what's been done so far? We can obtain the percentage of allotted memory being used, I believe. And since we know how much time has passed since the start of clustering and the percentage done, we could extrapolate how much longer it will take based on how long it's taken for the percentage completed. We can display a running estimate of time remaining and/or percent memory used along with the progress bar. We could also extrapolate how much memory will be consumed based on how much more has been consumed since the start. We could insert a (red) warning in the progress window about approaching the memory limit if our extrapolation looks like it might exceed the allotted memory.

    But anyway, I think that the suggested change in the issue description isn't as much effort as it might seem too. It's pretty straight-forward and the logic has been thought-out.

    Alternatively, is there a way to reduce the memory requirements or runtime for the clustering algorithm? How do these compare to cluster3.0?

  23. Christopher Keil repo owner

    > but it's still a pretty big loss for a user to wait a half hour and have it fail.

    Yes, but I think a reasonable warning about the potential for long runtimes is completely sufficient here. No need to provide pseudo-specific runtimes.

    > has the potential to be disruptive to the user unnecessarily

    The disruption of the warning can be reduced a lot by not displaying a dialog at all, but simply an orange warning text in the cluster dialog, for example. This would go in hand with your suggestions to display a memory warning around the progress bar.

    > we could extrapolate how much longer it will take based on how long it's taken for the percentage completed.

    This is not as easy because clustering accelerates the more it progresses (each cycle has less to process).

    > we can display a running estimate of time remaining and/or percent memory used along with the progress bar

    I think this is a good approach, yes. We have free memory and max memory available from the Runtime class.

    We may be able to reduce some of the memory needs, depending on the way hierarchical clustering is implemented now. I can look into it and possibly refactor how some data structures are used. However, I think that the underlying algorithm simply has a certain need for memory depending on data size and clustering method used.

    cluster3.0 may be better because it comes without a lot of the Java class overhead. Not sure how much modern JVM optimizes here though.

    I am going to put out a PR as an example implementation for #62 and we can discuss if and how well it works.

  24. Robert Leach reporter

    Yeah, I think that all sounds good to me. The points I disagree on are minor. I say go ahead based on your last comment. BTW, since I’ve got you here, and I don’t mean to derail this topic, but do you know offhand how easy it would be to reintroduce the ability to have multiple windows with different data open in them? You may have seen the one request we had to bring that feature back via email?

  25. Christopher Keil repo owner

    Yes, I have seen it! I think it should not be too hard but at one point we sort of went the way of doing the "one window thing". I dont mind either way but it might still be at least a little bit hairy to reintroduce multiple windows.

  26. Robert Leach reporter

    Btw, Lance and I were just talking. Memory usage could be improved if we used float instead of double. I don’t know which it is we’re using, but if it’s double, it may not be necessary. Also math library computations may be a bit faster with float compared to double... Just a thought.

  27. Log in to comment