Perf. Issues During 1.0.0.0 Validation

Issue #44 closed
Daniel Marsh-Patrick repo owner created an issue

Report attached, but the gist is than on Edge and Safari, large datasets you get a 'script taking too long error'.

I personally have not experienced this in Edge with a 100K row dataset (visual filters to top 30K) but performance in Edge is significantly worse than Chrome.

Having done some performance analysis in Edge, there's a lot of garbage collection being done prior to rendering but nothing in particular in the code takes long time to run.

Initial thoughts are the memory leak issues in 1.x, so I'll profile with no visual processing with a large dataset and will see what happens. Upgrading to 2.x might be the short fix to get the 1.0.0.0 code to validate successfully but it's likely that some additional code changes will be required to compensate. We'll see how it goes.

Official response

  • Daniel Marsh-Patrick reporter

    All changes made and ready to go. The final approach is to cap the categories at 100 (hard-set in settings; may allow configuration later on once Data Limit menu gets added back in).

    API has been upgraded to 2.2 and a number of other optimisations have been made. There might be more we can do in future, but as KDE is the main culprit overall, this will not be trivial to work on. We'll see how things go.

Comments (11)

  1. Daniel Marsh-Patrick reporter

    Did some profiling on a 30K row load without any settings parsing or view model mapping (i.e. just get the data view mapped and into the sandbox). This takes approx 8 seconds on Edge, eclusing the queryData URL (which takes approximately 1.1 seconds to get all data).

    I have updated the visuals API to 2.2, which reduces the noise quite significantly in the profiling. I have also managed to remove any migration issues between 1.13 and 2.2, so it's a sensible choice to upgrade to latest now anyway (especially as we aren't using fetchMoreData() right now.

    We'll still need to do some more testing, but this is a good start.

  2. Daniel Marsh-Patrick reporter

    To reduce unnecessary operations, we can detect whether the visual is getting more data or not. As such, we dont need to continually re-map the view model each time, just the bits that change. For instance, calculating stats on all catagories is expensive and doesn't need to happen and we can save a bunch of effort there.

    I'll break out the visualHelpers functions, particularly for view model mapping into a class and functions to handle each operation. I'm not planning to do the optimisations within the axis and dimension calculation at this time, unless I absolutely have to.

  3. Daniel Marsh-Patrick reporter

    I've considered drawing only the area plot, rather than the area and line plot, as this does have a saving (roughly 15-20% of the rendering cost) but if I apply the stroke width to the area plot, it will draw the stroke line down the centre of the violin, so we still need a line and an area plot to avoid this scenario.

  4. Daniel Marsh-Patrick reporter

    Note: when using the update type, if the chart collapses, it won't re-render when re-expanded using this approach so we need to do these responsiveness checks on resize and re-draw correctly.

  5. Daniel Marsh-Patrick reporter

    I've moved all view model-related methods and regression tested the visual. Have done some basic profiling and added this to the view model also for ease of reporting.

    I'll now look for some small improvements within the current code base. For instance, Power BI will give category min/max, so we don't need to use d3 over the array for each category to get this value. We can do something similar for the entire dataset and the y-domain. not expecting huge gains but we'll see.

    Currently the calculateStatistics method takes an average of:

    • Tooth Growth, categories: 27.4ms
    • Tooth Growth, no categories: 18.3ms
    • Census, categories: 139.9ms
    • Census, no categories: 92.3ms
  6. Daniel Marsh-Patrick reporter

    Attaching analysis of function consumption (by processing time).

    For larger datasets it's clearly the KDE calculations. for smaller ones, it's interhcangeable between thw KDE and the element generation.

    While the cost of the KDE probably can't be mitigated, we might be able to look into how we manage it?

  7. Daniel Marsh-Patrick reporter

    Heard back from MS re: test cases with validation, and the issue is actually more simpler and off the happy path. Essentially, swapping the category and sampling (i.e. 30K rows in the category field) is enough to break the app.

    If I re-jig the data view mapping, to group then the benefit of being able to provide a dataReductionAlgorithm to both categories will reduce the number of data points overall, as everything gets chunked up and we get a huge number of null values for most categories. This is the reason for the current data view mapping. Further research required.

    Edit: This also breaks other visuals such as the box & whisker chart, so there's no standard for managing this use case in other visuals I've reviewed.

  8. Daniel Marsh-Patrick reporter

    All changes made and ready to go. The final approach is to cap the categories at 100 (hard-set in settings; may allow configuration later on once Data Limit menu gets added back in).

    API has been upgraded to 2.2 and a number of other optimisations have been made. There might be more we can do in future, but as KDE is the main culprit overall, this will not be trivial to work on. We'll see how things go.

  9. Log in to comment