Export creates enormous PDFs

Issue #409 resolved
Christopher Keil repo owner created an issue

USE CASE: WHAT DO YOU WANT TO DO?

Export a clustered matrix to PDF.

STEPS TO REPRODUCE AN ISSUE (OR TRIGGER A NEW FEATURE)

  1. Open file "test_dataset1.txt" from the uploaded test datasets.
  2. Cluster both axes with default settings.
  3. Export whole matrix as a PDF. Probably very long waiting time occurs, Adobe Reader may temporarily freeze when attempting to open PDF.
  4. Inspect file properties and check file size.

CURRENT BEHAVIOR

test_dataset1.txt has 4175 rows and 3748 columns --> 15.6m+ data points. If a PDF is created for export, the export code runs for a long time. Subsequent opening of the PDF temporarily freezes the PDF software. File size ends up being 3.40GB (see screenshot)

export_filesize.PNG

EXPECTED BEHAVIOR

File size is reasonable (we should probably discuss this and get orientation from other software)

DEVELOPERS ONLY SECTION

SUGGESTED CHANGE (Pseudocode optional)

...

FILES AFFECTED (where the changes will be implemented) - developers only

ExportHandler.java

LEVEL OF EFFORT - developers only

medium

COMMENTS

Comments (17)

  1. Robert Leach

    I've been aware that large matrices produce very large PDFs. How big is the PNG version of the export? Perhaps a solution could be to embed a PNG inside a PDF?

  2. mohammed faizaan

    I exported large_6kx6k.tx to some of the formats. Here is the file size.

    PDF - 7GB

    SVG - 31GB

    PS - takes a lot of time

    PNG - 7MB

    PPM - 920MB

  3. mohammed faizaan

    Just a heads up on this issue.

    For this issue, my idea is to create a high dimension PNG first and then convert into PDF. I am using itextpdf for this issue. It has an AGPL licence and can be used if we have our code open sourced.

    I can see significant improvements with small files, and I would like to try with the largest 6x6 file we have (just not getting enough RAM on my local machine). Right now, PDF creation is taking twice the time it takes for PNG (because of converting and loading).

    EDIT - I have a test jar file if anyone wants to test with large files. you might need to append -Xmx4G while starting treeview if you are trying to export large_6x6.

    Test jar - https://bitbucket.org/smd_faizan/treeview3/downloads/treeview3-all-a83adc2.jar

  4. Anastasia Baryshnikova

    @smd_faizan I just wanted to emphasize that, while the matrix itself can be in raster format, the tree and the labels (once implemented) should be in vector and should be editable from the PDF. Would that be the case with this PNG -> PDF conversion?

  5. mohammed faizaan

    Oh wow, i didn't know we were editing the pdf using vector graphics. Let me find out the options, thanks for pointing out. This png->pdf doesn't support as of now.

  6. mohammed faizaan

    Hey @abarysh, If we use raster format for matrix and vector graphics for trees, it is way faster.

    test_dataset1.cdt.txt -> takes ~30 seconds to export & PDF/PS/SVG file size is ~40MB (previously, PDF was 3.4GB)

  7. Log in to comment