reduce peak memory in jgi_summarize_bam_contig depths with very large assemblies

Issue #161 resolved
Rob Egan created an issue

Currently there needs to be RAM to hold n (contigs) * s (samples) * 16 bytes before the depth table is created. On very large datasets, this can be huge with 500 samples and 100 million contigs this is over 800 GB.

At issue is that the 500 bam files are parsed individually but each represents a column in the depth file with 100 million rows, so the depth file is calculated in a transposed manner with respect to its output.

Looking quickly at the code, we should be able to cut this in half (i.e. 8 bytes per contig per sample) since the variance already contains the calculated mean, but maybe there is a way to avoid this peak memory altogether by logging the depths in binary (transposed) and then re-write it at the end in the correct format.

Comments (1)

  1. Log in to comment