- changed status to resolved
reduce peak memory in jgi_summarize_bam_contig depths with very large assemblies
Issue #161
resolved
Currently there needs to be RAM to hold n (contigs) * s (samples) * 16 bytes before the depth table is created. On very large datasets, this can be huge with 500 samples and 100 million contigs this is over 800 GB.
At issue is that the 500 bam files are parsed individually but each represents a column in the depth file with 100 million rows, so the depth file is calculated in a transposed manner with respect to its output.
Looking quickly at the code, we should be able to cut this in half (i.e. 8 bytes per contig per sample) since the variance already contains the calculated mean, but maybe there is a way to avoid this peak memory altogether by logging the depths in binary (transposed) and then re-write it at the end in the correct format.
Comments (1)
-
reporter - Log in to comment
Fixed in 47b0ff1bd399d62338b09cd3
Significantly improved both memory consumption and speed with small files which should also translate to the huge ones