Computational requirement indexing

Issue #52 resolved
alvanuffelen created an issue

Hi

I would like to index a multi-fasta (RefSeq Genomes) of around 140G.
I used the following command on a server with 1TB of RAM:

 kma index -i sequences_20210211.fna -o kma_db

The output is:

# Total time used for DB indexing: 211063.00 s.
#
# Compressing templates
# Calculating relative indexes.
# Compressing indexes.
# Compression overflow.
# Bypassing overflow.
# Overflow bypassed.
# Finalizing indexes.
Killed

I have two question:

  1. I assume the process was killed because of too much memory usage.
    Is there an estimate of the memory usage depending on the size of the provided FASTA?
    Does splitting the FASTA file in multiple smaller files and updating the database with -t_db use less memory than using one big FASTA file to create the database?
  2. Is Overflow bypassed a problem?

Comments (3)

  1. ptlcc

    Hi Alexander

    The strange thing is that the peak memory has been reached at that point, without giving at “Cannot allocate memory“ error.
    Usually the system kills a process if it uses too many resources, so it is not unlikely that it is killed because it was the most memory heavy job running while the memory was nearly filled.

    1. The memory usage depends on the redundancy in fast file as well as the size. Splitting it up, will not decrease the memory usage.
    2. The “Overflow bypassed” is not a problem. It means that some variables started to overflow, which caused kma to allocate them in a different type.

    Best,
    Philip

  2. Log in to comment