Indexing computational requirements

Issue #36 resolved
Mihkel Vaher created an issue

Hi!

I’m marking this as a proposal because it would be nice to be added to the README.md.

Currently, I’m planning to make an extended ncbi nt index in a computing cluster which raised the question of rough estimates of the requirements: cpus, memory and time.

Is the indexing running on a single thread? The manual isn’t listing “-t” under indexing.

Also, I’m assuming “Sparse” reduces sensitivity? How is the subset of k-mers chosen? What’s the difference between “-Sparse -” and “-Sparse TG”. “TG” should mean that only k-mers with the prefix “TG” are retained but is “-” just a subset of all k-mers?

Thanks,
Mihkel

Comments (13)

  1. ptlcc

    Hi Mihkel

    Currently the index is not set up for threading, meaning that only is utilised. It can however take input from stdin, so you can unzip the input through another process if that helps.

    The time and memory is more tricky to estimate, how large is the current fasta file you want indexed. Then I can check if we have indexed something of similar size recently.

    Using a prefix of “TG” will reduce the sensitivity somewhat, but not drastically as you will still have half-overlapping k-mers with a k-mer size of 16 (both strands are indexed). If you use “-Sparse -” it will use all k-mers.

    Best,
    Philip

  2. Mihkel Vaher reporter

    Thank you for the quick response!

    The multifasta is 511GB so I guess ncbi nt is the closest.

    If you use “-Sparse -” it will use all k-mers.

    I’m a bit lost now. If “-” uses all k-mers, which k-mers are used when not using “-Sparse”? Shouldn’t an index with “Sparse” contain less k-mers than an index where the flag was not used?

    Is “TG” a special argument, not a 2-mer? That is I can’t use “AC” as an argument?

    Thanks,
    Mihkel

  3. ptlcc

    Hi Mihkel

    If -Sparse is left out it will only index the forward k-mers. We usually use “TG” for 2-mer prefixes as it is present in most startcodons (all except “ATT”), but “AC” is valid as well. It shouldn’t really matter much, although we have seen decreased performances when using homopolymers as prefix.

    Best,
    Philip

  4. ptlcc

    Hi Mihkel

    It seems that the largest database we have indexed at the moment is about half the size of the most recent ncbi nt. A rough guess would be 1-1.5 TB and a few days for indexing.

    Best,
    Philip

  5. Mihkel Vaher reporter

    Hi!

    I keep hitting “Error: 12 (Cannot allocate memory)”.

    The available memory doesn’t seem to be the problem because not even half of the available memory is used up.

    I suspect it has something to do with the concatenated sequences I’m using though it shouldn’t be due to length (max length 2,120,519,931 bp).

    I’ve also made headers quite short, so it’s hard to tell for what the memory cannot be allocated.

    Best,
    Mihkel

  6. ptlcc

    Hi Mihkel

    Where in the process does it give the message, and how much memory do you have available.
    With few long sequences it should be doable, with the arguments “-Sparse TG -ME“. “-ME” will preallocate for the maximum number of distinct k-mers, which you will hit anyway with a fasta file of that size.

    Best,
    Philip

  7. Mihkel Vaher reporter

    I’m currently giving the job 960GB of memory. Slurm’s “sacct” reports MaxRSS to have been 173840424K (~173 GB).

    Would many long sequences be a problem?
    ”seqkit stat” of one of the input multifastas:
    format type num_seqs sum_len min_len avg_len max_len
    FASTA DNA 68,363 270,223,440,245 220 3,952,773.3 2,120,519,931

    I’m currently using “-Sparse -” for increased sensitivity. I’ll add “-ME” on the next run if there are no more ideas.

    //Edit: the error occurred while adding the seqs. Less than half were added.

  8. ptlcc

    It sounds a bit strange that it dies in that stage, when its that long away from the max memory.

    Many long sequences should not be a problem, as long as no single sequence exceeds 2Gbp.

    “-Sparse -” will take a long time and a lot of memory to index. I would still consider “-Sparse TG” or similar.

    Best,
    Philip

  9. Mihkel Vaher reporter

    “-Sparse TG” with “-ME” gives the same result.

    The issue seems still to be with the sequence length.
    Extracting the sequence just after last “Added” and trying to index only that gives the mentioned memory error. The extracted version is on 2 lines (header+seq)

    “wc”: 2 12 1914719048
    ”seqkit stat” max_len: 1,914,718,945

    Using KMA-1.3.22

    I’m currently trying to find the limit. Might be worth adding a check while reading the sequences?

  10. Mihkel Vaher reporter

    The limit appears to be 1 073 741 808 bp which is awfully close to 2**30 = 1 073 741 824.

    Every nucleotide is stored with 2 bits + some additional data so the longest sequence possible is actually 1 073 741 808?

  11. Log in to comment