Faster way to index a database ?

Issue #70 new
Gustavo Tamasco created an issue

I am trying to index a 11Gb fasta file.

The command is running for 9 days and I still have no result, the indexing is still running...

Is there a way to speedup this process? I am using the default run for that kma index -i db_fasta -o database .

Is there a safe way to apply multiprocessing on that ?

Best,
Tamasco

Comments (4)

  1. ptlcc

    Dear Tamasco

    There is not a possibility to use multiprocessing on kma index. But you could subsample the k-mers using the “-Sparse” option or index the minimizers (-m).

    Best,
    Philip

  2. Gustavo Tamasco reporter

    Hey Philip, thanks for the advice.

    One question, using the Sparse flag, less numbers of Kmers will be used. Can this reduce the resolution of my mappings down the road ?

    Just out of curiosity … Why no indexing tools use multiprocessing ? Is there a reason for that ?

    Best,
    Tamasco

  3. ptlcc

    Hi Tamasco

    The resolution can be lowered, but we have not seen anything notable for prefixes of length two or less, as you then will have half-overlapping k-mers on average.

    Some mapping and alignment methods do offer multithreading on indexing, but these usually comes at a relatively high memory cost. This is because it hard to parallelize updates to the same data structure, as you need to ensure that two processes are not writing/editing the same piece of memory at once.
    When performing the mapping and alignment it is easier as the data structure is constant and you can analyse the individual input reads more or less individually. That is, you just need to make sure only one thread is reading and writing at a time, together with some collection steps such as the ConClave algorithm.

    Best,
    Philip

  4. Gustavo Tamasco reporter

    Good to know that! I will make some tests using the -Sparse flag.

    Oh I see. Thanks for the explanation and for the advice !

    Best,

    Tamasco

  5. Log in to comment