Modifying previously built index

Issue #37 resolved
Mihkel Vaher created an issue

Hi!

Is there a way to modify (add/remove) an existing index?

Specifically, in the usual case, I’m seeing that the database needs to be updated.

Second, I’m working on a way to identify some redundant (sub)sequences from the database (remove a human gene if the whole human chromosome exists). This is for more accurate coverage/depth/taxonomical abundance calculation with CCMetagen where the template length matters (chromosome+gene(s) is longer than just chromosome => lower abundance?). I’m wondering if there’s a way to remove (mask, disallow mapping to) some of the selected sequences in the database.

Also, the specification says

kma, which can be subdivided into four main programs: kma, index, shm, seq2fasta and update, each with their own section

There isn’t an “update” section?

Thanks,
Mihkel

Comments (4)

  1. ptlcc

    Hi Mihkel

    Currently you can add to an existing index, by specifying the index to update with “-t_db” in kma index. You can also change the name of an entry by changing the entry name in the *.name index file.
    Unfortunately there is not an option to remove entries yet, neither to disallow mappings to certain templates. But it would make a valuable addition to the code, which I will keep in mind and check if it can be implemented in a smart way. Both for updating indexes and while mapping.

    You can however utilise the hobohm-1 algorithm while indexing which should remove the single genes, if the fasta files have been sorted on entry length prior to indexing. You can set this with the “-ht”, “-hq” and “-and“ in the index part, “-Sparse” should be used as well here.

    I have updated the KMAspecification.pdf to include the missing parts of the program.

    Best,
    Philip

  2. Mihkel Vaher reporter

    Hi Philip!

    Does the usage of “-t_db” essentially mean that the indexing could be parallelized by splitting the multifasta into smaller chunks, indexing them separately and then merging them together?

    /// Just realized that it’s about fasta to index not index to index, so never mind.

    Would using the “-t_db” flag and still splitting the input into multiple parts reduce memory consumption if I’d wait for the indexing to finish each time before adding the next part?

    Regarding the redundant sequences - I tried sorting the whole NCBI nt but it was too slow and resource heavy. Instead, using some divide and conquer and HLL I ended up removing about 77% of the sequences (48% by size). The runtime was also quite fast when parallelized on a cluster the slowest part being the taxid addition (modified script from CCMetagen). Since the results are a bit too good, I’m in the middle of running a quick validation to see if only redundant sequences were removed.

  3. Mihkel Vaher reporter

    To answer my own question

    Would using the “-t_db” flag and still splitting the input into multiple parts reduce memory consumption if I’d wait for the indexing to finish each time before adding the next part?

    No, though there was a slight decrease of “Maximum resident set size (kbytes)” when using /usr/bin/time -v, the difference was less than 1%.

  4. Log in to comment