Modifying previously built index

ptlcc

Hi Mihkel

Currently you can add to an existing index, by specifying the index to update with “-t_db” in kma index. You can also change the name of an entry by changing the entry name in the *.name index file.
Unfortunately there is not an option to remove entries yet, neither to disallow mappings to certain templates. But it would make a valuable addition to the code, which I will keep in mind and check if it can be implemented in a smart way. Both for updating indexes and while mapping.

You can however utilise the hobohm-1 algorithm while indexing which should remove the single genes, if the fasta files have been sorted on entry length prior to indexing. You can set this with the “-ht”, “-hq” and “-and“ in the index part, “-Sparse” should be used as well here.

I have updated the KMAspecification.pdf to include the missing parts of the program.

Best,
Philip

2021-06-02T08:09:42+00:00

Mihkel Vaher reporter

Hi Philip!

Does the usage of “-t_db” essentially mean that the indexing could be parallelized by splitting the multifasta into smaller chunks, indexing them separately and then merging them together?

/// Just realized that it’s about fasta to index not index to index, so never mind.

Would using the “-t_db” flag and still splitting the input into multiple parts reduce memory consumption if I’d wait for the indexing to finish each time before adding the next part?

Regarding the redundant sequences - I tried sorting the whole NCBI nt but it was too slow and resource heavy. Instead, using some divide and conquer and HLL I ended up removing about 77% of the sequences (48% by size). The runtime was also quite fast when parallelized on a cluster the slowest part being the taxid addition (modified script from CCMetagen). Since the results are a bit too good, I’m in the middle of running a quick validation to see if only redundant sequences were removed.

2021-06-02T11:38:38+00:00

Mihkel Vaher reporter

To answer my own question

Would using the “-t_db” flag and still splitting the input into multiple parts reduce memory consumption if I’d wait for the indexing to finish each time before adding the next part?

No, though there was a slight decrease of “Maximum resident set size (kbytes)” when using /usr/bin/time -v, the difference was less than 1%.

‌

2021-06-10T07:40:37+00:00

Mihkel Vaher reporter

changed status to resolved

2021-06-10T07:40:52+00:00

Comments (4)