different results when using DBs with identical sequences

Hi all

I ve been using kma for a few weeks now and I am extremely happy with its speed and performance however there is a small inconsistency that is bothering me

Problem:
I am using the fastq reads of a metagenome (name: N11) to map it against a virulence database called VFDB (http://www.mgc.ac.cn/VFs/main.htm). The sequences that I would like to check for their presence are the “stx” from Escherichia pathogens. Stx sequences exist both in VFDB and a custom made database that I made by downloading Escherichia genomes available on NCBI that carry stx genes.

When mapping the N11 sample to these 2 databases even though at least 3 of the stx genes are identical (100%) I am getting inconsistent results: only VFDB positively identifies stx sequences in the N11 sample.

When I then pulled out the stx sequences from both db and created a new db only with them kma gave 0 hits.

Therefore my question is: why can KMA positively identify the stx sequences from VFDB but not from my db or the merged custom db that only has the few stx genes? Could this be a txt format problem? Are there characters that are not allowed?

For the record I do get some hits with my custom db and no errors so it doesnot seem to be a general problem with the db. Also all databases were created using the same command $kma_index -in *.fas -db .db (where “.fas“ was each time the corresponding fasta file)

I would be happy to send a link with the files if sb wants to check for themselves (I just dont want to do it publicly cause these are unpublished data)

Thanks,

P

Comments (6)