different results when using DBs with identical sequences

Issue #29 resolved
Panos Sapou created an issue

Hi all

I ve been using kma for a few weeks now and I am extremely happy with its speed and performance however there is a small inconsistency that is bothering me

Problem:
I am using the fastq reads of a metagenome (name: N11) to map it against a virulence database called VFDB (http://www.mgc.ac.cn/VFs/main.htm). The sequences that I would like to check for their presence are the “stx” from Escherichia pathogens. Stx sequences exist both in VFDB and a custom made database that I made by downloading Escherichia genomes available on NCBI that carry stx genes.

When mapping the N11 sample to these 2 databases even though at least 3 of the stx genes are identical (100%) I am getting inconsistent results: only VFDB positively identifies stx sequences in the N11 sample.

When I then pulled out the stx sequences from both db and created a new db only with them kma gave 0 hits.

Therefore my question is: why can KMA positively identify the stx sequences from VFDB but not from my db or the merged custom db that only has the few stx genes? Could this be a txt format problem? Are there characters that are not allowed?

For the record I do get some hits with my custom db and no errors so it doesnot seem to be a general problem with the db. Also all databases were created using the same command $kma_index -in *.fas -db .db (where “.fas“ was each time the corresponding fasta file)

I would be happy to send a link with the files if sb wants to check for themselves (I just dont want to do it publicly cause these are unpublished data)

Thanks,

P

Comments (6)

  1. ptlcc

    Hi Panos

    I am glad to hear that you are generally happy with KMA.

    Before you transfer the data I have a three questions:

    1. Have the six genes been isolated from E. coli, or did you build the custom database with the entire genomes.
    2. What parameters was used with KMA when mapping and aligning.
    3. Which version was used.

    Best,
    Philip

  2. Panos Sapou reporter

    hey Philip

    Thanks for the quick reply

    1. All databases are genes - I never use entire genomes
    2. here is the exact command that I use (against any db) $kma -ipe *fq.gz -o /output/sth_sq -t_db ~/VFDB -mem_mode -ef -1t1 -cge -nf -t 8
    3. KMA-1.3.11

    Again, thanks for looking into this

    P

  3. Log in to comment