Separator for concatenating genome fragments into a single sequence

Issue #39 resolved
Mihkel Vaher created an issue

Hi!

Background:

My goal is to find the taxonomical abundances of species in a given sample (not read abundances that are affected by the genome size). CCMetagen’s/kma’s default “depth” (aligned nucleotide count divided by template length) is perfect for this if there’s a single template for a single species. If, on the other hand, the genome is fragmented (into chromosomes or contigs), “depth” is reported for every template. Later, when results are aggregated, the depths are summed. This is fine, if there are, for example, 2 different E. coli strains. If, however, the two belong to the same organism (2 human chromosomes) then summing the depths gives an incorrect result. This means that the more a genome is fragmented, the more overestimated the taxonomical abundance is.

The only solution I can currently see is to concatenate all of the chromosomes/contigs under a single fasta header. Another option would be to aggregate and recalculate kma results before feeding them into CCMetagen but this would require a separate database to know which sequences are from the same and which from separate genomes.

Questions:

  1. Which separator would be best when concatenating two or more sequences? “N”? “N”*k?
  2. How would it affect the performance of indexing and aligning? Faster because there are fewer sequences? Slower because the sequences will get very long?

Thanks,
Mihkel

Comments (3)

  1. ptlcc

    Hi Mihkel

    It sounds like a good idea.

    1. Separating the sequences with a single ‘N' will be enough to discriminate two adjacent templates, as k-mers with ambiguous bases are not included in the index.
      There might be some discrepancies when aligning over the join, which can be mitigated by adding a patch of N’s approximately equal to the reads lengths used. We have had good results with both in the past.
    2. Performance would increase during mapping to the templates and running the ConClave part of kma. While the difference in alignment speed should be minimal.
      The indexing itself will require fewer resources as well.

    This is under the assumption that each individual genome is concatenated with itself. Ie. one draft assembly of an E. coli will be concatenated to one sequence, and not every E. coli in the database. Likewise for the chromosomes in eukaryotic organisms.
    If not alignment performance will drop significantly, and crash if a sequence reaches a length of 2,147,483,647 bp.

    Best,
    Philip

  2. Mihkel Vaher reporter

    Hi Philip

    The 2G limit certainly adds to the overall complexity (eukaryotic genomes). Removing long batches of N’s and dustmasking might help get the size down. If not, then there might be a need to trim beforehand or to merge after aligning…

    Thanks

  3. Log in to comment