Separator for concatenating genome fragments into a single sequence

Hi!

Background:

My goal is to find the taxonomical abundances of species in a given sample (not read abundances that are affected by the genome size). CCMetagen’s/kma’s default “depth” (aligned nucleotide count divided by template length) is perfect for this if there’s a single template for a single species. If, on the other hand, the genome is fragmented (into chromosomes or contigs), “depth” is reported for every template. Later, when results are aggregated, the depths are summed. This is fine, if there are, for example, 2 different E. coli strains. If, however, the two belong to the same organism (2 human chromosomes) then summing the depths gives an incorrect result. This means that the more a genome is fragmented, the more overestimated the taxonomical abundance is.

The only solution I can currently see is to concatenate all of the chromosomes/contigs under a single fasta header. Another option would be to aggregate and recalculate kma results before feeding them into CCMetagen but this would require a separate database to know which sequences are from the same and which from separate genomes.

‌

Questions:

Which separator would be best when concatenating two or more sequences? “N”? “N”*k?
How would it affect the performance of indexing and aligning? Faster because there are fewer sequences? Slower because the sequences will get very long?

‌

Thanks,
Mihkel

Comments (3)