optimal use with MAGs/other bacterial genomes?

Issue #68 new
Panos Sapou created an issue

Dear all

I have been using your excellent software to create abundance matrices using redundant gene databases like VFdb, NCBI AMR etc For these databases I have been using your recommendation, which is “-cge, -1t1”

However, because of its high speed and efficiency I would also like to use KMA to create abundance matrices using a dereplicated set of MAGs (and at some point maybe also genomes of isolates).

For a previous study I used the illumina shotgun reads to assemble and create MAGs for all samples and then I used the dereplicated set of MAGs to create a reference set of “genomes“. Then I used the same parameters “-cge, 1t1“ to re-map the reads to the reference MAGs and the “fragmentCountAln“ to count reads properly aligned. Does that sound right?

Considering that the dereplicated MAGs are relative short contigs (illumina) and the database is still redundant, I continued using the “-cge, 1t1“ but should I change that? maybe should i use the “-Mt1” and drop the Conclave?

Finally, for MAGs abundance matrices, i usually normalize across samples by using the Silva 16S to count how many reads map to the 16S (again using KMA) and taking the sum for each sample. However, I recently heard that there may be better ways to normalize instead of using the Silva 16S - any suggestions?

Thanks in advance

P

Comments (7)

  1. ptlcc

    Dear Panos

    I would add the “-mem_mode” option when mapping the reads back to the assembly, as redundant hits here will be very limited.

    Depending on the quality of the assemblies you might want to drop the “-cge” option, as the “-cge” option sets stricter alignment and mapping parameters.

    Personally, I like to use read counts over fragment counts, as fragment counts can sum to over 100% when a read-pair is split over two templates.

    Best,
    Philip

  2. Christian Brinch

    Instead of 16S, you can map to whole bacterial genomes for a more robust estimate of the bacterial content.

  3. Panos Sapou reporter

    Dear both thanks a lot!

    Philip, wouldnt you prefer the Mt1 option for bacterial genomes with longer contigs (or complete) ?

    Again, Thanks!

  4. ptlcc

    Dear Panos

    The -Mt1 only considers one template, so if there are several contigs in the assembly these will be skipped.

    Best,
    Philip

  5. Patrick Munk

    Dear Panos,

    For normalization, I think there are many fine options depending on your research questions. If you are just interested in how your recovered genomes change relative to another: calculate CLR.

    Want to see them relative to all bacteria? Classify all reads and look at bacterial fraction or use either bacterial 16S rRNA (but also let non-bacterial rRNA compete in alignment).
    Comparing very different microbiomes with very different bacteria? Consider using GTDBs collection of bacterial single-copy genes (bacSCGs).
    You can also check out the concept of “average genome size (AGS)” and its MicrobeCensus implementation which also relies on bacSCGs:
    https://github.com/snayfach/MicrobeCensus

    Cheers,
    Patrick

  6. Panos Sapou reporter

    Hey Patrick,

    Thanks! I didnt know about the GTDBs! That does look like a great solution, i ll surely check it out!

    Cheers

    P

  7. Log in to comment