normalization for KMA results

Comments (2)

Patrick Munk

Dear Panos,

Here are a few thoughts on the matter I think might be useful!

You should consider what you are interested in when you say “compare abundances”. You are not interested in specific genes in relation to all genes or all DNA? You are perhaps interested in how many specific genes there are in relation to bacterial cells? This should influence the chosen method. I am not sure if a general approach with KMA has emerged. I have tried both using genome-assigned and 16S rRNA gene-assigned reads.
Yes 16S is not perfect for the exact reason you mention. But using all the bacterial reads as a denominator is also not without issue, because species composition might also affect what proportion of reads you are able to assign. 16/18S is at least easy to identify. Perhaps a large collection of alleles of bacterial single copy genes (bSCG) is the best compromise?
If you 100% decide to go for a reference-based approach and don’t use your own contigs, there is a good chance you will assign way fewer reads. This of course depends on how well your microbiomes are represented by established databases.
You could consider finding a selection of genomes on e.g. RefSeq or JGI IMG. KMA works well with redundant sequences because of conclave, so it will mostly be compute / memory reasons to sub-sample. You could use e.g. MASH to remove extremely similar genomes and gather representatives, until you have something more manageable / you can index with KMA.
Your calculation does not look correct. You should not multiply with the lengths of your genes of interest. You should divide. A longer gene with the same number of hits as a shorter gene, should give you a smaller relative abundance.
If you are mostly concerned with “differential abundance analysis”, you could consider using CLR (centered log ratio) with Aldex2 to see if specific genes differ in the composition.

Best regards,
Patrick

2021-03-15T12:54:29+00:00

Panos Sapou reporter

Dear Patrick

‌

Again, thanks for your help

i think I have now wrapped my head around on what i have to do to “correct“ (not normalize ) and then what to use (log transformed ratios to do comparisons)

i have one last question that should conclude it:

I hear that the suggested strategy is to use “aligned fragments” for genes and just “fragments” when mapping to whole genomes - may I ask why? If I understand correctly the difference between them (which I assume are the “fragmentCount” and “fragmentCountAln” values in the mapstat file ) is that the “fragmentCountAln“ accounts for the position in the template? or not?

I also notice in some of my results these two numbers differ a lot!

‌

Thanks in advance,

2021-03-17T11:20:22+00:00

Patrick Munk
Dear Panos,

Here are a few thoughts on the matter I think might be useful!
1. You should consider what you are interested in when you say “compare abundances”. You are not interested in specific genes in relation to all genes or all DNA? You are perhaps interested in how many specific genes there are in relation to bacterial cells? This should influence the chosen method. I am not sure if a general approach with KMA has emerged. I have tried both using genome-assigned and 16S rRNA gene-assigned reads.
2. Yes 16S is not perfect for the exact reason you mention. But using all the bacterial reads as a denominator is also not without issue, because species composition might also affect what proportion of reads you are able to assign. 16/18S is at least easy to identify. Perhaps a large collection of alleles of bacterial single copy genes (bSCG) is the best compromise?
3. If you 100% decide to go for a reference-based approach and don’t use your own contigs, there is a good chance you will assign way fewer reads. This of course depends on how well your microbiomes are represented by established databases.
4. You could consider finding a selection of genomes on e.g. RefSeq or JGI IMG. KMA works well with redundant sequences because of conclave, so it will mostly be compute / memory reasons to sub-sample. You could use e.g. MASH to remove extremely similar genomes and gather representatives, until you have something more manageable / you can index with KMA.
5. Your calculation does not look correct. You should not multiply with the lengths of your genes of interest. You should divide. A longer gene with the same number of hits as a shorter gene, should give you a smaller relative abundance.
6. If you are mostly concerned with “differential abundance analysis”, you could consider using CLR (centered log ratio) with Aldex2 to see if specific genes differ in the composition.
Best regards,
Patrick
- 2021-03-15T12:54:29+00:00
Panos Sapou reporter
Dear Patrick

‌

Again, thanks for your help

i think I have now wrapped my head around on what i have to do to “correct“ (not normalize ) and then what to use (log transformed ratios to do comparisons)

i have one last question that should conclude it:

I hear that the suggested strategy is to use “aligned fragments” for genes and just “fragments” when mapping to whole genomes - may I ask why? If I understand correctly the difference between them (which I assume are the “fragmentCount” and “fragmentCountAln” values in the mapstat file ) is that the “fragmentCountAln“ accounts for the position in the template? or not?

I also notice in some of my results these two numbers differ a lot!

‌

Thanks in advance,

P
- 2021-03-17T11:20:22+00:00
Log in to comment