normalization for KMA results

Issue #31 new
Panos Sapou created an issue

Dear all

I would like to ask your advice regarding on how to normalize samples when using KMA.

My goal is to compare a few genes of interest among samples to compare abundances. My understanding is that FPKM is a very popular approach and I have seen in literature that people often use a formula that looks like (gene-of-interest counts)*length/(16S counts)*length.

However, I have some issues using the 16S as the variability across genomes (e.g. different number of copies or multiple variants) can affect its counts depending on the species composition.

I also have seen a lot of criticism towards FPKM (which I do not understand) and a lot of comments suggesting that it is not suitable for comparing different samples. I was wondering why is that.

Back to my project, For my samples I have already used KMA to count frequencies of the genes that interest me and I have already assembled the reads, compared them to NR and extracted the bacterial contigs. Therefore I was thinking that I could follow the follow strategy:

  1. map the original fastq reads to my bacterial contigs and therefore calculate (number of total bacterial read counts)
  2. use the following formula (gene-of-interest counts)*length/(total bacterial counts)

Does that look correct? Is it a problem that I do not use length to the denominator of my formula? (like people do when using 16S) - what normalization strategy do people use when using KMA in general?

Also, for these particular samples i happened to have them assembled and have the bacterial contigs (which i can use as reference for each sample) - but I do not want to do this procedure every time i want to normalize. Is there a small bacterial DB that I could use or some other strategy?

Thanks in advance for the advice

Comments (2)

  1. Patrick Munk

    Dear Panos,

    Here are a few thoughts on the matter I think might be useful!

    1. You should consider what you are interested in when you say “compare abundances”. You are not interested in specific genes in relation to all genes or all DNA? You are perhaps interested in how many specific genes there are in relation to bacterial cells? This should influence the chosen method. I am not sure if a general approach with KMA has emerged. I have tried both using genome-assigned and 16S rRNA gene-assigned reads.

    2. Yes 16S is not perfect for the exact reason you mention. But using all the bacterial reads as a denominator is also not without issue, because species composition might also affect what proportion of reads you are able to assign. 16/18S is at least easy to identify. Perhaps a large collection of alleles of bacterial single copy genes (bSCG) is the best compromise?

    3. If you 100% decide to go for a reference-based approach and don’t use your own contigs, there is a good chance you will assign way fewer reads. This of course depends on how well your microbiomes are represented by established databases.

    4. You could consider finding a selection of genomes on e.g. RefSeq or JGI IMG. KMA works well with redundant sequences because of conclave, so it will mostly be compute / memory reasons to sub-sample. You could use e.g. MASH to remove extremely similar genomes and gather representatives, until you have something more manageable / you can index with KMA.

    5. Your calculation does not look correct. You should not multiply with the lengths of your genes of interest. You should divide. A longer gene with the same number of hits as a shorter gene, should give you a smaller relative abundance.

    6. If you are mostly concerned with “differential abundance analysis”, you could consider using CLR (centered log ratio) with Aldex2 to see if specific genes differ in the composition.

    Best regards,
    Patrick

  2. Panos Sapou reporter

    Dear Patrick

    Again, thanks for your help

    i think I have now wrapped my head around on what i have to do to “correct“ (not normalize 🙂 ) and then what to use (log transformed ratios to do comparisons)

    i have one last question that should conclude it:

    I hear that the suggested strategy is to use “aligned fragments” for genes and just “fragments” when mapping to whole genomes - may I ask why? If I understand correctly the difference between them (which I assume are the “fragmentCount” and “fragmentCountAln” values in the mapstat file ) is that the “fragmentCountAln“ accounts for the position in the template? or not?

    I also notice in some of my results these two numbers differ a lot!

    Thanks in advance,

    P

  3. Log in to comment