Issue #23 resolved

Supermatrix alignment length is not a multiple of 3!

Sergey Naumenko
created an issue

Hi! I've finished agalma phylogenetics pipeline exactly as described in the tutorial on 4 real transcriptomes.

The lengths of resulting nucleotide supermatrix alignment is not a multiple of 3! When looking in gblocks-nuc folder, which contains per gene alignments, I see 1722 alignments and 1132 out of these 1722 have not 3x lengths.

What should I do to check these alignments? If alignment length is not 3x, it might be that somewhere are problems with reading frame, so I cant trust the aa-alignment.

Is this a problem of gblocks or macse?



Comments (7)

  1. Sergey Naumenko reporter

    And when I selected only 3x length genes from gblocks-nuc folder, translated, concatenated and build a tree with the RAxml, it is very different both in topology and branch - length

  2. Felipe Zapata

    Hi Sergey, once the nucleotide alignments are sent to gblocks they may not be in frame because of the removal of sites. Macse does the alignments in all frames and takes into account frameshifts and stop codons, so the first pass in macse (before gblocks) should return 3x alignments (notice that there maybe frameshifts and premature stop codons that macse is inferring, so you may need to manipulate these alignments as well).

  3. Sergey Naumenko reporter

    Hi Felipe, Yes, you are right. I've double checked. The first check was not correct.

    It is gblocks not macse introduces lots of stops and not 3x.

    Still there are internal stops after macse: 64 of 1640 aligments, but with this level of errors I can do next steps.

    I think the problem might go deeper. The BBH approach all vs all blast hit results in frameshifts in a multiple alignment. I've faced this issue before while creating my own scripts.

    One more suggestion is that macse has been developed having viruses in mind, so it tolerates internal stops. On any other data (most cases) the internal stop in an alignment is the evidence of wrong data processing for transcriptomes. It is usual check: internal stops = something is wrong.


  4. Vincent Ranwez


    MACSE accept the presence of some stop codons in the alignment since they could appear for instance due to sequence errors (when dealing with NGS contigs) or because the sequence is now a pseudo-gene. Stop codon, as frameshift, are authorized but highly penalized. However you can adjust those penalties according to your datasets, we proposed so possible penalties for some applications (e.g. dealing with newly assembled NGS contigs, pseudo-genes etc.) but you can easily adjust stop and frameshift (fs) costs to your need using -stop and -fs options of the latest version of MACSE available here.

    As "!" used to pinpoint frameshift are not support by most software (except seaview), we also now provide an export functionality that allows to remove codon containing frameshits.

    For filtering the resulting alignment I usually rely on trimal (that seems to preserve the codon structure) or BMGE, which have an explicit function to filter the alignment at the codon level.

    I hope this will help.

  5. Log in to comment