Site indexing and reference sequence for NEB and BEB

Issue #2 resolved
Nathanael Walker-Hale created an issue

Hi Iakov,

Thanks very much for godon, it’s a really nice package. I just have a couple of questions about the sequence position referencing for the NEB and BEB testing.

Firstly, what is the reference sequence from which the codon and AA at the position is taken? Is in the first sequence in my input alignment, as in PAML, or does some reordering go on?

Secondly, are some sites excluded from the NEB and BEB calculations, and does the position referencing take this into account? I ask this because, in my output I have an amino acid listed that does not occur in what would be the reference sequence in the alignment (first sequence in input FASTA), but does occur in that same sequence one position adjacent. So I was wondering if perhaps the position indexing had some kind of bug?

Thanks in advance for your help,

Nat

Comments (19)

  1. Iakov Davydov repo owner

    Hi Nat,

    I quickly checked, the intention is to print the first non-gap codon (from the top).

    As far as I remember, all the codon sites are used in the computation. If you want to see the full raw data there is a possibility to use JSON export, which will give you all the values.

    Could you please share with me the input files and the command-line you used? I never observed this shifted indexing problem, and I would like to find out what’s causing this.

    Cheers,

    Iakov

  2. Nathanael Walker-Hale reporter

    Hi Iakov,

    Thanks for the speedy response. I’ve attached the requested files. Note that due to an analysis versioning error one of the tree labels may have shifted by one node, but this should not affect the issue. The JSON included is from the previous version.

    I’m specifically interested in site 182. I believe, based on the alignment, that it should be 183.

    Cheers,

    Nat

  3. Nathanael Walker-Hale reporter

    Yep - although unfortunately due to the aforementioned versioning issue it is the output from the latter run. However there is still a problem with it: 181, 182 and 183 are all found with high posterior probability in the BEB output, and the reference AA is given as P, N and N, respectively. But based on the alignment, it should be P, S and N (considering the positions are 1-indexed, which it appears that they are).

  4. Iakov Davydov repo owner

    Ok, so I was wrong about the reference sequence. In this case, it is Chenopodiaceae_Chenopodium_quinoa@AUR62006956-RA (which is the first one in the phylogenetic tree). This sequence has a gap in positions 181, 182, and 183. So godon prints this site from the next sequence (I suspect again the next in the tree).

    Obviously very confusing. So far I think the most logical thing would be to:

    • Report the reference sequence explicitly
    • Do not try to be smart and print real codons/AAs instead of finding the next present sequence.

    What do you think?

  5. Nathanael Walker-Hale reporter

    Oh great! I should have spotted that. Sorry for the hassle.

    In terms of the fix, I think reporting the reference sequence would be really helpful. If I’m not mistaken, this is what also is reported by PAML.

    Thanks so much again for your help,

    Nat

  6. Iakov Davydov repo owner

    I will try add reference sequence to the output during the weekend.

    In the meantime, I wanted to comment about the dataset. If you see something which looks like frame shifts this of course violates the model assumptions. All the models are wrong, but for this particular problem alignment quality is crucial. So it’s generally a good idea is to mask regions with low-quality of alignment. One tool to do this is guidance2, but I think @smoretti had a good experience with another one. @smoretti , what was the name?

    I also had good experience with the --m0-tree option, I think for large complex datasets it not only reduces the optimization time but also improves the statistical properties of the model.

  7. Nathanael Walker-Hale reporter

    Thank you very much, both, for your help! I agree that alignment accuracy is paramount for these analyses. However I am curious, what particularly makes you think frameshifts are at play, rather than just an indel process? I have actually compared this dataset with quite a few different aligners (including MACSE; this is PRANK), but haven’t yet tried any consensus approach. Thank you for the advice!

  8. Iakov Davydov repo owner

    The problem is that vast majority of codon models assume single nucleotide substitution process. The rates of non-synonymous and synonymous substitutions are compared to determine selection regime.

    So if there is a frame shift happening (either via indel or via incorrect alignment), it looks like multiple non-synonymous substitutions; while in reality maybe it was only one. In such cacses the model can mistakenly assume that this is a positive selection event.

  9. Nathanael Walker-Hale reporter

    Right, totally understand all that - I was just curious what prompted you to think that there was specifically a frameshift in this alignment, as opposed to an inframe indel?

  10. Iakov Davydov repo owner

    Ah, sorry I misinterpreted your initial question. I thought that you are talking about nucleotide position index shift, not codon index position.

    With inframe indel there should be no problems (if sequences are aligned properly). If they look like they are 1 position shifted (be it one codon or one nucleotide), this might be problematic.

    I hope that makes sense.

  11. Log in to comment