treeinform error

Issue #50 resolved
Cat Munro
created an issue

I ran into the following error after running agalma homologize, multalign,genetree and then treeinform STAGE 1 / treeinform.identify_candidate_variants / 0.145s / 131.5MB Identify candidates variants Traceback (most recent call last): File "/gpfs/runtime/opt/agalma/1.0.0/lib/python2.7/site-packages/agalma/", line 113, in <module> File "/gpfs/runtime/opt/agalma/1.0.0/lib/python2.7/site-packages/biolite-1.0.0-py2.7.egg/biolite/", line 469, in run if self.run_stage(s): File "/gpfs/runtime/opt/agalma/1.0.0/lib/python2.7/site-packages/biolite-1.0.0-py2.7.egg/biolite/", line 386, in run_stage ret = func(**argdict) File "/gpfs/runtime/opt/agalma/1.0.0/lib/python2.7/site-packages/agalma/", line 71, in identify_candidate_variants for i, model_ids in enumerate(workflows.phylogeny.identify_candidate_variants(trees, threshold)): File "/gpfs/runtime/opt/agalma/1.0.0/lib/python2.7/site-packages/biolite-1.0.0-py2.7.egg/biolite/workflows/", line 226, in identify_candidate_variants tree.set_outgroup(outgroup) File "/gpfs/runtime/opt/agalma/1.0.0/lib/python2.7/site-packages/ete3/coretype/", line 1234, in set_outgroup outgroup = _translate_nodes(self, outgroup) File "/gpfs/runtime/opt/agalma/1.0.0/lib/python2.7/site-packages/ete3/coretype/", line 2457, in _translate_nodes raise TreeError("Invalid target node: "+str(n)) ete3.coretype.tree.TreeError: 'Invalid target node: None'

Comments (20)

  1. Casey Dunn repo owner

    I tracked down the offending tree with the following...

    sqlite3 /gpfs/data/cdunn/analyses/agalma-siphonophora-20170501.sqlite
    sqlite> select * from runs where name="genetree";
    sqlite> .quit
    [cdunn@login001 ~]$ interact -t 4:00:00
    [cdunn@node506 ~]$ export AGALMA_DB=/gpfs/data/cdunn/analyses/agalma-siphonophora-20170501.sqlite
    [cdunn@node506 ~]$ module load agalma/1.0.0
    [cdunn@node506 ~]$ python
    Python 2.7.12 |Continuum Analytics, Inc.| (default, Jul  2 2016, 17:42:40) 
    [GCC 4.4.7 20120313 (Red Hat 4.4.7-1)] on linux2
    Type "help", "copyright", "credits" or "license" for more information.
    Anaconda is brought to you by Continuum Analytics.
    Please check out: and
    >>> genetree_id=92
    >>> threshold=0.05
    >>> import os
    >>> import numpy as np
    >>> from itertools import imap
    >>> from operator import itemgetter
    >>> from agalma import config
    >>> from agalma import database
    >>> from biolite import diagnostics
    >>> from biolite import report
    >>> from biolite import utils
    >>> from biolite import workflows
    >>> from biolite.pipeline import Pipeline
    >>> pipe = Pipeline("treeinform", __doc__)
    >>> variants = {}
    >>> trees = imap(itemgetter("tree"), database.select_trees(genetree_id))
    >>> trees
    <itertools.imap object at 0x7fb30f7f5050>
    >>> for i, model_ids in enumerate(workflows.phylogeny.identify_candidate_variants(trees, threshold)):
    ...     for model_id in model_ids:
    ...             assert model_id not in variants
    ...             variants[model_id] = i
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "/gpfs/runtime/opt/agalma/1.0.0/lib/python2.7/site-packages/biolite-1.0.0-py2.7.egg/biolite/workflows/", line 226, in identify_candidate_variants
      File "/gpfs/runtime/opt/agalma/1.0.0/lib/python2.7/site-packages/ete3/coretype/", line 1234, in set_outgroup
        outgroup = _translate_nodes(self, outgroup)
      File "/gpfs/runtime/opt/agalma/1.0.0/lib/python2.7/site-packages/ete3/coretype/", line 2457, in _translate_nodes
        raise TreeError("Invalid target node: "+str(n))
    ete3.coretype.tree.TreeError: 'Invalid target node: None'
    >>> trees = imap(itemgetter("tree"), database.select_trees(genetree_id))
    >>> newicks = trees
    >>> hist = {}
    >>> import ete3
    >>> for newick in newicks:
    ...     tree = ete3.Tree(newick)
    ...     outgroup = tree.get_midpoint_outgroup()
    ...     if outgroup is None:
    ...             print(newick)
  2. Casey Dunn repo owner

    The problem is that the ete3 method get_midpoint_outgroup() doesn't succeed for all trees. The point of the midpoint rooting is to avoid situations where the arbitrary root happens to fall within a clade of sequences to be collapsed. It is fine, then, in these rare cases where a midpoint root can't be found to just not reroot the tree.

  3. Josephine Reinhardt

    Hello. I realize this issue is marked as resolved, but I have obtained this exact error using the latest conda agalma release (which is installing agalma 1.0.1 and biolite 1.0.0-np111py27_0).

    Is this fix implemented in the current conda release? If not, is there a work around I could use? If so... can we reopen this?

    Thanks for your time.

  4. Cat Munro reporter

    Hi -

    There is a fix, and there is work getting a new version of agalma out that is using the latest biolite version. In the meantime, I have written up this fix:

    Users running agalma/2.0.0 may run into the following error during the treeinform stage: raise TreeError("Invalid target node: "+str(n)) ete3.coretype.tree.TreeError: 'Invalid target node: None'

    This is due to the fact that some trees cannot be midpoint rooted. There is a fix for this:

    As a workaround, you can uninstall biolite/1.0.0 from your agalma/2.0.0 installation:

    conda remove biolite

    Then clone the latest version of biolite from the git repository:

    git clone

    cd biolite/

    Now checkout the specific commit with the fix:

    git checkout 784edc6

    Finally activate this version with the fix:

    python install

    Now you should be able to run treeinform without this error.

  5. Josephine Reinhardt

    Hello Cat, thanks for your quick responses. If users are supposed to be using v 2.0, why is it recommended in the documentation that users install via conda?

    Specifically the documentation states "On 64-bit Linux, it is also possible to install Agalma using prebuilt packages from our Anaconda channel. We recommend this for most full analyses."

    I ended up directly editing with the hotfix above and it seems to have worked?

  6. Cat Munro reporter

    Conda should be installing v 2.0 - we're looking into it. In the meantime, you can use a similar approach (conda remove agalma etc and checkout the master branch of the agalma git repo to install the latest version).

    For the above issue, directly editing would also do the trick.

  7. Josephine Reinhardt

    Great, thank you for the information.

    The "updating" section of the manual states "you may need to rerun already completed analyses if you want to generate new reports or use existing data with new versions of pipelines."

    Do you know whether installing 2.0 would require me to redo all of my analyses or if not, which steps should be redone if I do update? Obviously I don't want to end up with incorrect results downstream due to that bug, but in particular the transcriptome and expression pipelines take a substantial amount of computational time and I would really rather not have to redo them (it would drain our allocation substantially).

    Thanks again

  8. Cat Munro reporter

    The transcriptome assembly will be unaffected - this part of the pipeline has not changed since agalma 1.0.0, so you will not have to rebuild anything generated by agalma transcriptome.

    You will, however, have to re-run everything in the phylogeny pipeline from agalma homologize onwards (as outlined from line 147 through 169 in the Given that you are at the treeinform stage, this suggests that you would have to re-run homologize, multalign, genetree and treeinform with the new version.

    For the expression part of the pipeline, the database is slightly updated in the latest version to output tpm and fpkm values in the JSON. To continue using the same database from agalma 1.0.1 you'll want to run sqlite3 to access your sqlite database and add these columns into the schema as shown below:

    sqlite3 [name of your agalma DB]

    alter table agalma_expression add column gene_length FLOAT

    alter table agalma_expression add column fpkm FLOAT

    alter table agalma_expression add column tpm FLOAT

    [For internal reference, this is listed in this issue]

  9. Josephine Reinhardt

    Ok, I cannot now replicate the error either. I am sure I did do conda env remove agalma previously, but indeed now it did work and I have agalma 2.0.0 through conda.

    Anyhow, I went ahead and tried the fix above for the treeinform issue, and conda is trying to remove agalma as well as biolite (see below). I then can properly install biolite from git but since agalma is gone, this leads to the fix happening, but agalma no longer works.

    * (agalma) login-1:data$ conda remove biolite Solving environment: done

    Package Plan

    environment location: /homes/reinharj/miniconda2/envs/agalma

    removed specs: - biolite

    The following packages will be REMOVED:

    agalma:  2.0.0-py27_0 dunnlab
    biolite: 1.2.0-py27_0 dunnlab

    Proceed ([y]/n)? y

    Preparing transaction: done Verifying transaction: done Executing transaction: done

    login-1:softwares$ source activate agalma 8.sqlite login-1:softwares$ export AGALMA_DB=/lustre/reinharj/data/batdna/agalma_Sp18/agalma_bat_Sp1 (agalma) login-1:softwares$ agalma diagnostics list -bash: agalma: command not found

  10. Cat Munro reporter

    Hi - if you're installing the latest version of agalma and biolite 1.2.0 then the fix is already in. No need to remove biolite.

    You will still need to update your sqlite database schema, if you haven't already.

  11. Josephine Reinhardt

    HI I posted a new issue but I think that it is probably quite specific to my case and may not need to be a general question for agalma.

    After updating the schema as above, my database has empty values (NoneType) for gene_legnth, fpkm, and tpm columns for each gene. Will I have to rerun the expression pipeline to get everything into the database?

    I suppose I could just back-alter my database to get rid of the new columns (I can get gene length and calculate FPKM/TPM myself with only a little extra effort).

  12. Cat Munro reporter

    Yes, you will need to re-run expression again to not only re-populate the missing columns, but also because the results of treeinform have an impact on gene assignment. This is probably the easiest and more accurate option as the results of the phylogeny pipeline are used downstream by the expression pipeline.

    If you don't want to re-run expression, you'd have to run a modified export expression script that does not pull tpm, fpkm and gene length out.

  13. Log in to comment