Report generation with Singularity

Create issue
Issue #1269 new
v created an issue

I’d like to be able to produce reports reproducibly, meaning providing the pygraphviz dependency in a container. I would want this to work:

$ snakemake --use-singularity --report report.html
Building DAG of jobs...
Creating report...
Traceback (most recent call last):
  File "/home/vanessa/anaconda3/lib/python3.7/site-packages/networkx/drawing/nx_agraph.py", line 283, in pygraphviz_layout
    import pygraphviz
ModuleNotFoundError: No module named 'pygraphviz'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/vanessa/anaconda3/lib/python3.7/site-packages/snakemake-5.5.4+26.g88d8c0aa.dirty-py3.7.egg/snakemake/__init__.py", line 551, in snakemake
    export_cwl=export_cwl)
  File "/home/vanessa/anaconda3/lib/python3.7/site-packages/snakemake-5.5.4+26.g88d8c0aa.dirty-py3.7.egg/snakemake/workflow.py", line 526, in execute
    auto_report(dag, report)
  File "/home/vanessa/anaconda3/lib/python3.7/site-packages/snakemake-5.5.4+26.g88d8c0aa.dirty-py3.7.egg/snakemake/report/__init__.py", line 515, in auto_report
    rulegraph, xmax, ymax = rulegraph_d3_spec(dag)
  File "/home/vanessa/anaconda3/lib/python3.7/site-packages/snakemake-5.5.4+26.g88d8c0aa.dirty-py3.7.egg/snakemake/report/__init__.py", line 407, in rulegraph_d3_spec
    pos = graphviz_layout(g, "dot", args="-Grankdir=BT")
  File "/home/vanessa/anaconda3/lib/python3.7/site-packages/networkx/drawing/nx_agraph.py", line 243, in graphviz_layout
    return pygraphviz_layout(G, prog=prog, root=root, args=args)
  File "/home/vanessa/anaconda3/lib/python3.7/site-packages/networkx/drawing/nx_agraph.py", line 286, in pygraphviz_layout
    'http://pygraphviz.github.io/')
ImportError: ('requires pygraphviz ', 'http://pygraphviz.github.io/')

instead of it relying on my local machine Python.

Comments (29)

  1. Johannes Köster

    Using the official snakemake docker container this should already work fine, pygraphviz is in there. The same holds for snakemake installed via conda. I am not sure what you are referring to here.

  2. v reporter

    Running snakemake locally, I’d want to be able to specify to run report generation in a Singularity container, akin to how I can run a workflow. If I’m running this on HPC I won’t be able to use Docker, and conda doesn’t offer a proper container. Does that make sense?

  3. v reporter

    So my thinking is - in that wrappers can already handle running other python files in a Singularity container (I haven’t figured out how this works yet), wouldn’t it be logical that the report generation command does the same, but for a wrapper that is stored alongside snakemake?

  4. Johannes Köster

    Not really. So the --report flag just collects (meta-)data and composes an HTML file. What would be the problem with using a snakemake installed via conda? Second, you can use the Snakemake docker container via singularity. Singularity supports docker containers out of the box.

  5. Johannes Köster

    Of course one could encapsulate this in a further container or so, but I don’t see the benefit. When you are able to run snakemake, everything needed for the report is there anyway.

  6. v reporter

    The report generation requires separate dependencies to generate the graph - pygraphviz (see issue above). Actually, running on the host with either --use-singularity or --use-conda doesn’t use either for generating the report.

    It’s a bit counter-intuitive - you are suggesting to run a docker or singularity base to run a workflow, but the containers that contain the workflow dependencies would then need to be run inside (and this would require binding the docker socket for docker, and not possible at all with singularity). The only way this would work is if every analysis container (with some genomic thing, for example) also had snakemake installed, which isn’t something these labs are going to do just for one workflow orchestrator.

    The ideal use case is to:

    1. Have singularity containers that have my analysis steps in them, it doesn’t matter if it’s a docker URI or singularity image, both work great
    2. Be able to execute snakemake (locally) targeting a container to run the workflow (this also works great)
    3. ALSO be able to generate a report, using software within a container (currently not possible - the requirement is pygraphviz on the host).

    This means I should be able to do this:

    snakemake --use-singularity --report
    # or
    snakemake --use-conda --report
    

    And have the report generation done within a container that I’ve specified with the repository. If it’s too much to ask a user to install pygraphviz in their container (arguably, could be) then the entire --report generation could essentially be a wrapper with a known container to run with.

    With steps 1-2 the workflows can be reproducible, but the report generation is not.

  7. v reporter

    Actually let me make sure this isn’t a cache issue - I added pygraphviz to the container but the hash doesn’t look like it’s updated. (I’m using a Docker Hub container pulling to Singularity).

  8. v reporter

    Nope it’s definitely running on the host - even deleting the container and doing:

    snakemake --use-singularity --report
    

    results in the missing pygraphviz error.

    Testing a workflow that works locally on our cluster, when I use a Singularity image I’d also expect the container Python to be used, since that’s where everything is correctly installed. But it looks like it still leverages the host?

    Activating singularity image /scratch/users/vsochat/SOFTWARE/encode-demo-workflow/.snakemake/singularity/d740e45807c551b7303b0b0f635e86c9.simg
    Traceback (most recent call last):
      File "/share/software/user/open/py-numpy/1.14.3_py36/lib/python3.6/site-packages/numpy/core/__init__.py", line 16, in <module>
        from . import multiarray
    ImportError: cannot import name 'multiarray' from 'numpy.core' (/share/software/user/open/py-numpy/1.14.3_py36/lib/python3.6/site-packages/numpy/core/__init__.py)
    
    During handling of the above exception, another exception occurred:
    
    Traceback (most recent call last):
      File "scripts/plot_fastq_scores.py", line 12, in <module>
        import matplotlib
      File "/opt/conda/lib/python3.7/site-packages/matplotlib/__init__.py", line 138, in <module>
        from . import cbook, rcsetup
      File "/opt/conda/lib/python3.7/site-packages/matplotlib/cbook/__init__.py", line 31, in <module>
        import numpy as np
      File "/share/software/user/open/py-numpy/1.14.3_py36/lib/python3.6/site-packages/numpy/__init__.py", line 142, in <module>
        from . import add_newdocs
      File "/share/software/user/open/py-numpy/1.14.3_py36/lib/python3.6/site-packages/numpy/add_newdocs.py", line 13, in <module>
        from numpy.lib import add_newdoc
      File "/share/software/user/open/py-numpy/1.14.3_py36/lib/python3.6/site-packages/numpy/lib/__init__.py", line 8, in <module>
        from .type_check import *
      File "/share/software/user/open/py-numpy/1.14.3_py36/lib/python3.6/site-packages/numpy/lib/type_check.py", line 11, in <module>
        import numpy.core.numeric as _nx
      File "/share/software/user/open/py-numpy/1.14.3_py36/lib/python3.6/site-packages/numpy/core/__init__.py", line 26, in <module>
        raise ImportError(msg)
    ImportError: 
    Importing the multiarray numpy extension module failed.  Most
    likely you are trying to import a failed build of numpy.
    If you're working with a numpy git repo, try `git clean -xdf` (removes all
    files not under version control).  Otherwise reinstall numpy.
    
    Original error was: cannot import name 'multiarray' from 'numpy.core' (/share/software/user/open/py-numpy/1.14.3_py36/lib/python3.6/site-packages/numpy/core/__init__.py)
    
    [Sat Aug 31 08:40:37 2019]
    Error in rule plot:
        jobid: 4
        output: data/file2_untrimmed_file2_trimmed_quality_scores.png
        log: logs/plot/2.log (check log file(s) for error message)
        shell:
            python3 scripts/plot_fastq_scores.py --untrimmed "data/reads/file2.fastq.gz" --trimmed "data/trimmed/trimmed.file2.fastq.gz" --bar-color white --flier-color grey --plot-color darkgrid --output-dir data/
            (exited with non-zero exit code)
    
    Shutting down, this might take some time.
    Exiting because a job execution failed. Look above for error message
    Complete log: /scratch/users/vsochat/SOFTWARE/encode-demo-workflow/.snakemake/log/2019-08-31T083003.702334.snakemake.log
    

    What I’m learning is that it’s never going to really work to use snakemake alongside containers - the requirement seems to be to install it within containers, alongside the analysis software, and then run everything completely separate from the host. It’s not the use case I anticipated, because it means using several containers in a workflow wouldn’t be possible. For a dummy / simple workflow I can test this out, but I don’t think I’ll be able to convince the lab I’m developing workflows for to use snakemake under these circumstances.

  9. Johannes Köster

    I still cannot follow you. Snakemake has several dependencies. Some of them are optional, because not everybody needs them. This is why snakemake --report needs an additional dependency like pygraphviz. But there is no need to install this via a container at all. Example. Let's assume you use conda to install snakemake (which is the recommended method and does not yet imply that you need to use conda for your analysis steps). So, in case you just need the minimal snakemake without e.g. report functionality, you do conda install snakemake-minimal. In case you need things like reports, remote files etc., you do conda install snakemake. That’s it. Parts of snakemake’s functionality simply need additional dependencies. There are two flavors of installation, the full and the minimal one. This is a different story than the dependencies of analysis steps.

  10. v reporter

    I still cannot follow you. Snakemake has several dependencies. Some of them are optional, because not everybody needs them. This is why snakemake --report needs an additional dependency like pygraphviz.

    It should be the case that, akin to how I can do --use-singularity to run my workflow, I can also add --use-singularity to generate the report. That’s all I’m saying.

  11. Johannes Köster

    I guess this is the main misunderstanding. With the conda installation method, there are two flavors of snakemake, the minimal one and the full one. One needs the full for using all functionality. This does not mean that ---report is not reproducible. It just means that it needs the full flavor of snakemake.

  12. v reporter

    Also, conda install snakemake doesn’t work - if it’s kept with a specific channel the instructions need to say that:

    $ docker run --entrypoint bash -it vanessa/encode-demo-workflow 
    (base) root@0c93e15d0181:/code# conda install snakemake
    Collecting package metadata (current_repodata.json): done
    Solving environment: failed with initial frozen solve. Retrying with flexible solve.
    Collecting package metadata (repodata.json): done
    Solving environment: failed with initial frozen solve. Retrying with flexible solve.
    
    PackagesNotFoundError: The following packages are not available from current channels:
    
      - snakemake
    
    Current channels:
    
      - https://repo.anaconda.com/pkgs/main/linux-64
      - https://repo.anaconda.com/pkgs/main/noarch
      - https://repo.anaconda.com/pkgs/r/linux-64
      - https://repo.anaconda.com/pkgs/r/noarch
    
    To search for alternate channels that may provide the conda package you're
    looking for, navigate to
    
        https://anaconda.org
    
    and use the search bar at the top of the page.
    

    So I installed with pip - and with pip the additional libraries that are needed for plotting (including pygraphviz, jinja2, networkx, and pygments, weren’t installed, so it didn’t work without a hitch (but with many retries and debugging).

  13. v reporter

    To quickly define how I am thinking of “reproducible” - it means that I don’t need dependencies on my host, other than the container technology. The current strategy I’m using is to install everything in a container (see usage here https://github.com/vsoch/encode-demo-workflow#usage) but even with a conda base, snakemake isn’t found (what channel?) and so I have to manually install it (and all dependencies) with pip https://github.com/vsoch/encode-demo-workflow/blob/master/docker/Dockerfile#L36.

  14. Johannes Köster

    Just saw your answer, sorry. Well, I can in principle understand. The general argument would be that all optional functionality should be automatically loaded via containers when adding --use-singularity. We could potentially do that for reports, but likely not for other parts, like remote file support. Hence, it would be a bit asymmetric. Moreover, it would make the code more complicated for something for which I don’t really see a good reason. Simply install the full flavor via conda and all is fine. What we could do is providing a similar snakemake-minimal plus snakemake-full package on pypi, for those installing via pip, such that this not only works for conda.

  15. Johannes Köster

    Ah, you are answering too fast for me, I am always two questions behind.

    It does work via conda, when following the installation instructions: https://snakemake.readthedocs.io/en/stable/getting_started/installation.html#installation-via-conda. You need bioconda and conda-forge. Not sure which instructions you are referring to (can you point me to them?).

    Regarding your second message: maybe we have simply different defintions of reproducibility. Conda is also a technology for reproducible installation. If you just want a container engine to be installed as a prerequisite, it does also not make any sense to install Snakemake via pip. The pure container usage for snakemake could be something like singularity exec quay.io://snakemake/snakemake snakemake --help, something like that (untested). I have never tested this though, and it can of course be that locally this could lead to some problems.

  16. v reporter

    Could you show me how the proper install with conda? Note my example above - it’s not found on main channels.

    So my understanding is that snakemake is a workflow engine that is primarily conda. Are there examples out there that don’t rely on conda / use primarily containers? It can execute commands to a Singularity container but it’s still heavily conda. Even the container that I built with Docker, with snakemake inside, has conda inside too.

    For a reproducible snakemake, I would want to gut it out of the entire requirement of conda, have only the minimal python required to read in the configuration and execute commands, and have each step truly be run via a container. As a user, I should be able to install the minimal package to some python, and then have everything work seamlessly given that I have the container technology installed. Because currently, step 1 for any user is to install a lot of dependencies (likely get a wrong version, or in my case, miss the bulk of what is needed for report generation) and then not really have a reproducible workflow.

  17. v reporter

    Got the link, thanks! I’m behind too 🙂 You know GitHub updates the comments live on the screen, without any of this delay - if you ever feel like switching 😛

  18. Johannes Köster

    Yeah Bitbucket is awful, not only this, but also pull request support is really bad. And everything is just slow and unresponsive. You know, I am actually planning the move to github since 3 years now (all my other stuff is there). I just need a free week and enough time to migrate all issues and clean up all old pull requests first. It is just never the right time. At some point, I just need to make a hard cut.

    Regarding your previous comment:

    • Actually snakemake is not at all bound to conda. Conda integration is just one available functionality. The equally supported alternative is singularity integration. The reason why the snakemake container image contains conda AND singularity is, that it needs to support both options for people that run Snakemake workflows on kubernetes. The key idea of snakemake is that it is up to the user to use either conda or singularity for the software stack of each analysis step (or both, or a combination of them). The container has to support all modes.
    • The reason why it is not so easy to just say, get rid of conda for installation and just use python + snakemake is that snakemake contains a lot of functionality requiring third party python packages and even dependencies outside of python. E.g. pygraphviz, imagemagick, graphviz for the report. Severy python packages for remote files support, pandas (which itself has dependencies outside of python) for most of the workflows, because they read in sample sheets, and so on. Hence the recommendation to install it via conda, because it is simply the most convenient solution to install packages that span multiple programming language ecosystems. I am really keen to learn which installation instructions you used that misled you to the impression that conda install does not work. But I agree that the docs should more explicitly explain what additional stuff one can install when not using conda (although, when looking at the download counts most people just use the conda way).
    • Regarding your statement: “As a user, I should be able to install the minimal package to some python, and then have everything work seamlessly given that I have the container technology installed.” Even if that would be possible, it would exclude all workflows that rely on conda for their software stack (making them not reproducible). Snakemake has to support both approaches. And it does currently, we just need to modify your statement a tiny bit and it becomes true: “As a user, I should be able to install the snakemake package via conda, and then have everything work seamlessly given that I have the container technology installed.”
  19. v reporter

    Oh that’s great news!! Can I help in any way? I’m really good at terrible, long and menial tasks (I strangely enjoy them). Have you tried importing directly to GitHub? There is even an importer tool that might be able to handle a lot of the issues/PRs that you are worried about → https://help.github.com/en/articles/importing-a-repository-with-github-importer

    I understand the dependencies, and using conda - it definitely is better than installing from source using system python, and I use it on my host to avoid exactly that. I’m wondering if there is a way to decrease the dependency load. For example, is it worth having such a heavy library (pandas, which changes frequently and requires numpy) if you can easily read in sheets using csv, and then do some custom parsing to get row/header fields? Is pandas really just used for the samples.csv (and similar) sheets? What if we could remove it?

    And for the plotting, are the charts so complicated that they require networkx / pygraphvix and it couldn’t be done with a simple data structure and some front end rendering (e.g., d3 or similar?)

    Snakemake is, by far, my favorite workflow manager - Python is on the top of the list for languages that I really love, and it’s especially relevant for researchers as a main language in the scientific programming ecosystem. So I’m wanting to help as much as I can - both to ease usability, and maintaining on your part. It definitely will take me a bit to get up to speed with usage and the code, but I’m up for it. 🙂 I really appreciate you taking the time to have this discussion, just in a day I’m picking up quite a bit.

  20. Johannes Köster

    No way to get rid of pandas. It is up to the workflow developer, but it allows to read in sample sheets in a one-liner (e.g. https://github.com/snakemake-workflows/dna-seq-gatk-variant-calling/blob/master/rules/common.smk#L10), and also very easily reason over them inside of the workflow, with database like operations (e.g. https://github.com/snakemake-workflows/dna-seq-gatk-variant-calling/blob/master/rules/common.smk#L34). It would be not worth trying to rebuild that by hand with the csv module. The full snakemake conda package does not really use pandas by itself, but since production workflows usually require it, it is better to include it, such that for using the production workflow people can just say you need to install the snakemake package via bioconda.

    The DAG plotting in the report requires the Sugiyama layout (a layered approach to render acyclic graphs with minimal edge crossings). If not using this, complicated DAGs just become unreadable. This is not available in D3, neither in Vega. Therefore we need to precompute it outside of the report, and the only implementation I know of is in graphviz. Of course, I would be super happy to see a PR from you that removes this dependency via implementing sugiyama layout in plain Python :-).

  21. Johannes Köster

    Thanks for offering help with the migration, that would be great. Maybe we can put this on the list for October (September is full already). I will check out the github importer, but I guess at least we will loose all the user association… So ideally I would like to minimize the number of PRs and issues.

    There is indeed a task where you can help, but it is so boring that I almost don’t dare to propose it: basically I would like to migrate all issues that are feature requests into a a github project board. One would need to summarize each of them into a few lines inside of the “Proposals” column here: https://github.com/orgs/snakemake/projects/1, link out to the original issue and close it with referral to the project board.

    Then we would have only bug reports left, and many of them will be either already solved or just need an update in the docs (a lot of them are caused by misunderstandings or too short error messages). But there it would be best if I went through them myself in one week or so.

  22. v reporter

    I’d be happy to do that, even if I just covered a handful a day it would be done sooner than later!

    I think to optimize that, we’d want to import the repository first, that way the issues are included (and we can link to them directly in the project board.) Another approach that GitHub can offer is via Milestones → https://help.github.com/en/articles/about-milestones which are slightly better integrated than Projects (in my experience with projects, they typically are set up, but then not so useful as issues / PRs are the primary unit of operation).

    Are you saying to link the GitHub proposal to the Bitbucket issue? Wouldn’t it be faster to just import the repo, and have all the issues imported there automatically (and then linked to Proposals, if you wanted some different structure there).

  23. Johannes Köster

    The reason I want to do it by first adding the proposals to the board and before migrating the repo is that I wanted to kind of do that behind the scenes without people being confused by two repos. Then, the migration could be really fast in the end. However, you also have a good point in that it would be easier from a technical perspective to do it the other way round. Not sure what is best, I will have to think about it.

    Milestones are great, but not every proposal will become a milestone. Some maybe will never happen. However, we still need a place to store them when doing the migration. My idea is that it would be good to have a short entry in the proposal column and a closed issue that is linked. This way, the issues (be it on github or on bitbucket) are not flooded with (possibly even duplicate) ideas but just reflect what is currently unsolved, or not yet reviewed.

  24. v reporter

    Oh! What about doing the import, and just making the repository private? We can still do all the work that way - and then make it public when the time is right 🙂

  25. Johannes Köster

    Yeah good idea. That would solve my issue. Ok, so then in october I’ll look at the migration script. In case you don’t hear from me about it, please feel free to remind me :-)!

  26. v reporter

    I’m @v of course! 😃

    I’m about to go for a quick run and take out the trash, but I’ll be back in 20 minutes to continue!

  27. v reporter

    okay I’m back! Please feel free to add me to the GitHub org, and assign me to whatever needs to be done. I’m going to add a GitHub CI recipe to deploy the container to quay.io, and then I’m going to give a crack at a more substantial encode pipeline.

  28. Log in to comment