Skip checking if newer versions of (remote) inputs exist

Create issue
Issue #332 new
Joona Lehtomäki created an issue

Creating a rule that downloads (potentially large number of) files over a remote and moves the files to a directory on the local file system for further processing/analysis is useful for more centralized data management. However, even when the files have been downloaded locally, checking for a newer version (or however Snakemake decides whether a file has changed) over a remote can take a significant amount of time. This slows down running any other rules that depend on the inputs.

To speed things up, a switch like --ignore-updates could be used to tell Snakemake to skip checking for newer versions of (remote) input files if all the files are already present locally.

Comments (9)

  1. Timothy Booth

    In a recent workflow I had a situation where a rule fetched something from a database and saved it as a file. A second rule collected the files to make an aggregate report. The exact items fetched from the database depended on the input files to be processed. I wanted to add input files into the directory and re-run the workflow, having it fetch extra database items and compile a new report. However, the report is never re-generated once it has been made the first time because Snakemake does not see any newer input files, and if the input files are simply missing they are not recreated.

    The key factor here was that I knew the files would never change once created (immutable input from the database), and they are not very large so keeping the files around is not a problem, but I wanted any missing input file to trigger a re-run of the rule that depended on it, and there is actually no clean way to do this.

    Possible workarounds included manually deleting the report each time, then forcing it to be regenerated, or using an empty file as a dummy input to the rule, then remembering to touch it before re-running Snakemake. You could also work around your own problem - eg. rather than specifying the files as remote dependencies, you could make a generator rule that downloads the file, like so:

    rule fetch_file:
        output: "{foo}.pdf"
        shell: "wget http://my.server.net/{output}"
    

    This will avoid all scanning for newer files, but then you'll quite possibly run into the same problem as me, which (finally) brings me to my point.

    I'd like to advocate a static() marker to be added to output files, in the same way as "protected()" and "temp()". The semantics would be as follows:

    • A static output is always re-created if it is missing (missing file is seen as as out-of-date)
    • A static output is never re-created if it is present (existing file is always seen as new)

    I think this would interact logically with remote files in order to address both our problems. On the rule that downloads the remote files, you would simply mark the output of the rule as static. Then if the output already existed, the input would not be checked and no remote call would be made, but if the input was missing it would always be fetched and stored.

    rule fetch_file:
        output:
            static("{foo}.pdf")
        input:
            HTTP.remote("www.example.com/path/to/{foo}.pdf")
        shell:
            "mv {input} {output}"
    

    Do you think that would address the problem you have, or do you need a command-line switch to toggle between using the local files and checking for updates?

    TIM

  2. Joona Lehtomäki reporter

    Tim, thanks a lot for your explanation and the idea. Sounds like your problem is related and from what I can see, having static() would fix my issue. I also very much like the idea conceptually, because as you say, it would nicely follow the lines of protected() and temp().

    While in my case upstream updates in the data are rare, they can happen. In such an event it would be great to force an update e.g. using a command-line switch. An alternative would be to just remove all local files in case I know for sure that data has updated, but it is also conceivable that only part of the remote input files have been updated. There is already a CL switch --notemp, would same approach work for static()? I.e. having --nostatic to ignore static() declarations?

  3. Johannes Köster

    I like the static approach. To handle the case of rare updates, one could further add a parameter to static, e.g., interval=14 that specifies that modification dates should be checked if the marked file is older than 14 days.

  4. Timothy Booth

    And of course, if you then use interval=config.get('interval') in your Snakefile then you could set --config interval=0 on the command line and there you have your desired CL switch for free. :-)

  5. Elmar Pruesse

    How about making ancient() work on remote objects first? Often enough (say reference data), things hosted at remote URIs are unchangeable (or would receive a new URI if changed), so it really is only about present or not.

  6. Cedric Laczny

    I tried ancient(HTTP.remote()) with snakemake-5.2.2 but to no avail. It still seemed to try to check whether the remote files are newer than the local files and was thus hanging/taking very long at the "Building DAG of jobs..." step.

    Using HTTP.remote(MYURI, keep_local=True, static=True), i.e., adding the static=True flag, seemed to solve this as the DAG was quickly built and resolved so that the actual jobs could be printed in a dry-run (snakemake -np).

    Hence my Q: is static=True the suggested way to specify for remotes when one does not want the remotes to be checked but rather simply downloaded once-only?

    Or what have I missed in using ancient()?

    TIA!

    ADDENDUM: I should add that I found the use of the static-flag in https://github.com/snakemake-workflows/ngs-test-data/blob/master/Snakefile, but sadly no documentation at https://snakemake.readthedocs.io/en/stable/snakefiles/remote_files.html.

  7. Log in to comment