Recursion bug and unclear documentation with data-dependent conditional execution (checkpoints)

Create issue
Issue #1133 new
Casper Camiel van Mourik created an issue

Consider the following example from the docs (data-dependent conditional execution):

# a target rule to define the desired final output
rule all:
    input:
        "aggregated/a.txt",
        "aggregated/b.txt"


# the checkpoint that shall trigger re-evaluation of the DAG
checkpoint clustering:
    input:
        "samples/{sample}.txt"
    output:
        clusters=directory("clustering/{sample}")
    shell:
        "mkdir clustering/{wildcards.sample}; "
        "for i in 1 2 3; do echo $i > clustering/{wildcards.sample}/$i.txt; done"


# an intermediate rule
rule intermediate:
    input:
        "clustering/{sample}/{i}.txt"
    output:
        "post/{sample}/{i}.txt"
    shell:
        "cp {input} {output}"


def aggregate_input(wildcards):
    checkpoint_output = checkpoints.clustering.get(**wildcards).output[0]
    return expand("post/{sample}/{i}.txt",
           sample=wildcards.sample,
           i=glob_wildcards(os.path.join(checkpoint_output, "{i}.txt")).i)


# an aggregation over all produced clusters
rule aggregate:
    input:
        aggregate_input
    output:
        "aggregated/{sample}.txt"
    shell:
        "cat {input} > {output}"

If a user tries to run this examples without theaggregated directory, snakemake goes into a recursion if the sample folder is under the same directory. This in my opinion is quite unclear, because as specified in rule all there are only two samples: a.txt and b.txt and you would expect the wildcards to be derived from that.

The way I came across this bug is by trying out the example by hand to better understand it, and this was something I missed.

Example of when it goes wrong:

# a target rule to define the desired final output
rule all:
    input:
        "a.txt",
        "b.txt"


# the checkpoint that shall trigger re-evaluation of the DAG
checkpoint clustering:
    input:
        "samples/{sample}.txt"
    output:
        clusters=directory("clustering/{sample}")
    shell:
        "mkdir clustering/{wildcards.sample}; "
        "for i in 1 2 3; do echo $i > clustering/{wildcards.sample}/$i.txt; done"


# an intermediate rule
rule intermediate:
    input:
        "clustering/{sample}/{i}.txt"
    output:
        "post/{sample}/{i}.txt"
    shell:
        "cp {input} {output}"


def aggregate_input(wildcards):
    checkpoint_output = checkpoints.clustering.get(**wildcards).output[0]
    return expand("post/{sample}/{i}.txt",
           sample=wildcards.sample,
           i=glob_wildcards(os.path.join(checkpoint_output, "{i}.txt")).i)


# an aggregation over all produced clusters
rule aggregate:
    input:
        aggregate_input
    output:
        "{sample}.txt"
    shell:
        "cat {input} > {output}"

This throws the following error:

RecursionError: maximum recursion depth exceeded while calling a Python object Wildcards: sample=samples/samples/samples/samples/samples/samples/samples/samples/samples/samples/samples/samples/samples/samples/samples/samples/samples/samples/samples/samples/samples/samples/samples/samples/samples/samples/samples/samples/samples/samples/samples/samples/samples/samples/samples/samples/samples/samples/samples/samples/samples/samples/samples/samples/samples/samples/samples/samples/samples/samples/samples/samples/samples/samples/samples/samples/samples/samples/samples/samples/samples/samples/samples/samples/samples/samples/samples/samples/samples/samples/samples/samples/samples/samples/samples/samples/samples/samples/samples/samples/samples/samples/samples/samples/samples/samples/samples/samples/samples/samples/samples/samples/samples/samples/samples/samples/samples/samples/samples/samples/samples/samples/samples/samples/samples/samples/samples/samples/samples/samples/samples/samples/samples/samples/samples/samples/samples/samples/samples/samples/samples/samples/samples/samples/samples/samples/samples/samples/samples/samples/samples/samples/samples/samples/samples/samples/samples/samples/samples/samples/samples/samples/samples/samples/samples/samples/samples/samples/samples/samples/samples/samples/samples/samples/samples/samples/samples/samples/samples/samples/samples/samples/samples/samples/samples/samples/samples/samples/samples/samples/samples/samples/samples/samples/samples/samples/samples/samples/samples/samples/samples/samples/samples/samples/samples/samples/samples/samples/samples/samples/samples/samples/samples/samples/samples/samples/samples/samples/samples/samples/samples/samples/samples/samples/samples/samples/samples/samples/samples/samples/samples/samples/samples/samples/samples/samples/samples/samples/samples/samples/samples/samples/samples/samples/samples/samples/samples/samples/samples/samples/samples/samples/samples/samples/samples/samples/samples/samples/samples/samples/samples/samples/samples/a

Besides this, I've found the docs hard to understand on this specific topic.

For instance, it isn't explicitly noted that the checkpoint finishes right when the output directory is made. Meaning, that if you would use the run: directive and make the directory first, that this creates a pretty confusing exit; no output files, and everything hangs (cat does not get an output).

This might be considered an extension of issue #1075

The bugs I encountered exist in 5.4.0.

Thanks in advance for taking into account my feedback.

Comments (3)

  1. Chang Ye
    • intermediate step is necessary, even if you do not need one
    • aggregate can only be used in one level. In the example above, if you want to aggregate all {sample}/{i}.txt file and output a report.txt file, some wired error will occur.

    The most important one is that all the errors are not documented and the error message is unclear.

  2. Log in to comment