Requesting checkpoint output in rule all results in different output

Create issue
Issue #1284 resolved
paul created an issue

I used the current master (5.6.0+5.g9030b9b5) to reproduce this bug.

I have this workflow with two checkpoints. In the end data will be summarized using the result of both checkpoints. If the pipeline works the output file should read “output as requested”. This works well if the input rule only requests the output file. All intermediate steps are run as expected. If I additionally request the output files from the checkpoints in the first rule (set variable producerror to True) the input function of the last rule instead of using the content of the last checkpoints output it will use the path of the last checkpoint.

I hope the problem is understanable. Here is a minimal not working example.

Reproduce by running and looking at `data/output.txt`. Change the variable producerror to False to get desired output.

import pandas as pd

producerror = True
A = ["data/output.txt"]
B = ["data/names/table.csv",
    "data/InputFile/table.csv"]

if producerror:
    A.extend(B)


rule all:
    input:
        A



rule makeInput:
    output:
        "data/input/{name}/file.txt"
    shell:
        """
        touch {output}
        """

checkpoint InputFile:
    output:
        "data/InputFile/table.csv"
    shell:
        """
        touch {output}
        """

def aggregateInput(wc):
    p = checkpoint_output = checkpoints.InputFile.get(**wc).output[0]

    n = ['manuel']
    hits  = expand("data/input/{name}/file.txt",
                name = n)
    return(hits)

checkpoint defineNames:
    input:
        aggregateInput
    output:
        "data/names/table.csv"
    shell:
        """
        echo "ERROR\nTHIS SHOULD NOT BE THE OUTPUT" > {output}
        """

def requestfiles(wc):
    ta = checkpoints.defineNames.get().output[0]
    f = expand("data/requested/{name}.txt",
            name = ['Output'])
    return(f)

rule createFiles:
    output:
        "data/requested/{name}.txt"
    shell:
        """
        echo "{wildcards.name} as requested" >> {output}
        """

rule summarize:
    input:
        requestfiles
    output:
        'data/output.txt'
    shell:
        """
        cat {input} >  {output}
        cat {output}
        """

Comments (12)

  1. Troy Comi

    That is quite the puzzle! It appears that the input to summarize changes even though the return value from requestfiles stays the same. I added a print statement right before the return of requestfiles (the ‘returning …’ line below)

    In the case of produceerror == False, after defineNames runs, the log looks like:

    Updating job 1.
    returning ['data/requested/Output.txt']
    [Wed Sep 18 14:40:38 2019]
    Finished job 4.
    3 of 6 steps (50%) done
    
    [Wed Sep 18 14:40:38 2019]
    rule createFiles:
        output: data/requested/Output.txt
        jobid: 7
        wildcards: name=Output
    
    [Wed Sep 18 14:40:38 2019]
    Finished job 7.
    4 of 6 steps (67%) done
    
    [Wed Sep 18 14:40:38 2019]
    rule summarize:
        input: data/requested/Output.txt
        output: data/output.txt
        jobid: 6
    

    Note that job 1 (summarize) now has an id of 6 and the correct input. The same log region with produceerror == True gives

    Updating job 1.
    returning ['data/requested/Output.txt']
    Updating job 0.
    [Wed Sep 18 14:40:07 2019]
    Finished job 4.
    3 of 5 steps (60%) done
    
    [Wed Sep 18 14:40:07 2019]
    rule summarize:
        input: <unknown>
        output: data/output.txt
        jobid: 1
    

    Because the rule all depends on the checkpoints, their completion triggers it to update (job 0 above). Since all also depends on summarize, it appears the updated summarize (would be job 6) is overwritten by the update of all. In particular, the jobid of summarize is still 1 and the input isn’t set. I’m not sure if there needs to be a change in the order of evaluation following checkpoints or an additional check of dependencies. It’s also puzzling how the checkpoint output is passed to the summarize rule.

    You can hack it in this case by changing the summarize rule to:

    rule summarize:
        input:
            'data/requested/Output.txt',
            requestfiles
        output:
            'data/output.txt'
        shell:
            """
            cat {input[0]} >  {output}
            cat {output}
            """
    

    That way even if the input doesn’t update you still get your Output.txt path for the shell command. Not useful in cases where the checkpoint output is actually used, but may help here.

    Can I ask what the actual usage is? Why do you need to explicitly specify outputs of checkpoints in all if they are implicitly created through dependencies?

  2. paul reporter

    I stumbled across this bug as I was developing a pipeline and a rule was changed into a checkpoint and I did not remove the request for the checkpoint output from the first rule. Thus of course you can just remove the request and thats what I did in my pipeline, that fixed some unwanted behaviour.

    But nevertheless I think this is unwanted behaviour and did cost me the better half of a week to figure out. Glad someone could reproduce this and confirm this.

    In my real workflow I of course used the output from the Checkpoints and also the Checkpoints did a little bit more. Thus the workaround fixes this particular toy example but not the bug.

  3. Troy Comi

    I don’t know enough about the source code to pinpoint the issue, but here’s a smaller (clearer?) example snakefile

    rule all:
        input:
            'result.r1',
            'result.c1',  # targeting checkpoint output overrides input
            'result.c2'
    
    
    def rule1_input(wc):
        checkpoints.check1.get(**wc).output[0]
        return "result.r2"
    
    rule r1:
        input:
            rule1_input
        output:
            'result.r1',
        shell:
            'echo {input} > {output}'
    
    
    rule r2:
        output:
            'result.r2'
        shell:
            'echo r2 > {output}'
    
    
    def check1_input(wc):
        checkpoints.check2.get(**wc).output[0]
        return 'result.r3'
    
    checkpoint check1:
        input:
            check1_input
        output:
            'result.c1'
        shell:
            'echo c1 > {output}'
    
    
    checkpoint check2:
        output:
            'result.c2'
        shell:
            'echo c2 > {output}'
    
    
    rule r3:
        output:
            'result.r3'
        shell:
            'echo r3 > {output}'
    

    Where the output of rule 1 should be result.r2 but is instead result.c1. Expected results are

    ==> result.c1 <==
    c1
    
    ==> result.c2 <==
    c2
    
    ==> result.r1 <==
    result.r2
    
    ==> result.r2 <==
    r2
    
    ==> result.r3 <==
    r3
    

    I also noticed that targeting one of the two checkpoints produces the correct output, so it’s an issue of a rule depending on more than one checkpoint output rather than my initial guess. If someone can point me to the correct region of code I’d work on a PR, seems like either postprocess or update_checkpoint in dag, but could be when jobs are copied during updates.

  4. paul reporter

    Thanks for the better bug description. I am now digging in a bit. But its my first time looking at the sourcecode of snakemake, so it might take a bit until I am successful. Are there any tools to debug snakemake or established approaches?

  5. paul reporter

    Ok, I am getting places.

    The input file for rule r1 when the initial DAG is created is set to be the input file of the checkpoint 1 (result.c1). This is wrong, as this is the output of the checkpoint and not what the function would return, but fair enough.

    Later on this usually is replaced by the function evaluation. If a rule higher up e.g. all is re-evaluated, this seems to trigger again the insertion of “result.c1” after for a short period of time the correct value “result.r2” was saved in the job.input slot.

    After: Updating job 1. the input is set as result.r2 after Updating job 0 the input returns to be result.c1 which is ultimately wrong.

    I debugged this using pdb and print statements.

    I think the first placement of the wrong input file, so the initial DAG formation should not insert the output of the dependent checkpoint, or at least set a flag that this is not the evaluated function.

    edit:

    This seems to be whats happening, I can’t figure out where the job is modified at the moment. It seems to be happening around the _apply_wildcards function. I will try to look into this again tomorrow.

    Btw the bug is still present in version 89bc383f.

  6. paul reporter

    I could narrow it down to this function in file rules.py

    def handle_incomplete_checkpoint(exception):
        """If checkpoint is incomplete, target it such that it is completed
        before this rule gets executed."""
        return exception.targetfile
    

    This function returns the output file of the depended checkpoint in the initial DAG creation, which leads to the dependency creation (I think). But when updating rule all this behaviour overwrites the updated input values for rule r1.

    Edit:

    This is not the case, it does not overwrite it there. But this is where the initial value for input files is created.

  7. paul reporter

    I found a workaround/bugfix:

    removing cache lookup in dag.py:

      def new_job(self, rule, targetfile=None, format_wildcards=None):
            """Create new job for given rule and (optional) targetfile.
            This will reuse existing jobs with the same wildcards."""
            key = (rule, targetfile)
            #if key in self.job_cache:
            #    assert targetfile is not None
            #    return self.job_cache[key]
            wildcards_dict = rule.get_wildcards(targetfile)
            job = Job(rule, self, wildcards_dict=wildcards_dict,
                      format_wildcards=format_wildcards, targetfile=targetfile)
            for f in job.products:
                self.job_cache[(rule, f)] = job
            return job
    

    results in correct execution.

    So it has something to do with caching.

    Edit:

    While this workaround fixes this issue in this example, in bigger workflows the checkpoint feature seems to still retain this bug.

  8. Johannes Köster

    Thanks a lot guys! Your detective work helped a lot to find the right place to fix this. The solution is to remove the jobs also from the cache when they are removed from the DAG (which is what happens to the <TBD> jobs after a checkpoint is completed). I have fixed this in the master branch. I am really happy that Snakemake has users like you, who dedicate so much work into finding and understanding such corner cases.

  9. Log in to comment