Rule with folder as input and output

Create issue
Issue #961 invalid
Joerg created an issue

Hello, I am new to Snakemake and sorry for the newby question. I want to use a tool (parsnp from HarvestSuite) which takes as input a folder holding a number of assembly and output is as well a folder with a number of different files.

For me its unclear how to produce a rule for such a shell command as of course snakemake does not know how to produce output from input. So far, I have a pipeline which creates assemblies and copies them in one folder but my rule for parsnp is not working (please ignore the kraken rule which is currently not working):

rule spades:
    input:
        fw= lambda wildcards: expand("{sample}", sample=config["samples"][wildcards.sample]["fw"]),
        rv= lambda wildcards: expand("{sample}", sample=config["samples"][wildcards.sample]["rv"])
    output:
        "{sample}/Assembly"
    threads: 32
    benchmark:
        "benchmarks/{sample}/spades.txt"
    log:
        "Log/{sample}/spades.log"
    shell:
        "spades.py -t {threads} \
            --pe1-1 {input.fw} \
            --pe1-2 {input.rv} \
            --careful \
            -o {output} &> {log}"

"""
Filtering contigs based on the coverage and length. Using in-house script.
"""
rule filtering:
    input:
        fasta="{sample}/Assembly"
    output:
        filtered="{sample}/Filtered/contigs.fasta"
    log:
        "Log/{sample}/filter_assembly.log"
    benchmark:
        "benchmarks/{sample}/filter_assembly.txt"
    shell:
        "spades_cutoff.py {input.fasta}/contigs.fasta 5 500 {output.filtered} &>  {log}"

"""
NOT WORKING CORRECTLY
Extract specific contigs from kraken results.
use taxomy from config.yaml
"""
rule extractKraken:
    input:
        kraken="{sample}/Kraken/classified.kraken",
        fasta="{sample}/Filtered/contigs.fasta",
        translate="{sample}/Kraken/all_translated.txt"
    output:
        protected("{sample}/FinalAssembly/{sample}.fasta")
    #params:
    #    taxonomy = lambda wildcards: expand("{sample}", sample=config["samples"][wildcards.sample]["taxonomy"])
    shell:
        "cp {input.kraken} {output}"

rule prepare_Assemblyfolder:
    input:
        contig="{sample}/FinalAssembly/{sample}.fasta",
    output:
        assembly="Assemblies/{sample}.fasta"
    log:
        "Log/Assembly_preperation.log"
    shell:
        "cp {input.contig}  {output.assembly} 2> {log}"

"""
Needs revision: Set -r flag to correct reference
"""
rule parsnp:
    input:
        input="Assemblies/"
    output:
        directory("Parsnp/")
    threads:
        8
    benchmark:
        "benchmarks/parsnp.txt"
    log:
        "Log/parsnp.log"
    shell:
        "parsnp -r ! -d {input} -p 32  -o {output} &> {log}"

Snakemake Version 4.3.1

Comments (13)

  1. Rasmus Ågren

    The issue tracker is meant for bug reports, feature requests and so no, and maybe not for questions. In the future, please consider asking on Stack Overflow instead. This also makes it easier for others to profit from your questions and the answers.

    Anyways, you need to have a rule that aggregates the assemblies for the different samples and prepares the directory you need for parsnp.

    rule prepare_Assemblyfolder:
        input:
            # Assuming this returns a list of your samples
            contigs=expand("{sample}/FinalAssembly/{sample}.fasta", sample=config["samples"])
        output:
            # Don't use the trailing "/" for directories in your rules
            assembly=directory("Assemblies")
        run:
            for contig in input.contigs:
                # Better to symlink than to copy to save some space
                shell("ln -s {contig} {output.assembly}")
    
  2. Rasmus Ågren

    You will also need to update Snakemake if you want to use the directory keyword since it's quite new. That in turn will lead to some error regarding how you name your log files, because that's stricter in newer versions.

  3. Joerg reporter

    Dear Rasmus, I am very very sorry. You are absolutely right: this is the wrong place. Please delete my request. I will move to Stackoverflow. Again strong apologizes for the inconvenience.

    best

  4. Rasmus Ågren

    No worries! It's no harm having it here, it's just that others with similar problems are less likely to find it. You could try the solution I posted above before reposting at Stack Overflow.

  5. Joerg reporter

    Dear Rasmus. OK if its fine for you to have it here, I am also fine. Thank you very much for your support. I updated to snakemake 5.2.2. and adapted creating of log-files. For me the rule prepare_Assemblyfolder is fine. While your is way more elegant both version (tested them) produce a folder containing all Assemblies. MY problem of understanding is the rule parsnp. The input of the rule is the folder containing Assemblies. The output is a folder as well. The rule parsnap should of course only start after all assembly-files are there (or in case one changes). I have no idea how to implement and start a rule which has a folder as input and as output.

  6. Rasmus Ågren

    It should work as it is. If any of the assemblies are changed then Snakemake will rerun the prepare_Assemblyfolder. It will then first remove the Assemblies directory, create a new one, and then symlink all your assemblies to it (if you used my example above). Snakemake then detects that the input to parsnp has changed so it will rerun that as well. It's no difference compared to if a file was used as input/output. You write input: input = ... while you refer only to {input} in the shell command. That is not correct, but maybe a typo?

    Your version won't work because the output of prepare_Assemblyfolder isn't the same as the input of parsnp. Snakemake won't understand that the fasta files are in the input directory of parsnp. You have to do it like in my example.

  7. Joerg reporter

    Somehow, I am lost now. When I use >snakemake parsnp using your version, I get an error saying:

    [Fri Sep  7 13:17:47 2018]
    rule prepare_Assemblyfolder:
        input: 09T0179/FinalAssembly/09T0179.fasta, 10T0192/FinalAssembly/10T0192.fasta, 10T0193/FinalAssembly/10T0193.fasta, 11T0309/FinalAssembly/11T0309.fasta, 12T0002/FinalAssembly/12T0002.fasta, 12T0050/FinalAssembly/12T0050.fasta, 12T0062/FinalAssembly/12T0062.fasta, 15T0012/FinalAssembly/15T0012.fasta, 15T0013/FinalAssembly/15T0013.fasta, 15T0014/FinalAssembly/15T0014.fasta, 15T0016/FinalAssembly/15T0016.fasta, 15T0031/FinalAssembly/15T0031.fasta, 15T0085/FinalAssembly/15T0085.fasta, 15T0086/FinalAssembly/15T0086.fasta
        output: Assemblies
        jobid: 1
    
    Job counts:
            count   jobs
            1       prepare_Assemblyfolder
            1
    ln: failed to create symbolic link Assemblies: File exists
        [Fri Sep  7 13:17:48 2018]
        Error in rule prepare_Assemblyfolder:
            jobid: 0
            output: Assemblies
    
    RuleException:
    CalledProcessError in line 198 of /home/joerg/Snake_Francisella/Snakefile:
    Command ' set -euo pipefail;  ln -s 10T0192/FinalAssembly/10T0192.fasta  Assemblies ' returned non-zero exit status 1.
      File "/home/joerg/Snake_Francisella/Snakefile", line 198, in __rule_prepare_Assemblyfolder
      File "/home/software/miniconda3/lib/python3.6/concurrent/futures/thread.py", line 56, in run
    Exiting because a job execution failed. Look above for error message
    Shutting down, this might take some time.
    Exiting because a job execution failed. Look above for error message
    

    For me it looks like snakemake creates a link called Assemblies which points to the first assembly, instead of creating a folder Assemblies with all the links.

  8. Rasmus Ågren

    Oops sorry, wrong order of arguments to ln. Should have been ln -s {contig} {output.assembly} (changed in code above as well).

  9. Joerg reporter

    I am really sorry to bother you again but I still don't get it to work. I have tried different version, including copy instead of ln. Argument order of ln is correct, but still same problem. It seems like the for loop creates a file (no folder) called Assemblies and links the first contig. Then it tries to link the next contig but fails as the file already exists (see error above "ln: failed to create symbolic link ‘Assemblies’: File exists").

  10. Rasmus Ågren

    Sorry, I wrote that without testing. Your problem came from that the Assemblies directory has to be created first. This is how the whole thing should look:

    rule spades:
        output:
            touch(directory("{sample}/Assembly"))
    
    rule filtering:
        input:
            fasta="{sample}/Assembly"
        output:
            filtered=touch("{sample}/Filtered/contigs.fasta")
    
    rule extractKraken:
        input:
            fasta="{sample}/Filtered/contigs.fasta",
        output:
            touch("{sample}/FinalAssembly/{sample}.fasta")
    
    rule prepare_Assemblyfolder:
        input:
            contigs=expand("{sample}/FinalAssembly/{sample}.fasta", sample=["sampleA", "sampleB"])
        output:
            assembly=directory("Assemblies")
        run:
            os.makedirs(output.assembly)
            for contig in input.contigs:
                # Better to symlink than to copy to save some space
                # Better to make relative links, fix if you want to
                abspath = os.path.abspath(contig)
                shell("ln -s {abspath} {output.assembly}")
    
    rule parsnp:
        input:
            input="Assemblies"
        output:
            touch(directory("Parsnp"))
    
  11. Joerg reporter

    That's it Working like a charm. I've learnt a lot: snakemake handles folders as in/output just as it handles files directory command * We need to create directories before using

    Thank you very much and next time I will go to StackOverflow

  12. Log in to comment