Possible race condition with very short jobs in virtualized host attached to NAS

Create issue
Issue #1157 new
Michele Vidotto created an issue

Snakemake version 5.4.2

I'm using snakemake to process many small files (~6000) in a virtualized host over a VMware vSphere Hypervisor attached to an Isilon NAS were files are stored.

Among many others, when the following rule is executed through the shell function:

shell:
    """
    mkdir -p $(dirname "{output.fastqc}") {output.tmp} \
    && fastqc \
    --outdir $(dirname $(dirname "{output.fastqc}")) \
    --casava {input.fastq_gz} \
    --dir {output.tmp} \
    --nogroup \
    --extract \
    --quiet \
    """

mkdir -p $(dirname "//data/aut-lfgt/fastqc/37004_ID1222_GenTree-SPET-Pnigra-2-RUPN1822-G12_S190_L001_R2_fastqc/fastqc_data.txt") //data/aut-lfgt/fastqc/37004_ID1222_GenTree-SPET-Pnigra-2-RUPN1822-G12_S190_L001_R2_tmp && fastqc --outdir $(dirname $(dirname "//data/aut-lfgt/fastqc/37004_ID1222_GenTree-SPET-Pnigra-2-RUPN1822-G12_S190_L001_R2_fastqc/fastqc_data.txt")) --casava //data/aut-lfgt/37004_ID1222_GenTree-SPET-Pnigra-2-RUPN1822-G12_S190_L001_R2_001.fastq.gz --dir //data/aut-lfgt/fastqc/37004_ID1222_GenTree-SPET-Pnigra-2-RUPN1822-G12_S190_L001_R2_tmp --nogroup --extract --quiet

Sometimes I got the following error:

Error in rule run_fastqc: jobid: 3773 output: /illumina-spool/FC1113_B/default/I8N6I8_nomask-1/cnr-ibbr-firenze/fastqc/37027_ID1223_TaxusCNR1-EsTB0206-G03_S213_L004_R1_fastqc/fastqc_data.txt, /illumina-spool/FC1113_B/default/I8N6I8_nomask-1/cnr-ibbr-firenze/fastqc/37027_ID1223_TaxusCNR1-EsTB0206-G03_S213_L004_R1_tmp

RuleException: ValueError in line 352 of /pipelines/live/illumina-cleaning_and_qc/Snakefile: not enough values to unpack (expected 2, got 1) File "/pipelines/live/illumina-cleaning_and_qc/Snakefile", line 352, in __rule_run_fastqc File "/pipelines/miniconda3/envs/illumina-cleaning_and_qc/lib/python3.5/contextlib.py", line 59, in __enter__ File "/pipelines/miniconda3/envs/illumina-cleaning_and_qc/lib/python3.5/site-packages/psutil/__init__.py", line 750, in io_counters File "/pipelines/miniconda3/envs/illumina-cleaning_and_qc/lib/python3.5/site-packages/psutil/_pslinux.py", line 1459, in wrapper File "/pipelines/miniconda3/envs/illumina-cleaning_and_qc/lib/python3.5/site-packages/psutil/_pslinux.py", line 1604, in io_counters File "/pipelines/miniconda3/envs/illumina-cleaning_and_qc/lib/python3.5/concurrent/futures/thread.py", line 55, in run

other times the rule fails with:

Snakemake fails with: FileNotFoundError: [Errno 2] No such file or directory: /proc/xxx/stat

With another rule executed through the shell function:

shell:
    """
    {params.fastqc_flagger} \
    --fastqc {input.fastqc} \
    2>{output.fastqc_flag} \
    3>&1 \
    1>&2 \
    2>&3 \
    | tee {log} \
    """

/illumina-cleaning_and_qc/scripts/fastqc_flagger.py --fastqc /data/I10-I10-1/inra-cnrs-agpf/fastqc/34925_ID781_IT-PO-17-17_S134_L001_R1_fastqc/fastqc_data.txt 2>/data/I10-I10-1/inra-cnrs-agpf/fastqc/34925_ID781_IT-PO-17-17_S134_L001_R1_fastqc/fastqc_data_flags.txt 3>&1 1>&2 2>&3 | tee /igatech/illumina-spool/logs/FC1110/run_fastqc_flag.I10-I10-1.inra-cnrs-agpf.34925_ID781_IT-PO-17-17_S134_L001_R1.fastqc_flag.log

sometimes i got the following error:

` RuleException: BlockingIOError in line 391 of /pipelines/live/illumina-cleaning_and_qc/Snakefile: [Errno 11] Resource temporarily unavailable File "/pipelines/live/illumina-cleaning_and_qc/Snakefile", line 391, in rule_run_fastqc_flag File "/pipelines/miniconda3/envs/illumina-cleaning_and_qc/lib/python3.5/subprocess.py", line 950, in init__ File "/pipelines/miniconda3/envs/illumina-cleaning_and_qc/lib/python3.5/subprocess.py", line 1483, in _execute_child File "/pipelines/miniconda3/envs/illumina-cleaning_and_qc/lib/python3.5/concurrent/futures/thread.py", line 55, in run

Error in rule run_fastqc_flag:

RuleException: RuntimeError in line 391 of /pipelines/live/illumina-cleaning_and_qc/Snakefile: can't start new thread File "/pipelines/miniconda3/envs/illumina-cleaning_and_qc/lib/python3.5/contextlib.py", line 77, in __exit__ File "/pipelines/live/illumina-cleaning_and_qc/Snakefile", line 391, in __rule_run_fastqc_flag File "/pipelines/miniconda3/envs/illumina-cleaning_and_qc/lib/python3.5/contextlib.py", line 59, in __enter__ File "/pipelines/miniconda3/envs/illumina-cleaning_and_qc/lib/python3.5/threading.py", line 844, in start File "/pipelines/miniconda3/envs/illumina-cleaning_and_qc/lib/python3.5/concurrent/futures/thread.py", line 55, in run Removing output files of failed job run_fastqc_flag since they might be corrupted: /illumina-spool/FC1109_B/default/I8N6I8-1/upv-comav/fastqc/37567_ID1289_IL-DAS-2019-DAS40-3-4-5-F11_S753_L004_R1_fastqc/fastqc_data_flags.txt [Thu Mar 28 16:06:37 2019] Error in rule run_fastqc_flag: jobid: 9836 output: /illumina-spool/FC1109_B/default/I8N6I8-1/aut-lfgt/fastqc/36942_ID1222_GenTree-SPET-Pnigra-2-ATPN1317-A05_S128_L002_R1_fastqc/fastqc_data_flags.txt log: /illumina-spool/logs/FC1109_B/run_fastqc_flag.I8N6I8-1.aut-lfgt.36942_ID1222_GenTree-SPET-Pnigra-2-ATPN1317-A05_S128_L002_R1.fastqc_flag.log conda-env: /illumina-spool/illumina-cleaning_and_qc/.snakemake/conda/6a606cff

sometimes:

RuleException: ValueError in line 352 of /igatech/pipelines/live/illumina-cleaning_and_qc/Snakefile: not enough values to unpack (expected 2, got 1) File "//pipelines/live/illumina-cleaning_and_qc/Snakefile", line 352, in __rule_run_fastqc File "//pipelines/miniconda3/envs/illumina-cleaning_and_qc/lib/python3.5/contextlib.py", line 59, in __enter__ File "//pipelines/miniconda3/envs/illumina-cleaning_and_qc/lib/python3.5/site-packages/psutil/__init__.py", line 750, in io_counters File "//pipelines/miniconda3/envs/illumina-cleaning_and_qc/lib/python3.5/site-packages/psutil/_pslinux.py", line 1459, in wrapper File "//pipelines/miniconda3/envs/illumina-cleaning_and_qc/lib/python3.5/site-packages/psutil/_pslinux.py", line 1604, in io_counters File "//pipelines/miniconda3/envs/illumina-cleaning_and_qc/lib/python3.5/concurrent/futures/thread.py", line 55, in run

In all the cases the processes generated by the aforementioned rules are very short and last less then 1 second:

s h:m:s max_rss max_vms max_uss max_pss io_in io_out mean_load 0.0575 0:00:00 36.63 280.84 31.91 32.43 0.00 0.00 0.00

Could this be a bug in that prevent Snakemake to correctly infer the end of a job or a problem in the code of the rules?

Comments (16)

  1. Johannes Köster

    Can you run again with --verbose? This should print the full stack trace, and hopefully give us a hint on the true cause. Thanks for reporting!

  2. Michele Vidotto reporter

    Hi Johannes,

    sorry for the delay of reply. At the moment I’m not able to reproduce the exact condition for the problems to occour. I’ll do that as soon as possible.

    In the maintime we are investigating our systems because we suspect problems are related to the configuration of our infrastructure or to the hardware.

    So please hold off on this thread for the moment. I change the status to minor and I'll report updates as soon as I’ll have them!

    Thanks again!

  3. Log in to comment