threads and --cores, --configfile

Issue #412 new
Christian Arnold
created an issue

Hi,

I am not quite sure right now what the best way of doing it is in Snakemake, and since this might be relevant for many users, I thought I quickly ask here. I have two issues that both relate to how to maximize flexibility in the Snakemake universe:

  • If I use --cores, the (maximum) number of cores is specified. How do I best specify in my rules that this number should be used, how do I refer to it? So far, I specified a number explicitly (e.g., 1 for non-parallelizable and a fixed number like 8 for parallelizable ones), but this is not ideal and unflexible in fact.

  • Maybe I did something wrong but how do I correctly invoke the -configfile argument? If I use it to specify a valid config file, Snakemake complains about missing values in the "config" variable. I was expecting a different behaviour, namely that I have access to the "config" array if either --configfile is specified (which also overrides any "configfile:" statement ) or if a config file is specified via "configfile:". Only the latter seems to work?

Maybe this can be clarified in the documentation...

Thanks, Christian

Comments (9)

  1. Johannes Köster

    Question 1:
    The idea is that a reasonable value (most tools cannot be parallelized to infinity) is put into threads. --cores defines a maximum. That means that if a rule specifies more threads, it will be restricted to the number of threads that has been specified via --cores. Hence, you can achieve what you want by setting threads to a high value (e.g. 99). If you specifiy e.g. --cores to 24 then, this rule will only use 24 threads. Indeed, an example would be nice in the docs, although I believe this is explained in detail in the tutorial. We have a hackathon soon, which will also improve the docs and make it easier to contribute to them via pull requests.

    Question 2:
    Hard to tell what happens in your case without seeing the actual error message. Can you post it?

  2. Christian Arnold reporter

    Hi, regarding the first Question (for the second one, a separate post follows): I see what you mean when not being in a cluster mode, this sounds totally fine and abolustely reasonable.

    But in cluster mode, a few things confuse me now and do not seem to work as I thought they are... This is what I want: Parameterize the number of cores nCoresMax each rules that can be parallelized uses, while at the same time defining a maximum number of jobs that are submitted to the cluster system nJobsMax (e..g, maximum number of nodes that perform computation, although this is not strictly correct because the same node could also perform multiple rules if the node has enough capacity and is assigned multiple times by the central submission system). Thus, a maximum of nCoresMax x nJobsMax would be used at any time.

    1. Following the logic you just mentioned, if I set the number of threads to any high value such as 99 for my rules that can be parallelized, --cores to 16 and --jobs to say 20, I was expecting that multithreaded rules get 16 cores each, and up to 20 jobs are submitted in parallel. However, it appears that my jobs were pending and could not be started because of insufficient resources ("99 Task(s)")

    2. In the Docs, I read "On a cluster node, Snakemake always uses as many cores as available on that node". So if I wanted to specify that each rule runs on only 8 cores in a cluster setting, this would not be possible because Snakemake is "greedy" and uses ALL cores on the node the job is assigned to?

  3. Johannes Köster

    Indeed, --cores currently does not restrict the number of threads in a cluster setting. Instead, each job is started with -j on its node. This means that the job will use min(threads, available_cores_on_node). With the 99 hack, indeed it will always use all cores on the node. This is not the case in your second point. If you set threads to 8, it will use only 8 threads even if the node has more cores. If you need more flexibility, you could also provide a variable to the threads directive, and have that value parsed from the config file or an ENV variable.

  4. Christian Arnold reporter

    Regarding the --configfile part of my question, I believe I found a bug or at least an inconsistent behaviour: If one specifies both the --configfile parameter via the command line and a config file via configfile: in the Snakefile, it seems that the config array is filled with the values from both config files rather than just one. I was hoping and expecting that one could specify a default config in the Snakefile, which however is overridden when the --configfile parameter is explicitly invoked.

    Example:

    configDefault.json
     "common": "bla1"
     "onlyIn1": "true"
    
    config2.json
     "common": "bla2"
     "onlyIn2": "true"
    

    If I now call Snakemake via --configfile config2.json while also specifying a "default" config file via configfile:configDefault.json in the Snakefile, I have access to both "onlyIn1" and "onlyIn2", while the value for "common" is the one specified via configfile:, so "bla1", so the command line argument does not override what is specified inside the Snakefile

  5. Frank Feng

    Hi Johannes,

    Same as Christian, I was also expecting that "one could specify a default config in the Snakefile, which however is overridden when the --configfile parameter is explicitly invoked."

    Thanks!

  6. Christian Arnold reporter

    Hi again,

    I am still a bit unsure about the --cores part of this issue, and a clarifying statement would be good. What I would like to want is this:

    1. Allow to run my Snakefile locally and/or the cluster
    2. Specify the number of cores only once and not in multiple places.

    Currently, I specify the maximum number of threads per rule, nThreadsMax, in my Snakefile, and then each rule is assigned to this in the threads: part. When calling Snakemake, I use --cores nCores and --jobs maxJobs, for the former of which I use the same value as for nThreadsMax (say 8). This works fine for both local and cluster settings, but it is a bit redundant because I have to change the threads per rule parameter twice: once in my Snakefile (nThreadsMax) and once from the command line, nCores, they essentially have to sync. So my questions:

    1. How can I avoid this redundancy? Something like threads: {cores}, where cores is a placehodler for the value of the --cores command line argument would solve it for instance
    2. On our cluster system, we have the rule that each user should not use more than say 500 CPUs at any time. How do I specify that in Snakemake? My understanding is that you can restrict only the number of nodes that Snakemake sends jobs to, but because the number of threads may vary dramatically for each rule (i.e., 1 vs many cores), this is not enough to control the total number of CPUs that are used at any time, or am I misunderstanding something completely?

    Thanks, Christian

  7. Simon Ye

    To answer question 1 - this isn't something that snakemake is designed to solve because you may wish to have different configurations for a local pipeline vs cluster pipeline. This is why you have a cluster.yaml which you can template in args to your cluster scheduler. On the program side, it's better to have a wrapper to interrogate whether your program is being run under a cluster context, and derive the appropriate resource allocations it has been given through the environment or the scheduler to then execute your program. I put my approach here: https://github.com/yesimon/resource-wrapper

    1. This is a strange system since most cluster limit by the # of jobs. Elsewise, if you request more than whatever limit your job gets queued. It sounds like your scheduler rejects jobs when you go over the limit instead of queueing them. I'm sure that Johannes will accept a pull request that constrains by #cores.
  8. Log in to comment