Job Runner Enhancements + GALAXY_SLOTS Implementation

#236 Declined
Repository
galaxy-central-fork-1
Branch
default
Repository
galaxy-central
Branch
default
Author
  1. John Chilton
Reviewers
Description

This pull request implements the following job runner enhancements:

  • Centralize job script creation logic across DRMAA, condor, and PBS runners as well as the LWR.
  • Implement GALAXY_SLOTS across all of these runners.
  • Allow passing submission script parameters to condor job runner (destination parameter submit_xxxx becomes submission parameter xxxx).
  • Bring in CSIRO PBS fixes from pull request #194 (related to PBS staging) [thanks @stephen_mcmahon_].
  • Implement exit code handling in condor job runner (was completely missing, broken before).
  • Various refactorings to make use of higher-level LWR server side utility methods as well as PEP-8 fixes.

Comments (12)

  1. Peter Cock

    Nice John, I don't use Condor or PBS so can't comment on them - but I'm excited to see $GALAXY_SLOTS at last :)

  2. Björn Grüning

    Oh ... SLOTS, thanks John! That will save much time in deploying our tools.

    Thanks!

  3. John Chilton author

    The local and the LWR job runner (if configured to run jobs outside a resource manager) will not populate this, so I guess for now, the following should be added to the cheetah templates to utilize these:

    \${GALAXY_SLOTS:-"1"}

    where 1 can be replaced with whatever default seems sensible for the tool. Note the $ must be escaped to prevent its interpretation as a cheetah variable.

  4. John Chilton author

    I assume you mean I should hack up the local runner to define GALAXY_SLOTS, I am not sure this is wise but I will do it. (Tool should probably still use the above idiom to be backward compatible with old Galaxy versions.)

    When you say configurable per runner, I assume you mean per destination instead of per runner right?

    <job_conf>
      <runners>
        <plugin id="local" type="runner" load="galaxy.jobs.runners.local:LocalJobRunner"/>
      </runners>
      <destinations default="local">
         <destination id="local" runner="local"/>
         <destination id="tophat_local" runner="local">
            <param id="slots">4</param>
         </destination>
      </destinations>
    </job_conf>
    

    I could stick the property on the runner too and one can define multiple local runners, but it doesn't seem to fit with the approach @natefoo has laid out. Let me know if I am wrong about how I am interpreting you.

    If we are travelling down this rabbit hole, I also wonder if the tools should be able to specify a default that could then be fed back into job destinations.

    <tool id="tophat" default_threads="4"> or <tool id="cut1" default_threads="1"> <!-- default default_threads would be 1 -->

    Then allow job destinations like this be defined:

    <job_conf>
      ...
      <destinations>
        <destination id="default" runner="drmaa">
          <param id="nativeSpecification">-l nodes=${tool.default_threads}</param>
        </destination>
    

    Here default_threads could default to 1 so all tools would have a default_threads. This would give savy deployers the ability to still go in and override the default number of threads for a tool, but the distinction between multi-threaded and single threaded tools could be handled out of the box correctly in "the typical" case.

  5. James Taylor

    Agreed, per destination makes more sense.

    For tools, we've always had this notion of tools asserting resource requests that the runner would try to meet. So, I imagine something like:

    <tool id="tophat">
      <resources>
        <slots min="2" max="16"/>
      </resources>
    

    And then instead of native specification, the runner would know how to format the resource request properly.

  6. John Chilton author

    The native specification thing is sort of all or nothing. I don't want to go into that and start hacking it up if the user has already specified one. If the user needs to specify one there should be a way to inject the tool parameters in (as shown). If one isn't defined for a particular job destination I guess we could add it...

    The problem here is each underlying resource manager will have its own format for these things right? So the DRMAA job runner would need to have special logic for each resource manager defining how to specify that and we would need to know about each possible (supported) drmaa backend. In some ways, I am not opposed to doing something like that, we could then provide deployers with a uniform interface for defining memory, walltime, slots, etc..., going a step above the DRMAA interface. That is a cool project, but it seems out of scope for this pull request.

    I would also like to push back on the min/max thing. Most software has a minimum of 1 and the maximum (if applicable at all) is usually very tied to the architecture it is running on. I think a recommended number of threads makes more sense and is something easier to work with (is it a little, medium, or big job). I guess I will delay implementing the default number of threads until we have hashed this out.

    So is there anything that I should definitely add before merging this (local runner, configurable default, etc...)? I have tested the DRMAA runner and SLURM on my laptop, I have partially tested the condor piece (I cannot convince condor my laptop has more than one CPU for some reason). The PBS stuff is largely untested - I do not have an easy platform for that. I did run a job and verify the submit script looks fine but I have not run that submit script through an actual resource manager. I have deployed this to Galaxy-P production CloudMan server and tested with SGE and it seems to work.

  7. James Taylor

    I think this can pretty much be merged as is, as long as GALAXY_SLOTS is always defined for any runner. A hard default of 1 is an acceptable starting point.

    Yes, eliminating the need for nativeSpecfication (though it would still be optional) would require knowledge of the underlying runner beyond DRMAA, but I think this is a good way for us to move to simplify how resource requests are specified in job_conf. If the tool asserts its requests, and the runner knows how to try to meet them, we eliminate a bunch of per tool configuration.

    I'm not sure about the argument against min/max. You can certainly default min to 1, but there are definitely tools we would not want to run until there were more slots available, so I don't think min is always 1. And for some tools, we know pretty well where they start to fail to scale up with more slots, and I'm not convinced this is always architecture specific. But still, this can be accommodated at scheduling time by trying to reconcile whatever abstract resource request the tool makes with what is actually available at a given destination, and map that to a concrete request.

  8. John Chilton author

    Oh well, I think we just disagree on some of these points. However, I will make some changes to the local runner so GALAXY_SLOTS is "always defined" and then I will merge. I will also create a Trello card for some of these longer term direction things.

  9. Daniel Blankenberg

    Just to confound the min/max argument a bit more, there are tools that require more complicated heuristics for determining the number of threads/cores to use, e.g. powers of 2. This is not exactly needed for this specific pull request, but I thought I would mention it since there is a bit of discussion on the topic occurring here (and someone was volunteering to make trello cards ;) ).

  10. John Chilton author

    ... and of course once the Trello card exists the battle is practically won :).

    You are of course right, and there are going to be different clusters with different resource types and different priorities for different users, etc... Any sufficiently complex Galaxy configuration should be using dynamic job destinations and carefully crafted resource allocation requests based on a number of factors. I will happily work on automating what can be automated though.