There is a new feature in Snakemake called "group". This allows one to set a group id on a per-rule basis. Snakemake then looks at subgraphs within each group and submits them as individual jobs on the cluster. This is great because then we can submit the jobs though the normal SLURM scheduler instead of manually having to run parts of the workflow on e.g. one interactive node and the indexing as a SLURM job.
I've formulated it here so that you set the parameter
workflow: cells_per_group_job to determine how many cells should be aggregated into one job (by default 50). The solution is a bit hackish and involves calculating md5 checksums to represent unique sets of samples. These are in turn used to touch some flag files. The drawback of this is that if you change the sample set to run for in the middle of the workflow you end up with leftover flag files. I.e. you might start a pilot with 500 cells, finish 200 or them and then start a run with all your 1000 cells. It's maybe not a super common use case, but it will happen. Is this a problem? Do we need a systematic way of dealing with this? I recently made a PR to Snakemake which adds the
--list-untracked flag, which can be used to identify such leftover files. Maybe enough?
This also bumps Scater and changes to use SingleCellExperiment and outputs a RDS with that towards the end.