rework replicate name inference
We first got on this track by noticing that #40 (an bug occurring in lib5c.tools.helpers.resolve_parallel()
when a glob matches only one filename) can probably be fixed by tacking this conditional on after the if statement mentioned in that issue:
elif len(infiles) == 1 and '*' in args_dict[key_arg]:
if hasattr(args, 'outfile') and '%s' in args.outfile:
prefix, postfix = args_dict[key_arg].split('*')
rep_name = infiles[0][len(prefix):-len(postfix)]
args.outfile = args.outfile.replace('%s', rep_name)
setattr(args, key_arg, infiles[0])
Basically, if the glob only matches one input file, but it has a wildcard in it, the expected behavior is that
- we will drop in the single match glob-expanded filename in to replace the unexpanded wildcard filename
- if there was an
outdir
arg that had'%s'
in it, the user probably expects that this would be replaced with the replicate name, so we go ahead and do that, using an experimental new method for getting the rep name from a single rep
Historically, we use lib5c.tools.helpers.infer_replicate_names()
to infer replicate names from a list of input files (namely, selecting out the parts that are different across the filenames). This doesn't work with a single input file, so instead we add a different assumption. This assumption is that the letters that end up getting matched to the wildcard are the rep name. This is not always true, for example when the file extension is not included (if the pattern is 'kr/*'
, the rep name will include '.counts'
). However, it is the only thing I can come up with here.
We could actually use this new assumption to fix a long-standing issue (#39) where the qnorm step of the pipeline will break if the rep names contain a common prefix or postfix. This opens the discussion of whether or not we should change infer_replicate_names()
to make the strong assumption that the replicate name will always be the exact substring of the matched filename that has replaced the wildcard in the provided pattern.
If we make this change to infer_replicate_names()
, fixing this error simplifies to changing the conditional in resolve_parallel()
from
if len(infiles) > 1:
to
if len(infiles) > 1 or '*' in args_dict[key_arg]:
since the machinery of _parallelize_flags()
will kick in as expected.
To some extent, making this change everywhere is slightly undesirable because we're overriding standard expectations for how the wildcard should behave (namely, that we can include or omit characters from the pattern and as long as it still matches the same filenames then it doesn't make a difference).
Making this change everywhere probably also requires a change in lib5c.contrib.luigi.pipeline.JointInnerMixin._match_inputs()
since wildcard patterns generated by that function disobey this new assumption (if the reps are "ES_rep1" and "ES_rep2" and the input files are "input/ES_rep1.counts" and "input/ES_rep2.counts" then the pattern is "input/ES_rep*.counts", and replicate names inferred from that pattern will again be mangled to "1" and "2"). The solution would be to modify that function as well. One possibility could be:
class JointInnerMixin(object):
def _match_inputs(self):
return self.preceding_task(rep=self.all_reps[0])\
.outfile_pattern.replace('%s', '*')
...
where we are relying on the symmetry between where the files are written ('pattern/%s.counts' % rep
) and how they are read ('pattern/*.counts'
).
We will probably still need to fall back to the old behavior of infer_replicate_names()
when it is called by a tool that does not use resolve_parallel()
(e.g., when the tool accepts multiple countsfiles as an already-expanded variable-length positional argument). One way to handle this could be to add pattern=None
as a kwarg to infer_replicate_names()
, and pass it whenever it is available. For example:
def infer_replicate_names(infiles, as_dict=False, pattern=None):
if pattern:
prefix, postfix = pattern.split('*')
if as_dict:
return {infile: [infile[len(prefix):-len(postfix)]
for infile in infiles}
return [infile[len(prefix):-len(postfix)] for infile in infiles]
common_prefix = os.path.commonprefix(infiles)
common_postfix = os.path.commonprefix(
[infile[::-1] for infile in infiles])[::-1]
if len(infiles) == 1:
common_prefix = ''
common_postfix = ''
if as_dict:
return {
infile: infile[len(common_prefix):len(infile)-len(common_postfix)]
for infile in infiles
}
return [infile[len(common_prefix):len(infile)-len(common_postfix)]
for infile in infiles]
In particular, to make this work in the pipeline we need to modify the tools that accept a variable number of countsfiles as input but don't use resolve_parallel()
to check the passed infiles before expansion - if there is only one infile and it contains a wildcard, this pattern should be used to infer the replicate names from the expanded infiles. Using lib5c qnorm
as an example:
...
# expand infiles
expanded_infiles = []
for infile in args.countsfiles:
expanded_infiles.extend(glob.glob(infile))
if len(args.countsfiles) == 1 and '*' in args.countsfiles[0]:
pattern = args.countsfiles[0]
else:
pattern = None
...
replicate_names = infer_replicate_names(expanded_infiles, pattern=pattern)
This approach allows these tools to use the new pattern-based replicate name inference when a pattern is available, but still fall back to the old approach when it is not. In the context of the pipeline, patterns will always be passed, ensuring filenames will work with outfile_pattern
and the new _match_inputs()
. Tools that use resolve_parallel()
already demand that the input files be specified using a wildcard pattern, the only new requirement there is that the postfix/prefix now matter (since they define the replicate name).
Since so many tools depend on this (see below for a complete list), this might be worth writing a separate helper for this step. Namely, many tools already repeat code of the form:
# expand infiles
expanded_infiles = []
for infile in args.countsfiles:
expanded_infiles.extend(glob.glob(infile))
though later usage of the expanded_infiles
for replicate inference is diverse (some pass as_dict=True
to infer_replicate_names()
for example), so maybe we should focus just on the expansion part, and let individual tools decide exactly how to call infer_replicate_names()
. The helper then has signature resolve_expansion(infiles) -> (expanded_infiles, pattern)
where pattern can be None, and clients are expected to pass through pattern
to infer_replicate_names()
.
As a related stretch goal, the documentation should include a page on replicate name inference explaining these two approaches.
The overall TODO list is then:
- modify
infer_replicate_names()
- modify
resolve_parallel()
(with the simpler of the two fixes described here) - modify
_match_inputs()
- add new helper for expansion
- modify all tools that take variable numbers of countsfiles to use the helper
- trim
- pca
- dd_curve
- threshold
- colorscale
- distribution
- remove
- qnorm
- correlation
- add replicate name inference to docs
Comments (2)
-
reporter -
reporter - changed status to closed
- Log in to comment
instead of writing the new helper, I think we can temporarily just change calls to
infer_replicate_names()
to pass through apattern
kwarg if appropriatean example is: