rework replicate name inference

We first got on this track by noticing that ~~#40~~ (an bug occurring in lib5c.tools.helpers.resolve_parallel() when a glob matches only one filename) can probably be fixed by tacking this conditional on after the if statement mentioned in that issue:

    elif len(infiles) == 1 and '*' in args_dict[key_arg]:
        if hasattr(args, 'outfile') and '%s' in args.outfile:
            prefix, postfix = args_dict[key_arg].split('*')
            rep_name = infiles[0][len(prefix):-len(postfix)]
            args.outfile = args.outfile.replace('%s', rep_name)
        setattr(args, key_arg, infiles[0])

Basically, if the glob only matches one input file, but it has a wildcard in it, the expected behavior is that

we will drop in the single match glob-expanded filename in to replace the unexpanded wildcard filename
if there was an outdir arg that had '%s' in it, the user probably expects that this would be replaced with the replicate name, so we go ahead and do that, using an experimental new method for getting the rep name from a single rep

Historically, we use lib5c.tools.helpers.infer_replicate_names() to infer replicate names from a list of input files (namely, selecting out the parts that are different across the filenames). This doesn't work with a single input file, so instead we add a different assumption. This assumption is that the letters that end up getting matched to the wildcard are the rep name. This is not always true, for example when the file extension is not included (if the pattern is 'kr/*', the rep name will include '.counts'). However, it is the only thing I can come up with here.

We could actually use this new assumption to fix a long-standing issue (~~#39~~) where the qnorm step of the pipeline will break if the rep names contain a common prefix or postfix. This opens the discussion of whether or not we should change infer_replicate_names() to make the strong assumption that the replicate name will always be the exact substring of the matched filename that has replaced the wildcard in the provided pattern.

If we make this change to infer_replicate_names(), fixing this error simplifies to changing the conditional in resolve_parallel() from

    if len(infiles) > 1:

to

    if len(infiles) > 1 or '*' in args_dict[key_arg]:

since the machinery of _parallelize_flags() will kick in as expected.

To some extent, making this change everywhere is slightly undesirable because we're overriding standard expectations for how the wildcard should behave (namely, that we can include or omit characters from the pattern and as long as it still matches the same filenames then it doesn't make a difference).

Making this change everywhere probably also requires a change in lib5c.contrib.luigi.pipeline.JointInnerMixin._match_inputs() since wildcard patterns generated by that function disobey this new assumption (if the reps are "ES_rep1" and "ES_rep2" and the input files are "input/ES_rep1.counts" and "input/ES_rep2.counts" then the pattern is "input/ES_rep*.counts", and replicate names inferred from that pattern will again be mangled to "1" and "2"). The solution would be to modify that function as well. One possibility could be:

class JointInnerMixin(object):
    def _match_inputs(self):
        return self.preceding_task(rep=self.all_reps[0])\
            .outfile_pattern.replace('%s', '*')
    ...

where we are relying on the symmetry between where the files are written ('pattern/%s.counts' % rep) and how they are read ('pattern/*.counts').

We will probably still need to fall back to the old behavior of infer_replicate_names() when it is called by a tool that does not use resolve_parallel() (e.g., when the tool accepts multiple countsfiles as an already-expanded variable-length positional argument). One way to handle this could be to add pattern=None as a kwarg to infer_replicate_names(), and pass it whenever it is available. For example:

def infer_replicate_names(infiles, as_dict=False, pattern=None):
    if pattern:
        prefix, postfix = pattern.split('*')
        if as_dict:
            return {infile: [infile[len(prefix):-len(postfix)]
                    for infile in infiles}
        return [infile[len(prefix):-len(postfix)] for infile in infiles]
    common_prefix = os.path.commonprefix(infiles)
    common_postfix = os.path.commonprefix(
        [infile[::-1] for infile in infiles])[::-1]
    if len(infiles) == 1:
        common_prefix = ''
        common_postfix = ''
    if as_dict:
        return {
            infile: infile[len(common_prefix):len(infile)-len(common_postfix)]
            for infile in infiles
        }
    return [infile[len(common_prefix):len(infile)-len(common_postfix)]
            for infile in infiles]

In particular, to make this work in the pipeline we need to modify the tools that accept a variable number of countsfiles as input but don't use resolve_parallel() to check the passed infiles before expansion - if there is only one infile and it contains a wildcard, this pattern should be used to infer the replicate names from the expanded infiles. Using lib5c qnorm as an example:

    ...

    # expand infiles
    expanded_infiles = []
    for infile in args.countsfiles:
        expanded_infiles.extend(glob.glob(infile))
    if len(args.countsfiles) == 1 and '*' in args.countsfiles[0]:
        pattern = args.countsfiles[0]
    else:
        pattern = None

    ...

    replicate_names = infer_replicate_names(expanded_infiles, pattern=pattern)

This approach allows these tools to use the new pattern-based replicate name inference when a pattern is available, but still fall back to the old approach when it is not. In the context of the pipeline, patterns will always be passed, ensuring filenames will work with outfile_pattern and the new _match_inputs(). Tools that use resolve_parallel() already demand that the input files be specified using a wildcard pattern, the only new requirement there is that the postfix/prefix now matter (since they define the replicate name).

Since so many tools depend on this (see below for a complete list), this might be worth writing a separate helper for this step. Namely, many tools already repeat code of the form:

    # expand infiles
    expanded_infiles = []
    for infile in args.countsfiles:
        expanded_infiles.extend(glob.glob(infile))

though later usage of the expanded_infiles for replicate inference is diverse (some pass as_dict=True to infer_replicate_names() for example), so maybe we should focus just on the expansion part, and let individual tools decide exactly how to call infer_replicate_names(). The helper then has signature resolve_expansion(infiles) -> (expanded_infiles, pattern) where pattern can be None, and clients are expected to pass through pattern to infer_replicate_names().

As a related stretch goal, the documentation should include a page on replicate name inference explaining these two approaches.

The overall TODO list is then:

modify infer_replicate_names()
modify resolve_parallel() (with the simpler of the two fixes described here)
modify _match_inputs()
add new helper for expansion
modify all tools that take variable numbers of countsfiles to use the helper
- trim
- pca
- dd_curve
- threshold
- colorscale
- distribution
- remove
- qnorm
- correlation
add replicate name inference to docs

Comments (2)