Diagnostics

Diagnostics usually come in the form of plots or summary statistics. They can serve many purposes, such as:

  • diagnosing problems in sample preparation and optimizing future preparations;
  • providing feedback on the sequencing itself, e.g. on read quality;
  • implementing ‘sanity checks’ at intermediate steps of analysis;
  • finding optimal parameters by comparing previous runs;
  • recording computational and storage demands, and predicting future demands.

The diagnostics database table archives summary statistics that can be accessed across multiple stages of a pipeline, from different pipelines, and in HTML reports.

A diagnostics record looks like:

catalog_id | run_id | entity | attribute | value | timestamp

The entity field acts as a namespace to prevent attribute collisions, since the same attribute name can arise multiple times within a pipeline run.

When running a BioLite pipeline, the default entity is the pipeline name plus the stage name, so that values can be traced to the pipeline and stage during which they were entered. Entries in the diagnostics table can include paths to derivative files, which can be summaries of intermediate files that are used to generate reports or intermediate data files that serve as input to other stages and pipelines.

Initializing

Before logging to diagnostics, your script must initialize this module with a BioLite catalog ID and a name for the run using the init method. This will return a new run ID from the runs Table. Optionally, you can pass an existing run ID to init to continue a previous run.

Diagnostics are automatically initialized by the Pipeline and IlluminaPipeline classes in the pipeline Module.

Logging a record

Use the log function described below.

Detailed system utilization statistics, including memory high-water marks and compute wall-time are recorded automatically (by the wrapper base class) for any wrapper that your pipeline calls, and for the overall pipeline itself.

Provenance

Because every wrapper call is automatically logged, the diagnostics table holds a complete non-executable history of the analysis, which complements the original scripts that were used to run the analysis. In combination, the diagnostics table and original scripts provide provenance for all analyses.

class biolite.diagnostics.OutputPattern

Bases: tuple

OutputPattern(re, entity, attr)

attr

Alias for field number 2

entity

Alias for field number 1

re

Alias for field number 0

class biolite.diagnostics.Run

Bases: tuple

Run(done, run_id, id, name, hostname, username, timestamp, hidden)

done

Alias for field number 0

hidden

Alias for field number 7

hostname

Alias for field number 4

id

Alias for field number 2

name

Alias for field number 3

run_id

Alias for field number 1

timestamp

Alias for field number 6

username

Alias for field number 5

biolite.diagnostics.timestamp()[source]

Returns the current time in ISO 8601 format, e.g. YYYY-MM-DDTHH:MM:SS[.mmmmmm][+HH:MM].

biolite.diagnostics.str2list(data)[source]

Converts a diagnostics string with key name in self.data into a list, by parsing it as a typical Python list representation [item1, item2, ... ].

biolite.diagnostics.get_run_id()[source]

Returns the run_id (as a string)

biolite.diagnostics.get_entity()[source]

Returns the current entity as a dot-delimited string.

biolite.diagnostics.init(id, name, run_id=None, workdir='/Users/mhowison/code/biolite/doc')[source]

By default, appends to a file diagnostics.txt in the current working directory, but you can override this with the workdir argument.

You must specify a catalog id and a name for the run. If no run_id is specified, an auto-incremented run ID will be allocated by inserting a new row into the runs Table.

Returns the run_id (as a string).

biolite.diagnostics.check_init()[source]

Aborts if the biolite.diagnostics.init() has not been called yet.

biolite.diagnostics.merge()[source]

Merges the diagnostics and program caches into the SQLite database.

biolite.diagnostics.load_cache()[source]

Similar to a merge, but loads the local diagnostics file into an in-memory cache instead of the SQLite database.

Uses the filename specified with name, or the file diagnostics.txt in the current working directory (default).

biolite.diagnostics.log(attribute, value)[source]

Log an attribute/value pair in the diagnostics using the currently set entity. The pair is written to the local diagnostics text file and also into the local in-memory cache.

biolite.diagnostics.log_path(path, log_prefix=None)[source]

Logs a path by writing these attributes at the current entity, with an optional prefix for this entry: 1) the full path string 2) the full path string, converted to an absolute path by os.path.abspath() 3) the size of the file/directory at the path (according to os.stat) 4) the access time of the file/directory at the path (according to os.stat) 5) the modify time of the file/directory at the path (according to os.stat) 6) the permissions of the file/directory at the path (according to os.stat)

biolite.diagnostics.log_dict(d, prefix=None, filter=False)[source]

Log a dictionary d by calling log for each key/value pair.

biolite.diagnostics.log_program_version(name, version, path)[source]

Enter the version string and a hash of the binary file at path into the programs table.

biolite.diagnostics.log_program_output(filename, patterns=None)[source]

Read backwards through a program’s output to find any [biolite] markers, then log their key=value pairs in the diagnostics.

A marker can specify an entity suffix with the form [biolite.suffix].

[biolite.profile] markers are handled specially, since mem= and vmem= entries need to be accumulated. These are inserted into a program’s output on Linux systems by the preloaded memusage.so library.

You can optionally include a list of additional patterns, specified as OutputPattern tuples with:

(regular expression string, entity, attribute)

and the first line of program output matching the pattern will be logged to that entity and attribute name. The value will be the subexpressions matched by the regular expression, either a single value if there is one subexpression, or a string of the tuple if there are more.

biolite.diagnostics.lookup(run_id, entity)[source]

Returns a dictionary of attribute/value pairs for the given run_id and entity in the SQLite database.

Returns an empty dictionary if no records are found.

biolite.diagnostics.local_lookup(entity)[source]

Similar to lookup, but queries the in-memory cache instead of the SQLite database. This can provide lookups when the local diagnostics text file has not yet been merged into the SQLite database (for instance, after restarting a pipeline that never completed, and hence never reached a diagnostics merge).

Returns an empty dictionary if no records are found.

biolite.diagnostics.lookup_like(run_id, entity)[source]

Similar to lookup, but allows for wildcards in the entity name (either the SQL ‘%’ wildcard or the more standard UNIX ‘*’ wildcard).

Returns a dictinoary of dictionaries keyed on [entity][attribute].

biolite.diagnostics.lookup_by_id(id, entity)[source]
biolite.diagnostics.lookup_attribute(run_id, attribute)[source]

Returns each value for the given attribute found in all entities for the given run_id, as an iterator of (entity, value) tuples.

biolite.diagnostics.lookup_entities(run_id)[source]
biolite.diagnostics.lookup_pipelines(run_id)[source]
biolite.diagnostics.lookup_run(run_id)[source]
biolite.diagnostics.lookup_runs(id=None, name=None, order='ASC', hidden=True)[source]
biolite.diagnostics.lookup_last_run(id=None, name=None)[source]
biolite.diagnostics.lookup_prev_run(id, previous)[source]

If previous is an integer, tries to lookup the exit diagnostics of a previous run with that run ID. If previous is any string, To input the results from a previous pipeline run, use the (–previous, -p) argument with a ‘RUN_SPEC’, which is either a specific run ID to lookup in the diagnostics, or the wildcard ‘*’, meaning the latest of any previous run found in the diagnostics for the given catalog ID.

biolite.diagnostics.dump(run_id)[source]
biolite.diagnostics.dump_commands(run_id)[source]
biolite.diagnostics.dump_by_id(id)[source]
biolite.diagnostics.dump_all()[source]
biolite.diagnostics.hide_run(*args)[source]
biolite.diagnostics.unhide_run(*args)[source]
biolite.diagnostics.dump_programs()[source]
biolite.diagnostics.exit_profiler(start)[source]

Capture script resource usage, after a script run ends or as an exit handler if the script fails.

biolite.diagnostics.register_exit_profiler(start)[source]

Table Of Contents

Previous topic

Cataloging data

Next topic

Building pipelines

This Page