Cataloging data

The easiest way to interact with the BioLite catalog is using the catalog script packaged wit BioLite:

$ catalog -h
usage: catalog [-h] {insert,all,search,sizes} ...

Command-line tool for interacting with the agalma catalog.

agalma maintains a 'catalog' stored in an SQLite database of metadata
associated with your raw Illumina data, including:

- A unique ID that you make up to reference this data set.
- Paths to the FASTQ files containing the raw forward and reverse reads.
- The species name and NCBI ID.
- The sequencing center where the data was collected.

optional arguments:
  -h, --help            show this help message and exit

commands:
  {insert,all,search,sizes}
    insert              Add a new record to the catalog, or overwrite the
                        existing record with the same id.
    all                 List all catalog entries.
    search              Search all fields (except 'paths') for entries
                        matching the provided pattern, which can include * as
                        a wildcard.
    sizes               List all paths in the catalog, ordered by size on
                        disk.

The documentation below describes the catalog module, for manually interacting with the catalog from within a Python script.

catalog Module

The BioLite catalog table pairs metadata with the raw NGS data files (identified by their absolute path on disk). It includes the following:

  • A unique ID for referencing the data set. If the data is paired-end Illumina HiSeq data, the ID can be automatically generated using unique information in the Illumina header.
  • Paths to the raw sequence data. For paired-end Illumina data, this is expected to be two FASTQ files (possibly compressed) containing the forward and reverse reads.
  • Notes about the species, the sample preparation and origin, the species, IDs from NCBI and ITIS taxonomies, and the sequencing machine and center where the data were collected.

The catalog acts as a bridge between the BioLite diagnostics and a more detailed laboratory information management system (LIMS) for tracking provenance of sample preparation and data collection upstream of and during sequencing. It contains the minimal context needed to associate diagnostics reports of downstream analyses with the raw sequence data, but without replicating or reimplementing the full functionality of a LIMS.

class biolite.catalog.CatalogRecord

Bases: tuple

A named tuple for holding records from the catalog Table.

extraction_id

Alias for field number 5

id

Alias for field number 0

itis_id

Alias for field number 4

library_id

Alias for field number 6

library_type

Alias for field number 7

ncbi_id

Alias for field number 3

note

Alias for field number 11

paths

Alias for field number 1

sample_prep

Alias for field number 12

seq_center

Alias for field number 10

sequencer

Alias for field number 9

species

Alias for field number 2

timestamp

Alias for field number 13

tissue

Alias for field number 8

biolite.catalog.split_paths(paths)[source]

Splits a catalog path entry to return a list of paths.

biolite.catalog.insert(**kwargs)[source]

Insert or update a catalog entry, where keyword arguments specify the column/value pairs. If an entry for the given ID already exists, then the specified column/values pairs are used to update the entry. If the ID does not exist, a new entry is created with the specified values.

biolite.catalog.select(id)[source]

Returns a CatalogRecord object for the given catalog ID, or :keyword:None if the ID is not found in the catalog.

biolite.catalog.select_all()[source]

Yields a list of CatalogRecord objects for all entries in the catalog, ordered with the default ordering that SQLite provides.

biolite.catalog.search(string)[source]

Yields a list of CatalogRecord objects for all entries in the catalog with an indexed column matching the given search string. The indexed columns are all the columns in the catalog except paths.

biolite.catalog.make_record(**kwargs)[source]

Returns a CatalogRecord object by mapping the provided keyword arguments to field names.

Table Of Contents

Previous topic

Installation

Next topic

Diagnostics

This Page