Automating workflows

workflows Module

Provides a collection of helper functions that coordinate multiple wrappers from the wrappers Module to accomplish a unified goal or automate a common analysis task.

Workflows are available for the following groups of tasks:

  • Assembly statistics and sweeps
  • Contig parsing
  • Blast result parsing
  • SamTools automation
  • Transcript cleaning
class biolite.workflows.BlastHit

Bases: tuple

BlastHit(query, title, definition, id, evalue, rank, orient, mask, score, bitscore, length, percent)

bitscore

Alias for field number 9

definition

Alias for field number 2

evalue

Alias for field number 4

id

Alias for field number 3

length

Alias for field number 10

mask

Alias for field number 7

orient

Alias for field number 6

percent

Alias for field number 11

query

Alias for field number 0

rank

Alias for field number 5

score

Alias for field number 8

title

Alias for field number 1

class biolite.workflows.ContigHeader

Bases: tuple

ContigHeader(locus, transcript, confidence, length)

confidence

Alias for field number 2

length

Alias for field number 3

locus

Alias for field number 0

transcript

Alias for field number 1

biolite.workflows.blast_annotate_seqs(hits, fasta_in, hits_out, misses_out, all_out=False, rpkms={})[source]

Iterates through the records in fasta_in and looks for a hit in a dict of BlastHit object, hits.

For each record with a hit, the RPKM (if provided), hit title, and evalue are added to the ID and the record is written to hits_out.

If there is no hit, the record is written to misses_out.

If all_out is True, then hits are also written to misses_out.

biolite.workflows.blast_hits(xml_path, nlimit=None)[source]

Reads an XML formatted BLAST report, and yields one named tuple per alignment, i.e. per hit between a query and a subject. Each named tuple has the following elements:

query title definition id evalue rank orient mask score bitscore length percent

where:

  • orient is 1 if query and subject are in the same direction, 2 if they are in the opposite direction, and 0 if direction is inconsistent across hsp’s
  • evalue is the minimum evalue across hsp’s
  • score, bitcore and length are maximal across hsp’s
biolite.workflows.blast_top_hits(xml_path)[source]

Similar to blast_hits, but returns an OrderedDict keyed by query name with only one hit (the top hit) per query.

biolite.workflows.calculate_rpkms(coverage_table)[source]
biolite.workflows.clean_rrna(fasta_in, clean_out, rrna_out)[source]

Blastn against rRNA, transferring sequences with or without a hit to their own files. Even when rRNA reads are removed prior to assembly, some may make it through and be assembled from the full dataset (including low frequency contaminant rRNAs).

biolite.workflows.clean_swissprot(fasta_in, clean_out, annotated_out, blast_out, rpkms=None)[source]

Blastn against SwissProt, transferring sequences with or without a hit to their own files, used in comparing assemblies.

biolite.workflows.clean_univec(fasta_in, clean_out, vector_out)[source]

Blastn against univec, transferring sequences with or without a hit to their own files This removes sequences that still have adapters, or that are contaminated with plasmids (including the protein expression plasmids used to manufacture sample prep enzymes).

biolite.workflows.contig_stats(fasta_path, hist_path)[source]

Parses the assembled contigs in fasta_path and writes a histogram of contig length to hist_path.

Writes the total contig count, mean length, and N50 length to the diagnostics.

biolite.workflows.dustmasker(fasta_in, clean_out, dirty_out, max_lowc=0.8, min_region=0.1, unpack_func=<function unpack_oases_header at 0x1069297d0>)[source]
biolite.workflows.extract_oases_exemplars(input_path, output_path, min_length=0)[source]

Extracts a single exemplar transcript for each locus in an Oases assembly at input_path and writes it to output_path. Only transcripts longer than min_length are considered.

The exemplar is chosen as the transcript with the highest confidence score.

biolite.workflows.filter_coverage_table(coverage_table, seq_ids, filtered_table)[source]

Filters a coverage_table so that only entries with IDs in the list seq_ids remain and writes output to the path filtered_table.

biolite.workflows.multiblast(blast, query, db, out, evalue=0.0001, cores=4, targets=20)[source]

Prepares a single query file for the multiblast by dividing the queries into nodes = threads/cores many chunks, where threads is from the BioLite configuration file.

Executes the Blast operation blast (e.g. ‘blastx’) in parallel on each node, then concatenates the XML output into a single XML file out.

biolite.workflows.oases_assemblies(inputs, kmers=[61], workdir='./', min_length=None, ins_length=None)[source]

Automates Oases assemblies that sweep multiple kmers.

If inputs is a list of FASTQ files, they are automatically shuffled together. Or, provide a singleton list with the path to a pre-shuffled FASTQ file.

biolite.workflows.oases_clean(workdir='./')[source]

Cleans up a work directory that was used for an Oases assembly.

biolite.workflows.oases_concat_assembly(inputs, concat_path, kmers, workdir='./', ins_length=None)[source]

Performs Oases assemblies sweeping over the provided kmers list, and concatenates all contigs to concat_path.

If inputs is a list of FASTQ files, they are automatically shuffled together. Or, provide a singleton list with the path to a pre-shuffled FASTQ file.

biolite.workflows.oases_merge_assembly(inputs, merge_path, merge_kmer, kmers, min_length=None, workdir='./', ins_length=None)[source]

Implements the Oases-M protocol for merging several Oases assemblies, as described in:

Schulz, M. H., Zerbino, D. R., Vingron, M., & Birney, E. (2012). Oases: Robust de novo RNA-seq assembly across the dynamic range of expression levels. Bioinformatics (Oxford, England), 1-7. doi:10.1093/bioinformatics/bts094

Performs Oases assemblies sweeping over the provided kmers list, then performs a Oases merge assembly with merge_kmer.

class biolite.workflows.rRNAhit

Bases: tuple

rRNAhit(locus, gene, confidence, orient, query)

confidence

Alias for field number 2

gene

Alias for field number 1

locus

Alias for field number 0

orient

Alias for field number 3

query

Alias for field number 4

biolite.workflows.rrna_blast_hits(xml_path, unpack_header_func)[source]

Reads an XML formatted BLAST report, and saves one top hit per locus, using the transcript with the highest confidence for the locus.

The locus name and confidence are extracted from the query name with the supplied ‘unpack_header_func’ function.

Returns both a set of all the queries in the XML report, and a dictionary keyed by locus and storing the rRNA hits:

set(queries), dict(hits)
The rRNA hits are tuple with the following fields:
(locus gene confidence orient query)
biolite.workflows.sort_and_index_sam(sam_path)[source]

Uses SamTools to convert a SAM file at sam_path to BAM, then sort and index the BAM.

Returns the filename of the final output, which is ‘_sorted.bam’ appended to sam_path.

biolite.workflows.trinity_assembly(out, inputs, workdir='./', min_length=None)[source]
biolite.workflows.unpack_oases_header(header)[source]

Unpacks an Oases contig header into a ContigHeader object.

Example header:

>Locus_9919_Transcript_1/1_Confidence_1.000_Length_160

Table Of Contents

Previous topic

Calling external tools

Next topic

Internals

This Page