HTTPS SSH

README

What is this repository for?

This is the software repository for the Genome Canada HelpDesk. The HelpDesk was a bioinformatics support platform, based at the University of Alberta in the Wishart Research Group (Wishart Research Group) until 2011.

This library of scripts and tools is free to use, however support is limited. If you find any serious issues please submit an issue on BitBucket and we'll take a look.

Repository

align_learn.pl

converts a multiple sequence alignment into a format that can be readily analyzed using common machine learning algorithms.

annotator.pl

Reads multiple sequence files in FASTA format from a file and submits each to local BLAST. The complete BLAST results are written to a file, and the best match is sent as an Entrez query to NCBI.

Batch PSORT

This program sends protein sequences to a PSORT Server, parses the response, and writes the results to a text file.

batch_bind_blast.pl

This script reads multiple sequences from a file and submits each to BIND BLAST.

BLAST Hit Table Extender

This script uses the identification number to retrieve a more detailed description of the hit sequence from NCBI.

blast_client3_2.pl

This script performs BLAST searches against NCBI's nr database. It prompts the user for a blast search type and an input file of FASTA formatted sequences. An optional 'limit by entrez query' value can be supplied to restrict the search. The script then submits each sequence to BLAST and retrieves the results. For each of the hits the script retrieves a detailed title by performing a separate query of NCBI's databases. Each BLAST hit and its descriptive title are written to a single tab-delimited output file.

blastn_client3_1.pl

This script reads one or more DNA sequences in FASTA format from a file and submits each to NCBI BLAST using the blastn program.

blastx_client3_1.pl

This script reads one or more DNA sequences in FASTA format from a file and submits each to NCBI BLAST using the blastx program.

Clickable Sequence Features

Clickable Sequence Features is an object-oriented program that converts GenBank, EMBL, FASTA, or RAW sequence files into an HTML figure showing the DNA sequence and translations described in the sequence record.

Codon Usage

Codon Usage accepts a DNA sequence and returns the number and frequency of each codon type.

compare_library.pl

This script accepts two files (i and j) containing multiple DNA sequences in FASTA format. Each sequence in file i is compared using local BLAST (bl2seq) to each sequence in file j, and an HTML table is generated to display a summary of the findings.

DNA Stats

DNA Stats returns the number of occurrences of each residue in the sequence you enter.

EMBOSS - User Interface

This software package generates interfaces for the EMBOSS suite of programs.

Extract FASTA Headers

Given a file containing multiple FASTA-formatted entries, this script outputs a file containing only the FASTA headers.

evolving_peptide_search.pl

This script reads multiple protein sequences (in FASTA format) from a file and then searches each for a peptide sequence. The search is repeated using increasingly degenerate versions of the peptide until the maximum allowed number of matches is obtained. This script can be used to find peptides with a primary sequence close to a peptide of interest.

feature_parse.pl

This script reads a genomic sequence in FASTA or RAW format from a file and writes out the features that are described in a feature position file. The extracted features are written in FASTA format to the specified output file.

fetch_protein_v_2.pl

This script accepts a list of Swiss-Prot IDs or Swiss-Prot names. The sequence record corresponding to each ID is retrieved from ExPASy and written to a separate file in the output directory you specify. Records can be written in FASTA format or in Swiss-Prot format.

fetch_swissprot_using_id.pl

This script accepts a list of Swiss-Prot IDs. The sequence and title corresponding to each ID are retrieved from ExPASy and written to a file in FASTA format.

Filter DNA

Filter DNA removes non-DNA characters from text. Use this program when you wish to remove digits and blank spaces from a sequence to make it suitable for other applications.

Filter Protein

Filter Protein removes non-protein characters from text. Use this program when you wish to remove digits and blank spaces from a sequence to make it suitable for other applications.

GenBank Feature Extractor

GenBank Feature Extractor accepts a GenBank file as input and reads the sequence feature information described in the feature table, according to the rules outlined in the GenBank release notes. The program concatenates or highlights the relevant sequence segments and returns each sequence feature in FASTA format.

GenBank Trans Extractor

GenBank Trans Extractor accepts a GenBank file as input and returns each of the protein translations described in the file in FASTA format.

genbank_to_cgview.pl

genbank_to_cgview.pl converts a GenBank or EMBL sequence record into an XML document for the CGView genome visualization software.

generic_ncbi_data_fetcher.pl

This script uses NCBI's Entrez Programming Utilities to perform searches of NCBI databases. This script can return either the complete database records, or the IDs of the records (recommended). It is up to you to know how to handle the IDs and records. The results are written to a single output file. For additional information on NCBI's Entrez Programming Utilities.

genome_search.pl

Genome Search reads a genomic sequence in FASTA format from a file and searches for the patterns you specify using regular expressions.

genome_search_parse_results.pl

Reads the results from genome_search.pl and generates a summary for each match.

go_fish_source.pl

This perl script assigns Gene Ontology (GO) numbers and descriptions for blast results generated by annotator.pl

Hydrophobicity Profiler

This Perl script reads a FASTA formatted protein sequence file and returns the hydrophobicity profile for the inputted sequence according to the user-specified window size and hydrophobicity scale.

local_blast_client.pl

This script performs BLAST searches against a local blast database. It prompts the user for a BLAST search type and an input file of FASTA formatted sequences. The script then submits each sequence to BLAST and retrieves the results. For each of the hits the script retrieves a detailed title by performing a separate query of NCBI's databases. Each BLAST hit and its descriptive title are written to a single tab-delimited output file.

microarray_randomizer.pl

This script accepts a file consisting of tab-delimited microarray data. Numerical values, except for those in the first column, are replaced with pseudo-random values greater than or equal to the lower limit you specify, and less than the upper limit you specify.

Multiple Align Show

Multiple Align Show accepts a group of aligned sequences (in FASTA or GDE format) and formats the alignment to your specifications.

Multi Rev Trans

Multi Rev Trans accepts a protein alignment and uses a codon usage table to generate a graph that can be used to find regions of minimal degeneracy at the nucleotide level.

new_psort.pl

new_psort.pl sends sequences to a PSORT server and parses and saves the results.

ORF Finder

ORF Finder searches for open reading frames (ORFs) in the DNA sequence you enter. The program returns the range of each ORF, along with its protein translation.

Pearson Correlation Coefficient Parser"

This perl script, given a single excel file with multiple genes along with their intensities, will calculate the Pearson correlation coefficient and, if the threshold is above 0.6 or below -0.6, will output the results to two Excel files, Detail_Over.xls and Detail_Under.xls.

Perl BLAST Client

Reads a text file containing multiple sequences in FASTA format and submits each sequence to NCBI's BLAST server using QBLAST'S URL API.

pI/MW batch analysis tool

This Perl program creates a .txt file containing the sequence name, length, predicted molecular weight, and predicted isoelectric point of the protein sequences it receives.

Programming in Perl - Part 1

This collection of simple programs is intended to introduce the Perl programming language to students with little or no programming experience (part one of two).

Programming in Perl - Part 2

This collection of simple programs is intended to introduce the Perl programming language to students with little or no programming experience (part two of two).

Protein Molecular Weight

Protein Molecular Weight accepts a protein sequence and calculates the molecular weight. You can append copies of commonly used epitopes and fusion proteins using the supplied list.

Protein Stats

Protein Stats returns the number of occurrences of each residue in the sequence you enter. Percentage totals are also given for each residue, and for certain groups of residues.

Random DNA Sequence

Random DNA Sequence generates a random sequence of the length you specify. Random sequences can be used to evaluate the significance of sequence analysis results.

Random Protein Sequence

Random Protein Sequence generates a random sequence of the length you specify. Random sequences can be used to evaluate the significance of sequence analysis results.

random_seq_sample.pl

This script accepts a file consisting of multiple FASTA formatted sequence records. It then randomly selects sequences from the file, without replacement.

range_extract.pl

Reads a genomic sequence in FASTA or RAW format from a file and writes out the range of bases between the supplied start and stop positions to a file.

Reformat PDB

A script to reformat unusual PDB files into a more standard PDB format. This script (1) re-orders the atoms within each residue into a 'standard' order, (2) renames atoms to a 'standard' format, e.g. HD23 becomes 3HD2, (3) renames certain residues, e.g. 'HSD' or 'HID' become 'HIS', (4) preserves only one location for each atom, for atoms that have alternate location codes.

remote_blast_client.pl

This script performs BLAST searches against NCBI's sequence databases. It prompts the user for a blast search type and an input file of FASTA formatted sequences. An optional 'limit by Entrez query' value can be supplied to restrict the search. The script then submits each sequence to BLAST and retrieves the results. For each of the hits the script retrieves a detailed title by performing a separate query of NCBI's databases. Each BLAST hit and its descriptive title are written to a single tab-delimited output file.

remove_duplicate_seqs.pl

Reads multiple sequence records in FASTA format from a file and if there are two or more sequences that match, only the first record in the matching group is written to the output file.

remove_duplicates.pl

Reads multiple sequence files in FASTA format from a file and removes duplicate sequence records (based on sequence title).

remove_near_duplicates.pl

This script reads multiple sequence records in FASTA format from a file and if there are two or more sequences that match, only the first record in the matching group is written to the output file. The names of the removed records are written to a log file.

remove_x.pl

Reads multiple sequence files in FASTA format from a file and removes X's and x's from the sequences.

Restriction Summary

Restriction Summary accepts a DNA sequence and returns the number and positions of restriction endonuclease cut sites.

Retrieve_Entrez_Gene_Info.pl

This script uses NCBI's Entrez Programming Utilities URL API to submit batch requests to NCBI Entrez. It retrieves gene information for an organism such as Gene ID, Gene name, Gene description, Gene synonyms, Location, HGNC ID, HPRD ID, MIM ID, phenotype[MIM ID], KEGGPathways, ConserveDomains and Unigene ID information from NCBI's Entrez gene database.

retrieve_seq.pl

This script uses NCBI's Entrez Programming Utilities URL API to submit batch requests to NCBI Entrez. It can be used, for example, to download all the sequences in an NCBI database that were obtained from a particular species.

retrieve_seq_v2.pl

This script uses NCBI's Entrez Programming Utilities to perform batch requests to NCBI Entrez. It can be used, for example, to download all the sequences in an NCBI database that were obtained from a particular species. This version has been customized for retrieval of 16S RNA sequences.

Reverse Complement

Reverse Complement converts a DNA sequence into its reverse, complement, or reverse-complement counterpart.

seqsee

SEQSEE is a comprehensive protein sequence analysis package.

Sequence Extractor

Sequence Extractor accepts a DNA sequence along with a set of primer sequences and returns a textual map showing the annealing positions of the primers, restriction cut sites, and protein translations.

Sequence Manipulation Suite

The Sequence Manipulation Suite is a collection of web-based programs for analyzing and formatting DNA and protein sequences (version 1).

Sequence Manipulation Suite 2

The Sequence Manipulation Suite version 2 is much faster than the previous version and contains several new programs and enhancements. It can be used to perform much of the simple sequence formatting and analysis done in molecular biology labs, and as a teaching aid when introducing students to DNA and protein sequences.

Shuffle DNA

Shuffle DNA randomly shuffles a DNA sequence. Shuffled sequences can be used to evaluate the significance of sequence analysis results, particularly when sequence composition is an important consideration.

Shuffle Protein

Shuffle Protein randomly shuffles a protein sequence. Shuffled sequences can be used to evaluate the significance of sequence analysis results, particularly when sequence composition is an important consideration.

split_fasta.pl

This script accepts a file consisting of multiple FASTA formatted sequence records. It splits the file into multiple new files, each consisting of a subset of the original records.

summary_adder_2.pl

This script obtains summary information from NCBI and adds it to the output of earlier versions of the blast_client.pl scripts (versions 1.2 and earlier).

three_frames.pl

This script converts a fasta formatted DNA sequence file into a new file containing all six protein translations of each supplied DNA sequence.

Translate

Translate accepts a DNA sequence and converts it into a protein using the reading frame you specify.

XALIGN (version 5)

XALIGN is a graphical X-windows program for multiple sequence alignment based on sequence homology and secondary structure (version 5, Linux binary).

XALIGN (version 6)

XALIGN is a graphical X-windows program for multiple sequence alignment based on sequence homology and secondary structure (version 6, source code).