HTTPS SSH

Genome Constellation

Purpose

Tools for getting users data (i.e. metagenome bins) onto Genome Constellation.

Summary

This repository contains the tools required to put the user genomes/bins onto the Genome Constellation browser for viewing.
MASH distances are calculated between 1) each of the users genomes and 2) between the user genomes and the reference genomes (i.e RefSeq genomes).

Installation

Install dependencies

You need to have:

1) python2.7
2) conda
3) mash

download the genome constellation repository

git clone https://jfroula@bitbucket.org/berkeleylab/jgi-genomeconstellation.git

Optional (to recompile jgi_gc):

  • gcc >=4.8
  • boost development libraries with program-options
  • libz development libraries
sudo apt-get install build-essential libboost-dev libboost-program-options-dev libz-dev

Build jgi_gc executable

cd src
make && make install
Install conda & python via miniconda or anaconda
# download miniconda2 (only for Linux x84_64). Refer to documentation for miniconda2 installation for max os.
wget https://repo.continuum.io/miniconda/Miniconda2-latest-Linux-x86_64.sh

# run instalation
/bin/bash ./Miniconda2-latest-Linux-x86_64.sh

# follow directions to complete setup. 
# make sure conda is now in your PATH.
# The installation script will ask you if you want to prepend PATH with the miniconda directory or if you want to export PATH manually:
  1) if you choose to add the miniconda bin to your PATH in the .bashrc file then remember to 
  source the .bashrc file (i.e "source <path_to>/.bashrc").  
  2) otherwise just type "export PATH=<full_path_to>/miniconda2/bin:$PATH"

Note: Make sure you have ownership (i.e. write permissions) for your conda because the next step will install mash into conda's parent directory. Typing which conda will tell you what conda command you are using, i.e. <path to your installation>/miniconda2/bin/conda and installs like Mash will be installed under miniconda2/bin/mash

add channels (places to search for "mash" besides default)
conda config --add channels r
conda config --add channels bioconda
install mash

conda install --yes mash

create mash sketches

This step will calculate the distances between genomes and create a matrix; actually, a json file of the matrix called *.csv.json.

Before running anything, make sure:

1) your fasta files are in one or more directories. The names of the directories will be used as the label in Constellation, so if you have one called arctic and another called antarctic, then in the Constellation browser, you will be able to select different colors for the two sets, i.e both data sets will be discernable.
2) the suffix of your fasta files need to be the same for all files, even if you have more than one directory. You can specify on the command line if you have [fasta|fa|fna|etc]. ".fa" is default.

Usage:
<path_to_repository>/tools/sketching/generate_mash_sketches.sh [options] <fasta_dir> [<fasta_dir>]

   Options (defaults are shown in square brackets):
    <-s suffix regex for your fasta files [*.fa]>
    <-r path to reference pre-computed sketches [<repository>/tools/sketching/REFERENCE_10K.msh]>
    <-p threads [8]>

Note the default path to the reference sketch is constellation_mash/tools/sketching/REFERENCE_10K.msh. In theory, you can make your own reference sketch and include it with the "-r" flag; however, for now, we didn't include any script to generate the mash sketches for user references. Please contact Zhong Wang zhongwang@lbl.gov if you desire such an option.

Fingerprint size:

The default is set to using minFrac, where minFrac = 1024 and numBits = 131072. This is the 16 KB condensed fingerprint. If you want to use the full fingerprint, set minFrac = 1 and numBits = 1073741824 (or specify desired numBits).

Example:

To run using test bins run the following, assuming you are in the constellation_mash dir

tools/sketching/generate_mash_sketches.sh -s *.fa misc/test_data/fastas

Running with options different than default. Note that at this time, we don't supply code for you to generate your own reference mash fingerprints so you need to use default. In other words, you don't need to include the "-r" flag since it is set to a default path.

constellation_mash/tools/sketching/generate_mash_sketches.sh \
  -s *.fasta \
  -p 32 \
  arcticData antarcticData

A file will be generated called: "sketch.csv.json". You will want to upload this to the Genome Constellation browser.


Upload / Use your data to Constellation

Run locally in a browser

Now that your data has been calculated and installed under www/data as the files:

links_0.json nodes.json

You can simply point your browser to this file path and run Genome Constellation: file:///FULL_PATH/jgi-genomeconstellations/www/index.html

Alternatively you can start a python web server locally:

You need to now upload your data to a browser that is running Genome Constellation.

Start a python server for local port.

1) in your genome-constellation repository, cd into the constellation_mash/www (where the index.html located)
2) if using version 2 of python, then do: python -m SimpleHTTPServer 8989 and for python3: python -m http.server 8989
3) open a browser on the same machine where you are running the python server and type in "localhost:8989"
4) In the browser that should now be showing the reference genomes in Genome Constellation. Wait until all the links have loaded (see progress bar at top. Now you can load your user data by going to the "import" tab in lower right. Choose where you saved your *.csv.json file.

Contributors