Wiki

Clone wiki

biobakery / biobakery_internal

bioBakery: Internal documentation

This page describes the process we follow to maintain bioBakery. Please be informed that the instructions on this page are targeted to the bioBakery team. bioBakery users will not perform these steps. However, users are welcome to read the internal documentation to learn about how we add new tools and release new images.



1. Add a new tool

Each bioBakery tool has its own Homebrew formula. There is also a bioBakery tool suite formula that will install all of the bioBakery tool formulas.

To add a new tool, follow these steps:

  1. Write a Homebrew formula for your tool
    1. The Homebrew formula is a ruby script named newtool.rb (if your tool is named "newtool").
    2. Refer to halla.rb as an example formula for a python package.
    3. Refer to maaslin.rb as an example formula for an R package with a command line interface.
  2. Test your formula
    1. Run $ brew install newtool.rb to install your tool with your new formula.
    2. Run your tool to make sure it is operating as expected.
    3. Verify all dependencies of your tool are included in your formula.
  3. Add your formula to the bioBakery Homebrew repository
    1. Push your formula to the repository
      1. $ git clone https://github.com/biobakery/homebrew-biobakery.git
      2. $ git add newtool.rb
      3. $ git commit -m 'add formula for newtool'
      4. $ git push
    2. Update the readme in the repository to include your tool.
  4. Add your tool to the tool suite
    1. If your tool is named newtool, you would add the following line to the biobakery_tool_suite formula.
      1. depends_on "biobakery/biobakery/newtool" => :recommended
    2. Test that the updated tool suite installs as expected after your change.
      1. $ brew install biobakery_tool_suite.rb
    3. Push your changes to the repository.
  5. Add a demo for the tool to biobakery demos.
    1. It is important you add a new demo for your new tool as these demos are used to test that your new tool is installed and running with out errors on the bioBakery box and Google Cloud images.

2. Add a new public Atlas Vagrant box

The bioBakery start up scripts download and install the bioBakery box hosted by Atlas.

To build and upload a new version of the box, use the instructions that follow. If you need to update the packages included in the box, first update the box provisioning scripts. See the section on provisioning for more information.

  1. Build a new GUI bioBakery box
    1. When the box is built the latest version of the bioBakery tool suite will be installed. This will include the latest versions of all of the bioBakery tools.
  2. Package the box
    1. Find the box name with Virtualbox
      1. $ vboxmanage list vms
    2. Package the box with vagrant
      1. $ vagrant package --base BOXNAME --output biobakery-gui.box --vagrantfile Vagrantfile.package
      2. In the above command, replace BOXNAME with the name of the box from the prior step.
  3. Test that the packaged box works as expected
    1. Add the box to your box list
      1. $ vagrant box add biobakery-gui.box --name biobakery-gui-test
    2. Start a new box from the packaged box
      1. $ vagrant init biobakery-gui-test
      2. $ vagrant up --provider=virtualbox
  4. Add the new bioBakery box version to the bioBakery Atlas box set

2.2. Provisioning scripts

Vagrant builds the bioBakery box by executing a series of linux commands within a base Ubuntu box. See the Vagrantfile for the url of the current base Ubuntu box. These commands are contained in "provisioning scripts," which are bash scripts. The provisioning scripts are called from the Vagrantfile associated with each bioBakery box. All boxes use two provisioning scripts:

  • provision-biobakery-core.sh [common to all images]
  • provision-biobakery-(gui|nogui).sh [box-specific]

The first script, provision-biobakery-core.sh, handles all configuration options common across the bioBakery boxes. Specifically, this includes (1) install and removal of packages from the base Ubuntu box and, more importantly, (2) install of the bioBakery tool suite with the Homebrew formula. A single Homebrew formula installs the full bioBakery tool suite.

The second script is specific to the type of box you are trying to build. For example, the version of bioBakery with a graphical user interface (GUI) is additionally configured by calling provision-biobakery-gui.sh. These second scripts install additional packages, configure the graphical environment (for the GUI version), set aliases in the .bashrc file, and so forth. Notably, any "cleanup" steps common to all box builds must be present at the end of these box-specific provisioning scripts (e.g. purging the apt-get cache).


3. Add a new public Google Cloud image

The public Google Cloud image is hosted in the hutlab biobakery bucket. Follow these instructions to add a new public image to the bucket.

  1. Build a new bioBakery Google Cloud instance
    1. SKIP the step that installs tools with licenses.
  2. Stop the instance
  3. Go to Compute Engine -> Snapshot -> Create snapshot and create a snapshot of the stopped instance
  4. Delete the original instance
  5. Go to Compute Engine -> Disks -> Create disk
    1. Create a disk from the snapshot
      1. Name the disk: disk-biobakery-image
    2. Create a temp disk
      1. Name the disk: disk-temp
      2. This disk is 50% larger than the snapshot disk.
      3. This disk is blank.
  6. Go to Compute Engine -> VM instances -> Create new instance
    1. Create a new instance with the following:
      1. Ubuntu 16.04 (10 GB memory, 1 core, 3.75 GB RAM)
      2. Identity and API Access -> Add Storage read / write
  7. AFTER the instance has been created add the disks by editing the instance properties
    1. Add the snapshot and temp disk
    2. As of June 2016, there exists an issue with Google Cloud instances boot ordering. Adding the additional disks when the instance is created will cause errors in the remaining steps when trying to export the image. Please only add the additional disks after the instance has been created.
  8. SSH to the instance to run the script to package and export the image
    1. Clone the bioBakery repository
      1. $ sudo apt-get install mercurial
      2. $ hg clone https://bitbucket.org/biobakery/biobakery
    2. Run the script to package and export the bioBakery image
      1. $ cd biobakery/google_cloud
      2. $ bash -x package_biobakery.sh $VERSION (replace $VERSION with the version number, ie 1.1)
      3. This script will take some time to run. It will shred files (following AWS security best practices), build the image, and then export it to the bioBakery bucket.
  9. Delete the instance
  10. Go to Storage -> Browser -> biobakery_bucket and click on the link to make the new image public
  11. Follow the basic user instructions to run bioBakery in Google Cloud to create a new instance.
  12. Test the bioBakery install.
  13. Delete the test instance.

4. Add a new demo

Follow these instructions to add a new demo to biobakery demos. You will only need to add input files, output files, and a bash script to add a new demo. You will not need to edit the biobakery demos software. The software will discover any new demos that are added to its sub-folders and make them available as new tool options.

  1. Make new data folders for your tool (replace NEWTOOL with tool name)
    • $ mkdir biobakery_demos/data/NEWTOOL/input
    • $ mkdir biobakery_demos/data/NEWTOOL/output
  2. Add the input files for the demo to the folder biobakery_demos/data/NEWTOOL/input
  3. Add the output files from running the demo to the folder biobakery_demos/data/NEWTOOL/output
  4. Create a bash script with demo commands for your tool (see biobakery_demos/demos/kneaddata.bash as an example)
    • This bash script should be added to the folder biobakery_demos/demos/
    • This bash script should be named NEWTOOL.bash (replace NEWTOOL with tool name)
    • Note in the bash script $INPUT_FOLDER and $OUTPUT_FOLDER will be replaced with the full paths to these folders.
  5. Reinstall biobakery_demos (this will add the new files to the install folder)
    • $ python setup.py install
  6. Test running your new demo (replace NEWTOOL with tool name)
    • $ biobakery_demos --tool NEWTOOL --mode test

5. Set up for a workshop

There are three methods that can be used by students to connect to bioBakery Google Cloud instances for a workshop ( Web Browser, VNC Viewer, and SSH ). The following instructions describe how to setup a workshop that will allow for all three methods.

  1. Log in to the google cloud console using the hutlab.public account: http://console.cloud.google.com
    1. If you are not prompted to login, please log out of your personal google account and try the link again. Alternatively you can login through an incognito window if in other windows you would like to remain in your personal google cloud account.
  2. Go to Compute Engine -> Instance templates to create a new template that will capture the settings for all of the instances.
    1. In general instances should have 1 core, 6.5 GB of memory, and an image that includes a desktop with vncserver installed and set to run on startup. Also instances should have the latest bioBakery tool suite installed. The bioBakery public image can be used for the instances after starting up vnc to set the password and setting vnc up to run on startup. Small modifications to the image could be made for the workshop if needed like installing additional tools and changing the desktop configuration (ie new items on menus, increasing font).
  3. Go to Compute Engine -> Instance groups to create one or more groups of instances for the workshop. Make sure to name it something with "biobakery" so the automated configuration script will pick up these instances.
    1. If a large number of instances are required, you will need to create a couple groups of instances each in a different region to not overload the max quotas. For information on quotas, including max and use, go to Compute Engine -> Quotas.
    2. Once this step is complete, students can access the instances through SSH.
  4. Go to Compute Engine -> VM instances and start the guacamole server (name instance-guacamole-server) and also start the reverse proxy server. The proxy instance has a static external IP so it will always be at the IP of the huttenhower guacamole redirect.
    1. NOTE: An automated script will run on the guacamole server to find the group of student instances and update its configuration to include their IPs. This script will run about four minutes after the machine is started. The file $HOME/update_config/config.log will contain the log of the runs for this script. The file $HOME/update_config/mysql_commands_run.txt will contain the list of the configuration commands run plus the mapping of the google cloud instance name to guacamole instance name.
    2. NOTE: Depending on the number of student instances you might want to increase the machine size for the guacamole server. In a class of 71 instances an 8 core, 30 GB machine was used. This worked well and appeared to be more than enough computing resources. A machine of 4 cores and 15 GB could possibly be used for a workshop of 71 instances.
  5. Login to the hutlab guacamole server as student: http://huttenhower.sph.harvard.edu/guacamole . Check that all of the expected connections are up and running. If there are any issues refer to the log on the guacamole server and possibly rerun the configuration script at $HOME/update_config/run_update.bash if needed.
  6. (Optional) Depending on the number of student connections you might need to add or remove access to the new/existing connections for the student account. If so, logout as admin and and login as student to make sure all of the instances are visible to this account.
  7. (Optional) To allow students to connect through VNC, add a tag vnc-access to each of the instances by editing the instance properties. Once the changes to the properties have been saved VNC access is available. It is not recommended to set this option by default in the group template if VNC access is not being used because it allows access to the instances through their external IPs.

NOTE: It is very important to delete the student instances when the workshop is complete. When the workshop is complete, go to Compute Engine -> Instance groups to delete all of the student instances. Then go to Compute Engine -> VM instances and stop the guacamole server.

Updated