Scripts for Estimating Mutant Fitness By Sequencing Randomly Barcoded Transposons (RB-TnSeq) This code repository (bitbucket.org/berkeleylab/feba/) includes scripts for estimating mutant fitness by sequencing randomly barcoded transposons (RB-TnSeq). For an overview of the technology see Figure 1 of Deutschbauer et al., doi: 10.1128/mBio.00306-15 http://mbio.asm.org/content/6/3/e00306-15.full The stages in creating fitness estimates are: Processing TnSeq data with MapTnSeq.pl and DesignRandomPool.pl Processing BarSeq data with MultiCodes.pl, combineBarSeq.pl, and BarSeqR.pl or BarSeqtest.pl. (Unless otherwise specified, all executables are from the bin/ subdirectory.) For web server code to browse the results, see the cgi/ subdirectory, especially cgi/README and cgi/SETUP. This is research software that has undergone limited testing. Be careful. Morgan Price, Lawrence Berkeley Lab, October 2014 DETAILS ON USING THE SCRIPTS MapTnSeq.pl identifies the random barcode, the junction between the transposon and the genome, and maps the remainder of the read to the genome. The result is a list of mapped reads with their barcodes and insertion locations. DesignRandomPool.pl uses the output of MapTnSeq.pl to identify barcodes that consistently map to a unique location in the genome. These are the useful barcodes. It also reports various metrics about the pool of mutants. (This step is done by invoking PoolStats.R.) Ideally, a mutant library has even insertions across the genome; has insertions in most of the protein-coding genes (except the essential ones); has a similar number of reads for insertions in most genes (i.e., no crazy positive selection for loss of a few genes); has insertions on both strands of genes (insertions on only one strand may indicate that the resistance marker's promoter is too weak); has tens of thousands or hundreds of thousands of useful barcodes; and the useful barcode s account for most of the reads. MultiCodes.pl identifies the barcode in each read and makes a table of how often it saw each barcode. It can also demultiplex reads if using the older primers with inline demultiplexing and no separate index read. combineBarSeq.pl merges the table of counts from MultiCodes.pl with the pool definition from DesignRandomPool.pl to make a table of how often each strain was seen. BarSeqR.pl combines multiple lanes of barseq output with the genes table to make a single large table, and then uses the R code in FEBA.R to estimate the fitness of each gene in each experiment. It produces a mini web site with data tables and quality assessment plots. It requires metadata about the experiments (usually FEBA_BarSeq.tsv) and information about the GC content of each gene (genes.GC -- this can be produced from a normal genes table with RegionGC.pl). In FEBA_BarSeq.tsv, the Date_pool_expt_started field specifies what date the experiment was started on, and experiment(s) with Description=Time0 are the control sample(s) for that set of experiments. If you need to set up your controls in a different way, use a different Date_pool_expt_started (any character string is allowed) or use BarSeqTest.pl. SetName and Index indicate which reads correspond to that sample. If you want to use the cgi scripts to view the results, then the metadata table must include all of these fields: SetName Date_pool_expt_started Person "Mutant Library" Description Index Media "Growth Method" Group Condition_1 Units_1 Concentration_1 Condition_2 Units_2 Concentration_2. The results of BarSeqR.pl may change as data is added, as this alters which strains are considered abundant enough to include in the analysis. You can get around this by using the SaveStrainUsage.pl script after you process your data. This will record which strains were used for an analysis, and if these files are saved to the organism directory, then BarSeqR.pl will use this information. (You can turn this behavior of BarSeqR.pl off by setting the FEBA_NO_STRAIN_USAGE environment variable.) BarSeqTest.pl is a wrapper to analyze barseq reads and run BarSeqR.pl. It is intended for small test runs and does not require a metadata table. There are also helper scripts RunTnSeqLocal.pl and RunBarSeqLocal.pl for analyzing the reads in parallel. Both of these scripts use submitter.pl to issue jobs in parallel, and its behavior can be modified by setting environment variables (see top of submitter.pl). RunTnSeqLocal.pl processes a TnSeq run in parallel by running MapTnSeq.pl on each part separately and then running DesignRandomPool.pl. This script assumes that there is a g/nickname directory for your organism that includes the genome sequence (genome.fna) and a tab-delimited table of genes (genes.tab). RunBarSeqLocal.pl processes a large BarSeq run in parallel by running MultiCodes.pl on each piece separately and then running combineBarSeq.pl to combine the results. SETTING UP A GENOME The files that these scripts depend on (genome.fna, genes.tab, genes.GC, aaseq) can be set up from a genbank file or a JGI assembly with a gff file by the script SetupOrg.pl. Unlike most of the other code, Setuporg.pl depends on BioPerl; also to handle genbank files you will need to place the genbank2gff.pl script from https://github.com/ihh/gfftools/ in the bin/ subdirectory. OTHER DEPENDENCIES The read mapping (MapTnSeq.pl) depends on UCSC's blat, which is freely available for non-commercial use at http://hgdownload.soe.ucsc.edu/downloads.html#source_downloads It should be straigthforward to modify MapTnSeq.pl to use bowtie 2 or megablast or ublast instead, but some minor changes to the parsing code would be required. EXAMPLES You can see examples of using this code and of the resulting mini web sites at http://genomics.lbl.gov/supplemental/rbarseq/ CONTROLLING HOW MANY CPUS ARE USED You can use the MC_CORES environment variable to control how many CPUs the scripts try to use. (If MC_CORES is not set, then for the R code, the default is 2 threads; for RunTnSeqLocal.pl or RunBarSeqLocal.pl, the default is based on the number of CPUs according to /proc/cpuinfo: see submitter.pl.) RUNNING ON MacOS I have only tested the scripts on Linux, but I have heard that they will work if you make these two changes: 1. Various scripts depend on /usr/bin/Rscript. If it does not exist, then you need to set up the symbolic link or modify the scripts. (BarSeqR.pl or BarSeqTest.pl invoke an R script bin/RunFEBA.R which invokes lib/FEBA.R. DesignRandomPool.pl invokes an R script lib/PoolStats.R. bin/db_setup.pl invokes lib/TopCofit.R) 2. Set MC_CORES to the number of CPUs you want the scripts to use. (By default, RunTnSeqLocal.pl or RunBarSeqLocal.pl run jobs in parallel using feba/bin/submitter.pl, which uses /proc/cpuinfo to estimate how many CPUs to use, but /proc/cpuinfo does not exist on MacOS.) RUNNING ON WINDOWS I have not tested the scripts on Windows, but I have heard that they will work. If you use BarSeqR.pl or BarSeqTest.pl to compute fitness values, then you will need to install cygwin and you will need to make sure that the cygwin directory is at the beginning of the PATH (ahead of the windows directory). This is so that the Unix-like version of find will be run. For example, a command like this find -H fastq_directory -name '*.fastq.gz' will work if cygwin and PATH are set up but will not work on standard windows. --- LEGALESE --- Copyright (C) 2014 The Regents of the University of California All rights reserved. This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program; if not, write to the Free Software Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA. or visit http://www.gnu.org/copyleft/gpl.html Disclaimer NEITHER THE UNITED STATES NOR THE UNITED STATES DEPARTMENT OF ENERGY, NOR ANY OF THEIR EMPLOYEES, MAKES ANY WARRANTY, EXPRESS OR IMPLIED, OR ASSUMES ANY LEGAL LIABILITY OR RESPONSIBILITY FOR THE ACCURACY, COMPLETENESS, OR USEFULNESS OF ANY INFORMATION, APPARATUS, PRODUCT, OR PROCESS DISCLOSED, OR REPRESENTS THAT ITS USE WOULD NOT INFRINGE PRIVATELY OWNED RIGHTS.