RNA-Seq analysis of transgenic soy
This repository contains data analysis code and data files for a project investigating how transgene expression affects native gene expression in soybean seeds. A paper describing this work titled "A Comparison of Transgenic and Wild Type Soybean Seeds: Analysis of Transcriptome Profiles Using RNA-Seq" is in review.
This project is the Ph.D. research of Kevin Lambirth, a graduate student at UNC Charlotte, a student with Ken Piller and Ken Bost.
Please note that before these data are published and peer-reviewed, any and all files contained in this repository may change without warning.
If you would like to use any of these data prior to publication, please contact Kevin Lambirth (KCLambirth@uncc.edu) and Ken Piller (firstname.lastname@example.org).
Premise of the study
Soybean (Glycine max) has been bred for thousands of years to produce protein-rich seeds for human consumption. Thanks to this, soybean can serve as a low cost bioreactor for producing valuable recombinant proteins at high levels. However, the effects of expressing non-native protein at high levels on bean physiology are not well understood.
To learn more, we used RNA-Seq to survey gene expression in three soybean lines bearing transgenes expressed exclusively in seeds. These included:
- ST77, expressing human thyroglobulin protein (hTG)
- ST111, expressing human myelin basic protein (hMBP)
- 764, expressing a mutant, nontoxic form of a staphylococcal subunit vaccine protein (mSEB).
The experiment included nine libraries per genotype, including libraries from three beans from three different soybean plants growing under identical conditions in a temperature controlled growth room. Beans from non-transgenic plants were included for comparison.
Prior work investigating these lines found:
- ST77 and 764 contained one transgene insertion
- ST111 contained several transgene insertions
- ST77 seeds contained about twice as much transgenic protein as 764 seeds
- In ST77, 1.5% of total soluable protein (TSP) was transgenic protein
- In 764, 0.75% of TSP was transgenic protein.
- Western blots confirmed prior results that ST77 and 764 seeds contain relatively high levels of transgenic protein. However, ST111 contained much less transgenic protein, around ten times less than the other two lines.
- Expression of transgene RNA matched protein expression. The ST77 and 764 transgenes were expressed at similarly high levels. Expression of the ST111 transgene was lower.
- However, native soybean gene expression was significantly altered in 764 (mSEB) but not the others. In this line, more than 3,000 genes were up or down regulated.
- GO term enrichment analysis identified upregulation of genes involved in translation and protease inhibition. Genes annotated to nuclear pore and the nucleus were also differentially expressed, but not all in the same the direction.
- Even though expression of ST111 transgene protein and mRNA was much lower than in ST77, there were more differentially expressed genes in ST111 than ST77.
Gene expression analysis suggested that many aspects of protein synthesis and degradation were altered in the 764 line expressing mSEB. This analysis also suggested that cellular structures, mainly the nuclear pore and endomembrane systems, were altered. The large number of gene expression differences observed for the 764/mSEB line indicates that a transgene expressed at high levels in seeds can sometimes trigger major molecular changes.
Powell R, Hudson LC, Lambirth KC, Luth D, Wang K, Bost KL, Piller KJ: Recombinant expression of homodimeric 660 kDa human thyroglobulin in soybean seeds: an alternative source of human thyroglobulin. Plant cell reports 2011, 30(7):1327-1338.
Code used for analysis, processing, and figure generation are grouped into folders with names indicating the general purpose of the analysis. Each folder is designed to be run as a relatively self-contained project in RStudio. As such, each folder contains an ".Rproj" file. To run the code in RStudio, just open that file and go from there.
Note that some modules depend on the output of other modules. Also, some modules depend on externally supplied data files, which are version-controlled here but may also be available from external sites.
Readers interested in re-running aspects of the analysis can do so using code and data files stored in this repository. Most of the source code and data files are available here. Larger files, especially sequence data files, archived elsewhere.
Analysis modules and other directories include:
Reports on the number of sequences obtained per library and number of reads aligned to the soybean genome assembly. Look here for information about yield. Summaries of read alignments were from tophat.
Images and output files from Agilent Bioanalyzer analysis of cDNA libraries prepared for sequencing. Look here if you are interested in investigating how library quality may or may not predict future sequencing yields.
Makes counts, RPM, and RPKM files. Contains code for generating barcharts showing gene expression by sample. Look here for summaries of read count and expression level distributions. This module also contains plots showing transgene expression.
Scripts, and data files analysis using Cufflinks, Cuffmerge, and Cuffdiff.
Note that this directory only contains some of the code and data files used for cuff* analysis. The cuffdiff analysis was done by Adam Whaley. Unfortunately, the computer he used to run the analysis had a hard drive failure and several products of the analysis are no longer available. We were able to reproduce the alignment, cuffdiff, and cuffmerge steps, but not the cuffdiff steps - cuffdiff kept crashing with memory allocation errors. This may have been due to the large number of replicates in our study.
Contact Adam at Adam Whaley (email@example.com) if you have questions about cuffdiff.
This directory also contains files created by Ivory Blakley comparing Cuffdiff output to output from other differential expression analysis programs. Results were similar, but not identical.
Identifies differentially expressed genes using tools and libraries from BioConductor, mainly edgeR. Look here for spreadsheets listing results from differential expression testing.
Contains annotations and datat downloaded from IGBQuickLoad.org, GeneOntology.org, JGI, and other sites.
Contains reports from FastQC. Look here for summaries of sequencing quality, sequencing depth, and more.
Contains code for generating BED-detail and gene description files from GFF3 and functional annotation files downloaded from JGI.
Look here for simple tab-delimited files mapping gene names onto gene descriptions, BED-detail files suitable for visualization in IGB, informaton about splice variant and gene size distributions in soybean, and more.
Contains code for generating Simple Annotation Format (SAF) files for running the featureCounts software from the subread library.
Look here for an SAF file reporting the location (start and end) of native soybean genes and the three transgenes. Note that to enable featureCounts to report counts for the transgenes, we created an artificial chromosome sequence with all three transgenes. Thus the SAF file contains three lines (the last ones) with the location of the transgenes within this artificial sequence.
Contains code investigating expression levels of transgenes. Also contains code for creating a transgene sequence used to evaluate transgene expression.
Ann started experimenting with the newly published stringtie software from Mihaela Pertea, Steven Salzberg, and colleagues. This was not finished. Our ultimate plan was to run the recommended ballgown pipeline to identify differentially expressed genes and splice variants and then compare the results to other methods. Readers interested in contributing should contact Ann Loraine (firstname.lastname@example.org) or Ivory Blakley (email@example.com).
Project-wide source code, such as color settings for samples types. Also includes scripts used to run alignment software and other tasks.
This directory contains a few scripts written by Ann Loraine for re-running programs from the cufflinks suite. These were not the same scripts used by Adam. His scripts are in CuffAnalysis/src.
Investigates whether and how many genes were found to be differentially expressed between the different lines and different methods used to assess differential expression.
- Ann Loraine firstname.lastname@example.org
- Kevin Lambirth KCLambirth@uncc.edu
- Ivory Blakely email@example.com
- Adam Whaley firstname.lastname@example.org
Copyright (c) University of North Carolina at Charlotte
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
Also, see: http://opensource.org/licenses/MIT