HTTPS SSH

The Making of menpiGT_2015

This set of files includes all of the scripts I used to generate the model menpiGT_2015. It does not include third-party code or databases which were used, such as BLAST, targetP, MetaCyc, UniProt, Gurobi etc. All of the software and databases used are either open-source, or free for Academic use. Most of the intermediary files are included in the repository, so you can jump in at any step, even if you haven't done the previous steps.

Prerequisite software:

Python 2.7

YASMEnv Please use version 0.2 for maximum compatibilty with these scripts.

Pandas

Gurobi

TargetP

Fasta2 align0

Local BLAST

Cytoscape Not essential, but very helpful for visualizing Flux Balance Analysis solutions

Prerequsite databases and models:

AraCyc You must make an account and get the flatfiles.

MetaCyc You must make an account and get the flatfiles.

Arabidopsis Core Model You don't need to download this, it's included in this repository.

PRIAM

TODO: Ensure that this list is complete

I hope these scripts are useful to other people making stoichiometric models. Feel free to contact me (sean.r.johnson at gmail) if you need any help or advice with making models, or using these scripts.

Start by downloading this repository, and open a console in the root directory of the repository (the same directory as this README.md file)

1. Prepare the Arabidopsis Core model, and find homology from genes from that model to peppermint genes.

The end result of this step should be three files: arnold_prot_to_rxn.tsv (which associates TAIR accessions, EC numbers, and reaction names), peppermint_peptides_to_arnold.outfmt6.aligned (which gives the global identity between peppermint proteins, and the top BLASTp hit from the TAIR accessions), and Arnold_metacyc_no_biomass.tsv (which is just the model with the biomass reactions removed, because those are not relevant to peppermint).

cd Arabidopsis_Core

convert the original model to a model with metacyc compatible names

python metabolites_to_metacyc.py

get a list of the TAIR accessions from the gene-reaction-associations in the model

python get_Arnold_genes.py > geneslist.txt

open up a browser and go to TAIR bulk download (https://www.arabidopsis.org/tools/bulk/sequences/) and get the "AGI protein sequences" for each locus. Save the sequences as a fasta file called TAIR_peptides.fasta

create a blast database from the TAIR_peptides.fasta sequences, and blast the peppermint extracted peptide sequences (peppermint_transcripts/peppermint_contigs.fasta.transdecoder.pep) against the database. Use default parameters (BLOSUM62, evalue 10, gap-open 11, gap-extend 1), keep only the top hit, the output should be in output format 6 and saved to a file called peppermint_peptides_to_arnold.outfmt6. (commands not shown)

Find the global alignment between each peppermint peptide and the top blast hit, using align0 from Fasta2 (http://faculty.virginia.edu/wrpearson/fasta/fasta2/). For input to align0, you need to first split the multiple-sequence fasta files into files with only one sequence. For this, use split_fasta.py

mkdir arnold_proteins
python split_fasta.py -i TAIR_peptides.fasta --output_dir arnold_proteins

do the same for the peppermint peptides if it isn't already done (note, this step will only work on linux, because windows doesn't allow pipe characters (|) in filenames. A workaround would be to redo the analysis, but with the pipe characters removed from the sequence names

mkdir ../peppermint_transcripts/individual_peptides
python split_fasta.py -i ../peppermint_transcripts/peppermint_contigs.fasta.transdecoder.pep --output_dir ../peppermint_transcripts/individual_peptides

calculate sequence identity and save results to a file called peppermint_peptides_to_arnold.outfmt6.aligned. First check that the settings at the top of the file mint_pep_to_TAIR_identity.py are correct, then run that script

python mint_pep_to_TAIR_identity.py

generate arnold_prot_to_rxn.tsv file

python collect_protein_reaction_associations.py

generate the Arnold_metacyc_no_biomass.tsv file by opening up the Arnold_metacyc.tsv file, deleting the lines for the reactions Bio_AA, Bio_CLim, Bio_NLim, and Bio_opt, then saving the result with the new filename

2. Prepare Aracyc model, and find homology from genes from that model to peppermint genes.

The end result of this step should be three files: aracyc_prot_to_rxn.tsv, peppermint_peptides_to_aracyc.outfmt6.aligned, and aracyc_suba.txt. You need to download the Aracyc .dat files from SRI BioCyc

cd ../AraCyc

instantiate the model. This will generate two YASMEnv model files, instantiated_good.tsv, and instantiated_bad.tsv. instantiated_good.tsv contains reactions which we can have high confidence that they are balanced. Some reactions from instantiated_bad.tsv may be balanced, but YASMEnv cannot verify it in an automated way.

python instantiate_AraCyc_13.py

get the TAIR accessions from the model files and save them to a file called arabidopsis_genes.txt

python get_tair.py > arabidopsis_genes.txt

Get protein localization from SUBA: open up a browser and go to SUBA, and submit the entire gene list, and download that data as a table (SUBA website is not designed for bulk download, so you will probably have to submit genes in groups of 300). Save the table as aracyc_suba.tsv

open up a browser and go to TAIR bulk download (https://www.arabidopsis.org/tools/bulk/sequences/) and get the "AGI protein sequences" for each locus. Save the sequences as a fasta file called aracyc_tair_peptides.fasta

generate aracyc_prot_to_rxn.tsv file

python collect_protein_reaction_associations.py

create a blast database from the aracyc_tair_peptides.fasta sequences, and blast the peppermint extracted peptide sequences (peppermint_transcripts/peppermint_contigs.fasta.transdecoder.pep) against the database. Use default parameters (BLOSUM62, evalue 10, gap-open 11, gap-extend 1), keep only the top hit, the output should be in output format 6 and saved to a file called peppermint_peptides_to_aracyc.outfmt6. (commands not shown)

Find the global alignment between each peppermint peptide and the top blast hit, using align0 from Fasta2 (http://faculty.virginia.edu/wrpearson/fasta/fasta2/). For input to align0, you need to first split the multiple-sequence fasta files into files with only one sequence. For this, use split_fasta.py

mkdir aracyc_proteins
python split_fasta.py -i aracyc_tair_peptides.fasta --output_dir aracyc_proteins

do the same for the peppermint peptides if it isn't already done (note, this step will only work on linux, because windows doesn't allow pipe characters (|) in filenames. A workaround would be to redo the analysis, but with the pipe characters removed from the sequence names

mkdir ../peppermint_transcripts/individual_peptides
python split_fasta.py -i ../peppermint_transcripts/peppermint_contigs.fasta.transdecoder.pep --output_dir ../peppermint_transcripts/individual_peptides

calculate sequence identity and save results to a file called peppermint_peptides_to_arnold.outfmt6.aligned. First check that the settings at the top of the file mint_pep_to_TAIR_identity.py are correct, then run that script

python mint_pep_to_TAIR_identity.py

3. Prepare Metacyc model, and find homology from genes from that model to peppermint genes.

The end result of this step should be four files: metacyc_prot_to_rxn.tsv, peppermint_peptides_to_metacyc.outfmt6.aligned, uniprot_subcellular_localization_table.tab, and metacyc_targetp.tsv.

cd ../MetaCyc

instantiate the model. This will generate two YASMEnv model files, instantiated_good.tsv, and instantiated_bad.tsv. instantiated_good.tsv contains reactions which we can have high confidence that they are balanced. Some reactions from instantiated_bad.tsv may be balanced, but YASMEnv cannot verify it in an automated way.

python instantiate.py

get the UniProt accessions from the model files and save them to a file called uniprot_accessions.txt

python get_uniprot.py > uniprot_accessions.txt

go to the uniprot bulk download page and paste in the list of accessions, and run the search. Use the default columns (see below). Download the table as a tab separated file called uniprot_table.tab. Query Entry Entry name Status Protein names Gene names Organism Length

based on this file, you will be able to replace deprecated UniProt IDs with the preferred accession in the MetaCyc file. You will produce two new files: instantiated_good_updated_uniprot.tsv, and instantiated_bad_updated_uniprot.tsv

python consolidate_genes.py

backup the original metacyc instantiated_bad.tsv and instantiated_good.tsv files if you want, and then rename the new ones (*_updated_uniprot.tsv) to replace the old ones. get a list of the new uniprot IDs by reruning the get_uniprot script

python get_uniprot.py > uniprot_accessions.txt

use the bulk download function to get a table that includes subcellular localization. Use the options in the UniProt web interface to generate a table with the following columns: Entry Organism ID Subcellular location [CC] Download the table as tab-separated file and save it as uniprot_subcellular_localization_table.tab

generate the table of protein-reaction associations

python collect_protein_reaction_associations.py

use the UniProt bulk download page to get a fasta file of all of the accessions in the new list. Save as uniprot_download.fasta

create a blast database from the uniprot_download.fasta sequences, and blast the peppermint extracted peptide sequences (peppermint_transcripts/peppermint_contigs.fasta.transdecoder.pep) against the database. Use default parameters (BLOSUM62, evalue 10, gap-open 11, gap-extend 1), keep only the top hit, the output should be in output format 6 and saved to a file called peppermint_peptides_to_metacyc.outfmt6. (commands not shown)

Find the global alignment between each peppermint peptide and the top blast hit, using align0 from Fasta2 (http://faculty.virginia.edu/wrpearson/fasta/fasta2/). For input to align0, you need to first split the multiple-sequence fasta files into files with only one sequence. For this, use split_fasta.py

mkdir uniprot_proteins
python split_fasta.py -i uniprot_download.fasta --output_dir uniprot_proteins

do the same for the peppermint peptides if it isn't already done (note, this step will only work on linux, because windows doesn't allow pipe characters (|) in filenames. A workaround would be to redo the analysis, but with the pipe characters removed from the sequence names

mkdir ../peppermint_transcripts/individual_peptides
python split_fasta.py -i ../peppermint_transcripts/peppermint_contigs.fasta.transdecoder.pep --output_dir ../peppermint_transcripts/individual_peptides

calculate sequence identity and save results to a file called peppermint_peptides_to_arnold.outfmt6.aligned. First check that the settings at the top of the file mint_pep_to_uniprot_identity.py are correct, then run that script

python mint_pep_to_uniprot_identity.py

download TargetP and configure it Use the uniprot bulk downloader one more time to get a table with the kingdoms of the sequences. Search for the list of uniprot IDs extracted from the updated model files, and create a table file with the name: uniprot_taxonomy_table.tab, and the columns: Entry Taxonomic lineage (KINGDOM) Run targetP on all of the uniprot accessions

python run_targetp_metacyc.py > metacyc_targetp.tsv

4. Run PRIAM on all of the peppermint peptides to get the file seqHits.tab

Download and install PRIAM software and the latest database from http://priam.prabi.fr/ run PRIAM using the settings in priam_params.ppf to generate the file priam/seqHits.tab

5. Consolidate all of the annotations generated for the various precursor models into one combined list, and then choose reactions and compartments to assign to each transcript.

The result of this step will be a file called transcript_reaction_associations.tsv, which will be used to build the first iteration of menpiGT_2015.

cd ../peppermint_transcripts

generate the file transcripts_with_protein_annotations.csv, which consolidates annotations from all of the source databases, and produces a table of annotations for the peppermint transcripts. The output file is transcripts_with_protein_annotations.csv

python get_annotations.py

Select which particular annotation is the best for each transcript, and save the results as a file called transcript_reaction_associations.tsv

python generate_transcript_reaction_associations.py

6. Create the model in six iterations

cd ../model
Iteration One

The first iteration consists of the Arabidopsis_Core model, with peppermint transcripts added as annoations. Additional reactions are drawn from AraCyc and MetaCyc if there are peppermint transcripts homologous to the protein sequences associated with the reactions in the database of origin. The necessary files for generating the first iteration are transcript_reaction_associations.tsv, and the individual model files (.tsv files) from each database.

The first iteration of the model will get the name "first_model.tsv"

python generate_first_model.py
Iteration Two

The second iteration of menpiGT_2015 consists of all reactions from the first model, with equivalent reactions merged, plus additional annotations and reactions based on PRIAM results, and literature about GT products. Much of the work involved in generating this iteration of the model was manual.

A new model file, manual_reactions_to_generate_model_two.tsv, was created to track changes in this iteration of the modeling.

PRIAM results were used to annotate additional mint transcripts. The table generated in step 5, transcripts_with_protein_annotations.csv, was sorted in such a way that transcripts with good PRIAM annotations (positive_hit_probability > 0.8), but bad annotations from other sources (best protein identity < 70%), could be identified and copied to a new workbook, Sorted_priam_greater_than_80.xlsx. Transcripts and annotations in this subset were evaluated one by one, and assigned to categories according to their probably function. Transcripts in the "small_molecule_substrates" category were added to the file for manual reactions.

A realistic Biomass equation was postulated based on a review of literature, and boundary reactions were adjusted to be consistent with known properties of peppermint glandular trichomes.

Many rounds of flux-balance analysis and related techniques were performed to evaluate the capabilities of the model and identify gaps between sugars and the biomass compounds. Reactions that were to be added or modified were saved in the file manual_reactions_to_generate_model_two.tsv. Scripts that were helpful for this process were: fba.py, test_model_production.py, and test_model_production_all.py

There are many redundant metabolites in AraCyc, and particularly in MetaCyc. A list of redundant metabolites to remove was saved as the file metabolites_to_leave_out.txt

After all of that, a script was run to integrate the changes and generate the second model iteration.

python generate_second_model.py
Iteration Three

The third iteration of menpiGT_2015 consisted of the second iteration with modifications to eliminate thermodynamically impossible loops.

The process of generating the third iteration of the model was very similar to the process of generating the second iteration. The same scripts were used again, as well as the cript test_energy_balance.py, to evaluate the model over many rounds of modification. This time the emphasis was instead on making sure the model did not contain thermodynamically infeasible loops, for example the ability to regenerate ATP or NAD without also converting sugar to CO2, or the ability to produce biomass without uptaking precursor molecules.

Reactions that were to be added or modified were saved in the file manual_reactions_to_generate_model_three.tsv

Reactions to be deleted were saved in the file reactions_to_delete_for_third_model.txt

There is another file necessary for the production of the third model, original_fourth_model.tsv, the reason this file is necessary is that I was not very good at tracking the changes I was making when I first tried to correct the model for thermodynamic balance. I made dozens of changes directly to the model. At some point, I realized it would be necessary to redo some of the previous steps, but I did not want to lose all the changes I made to this step. So the file original_fourth_model.tsv is a sort of relic of the past. Differences between that model and the second model are automatically reconciled in the process of generating the third model.

To apply all of these changes, and generate the third model, simply run a script.

python generate_third_model.py
Iteration Four

The fourth iteration of menpiGT_2015 consisted of the third iteration with modifications to remove reactions not supported by transcriptional evidence.

At this point in the model building, most of the manual pain is in the past. A simple script is used to remove all of the reactions from the model that: A - originate with the Arabidopsis Core model, B - have an associated arabidopsis transcript, but not an associated mint transcript, and C - are not necessary for biomas production.

These changes are made by a script.

python generate_fourth_model.py
Iteration Five

The fifth iteration of menpiGT_2015 consisted of the fourth iteration modified to eliminate reactions containing singlet metabolites. Such reactions cannot carry flux under any conditions, and are therefore removed from the model. Removing reactions containing singlet metabolites has to be repeated multiple times until there are no reactions left with singlet metabolites.

python generate_fifth_model.py
Iteration Six

After our experimental results showed activity of ethanol dehydrogenase, we added exporters for ethanol and lactate, thereby allowing the model to use fermentation as a means of ATP regeneration, and to simulate accumulation of fermentation products. The sixth iteration of menpiGT_2015 is generated from the fourth model by the adding the new export reactions, and then eliminating all reactions with singlet metabolites. It is important to note that further changes and improvements to the model should be based on the fourth iteration or earlier, and not on the fifth iteration, because the addition of a new reaction may allow one to connect an orphan reaction to the network.

changes to be made were noted in the file: manual_reactions_to_generate_model_six.tsv

and a script was run to integrate them into the model.

python generate_sixth_model.py

7. Comparing the different iterations of the model

It is interesting to see the various summary numbers about the different iterations of the model to get some idea of how the model progressed over the iterations. The script compare_models.py calculates some interesting descriptive numbers and saves them to a file called model_comparison.tsv. For example, it calculates the number of blocked reactions (reactions that cannot carry flux), the number of singlet metabolites (metabolites that occur in only one reaction), and the number of peppermint transcripts included anywhere in the model.

python compare_models.py

8. Knockout simulations

Reaction knockout simulations were performed based on the fifth iteration of menpiGT_2015. The simulations were to determine which reactions are necessary for the production of menthol from Raffinose. The output file will be single_knockout_fba_sol.tsv. This file contains two columns, the first column is reaction name, the second column is the flux through the raffinose uptake reaction. 0 indicates that removing the reaction prevents menthol biosynthesis from raffinose.

python single_knockout.py

9. BC-SPOT

BC-SPOT was run based on the fifth and sixth iterations of the model. These simulations predict biomass and imports based on gene expression. The output files will include the suffix model_five_spot_full_output, and model_six_spot_full_output respectively. There are three output files for each simulation, fba_graph_.sif , fba_model_.tsv, and fba_sol_*.tsv.

fba_graph_.sif is a graph file that can be read with Cytoscape. fba_model_.tsv is the model file with extra columns added for the flux solution calculated by SPOT fba_sol_*.tsv has five lines, (1) the objective function value, (2) mitochondrial, (3) plastid, (4) cytosol, (5) peroxisome net fluxes.

python model_five_spot.py
python model_six_spot.py