This repository stores code and data for the following study, which should be referenced by any one using the contents of this repository:
Zaretzki, J., Browning, M., Hughes, T., and Swamidass, S. (2015) Extending P450 Site-of-Metabolism Models with Region-Resolution Data. Bioinformatics. In press.
Motivation: Cytochrome P450s are a family of enzymes responsible for the metabolism of approximately 90% of FDA approved drugs. Medicinal chemists often want to know which atoms of a molecule— its metabolized sites—are oxidized by Cytochrome P450s in order to modify their metabolism. Consequently, there are several methods that use literature-derived, atom-resolution data to train models that can predict a molecule’s sites of metabolism. There is, however, much more data available at a lower resolution, where the exact site of metabolism is not known, but the region of the molecule that is oxidized is known. Until now, no site of metabolism models made use of region-resolution data.
Results: Here, we describe XenoSite-Region, the first reported method for training site-of-metabolism models with region-resolution data. Our approach uses the Expectation Maximization algorithm to train a site-of-metabolism model. Region-resolution metabolism data was simulated from a large site-of-metabolism dataset, containing 2,000 molecules with 3,400 metabolized and 30,000 un-metabolized sites and covering 9 Cytochrome P450 isozymes. When training on the same molecules (but with only region level information), we find that this approach yields models almost as accurate as models trained with atom-resolution data. Moreover, we find that atom-resolution trained models are more accurate when also trained with region-resolution data from additional molecules. Our approach, therefore, opens up a way to extend the applicable domain of site-of- metabolism models into larger regions of chemical space. This meets a critical need in drug development by tapping into underutilized data commonly available in most large drug companies.
The XenoSite webserver is available at http://swami.wustl.edu/xenosite.
The code and data is available at https://bitbucket.org/swamidass/xenosite-region/.
Table of Contents
This repository contains the data and code needed to execute the Expectation-step (E-step) of the XenoSite-Region algorithm described in Extending P450 Site-of-Metabolism Models with Region-Resolution Data published in Bioinformatics in 2015. The novel contribution of this work is the E-step, while the maximization step is a standard machine-learning practice of training a model from characterized instances having known output.
The purpose of the E-step is to compute the expected values of a set of instances given:
- A set of predictions for those instances (.pred file).
- A set of constraints on those instances (.som file).
In this work each instance is atom, and the constraints are: each atom belongs to a group of atoms contained in the same substrate molecule, and the number of atoms in the group that can be predicted as positive (observed site of metabolism) is predefined.
- Determines all permutations of binary (1 or 0) predictions for all atoms contained in the same group.
- Removes all permutations that do not have an appropriate number of atoms predicted as positive for the given group.
- Determines a score for each permutation based on the set of predictions for those atoms.
- Computes the probability weighted average of each atom from the likelihood of all viable prediction permutations.
A more in-depth description of the algorithm can be found in our published paper. The atom-resolution SDFs and their corresponding region-resolution .som files were released as Supplementary Information, and may also be found here (FILL link).
This software requires Python and the following python libraries:
Execute the following commands to install the software.
$ hg clone ssh://firstname.lastname@example.org/swamidass/xenosite-region $ cd xenosite-region $ python setup.py build $ python setup.py install
Running the Program
Once installed, the program can be run from the command line. Help for the command-line utility is displayed with the -h option.
$ python -m xregion.estep -h usage: estep.py [-h] [-T TYPE] [-c FILE [FILE ...]] [--export-conf-file [FILE]] som_file p_file out_file positional arguments: som_file SOM file from which to read in data. p_file File of probabilities to adjust with the E-step. out_file File to output E-step adjusted predictions. optional arguments: -h, --help show this help message and exit -T TYPE Allowable values are region, exact_region, atom, site (default=region). configuration file options: -c FILE [FILE ...], --conf-file FILE [FILE ...] specify config files --export-conf-file [FILE] translate arguments into a config file
Running the Examples
The example files can be run from the xenosite-region/example directory with the command:
$ python -m xregion.estep example.som example.pred example.pred.out -T exact_region
Different constraints can be used by changing the -T argument. Use the -h arguement to get a foll
SOM Input Format
This files encodes the connectivity, topology, and regions of a molecule. It is a white space delimited table with named columns (case-sensitive).
- A number corresponding to the relative location of the substrate containing the atom in the source SDF file for the given isozyme. For example, atoms contained in the first substrate in the source SDF will have a MOL column value of 1.
- A number corresponding to the relative location of the atom compared to other
atoms of the same substrate.
- A number corresponding to the relative location of the atom compared to other
- A unique identifier for the atom composed of the value of the atom for the MOL column and the value of the atom for the ATOM column, separated by a period.
- A value designating the topological group to which the atom belongs. Atoms
belonging to the same substrate that are topologically equivalent will have
the same value for this column. Atoms belonging to the same substrate that
are not topologically equivalent will always have different values for this column.
- A value designating the topological group to which the atom belongs. Atoms belonging to the same substrate that are topologically equivalent will have the same value for this column. Atoms belonging to the same substrate that
- A value designating the multi-atom group to which the atom belongs. Any halogen or oxygen bound to a single atom is grouped with the atom to which it is bound, forming a multi-atom site.
- The comma-separated ATOM values for all atoms that the atom is bound to in the substrate.
- The atom will have a value of 1 if it undergoes CYP-mediated metabolism and 0 if it does not. The PRIMARY_SOM, SECONDARY_SOM, and TERTIARY_SOM fields in the source SDF files identify the atoms that have values of 1 for this column.
- A unique identifier for the given substrate that indicate the site of metabolism that
the atom belongs to. In practice, all atoms that have either the same
TOPOLOGY value or the same SITE value, will have the same EQUIV value.
This field is used for evaluation of model predictions. For all atoms having the same EQUIV values, if any of those atoms have a value of 1 for EXP_SOM, and any of those atoms are predicted first or second out of all atoms of the substrate, the substrate is considered to be correctly predicted. This is known as the Top-2 metric.
- A unique identifier for the given substrate that indicate the site of metabolism that the atom belongs to. In practice, all atoms that have either the same TOPOLOGY value or the same SITE value, will have the same EQUIV value.
- All atoms in the substrate belonging to the same partitioned region will have the same value for this column.
ID MOL ATOM TOPOLOGY SITE BONDS EXP_SOM EQUIV REGION 1.1 1 1 50 20 50 0 1 7 1.3 1 3 49 20 50 0 1 7 1.4 1 4 5 1 16,20,5,11 0 4 1 1.5 1 5 13 2 8,4 0 5 2 1.8 1 8 14 3 12,5 0 8 2 1.11 1 11 1 4 24,15,4 0 11 3 1.12 1 12 12 5 8,15 1 12 2 1.15 1 15 2 6 26,12,11 0 15 3 1.16 1 16 18 7 4 0 16 1 1.20 1 20 18 8 4 0 16 1 1.24 1 24 6 9 30,11 0 24 5 1.26 1 26 15 10 15 0 26 3 1.30 1 30 7 11 24,32 0 30 5 1.32 1 32 3 12 33,37,30 0 32 4 1.33 1 33 16 13 32 0 33 4 1.37 1 37 8 14 39,32 0 37 5 1.39 1 39 11 15 37,41 0 39 6 1.41 1 41 9 16 39,43 0 41 6 1.43 1 43 4 17 44,41,48 0 43 6 1.44 1 44 17 18 43 0 44 7 1.48 1 48 10 19 43,50 0 48 7 1.50 1 50 20 20 48,1,3 0 1 7
PRED Input and Output Format
A .pred file contains a row for each atom, with MOL and ATOM fields equivalent to those for a .som file, and a PREDICTION field which contains a value between 0 and 1. The ID column is optional and rquired.
ID MOL ATOM PREDICTION 1.1 1 1 0.7108205865063646 1.3 1 3 0.734227761353739 1.4 1 4 0.8501762953348507 1.5 1 5 0.9108665442843262 1.8 1 8 0.6548000288312935 1.11 1 11 0.9886258218403073 1.12 1 12 0.1550316840459227 1.15 1 15 0.9380128063019367 1.16 1 16 0.24762540810401046 1.20 1 20 0.2533649445190669 1.24 1 24 0.27282763702874324 1.26 1 26 0.4421490724452566 1.30 1 30 0.3953124416757402 1.32 1 32 0.36392236451394966 1.33 1 33 0.21706058432059072 1.37 1 37 0.6763334739463183 1.39 1 39 0.7868959843842305 1.41 1 41 0.23379946504645943 1.43 1 43 0.8756125229411463 1.44 1 44 0.35517910217429627 1.48 1 48 0.334020861305936 1.50 1 50 0.41832096612808256
When running the E-step algorithm, all MOL/ATOM pairs in the PRED file must occur in the SOM file