Wiki
Clone wikiMultiChannel-CNN / Home
Predicting disease-associated mutation of metal binding sites of a protein using a deep learning approach
Here, we develop a multi-channel convolutional neural network (MCCNN) to predict disease-associated mutation of metalloproteins in metal binding sites. We integrated OMICS data from different databases including ClinVar, CancerResource2, Uniprot humsavar, and MetalPDB to extract disease and benign missense mutations, which occur in the metal binding sites. We then extracted the spatial and sequential features from these dataset to trian the MCCNN model. This work has been published in Nature Machine Intelligence journal:
The proposed model:
You can download the data and model from here.
Package dependencies:
- Python 3
- Keras >2.2.4
- ProPy >1.0.0
- numpy >1.13.1
- pandas >0.20.3
- scipy >1.18.1
- scikit-learn >0.21
- matplotlib >3.1.0
- MGLTool 1.5.6
- AutoGrid 4.2 Which is available in the source code
OMICs data integration:
We integrated OMICS data from different databases including ClinVar, CancerResource2, Uniprot humsavar, and MetalPDB to find missense mutations, which occur in the metal binding sites. We used the disease-associated and benign mutation as positive and negative lablel dataset respectively to train the model. We finally evaluated our model using 10-fold cross validation and unseen dataset.
Workflow of the proposed model:
We extracted the human metal binding sites from MetalPDB. Then we used ClinVar, Uniprot Humsavar, and CancerResource2 to extract missense mutations which occur in the metal binding sites. Here, we collected missense mutations which directly bind to the metals (first coordination sphere, shown in blue) and also those ones which are in the second shell (second coordination spheres, shown in red). We then used AutoGrid to generate spatial features by generating five different affinity grid maps. We also used ProPy python package to generate sequential features by extracting physiochemical features of the amino acids sequence of the metal binding sites. We finally used these spatial and sequential features to train MCCNN model.
Extract disease-associated mutations:
We developed a python code to extract amino acids, which are in the first and the second coordination spheres of the metal-binding sites. We considered amino acids as the first coordination spheres if their alpha-carbon are in a distance less than 5Å to the given metal element. Those amino acids with their alpha-carbon in a distance between 5Å to 10Å to the metal element were considered as the second coordination spheres. MetalCoordination.py extracts the first/second coordination spheres amino acids of different metal-binding sites.
#!text usage: MetalCoordination.py [-h] [--PDBFile STRPDB] [--output STROUT] MetalCoordination is script to extract residues which are in the first and second coordination of the metal binding sites optional arguments: -h, --help show this help message and exit Input:: --PDBFile STRPDB Enter the path of the PDB. Output:: --output STROUT Enter the path of the output CSV file
Generate spatial features:
We can build the five different energy-based affinity grid maps using GenerateSpatialFeatures.py script.
#!text GenerateSpatialFeatures is script to generate five different energy-based affinity grid maps for each receptor optional arguments: -h, --help show this help message and exit Input:: --PDBQTFolder STRPDBQTFOLDER Enter the path of the folder that contain PDBQT files. Output:: --output STROUT Enter the path of the output directory
The script generates five different energy based grid maps which we will use them in the Multi-channel CNN model. Here is the list of these five grid maps:
- Aliphatic Carbon
- Aromatic Carbon
- Hydrogen that donates hydrogen
- Oxygen that accepts hydrogen
- Electron e
For example in the following figure we build the electrostatic gird map by putting an electron in each probe of the 3D lattice and calculate the interaction energy between the electron and all pocket atoms:
Extract sequential features:
We extracted 1047 physiochemical features of the amino acids sequence of the metal binding sites as sequential features.
Using the GenerateSeqFeatures.py script users can extract the sequential features of the metal pockets:
#!text GenerateSeqFeatures is script to generate sequential features of the metal binding pockets optional arguments: -h, --help show this help message and exit Input:: --PDBFolder STRPDBFOLDER Enter the path of the folder that contain PDB files. Output:: --output STROUT Enter the path of the output directory
Besides the spatial and sequential features we used five different meta data features. These metadata features came from the integrated database. These features are the original amino acid type, the mutated amino acid type, the location of amino acid in the protein, the metal type, and the type of interaction between amino acid and metal (direct or indirect). As the MCCNN accepts numerical variables as input we converted the categorical variables of these metadata to the numerical using one hot encoding approach.
Run the model:
We finally used the spatial and sequential features to train the MCCNN using Zn, Ca and Mg metal-binding sites.
Multi-Channel-CNN.py can be used to build model and evaluate it using 10-fold cross validation. Users need to use the spatial features which have been generated in the previous step.
#!text Multi-Channel-CNN is script to build and evaluate a multi-channel convolution neural network to predict disease associated mutation of metal-binding site. We used 10-fold cross validation to evaluate our approach on Zn-binding metal binding site. optional arguments: -h, --help show this help message and exit Input:: --seqFeatures STRSEQFEATURES Enter the path of the sequential features. It should be in CVS format as describe above. --spatialFeatures STRSPATIALFEATURES Enter the path of the Spatial features folder. --filters STRFILTERS Enter the number of filters. Output:: --o STROUT Enter the path of the output directory
Performance of MCCNN to predict disease-associated mutations. 10-fold cross-validation results of the MCCNN model for Zn-binding site (a), Ca-binding site (b), Mg-binding site (c), and for a true positive dataset containing Zn, Ca, Mg binding-sites (d).
Updated