Wiki

Clone wiki

MultiChannel-CNN / Home

Predicting disease-associated mutation of metal binding sites of a protein using a deep learning approach

Here, we develop a multi-channel convolutional neural network (MCCNN) to predict disease-associated mutation of metalloproteins in metal binding sites. We integrated OMICS data from different databases including ClinVar, CancerResource2, Uniprot humsavar, and MetalPDB to extract disease and benign missense mutations, which occur in the metal binding sites. We then extracted the spatial and sequential features from these dataset to trian the MCCNN model. This work has been published in Nature Machine Intelligence journal:

Mohamad Koohi-Moghadam, Haibo Wang, Yuchuan Wang, Xinming Yang, Hongyan Li, Junwen Wang and Hongzhe Sun, "Predicting disease-associated mutation of metal-binding sites in proteins using a deep learning approach", Nature Machine Intelligence.


The proposed model:

You can download the data and model from here.

Package dependencies:


OMICs data integration:

We integrated OMICS data from different databases including ClinVar, CancerResource2, Uniprot humsavar, and MetalPDB to find missense mutations, which occur in the metal binding sites. We used the disease-associated and benign mutation as positive and negative lablel dataset respectively to train the model. We finally evaluated our model using 10-fold cross validation and unseen dataset.

fig-1.PNG


Workflow of the proposed model:

We extracted the human metal binding sites from MetalPDB. Then we used ClinVar, Uniprot Humsavar, and CancerResource2 to extract missense mutations which occur in the metal binding sites. Here, we collected missense mutations which directly bind to the metals (first coordination sphere, shown in blue) and also those ones which are in the second shell (second coordination spheres, shown in red). We then used AutoGrid to generate spatial features by generating five different affinity grid maps. We also used ProPy python package to generate sequential features by extracting physiochemical features of the amino acids sequence of the metal binding sites. We finally used these spatial and sequential features to train MCCNN model.

fig1.png


Extract disease-associated mutations:

We developed a python code to extract amino acids, which are in the first and the second coordination spheres of the metal-binding sites. We considered amino acids as the first coordination spheres if their alpha-carbon are in a distance less than 5Å to the given metal element. Those amino acids with their alpha-carbon in a distance between 5Å to 10Å to the metal element were considered as the second coordination spheres. MetalCoordination.py extracts the first/second coordination spheres amino acids of different metal-binding sites.

#!text

usage: MetalCoordination.py [-h] [--PDBFile STRPDB] [--output STROUT]

MetalCoordination is script to extract residues which are in the first and
second coordination of the metal binding sites

optional arguments:
  -h, --help        show this help message and exit

Input::
  --PDBFile STRPDB  Enter the path of the PDB.

Output::
  --output STROUT   Enter the path of the output CSV file

Generate spatial features:

We can build the five different energy-based affinity grid maps using GenerateSpatialFeatures.py script.

#!text

GenerateSpatialFeatures is script to generate five different energy-based
affinity grid maps for each receptor

optional arguments:
  -h, --help            show this help message and exit

Input::
  --PDBQTFolder STRPDBQTFOLDER
                        Enter the path of the folder that contain PDBQT files.

Output::
  --output STROUT       Enter the path of the output directory

The script generates five different energy based grid maps which we will use them in the Multi-channel CNN model. Here is the list of these five grid maps:

  • Aliphatic Carbon
  • Aromatic Carbon
  • Hydrogen that donates hydrogen
  • Oxygen that accepts hydrogen
  • Electron e

For example in the following figure we build the electrostatic gird map by putting an electron in each probe of the 3D lattice and calculate the interaction energy between the electron and all pocket atoms:

Capture.PNG


Extract sequential features:

We extracted 1047 physiochemical features of the amino acids sequence of the metal binding sites as sequential features.

fig2.png

Using the GenerateSeqFeatures.py script users can extract the sequential features of the metal pockets:

#!text

GenerateSeqFeatures is script to generate sequential features of the metal
binding pockets

optional arguments:
  -h, --help            show this help message and exit

Input::
  --PDBFolder STRPDBFOLDER
                        Enter the path of the folder that contain PDB files.

Output::
  --output STROUT       Enter the path of the output directory

Besides the spatial and sequential features we used five different meta data features. These metadata features came from the integrated database. These features are the original amino acid type, the mutated amino acid type, the location of amino acid in the protein, the metal type, and the type of interaction between amino acid and metal (direct or indirect). As the MCCNN accepts numerical variables as input we converted the categorical variables of these metadata to the numerical using one hot encoding approach.


Run the model:

We finally used the spatial and sequential features to train the MCCNN using Zn, Ca and Mg metal-binding sites.

pipeline.png

Multi-Channel-CNN.py can be used to build model and evaluate it using 10-fold cross validation. Users need to use the spatial features which have been generated in the previous step.

#!text

Multi-Channel-CNN is script to build and evaluate a multi-channel convolution
neural network to predict disease associated mutation of metal-binding site. We
used 10-fold cross validation to evaluate our approach on Zn-binding metal binding site. 

optional arguments:
  -h, --help            show this help message and exit

Input::
  --seqFeatures STRSEQFEATURES
                        Enter the path of the sequential features. It should
                        be in CVS format as describe above.
  --spatialFeatures STRSPATIALFEATURES
                        Enter the path of the Spatial features folder.
  --filters STRFILTERS  Enter the number of filters.

Output::
  --o STROUT            Enter the path of the output directory

Figure_3.png

Performance of MCCNN to predict disease-associated mutations. 10-fold cross-validation results of the MCCNN model for Zn-binding site (a), Ca-binding site (b), Mg-binding site (c), and for a true positive dataset containing Zn, Ca, Mg binding-sites (d).

Updated