M-CAP README file M-CAP is the first pathogenicity classifier for rare missense variants in the human genome that is tuned to the high sensitivity required in the clinic. By combining previous pathogenicity scores (including SIFT, Polyphen-2 and CADD) with novel features and a powerful model, we attain the best classifier at all thresholds, reducing a typical exome/genome rare (<1%) missense variant (VUS) list from 300 to 120, while never mistaking 95% of known pathogenic variants as benign. Enumerate all the main steps to building the mcap classifier # building a train and test set ## Pathogenic Variants (HGMD) Filter Steps: - HGMD DM - ALFQ.max < 1% - nonsynonymous - Not seen during training phase in any other metrics command to clean data: ## Benign Variants (ExAC) - ALFQ.max < % - nonsynonymous - Not seen during training phase in any other metrics command to clean data: # How to annotate variants - semantic effect - allele frequency - conservation scores across species command to annotate variants: # How to train the gradient boosting tree classifier code to train classifier: ./src/scripts/one_grad_boosting.sh results.txt <# of trees> <learning rate> <subsample> <max-depth> <max-features> <min leafs> - dispatch jobs across parasol nodes so each node train the a model using different parameters # How to run the classifier - given the best classifier you can now test this classifier on any data set.. ## How to get the ROC curve command to build the test curves: ./src/scripts/plot_test_curves.py <trained classifier model> ## How to evaluate on a new patient command to test the model on a new set of variants: ./src/scripts/evaluate_snps.py <trained classifier> Licensed under a Creative Commons Attribution-NonCommercial 4.0 International License. Please contact Gill Bejerano (email@example.com) for any questions.
README edited online with Bitbucket