cyclops_mysql version 1.0 Cheminformatics extensions to MySQL based on OEChem Andrew Dalke <firstname.lastname@example.org> Andrew Dalke Scientific AB, Gothenburg, Sweden ADVERTISING: I make my living from consulting, custom software development, and training computational chemists in how to be more effective at the software side of what they do. If you are interested in my services, want improvements to cyclops_mysql, or are otherwise interested in giving me money, please do contact me. This package extends MySQL with a set of user-defined functions for doing chemical informatics tasks in the database, including substructure searches and similarity comparisons. The implementation uses OpenEye's OEChem for the chemistry. This package is made freely available under the MIT license. For details see the file "COPYING". For installation directions see the file "INSTALL". For benchmarking information see "README.benchmark". This implementation is quite fast, due in large part to OEChem. The code overhead of brute-force MySQL extension function is noticible so I've spent time optimizing common cases where one of the inputs (SMILES or SMARTS) is a constant string. The resulting extension handles about 10,000 SMARTS match tests per second on a modern machine, which is perfect for small- to medium-sized data sets. Many of the ideas for the implementation API are based on TJ O'Donnell's book "Design and Use of Relational Databases in Chemistry." While I have not followed the CHORD API exactly, my API is in derived from the functions and parameters he uses. If you have no experience with SQL and want to learn how to use it in chemistry databases, get that book. There is much more that can be done. For details see the file IDEAS. If you have any questions, send them to email@example.com. If you are interested in support, bug-fixes, or new features, please note that I am a consultant and available for hire. This MySQL extension adds the following new SQL functions: * oe_version([module]) The possible values for the optional 'module' argument are: cyclops - return the version number of this MySQL extension. For this release it will be "1.0" oechem - returns a string in the form '2010-08-09 1.7.4' where the first term derives from OEChemGetVersion() and the rest of the string is from OEChemGetRelease() oegraphsim - returns a string of the form '2010-08-09 1.0.0' where the first term derives from OEGraphSimGetVersion() and the rest of the string is from OEGraphSimGetRelease() * oe_licensed([module]) If module is not specified then return 1 if all needed modules are licensed. Otherwise, return 0. If the module name is specified then return 1 if the license for that module is valid, otherwise 0. The possible values for module are: 'cyclops', 'oechem' and 'oegraphsim'. Note that cyclops will return 1 as there is no license. * oe_license_date(module) Returns a string in the format 'YYYY-MM-DD' which is the expiration date for the given module. A example date is "2009-08-22". The possible values for module are: 'cyclops', 'oechem' and 'oegraphsim'. Note that cyclops will return "9999-12-31" as there is no license. * oe_valid_smiles(smiles) Returns 1 if the input string is a valid SMILES string, otherwise 0. * oe_cansmiles(smiles) Convert the input SMILES string into a canonical (non-isomeric) SMILES. It is an error if the input is not a valid SMILES string. This uses the OpenEye aromaticity model. * oe_isosmiles(smiles) Convert the input SMILES string into a canonical isomeric SMILES. It is an error if the input is not a valid SMILES string. This uses the OpenEye aromaticity model. * oe_keksmiles(smiles) Convert the input SMILES string into a Kekule (non-canonical) isomeric SMILES. It is an error if the input is not a valid SMILES string. This uses the OpenEye aromaticity model. * oe_matches(smiles, smarts) Returns 1 if the SMARTS pattern is found at least once in the SMILES, else returns 0. * oe_count_matches(smiles, smarts) Returns the number of times the SMARTS pattern is found in the SMILES, up to 1024 matches. This returns 0 if there are no matches. * oe_count_umatches(smiles, smarts) Returns the number of times the SMARTS pattern is uniquely found in the SMILES, up to 1024 matches. This returns 0 if there are no matches. A match is unique if it matches a given subset of atoms only one time. For example, "CC" has two matches against "CCO" but only one unique match. * oe_lingosim(str1, str2) Compute the Lingo similarity between the two strings as a real value. In most cases 'str1' and 'str2' will be canonical SMILES strings although other names are possible. The Lingo similarity is based on substring similarity. See Grant et al., JCIM, 46(5):1912 2006 and Vidal, Thormann and Pons, JCIM, 45(2):386 2005. The results are comparable to fingerprint similarities but don't require the intermediate fingerprint calculation and storage. * oe_maccs_fp(smiles) Return a hex string representation of the 166 bit MACCS key fingerprint for the SMILES. For example, oe_maccs_fp('Nc1ccccc1O') is "0000000000000000000084040000102405488283b3" Cyclops fingerprints are a multiple of 8 bits long. The excess two bits will be 0. Cyclops fingerprints have the same byte and bit order as OpenEye's OEFingerPrint.ToHexString() but omit the trailing buffer * oe_path_fp(smiles, num_bits=4096, min_bonds=0, max_bonds=5, atom_type=191, bond_type=3) Return a OpenEye path fingerprint. The SMILES must be given but all other fields are optional, except that if min_bonds is given then max_bonds must also be given. Note that SQL does not have keyword arguments, only positional ones. The parameters are as defined for OEGraphSim's OEMakePathFP() except that I prefer a different parameter naming scheme. The atom_type and bond_type fields are bit-wise ORs of different flags. These are not listed in the OpenEye documentation so I list them here. AtomType BondType ================= ================ 1 = AtomicNumber 1 = BondOrder 2 = Aromaticity 2 = Chiral 4 = Chiral 4 = InRing 8 = FormalCharge 16 = HvyDegree 32 = Hybridization 3 = DefaultBond 64 = InRing = 1+2 128 = EqHalogen 256 = EqAromatic 191 = DefaultAtom = 1+2+4+8+16+32+128 (omits 64) * fp_valid(fp) Returns 1 if the fingerprint is a valid hex fingerprint (contains only the digits 0-9 and the characters A-F and a-f), otherwise returns 0. * fp_contains(superstructure_fp, substructure_fp) Test if the first fingerprint contains the second fingerprint, that is, if every bit which is set in the second fingerprint is also on in the first fingerprint. The fingerprints are encoded as hex strings. Returns 1 if superstructure_fp contains substructure_fp, otherwise returns 0. It is an error if either fingerprint string contains a non-hex character. * fp_tanimoto(fp1, fp2) Return the Tanimoto similarity between the two hex-encoded fingerprints. The possible values range from 0.0 (not similar, or neither fingerprint contains set bits) to 1.0 (identical). It is an error if either string are not hex-encoded values. >> A note about errors. Where possible I report errors during query initialization. These errors should be detailed enough to figure out the cause of the problem. Other errors can only be checked during row evaluation time. An example is "oe_cansmiles(smiles)" when the SMILES comes from another column in the database. These errors are reported as a NULL result.