IMPORTANT NOTICE If you are interested in ChemDB, please first look at my follow-up project MoleculeDatabaseFramework. It has more features and is more reliable and stable. https://bitbucket.org/kienerj/moleculedatabaseframework INTRODUCTION This framework is intended to be used by Applications that need a very simple and basic functionality for storing, managing and searching chemical structures. Use at your own risk! Currently it works when using HSQLDB, MySQL, PostgreSQL or SQL Server Express on 1 PC with 1 concurrent user. No further testing done, eg. especially not with multiple users searching at the same time or with other RDBMS. Note: Interfaces are not stable!!! See https://bitbucket.org/kienerj/chemdb/wiki/future%20developments ******************************************************************************** DEPENDENCIES 1. Chemistry Development Kit - CDK * cdk-1.2.7.jar (or newer) with all dependencies or only required cdk dependencies: * cdk-pure.jar * cdk-ismorphism.jar * jgrapht-0.6.0.jar * vecmath-1.3.1.jar 2. logging * slf4j-api-1.6.1.jar (or newer) * slfj4-ext-1.6.1.jar (or newer) * optional: slf4j binding to enable logging (logback, log4j) 3. JDBC Driver for the used Database Note: It is recommended to NOT USE the full CDK jar because it overrides logging settings of this framework and possibly applications built on it. INSTALLATION / SETUP a) Using only required dependencies (recommended) You need to have Ant (http://jakarta.apache.org/ant/) 1.7.1 or better and Java 1.6.0 or better installed. 1. Download ChemDB jar 2. Download latest cdk-src+libs 3. unpack cdk-src+libs 4. open cmd and change to the directory created in above step 5. for further info read the README file in that directory 6. execute "ant dist-all" 7. go to /dist/jar to see the jar files that were created 8. add following jars to your project * cdk-pure.jar * cdk-ismorphism.jar 9. add following dependencies of CDK (in /jar) to your project * jgrapht-0.6.0.jar * vecmath-1.3.1.jar 10. Download latest slf4j release 11. extract and add following to your project * slf4j-api-<version>.jar * slfj4-ext-<version>.jar * logging library and the according slf4j binding (logback is suggested, no binding needed) Note: CDK and certain dependencies do not have an official maven repository. You will have to install the jars manually to your local repository. b) Using full CDK-1.4.2 with all dependencies (for quick setup to try out ChemDB) Download ChemDB with dependencies jar and add it to your project. ******************************************************************************** INITIAL STEPS / CONFIGURATION 1. Database Setup Two Database tables are required, one for holding the molecules and one for the search index. Search index must be a separate table because maintenance operations may truncate this table! Note that PostgreSQL can be case-sensitive with table and column names. It is suggested to use all lower case letters for table and column names in PostgreSQL! Use _ instead of CamelCase eg. camel_case. This also applies to the application.properties file. Configurable columns when using PostgreSQL should be in lower case. Molecule Table creation Example: HSQLDB: CREATE TABLE molecule ( molid Identity, structure character varying(10000) ) MySQL: CREATE TABLE molecule ( molid INT(11) NOT NULL AUTO_INCREMENT, structure VARCHAR(10000) NULL DEFAULT NULL, PRIMARY KEY (molid) ) PostgreSQL: CREATE TABLE molecule ( molid SERIAL, structure character varying(10000), CONSTRAINT "Molecule_PK" PRIMARY KEY (molid) ) SQL Server 2008 R2: molid must be an IDENTITY type for this framework to work properly! 8000 is max size for varchar. this can be too small if you use molfiles on large molecules.) CREATE TABLE molecule ( molid int IDENTITY(1,1) NOT NULL, structure varchar (8000) NULL, CONSTRAINT PK_molecule PRIMARY KEY (molid) ); Search Index Table creation: Connect to your database and execute following command: HSQLDB: create table structuresearchindex ( MOLID INT PRIMARY KEY, FINGERPRINT VARBINARY(256), ATOMS VARCHAR(400), BONDS VARCHAR(900), STRUCTUREKEY VARCHAR(200) ) MySQL: CREATE TABLE structuresearchindex ( MOLID INT(11) NOT NULL, FINGERPRINT VARBINARY(256), ATOMS VARCHAR(400), BONDS VARCHAR(900), STRUCTUREKEY VARCHAR(200), PRIMARY KEY (MOLID) ) PostgreSQL: CREATE TABLE structuresearchindex ( molid integer NOT NULL, fingerprint bytea, atoms character varying(400), bonds character varying(900), structurekey character varying(200), CONSTRAINT "SearchIndex_PK" PRIMARY KEY (molid) ) SQL Server 2008 R2: CREATE TABLE structuresearchindex ( molid int NOT NULL, fingerprint varbinary(256) NOT NULL, atoms varchar(400) NOT NULL, bonds varchar(900) NOT NULL, structurekey varchar(200) NOT NULL, CONSTRAINT PK_structuresearchindex PRIMARY KEY (molid) ); FINGERPRINT column must be able to store byte data. Hence use the according type of your RDBMS that supports this. (This Limitation can be circumvented if you create your own implementation of StructureSearchIndexDataAccessLayer.) Columns ATOMS, BONDS and STRUCTUREKEY Length parameter might need to be adjusted depending on the type of molecules (size/number of atoms) you store. The Table containing the chemical structures probably already exists. However it is required that the table has a unique, numeric (Integer), auto-increment Column identifing each Molecule. 2. Create your Application Can't help you with this. 3. Configuration You need a properties file with following entries: molTableName = <Name of the Table where chemical Structres are stored> molIdColumnName = <Name of the numeric unique key Column of %molTableName%> structureColumnName = <Name of the Chemical Structure Column of %molTableName%> useMoleculeCache = <false or true> moleculeDataAccessLayerClass = <Class used to read Chemical Structures from database> structureSearchIndexDataAccessLayerClass = <Class used to read search index from database> moleculeCreatorClass = <Class used to create Molecules from the stored Data like SMILES> fingerprinterClass = <CDK IFingerprinter implementation to use> structureSearchIndexManagerClass = <Class that manages search index> moleculeManagerClass = <Class for managing the molecules> molIdColumnName should be the name of the numeric primary key column in the Molecule Table. For moleculeCreatorClass SMILES or Molfile format is currently available: moleculeCreatorClass = org.bitbucket.kienerj.chemdb.formats.MolfileMoleculeCreator or moleculeCreatorClass = org.bitbucket.kienerj.chemdb.formats.SmilesMoleculeCreator You could also create your own Implementations of the according interfaces or subclass the exisiting classes and use those. useMoleculeCache Determines if additional information is held in memory. Keeps the data from the atoms and bonds columns in a key-value store (java.util.Map) in memory. This greatly affects memory consumption and can affect search performance. Memory consumption will always be higher. If this is set to true, the load on the database during substructure searches (default implementation) is 0. Hence more of the CPU can be used for graph matching. This is helpful if database is running on the same server as the application and a substructure search is configured to use all available CPU cores. DEFAULT VALUES USED: molTableName = molecule molIdColumnName = molid structureColumnName = structure useMoleculeCache = false moleculeDataAccessLayerClass = org.bitbucket.kienerj.chemdb.data.DefaultMoleculeDataAccessLayer structureSearchIndexDataAccessLayerClass = org.bitbucket.kienerj.chemdb.data.DefaultStructureSearchIndexDataAccessLayer moleculeCreatorClass = org.bitbucket.kienerj.chemdb.formats.MolfileMoleculeCreator fingerprinterClass = org.openscience.cdk.fingerprint.ExtendedFingerprinter structureSearchIndexManagerClass = org.bitbucket.kienerj.chemdb.searchindex.DefaultStructureSearchIndexManager moleculeManagerClass = org.bitbucket.kienerj.chemdb.molecule.DefaultMoleculeManager Default values are used when no configuration file is found.