Overview

HTTPS SSH
IMPORTANT NOTICE

If you are interested in ChemDB, please first look at my follow-up project
MoleculeDatabaseFramework. It has more features and is more reliable and stable.

https://bitbucket.org/kienerj/moleculedatabaseframework

INTRODUCTION

This framework is intended to be used by Applications that need a very simple 
and basic functionality for storing, managing and searching chemical structures.

Use at your own risk!

Currently it works when using HSQLDB, MySQL, PostgreSQL or SQL Server Express
on 1 PC with 1 concurrent user.
No further testing done, eg. especially not with multiple users searching
at the same time or with other RDBMS.

Note: Interfaces are not stable!!! See https://bitbucket.org/kienerj/chemdb/wiki/future%20developments

********************************************************************************

DEPENDENCIES

   1. Chemistry Development Kit - CDK
          * cdk-1.2.7.jar (or newer) with all dependencies		  
          or only required cdk dependencies:
          * cdk-pure.jar
          * cdk-ismorphism.jar
          * jgrapht-0.6.0.jar
          * vecmath-1.3.1.jar 
   2. logging
          * slf4j-api-1.6.1.jar (or newer)
          * slfj4-ext-1.6.1.jar (or newer)
          * optional: slf4j binding to enable logging (logback, log4j) 
   3. JDBC Driver for the used Database 

Note: It is recommended to NOT USE the full CDK jar because it overrides logging
settings of this framework and possibly applications built on it.

INSTALLATION / SETUP

a) Using only required dependencies (recommended)

You need to have Ant (http://jakarta.apache.org/ant/) 1.7.1 or better and 
Java 1.6.0 or better installed.

   1. Download ChemDB jar
   2. Download latest cdk-src+libs
   3. unpack cdk-src+libs
   4. open cmd and change to the directory created in above step
   5. for further info read the README file in that directory
   6. execute "ant dist-all"
   7. go to /dist/jar to see the jar files that were created
   8. add following jars to your project
          * cdk-pure.jar
          * cdk-ismorphism.jar 
   9. add following dependencies of CDK (in /jar) to your project
          * jgrapht-0.6.0.jar
          * vecmath-1.3.1.jar 
  10. Download latest slf4j release
  11. extract and add following to your project
          * slf4j-api-<version>.jar
          * slfj4-ext-<version>.jar
          * logging library and the according slf4j binding 
		  (logback is suggested, no binding needed) 

Note: 
CDK and certain dependencies do not have an official maven repository. 
You will have to install the jars manually to your local repository.

b) Using full CDK-1.4.2 with all dependencies (for quick setup to try out ChemDB)

Download ChemDB with dependencies jar and add it to your project.

********************************************************************************

INITIAL STEPS / CONFIGURATION

1. Database Setup

Two Database tables are required, one for holding the molecules and one for the 
search index.
Search index must be a separate table because maintenance operations may truncate 
this table!

Note that PostgreSQL can be case-sensitive with table and column names. It is
suggested to use all lower case letters for table and column names in PostgreSQL!
Use _ instead of CamelCase eg. camel_case. This also applies to the
application.properties file. Configurable columns when using PostgreSQL should
be in lower case.

Molecule Table creation Example:

HSQLDB:
CREATE TABLE molecule
(
  molid Identity,
  structure character varying(10000)
)

MySQL:
CREATE TABLE molecule
(
    molid INT(11) NOT NULL AUTO_INCREMENT,
    structure VARCHAR(10000) NULL DEFAULT NULL,
    PRIMARY KEY (molid)
)

PostgreSQL:
CREATE TABLE molecule
(
  molid SERIAL,
  structure character varying(10000),
  CONSTRAINT "Molecule_PK" PRIMARY KEY (molid)
)

SQL Server 2008 R2:
molid must be an IDENTITY type for this framework to work properly!
8000 is max size for varchar. this can be too small if you use molfiles on large
molecules.)
CREATE TABLE molecule (
	molid int IDENTITY(1,1) NOT NULL,
	structure varchar (8000) NULL,
    CONSTRAINT PK_molecule PRIMARY KEY (molid)
);


Search Index Table creation:
Connect to your database and execute following command:

HSQLDB:
create table structuresearchindex (
    MOLID INT PRIMARY KEY,
    FINGERPRINT VARBINARY(256),
    ATOMS VARCHAR(400),
    BONDS VARCHAR(900),
    STRUCTUREKEY VARCHAR(200)
)
MySQL:
CREATE TABLE structuresearchindex (
	MOLID INT(11) NOT NULL,
	FINGERPRINT VARBINARY(256),
	ATOMS VARCHAR(400),
	BONDS VARCHAR(900),
	STRUCTUREKEY VARCHAR(200),
	PRIMARY KEY (MOLID)
)

PostgreSQL:
CREATE TABLE structuresearchindex
(
  molid integer NOT NULL,
  fingerprint bytea,
  atoms character varying(400),
  bonds character varying(900),
  structurekey character varying(200),
  CONSTRAINT "SearchIndex_PK" PRIMARY KEY (molid)
)

SQL Server 2008 R2:
CREATE TABLE structuresearchindex (
	molid int NOT NULL,
	fingerprint varbinary(256) NOT NULL,
	atoms varchar(400) NOT NULL,
	bonds varchar(900) NOT NULL,
	structurekey varchar(200) NOT NULL,
    CONSTRAINT PK_structuresearchindex PRIMARY KEY (molid)
);


FINGERPRINT column must be able to store byte[] data. Hence use the according type
of your RDBMS that supports this.
(This Limitation can be circumvented if you create your own implementation of
StructureSearchIndexDataAccessLayer.)
Columns ATOMS, BONDS and STRUCTUREKEY Length parameter might need to be adjusted
depending on the type of molecules (size/number of atoms) you store.

The Table containing the chemical structures probably already exists. However
it is required that the table has a unique, numeric (Integer), auto-increment 
Column identifing each Molecule.

2. Create your Application

Can't help you with this.

3. Configuration

You need a properties file with following entries:

molTableName = <Name of the Table where chemical Structres are stored>
molIdColumnName = <Name of the numeric unique key Column of %molTableName%>
structureColumnName =  <Name of the Chemical Structure Column of %molTableName%>
useMoleculeCache = <false or true>
moleculeDataAccessLayerClass = <Class used to read Chemical Structures from database>
structureSearchIndexDataAccessLayerClass = <Class used to read search index from database>
moleculeCreatorClass = <Class used to create Molecules from the stored Data like SMILES>
fingerprinterClass = <CDK IFingerprinter implementation to use>
structureSearchIndexManagerClass = <Class that manages search index>
moleculeManagerClass = <Class for managing the molecules>

molIdColumnName should be the name of the numeric primary key column in the
Molecule Table.

For moleculeCreatorClass SMILES or Molfile format is currently available:
moleculeCreatorClass = org.bitbucket.kienerj.chemdb.formats.MolfileMoleculeCreator
or
moleculeCreatorClass = org.bitbucket.kienerj.chemdb.formats.SmilesMoleculeCreator

You could also create your own Implementations of the according interfaces or
subclass the exisiting classes and use those.


useMoleculeCache 

Determines if additional information is held in memory. Keeps the data from the 
atoms and bonds columns in a key-value store (java.util.Map) in memory.

This greatly affects memory consumption and can affect search performance. 
Memory consumption will always be higher. If this is set to true, the load on the 
database during substructure searches (default implementation) is 0. Hence more 
of the CPU can be used for graph matching. This is helpful if database is running 
on the same server as the application and a substructure search is configured to 
use all available CPU cores.


DEFAULT VALUES USED:

molTableName = molecule
molIdColumnName = molid
structureColumnName = structure
useMoleculeCache = false
moleculeDataAccessLayerClass = org.bitbucket.kienerj.chemdb.data.DefaultMoleculeDataAccessLayer
structureSearchIndexDataAccessLayerClass = org.bitbucket.kienerj.chemdb.data.DefaultStructureSearchIndexDataAccessLayer
moleculeCreatorClass = org.bitbucket.kienerj.chemdb.formats.MolfileMoleculeCreator
fingerprinterClass = org.openscience.cdk.fingerprint.ExtendedFingerprinter
structureSearchIndexManagerClass = org.bitbucket.kienerj.chemdb.searchindex.DefaultStructureSearchIndexManager
moleculeManagerClass = org.bitbucket.kienerj.chemdb.molecule.DefaultMoleculeManager

Default values are used when no configuration file is found.