The structure query collection ("SQC") project is meant for developers
of small-molecule chemical search systems who want to benchmark and
optimize their system to handle real-world queries.

The fundamental problem is that you don't know what you should
optimize. Sure, you can make a substructure search for benzene go
really fast, but who really does that for serious research? Or you can
make a new hash fingerprint algorithm which you believe improves
similarity quality or substructure screenout. Can you justify your

What you want a representative sample of the types of queries that
people do, and preferably also for the same types of compounds your
users want to search for.

Unfortunately, that data is hard to get. Companies don't give out that
information because it's almost certain that proprietary compounds are
in the data set, as well as information which might reveal an internal
research direction at the company. PubChem by law can't reveal that
information either.

Hard perhaps, but not impossible. The SQC contains query data sets
contributed to the project, culled from the literature, and extracted
from other projects.

These data sets are:

 o BindingDB_exact
   User-generated structures sketched with Marvin used as an exact
   search in BindingDB

 o BindingDB_similarity
   User-generated structures sketched with Marvin and used for a similarity
   search in BindingDB

 o BindingDB_substructure
   User-generated structures sketched with Marvin and used as a
   substructure query in BindingDB

 o Rarey_smarts
   Literature-based SMARTS reported in "Systematic benchmark of
   substructure search in molecular graphs - From Ullmann to VF2" by
   Hans-Christian Ehrlich and Matthias Rarey.

 o RDKit_smarts
   "Various smarts filters collected and contributed by Richard Lewis"
   to the RDKit project. 

   Cross-platform versions of the RDKit SMARTS patterns for the
   MACCS keys.

See the correspondingly named subdirectory for full details and data.


Thanks to Michael Gilson and Tiqing Liu for contributing the BindingDB
search query sets. BindingDB (See http://www.bindingdb.org/ ) is "a
public web-accessible database of measured binding affinities,
focusing chiefly on the interactions of protein considered to be
drug-targets with small, drug-like molecules."

Thanks to the RDKit, Open Babel and OEChem authors for providing their
  o OEChem - http://www.eyesopen.com/oechem-tk
  o Open Babel - http://openbabel.org/
  o RDKit - http://rdkit.org/