The structure query collection ("SQC") project is meant for developers
of small-molecule chemical search systems who want to benchmark and
optimize their system to handle real-world queries.
The fundamental problem is that you don't know what you should
optimize. Sure, you can make a substructure search for benzene go
really fast, but who really does that for serious research? Or you can
make a new hash fingerprint algorithm which you believe improves
similarity quality or substructure screenout. Can you justify your
What you want a representative sample of the types of queries that
people do, and preferably also for the same types of compounds your
users want to search for.
Unfortunately, that data is hard to get. Companies don't give out that
information because it's almost certain that proprietary compounds are
in the data set, as well as information which might reveal an internal
research direction at the company. PubChem by law can't reveal that
Hard perhaps, but not impossible. The SQC contains query data sets
contributed to the project, culled from the literature, and extracted
from other projects.
These data sets are:
User-generated structures sketched with Marvin used as an exact
search in BindingDB
User-generated structures sketched with Marvin and used for a similarity
search in BindingDB
User-generated structures sketched with Marvin and used as a
substructure query in BindingDB
Literature-based SMARTS reported in "Systematic benchmark of
substructure search in molecular graphs - From Ullmann to VF2" by
Hans-Christian Ehrlich and Matthias Rarey.
"Various smarts filters collected and contributed by Richard Lewis"
to the RDKit project.
Cross-platform versions of the RDKit SMARTS patterns for the
See the correspondingly named subdirectory for full details and data.
Thanks to Michael Gilson and Tiqing Liu for contributing the BindingDB
search query sets. BindingDB (See http://www.bindingdb.org/ ) is "a
public web-accessible database of measured binding affinities,
focusing chiefly on the interactions of protein considered to be
drug-targets with small, drug-like molecules."
Thanks to the RDKit, Open Babel and OEChem authors for providing their
o OEChem - http://www.eyesopen.com/oechem-tk
o Open Babel - http://openbabel.org/
o RDKit - http://rdkit.org/