Bitbucket is a code hosting site with unlimited public and private repositories. We're also free for small teams!

The structure query collection ("SQC") project is meant for developers
of small-molecule chemical search systems who want to benchmark and
optimize their system to handle real-world queries.

The fundamental problem is that you don't know what you should
optimize. Sure, you can make a substructure search for benzene go
really fast, but who really does that for serious research? Or you can
make a new hash fingerprint algorithm which you believe improves
similarity quality or substructure screenout. Can you justify your

What you want a representative sample of the types of queries that
people do, and preferably also for the same types of compounds your
users want to search for.

Unfortunately, that data is hard to get. Companies don't give out that
information because it's almost certain that proprietary compounds are
in the data set, as well as information which might reveal an internal
research direction at the company. PubChem by law can't reveal that
information either.

Hard perhaps, but not impossible. The SQC contains query data sets
contributed to the project, culled from the literature, and extracted
from other projects.

These data sets are:

 o BindingDB_exact
   User-generated structures sketched with Marvin used as an exact
   search in BindingDB

 o BindingDB_similarity
   User-generated structures sketched with Marvin and used for a similarity
   search in BindingDB

 o BindingDB_substructure
   User-generated structures sketched with Marvin and used as a
   substructure query in BindingDB

 o Rarey_smarts
   Literature-based SMARTS reported in "Systematic benchmark of
   substructure search in molecular graphs - From Ullmann to VF2" by
   Hans-Christian Ehrlich and Matthias Rarey.

 o RDKit_smarts
   "Various smarts filters collected and contributed by Richard Lewis"
   to the RDKit project. 

   Cross-platform versions of the RDKit SMARTS patterns for the
   MACCS keys.

See the correspondingly named subdirectory for full details and data.


Thanks to Michael Gilson and Tiqing Liu for contributing the BindingDB
search query sets. BindingDB (See ) is "a
public web-accessible database of measured binding affinities,
focusing chiefly on the interactions of protein considered to be
drug-targets with small, drug-like molecules."

Thanks to the RDKit, Open Babel and OEChem authors for providing their
  o OEChem -
  o Open Babel -
  o RDKit -

Recent activity

Andrew Dalke

Andrew Dalke pushed 1 commit to dalke/sqc

b4410c5 - Published a notice that the Ehrlich and Rarey analysis is out of date because
Tip: Filter by directory path e.g. /media app.js to search for public/media/app.js.
Tip: Use camelCasing e.g. ProjME to search for
Tip: Filter by extension type e.g. /repo .js to search for all .js files in the /repo directory.
Tip: Separate your search with spaces e.g. /ssh pom.xml to search for src/ssh/pom.xml.
Tip: Use ↑ and ↓ arrow keys to navigate and return to view the file.
Tip: You can also navigate files with Ctrl+j (next) and Ctrl+k (previous) and view the file with Ctrl+o.
Tip: You can also navigate files with Alt+j (next) and Alt+k (previous) and view the file with Alt+o.