The structure query collection ("SQC") project is meant for developers of small-molecule chemical search systems who want to benchmark and optimize their system to handle real-world queries. The fundamental problem is that you don't know what you should optimize. Sure, you can make a substructure search for benzene go really fast, but who really does that for serious research? Or you can make a new hash fingerprint algorithm which you believe improves similarity quality or substructure screenout. Can you justify your belief? What you want a representative sample of the types of queries that people do, and preferably also for the same types of compounds your users want to search for. Unfortunately, that data is hard to get. Companies don't give out that information because it's almost certain that proprietary compounds are in the data set, as well as information which might reveal an internal research direction at the company. PubChem by law can't reveal that information either. Hard perhaps, but not impossible. The SQC contains query data sets contributed to the project, culled from the literature, and extracted from other projects. These data sets are: o BindingDB_exact User-generated structures sketched with Marvin used as an exact search in BindingDB o BindingDB_similarity User-generated structures sketched with Marvin and used for a similarity search in BindingDB o BindingDB_substructure User-generated structures sketched with Marvin and used as a substructure query in BindingDB o Rarey_smarts Literature-based SMARTS reported in "Systematic benchmark of substructure search in molecular graphs - From Ullmann to VF2" by Hans-Christian Ehrlich and Matthias Rarey. o RDKit_smarts "Various smarts filters collected and contributed by Richard Lewis" to the RDKit project. o RDMACCS Cross-platform versions of the RDKit SMARTS patterns for the MACCS keys. See the correspondingly named subdirectory for full details and data. Thanks ====== Thanks to Michael Gilson and Tiqing Liu for contributing the BindingDB search query sets. BindingDB (See http://www.bindingdb.org/ ) is "a public web-accessible database of measured binding affinities, focusing chiefly on the interactions of protein considered to be drug-targets with small, drug-like molecules." Thanks to the RDKit, Open Babel and OEChem authors for providing their toolkits. o OEChem - http://www.eyesopen.com/oechem-tk o Open Babel - http://openbabel.org/ o RDKit - http://rdkit.org/
b97862a - Moved the note even higher.
b4410c5 - Published a notice that the Ehrlich and Rarey analysis is out of date because
44c6725 - Added a reference link should people use the BindingDB queries.
f931408 - We don't need no stinking copyright.
c031562 - Once more, with correct English this time.
0a33046 - Links to some of the resources
fd8c515 - Added a copy of "table 1" from supplemental file 1.
a87c1ff - Added a THANKS section.
35d6cd8 - Improved the text.
353b44b - Changed the incorrect "OpenBabel" to the correct "Open Babel".