HTTPS SSH
This is a chemical similarity search benchmark suite.

The primary use is to measure chemfp performance across a range of
parameters. I hope it will also be used as a more general-purpose
benchmark.

The primary benchmark does the following searches:
  - k-nearest similarity searches for k=1, k=10, k=100, and k=1000
  - minimum-threshold search for thresholds of 0.99, 0.90, 0.80, and 0.70

against the following source data sets:
  -  166-bit OEChem/OEGraphSim MACCS keys, converted from the ChEMBL-23 sdf.gz file
  -  881-bit PubChem/CACVTS fingerprints, extracted from PubChem
  - 1024-bit Open Babel FP2 fingerprints, converted from the ChEMBL-23 sdf.gz file
  - 2048-bit RDKit Morgan fingerprints, from the ChEMBL-23 distribution

The program "source_datasets/create_source_datsets.py" was used to
create the source fingerprint datasets from the structure files. These
fingerprints are not available as part of this project because they
take too much space.

Instead, a subset of the source fingerprints were used to make the
actual benchmark datasets, and those are available from this project,
in the "datasets/" subdirectory. The program
datasets/create_benchmark_datasets.py randomly selected, without
replacement, 1,002,000 fingerprints from each of the source
datasets. 2,000 of these were saved as queries, in the files starting
"queries_". 1,000,000 of them were saved as targets, in the files
starting "targets_".

The in-memory benchmark uses the first 1,000 queries against all of
the targets. The file-scan benchmark uses the first 100 queries
against all of the targets.

There are other benchmarks which evaluate the effect of different
population count algorithms on the overall chemfp performance, the
differences between Python 2 and Python 3, etc.


Dependencies
============

You'll need chemfp, from chemfp.com. The benchmark will work with all
versions of chemfp.

You should install py-cpuinfo so the output includes detailed
CPU information. (This is not required.)

If you create the datasets from scratch then you'll need RDKit,
OEChem/OEGraphSim, and Open Babel.

Create the data sets from scratch
=================================

*** Note: You do not need to do this! ***

This is only if you want to regenerate the benchmark fingerprints from
scratch. Otherwise, go to the next section.

There are two steps to create the query and target datasets for the chemfp benchmark:

1) Download and process all of the reference data sets using
"create_source_datasets.py".

  cd source_datasets
  python create_source_datasets.py
  cd ..

You will need to configure it so it points to a local mirror of
PubChem. You can either use the "PUBCHEM_DIR" environment variable or
the "--pubchem-dir" command-line option.

166: For the 166-bit MACCS keys:
  - download ChEMBL 23 as an sdf.gz file
  - use OEChem to create the MACCS fingerprints
  - save the results to 'chemfp_0166.fps'

881: For the 881-bit CACTVS/PubChem keys:
  - you must have a local PubChem mirror. My mirror is from 2017-07-12.
  - extract the id and PUBCHEM_CACTVS_SUBSKEYS for each record
  - save the results to 'chemfp_0881.fps'

1024: For the 1024-bit FP2 fingerprints
  - download ChEMBL 23 as an sdf.gz file
  - use Open Babel to create the FP2 fingerprints
  - save the results to 'chemfp_1024.fps'

2048: For the 2048-bit RDKit Morgan fingerprints
  - download the ChEMBL 23 fps.gz file with pre-computed fingerprints
  - decompress the result to 'chemfp_2048.fps'

The program takes a while to run. It may need some tweaking to
work. Then again, you probably don't need to run it as you likely only
need the sampled subsets.

2) Create the benchmark datasets using "create_benchmark_datasets.py"

  cd datasets
  python create_benchmark_datasets.py --source-dir ../source_datasets --seed 20170718 \
       --num-queries 2000 --num-targets 1000000 --force 
  cd ..

3) I also compressed the files before commiting.

  cd datasets
  gzip -9 *.fps

(Note: the commands for #2 and #3 are also in datasets/Makefile and
available as "make sample" and "make compress", respectively.)

Create the data sets from version control
=========================================

The benchmarks fingerprint files are compressed. The benchmark uses
uncompressed file. To uncompress:

  make decompress

This is effectively the same as:

  cd datasets
  gunzip --keep {queries,targets}_{0166,0881,1024,2048}.fps.gz

Run the benchmark
=================

The benchmark program is named "chemfp_benchmark.py". Here is the --help:

usage: chemfp_benchmark.py [-h] [--scan | --memory | --NxM]
                           [--k K | --threshold THRESHOLD | --wc]
                           [--wc-executable WC_EXECUTABLE] [--166] [--881]
                           [--1024] [--2048] [--alpha ALPHA] [--beta BETA]
                           [--select SELECT] [--warmup WARMUP]
                           [--alignment {16}] [--datasets-dir DATASETS_DIR]
                           [--output FILENAME]

optional arguments:
  -h, --help            show this help message and exit
  --scan                Do a file scan. By default use the first 100 queries.
  --memory              Do an in-memory search, processing one query at a
                        time. By default use the first 1,000 queries.
  --NxM                 Do an in-memory search processing all of the queries.
                        By default use the first 1,000 queries.
  --k K, -k K           do a k-nearest benchmark with the given value of k
  --threshold THRESHOLD, -t THRESHOLD
                        do a threshold benchmark with the given minimum
                        threshold
  --wc                  time the performance of 'wc' on the targets
  --wc-executable WC_EXECUTABLE
                        specify the 'wc' executable. GNU wc, available as
                        'gwc' from Homebrew, is twice as fast as the Mac wc.
  --166                 Run the benchmark using the 166-bit MACCS keys
  --881                 Run the benchmark using the 881-bit PubChem/CACTVS
                        keys
  --1024                Run the benchmark using the 1024-bit Open Babel FP2
                        fingerprints
  --2048                Run the benchmark using the 2048-bit RDKit Morgan
                        fingerprints
  --alpha ALPHA         Do a Tversky search with the given value of alpha
  --beta BETA           Do a Tversky search with the given value of beta
  --select SELECT       specify which queries to use. An integer like '100'
                        specifies the first 100 records, A range like '10-50'
                        means the 40 records starting with the 11th record.
                        'sample=30' means to randomly sample 30 terms.
  --warmup WARMUP       number of times to run a warm-up query before doing
                        the actual benchmark
  --alignment {16}      specify arena alignment. Only needed because of a bug
                        in the SSSE3 requires --alignment 16
  --datasets-dir DATASETS_DIR
                        location of the chemfp benchmark datasets. (Default:
                        'datasets')
  --output FILENAME, -o FILENAME
                        save the benchmark timings to the named file

Examples
========

(See "Running the benchmark suite", below, for how to run the standard
set of benchmarks.)

I'll do an in-memory similarity search of the k=5 nearest neighbors of
the MACCS keys, using the first 25 elements of the queries.

% python chemfp_benchmark.py --memory --k 5 --166 --select 25 -o output.fpbench

This sends a summary to stderr:

memory benchmark using selection 0-25
k=5
  166 min: 19.07 us avg: 144.85 us max: 343.80 us and 125 total hits
   19.07 us 25.99 us 25.99 us 25.99 us 34.81 us ... 261.07 us 265.12 us 282.05 us 304.94 us 343.80 us
   slowest query: CHEMBL503982 lowest score: 0.7

The first line, "memory benchmark using selection 0-25" says this is
an in-memory search using the first 25 queries.

After that is the information about the "k=5" search. The search
details for that specific search are indented.

The first detail line says the minimum time was 19.07 microseconds,
the average time was 144.85 microseconds, and the maximum was 343.80
microseconds. Overall it found 125 hits, which makes sense as there
were 25 queries for the 5-nearest neighbors and 25*5 = 125.

The next line is:
   19.07 us 25.99 us 25.99 us 25.99 us 34.81 us ... 261.07 us 265.12 us 282.05 us 304.94 us 343.80 us

which shows the 5 fastest times followed by "..." followed by the 5
slowest times. The idea is to see if the extremes are really outliers.

The last line of the details is:
   slowest query: CHEMBL503982 lowest score: 0.7

which gives the id of the slowest query and the score of the least
similar hit. The idea is to have a reproducible in case the time seems
very slow, and in the k-nearest search to see if the query was in a
sparse part of chemical space.


I do another example:

% python chemfp_benchmark.py --scan --select sample=5 --166 --881 -o output.fpbench

This does a file-based scan of the MACCS (166) and PubChem (881)
datasets using all of the k-nearest and threshold values, as well as
the 'wc' search. In this case I've asked it to time only 5 queries,
selected at random from the possible queries.

The output looks like this, though I've removed many of the lines:

scan benchmark using selection sample=5
k=1
  166 min: 143.69 ms avg: 167.13 ms max: 184.59 ms and 5 total hits
   143.69 ms 162.91 ms 170.45 ms 174.01 ms 184.59 ms
   slowest query: CHEMBL524356 lowest score: 1.0
  881 min: 695.42 ms avg: 728.54 ms max: 745.86 ms and 5 total hits
   695.42 ms 718.17 ms 740.57 ms 742.69 ms 745.86 ms
   slowest query: 59795762 lowest score: 0.860465116279
k=10
  166 min: 170.37 ms avg: 173.31 ms max: 177.42 ms and 50 total hits
   170.37 ms 171.40 ms 172.62 ms 174.76 ms 177.42 ms
   slowest query: CHEMBL3642314 lowest score: 0.923076923077
  881 min: 731.66 ms avg: 739.70 ms max: 762.74 ms and 50 total hits
      ... removed lines ...
threshold=0.99
  166 min: 180.58 ms avg: 187.53 ms max: 202.37 ms and 3 total hits
   180.58 ms 180.98 ms 183.36 ms 190.36 ms 202.37 ms
   slowest query: CHEMBL524356 lowest score: 1.0
  881 min: 810.39 ms avg: 820.96 ms max: 835.03 ms and 0 total hits
   810.39 ms 810.87 ms 819.05 ms 829.43 ms 835.03 ms
   slowest query: 13282900 lowest score: None
threshold=0.9
  166 min: 175.89 ms avg: 184.05 ms max: 196.80 ms and 150 total hits
   175.89 ms 180.72 ms 182.83 ms 184.02 ms 196.80 ms
   slowest query: CHEMBL3560297 lowest score: None
  881 min: 755.94 ms avg: 794.02 ms max: 813.96 ms and 26 total hits
      ... removed lines ...
wc
  166 min: 76.66 ms avg: 82.91 ms max: 87.96 ms and 1000007 total hits
   76.66 ms 78.98 ms 83.33 ms 87.61 ms 87.96 ms
   slowest query: None lowest score: None
  881 min: 271.18 ms avg: 280.40 ms max: 289.38 ms and 1005164 total hits
   271.18 ms 273.07 ms 279.65 ms 288.71 ms 289.38 ms
   slowest query: None lowest score: None

This reports the k=1, k=10, k=100, and k=1000 searches, followed the
threshold=0.99, 0.9, 0.8, and 0.7 searches, followed by a 'wc'
search. For each search type it reports the times for 166 bits and
then 881 bits.

The "wc" results give an upper-limit to what the FPS search
performance could be. These are not query based, so there is no
slowest query or corresponding score.

In the previous example, the timings line showed the 5 slowest times,
followed by "...", followed by the 5 fastest times. This example has
only 5 timings, so no timings were removed, which is why there's no
"..." this time.

wc performance
==============

I tried several programs to count the number of lines in a file. The
fastest by far was GNU wc. For example, it was faster than the default
'wc' on Mac OS X, which is the FreeBSD wc from 2004.

If you are on Mac OS X and want an accurate estimate of the
upper-bound performance, install and use GNU wc. I use the Homebrew
package manager, which distributes wc as part of 'coreutils':

  brew install coreutils

To avoid conflicts with the OS versions of those utilities, the
commands all have a 'g' prefix, so GNU wc is available as 'gwc'.

Here's the performance difference:

% python chemfp_benchmark.py --scan --wc --wc-executable wc --2048 -o wc.fpbench
scan benchmark using selection 0-100
wc
  2048 min: 572.10 ms avg: 637.46 ms max: 725.49 ms and 1000007 total hits
   572.10 ms 574.70 ms 582.42 ms 584.96 ms 586.26 ms ... 712.12 ms 713.12 ms 714.79 ms 718.33 ms 725.49 ms
   slowest query: None lowest score: None
   
% python chemfp_benchmark.py --scan --wc --wc-executable gwc --2048 -o wc.fpbench
scan benchmark using selection 0-100
wc
  2048 min: 132.42 ms avg: 143.58 ms max: 165.86 ms and 1000007 total hits
   132.42 ms 132.49 ms 132.68 ms 132.70 ms 132.72 ms ... 162.94 ms 163.63 ms 164.04 ms 164.96 ms 165.86 ms
   slowest query: None lowest score: None

Yes, that's a ratio of 4.8!

".fpbench" file format
======================

The examples all showed the summary data which is written to
stderr. The benchmark program will also write much more detailed
results to stdout, or alternatively the "-o"/"--output" flag will save
those details to a file.

Here' the output from doing a k=3 nearest search of the 166 dataset
using 5 queries. The first 5 output lines are the summary sent to
stderr. The rest, starting with "{", is a JSON document send to
stdout:

% python chemfp_benchmark.py -k 3 --select 5 --166
memory benchmark using selection 0-5
k=3
  166 min: 15.02 us avg: 32.38 us max: 55.07 us and 15 total hits
   15.02 us 20.98 us 21.93 us 48.88 us 55.07 us
   slowest query: CHEMBL503982 lowest score: 0.714285714286
{
  "format": "chemfp-search-benchmark/1",
  "date": "2017-08-15T13:06:26",
  "python_version": "2.7.10 (default, Jul 30 2016, 19:40:32) \n[GCC 4.2.1 Compatible Apple LLVM 8.0.0 (clang-800.0.34)]",
  "chemfp_version": "3.0",
  "omp_get_num_threads": 8,
  "report": "== Configuration report for chemfp/3.0 ==\nAvailable method families: lut8 lut16 lauradoux gillies ssse3 popcnt\nAvailable methods: lut8_1_1 lut16_4_1 lut16_4_4 lauradoux_96_8 gillies_8_8 ssse3_64_64 popcnt_8_8 popcnt_24 popcnt_32_8 popcnt_32_32 popcnt_64 popcnt_128 popcnt_128_128\nSize methods:\n  size1_1: lut8_1_1\n  size8_1: lut16_4_1\n  size8_8: popcnt_8_8\n  size24: popcnt_24\n  size32_32: popcnt_32_32\n  size64: popcnt_64\n  size64_1: lut16_4_1\n  size64_8: popcnt_32_8\n  size128: popcnt_128\n  size128_1: lut16_4_1\n  size128_8: popcnt_32_8\n  size128_64: popcnt_32_32\n  size128_128: popcnt_128_128\n  size256: popcnt_128_128\nOption settings:\n  report-popcount: 0\n  report-intersect: 0\n",
  "benchmark_type": "memory",
  "argv": [
    "-k",
    "3",
    "--select",
    "5",
    "--166"
  ],
  "similarity_type": "tanimoto",
  "selection": "0-5",
  "hostname": "xebulon.local",
  "cpuinfo": {
    "count": 8,
    "model": 42,
    "hz_advertised": "2.2000 GHz",
    "family": 6,
    "bits": 64,
    "brand": "Intel(R) Core(TM) i7-2675QM CPU @ 2.20GHz",
    "vendor_id": "GenuineIntel",
    "cpuinfo_version": [
      3,
      3,
      0
    ],
    "flags": [
      "acpi",
      "aes",
      "apic",
      "avx1.0",
      "clfsh",
      "cmov",
      "cx16",
      "cx8",
      "de",
      "ds",
      "dscpl",
      "dtes64",
      "em64t",
      "est",
      "fpu",
      "fxsr",
      "htt",
      "lahf",
      "mca",
      "mce",
      "mmx",
      "mon",
      "msr",
      "mtrr",
      "osxsave",
      "pae",
      "pat",
      "pbe",
      "pcid",
      "pclmulqdq",
      "pdcm",
      "pge",
      "popcnt",
      "pse",
      "pse36",
      "rdtscp",
      "sep",
      "ss",
      "sse",
      "sse2",
      "sse3",
      "sse4.1",
      "sse4.2",
      "ssse3",
      "syscall",
      "tm",
      "tm2",
      "tpr",
      "tsc",
      "tsci",
      "tsctmr",
      "vme",
      "vmx",
      "x2apic",
      "xd",
      "xsave"
    ],
    "raw_arch_string": "x86_64",
    "l2_cache_size": "256",
    "stepping": 7,
    "hz_actual_raw": [
      2200000000,
      0
    ],
    "hz_actual": "2.2000 GHz",
    "arch": "X86_64",
    "hz_advertised_raw": [
      2200000000,
      0
    ]
  },
  "benchmarks": [
    {
      "label": "k=3",
      "sizes": [
        {
          "num_bits": 166,
          "num_hits": 15,
          "min_time": 4.887580871582031e-05,
          "avg_time": 3.237724304199219e-05,
          "max_time": 5.507469177246094e-05,
          "all_times": [
            4.887580871582031e-05,
            2.193450927734375e-05,
            1.5020370483398438e-05,
            2.09808349609375e-05,
            5.507469177246094e-05
          ],
          "slowest_id": "CHEMBL503982",
          "slowest_hex_fp": "000000000000000000000000048002008000000608",
          "slowest_lowest_similarity": 0.7142857142857143
        }
      ]
    }
  ]
}

I'll break it down into its parts:

{
  "format": "chemfp-search-benchmark/1",

This is a format identifier.

  "date": "2017-08-15T13:06:26",

An ISO datestamp in GMT. I ran the tests at 13:06 (1 in the afternoon)
GMT. Since I'm in Sweden and it's Daylight Saving Time (CEST), that
means it's really 15:06 local time. And I did it in the middle of
August.

  "python_version": "2.7.10 (default, Jul 30 2016, 19:40:32) \n[GCC 4.2.1 Compatible Apple LLVM 8.0.0 (clang-800.0.34)]",

What Python reports as its version.

  "chemfp_version": "3.0",

The version of chemfp I tested.

  "omp_get_num_threads": 8,

The number of threads reported by omp_get_num_threads(), or "1" if not
compiled with OpenMP support.

  "report": "== Configuration report for chemfp/3.0 ==\nAvailable method families: lut8 lut16 lauradoux gillies ssse3 popcnt\nAvailable methods: lut8_1_1 lut16_4_1 lut16_4_4 lauradoux_96_8 gillies_8_8 ssse3_64_64 popcnt_8_8 popcnt_24 popcnt_32_8 popcnt_32_32 popcnt_64 popcnt_128 popcnt_128_128\nSize methods:\n  size1_1: lut8_1_1\n  size8_1: lut16_4_1\n  size8_8: popcnt_8_8\n  size24: popcnt_24\n  size32_32: popcnt_32_32\n  size64: popcnt_64\n  size64_1: lut16_4_1\n  size64_8: popcnt_32_8\n  size128: popcnt_128\n  size128_1: lut16_4_1\n  size128_8: popcnt_32_8\n  size128_64: popcnt_32_32\n  size128_128: popcnt_128_128\n  size256: popcnt_128_128\nOption settings:\n  report-popcount: 0\n  report-intersect: 0\n",

Chemfp's internal configuration report. It describes which popcount
implementations are used for different alignment and fingerprint
sizes. Unpacked (and indented) it looks like:

  == Configuration report for chemfp/3.0 ==
  Available method families: lut8 lut16 lauradoux gillies ssse3 popcnt
  Available methods: lut8_1_1 lut16_4_1 lut16_4_4 lauradoux_96_8 gillies_8_8 ssse3_64_64 popcnt_8_8 popcnt_24 popcnt_32_8 popcnt_32_32 popcnt_64 popcnt_128 popcnt_128_128
  Size methods:
    size1_1: lut8_1_1
    size8_1: lut16_4_1
    size8_8: popcnt_8_8
    size24: popcnt_24
    size32_32: popcnt_32_32
    size64: popcnt_64
    size64_1: lut16_4_1
    size64_8: popcnt_32_8
    size128: popcnt_128
    size128_1: lut16_4_1
    size128_8: popcnt_32_8
    size128_64: popcnt_32_32
    size128_128: popcnt_128_128
    size256: popcnt_128_128
  Option settings:
    report-popcount: 0
    report-intersect: 0

Older versions of chemfp have a different report style.

  "benchmark_type": "memory",

This will be an in-memory benchmark. It could also be "scan" for a
file scan benchmark, or "NxM" for an in-memory benchmark were all the
queries are made at once.

  "argv": [
    "-k",
    "3",
    "--select",
    "5",
    "--166"
  ],

The command-line arguments.

  "similarity_type": "tanimoto",

This is a Tanimoto search. A Tversky search (with --alpha and --beta
specified) would look like "tversky(0.300000,0.700000)".

  "selection": "0-5",

The queries to use.

  "hostname": "xebulon.local",

The hostname for the machine this was run on. In this case, my laptop.

  "cpuinfo": {
    "count": 8,
    "model": 42,
    "hz_advertised": "2.2000 GHz",
    "family": 6,
    "bits": 64,
    "brand": "Intel(R) Core(TM) i7-2675QM CPU @ 2.20GHz",
    "vendor_id": "GenuineIntel",
    "cpuinfo_version": [
      3,
      3,
      0
    ],
    "flags": [
      "acpi",
      "aes",
      "apic",
      "avx1.0",
      "clfsh",
      "cmov",
      "cx16",
      "cx8",
      "de",
      "ds",
      "dscpl",
      "dtes64",
      "em64t",
      "est",
      "fpu",
      "fxsr",
      "htt",
      "lahf",
      "mca",
      "mce",
      "mmx",
      "mon",
      "msr",
      "mtrr",
      "osxsave",
      "pae",
      "pat",
      "pbe",
      "pcid",
      "pclmulqdq",
      "pdcm",
      "pge",
      "popcnt",
      "pse",
      "pse36",
      "rdtscp",
      "sep",
      "ss",
      "sse",
      "sse2",
      "sse3",
      "sse4.1",
      "sse4.2",
      "ssse3",
      "syscall",
      "tm",
      "tm2",
      "tpr",
      "tsc",
      "tsci",
      "tsctmr",
      "vme",
      "vmx",
      "x2apic",
      "xd",
      "xsave"
    ],
    "raw_arch_string": "x86_64",
    "l2_cache_size": "256",
    "stepping": 7,
    "hz_actual_raw": [
      2200000000,
      0
    ],
    "hz_actual": "2.2000 GHz",
    "arch": "X86_64",
    "hz_advertised_raw": [
      2200000000,
      0
    ]
  },

CPU details directly from py-cpuinfo.

  "benchmarks": [
     .... benchmarks records ...
  ]
}

The "benchmark records" are:

    {
      "label": "k=3",

The benchmark type, followed by a list of "sizes":

      "sizes": [
         .... size records ...
      ]
    }

Each "size record" is:

        {
          "num_bits": 166,

The number of bits in the fingerprint.

          "num_hits": 15,

The total number of hits in the search results.

          "min_time": 4.887580871582031e-05,

The fastest search time.

          "avg_time": 3.237724304199219e-05,

The average search time.

          "max_time": 5.507469177246094e-05,

The slowest search time.

          "all_times": [
            4.887580871582031e-05,
            2.193450927734375e-05,
            1.5020370483398438e-05,
            2.09808349609375e-05,
            5.507469177246094e-05
          ],

A list of all the times, in query order.

          "slowest_id": "CHEMBL503982",

The identifier corresponding to the slowest time, or None if the id is meaningless.

          "slowest_hex_fp": "000000000000000000000000048002008000000608",

The hex-encoded fingerprint for the slowest time

          "slowest_lowest_similarity": 0.7142857142857143
        }

The similarity score for the slowest time.


Run the benchmark suite
=======================



PREFIX=laptop


make

make scan

make memory

make nxm

make popcount

make decompress

make clean

make clean-fps

make really-clean