HTTPS SSH
# chemfp_benchmark 0.9

This is a chemical fingerprint similarity search benchmark suite.

The primary use is to measure chemfp performance across a range of
parameters. I hope it will also be used as a more general-purpose
fingerprint similarity search benchmark - the data sets and expected
scores are in text-oriented data files that can easily be read by
other tools.

## Benchmark Data Sets


The benchmark data sets are:

  -  166-bit OEChem/OEGraphSim MACCS keys, converted from the ChEMBL-23 sdf.gz file
  -  881-bit PubChem/CACVTS fingerprints, extracted from PubChem
  - 1021-bit Open Babel FP2 fingerprints, converted from the ChEMBL-23 sdf.gz file
  - 2048-bit RDKit Morgan fingerprints, from the ChEMBL-23 distribution

The program "source_datasets/create_source_datsets.py" was used to
create the source fingerprint datasets from the structure files. These
fingerprints are not available as part of this project because they
take too much space.

Instead, a subset of the source fingerprints were used to make the
actual benchmark datasets, and those are available from this project,
in the "datasets/" subdirectory. The program
`datasets/create_benchmark_datasets.py` was used to randomly select,
without replacement, 1,002,000 fingerprints from each of the source
datasets. 2,000 of these were saved as queries, in the files starting
"queries_". 1,000,000 of them were saved as targets, in the files
starting "targets_".

The benchmarks all use the first 1,000 query fingerprints to search
all of the target fingerprints. The second 1,000 query fingerprints
are meant as a way to cross-check the timings.


If you create the datasets from scratch then you'll need RDKit,
OEChem/OEGraphSim, and Open Babel.

## Create the data sets from scratch

*** Note: You do not need to do this! ***

This is only if you want to regenerate the benchmark fingerprints from
scratch. Otherwise, go on to the next section.

There are two steps to create the query and target datasets for the
chemfp benchmark:

1) Download and process all of the reference data sets using
"source_datasets/create_source_datasets.py".

    cd source_datasets
    python create_source_datasets.py
    cd ..

You will need to configure it so it points to a local mirror of
PubChem. You can either use the "PUBCHEM_DIR" environment variable or
the `--pubchem-dir` command-line option.

166: For the 166-bit MACCS keys:

  - download ChEMBL 23 as an sdf.gz file
  - use OEChem to create the MACCS fingerprints
  - save the results to 'chemfp_0166.fps'

881: For the 881-bit CACTVS/PubChem keys:

  - you must have a local PubChem mirror. My mirror is from 2017-07-12.
  - extract the id and PUBCHEM_CACTVS_SUBSKEYS for each record
  - save the results to 'chemfp_0881.fps'

1024: For the 1024-bit FP2 fingerprints

  - download ChEMBL 23 as an sdf.gz file
  - use Open Babel to create the FP2 fingerprints
  - save the results to 'chemfp_1024.fps'

2048: For the 2048-bit RDKit Morgan fingerprints

  - download the ChEMBL 23 fps.gz file with pre-computed fingerprints
  - decompress the result to 'chemfp_2048.fps'

The program takes a while to run. It may need some tweaking to
work. Then again, you probably don't need to run it as you likely only
need the sampled subsets.

2) Create the benchmark datasets using
`datasets/create_benchmark_datasets.py`

    cd datasets
    python create_benchmark_datasets.py --source-dir ../source_datasets --seed 20170718 \
         --num-queries 2000 --num-targets 1000000 --force 
    cd ..

3) I also compressed the files before commiting.

    cd datasets
    gzip -9 *.fps

(Note: the commands for #2 and #3 are also in datasets/Makefile and
available as "make sample" and "make compress", respectively.)


## Benchmark tasks

The primary in-memory query benchmark does the following searches:

  - for 166-bit OEChem MACCS searches, 881-bit PubChem searches,
            and 1021-bit Open Babel FP2 searches
      - count and threshold searches at least 0.7 similar
      - knearest similarity searches for k=1 and k=1000
  - for 2048-bit RDKit Morgan, radius=2 searches
      - count and threshold searches at least 0.4 similar
      - knearest similarity searches for k=1 and k=1000

There will be other benchmarks to evaluate file-based search and the
NxN symmetrical search.

These benchmarks focus on chemfp's current design goals. If you have
other use cases, and can suggest a benchmark, let me know.


## Reference results

The file `query_benchmark.scores` contains the expected scores for the
above tasks with three different similarity methods:

  - Tanimoto
  - Dice (Tversky similarity with alpha = beta = 0.5)
  - Tversky_80_20 (Tversky similarity with alpha=0.8 and beta=0.2)

In addition, the scores have been generates for the first N=10, 100,
1,000, and 2,000 queries. While the benchmark is based on 1,000
queries, I often use 10 queries for a quick check that the benchmark
code works. The results for N=100 and and N=2,000 are there because I
think they might be useful, but they may disappear in the future.

## Score file format

The 'scores' file is a line-oriented format containing comment lines
and data lines. A comment line either contains only whitespace
characters or a line where the first non-whitespace character is a
'#'.

The data line of the query scores file contains 6 white-space
separated columns (leading and trailing whitespace are ignored). Here
are three examples:

    Tanimoto 166 threshold T=0.7 10 64425.16744177775
    Dice 881 knearest k=1 1000 956.0828956923751
    Tversky_80_20 1021 knearest k=1000 1000 726183.3940483254
    Tanimoto 2048 count T=0.4 1000 393145

The first line of this example concerns a search of all
166-bit targets with a Tanimoto similarity at or above 0.7. The sum of
the scores for the first 10 queries should be around 64425.16744177775.

(Note: this number is not exact. Chemfp uses IEEE 754 doubles to store
the score, and different summation orders for doubles may give
slightly different answers. Other tools might use floats or even
rationals, and give a different answer. I think that +/-0.01% is close
enough.)

The second line of this example concerns a search of all 881-bit
targets to find the most similar target with a Dice similarity at or
above 0.7. The sum of the scores for the first 100 queries should be
around 956.0828956923751.

The third line of this example concerns a nearest neighbor search of
all 1021-bit targets to find the 1,000 targets with the highest
Tversky score, with alpha=0.8 and beta=0.2. The sum of the scores of
the 1,000 nearest neighbors should be around 726183.3940483254.

The last line of this example concerns search to count the number of
2048-bit targets with a Tanimoto score at or above 0.4. The sum of the
counts for the first 1,000 queries must be exactly 393145.

More details of the format can be found at the top of the
`query_benchmark.scores` file.

## chemfp query benchmark

The file "cfp_query_benchmark.py" generates benchmark timings for
different chemfp configurations. It has been tested on chemfp 1.4,
1.5, 3.2, 3.2.1 and 3.3. Over time I expect I'll add support for other
versions of chemfp.

(Note: version 3.2.1 was the first version of chemfp to enable AVX2
support by default. Earlier versions in the 3.x line required a
compile-time flag to be set.)

By default it generates the fastest timings for the standard 1,000
query benchmark, evaluated one query fingerprint at a time, with both
the AVX2 and POPCNT popcount implementations. (The AVX2 popcount
implementation only works with 1024- and 2048-bit fingerprints, and is
faster than POPCNT for those cases.)

Here is an example of the default output (note: use the '-o' option to
get the output in a more machine-readable format):

    % ./cfp_query_benchmark.py
    chemfp: 3.3 hostname: haswell multiquery: 0 specialized: 1 num_threads: 1 indexed: 1
    Loading queries and targets from 'datasets'
    Loaded arenas.
    166 Tanimoto (OEChem MACCS) 1000 queries
      count T=0.7 popcnt 1.36 ms/query  (T2: 1.36 ms check: 5632389 run-time: 2.7 s)
      count T=0.7 avx2 N/A
      threshold T=0.7 popcnt 1.42 ms/query  (T2: 1.43 ms check: 4181442.50 run-time: 2.9 s)
      threshold T=0.7 avx2 N/A
      knearest k=1 popcnt 0.19 ms/query  (T2: 0.19 ms check: 959.41 run-time: 0.4 s)
      knearest k=1 avx2 N/A
      knearest k=1000 popcnt 1.86 ms/query  (T2: 1.86 ms check: 764917.04 run-time: 3.7 s)
      knearest k=1000 avx2 N/A
    881 Tanimoto (PubChem) 1000 queries
      count T=0.7 popcnt 4.58 ms/query  (T2: 4.58 ms check: 6392581 run-time: 9.2 s)
      count T=0.7 avx2 N/A
      threshold T=0.7 popcnt 4.70 ms/query  (T2: 4.70 ms check: 4706168.60 run-time: 9.4 s)
      threshold T=0.7 avx2 N/A
      knearest k=1 popcnt 1.22 ms/query  (T2: 1.22 ms check: 917.30 run-time: 2.5 s)
      knearest k=1 avx2 N/A
      knearest k=1000 popcnt 4.73 ms/query  (T2: 4.73 ms check: 754037.35 run-time: 9.5 s)
      knearest k=1000 avx2 N/A
    1024 Tanimoto (OB FP2) 1000 queries
      count T=0.7 popcnt 4.71 ms/query  (T2: 4.72 ms check: 314319 run-time: 9.4 s)
      count T=0.7 avx2 3.71 ms/query  (T2: 3.71 ms check: 314319 run-time: 7.4 s)
      threshold T=0.7 popcnt 4.78 ms/query  (T2: 4.78 ms check: 243672.56 run-time: 9.6 s)
      threshold T=0.7 avx2 3.65 ms/query  (T2: 3.65 ms check: 243672.56 run-time: 7.3 s)
      knearest k=1 popcnt 1.12 ms/query  (T2: 1.12 ms check: 928.25 run-time: 2.3 s)
      knearest k=1 avx2 0.86 ms/query  (T2: 0.86 ms check: 928.25 run-time: 1.7 s)
      knearest k=1000 popcnt 8.09 ms/query  (T2: 8.09 ms check: 566563.77 run-time: 16.2 s)
      knearest k=1000 avx2 6.26 ms/query  (T2: 6.26 ms check: 566563.77 run-time: 12.5 s)
    2048 Tanimoto (RDKit Morgan radius=2) 1000 queries
      count T=0.4 popcnt 18.77 ms/query  (T2: 18.78 ms check: 393145 run-time: 37.6 s)
      count T=0.4 avx2 13.97 ms/query  (T2: 13.97 ms check: 393145 run-time: 27.9 s)
      threshold T=0.4 popcnt 18.70 ms/query  (T2: 18.70 ms check: 189655.09 run-time: 37.4 s)
      threshold T=0.4 avx2 13.66 ms/query  (T2: 13.66 ms check: 189655.09 run-time: 27.3 s)
      knearest k=1 popcnt 7.32 ms/query  (T2: 7.32 ms check: 810.25 run-time: 14.6 s)
      knearest k=1 avx2 5.30 ms/query  (T2: 5.30 ms check: 810.25 run-time: 10.6 s)
      knearest k=1000 popcnt 20.01 ms/query  (T2: 20.01 ms check: 341267.50 run-time: 40.0 s)
      knearest k=1000 avx2 14.58 ms/query  (T2: 14.58 ms check: 341267.50 run-time: 29.2 s)

The first line says that this benchmark was run with chemfp 3.3 on the
machine named "haswell".

Chemfp support a 'multiquery' mode where several queries are submitted
at the same time. This minimizes the Python overhead, and lets Python
process the queries in parallel, one query per threads. The
"multiquery: 0" say that multiquery mode is not used, that is, each
query is processed on its own.

Chemfp 3.3 added "specialized" searches for the POPCNT and AVX2
implementations. In short, these inline the popcount calculations with
the search algorithm. There is a run-time option to use the older
"generic" algorithm, which is what all searches used in chemfp 3.2.1 and
earlier. The "specialized: 1" means that chemfp will specialized
search methods if available.

Multiquery search can parallelize the queries, with one query per
thread. The "num_threads: 1" says that the multiquery search will only
use a single thread.

Chemfp supports the min/max bounds published in Swamidass and
Baldi. This is done by sorting the target fingerprints by popcount,
and creating a popcount index. Chemfp also supports searching
unsorted/unindexed fingerprints. The "indexed: 1" says that the
targets have a popcount index. An unindexed search can be used to
judge the effectiveness of the popcount bounds. Note: chemfp 1.5 and
3.3 implemented optimized versions of unindex search which are about
5x to 10x faster than earlier versions.

The benchmark program loads the data sets then iterates over them to
generate searches for different queries. Consider this section:

    166 Tanimoto (OEChem MACCS) 1000 queries
      count T=0.7 popcnt 1.36 ms/query  (T2: 1.36 ms check: 5632389 run-time: 2.7 s)
      count T=0.7 avx2 N/A
      threshold T=0.7 popcnt 1.42 ms/query  (T2: 1.43 ms check: 4181442.50 run-time: 2.9 s)
      threshold T=0.7 avx2 N/A
      knearest k=1 popcnt 0.19 ms/query  (T2: 0.19 ms check: 959.41 run-time: 0.4 s)
      knearest k=1 avx2 N/A
      knearest k=1000 popcnt 1.86 ms/query  (T2: 1.86 ms check: 764917.04 run-time: 3.7 s)
      knearest k=1000 avx2 N/A

This evaluates the 166-bit Tanimoto benchmarks for 1,000 queries and
each of the use cases.  The "count T=0.7 popcnt" line reports that
timings for the count search with a threshold of 0.7. The program runs
the benchmark twice and uses the smallest time to compute the average
search time. The benchmark is run twice to help minimize the effect of
memory warmup and jitter on the machine. In this case the fastest time
average 1.36 milliseconds per query. The slower time is reported after
"T2:", which in this case is also 1.36 milliseconds per query. On a
stable machine, these timing numbers should be very close to each
other.

The "check:" contains the sum of the counts, for a 'count' search, and
the sum of the scores, for a 'threshold' or 'knearest' search. This is
used to verify that the search was correct. It must match the
corresponding value in the `query_benchmark.scores` file.

If it does not match, then there will be one of two additional
lines. If the expected value is not present in the score file then the
extra output line will look like:

    !! ERROR: no expected result available: Tanimoto 166 count T=0.7 1000 5632389

If there is a mismatch then there will be a line like:

    !! ERROR: got Tanimoto 166 count T=0.7 1000 6743490 but expected 5632389

The AVX2 popcount implementation only works for 1024- and 2048-bit
fingerprints, so the "avx2" timings are marked as "N/A".

Finally, "run-time:" reports the total run-time for the given
benchmark.

## chemfp query benchmark command-line options

"cfp_query_benchmark.py" program in the previous section has many
command-line arguments, which can be used to limit a benchmark
evaluation to, for example, only the knearest tests for 166-bit
fingerprints, and to explore different tuning options in chemfp. Here
is the output from running the program with `--help`:

    usage: cfp_query_benchmark.py [-h] [-d DIRNAME] [-c] [-k] [-t]
                                  [-p {lut8,lut16,lauradoux,gillies,ssse3,popcnt,avx2}]
                                  [--166] [--881] [--1024] [--2048]
                                  [--similarity {Tanimoto,Dice,Tversky_80_20}]
                                  [-j NUM_THREADS] [-r] [-g] [-n]
                                  [--single | --multiquery] [--hostname NAME]
                                  [-D HEADER=VALUE] [-N {10,100,1000,2000}]
                                  [--scores FILENAME] [--once] [-o FILENAME]
    
    time chemfp
    
    optional arguments:
      -h, --help            show this help message and exit
      -d DIRNAME, --dirname DIRNAME
                            Location of the chemfp_benchmark 'datasets' directory
      -c, --count           run the count timings
      -k, --knearest        run the k-nearest timings
      -t, --threshold       run the threshold timings
      -p {lut8,lut16,lauradoux,gillies,ssse3,popcnt,avx2}, --popcount {lut8,lut16,lauradoux,gillies,ssse3,popcnt,avx2}
      --166                 run the 166-bit timings
      --881                 run the 881-bit timings
      --1024                run the 1024-bit timings
      --2048                run the 2048-bit timings
      --similarity {Tanimoto,Dice,Tversky_80_20}
                            similarity type; Dice is Tversky with alpha=beta=0.5
      -j NUM_THREADS, --num-threads NUM_THREADS
                            set the number of threads to use in --all searches
      -r, --report          enable chemfp's 'report-algorithm' option
      -g, --generic-only    disable support for chemfp's specialized search
                            functions
      -n, --no-index        use un-indexed search
      --single
      --multiquery
      --hostname NAME       text to use for the 'hostname' column
      -D HEADER=VALUE       additional column to include in the output (e.g. '-D
                            compiler=clang-7.0')
      -N {10,100,1000,2000}, --num-queries {10,100,1000,2000}
                            number of queries to use (default: 1000). I use '10'
                            to do a quick check that things work, '100' as a quick
                            timing comparison, '1000' as the benchmark, and '2000'
                            out of curiosity.
      --scores FILENAME     file containing the reference scores (default:
                            'query_benchmark.scores')
      --once                run the benchmark timings once, rather than twice.
                            This may be less accurate, though is about twice as
                            fast.
      -o FILENAME, --output FILENAME
                            append the timings to FILENAME as tab-delimited fields

The benchmark expects to be run in the home directory of the
distribution. If called from elsewhere, use `--dirname` and `--scores`
to tell it where to find the fingerprint files and expected score file.

By default it will run the count, knearest, and threshold
benchmarks. Use `--count`, `--knearest`, or `--threshold` to run only
those specific tests.

By default it will time the 'popcnt' and 'avx2' popcount
implementations. Use the `--popcount` option to have it run one of the
slower methods. This is specified once for each popcount method.

By default it will evaluate the Tanimoto similarity. Use `--similarity`
to have it run with Dice similarity or with a Tversky similarity where
alpha-0.8 and beta=0.2.

The default method uses `--single` mode, where each query is timed
independently. Use `--multiquery` to submit all of the queries at the
same time. The multiquery search supports parallelization. Use
`--num-threads` to have it use more than the default of 1 thread.

By default it will use a specialized search method, if available. This
feature was added in chemfp 3.3. Earlier versions of chemfp used a
more generic search algorithm. Use `--generic-only` to have chemfp 3.3
revert to the generic algorithm.

By default it will use the min/max bound on the popcounts to reduce
the search space. Use `--no-index` to have it ignore the index.

The benchmark program will include the machine name on the output. By
default this comes from socket.gethostname(). Use `--hostname` to pass
in a different hostname.

The standard benchmark uses the first 1,000 queries from the query
file. Use -N to specify 10, 100, or 2,000 queries. These are
non-standard values and must not be used when reporting a
comparison.

Use "-D" to specify an additional column in the output.

By default the progam will run each benchmark twice. This helps
determine if the timing numbers are stable, but it takes about twice
as long. Use `--once` if you don't care about timing stability and are
more interested in verifying the result values.

The `--output` specifies the filename where the timings numbers are
reported. If the file exists then the output is appeneded to the file.


## chemfp query benchmark driver

The "cfp_query_benchmark.py" program in the previous section can be
used to generate an incredible variety of cases, most of which are not
interesting.

The "cfp_query_benchmark_all.py" program is a driver which calls
"cfp_query_benchmark.py" for the different combinations of parameters
I think are interesting. Some of the things I think are interesting are:

 - How much faster is chemfp 3.3 than 3.2.1 or 1.5, on the standard
     benchmark?

 - What is the effect of the specialized algorithms in 3.3 over
     the generic algorithm of 3.2.1?
   
 - What is the effect of the min/max bounds on search performance?

 - How much faster is AVX2 than POPCNT?

Here is the result of `--help` on the driver program:

    usage: cfp_query_benchmark_all.py [-h] [-d DIRNAME] [-N {10,100,1000,2000}]
                                      [-D HEADER=VALUE] [--scores FILENAME]
                                      [--dry-run]
                                      [filename]
    
    positional arguments:
      filename              output filename
    
    optional arguments:
      -h, --help            show this help message and exit
      -d DIRNAME, --dirname DIRNAME
                            Location of the chemfp_benchmark 'datasets' directory
      -N {10,100,1000,2000}, --num-queries {10,100,1000,2000}
      -D HEADER=VALUE       additional column to include in the output (e.g. '-D
                            compiler=clang-7.0')
      --scores FILENAME     file containing the reference scores
      --dry-run             show the command-line for the benchmark tests which
                            will be run, but don't actually run them
  
The `--dirname`, `-N`, `-D`, and `--scores` files are passed to the underlying
benchmark code.

The `--dry-run` prints the list of command-line arguments that will be
passed to cfp_query_benchmark, but does not actually execute those
commands.

If the output filename is not given, and if -N is 1000, then the
default output filename will be based on the chemfp version number and
the hostname, with the extension "`.fpbench`".

For example, the default output file for chemfp 3.3 on the benchmark
machine named "haswell" is `chemfp33_haswell.fpbench`. If -N is some
other value then it is included in the filename, eg, with -N 10 the
default output filename is `chemfp33_10_haswell.fpbench`.

The output filename is passed down to the cfp_query_benchmark.py
program as the `-o` parameter. All of the benchmark cases will be
appended to the given file.

## fpbench format

Use the `-o` option of cfp_query_benchmark.py to have it save the
benchmark information in a more machine-readable format, called the
"fpbench" format. The `-o` option appends the benchmark results to an
existing file.

For example:

    % ./cfp_query_benchmark.py -o cfp.fpbench > /dev/null
    % cat cfp.fpbench
    #cfp_query_benchmark/0.9
    # -o cfp.fpbench
    version	hostname	N	multiquery	specialized	num_threads	indexed	similarity	num_bits	algorithm	parameters	popcount	T_avg	status
    3.3	haswell	1000	0	1	1	1	Tanimoto	166	count	T=0.7	popcnt	1.365	ok
    3.3	haswell	1000	0	1	1	1	Tanimoto	166	count	T=0.7	avx2	0.000	N/A
    3.3	haswell	1000	0	1	1	1	Tanimoto	166	threshold	T=0.7	popcnt	1.420	ok
    3.3	haswell	1000	0	1	1	1	Tanimoto	166	threshold	T=0.7	avx2	0.000	N/A
    3.3	haswell	1000	0	1	1	1	Tanimoto	166	knearest	k=1	popcnt	0.191	ok
    3.3	haswell	1000	0	1	1	1	Tanimoto	166	knearest	k=1	avx2	0.000	N/A
    3.3	haswell	1000	0	1	1	1	Tanimoto	166	knearest	k=1000	popcnt	1.865	ok
    3.3	haswell	1000	0	1	1	1	Tanimoto	166	knearest	k=1000	avx2	0.000	N/A
    3.3	haswell	1000	0	1	1	1	Tanimoto	881	count	T=0.7	popcnt	4.590	ok
      ..

This is a line-oriented format with three possible line types: comment, header, and data.

A comment line is any line containing only whitespace or where the
first non-whitespace character is a `#`. In the above, the
`#cfp_query_benchmark/0.9` is a comment line saying that the output
was generated by cfp_query_benchmark version 0.9, and the command-line
arguments were `-o cfp.fpbench`.

Comment lines may be ignored.

Header and data lines contain tab-seperated fields. A header line
contains the header names for each of the fields in data lines. The
first header field must be "version". The header line must also have
fields "status" and "T_avg". If the status line is not "ok" then the
line will be ignored in any analysis. If the status line is "ok" then
the "T_avg" must be a float value.

In order to compare it to chemfp, it should also have the fields
'algorithm' and 'parameters', as defined below.

Each data line must have the same number of fields as the most recent
header line.

A file may have multiple header lines. Except for the "version"
header, the order of the headers is arbitrary. A header line may have
a different number of headers and header fields than a previous
header. This lets a benchmark include fields for, for example, the
compiler name and version.

The default headers from the cfp_query_benchmark.py program are:

  * version - the chemfp version as reported by `chemfp.__version__`
  * hostname - the machine name, as reported by socket.gethostname()
  * N - the number of queries used in the benchmark (10, 100, 1000, or 2000)
  * multiquery - 0 for --single mode, 1 for --multiquery mode 
  * specialized - 0 for generic, 1 if specialization is allowed
  * num_threads - number of threads to use in multiquery mode
  * indexed - use the popcount index to set min/max bounds on the search
  * similarity - one of "Tanimoto", "Dice", or "Tversky_80_20"
  * num_bits - the benchmark fingerprint size; one of 166, 881, 1024, or 2048
  * algorithm - one of 'count', 'threshold', or 'knearest'
  * parameters - the specific algorithm parameters; 'T=0.7' means a
     threshold of 0.7 or above, 'T=0.4' means a threshold of 0.4 or above,
     'k=1' and 'k=1000' mean k=1 and k=1000 nearest-neighbor searches.
  * popcount - the underlying popcount method to use. The fastest is 'avx2',
     if the AVX2 instruction set is available and the number of bits is 1024
     or 2048. The next fastest is 'popcnt', if the POPCNT instruction is
     available. The 'ssse3', 'gillies', 'lauradoux', 'lut16', and 'lut8'
     are present for historical reasons. 
  * T_avg - the average search time, per query, in milliseconds.
  * status - this is 'ok' if the benchmark computed the expected value, 'N/A'
     if the given combination of parameters was invalid, or some other
     string (like 'error') to indicate some other condition.

## Reference benchmarks

The chemfp_benchmark distribution includes the stdout output and the
fpbench output from cfp_query_benchmark_all.py for chemfp 1.4, chemfp
1.5, chemfp 3.2.1, and chemfp 3.3. These are:

  * chemfp14_haswell.fpbench and chemfp14_haswell.out for chemfp 1.4
  * chemfp15_haswell.fpbench and chemfp15_haswell.out for chemfp 1.5
  * chemfp321_haswell.fpbench and chemfp321_haswell.out for chemfp 3.2.1
  * chemfp33_haswell.fpbench and chemfp33_haswell.out for chemfp 3.3

These benchmarks were generated on the machine named "haswell", named
after its Intel hardware architecture.

The file "haswell.cpuinfo" contains the output from running:

    % cpuinfo > haswell.cpuinfo

where "cpuinfo" is a program installed by the "py-cpuinfo" package
(see https://pypi.org/project/py-cpuinfo/ ).

In all four cases, GCC version `gcc-6 (Ubuntu 6.2.0-3ubuntu11~16.04)
6.2.0 20160901` was used to compile chemfp.


## Compare benchmarks

The program `fpbench_compare.py` can be used to compare different
chemfp benchmarks.

I think it's easiest to explain with a walkthrough. If you run it
without any options, finds columns (other than the "T_avg" and
"status" columns) which differ contain at least two different values,
and generates timings ratios for each of them:

    % ./fpbench_compare.py chemfp14_haswell.fpbench chemfp15_haswell.fpbench
    too many differences - not generating comparisons for popcount: ['gillies', 'lauradoux', 'lut16', 'lut8', 'popcnt', 'ssse3']
    version=1.4/version=1.5 geometic mean: 1.655 from 128 comparisons
    multiquery=0/multiquery=1 geometic mean: 1.003 from 43 comparisons
    num_threads=1/num_threads=2 geometic mean: 1.913 from 32 comparisons
    num_threads=1/num_threads=3 geometic mean: 2.360 from 32 comparisons
        ..

This shows that version 1.5 is about 65% faster than version 1.4, and
that multiquery searches are not much faster than single query
searches. The ratio is computed by finding all data rows which differ
only in the 'version' field, with '1.4' and '1.5' respectively, then
computing the geometic mean of the T_avg times.

However, that "65%" is not that useful. Let me focus on that ratio by
using that specific comparison (the `-c` flag is short for
`--compare`):

    %  ./fpbench_compare.py chemfp14_haswell.fpbench chemfp15_haswell.fpbench -c version=1.4,1.5
    version=1.4/version=1.5 geometic mean: 1.655 from 128 comparisons

Chemfp 1.4 had a very slow unindexed search, which is about 6x faster
in chemfp 1.5. I'll redo the calculation by selecting only those data
lines where index=0 (unindexed) and then where index=1 (indexed).

    % ./fpbench_compare.py chemfp14_haswell.fpbench chemfp15_haswell.fpbench -c version=1.4,1.5 -s indexed=0
    version=1.4/version=1.5 geometic mean: 5.828 from 32 comparisons
    % ./fpbench_compare.py chemfp14_haswell.fpbench chemfp15_haswell.fpbench -c version=1.4,1.5 -s indexed=1
    version=1.4/version=1.5 geometic mean: 1.088 from 96 comparisons

Use the `--details` option to see details about the 32 comparisons. If
you do, you'll see that it includes times for the slower popcount
implementations, which aren't needed for modern hardware, and it
include comparisons for both 1 and 4 threads.

I'll add another `--select` option, for num_threads=1, and I'll have it
"ignore" the popcount value and simply choose the fastest time of any
available popcount.

(If the match algorithm finds that two entries for a given selection
have an equivalent "key" (the "key" is the value for each field of the
non-ignored columns) then it will select the entry with the smallest
T_avg value. This results in comparing the best time for a given key.)

I'll also have it ignore a few other columns, because they are
contant, and not useful to see in the output:

    % ./fpbench_compare.py chemfp14_haswell.fpbench chemfp15_haswell.fpbench \
        -c version=1.4,1.5 -s indexed=0 -s num_threads=1 \
        -i popcount -i hostname -i N -i specialized -i status --details
    multiquery	similarity	num_bits	algorithm	parameters	T_avg1	T_avg2	ratio
    1	Tanimoto	166	count	T=0.7	26.362	4.855	5.43
    1	Tanimoto	166	threshold	T=0.7	26.128	4.561	5.73
    1	Tanimoto	166	knearest	k=1	25.994	8.358	3.11
    1	Tanimoto	166	knearest	k=1000	26.892	7.083	3.80
    1	Tanimoto	881	count	T=0.7	114.026	16.208	7.04
    1	Tanimoto	881	threshold	T=0.7	113.717	16.467	6.91
    1	Tanimoto	881	knearest	k=1	113.647	20.888	5.44
    1	Tanimoto	881	knearest	k=1000	114.602	21.015	5.45
    1	Tanimoto	1024	count	T=0.7	130.218	18.029	7.22
    1	Tanimoto	1024	threshold	T=0.7	129.806	18.196	7.13
    1	Tanimoto	1024	knearest	k=1	129.826	21.106	6.15
    1	Tanimoto	1024	knearest	k=1000	130.759	22.884	5.71
    1	Tanimoto	2048	count	T=0.4	252.360	33.903	7.44
    1	Tanimoto	2048	threshold	T=0.4	251.889	34.145	7.38
    1	Tanimoto	2048	knearest	k=1	251.982	36.116	6.98
    1	Tanimoto	2048	knearest	k=1000	252.932	39.128	6.46
    version=1.4/version=1.5 geometic mean: 5.936 from 16 comparisons

So, single-threaded unindexed time finished in 1/6 the time. That's
nice, but few people use unindexed search. What about the indexed
search differences between 1.4 and 1.5?

What about the single-query performance changes between 1.4 and 1.5,
again with 1 thread? For that I'll also have it select "multiquery=0"
to get the single query mode:

    % ./fpbench_compare.py chemfp14_haswell.fpbench chemfp15_haswell.fpbench \
        -c version=1.4,1.5 -s indexed=1 -s num_threads=1 -s multiquery=0 \
        -i popcount -i hostname -i N -i specialized -i status 
    version=1.4/version=1.5 geometic mean: 1.116 from 16 comparisons

That's 10% faster on average. Where did it come from? Version 1.4 had
a fast integer rejection test for 'threshold' searches. Version 1.5
added that support to both 'count' and 'knearest' searches. You can
see that the 'threshold' performance hasn't changed by selecting for
the 'threshold' algorithm:

    % ./fpbench_compare.py chemfp14_haswell.fpbench chemfp15_haswell.fpbench \
        -c version=1.4,1.5 -s indexed=1 -s num_threads=1 -s multiquery=0 \
        -s algorithm=threshold \
        -i popcount -i hostname -i N -i specialized -i status
    version=1.4/version=1.5 geometic mean: 1.003 from 4 comparisons
  
while the 'count' algorithm is generally faster (though oddly still
slower than the treshold search for 1024 and 2048 bit searches):

    % ./fpbench_compare.py chemfp14_haswell.fpbench chemfp15_haswell.fpbench \
        -c version=1.4,1.5 -s indexed=1 -s num_threads=1 -s multiquery=0 \
        -s algorithm=count \
        -i popcount -i hostname -i N -i specialized -i status --details
    similarity	num_bits	parameters	T_avg1	T_avg2	ratio
    Tanimoto	166	T=0.7	2.634	1.851	1.42
    Tanimoto	881	T=0.7	6.519	5.741	1.14
    Tanimoto	1024	T=0.7	6.279	5.606	1.12
    Tanimoto	2048	T=0.4	22.233	24.280	0.92
    version=1.4/version=1.5 geometic mean: 1.135 from 4 comparisons

and the knearest code is about 17% faster (though the 166-bit code is
a lot faster!):

    % ./fpbench_compare.py chemfp14_haswell.fpbench chemfp15_haswell.fpbench \
        -c version=1.4,1.5 -s indexed=1 -s num_threads=1 -s multiquery=0 \
        -s algorithm=knearest \
        -i popcount -i hostname -i N -i specialized -i status --details
    similarity	num_bits	parameters	T_avg1	T_avg2	ratio
    Tanimoto	166	k=1	0.359	0.246	1.46
    Tanimoto	166	k=1000	2.921	2.199	1.33
    Tanimoto	881	k=1	1.681	1.500	1.12
    Tanimoto	881	k=1000	6.272	5.644	1.11
    Tanimoto	1024	k=1	1.430	1.299	1.10
    Tanimoto	1024	k=1000	10.102	9.246	1.09
    Tanimoto	2048	k=1	8.445	7.757	1.09
    Tanimoto	2048	k=1000	22.988	21.182	1.09
    version=1.4/version=1.5 geometic mean: 1.167 from 8 comparisons
  
What about comparing chemfp 1.5 to chemfp 3.3 for single-query
searches? Your first thought might be to do:

    % ./fpbench_compare.py chemfp15_haswell.fpbench chemfp33_haswell.fpbench \
        -c version=1.5,3.3 -s indexed=1 -s num_threads=1 -s multiquery=0 \
        -i popcount -i hostname -i N -i specialized -i status --details
    similarity	num_bits	algorithm	parameters	T_avg1	T_avg2	ratio
    Tanimoto	166	count	T=0.7	1.851	1.374	1.35
    Tanimoto	166	threshold	T=0.7	1.721	1.422	1.21
    Tanimoto	166	knearest	k=1	0.246	0.191	1.29
    Tanimoto	166	knearest	k=1000	2.199	1.850	1.19
    Tanimoto	881	count	T=0.7	5.741	4.547	1.26
    Tanimoto	881	threshold	T=0.7	5.695	4.697	1.21
    Tanimoto	881	knearest	k=1	1.500	1.221	1.23
    Tanimoto	881	knearest	k=1000	5.644	4.732	1.19
    Tanimoto	1024	count	T=0.7	5.606	3.699	1.52
    Tanimoto	1024	threshold	T=0.7	5.515	3.643	1.51
    Tanimoto	1024	knearest	k=1	1.299	0.856	1.52
    Tanimoto	1024	knearest	k=1000	9.246	6.250	1.48
    Tanimoto	2048	count	T=0.4	24.280	13.956	1.74
    Tanimoto	2048	threshold	T=0.4	19.856	13.645	1.46
    Tanimoto	2048	knearest	k=1	7.757	5.290	1.47
    Tanimoto	2048	knearest	k=1000	21.182	14.563	1.45
    version=1.5/version=3.3 geometic mean: 1.371 from 16 comparisons

If you thought that, congratulations - you are right! Chemfp 3.3 is
20-50% faster then chemfp 1.5, depending on the circumstances.

The performance comes from improvements in the popcount algorithms
(added to chemfp 2.x), AVX2 support in chemfp 3.x (enabled by default
in chemfp 3.2.1), and from the new "specialized" search algorithms in
version 3.3.

We can also compare chemfp 3.2.1 to 3.3:

    % ./fpbench_compare.py *.fpbench \
        -c version=3.2.1,3.3 -s indexed=1 -s num_threads=1 -s multiquery=0 \
        -s similarity=Tanimoto \
        -i popcount -i hostname -i N -i specialized -i status
    version=3.2.1/version=3.3 geometic mean: 1.289 from 16 comparisons

and break it down further by the number of bits:

    num_bits= 166, geometic mean: 1.616 (inlining really helps here!)
    num_bits= 881, geometic mean: 1.151
    num_bits=1024, geometic mean: 1.276 (improved AVX2 code)
    num_bits=2048, geometic mean: 1.164 (improved AVX2 code)

I hope this gives your the idea of how useful fpbench_compare.py might
be to compare a range of benchmark results.

## Thanks!

Thanks to ChEMBL for providing the chemical structures used to
generate the MACCS, FPS, and Morgan fingerprints.

Thanks to OpenEye, for use of the OEChem/OEGraphSim toolkit, and
permission to include the MACCS key generation output for the 166-bit
fingerprints.

Thanks to PubChem and the CACTVS developers, for including the
fingerprints as part of the PubChem distribution.

Thanks to the Open Babel developers, for use of Open Babel to generate
the FP2 fingerprints used for the 1024-bit fingerprints.

Thanks to the RDKit developers, for use of RDKit to generate the
Morgan fingerprints for the 2048-bit fingerprints.

Thanks to Daniel Lemire for the use of his hardware as a reference
platform for this benchmark.