# Overview

Atlassian SourceTree is a free Git and Mercurial client for Windows.

Atlassian SourceTree is a free Git and Mercurial client for Mac.

# coresets: a package for creating and evaluating logistic regression coresets

The `coresets`

package was used produce the experiments for:

Jonathan H. Huggins,
Trevor Campbell,
Tamara Broderick.
*Coresets for Scalable Bayesian Logistic Regression*.
In *Proc. of the 30th Annual Conference on Neural Information Processing
Systems* (NIPS), 2016.

The package includes functionality to load data, construct coresets for logistic regression, run an adaptive Metropolis-Hastings sampler using the coreset, and compare performance of coreset inferences to those obtained with other methods.

## Compilation and testing

To compile and test the package (for development purposes):

python setup.py build_ext --inplace # compile cython code in place nosetests tests/ # run tests, which takes a minute or two

To install:

pip install .

## Usage

The key function is `coresets.algorithms.construct_lr_coreset_with_kmeans`

,
which constructs a logistic regression coreset. The implementation is quite
efficient and requires minimal memory overhead. On a 2012 MacBook Pro with a
2.9 GHz Intel Core i7 it takes about half a second to run on the Webspam
dataset, which contains 350,000 data points of dimension 127.

The script used for most of the experiments in the paper, which makes use of
`construct_lr_coreset_with_kmeans`

as well as other functionality, can be found
in
scripts/run_experiments.py.
The script loads data, constructs coresets from the data, and compares the
inferential accuracy of the coreset to a random subsample of the same sizes.
For example, the following would construct random subsets and coresets of
sizes 156, 312, 625, 1250, and 2500 for the Binary10 synthetic dataset.
5,000 iterations of adaptive MALA are used for inference, the radius R is
chosen adaptively, and each experimental condition is repeated 5 times
(running this will take some time -- as much a few hours).

lrcoresets$ python scripts/run_experiments.py --synth-bin -d 10 -s 156 312 625 1250 2500 -R=-3 -k=4 -r=5 -i=5000 Created output folder: results changed working directory to results Created output folder: ../data/synthetic Generating data with train path ../data/synthetic/binary-d-10-N-1000000.npz running experiment... Created output folder: synthetic_cmc-Ks=4-Ms=156,312,625,1250,2500-Rs=-3.0-d=10-iters=5000-experiment-MALA X is either low-dimensional or not very sparse, so converting to a numpy array X is either low-dimensional or not very sparse, so converting to a numpy array running target algorithm "Full" for 25000 iterations with a warmup of 12500 iterations File Binary-Full.cpk does not yet exist, so starting from scratch running algorithm "Coreset" for 5000 iterations with a warmup of 2500 iterations and with the following parameters: output_size_param=156, R=-3.0, K=4, kmeans_subsample_size=auto File Binary-Coreset-K=4-R=-3.0-kmeans_subsample_size=auto-output_size_param=156.cpk does not yet exist, so starting from scratch running algorithm "Coreset" for 5000 iterations with a warmup of 2500 iterations and with the following parameters: output_size_param=312, R=-3.0, K=4, kmeans_subsample_size=auto File Binary-Coreset-K=4-R=-3.0-kmeans_subsample_size=auto-output_size_param=312.cpk does not yet exist, so starting from scratch running algorithm "Coreset" for 5000 iterations with a warmup of 2500 iterations and with the following parameters: output_size_param=625, R=-3.0, K=4, kmeans_subsample_size=auto File Binary-Coreset-K=4-R=-3.0-kmeans_subsample_size=auto-output_size_param=625.cpk does not yet exist, so starting from scratch running algorithm "Coreset" for 5000 iterations with a warmup of 2500 iterations and with the following parameters: output_size_param=1250, R=-3.0, K=4, kmeans_subsample_size=auto File Binary-Coreset-K=4-R=-3.0-kmeans_subsample_size=auto-output_size_param=1250.cpk does not yet exist, so starting from scratch running algorithm "Coreset" for 5000 iterations with a warmup of 2500 iterations and with the following parameters: output_size_param=2500, R=-3.0, K=4, kmeans_subsample_size=auto File Binary-Coreset-K=4-R=-3.0-kmeans_subsample_size=auto-output_size_param=2500.cpk does not yet exist, so starting from scratch running algorithm "Random" for 5000 iterations with a warmup of 2500 iterations and with the following parameters: output_size_param=156, R=-3.0, K=4, kmeans_subsample_size=auto File Binary-Random-K=4-R=-3.0-kmeans_subsample_size=auto-output_size_param=156.cpk does not yet exist, so starting from scratch running algorithm "Random" for 5000 iterations with a warmup of 2500 iterations and with the following parameters: output_size_param=312, R=-3.0, K=4, kmeans_subsample_size=auto File Binary-Random-K=4-R=-3.0-kmeans_subsample_size=auto-output_size_param=312.cpk does not yet exist, so starting from scratch running algorithm "Random" for 5000 iterations with a warmup of 2500 iterations and with the following parameters: output_size_param=625, R=-3.0, K=4, kmeans_subsample_size=auto File Binary-Random-K=4-R=-3.0-kmeans_subsample_size=auto-output_size_param=625.cpk does not yet exist, so starting from scratch running algorithm "Random" for 5000 iterations with a warmup of 2500 iterations and with the following parameters: output_size_param=1250, R=-3.0, K=4, kmeans_subsample_size=auto File Binary-Random-K=4-R=-3.0-kmeans_subsample_size=auto-output_size_param=1250.cpk does not yet exist, so starting from scratch running algorithm "Random" for 5000 iterations with a warmup of 2500 iterations and with the following parameters: output_size_param=2500, R=-3.0, K=4, kmeans_subsample_size=auto File Binary-Random-K=4-R=-3.0-kmeans_subsample_size=auto-output_size_param=2500.cpk does not yet exist, so starting from scratch plotting experiment results... Created output folder: synthetic_cmc-Ks=4-Ms=156,312,625,1250,2500-Rs=-3.0-d=10-iters=5000-experiment-MALA_plots

The script saves the experimental data and figures the *results/* directory.
Some example plots:

*Subset size vs. 2nd degree polynomial MMD for coresets and random subsamples*

*Running time vs. 2nd degree polynomial MMD for coresets and random subsamples*

*True vs. estimated means for coresets and random subsamples of size 312*

This invocation of runs a timing experiment on two synthetic datasets and the three included real datasets:

lrcoresets$ python scripts/run_experiments.py --timing --synth-bin --synth-mix --chemreact --webspam --covtype -d 0 -s 156 312 625 1250 2500 5000 10000 -R=-3 -k=6 -r=5 -i=5000 changed working directory to results Generating data with train path ../data/synthetic/binary-d-10-N-1000000.npz Generating data with train path ../data/synthetic/mixture-d-10-N-1000000.npy running experiment... Created output folder: binary-mixture-chemreact-webspam-covtype-Ks=6-Ms=156,312,625,1250,2500,5000,10000-Rs=-3.0-d=0-iters=5000-timing-experiment-MALA X is either low-dimensional or not very sparse, so converting to a numpy array X is either low-dimensional or not very sparse, so converting to a numpy array running target algorithm "Full" for 2 iterations with a warmup of 1 iterations File Binary-Full.cpk does not yet exist, so starting from scratch running algorithm "Coreset" for 5000 iterations with a warmup of 2500 iterations and with the following parameters: output_size_param=156, R=-3.0, K=6, kmeans_subsample_size=auto File Binary-Coreset-K=6-R=-3.0-kmeans_subsample_size=auto-output_size_param=156.cpk does not yet exist, so starting from scratch running algorithm "Coreset" for 5000 iterations with a warmup of 2500 iterations and with the following parameters: output_size_param=312, R=-3.0, K=6, kmeans_subsample_size=auto File Binary-Coreset-K=6-R=-3.0-kmeans_subsample_size=auto-output_size_param=312.cpk does not yet exist, so starting from scratch running algorithm "Coreset" for 5000 iterations with a warmup of 2500 iterations and with the following parameters: output_size_param=625, R=-3.0, K=6, kmeans_subsample_size=auto File Binary-Coreset-K=6-R=-3.0-kmeans_subsample_size=auto-output_size_param=625.cpk does not yet exist, so starting from scratch running algorithm "Coreset" for 5000 iterations with a warmup of 2500 iterations and with the following parameters: output_size_param=1250, R=-3.0, K=6, kmeans_subsample_size=auto File Binary-Coreset-K=6-R=-3.0-kmeans_subsample_size=auto-output_size_param=1250.cpk does not yet exist, so starting from scratch running algorithm "Coreset" for 5000 iterations with a warmup of 2500 iterations and with the following parameters: output_size_param=2500, R=-3.0, K=6, kmeans_subsample_size=auto File Binary-Coreset-K=6-R=-3.0-kmeans_subsample_size=auto-output_size_param=2500.cpk does not yet exist, so starting from scratch running algorithm "Coreset" for 5000 iterations with a warmup of 2500 iterations and with the following parameters: output_size_param=5000, R=-3.0, K=6, kmeans_subsample_size=auto File Binary-Coreset-K=6-R=-3.0-kmeans_subsample_size=auto-output_size_param=5000.cpk does not yet exist, so starting from scratch running algorithm "Coreset" for 5000 iterations with a warmup of 2500 iterations and with the following parameters: output_size_param=10000, R=-3.0, K=6, kmeans_subsample_size=auto File Binary-Coreset-K=6-R=-3.0-kmeans_subsample_size=auto-output_size_param=10000.cpk does not yet exist, so starting from scratch . . . plotting experiment results... Created output folder: binary-mixture-chemreact-webspam-covtype-Ks=6-Ms=156,312,625,1250,2500,5000,10000-Rs=-3.0-d=0-iters=5000-timing-experiment-MALA_plots

*Percentage of time spent creating the coreset relative to the total inference
time*