CitiBike NYC

A collection of scripts to get snapshots from the CitiBike NYC bike sharing system, process the data and produce BBSS instances in the format used by research groups at TU Wien, Austria and University of Udine, Italy.

Note the repository also includes snapshots (plus pre-processed data) about ~6 months of service (since beginning of May 2013 until beginning of November 2013), and instances of size {10, 20, 30, 60, 90, 120, 150, 180, 210, 240} generated using station 294 (a central one) as a starting depot.


The process involves two steps: data collection and actual instance generation.

Data collection

Collecting the necessary data involves three steps:

  1. activating Tor,
  2. in crontab -e, adding:

    */10 * * * * &> /dev/null
    0 0 * * * &> /dev/null
  3. waiting for data to come.

The script will dump a snapshot of the NYC bike sharing system and name it accordingly; this can be done every 10 minutes to have a good discretization of the bicycles movements throughout the day.

The script checks the snapshots and updates a matrix of reachability costs (time and distance between pairs of stations). This must be cron'd because new stations can be added over time). The matrix is saved in distances.json.

Note configure the snapshots directories in and to make sure everything is in place, currently set to snapshots subdirectory.

Instance generation

In order to generate a BBSS instance you need two pieces of information: the current number of bikes in each station, and the desired number of bikes after the rebalancing. Unfortunately, while we have the former, we still lack the latter.

Our solution was to study, for each station, the distribution of bikes throughout the day. We then computed the desired number of bikes, thus the rebalancing, so that the minimum first quartile of bikes during the day is as far as possible from zero, and the maximum third quantile of bikes during the day, is as far as possible from the maximum station capacity. This way, we decrease the chance that a station is full when we need to drop a bike, and empty when we need to pick up one.

Moreover, the instances are generated so that there are always stations which require addition of bikes, and stations which require removal. This is needed to make instances interesting to solve. The steps to generate the instances are:

  1. delete targets.json and data.csv as they must be generated from scratch,
  2. run to generate data.csv,
  3. run targets.R to generate targets.json from data.csv, and finally
  4. run --size <n_stations> --depot <initial_depot>.

The script just convert the content of the snapshots directory into a csv that R can parse. After data.csv has been produced, targets.R will produce a JSON-formatted file with the desired number of bikes for each station, based on the procedure described above. This file is finally processed by to generate n_stations-sized instances starting from depot initial_depot and based on the 6:00 AM snapshots. It is usually a lot of instances. If an initial_depot is not specified, station 294 is used, as it's quite central.

The generator also supports a seed (--random-seed parameter, both in and for the pseudo-random number generator. In case a seed is not supplied, zero is used.

It is possible to generate instances from a single snapshot by using the (singular) script, all the python scripts are provided with a --help command line argument to explain the meaning of the various parameters. Instances will go under the instances directory.


The code is distributed under the MIT License. Data from CitiBike NYC is public and free to use.