CitiBike NYC

This repository contains 180 instances for the BBSS (Balancing Bike Sharing Systems) problem, generated by elaborating real data from CitiBike NYC. The format of the instances is the one defined and popularized by the ADS group at TU Wien and the mobility department at AIT. The instances themselves were generated by the Scheduling and Time-Tabling (SaTT) research group at the University of Udine.

Instance generation

The instances were generated from snapshots of the CitiBike NYC system, taken every 10 minutes for the whole month of September 2013. The following sections describe the generation process in detail.

Data analysis

From the 4320 snapshots, we computed the distributions of bikes at every hour of the day across the whole month. In this process we ignored data from the week-ends, as it is more likely to contain noise. From the analysis it was clear that, at certain times of the day, there were stations acting as sources and others acting as sinks. The following plots highlight this behaviour.

Source station Sink station

From some of the distributions, it was also clear that there was some artificial rebalancing happening overnight between 00:00 and 06:00 AM (mildly visible in the plot below at 04:00 AM).

Rebalancing step

Initial number of bikes

Every instance needs, for every station, the initial number of bikes, and the target number of bikes. Since from the analysis appeared that rebalancing started around midnight, we decided to use the number of bikes, in each station, at 00:00 AM as the initial number of bikes.

Target number of bikes

Since we don't know the company policy, it is difficult to find a reliable target number of bikes. By looking at the historical data (May 2013 to November 2013) we thus elaborated a simple policy to reduce the chance that of a user finding a station completely full or completely empty. For each station,

  1. we compute the minimum first quartile for the hourly distributions across the day,
  2. we compute the maximum third quartile for the hourly distributions across the day,
  3. we compute a displacement d for the distributions, so that both the minimum first quartile and the maximum third quartile are as far as possible respectively from 0 and the station capacity.

The target number of bikes, given the initial number of bikes p, is then p-d. Note that p is directly determined from the snapshot, i.e., the state of the system at a specific time, chosen to generate the instance.

Initial depot

To simplify the generation of instances, the initial depot is set at one of the stations (thus if a station is chosen as depot, it cannot be used as a station in the instance). For this set of instances, we choose the station with CitiBike ID 294, which is reasonably central with respect to the set of stations.

Choice of stations

Since we can generate instances of any size in [1..#stations], one issue is how to choose the stations to include in each instance, so that the instances are interesting. Our generator partitions the stations in stations with positive displacement (sinks), and stations with negative displacement (sources), and then chooses among the sets uniformly at random. The rationale behind this, is to generate instances where sources and sinks are balanced (in number), so that the range of the cost values is reasonably large.

Variability across instances

One could argue that the usage of the stations is more or less the same across the month, and thus the generated instances are similar. However, the sets of stations involved are different from instance to instance (less so in larger instances, as almost all stations will be included). Moreover, from day to day, there is a variability that in some cases (see Figure 2) is quite large.

Instance naming

Instance names are composed by the following parts, separated by a _ character

  1. the string CitiBike_NYC,
  2. the date in which the snapshot was taken, in YYMMDD format,
  3. the hour in which the snapshot was taken (in GMT+01 time and HHmm format),
  4. the CitiBike ID of the station used as starting depot,
  5. the number of stations used,
  6. the extension .bbs (no _ needed before this).


Data from CitiBike NYC is "public and free to use" (free both as in speech, and as in beer). The instances are distributed under the permissive MIT License (see LICENSE file). Credit is optional.