Code for generating spatio-temporal proposals from videos.
For more details, check out our paper.

There are two steps of the algorithm:

  • The clustering algorithm, the svx directory.
  • The proposal generation algorithm, the rp directory.


Installing the dependencies:

# Python packages for Ubuntu or Debian.
sudo apt-get install python-numpy python-scipy python-matplotlib
sudo apt-get install swig

# Python packages for Fedora.
sudo yum install numpy scipy python-matplotlib
sudo yum install swig

# Boost library for Ubuntu or Debian.
sudo apt-get install libboost-all-dev

# Boost library for Fedora.
sudo yum install boost-devel

# Structured edge detection and Piotr's toolbox.
cd svx

# Large displacement optical flow.

To compile the SWIG code just use the following command (again from the svx directory):

make all

Getting started

This is a short tutorial on how to use the code.
You can use the following data, the frames corresponding to the first video in the UCF Sports collection:


Dataset class

First you need to define a class with the following methods:

  • get_images_path(video) returns the path to the video frames.
  • get_edges_path(video) returns the path to the edges (SED features).
  • get_flow_path(video, direction) returns the path to the flow (direction can be forward or backward).
  • get_segmentation_directory(video) returns the path to where the segmentation will be stored (as images).

Add an instance of the class into the DATASETS dictionary, in the
For the UCF Sports dataset an example is already provided.

Extracting features for segmentation

The script generates a list of commands for extracting the features.
You can either execute the commands one by one or launch them in parallel (for example, using GNU parallel):

python --video 001 -d ucf_sports -f edges flow-forward flow-backward | while read c; do eval $c; done
python --video 001 -d ucf_sports -f edges flow-forward flow-backward | parallel 'eval {}'

Hierarchical clustering

The computes the segmentation of the video.
For testing and debugging purposes,
you can use only a subset of frames (by specifying the start and end arguments) and
you can visualize various steps of the algorithm (by supplying arguments to the --viz option).

python --video 001 -d ucf_sports --start 0 --end 5 -vv --viz mb edges
python --video 001 -d ucf_sports

Randomized merging algorithm

The folder rp (standing for Random Prim) contains the scripts needed for generating spatio-temporal object proposals.
Before generating the proposals, we first compute weights between super-voxels based on color, flow and geometric features (these features are explained in the paper).
The script computes distances between each pair of neighbouring super-voxels for each feature.
We combine the distances for the eight different features using a learnt weight combination;
you can use our learnt weight combination: download it from here and copy it into the data folder.

Here are examples of computing the weights between super-voxels and then generating the proposals:

python --video 001 -d ucf_sports -l 100
python --video 001 -d ucf_sports -l 100 -n 100

The proposals are stored as a three-dimensional matrix:

  • the first axis corresponds to the proposal;
  • the second axis corresponds to the frame number;
  • the third axis corresponds to the bounding box (hence it has four dimensions: the first two corresponding to the bottom corner and the last two corresponding to the top corner of the bounding box).

The proposals are by default stored in NumPy format,
but they can be stored in MATLAB format as well,
by specifying the option --format mat to the script.

If you wish to visualize the generated proposals, you can use the script as in the following example:

python --video 001 -d ucf_sports -n 5 -p ../data/ucf_sports/proposals/001/proposals_level_100_features_color_flow_size_fill_size_static_fill_static_size_time_fill_time_no_temp_constraint_False.dat

If you wish to evaluate the quality of the proposals for a given video in terms of best average overalp (BAO) or correct localization (CorLoc20, CorLoc50),
then you can use the script.
The evaluation script needs to access the groundtruth tube.
The parse_groundtruth method in the svx/ file loads the groundtruth:
a bounding box for each frame that contains an annotation.
More precisely, the function parse_groundtruth should return the following:

  • frame number;
  • x and y coordinates of the bottom left corner;
  • x and y coordinates of the top right corner.

Here is an example of how to call the script to compute the BAO for the first video (video 001) of the UCF Sports dataset:

python -d ucf_sports --video 001 -p ../data/ucf_sports/proposals/001/proposals_level_100_features_color_flow_size_fill_size_static_fill_static_size_time_fill_time_no_temp_constraint_False.dat -m mbao

You can get the original ( or our annotations ( as follows (make sure you are in the top level directory and not in the rp directory):