Realtime Online Spatiotemporal Topic Modeling

A library for unsupervised analysis and visualization of spatiotemporal data.


MacOS deps

Requirements: Homebrew package manager for installing dependencies.

  • $ brew tap homebrew/versions
  • $ brew tap homebrew/science
  • $ brew install opencv --cxx11 --with-ffmpeg
  • $ brew install boost --cxx11
  • $ brew install flann fftw libsndfile
  • $ pip install pysrt

Linux Deps

  • $ apt-get install libboost-all-dev libflann-dev libfftw3-dev libopencv-dev libsndfile1-dev cmake
  • $ pip install pysrt


$ cd rost-cli
$ mkdir build
$ cd build
$ cmake ..
$ make

After building, all the executables are in rost-cli/bin folder. It is recommended you add this location to you PATH environment variable.


It is recommended you set the ROSTPATH environment variable in your .bashrc file to help ROST locate various resource files.

$ echo "export ROSTPATH=~/Projects/rost-cli/" >> ~/.bashrc
$ source ~/.bashrc

Tutorial: Sunshine

Sunshine application takes a video file or a camera and runs topic modeling on this stream, and visualizes the results. To run sunshine with default parameters using the default camera run:

$ sunshine --camera=0

To run sunshine on a video file:

$ sunshine --video=/path/to/my/video.mp4

A dual core i7 or higher processor is recommended for running sunshine in realtime for video streams of size 640x480.

Tutorial: Topic Modeling using ROST

Lets assume you have a movie file called movie.mp4, which you would like to analyze using SoYummy. Our work flow would be as following:

  • Extract words from the media file. Words are essentially quantized local features that are extracted from your media file. SoYummy comes with code to extract several different kinds of audio and video words from the media, apart from the actual text words which could be used from a subtitles file.
  • Mix different words files into one combined word file.
  • Run topic modeling
  • Run topic visualizer
  • Run summarizer
  • Run summary visualizer

Extracting words

Word file format

Each of the following programs produce a list of words for each time step, in CSV format. The output format is:


Visual words command can be used to extract many different kinds of visual words from the media. Currently it supports: - Color words: Color and intensity values at different points in an image. These words are uniformly spread over the entire image. - Texton words: They capture textures better than gabor in some cases. - Gabor words: They capture orientation and scale of texture at different points in an image. These words are uniformly spread over the entire image. - ORB words: They describe local keypoints in an image. Extracting ORB words requires a vocabulary file, which can be imagined to be a database of description of all possible local patterns. A default vocabulary file is provided in the libvisualwords/data folder. All three of these words can be extracted at the same time:

$ --help
  --help                                help
  --video arg                           Video file
  --camera arg                          Use Camera with the given id. 0 => 
                                        default source
  --image arg                           Image filename
  --subsample arg (=1)                  Subsampling rate for input sequence of 
                                        images and video.
  --fps arg (=-1)                       fps for input
  -N [ --numframe ] arg (=0)             Number of input images (0=no limit)
  --scale arg (=1)                      Scale image
  --output-timestamps arg (=1)          If true, first column of the words 
                                        output file has the timestamp.
  --images-out arg (=./images)          images are put in this folder
  --save-images                         save time stamped images
  --logfile arg (=visualwords.log)      Log file
  --gabor arg (=0)                      Enable Gabor words
  --gabor-cell-size arg (=32)           Gabor words cell size
  --gabor-out arg (=words.gabor.csv)    Output file name with extracted words
  --gabor-visualize arg (=0)            Show visualization of the features
  --color arg (=0)                      Enable Color words
  --color-cell-size arg (=32)           Color words cell size
  --color-out arg (=words.color.csv)    Output file name with extracted words
  --color-visualize arg (=0)            Show visualization of the features
  --color-no-intensity                  No intensity words
  --color-no-hue                        No words words
  --orb arg (=0)                        Enable Color words
  --orb-vocabulary arg (=/Users/misha/Projects/rost-cli/share/visualwords/orb_vocab/default.yml)
                                        Vocabulary file name
  --orb-num-features arg (=1000)        Number of features
  --orb-out arg (=words.orb.csv)        Output file name with extracted words
  --orb-visualize arg (=0)              Show visualization of the features
  --texton arg (=0)                     Enable Texton words
  --texton-vocabulary arg (=/Users/misha/Projects/rost-cli/share/visualwords/texton.vocabulary.baraka.1000.csv)
                                        Vocabulary file name
  --texton-cell-size arg (=64)          cell size
  --texton-out arg (=words.texton.csv)  Output file name with extracted words
  --texton-visualize arg (=0)           Show visualization of the features

$ --video=movie.mp4 --subsample=15 --scale=0.5 --texton=true --color=true --orb=true --orb-vocabulary=/Users/yogesh/Projects/rost-cli/libvisualwords/data/orb_vocab/default.yml 
Initializing Hue Words
Initializing Intensity Words
Writing color words to: words.color.csv
Writing gabor words to: words.gabor.csv
Initializing Feature BOW with detectors:ORB
Using ORB descriptor
Read vocabulary: 5000 32
Writing orb words to: words.orb.csv
Using word extractors: 
name -- V
HueLight -- 436
Gabor -- 4096
ORB -- 5000
Opening video file: ../movie.mp4
Video duration: 503.71seconds
Connected videofile >> subsample
Connected subsample >> scale
Done reading files. 

You will have have three different word files after the above command finishes its execution: words.color.csv, words.texton.csv, words.orb.csv. Note their corresponding vocabulary sizes, as this will be needed in the next step. The program also saves a log file with default name visualwords.log, which also has the vocabulary information.

Face words

If you see the face of your supervisor, then their is a good chance you are around your work. If you see face of your Arnold Schwarzenegger in a movie, then there is a good chance it's an action movie. To enable such inference, SoYummy implements the idea of face words, which combines the idea of detecting a face in an image, and mapping it to closest face in its database. The list of faces in the database can be thought of as face vocabulary. SoYummy currently comes with a face vocabulary with 142 faces, which were trained using the Labeled Faces in the Wild dataset. Face detection is done using Haar Feature-based Cascade Classifier.

$ words.extract.face --help
Extract visual words from a video file or camera.:
  --help                                help
  -d [ --detect-model ] arg (=./default_detect_model.xml)
                                        INPUT: Cascade classifier model 
                                        filename for detecting face.
  -o [ --recog-model ] arg (=./default_recog_model.yml.gz)
                                        INPUT: face recognition model filename
  -g [ --recog-type ] arg (=fisher)     fisher = FisherFaces, eigen = 
                                        EigenFaces, lbp = Locally Binary 
  -v [ --visualize ]                    visualize
  --video arg                           Video file
  --camera                              Use Camera
  --weight arg (=1000)                  number of words to emit for a full 
                                        screen face
  --subsample arg (=1)                  subsample (video file only)
  --scale arg (=1)                      video scaling. 0<scale<=1.0
  --threshold arg (=10)                 How many neighbors each candidate 
                                        rectangle should have to retain it.
  --out arg (=words.face.csv)           output timestammped csv word file
words.extract.face --detect-model=libfacewords/data/default_detect_model.xml --recog-model=libfacewords/data/default_recog_model.yml.gz --video=filename.ext  --scale=0.5 --subsample=15

A default dcascade classifier model is available in libfacewords/data folder. We implement two different face recognition algorithms: Fisher Faces, and Locally Binary Patterns. Model files for both these algorithms are also avaialble in libfacewords/data folder.

MFCC audio words

Discretized MFCC audio words can be extracted from a WAV audio file using which has only been tested with mono wavs at 44.1 kHz sampling rate. Use other rate mono files or convert to a legit file with

ffmpeg -i file.ext -ac 1 -ar 44100 file.wav
$ --help
Allowed options:
  --help                               produce help message
  --audio arg                          The wav file to be processed. Must be 
  --vocab arg                          The name of an mfcc vocabulary file. 
                                       Must be specified
  --fft_buf_size arg (=4096)           Number of samples taken into account 
                                       when calculating the fft and mfcc.
  --overlap arg (=0.10000000000000001) Amount of overlap between successive 
                                       mfccs. Must be < 1.
  --out arg (=words.mfcc.csv)          The name of the file where the output 
                                       labels will be saved.
  --out-mfcc arg                       The name of the file where the raw mfcc 
                                       output will be saved.

A default vocabulary is available in libaudiowords\data folder. --audio=file.wav --vocab=libaudiowords/data/vocab/MontrealSounds2k.txt

Subtitles text words

Subtitles file in .srt format can be processed using the words.extract.subtitles program, which finds the stem for each word using Lencaster stemmer, and looks for its index in the provided vocabulary file.

$ words.extract.subtitles --help
Usage: words.extract.subtitles [options]

Extracts timestamped word ids from a subtitles file, given a vocabulary.

  -h, --help            show this help message and exit
  -i SRT_FILE, --in=SRT_FILE
                        input subtitles file
                        input vocabulary file
                        output word list
  -t TIMESTEP, --timestep=TIMESTEP
                        time step in milliseconds

The program requires a vocabulary file, which can be generated using words.vocab.subtitles program.

$ words.vocab.subtitles --help
Usage: words.vocab.subtitles [options]

Appends unique words from a subtitles file to a vocabulary file. If the
vocabulary file does not exist, then creates it.

  -h, --help            show this help message and exit
  -i SRT_FILE, --in=SRT_FILE
                        input subtitles file
                        vocabulary file

If for example you have many different subtitles file in the directory structure:


then you can generate a combined vocabulary using the following command

$ for i in `find movies/* -name`; do words.vocab.subtitles -i $i -v vocab.csv; done

we can then generate word csv file for a given subtitles using the command:

$ words.extract.subtitles -i -v vocab.srv -o words.subtitles.csv -t 100

Mixing words

We now need to merge the word files generated above, while taking into account the time-stamp. This is done by words.mix program. The program takes a list of word CSV files and the corresponding vocabulary size, and outputs a combined words file in CSV format. The program automatically maps each word if to a new globally unique word id, using the provided vocabulary sizes to compute offsets. Also, the program takes a time-step value(in milliseconds), and combines all the words from all the sources which fall within each time step.

To figure out the proper timestep, look at your subsampling rate in word extraction steps. Pick a word type that gets extracted at every time stamp such as color or audio and check the time step in the file. For example, if the timesteps differ by 625ms (first number in each line of the words file) then double that for use in the following command:

$ words.mix --timestep=1250 -o words.all.csv -i 436 words.color.csv 1000 words.texton.csv 5000 words.orb.csv 2000 words.mfcc.csv 142 words.face.csv

This will produce words.all.csv which has the mixed words, and a mixwords.log file, which shows the combined vocabulary size.

Topic modeling

This is where all the magic happens.

$ topics.refine.t --help
Topic modeling of data with 1 dimensional structure.:
  --help                                help
  -i [ --in.words ] arg (=/dev/stdin)   Word frequency count file. Each line is
                                        a document/cell, with integer 
                                        representation of words. 
  --in.words.delim arg (=,)             delimiter used to seperate words.
  --out.topics arg (=topics.csv)        Output topics file arg (=topics.maxlikelihood.csv)
                                        Output maximum likelihood topics file
  --out.topicmodel arg (=topicmodel.csv)
                                        Output topic model file
  --in.topicmodel arg                   Input topic model file
  --in.topics arg                       Initial topic labels
  --logfile arg (=topics.log)           Log file
  --ppx.rate arg (=10)                  Every _ iterations report perplexity.
  --ppx.out arg (=perplexity.csv)       Perplexity score for each timestep
  -V [ --vocabsize ] arg                Vocabulary size.
  -K [ --ntopics ] arg (=100)           Topic size.
  -n [ --iter ] arg (=1000)             Number of iterations
  -a [ --alpha ] arg (=0.10000000000000001)
                                        Controls the sparsity of theta. Lower 
                                        alpha means the model will prefer to 
                                        characterize documents by few topics
  -b [ --beta ] arg (=1)                Controls the sparsity of phi. Lower 
                                        beta means the model will prefer to 
                                        characterize topics by few words.
  -l [ --online ]                       Do online learning; i.e., output topic 
                                        labels after reading each 
                                        document/cell. arg (=100)              Minimum time in ms to spend between new
                                        observation timestep.
  --tau arg (=0.5)                      [0,1], Ratio of local refinement (vs 
                                        global refinement).
  --refine.weight.local arg (=0.5)      [0,1], High value implies more 
                                        importance to present time. 
                                        (GeometricDistribution(X)) arg (=0.5)     [0,1], High value implies more 
                                        importance to present time. 
  --threads arg (=4)                    Number of threads to use.
  --g.time arg (=1)                     Depth of the temporal neighborhood (in 
                                        #cells) arg (=1)                    Depth of the spatial neighborhood (in 
  --cell.time arg (=1)                  cell width in time dim arg (=32)                cell width in space dim 
  --in.topicmask arg                    Mask file for topics. Format is k lines
                                        of 0 or 1, where 0=> don't use the 
                                        topic arg (=1)          Add the given initial topic labels to 
                                        topic model. Only applicable when a 
                                        topic model and topics are provided
  --out.intermediate.topics arg (=0)    output intermediate topics arg (
                                        topic labels computed online (only 
                                        valid in online mode)
  --in.position arg                     Word position csv file.
  --out.position arg (=topics.position.csv)
                                        Position data for topics.
  --topicmodel.update arg (=1)          Update global topic model with each 
                                        iteration arg (
                                        Perplexity score for each timestep, 
                                        immediately after it has been observed.
  --batch.maxtime arg (=0)              Maximum time in milliseconds spent on 
                                        processing the data. 0 implies no max 
  --retime arg (=1)                     If this option is given, then timestamp
                                        from the words is ignored, and a 
                                        sequntial time is given to each 
topics.refine.t --in.words=words.all.csv --iter=100 --alpha=0.1 --beta=0.5 --vocabsize=11674 -K 20

Here vocabsize is what is reported in mixwords.log

Output: - topics.csv list of topic labels corresponging to each input word for each timestep. This file has exactly the same format as the input words file. - topics.maxlikelihood.csv maximum likelihood topic labels for each word. These labels are the ones which should be used for feeding into any classification or summarization task, and not the topics.csv. - perplexity.csv - topics.log

Rendering topics

We can produce visualizations of a each topic by selecting parts of a clip which have high representation of that topic. To do that first we need to produce a topic histogram file using words.bincount program and the topics.maxlikelihood.csv produced by the topic modeler, which is a a list of topic labels for each time step. Before

$ words.bincount --help
Given timestamped words or topics file, outputs timestamped distributions.:
  --help                              help
  -i [ --in.words ] arg (=/dev/stdin) timestamped word list csv file
  --in.words.delim arg (=,)           delimiter used to seperate words.
  -o [ --out ] arg (=/dev/stdout)     Output histogram
  -V [ --vocabsize ] arg (=0)         Vocabulary size. 0 => use the largest 
                                      wordid as vocab size.
  --alpha arg (=1)                    histogram smoothing (only used when 
  --normalize                         Normalize the distribution so that 
                                      everything sums to 1.0
$ words.bincount -i topics.maxlikelihood.csv -o topics.hist.csv -V 20

Here -V argument takes the number of topics as the argument.


summary.kcenters -i topics.hist.csv --kcenters-pp -S 20

This will produce a kcenters file that can be fed into the overall summary montage clip.

Visualizing results

Rendering the results requre ffmpeg version >= 1.1

Rendering a summary

$ summary.render --help
Usage: summary.render [options]

  -h, --help            show this help message and exit
  -s FILE, --summary=FILE
                        read summary from FILE
  -v FILE, --video=FILE
                        read video from FILE
  -o FILE, --out=FILE   output summary video FILE
  -w WINDOW, --window=WINDOW
                        Window size in milliseconds
summary.render -s summary.kcenters-pp.csv -v filename.ext -o summary_filename.ext -w 1000

Given the topic histogram file, we can now render topic visualizations using topics.render program.

$ topics.render --help
Usage: topics.render [options]

  -h, --help            show this help message and exit
  -i FILE, --topichist=FILE
                        input topic histogram FILE
  -v FILE, --video=FILE
                        input video from FILE
  -o PREFIX, --out-prefix=PREFIX
                        output video prefix. each video file name is
                        PEFIX<i>.<ext>, where <i> is the topic number, and ext
                        is same as input video file's extension
  -w WINDOW, --window=WINDOW
                        window size in milliseconds for each clip
  -n INHIBITION, --inhibition=INHIBITION
                        after selecting a timestamp, do not select another one
                        within this time radius
topics.render -w 1000 -i topics.hist.csv -v filename.ext -o filename

Building your own vocabularies

  • ORB (Oriented BRIEF) video vocabulary
$ words.vocab.orb --video=movie.mp4 --subsample 30 --scale=0.5 --task=train --show.keypoints=false
  • Face vocabulary words.vocab.face
  • Text vocabulary (subtitles)
  • MFCC audio vocabulary