THIS PROJECT HAS MOVED TO WARPLab
Realtime Online Spatiotemporal Topic Modeling
A library for unsupervised analysis and visualization of spatiotemporal data.
Requirements: Homebrew package manager for installing dependencies.
$ brew tap homebrew/versions
$ brew tap homebrew/science
$ brew install opencv --cxx11 --with-ffmpeg
$ brew install boost --cxx11
$ brew install flann fftw libsndfile
$ pip install pysrt
$ apt-get install libboost-all-dev libflann-dev libfftw3-dev libopencv-dev libsndfile1-dev cmake
$ pip install pysrt
$ cd rost-cli $ mkdir build $ cd build $ cmake .. $ make
After building, all the executables are in
rost-cli/bin folder. It is recommended you add this location to you
PATH environment variable.
It is recommended you set the ROSTPATH environment variable in your
.bashrc file to help ROST locate various resource files.
$ echo "export ROSTPATH=~/Projects/rost-cli/" >> ~/.bashrc $ source ~/.bashrc
Sunshine application takes a video file or a camera and runs topic modeling on this stream, and visualizes the results. To run
sunshine with default parameters using the default camera run:
$ sunshine --camera=0
To run sunshine on a video file:
$ sunshine --video=/path/to/my/video.mp4
A dual core i7 or higher processor is recommended for running sunshine in realtime for video streams of size 640x480.
Tutorial: Topic Modeling using ROST
Lets assume you have a movie file called
movie.mp4, which you would like to analyze using SoYummy. Our work flow would be as following:
- Extract words from the media file. Words are essentially quantized local features that are extracted from your media file. SoYummy comes with code to extract several different kinds of audio and video words from the media, apart from the actual text words which could be used from a subtitles file.
- Mix different words files into one combined word file.
- Run topic modeling
- Run topic visualizer
- Run summarizer
- Run summary visualizer
Word file format
Each of the following programs produce a list of words for each time step, in CSV format. The output format is:
words.extract.video command can be used to extract many different kinds of visual words from the media. Currently it supports:
- Color words: Color and intensity values at different points in an image. These words are uniformly spread over the entire image.
- Texton words: They capture textures better than gabor in some cases.
- Gabor words: They capture orientation and scale of texture at different points in an image. These words are uniformly spread over the entire image.
- ORB words: They describe local keypoints in an image. Extracting ORB words requires a vocabulary file, which can be imagined to be a database of description of all possible local patterns. A default vocabulary file is provided in the
All three of these words can be extracted at the same time:
$ words.extract.video --help --help help --video arg Video file --camera arg Use Camera with the given id. 0 => default source --image arg Image filename --subsample arg (=1) Subsampling rate for input sequence of images and video. --fps arg (=-1) fps for input -N [ --numframe ] arg (=0) Number of input images (0=no limit) --scale arg (=1) Scale image --output-timestamps arg (=1) If true, first column of the words output file has the timestamp. --images-out arg (=./images) images are put in this folder --save-images save time stamped images --logfile arg (=visualwords.log) Log file --gabor arg (=0) Enable Gabor words --gabor-cell-size arg (=32) Gabor words cell size --gabor-out arg (=words.gabor.csv) Output file name with extracted words --gabor-visualize arg (=0) Show visualization of the features --color arg (=0) Enable Color words --color-cell-size arg (=32) Color words cell size --color-out arg (=words.color.csv) Output file name with extracted words --color-visualize arg (=0) Show visualization of the features --color-no-intensity No intensity words --color-no-hue No words words --orb arg (=0) Enable Color words --orb-vocabulary arg (=/Users/misha/Projects/rost-cli/share/visualwords/orb_vocab/default.yml) Vocabulary file name --orb-num-features arg (=1000) Number of features --orb-out arg (=words.orb.csv) Output file name with extracted words --orb-visualize arg (=0) Show visualization of the features --texton arg (=0) Enable Texton words --texton-vocabulary arg (=/Users/misha/Projects/rost-cli/share/visualwords/texton.vocabulary.baraka.1000.csv) Vocabulary file name --texton-cell-size arg (=64) cell size --texton-out arg (=words.texton.csv) Output file name with extracted words --texton-visualize arg (=0) Show visualization of the features $ words.extract.video --video=movie.mp4 --subsample=15 --scale=0.5 --texton=true --color=true --orb=true --orb-vocabulary=/Users/yogesh/Projects/rost-cli/libvisualwords/data/orb_vocab/default.yml Initializing Hue Words Initializing Intensity Words Writing color words to: words.color.csv Writing gabor words to: words.gabor.csv Initializing Feature BOW with detectors:ORB Using ORB descriptor Read vocabulary: 5000 32 Writing orb words to: words.orb.csv Using word extractors: name -- V HueLight -- 436 Gabor -- 4096 ORB -- 5000 Opening video file: ../movie.mp4 Video duration: 503.71seconds Connected videofile >> subsample Connected subsample >> scale Done reading files.
You will have have three different word files after the above command finishes its execution:
words.orb.csv. Note their corresponding vocabulary sizes, as this will be needed in the next step. The program also saves a log file with default name
visualwords.log, which also has the vocabulary information.
If you see the face of your supervisor, then their is a good chance you are around your work. If you see face of your Arnold Schwarzenegger in a movie, then there is a good chance it's an action movie. To enable such inference, SoYummy implements the idea of face words, which combines the idea of detecting a face in an image, and mapping it to closest face in its database. The list of faces in the database can be thought of as face vocabulary. SoYummy currently comes with a face vocabulary with 142 faces, which were trained using the Labeled Faces in the Wild dataset. Face detection is done using Haar Feature-based Cascade Classifier.
$ words.extract.face --help Extract visual words from a video file or camera.: --help help -d [ --detect-model ] arg (=./default_detect_model.xml) INPUT: Cascade classifier model filename for detecting face. -o [ --recog-model ] arg (=./default_recog_model.yml.gz) INPUT: face recognition model filename -g [ --recog-type ] arg (=fisher) fisher = FisherFaces, eigen = EigenFaces, lbp = Locally Binary Pattern -v [ --visualize ] visualize --video arg Video file --camera Use Camera --weight arg (=1000) number of words to emit for a full screen face --subsample arg (=1) subsample (video file only) --scale arg (=1) video scaling. 0<scale<=1.0 --threshold arg (=10) How many neighbors each candidate rectangle should have to retain it. --out arg (=words.face.csv) output timestammped csv word file
words.extract.face --detect-model=libfacewords/data/default_detect_model.xml --recog-model=libfacewords/data/default_recog_model.yml.gz --video=filename.ext --scale=0.5 --subsample=15
A default dcascade classifier model is available in
libfacewords/data folder. We implement two different face recognition algorithms: Fisher Faces, and Locally Binary Patterns. Model files for both these algorithms are also avaialble in
MFCC audio words
Discretized MFCC audio words can be extracted from a WAV audio file using
words.extract.audio which has only been tested with mono wavs at 44.1 kHz sampling rate. Use other rate mono files or convert to a legit file with
ffmpeg -i file.ext -ac 1 -ar 44100 file.wav
$ words.extract.audio --help Allowed options: --help produce help message --audio arg The wav file to be processed. Must be specified. --vocab arg The name of an mfcc vocabulary file. Must be specified --fft_buf_size arg (=4096) Number of samples taken into account when calculating the fft and mfcc. --overlap arg (=0.10000000000000001) Amount of overlap between successive mfccs. Must be < 1. --out arg (=words.mfcc.csv) The name of the file where the output labels will be saved. --out-mfcc arg The name of the file where the raw mfcc output will be saved.
A default vocabulary is available in
words.extract.audio --audio=file.wav --vocab=libaudiowords/data/vocab/MontrealSounds2k.txt
Subtitles text words
Subtitles file in
.srt format can be processed using the
words.extract.subtitles program, which finds the stem for each word using Lencaster stemmer, and looks for its index in the provided vocabulary file.
$ words.extract.subtitles --help Usage: words.extract.subtitles [options] Extracts timestamped word ids from a subtitles file, given a vocabulary. Options: -h, --help show this help message and exit -i SRT_FILE, --in=SRT_FILE input subtitles file -v VOCAB_FILE, --vocab=VOCAB_FILE input vocabulary file -o WORD_CSV_FILE, --out=WORD_CSV_FILE output word list -t TIMESTEP, --timestep=TIMESTEP time step in milliseconds
The program requires a vocabulary file, which can be generated using
$ words.vocab.subtitles --help Usage: words.vocab.subtitles [options] Appends unique words from a subtitles file to a vocabulary file. If the vocabulary file does not exist, then creates it. Options: -h, --help show this help message and exit -i SRT_FILE, --in=SRT_FILE input subtitles file -v VOCAB_FILE, --vocab=VOCAB_FILE vocabulary file
If for example you have many different subtitles file in the directory structure:
movies/movie1/subtitles.srt movies/movie2/subtitles.srt movies/movie3/subtitles.srt movies/movie4/subtitles.srt ...
then you can generate a combined vocabulary using the following command
$ for i in `find movies/* -name subtitles.srt`; do words.vocab.subtitles -i $i -v vocab.csv; done
we can then generate word csv file for a given subtitles using the command:
$ words.extract.subtitles -i subtitles.srt -v vocab.srv -o words.subtitles.csv -t 100
We now need to merge the word files generated above, while taking into account the time-stamp. This is done by
words.mix program. The program takes a list of word CSV files and the corresponding vocabulary size, and outputs a combined words file in CSV format. The program automatically maps each word if to a new globally unique word id, using the provided vocabulary sizes to compute offsets. Also, the program takes a time-step value(in milliseconds), and combines all the words from all the sources which fall within each time step.
To figure out the proper timestep, look at your subsampling rate in word extraction steps. Pick a word type that gets extracted at every time stamp such as color or audio and check the time step in the file. For example, if the timesteps differ by 625ms (first number in each line of the words file) then double that for use in the following command:
$ words.mix --timestep=1250 -o words.all.csv -i 436 words.color.csv 1000 words.texton.csv 5000 words.orb.csv 2000 words.mfcc.csv 142 words.face.csv
This will produce words.all.csv which has the mixed words, and a mixwords.log file, which shows the combined vocabulary size.
This is where all the magic happens.
$ topics.refine.t --help Topic modeling of data with 1 dimensional structure.: --help help -i [ --in.words ] arg (=/dev/stdin) Word frequency count file. Each line is a document/cell, with integer representation of words. --in.words.delim arg (=,) delimiter used to seperate words. --out.topics arg (=topics.csv) Output topics file --out.topics.ml arg (=topics.maxlikelihood.csv) Output maximum likelihood topics file --out.topicmodel arg (=topicmodel.csv) Output topic model file --in.topicmodel arg Input topic model file --in.topics arg Initial topic labels --logfile arg (=topics.log) Log file --ppx.rate arg (=10) Every _ iterations report perplexity. --ppx.out arg (=perplexity.csv) Perplexity score for each timestep -V [ --vocabsize ] arg Vocabulary size. -K [ --ntopics ] arg (=100) Topic size. -n [ --iter ] arg (=1000) Number of iterations -a [ --alpha ] arg (=0.10000000000000001) Controls the sparsity of theta. Lower alpha means the model will prefer to characterize documents by few topics -b [ --beta ] arg (=1) Controls the sparsity of phi. Lower beta means the model will prefer to characterize topics by few words. -l [ --online ] Do online learning; i.e., output topic labels after reading each document/cell. --online.mint arg (=100) Minimum time in ms to spend between new observation timestep. --tau arg (=0.5) [0,1], Ratio of local refinement (vs global refinement). --refine.weight.local arg (=0.5) [0,1], High value implies more importance to present time. (GeometricDistribution(X)) --refine.weight.global arg (=0.5) [0,1], High value implies more importance to present time. (T*BetaDistribution(1/X,1)) --threads arg (=4) Number of threads to use. --g.time arg (=1) Depth of the temporal neighborhood (in #cells) --g.space arg (=1) Depth of the spatial neighborhood (in #cells) --cell.time arg (=1) cell width in time dim --cell.space arg (=32) cell width in space dim --in.topicmask arg Mask file for topics. Format is k lines of 0 or 1, where 0=> don't use the topic --add.to.topicmodel arg (=1) Add the given initial topic labels to topic model. Only applicable when a topic model and topics are provided --out.intermediate.topics arg (=0) output intermediate topics --out.topics.online arg (=topics.online.csv) topic labels computed online (only valid in online mode) --in.position arg Word position csv file. --out.position arg (=topics.position.csv) Position data for topics. --topicmodel.update arg (=1) Update global topic model with each iteration --out.ppx.online arg (=perplexity.online.csv) Perplexity score for each timestep, immediately after it has been observed. --batch.maxtime arg (=0) Maximum time in milliseconds spent on processing the data. 0 implies no max time. --retime arg (=1) If this option is given, then timestamp from the words is ignored, and a sequntial time is given to each timestep
topics.refine.t --in.words=words.all.csv --iter=100 --alpha=0.1 --beta=0.5 --vocabsize=11674 -K 20
Here vocabsize is what is reported in
topics.csv list of topic labels corresponging to each input word for each timestep. This file has exactly the same format as the input words file.
topics.maxlikelihood.csv maximum likelihood topic labels for each word. These labels are the ones which should be used for feeding into any classification or summarization task, and not the topics.csv.
We can produce visualizations of a each topic by selecting parts of a clip which have high representation of that topic. To do that first we need to produce a topic histogram file using
words.bincount program and the
topics.maxlikelihood.csv produced by the topic modeler, which is a a list of topic labels for each time step. Before
$ words.bincount --help Given timestamped words or topics file, outputs timestamped distributions.: --help help -i [ --in.words ] arg (=/dev/stdin) timestamped word list csv file --in.words.delim arg (=,) delimiter used to seperate words. -o [ --out ] arg (=/dev/stdout) Output histogram -V [ --vocabsize ] arg (=0) Vocabulary size. 0 => use the largest wordid as vocab size. --alpha arg (=1) histogram smoothing (only used when normalizing) --normalize Normalize the distribution so that everything sums to 1.0
$ words.bincount -i topics.maxlikelihood.csv -o topics.hist.csv -V 20
-V argument takes the number of topics as the argument.
summary.kcenters -i topics.hist.csv --kcenters-pp -S 20
This will produce a kcenters file that can be fed into the overall summary montage clip.
Rendering the results requre ffmpeg version >= 1.1
Rendering a summary
$ summary.render --help Usage: summary.render [options] Options: -h, --help show this help message and exit -s FILE, --summary=FILE read summary from FILE -v FILE, --video=FILE read video from FILE -o FILE, --out=FILE output summary video FILE -w WINDOW, --window=WINDOW Window size in milliseconds
summary.render -s summary.kcenters-pp.csv -v filename.ext -o summary_filename.ext -w 1000
Given the topic histogram file, we can now render topic visualizations using
$ topics.render --help Usage: topics.render [options] Options: -h, --help show this help message and exit -i FILE, --topichist=FILE input topic histogram FILE -v FILE, --video=FILE input video from FILE -o PREFIX, --out-prefix=PREFIX output video prefix. each video file name is PEFIX<i>.<ext>, where <i> is the topic number, and ext is same as input video file's extension -w WINDOW, --window=WINDOW window size in milliseconds for each clip -n INHIBITION, --inhibition=INHIBITION after selecting a timestamp, do not select another one within this time radius
topics.render -w 1000 -i topics.hist.csv -v filename.ext -o filename
Building your own vocabularies
- ORB (Oriented BRIEF) video vocabulary
$ words.vocab.orb --video=movie.mp4 --subsample 30 --scale=0.5 --task=train --show.keypoints=false
- Face vocabulary
- Text vocabulary (subtitles)
- MFCC audio vocabulary