Potential Projects for Newcomers

This wiki page collects some ideas for feasible projects for newcomers. They are organized by project length (short-term, long-term, etc.).

Short-term projects generally would take a few weeks if working about 5 hours a week, require little to no mathematical expertise, and apply existing inference code to a new application.
Long-term projects generally would be a semester or two-semester effort that would alter the internals of bnpy inference code. You could usually turn these into some kind of honors thesis/independent study report. If very successful, perhaps a workshop paper or conference paper.

Short-term projects
Long-term projects

Short-term projects

Project: Exploring different learning rates for stochastic variational inference (SVI)

Our current implementation of SVI requires a fixed learning rate decay schedule. You could demonstrate how performance with this schedule changes as a function of the schedule parameters and the batch size. You could explore any dataset of interest (Wikipedia/NY Times/etc for topic models, image patches for mixtures, etc.). Pick one dataset that is most interesting to you.

Expected deliverables:

Notebook describing an experiment

Make beautiful plots showing sensitivity on a relevant dataset. For bonus points, show the same settings on a different (but related) dataset to see if trends hold up. The goal is to have a notebook a newcomer can reference to immediately pick some "good" settings for a new experiment.

Project: Application of HDP-topics algorithms to Yelp dataset

You would apply topic model algorithms to the Yelp dataset. You'll need to develop scripts for turning raw plain-text reviews into sanitized bag-of-words data, and

Expected deliverables:

Code for creating sanitized bag-of-words datasets
Notebook with experiments on Yelp

Maybe would include trace plots of # topics over time, visuals of the learned topics, and some application of these topics (do they correlate with review categories?).

Project: Application of HDP-topics algorithms to bigger Wikipedia dataset

This project would involve collecting a larger dump of Wikipedia articles than the prepackaged dataset of ~7000 documents used in our AISTATS'15 paper. You could use some other dump of Wikipedia (size 300,000 docs or larger), or build your own.

Expected deliverables:

Code for creating sanitized bag-of-words dataset
Notebook with experiments

Maybe would include standard trace plots and visuals of the learned topics. Would also include some Wikipedia-relevant application of these topics (do they correlate with article categories? can you show how to browse articles by topic?).

Project: Application of HDP-HMM algorithms to Bird Song dataset

There is a cool dataset of hundreds of bird song recordings here: http://sabiod.univ-tln.fr/ulearnbio/challenges.html.

You would apply the HDP-HMM algorithms from our NIPS '15 paper to see if you can find any cool state structure here. Note: we expect that unsupervised learnign may not be so successful on this data, but your work could be an important baseline.

Expected deliverables:

Code for sanitizing the dataset into bnpy format
Notebook with experiments

Maybe would include standard trace plots and visuals of the learned topics. Would also include some relevant application of these topics: can you classify bird songs by the segmentation of a trained model? How does this compare to baseline supervised methods?

Long-term projects

Project: Add hyperparameter optimization

All of the inference algorithms are sensitive to model hyperparameters, especially concentration parameters of the Dirichlet process. In this project, you would add code that learns values of hyperparameters from data.

Background reading: TODO, please ask Mike

Expected deliverables:

Code

Choose one allocation model (DPMixtureModel, HDPTopicModel, HDPHMM, etc.) and add an "update_hyperparameter" method. This method would be (optionally?) called by "update_global_params", whenever a global update occurs. It would update the relevant parameter (usually a scalar named gamma0) via a MAP estimation.

Math writeup

You'll need to document the derivation of how to update the hyperparameter. Start by writing the variational objective as a function of that parameter, then take derivatives, set to zero, and solve. Be sure to answer these gotcha questions: Is this objective convex? Does this procedure always deliver (conditionally) optimal results?

Notebook describing an experiment

Make at least one IPython notebook demo that showcases how performance of some inference algorithm changes with and without update_hyperparameter enabled. Use toy data with an ideal hyperparameter, and show your algorithm recovers it.

Project: Adaptive learning rates for stochastic variational inference (SVI)

Our current implementation of SVI requires a fixed learning rate decay schedule. You could read recent research papers and implement an adaptive scheme.

Background reading

Expected deliverables:

Code

This would be a new algorithm, related to SOVBLearnAlg.py.

Notebook with experiments

Project: Extend HDP-topics algorithm to streaming setting

Do you like the idea of scraping the web and feeding documents into a topic-model-algorithm in (near) real time? This project might be for you. You could extend our topic model algorithms to this streaming setting, and show off a really awesome demo that just keeps on learning.

Background reading: TODO, please ask Mike

Expected deliverables:

Web-scraping code

Use some existing Python libraries (maybe Beautiful soup), write a scraper that can keep reading documents/articles from Wikipedia (or similar) and saving them to disk in format readable by bnpy algorithms. You'll also need to define a new object within bnpy/data/ that can manage reading the scraped data into a bnpy learning algorithm.

Inference Code

Duplicate the functionality of "MOVBLearnAlg.py" into a new file "StreamingMOVBLearnAlg.py" so that instead of a fixed-size dataset, you can provide a scraping script that just keeps adding new documents.

Project: Reproduce posterior predictive check experiments for topic models

There are some neat papers out there that do posterior predictive checks to explore topic models. The goal of a posterior predictive check is to see how well a trained model fits the data. this project would involve reading existing literature, picking a nice experiment to reproduce (Mike has some ideas), and then reproducing that experiment on a topic model dataset (NYTimes/wikipedia/etc.). You could then extend easily to try some other predictive checks or make better visualizations.

Background Reading:

http://www.cs.columbia.edu/~blei/papers/MimnoBlei2011.pdf

Expected deliverables:

Notebook describing an experiment