yc-utils -- Collection of NLP and data processing utilities

This is an assortment of utilities / functions that I have written and found useful for data manipulation and NLP. It is a mismash of scripts, libraries, modules, headers, etc written in various languages.

Getting started


To run Python scripts, you will need:

To use the Java packages tools, you will need

To use the C++ headers and compile the code, you will need:

  • GCC 4.9.1 -- All development work has been done with this compiler. The C++ codebase uses mostly standard C++11 (I think?) techniques, so it could possibly work with other C++11 compilers as well; your mileage may vary.
  • Boost libraries (>=1.56.0)

Applications and scripts

I have implemented a few applications and scripts for common data handling tasks.

For manipulating vocabularies

  • samplelines -- Utility for rapidly sampling lines from a large file.
  • shufflelines -- Utility for shuffling lines from a large file.
  • splitlines -- Quick utility for splitting into cross validation or training/dve/test sets.

Shell scripts and wrapper code

  • -- A Java wrapper around Stanford CoreNLP for tokenizing, sentence splitting, lemmatizing, POS tagging and NER labeling. You can run this using the bash shell script
  • -- Phrasinator is a Python application that extracts phrases from text, where phrases are sequence of tokens whose POS tags are of the form <tt>(ADJ*)(NN+)</tt>.

Libraries and packages

C++ headers

Several C++ header only source code are available, so compiling and linking libraries are not required.

  • collections.h -- Suite of C++ functions for working with collections.
  • ioutils.h -- A collection of utility functions for handling I/O reading and writing.
  • fastmath.h -- A wrapper of exponentials and logarithm functions using high performance math libraries.
  • math.h -- A collection of utility functions for doing math.
  • eigen.h -- Helper methods for use with Eigen library.
  • feature_map.h -- Implements the two-way feature mapping class.
  • lbfgs.h -- Helper methods for use with Naoaki Okazaki's libLBFGS.
  • mathtables.h -- Performs fast exponentials and logarithmic operations by pre-computing their values.
  • metropolis_hastings.h -- Implementation of the Metropolis-Hastings algorithm.
  • program_options.h -- Utility methods for dealing with boost::program_options.
  • random.h -- Helper functions for dealing with random number generation.
  • sampling.h -- A collections of functions for handling sampling.
  • stopwords.h -- Utility functions for managing stopwords.
  • vecmath.h -- A collection of utility functions for doing vectorial math.
  • vocabulary.h -- Implementation of a vocabulary -- includes saving/load to/from files, set-like operations, etc.

C++ library

The use of some functions will require linking against ycutils library. They are

Python modules

Some Python modules that I wrote for quick and dirty data processing tasks.

  • -- A Python module that interfaces with @ref

Other included libraries that are not mine


See TODOs.