yc-utils -- Collection of NLP and data processing utilities
This is an assortment of utilities / functions that I have written and found useful for data manipulation and NLP. It is a mismash of scripts, libraries, modules, headers, etc written in various languages.
To run Python scripts, you will need:
- Python 2.7 (>= 2.7.3)
To use the Java packages tools, you will need
- Java SE 7 or better
To use the C++ headers and compile the code, you will need:
- GCC 4.9.1 -- All development work has been done with this compiler. The C++ codebase uses mostly standard C++11 (I think?) techniques, so it could possibly work with other C++11 compilers as well; your mileage may vary.
- Boost libraries (>=1.56.0)
Applications and scripts
I have implemented a few applications and scripts for common data handling tasks.
For manipulating vocabularies
- buildvocab -- App for building n-gram vocabulary files from tokenized text.
- prunevocab -- App for pruning vocabulary files created by buildvocab.
- mergevocab -- App for merging multiple vocabulary files created by buildvocab.
Data processing related utilities
- samplelines -- Utility for rapidly sampling lines from a large file.
- shufflelines -- Utility for shuffling lines from a large file.
- splitlines -- Quick utility for splitting into cross validation or training/dve/test sets.
Shell scripts and wrapper code
- corenlpwrapper.sh -- A Java wrapper around Stanford CoreNLP for tokenizing, sentence splitting, lemmatizing, POS tagging and NER labeling. You can run this using the bash shell script
- phrasinator.py -- Phrasinator is a Python application that extracts phrases from text, where phrases are sequence of tokens whose POS tags are of the form <tt>(ADJ*)(NN+)</tt>.
Libraries and packages
Several C++ header only source code are available, so compiling and linking libraries are not required.
- collections.h -- Suite of C++ functions for working with collections.
- ioutils.h -- A collection of utility functions for handling I/O reading and writing.
- fastmath.h -- A wrapper of exponentials and logarithm functions using high performance math libraries.
- math.h -- A collection of utility functions for doing math.
- eigen.h -- Helper methods for use with Eigen library.
- feature_map.h -- Implements the two-way feature mapping class.
- lbfgs.h -- Helper methods for use with Naoaki Okazaki's libLBFGS.
- mathtables.h -- Performs fast exponentials and logarithmic operations by pre-computing their values.
- metropolis_hastings.h -- Implementation of the Metropolis-Hastings algorithm.
- program_options.h -- Utility methods for dealing with
- random.h -- Helper functions for dealing with random number generation.
- sampling.h -- A collections of functions for handling sampling.
- stopwords.h -- Utility functions for managing stopwords.
- vecmath.h -- A collection of utility functions for doing vectorial math.
- vocabulary.h -- Implementation of a
vocabulary-- includes saving/load to/from files, set-like operations, etc.
The use of some functions will require linking against
ycutils library. They are
- vocabulary.cpp due to use of class static constants.
Some Python modules that I wrote for quick and dirty data processing tasks.
- corenlpwrapper.py -- A Python module that interfaces with @ref StanfordCoreNLPWrapper.java.
Other included libraries that are not mine
- Stanford CoreNLP 3.5.0 -- Stanford CoreNLP provides a set of natural language analysis tools written in Java. Used by corenlpwrapper.sh, phrasinator.py and corenlpwrapper.py.
- Splitta 0.1.0 -- Statistical sentence boundary detection library by Dan Gilick at Google. Used by phrasinator.py and corenlpwrapper.py.
- Apache Commons CLI 1.2 -- The Apache Commons CLI library provides an API for parsing command line options passed to programs. Used by StanfordCoreNLPWrapper.