To get a jar with everything necessary, simply clone the repository and run mvn clean package. The you should get an jar with dependencies in the target folder.


You can use and extend this project under the terms of the Apache License 2.0.

What is this repository for?

A simple package for computing classes of similar words from a corpus of text without any annotation.

Example from an older version

How do I get set up?

See wiki.

Contribution guidelines

(to come)

Who do I talk to?

This guy


  • Review Code for possible optimizations
  • Add a K-Centroids clustering algorithm
  • Support input formats other than Leipzig Wortschatz format.
  • Support Penn Treebank Tokenization.
  • Better persistence.
  • Move all constants (especially strings) out of the main code and store them in one place.
  • comment and test everything in the data package