Overview

Word Embeddings

Introduction

Word embeddings are distributed representations of words. The tool calculates a vector representation of each word depending on it distributional behavior observed in a large corpus. To learn good reprensentation the model tries to estimate the fluency of short phrases of text. For more information, check our online demo, presentation and the paper .

Dependencies

  • Theano (upstream version). You need to install the latest theano available in github, as the running time optimizations, we developed, were recently commited back to theano.

Installation

To install the package:

Getting started

After the installation you can use create_embeddings.py and provide it with the requied configuration. We have created a shell script to the tool with some default parameter.

create_embeddings.py --train-file train.txt --dev-file dev.txt --vocabulary vocab.txt

Formats

Train & Dev Files

UTF-8 encoded text files, each line represents a sentences and words are seperated by spaces. The following is an example:

Riccardo Lombardi ( 16 August 1901 - 18 September 1984 ) was an Italian politician . Lombardi was born in Regalbuto .
He represented the Action Party in the Constituent Assembly of Italy from 1946 to 1948 and the Italian Socialist Party in the Chamber of Deputies from 1948 to 1983 .
References [ 1 ] Oleg Bolyakin ( , born September 5 , 1965 in Karaganda , Kazakh SSR , USSR ) is a former professional Kazakhstan ice hockey player .
He is honored coach of the Republic of Kazakhstan .
Bolyakin is a former head coach of Ertis Pavlodar , Saryarka Karaganda , Kazzinc-Torpedo and HC Almaty . Career Oleg Bolyakin is the graduate of Karaganda ice hockey school .
He started his career as a player of Avtomobilist Karaganda in 1981 .
In 1995 , he invited to play in Kazakhstan National Hockey Team and played 3 games with them .
In 1996 , Avtomobilist Karaganda was disbanded .
In 1998 , he sighed a contract with Amur Khabarovsk , but played only 9 games .
From 1999 to 2003 , he played for Yuzhny Ural Orsk at the Russian Major League .

Vocabulary

Each line contains one token. We will only learn vectors for the tokens listed in the vocabulary. The following is example of vocabulary file:

the
known
made
three
then
about
United
than
later
some
there
On