HTTPS SSH

README

CWN project

This is the documentation for the work-in-progress ColWordNet[1] Python API. ColWordNet, an automatic extension of WordNet 3.1 with fine-grained collocational information, is released with a Python API which allows for using custom training data (collocations bases and collocates), filtering each lexical function[2] with a custom threshold, and loading (all or part of) the original WordNet into a NetworkX[3] Directed Graph data structure for navigating the WordNEt taxonomy and accessing collocations, in addition to the already defined WordNet semantic relations.

After cloning the repository (git clone), you will need to download the required data from here, and unzip.

Dependencies

You must have the following dependencies installed:

  • gensim
  • numpy
  • pandas
  • nltk (and the wordnet corpus)
  • networkx
  • sklearn

Training

This repository comes from manually extracted and categorized collocations for a number of lexical functions (from the McMillan Collocations Dictionary). For example, if the goal is to train a transformation[4] for the `magn' lexical function, run the wordnet_collocations.py script as follows (Python 2.7) from the cwn/code folder:

python wordnet_collocations.py ../data/train_raw/magn.txt n a base_concepts data_folder

The 'n' and 'a' parameters are the part of speech of base and collocate of the input lexical function (for magn, e.g. `ardent(a) desire(n)'). base_concepts is a file containing the topmost WordNet synsets from which the algorithm will start traversing their hyponym branches and extracting candidate collocates. An example file is provided in cwn/data/train_raw/magn_base_concepts.txt. Finally, 'data_folder' is the folder where data (sense embeddings models and babelnet mappings) were unzipped.

Building the Resource

The previous step generates disambiguated training data (also stored in ../data/train_raw/ with the suffix _sensembed_WORDNET.txt) and saves a file in the output folder with all relations found by the previous step (without filtering). After generating k output files (one for each lexical functions), it is possible to combine all the relations into one single resource as follows:

python build_resource.py relations_folder out_cwn

'relations_folder' is the folder where relations derived from the previous step (one for each lexical function) were saved. 'out_cwn' is the path where the cwn resource will be created.

Loading CWN as a NetworkX object

Finally, it is possible to load the resulting CWN into WordNet (or a subset of it), and then navigate its relations, by calling (note the interactive mode):

python -i load_CWN_relations.py topmost_synset base_synset*

'topmost_synset' is the base concept whose hyponym branch will be traversed and enriched with CWN relations. If you want to load the whole WordNet resource, the topmost synset you have to enter is 'entity.n.01'. 'base_synset' (optional) is an example base synset (contained in the loaded CWN resource) whose collocational relations will be shown in standard output.

If you do not enter a 'base_synset' parameter, you may do:

python -i load_CWN_relations.py feeling.n.01

Then, in interactive mode, you may call the function:

get_neighbours(input_synset, by_relation='collocation')

in order to retrieve collocationally related synsets to input_synset (=feeling.n.01).

Likewise, you can call get_collocates(input_synset) as follows:

for edge,relation in get_collocates(input_synset, lf=''): print edge,relation

To obtain all collocates for a given base. Note that you can pass an optional lf parameter which, if not None, filters collocations by lexical function.

Note: The current version of this API implements a linear regression model instead of the moore penrose pseudoinverse matrix used in the original publication. Without having performed extensive evaluation, this variation is faster and provides equally satisfactory results.

Note2: It is possible to download retrofitted [5] pretrained vectors from: https://drive.google.com/drive/folders/0B4dY7B_VR5judG4zVXdPYXctcEU?usp=sharing. We release the four models we qualitatively evaluated in [1].

TO-DO

Integrate CWN as an nltk corpus


[1] Espinosa-Anke, L., Camacho-Collados, J., Rodrıguez-Fernández, S., Saggion, H., & Wanner, L. Extending WordNet with Fine-Grained Collocational Information via Supervised Distributional Learning. Coling 2016.

[2] Mel'čuk, I. (1998). Collocations and lexical functions. Cowie, AP (ed.), 23-53.

[3] networkx.github.io

[4] Mikolov, T., Le, Q. V., & Sutskever, I. (2013). Exploiting similarities among languages for machine translation. arXiv preprint arXiv:1309.4168.

[5] Faruqui, M., Dodge, J., Jauhar, S. K., Dyer, C., Hovy, E., & Smith, N. A. (2014). Retrofitting word vectors to semantic lexicons. arXiv preprint arXiv:1411.4166.