HTTPS SSH

Multi-Sense LSTMs for Text-to-Entity Mapping

Table of contents

  1. Purpose
  2. Provided resources
  3. Requirements
  4. Pipeline
  5. MSEmbedding layer
  6. Pre-trained vectors
  7. Citing
  8. Licence
  9. Contact info

Purpose

This repository contains code and resources for the MS-LSTM model introduced in (Kartsaklis et al., 2018) [1]. The model maps efficiently unrestricted text to knowledge graph entities using the following process:

  1. The KB graph is extended with textual features weighted by their importance with respect to the entity nodes.
  2. A synthetic "corpus" of biased random walks is created and used as input to the skipgram model. This generates an enhanced KB space to be used as target for the text-to-entity mapping process
  3. The transformation from text to entities/concepts is achieved via a supervised multi-sense compositional model, which generated a point in the KB space for every input text. The model is an LSTM equipped with an attentional mechanism that dynamically disambiguates the embeddings of the input words given the surrounding context.

Provided resources

  • Folder deepwalk_tf: Contains a modified version of the DeepWalk software (Perozzi et al., 2014) [2] for creating entity vectors using biased random walks on a graph extended with textual features.
  • Folder ms-lstm: Keras/Python code for the MS-LSTM model.
  • Folder vectors: Contains pre-trained vector sets for WordNet and SNOMED CT.

Requirements

The code in this repository requires Python 2.x with the numpy and scipy libraries installed. Additionally:

  • For deepwalk_tf: gensim, six, wheel
  • For the MS-LSTM model: keras over Theano or TensorFlow, sklearn.

Instructions on installing Keras can be found here. The MS-LSTM code has been tested using both backends of Keras.

Pipeline

Training a text-to-entity mapping system with the code of this repository requires 5 main steps:

  1. Extend your graph with textual features
  2. Create your entity vectors
  3. Prepare your training and testing data
  4. Train the model
  5. Test the model

Specifically:

  • For creating entity vectors enhanced with textual features for a knowledge graph, use Steps 1 and 2.
  • For training the text-to-entity mapping system, use Steps 3 and 4.
  • For testing the text-to-entity mapping system, use Step 5.

Step 1: Extend your graph with textual features

As a first step you need to extend the knowledge graph with nodes corresponding to textual features and extracted from a list of texts. For this, use the following commands:

$ cd ms-lstm
$ python extend_graph.py --txt_data <textfile> --edgelist <currentgraph> --outfile <merged-edgelist> --threshold <tfidf-threshold>
  • For the parameter edgelist, provide the name of the file holding the current graph as an edge list of the form:
node1 node2
node3 node4
...

or

node1 node2 weight1
node3 node4 weigth2
...
  • The txt_data file must contain the texts from which the textual features will be extracted. The form of this file is:
node1|text1
node2|text2
....

Note that each node can have more than one texts assigned to it, i.e.

node1|text1
node1|text2
node2|text3
...
  • Textual features with TF-IDF values less than threshold will be ignored. By default this value is 0, i.e. all textual features are included in the merged edge list.
  • The labels of the textual features have the form TF:textual_feature, e.g. TF:fever.
  • At output, the script produces two files: One for the merged edge list with name outfile (containing both the edges of currentgraph and the edges extracted from textfile), and one reference file containing the mapping from graph entities to textual features (outfile.tfidf).

Step 2: Create your entity vectors

In the second step you create the entity vectors that will be used as targets for the MS-LSTM. This can be done by running:

$ python deepwalk_tf/ --input <inputfile> --output <outputfile> --tf_mass <lambda>
  • Typically, your input file must be the merged edge list generated by Step 1 above. If you have generated this file by other means, its form should be:
node1 node2 weight1
node3 node4 weight2
...
  • If the node is a textual feature, its label must begin with TF: (e.g. TF:fever).
  • The weights do not have to correspond to probablities at this stage; they will be normalised later based on the tf_mass parameter (below) during the random walk generation.

    Example: Assume a random walk visits a node that is connected to 2 textual nodes t1 and t2 with weights 0.5 and 0.2, correspondingly, and to 3 normal entity nodes, c1, c2 and c3, each with weight 1.0. Further, let tf_mass = 0.6. The algorithm proceeds as follows:

    1. Normalise the textual weights to form a proper probability distribution, i.e. (t1:0.71, t2:0.29)
    2. Do the same for the entity nodes: (c1:0.33, c2:0.33, c3:0.33)
    3. Scale the textual probability distribution by tf_mass and the entity probability distribution by 1 - tf_mass. That is, for the text nodes we have (t1:0.6 * 0.71, t2: 0.6 * 0.29) = (t1:0.43, t2:0.17), and for the entity nodes (c1:0.4 * 0.33, c2:0.4 * 0.33, c3:0.4 * 0.33) = (c1:0.13, c2:0.13, c3:0.13)

    Putting together the two parts forms again a proper probability distribution (numbers add up to 1), namely (t1:0.43, t2:0.17, c1:0.13, c2:0.13, c3:0.13). The next node in the random walk will be sampled from that probability distribution.

  • The tf_mass parameter defines the probability mass that will be allocated to textual features during the random walks (see also the Example above). A value of 0 means no textual features, whereas a value of 1 means that half of the nodes in the random walk will be textual (default 0.5).

  • For the full list of available options type:

    $ python deepwalk_tf/ --help

Step 3: Prepare your training and testing data

This can be done by the following commands:

$ cd ms-lstm
$ python prepare_data.py --train_data <trainfile> --test_data <testfile> --dev_data <devfile>

The data files must have the following form:

node1|text1
node2|text2
...
  • The texts are expected to be short focused texts at the level of phrases or sentences.
  • Each node can have more than one entries in the data file.

The script tokenises the texts and replaces the words with integers. At output, the script produces 4 text files: one for the train set, one for the test set, one for the dev set (optional), and one that contains the mapping of words to their corresponding indices.

Important: In case the data is preprocessed by other tools, caution must be taken so that the assignment of indices is uniform for all files (train, test, and dev); i.e. each word must have the same index everywhere.

Step 4: Train the model

For this use the following command:

$ cd ms-lstm
$ python train.py --data <traindata> --vspace <vectorspace> --outfile <outputfile> --senses <number_of_senses> --vocsize <vocabulary_size> --maxlen <maxlen>
  • For the data parameter provide the train file generated by prepare_data.py, as described in Step 3.
  • For the vspace parameter provide the entity vector space generated by deepwalk_tf, as described in Step 2.
  • The senses options defines the number of sense embeddings (default 3).
  • The program also needs to know the size of your vocabulary, in order to create the embedding and multi-sense embedding matrices. To find this, check the word-to-index mapping file that has been prepared in Step 3 (the indices are sorted in ascending order).
  • The parameter max_len defines the length of the number sequences representing the texts. Sequences with fewer numbers (i.e. texts with fewer words) are padded with zeroes (which are "masked" during training), while sequences with more numbers are truncated to the specified value.
  • Set option bilstm to 1 for using a bi-LSTM instead of an LSTM (default 0).
  • The model is saved in a file (option outfile) after each epoch. You will need this file during the testing stage.
  • For more options please use the command:

    $ python train.py --help

Step 5: Test the model

Use the following command:

$ cd ms-lstm
$ python test.py --data <testdata> --vspace <vectorspace> --model <modelfile> --maxlen <maxlen>
  • For the data parameter provide the test file generated by prepare_data.py, as described in Step 3.
  • For the vspace parameter provide the entity vector space generated by deepwalk_tf, as described in Step 2. This must be the same vector space you used during training (Step 4).
  • In model provide the Keras model file that was generated during the training (Step 4).
  • The maxlen option has the same usage as in training, and must be the same with the corresponding number you used during training (Step 4).
  • The script generates a point in the entity space for each text input, and checks for the entity nodes with the closest vectors to that point. It reports MRR, strict accuracy, and accuracy on the top-20.

MSEmbedding layer

The dynamic disambiguation functionality is conveniently wrapped within the class MSEmbedding (ms-lstm folder), which extends the Keras class Layer and can be used independently as part of any Keras model. The MSEmbedding layer must always follow an Embedding layer, as below:

emb_layer = Embedding(voc_size+1, dimensions, input_length=max_len)(input_sequences)
ms_emb_layer = MSEmbedding(voc_size+1,senses=number_of_senses)([emb_layer,input_sequences])
  • Note that the input to the layer is a list consisting of the "ambiguous" embedding layer and the sequences of numbers representing the input texts; this second input is necessary for the layer in order to know which embeddings to update.

    • emb_layer shape: (batch_size, max_len, dimensions)
    • input_sequences shape: (batch_size, max_len)
  • The dimensions of the sense embeddings are automatically set to be equal to the dimensions of the "ambiguous" embeddings.

  • The parameter input_length defines the length of the number sequences representing the texts. Sequences with fewer numbers (i.e. texts with fewer words) are padded with zeroes (which are "masked" during training), while sequences with more numbers are truncated to the specified value.
  • The output of the layer is a tensor of shape (batch_size, max_len, dimensions).

Pre-trained vectors

Folder vectors contains pre-trained vectors created from graphs extended with textual features for WordNet and SNOMED CT:

  • WordNet: Two sets (150 and 300 dimensions), 118k synsets
  • SNOMED CT: One set (150 dimensions), 380k concepts

The SNOMED CT vectors are published by permission of SNOMED International, as declared in the Licence section of this document.

Citing

If you find this material useful in your research, please cite:

@InProceedings{kartsaklis_etal:EMNLP2018,
  author={Dimitri Kartsaklis and Mohammad T. Pilehvar and Nigel Collier},
  title={Mapping {T}ext to {K}nowledge {G}raph {E}ntities using {M}ulti-{S}ense {LSTM}s},
  booktitle={Proceedings of the 2018 {C}onference on {E}mpirical {M}ethods in {N}atural {L}anguage {P}rocessing ({EMNLP})},
  year={2018},
  month={November},
  address={Brussels, Belgium},
  publisher={Association for {C}omputational {L}inguistics}
}  

Licence

The code in this repository is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License version 3 as published by the Free Software Foundation. The code is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

SNOMED Clinical Terms® (SNOMED CT®) is used by permission of the SNOMED International. All rights reserved. SNOMED CT® was originally created by the College of American Pathologists. “SNOMED”, “SNOMED CT” and “SNOMED Clinical Terms” are registered trademarks of the SNOMED International. Use of SNOMED CT is governed by the conditions of SNOMED CT license issued by SNOMED International.

Contact info

For questions or more information please use the following:

References

[1] D. Kartsaklis, M.T. Pilehvar, N. Collier (2018). Mapping Text to Knowledge Graph Entities using Multi-Sense LSTMs, in Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP), Brussels, Belgium [pdf]

[2] B. Perozzi, R. Al-Rfou, S. Skiena (2014). DeepWalk: Online Learning of Social Representations, in Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, New York, USA [code]