ChemListem: Chemical Named Entity Recognition with deep neural networks
ChemListem is a package for chemical named entity recognition, developed for the CEMP task of BioCreative V.5. ChemListem uses deep learning, as implemented using the keras package, to do this. ChemListem also uses scikit-learn, h5py and numpy, and has pre-trained models that require the use of TensorFlow.
ChemListem is written in Python 3, and is known to be compatible with Python 3.5 and Python 3.6. It has been tested on Windows 10, and Ubuntu 14.
ChemListem may be installed from the PyPI via pip:
pip install chemlistem
Note that this does not install tensorflow by default - you will have to install that yourself.
pip install tensorflow will do. This
has been left out because there are other libraries that keras can use instead of tensorflow, and a gpu-enhanced version of
tensorflow, and occasional version compatibility issues...
At the time of writing there is a problem with the latest version of tensorflow - 1.11. We have found that using tensorflow 1.10
We have found that the WinPython distribution, which contains pre-built versions of the dependencies, works well on Windows. The following procedure had previously been found to work, but may need updating:
- Obtain WinPython 18.104.22.168Qt5-64bit, and install it.
- Update the keras package, using
pip install --upgrade --no-deps keras
- Install ChemListem
The pre-trained models for chemlistem were trained using the TensorFlow backend. If you are already using keras, then ensure that keras is set up to use TensorFlow - alternatively, if you need to use Theano, consider compiling your own model files (see "Training" below).
ChemListem uses three models - a "traditional" model, a "minimalist" model and an ensemble model that combines the two. The following example shows how to use the ensemble model:
from chemlistem import get_ensemble_model model = get_ensemble_model() results = model.process("The morphine was dissolved in ethyl acetate.") print(results)
The output should be as follows:
[(4, 12, 'morphine', 0.9738021492958069, True), (30, 43, 'ethyl acetate', 0.9788203537464142, True)]
The output is a list of lists, each sub-list corresponding to a chemical named entity.
- The start character position.
- The end character position.
- The string of the entity.
- The score of the entity - i.e. how confident chemlistem is that the entity is a true entity. 1.0 = maximum confidence, 0.0 = minimum confidence.
- Whether the entity is "dominant" i.e. not overlapping with a higher-score entity.
(There may also be some messages from TensorFlow talking about an "unknown op" - these can usually be ignored.)
ChemListem can be tuned for precision or recall by varying a threshold that confidence scores are tested against. Furthermore, chemlistem can be set up to report overlapping guesses. For example:
results = model.process("The morphine was dissolved in ethyl acetate.", 0.00001, False) for r in results: print(r)
(0, 12, 'The morphine', 0.00017620387734496035, False) (4, 12, 'morphine', 0.9738021492958069, True) (4, 16, 'morphine was', 0.00012143117555751815, False) (4, 26, 'morphine was dissolved', 7.890002598287538e-05, False) (4, 43, 'morphine was dissolved in ethyl acetate', 1.1027213076886255e-05, False) (4, 44, 'morphine was dissolved in ethyl acetate.', 1.1027213076886255e-05, False) (13, 16, 'was', 1.1566731700440869e-05, True) (17, 26, 'dissolved', 4.46309641120024e-05, True) (17, 43, 'dissolved in ethyl acetate', 1.0192422223553876e-05, False) (17, 44, 'dissolved in ethyl acetate.', 1.0192422223553876e-05, False) (27, 35, 'in ethyl', 1.8829327018465847e-05, False) (27, 43, 'in ethyl acetate', 0.00015375280418084003, False) (27, 44, 'in ethyl acetate.', 3.4707042686932255e-05, False) (30, 35, 'ethyl', 3.010855562024517e-05, False) (30, 43, 'ethyl acetate', 0.9788203537464142, True) (30, 44, 'ethyl acetate.', 3.4707042686932255e-05, False) (36, 43, 'acetate', 7.422585622407496e-05, False) (36, 44, 'acetate.', 1.424002675776137e-05, False)
The second argument to
process is the threshold. The third argument is whether to exclude non-dominant entities or not. For example,
model.process("The morphine was dissolved in ethyl acetate.", 0.00001, True) gives:
(4, 12, 'morphine', 0.9738021492958069, True) (13, 16, 'was', 1.1566731700440869e-05, True) (17, 26, 'dissolved', 4.46309641120024e-05, True) (30, 43, 'ethyl acetate', 0.9788203537464142, True)
To use the traditional and minimalist models, use
get_mini_model instead of
There are fast versions of these models which need to be run with a CUDNN-enabled GPU. To load these models, use
If you wish to process multiple lines quickly, there is a batchprocess method which accepts a list of strings, and gives a list of results. Neural networks run faster if they can process several items in parallel so using batch processing can give a speed increase. For example:
tm = chemlistem.get_trad_model() results = tm.batchprocess(["This is ethyl acetate and ethanol.", "This is codeine and morphine."], 0.5, False) for r in results: for l in r: print(l) print("---")
(8, 21, 'ethyl acetate', 0.9956316, True) (26, 33, 'ethanol', 0.975327, True) --- (8, 15, 'codeine', 0.98441076, True) (20, 28, 'morphine', 0.9720572, True) ---
ChemListem is bundled with model files. However, you may wish to train your own. The training data is available
here - you will need to get the GPRO & CEMP training set 2016, and extract the files
BioCreative V.5 training set.txt and
CEMP_BioCreative V.5 training set annot.tsv. Optionally, you may also wish to obtain the
GloVe pre-trained word vectors - get
glove.6B.zip, unzip it and find
The traditional model may be trained using the following example:
from chemlistem import tradmodel tm = tradmodel.TradModel() tm.train("BioCreative V.5 training set.txt", "CEMP_BioCreative V.5 training set annot.tsv", "D:/glove.6B/glove.6B.300d.txt", "tradtest")
If you have a CUDNN-enabled GPU, you can include the option
gpu=True in the
tm.train call. This should speed up training. Note
that models trained in this manner cannot be directly used on non-GPU-enabled systems.
If you do not wish to use GloVe, then instead of
Alternatively, in the BitBucket repository there is the file
vectors_patents.zip. Download it and unzip it and use the text
file therein instead of the GloVe file. This has been prepared specially from pharmaceutical patent abstracts, with ChemListem's
tokenisation and capitalisation controls, and gives better results.
This process will take several hours to run. It should eventually produce two files:
tradmodel_tradtest.json (also several files of the form
epoch_*_tradtest.h5). These are your model files. To use them, follow
from chemlistem import tradmodel tm = tradmodel.TradModel() tm.load("tradmodel_tradtest.json", "tradmodel_tradtest.h5") print(tm.process("This test includes morphine.")
Training and loading the minimalist model is similar - however, this takes several days, does not use GloVe, and does not produce a JSON file. Examples:
from chemlistem import minimodel mm = minimodel.MiniModel() mm.train("BioCreative V.5 training set.txt", "CEMP_BioCreative V.5 training set annot.tsv", "minitest") from chemlistem import minimodel mm = minimodel.MiniModel() mm.load("minimodel_minitest.json", "minimodel_minitest.h5") print(mm.process("This test includes morphine.")
Once you have produced these two models, then you may produce an ensemble model. Example:
from chemlistem import tradmodel, minimodel, ensemblemodel tm = tradmodel.TradModel() tm.load("tradmodel_tradtest.json", "tradmodel_tradtest.h5") mm = minimodel.MiniModel() mm.load("minimodel_minitest.json", "minimodel_minitest.h5") em = ensemblemodel.EnsembleModel(tm, mm) print(em.process("This test includes morphine.")
The train method here has several methods:
gpu- set this to True for fast training, as per the same option in the traditional system.
unsupfile- give this the filename of a file containing sentences from patent abstracts. In the BitBucket repository there is a file called
patent_lines.zipwhich is good for this.
nunsup- how many lines to use from the file - if this is larger than the number of lines in the file, it will use some or all of the lines more than once. 0 is no unsupervised learning. -1 is all the lines, once only.
unsupcfg- this contains options to control when the various unsupervised learning techniques take place. See the docstring for more details, or just leave it unset - there is a good default.
Differences from published versions.
The system here is as described in a forthcoming full-text journal paper. Due to version compatibility difficulties, the model files supplied are not exactly the same as those used for publication - they were built using the same code with the same hyperparameters, but the random initialisation was different, giving slightly different results.
There are also some minor difference from the version used in the original BioCreative V.5 submission.
This repository also contains the source code for our entry in BioCreative VI Task 5, in the subdirectory cl_bc6_chemprot
ChemListem has been developed by the Data Science group at the Royal Society of Chemistry.
ChemListem is distributed under the MIT License - see License.txt.
There is a paper describing ChemListem - Peter Corbett, John Boyle. “Chemlistem - chemical named entity recognition using recurrent neural networks”. Proceedings of the BioCreative V.5 Challenge Evaluation Workshop (2017): 61-68. published as part of the BioCreative V.5 proceedings, and there is a journal paper forthcoming. If you use chemlistem in your work, please cite it.
I would like to thank Adam Bernard for the initial Python translation of the tokeniser that chemlistem uses.
Peter Corbett, 2016-2018 The Royal Society of Chemistry