HTTPS SSH

Purpose

This manual is dedicated to train customized word2vec models. A general purpose model is already trained and available at Query Suggestion API (section "Local Installation"). To replace this with a customized one please edit the property file in the web archive "WEB-INF\classes\application.properties" and set the parameter "corpus.location" to our training result file path.

Please note that the code is written in Java, and this project is a Maven project.

  • Java 8
  • Maven 3

Before you go to next step, We strongly recommend you to set your Java environment variables. You can find a description about How do I set or change the PATH system variable?.

Build the executables

Please go to the project folder and run the command

  • mvn clean package -P word2vec-tools
  • mvn clean package -P RESTful -Dassembly.skipAssembly=true"

These maven build commands compiles the source code and package the compiled binaries into two JAR files into the sub folder "target":

  • lailapssuggestion.war: the RESTful web service The Java web application archive can be deployed in all Java EE containers

  • LAILAPS-QSM.jar: Bundle of tools to generate word2vec model used by the RESTful service

Retrieve a life science text corpus

The default text corpus is based on abstracts that can be downloaded from PubMed. Because of copyright issues, we can only download the titles and abstracts of all the articles. We strongly recommend you to do batch downloading, for example, if you want to download the articles that published in 2016, then just type in "("2016"[Date - Create] : "2016"[Date - Create])" in the search box, and click the "search" button, in left side of the search result page, click "Abstract", which means you only want to download the title and abstract part of the article. The last step is to click "Send to->File->Abstract(text)" in the search result page.

We have a program to format the text format, for example, if you want to download the articles that published in 2016, then just type in "("2016"[Date - Create] : "2016"[Date - Create])" in the search box, and click the "search" button, in left side of the search result page, click "Abstract", which means you only want to download the title and abstract part of the article. The last step is to click "Send to->File->Abstract(text)" in the search result page.

Export PubMed text format file

Command line tool

To train a word2vec model from a text corpus please execute the JAVA archive LAILAPS-QSM.jar:

* java -Xms20480M -Xmx40960M -jar LAILAPS-QSM.jar -i <path to input files> -o <output path> -m <feature size> -w <maximum word skip> -t <training iterations> -n <parallel threads> -f <format of supplied corpus>
* `-i`: The folder path of input files - all files in this folder are read in and must be of the same type.
* `-o`: The path to save the resulting word vectors.
* `-m`: The feature size of word vectors. Defaults to `200`.
* `-w`: The max skip length between words. Defaults to `5`.
* `-t`: The training iterations (default 1).
* `-n`: The training thread number, default is 1, more thread number will make training faster, but the training model will decrease in accuracy.
* `-f`: The format of input file, 0 is PubMed text format file, 1 is others format(each line a document).

The folder of input files

(1) DataExtract

This tool tokenize all text documents of the input folder to a final text corpus. It support PubMed abstracts and a list of text document, whereas each line comprise one document. So please ensure to remove newlines before you compile them into the below container format:

ID Document
1 Screening plant growth-promoting rhizobacteria for improving growth and yield of wheat.
2 Bioremediation of vegetable and agrowastes by Pleurotus ostreatus: a novel strategy to produce edible mushroom with enhanced yield and nutrition.

The command line parameter are:

* `-i`: The folder path of input files - all files in this folder are read in and must be of the same type.
* `-o`: The folder path of out file(this program will generate a corpus file in output folder, it's name is corpus.txt).
* `-f`: The format of input file, 0 is PubMed text format file, 1 is others format(each line a document), in this case we should use "0".
* For example: java -cp LAILAPS-QSM.jar -Xms2048M -Xmx4096M de.ipk_gatersleben.data.DataExtract -i /data/text -o /data/output -f 0

(2) Word2Phrase

This tool extends the text corpus with phrases. A phrase is a group of words that functions as a constituent in the syntax of a sentence, a single unit within a grammatical hierarchy, such as "heading date", "flowering time" in life science field. The command line parameter are:

* `-i`: The path of corpus text file (pubmedcorpus.txt)
* `-o`: The path to the generated phrase corpus file
* `-m`: The words that appear less than this number will be discarded. Defaults to `5`
* `-t`: The value represents threshold for forming the phrases (higher means less phrases). Defaults to `100`.
* For example: java -cp LAILAPS-QSM.jar -Xms10240M -Xmx20480M de.ipk_gatersleben.model.Word2Phrase -i /data/output/corpus.txt -o /data/output/phrase.txt -m 5 -t 100

Please note: this tool consumes for big text corpi much memory. Please assign suitable heap memory to our Java virtual machine. For details, please refer to Tuning Java Virtual Machines. To build the multi purpose text corpus of 1.5 billion terms, we set heap memory to 4GB using java command line arguments '-Xms10240M -Xmx20480M'

(3) Word2Vec

This tool implements the actual training of the word2vec model. The command line parameter are:

* `-i`: The path of train file.
* `-o`: The path to save the resulting word vectors.
* `-f`: The feature size of word vectors. Defaults to `200`.
* `-w`: The max skip length between words. Defaults to `5`.
* `-s`: Set threshold for occurrence of words. Those that appear with higher frequency in the training data will be randomly down-sampled; default is 1e-3, useful range is (0, 1e-5).
* `-t`: The training iterations (default 1).
* `-a`: Set the starting learning rate; default is 0.025.
* `-n`: The training thread number, default is 1, more thread number will make training faster, but the training model will decrease in accuracy.
* For example: java -cp LAILAPS-QSM.jar -Xms20480M -Xmx40960M de.ipk_gatersleben.model.Word2Vec -i /data/output/phrase.txt -o /data/output/skipgrammodel.bin -f 200 -w 5 -s 0.001 -t 1 -a 0.025 -n 4

Please note: this tool consumes for big text corpi much memory. Please assign suitable heap memory to our Java virtual machine. For details, please refer to Tuning Java Virtual Machines. To build the multi purpose text corpus of 1.5 billion terms, we set heap memory to 4GB using java command line arguments '-Xms10240M -Xmx20480M'

License

Copyright (c) 2017 Leibniz Institute of Plant Genetics and Crop Plant Research (IPK), Gatersleben, Germany. All rights reserved. This program and the accompanying materials are made available under the terms of the GNU General Public License, version 2 which accompanies this distribution, and is available at https://www.gnu.org/licenses/old-licenses/gpl-2.0.en.html (C)

This library is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.