Wiki

Clone wiki

Airpedia / Tutorial

Install the Airpedia tools

The Airpedia project is part of The Wiki Machine project, that includes a set of Java library to parse and manage the Wikipedia XML dumps, available for free from the Wikipedia community. In particular, the Airpedia part uses the ideas described in Palmero et al. 2013abc to create a set of new DBpedia mappings and to classify pages in the DBpedia ontology when they do not contain an infobox.

The step needed to create a new DBpedia chapter (mappings for classes and properties, classification of pages without infobox) is quite long, but with this tutorial and the bash scripts provided with the Airpedia library it should be straightforward.

Step 0. Before starting

First of all, this set of scripts need some common tools, that can be downloaded and installed for free in all the major operating systems. Windows users may be disadvantage, because the bash scripts written to make stuff easier are not compatible with it.

The preprocess.sh script (see below) will check that all these software is already installed on the machine.

Step 1. Preprocessing script

There are some preliminary actions to do before launching any script contained in The Wiki Machine project and, of course, in its Airpedia part.

Choose a folder where all the script and tools will be installed and extracted. It should be an empty folder, just to avoid that some files in it would be overwritten. In this tutorial, we'll call it "data folder" and use ~/data as an example. In this folder there will be a set of subfolders following The Wiki Machine paradigm, therefore it could be useful to download the project in the folder corresponding to sources.

  • Let's create the data folder with mkdir ~/data. If a folder with the same name already exists in your home path, just change its name to another one, as you wish.
  • Create the sources folder, where we are going to download the Airpedia source code (or just the jar): mkdir ~/data/sources.
  • Now enter the folder by typing cd ~/data/sources.
  • Clone the Airpedia project using git: git clone https://bitbucket.org/fbk/airpedia. This may take a while (depending on the connection speed).
  • Enter the Airpedia folder: cd airpedia.
  • Go to the develop branch: git checkout develop.
  • When the download is complete, you need to set some configuration variables, by creating the config.sh file in the scripts folder. There is a template file that can be used to start. Enter the scripts folder: cd src/bin.
  • Copy the template file: cp config.sh.default config.sh.
  • Edit the newly created config.sh and set the first three variables (there are some variables left, but there is no need to edit them at this point):
    • NUM_THREADS is the number of threads you want to use for the extraction. Obviously, the higher is the number, the faster is the process.
    • MEM is used to set the memory (RAM) that can be used by the Java Virtual Machine. Again, a higher value will result in a faster processing.
    • DATAFOLDER is the data folder chosen above, ~/data in our example. It is better to enter the folder using the complete path, such as /home/user/data, so that it's consistent with a user switch on the machine.

After doing that, you can launch the preprocessing tool.

./preprocess.sh

This script will: create the folder structure; check that all the needed software is installed on the machine; clone two additional repositories (The Wiki Machine library and DBpedia Extraction Framework); download some additional files that are used by the next steps.

You can skip any of these operations, see ./preprocess.sh -h for help.

Finally, edit the config.sh file again, and set the variables:

  • JAR is the path of the Java compiled file. It should be ~/data/sources/airpedia/target/airpedia-[VERSION]-jar-with-dependencies.jar where [VERSION] depends on the Airpedia version of the downloaded package. Again, it is better to use the complete path, instead of the user-related one.
  • CONFIGFOLDER is the link to the folder where the configuration files for the languages are stored. It is a part of The Wiki Machine library. You can leave that variable to the default value, unless you customized that files and stored them in a particular folder.
  • DBP_VERSION is the version of the DBpedia Ontology that is used for the extraction. It could be 3.9 or 2014.

Step 2. Wikipedia extraction

The next step consists in downloading the Wikipedia dumps and extracting from them some structured information. This step can be very slow, as some chapters of Wikipedia are really big and both downloading and extracting them may take some time. For example, the English Wikipedia contain (at the time of writing this tutorial) almost 5 million articles: the XML dump is 10.8 GB in size (48 GB unpacked) and it takes a couple of hours to extract on a 8 core computer with 16 GB of RAM.

If you want to create a new chapter of DBpedia (for funny example, Vulcanian), you need to extract first of all the structured information for that language (i.e. Vulcanian); then, you also need the same information for each language you want to use as "training", that is languages that have already been manually mapped to the DBpedia ontology. You can check the list of these languages on the DBpedia mapping website. The more the existing mappings, the more the accuracy in the newly generated mappings. Note that you can also generate these mappings for a language that is already present in the DBpedia mapping website (this can be useful expecially for languages with few mappings), and you do not need to remove it from the list of languages used as training. For example, if Vulcanian already has 10 mappings, you can use these mappings in conjunction with the ones in other languages: the Airpedia tool will generate a new set of mappings that will contain the original 10 (or maybe will not, depending on the agreement of this mappings with similar ones in other languages).

First of all, you should link the configuration folder.

. ./config.sh
ln -s $TWM_CONFIGFOLDER

For each language you want to extract information for, just use the extract-single.sh script. It accepts a mandatory option -l for the language, expressed in the ISO 639 format ("en" for English, "it" for Italian, and so on).

For example, for French you can use:

./extract-single.sh -l fr

The script also accepts some additional options, that can be listed using ./extract-single.sh -h.

As said, you need to execute this command a number of times, depending on the languages you want to use. Be sure that the language you are going to download/extract the dump of has a properties file in the configuration folder, otherwise the script won't do anything. You can add a custom language by creating the corresponding properties file (more info to come, stay tuned).

There is an additional script, extract-all.sh, that list all the properties file in the configuration folder and download/extract every Wikipedia related to them. It is really slow (there are properties file for 33 languages) and executing it may take some days.

Step 3. WikiData extraction

Download the last version of WikiData from the Wikipedia server. You can also use the your.org faster mirror. Select the version number, then download the file named wikidatawiki-<VERSION>-pages-articles.xml.bz2. Download it in $DATAFOLDER/corpora/wikidata/ and bunzip it in the same folder.

Extract it using this script (<VERSION> is the 8-digit version number, such as 20150330):

. ./config.sh
. ./include.sh
export CLASSPATH=$JAR
java eu.fbk.twm.wiki.xmldump.WikiDataExtractor \
  -o $MODELSFOLDER/wikidata/<VERSION>/ \
  -t $NUM_THREADS \
  -w $CORPORAFOLDER/wikidata/wikidatawiki-<VERSION>-pages-articles.xml

If you want to use it in production, just add the link to current:

cd $MODELSFOLDER/wikidata/
rm -f current
ln -s <VERSION> current

This model should be re-created from time to time.

Step 4-TWM. Postprocessing

For each language, you should create the LSA models. You can do it once forever, as the LSA model will not change in future. You need to first compile the svdlibc package from here.

./create-lsa.sh -l <language> -c

Then, finish creating the models for each language.

./extract-more.sh -l <language>

Do not delete the dumps after this step, because they are used by the next ones. In general, the last dump for each language should be kept.

Step 5-TWM. Multi-language models

There is a bunch of cross-language models that must be creates every time you add a language (or you extract a new model for an existing language). You need the resources package, and you can obtain it by cloning it into the $DATAFOLDER folder.

cd $DATAFOLDER
git clone https://bitbucket.org/aprosio/resources.git

Now you have the resource folder ready.

./create-namnom.sh -c
./create-topic.sh -c
./create-dbpedia.sh -c

Step 6-TWM. Adding topics

The topic model rely on the $DATAFOLDER/resources/topic-type-mapping/datasets/ folder, where there are the mappings between the Wikipedia meta information and the topics. Topic list is contained in the /topic-type-mapping/topic.txt file. The mappings are in the form

<LANG>wiki-<XXX>2topic.properties

where <LANG> is the ISO code of the language, and <XXX> can be cat, suff, sec, infobox, nav, portal. For the cat mapping, the Wikipedia category hierarchy is used. There are some rules that can be used to tell the tool how to work:

  • Life=: the tool stops ascending when the category Life is found.
  • Political_science=pol: the tool stops ascending when the category Political_science is found and a vote to pol is given.
  • #Flight=: the tool ignores the Flight category and continues ascending the hierarchy.

Step 7-TWM. Starting the server

First of all, download and compile the twm-service lib by cloning it in the $DATAFOLDER/sources folder.

cd $DATAFOLDER/sources
git clone https://bitbucket.org/cgiuliano/twm-service.git

To start the server, just run

java -cp target/jservice-2.3-jar-with-dependencies.jar \
  -Dfile.encoding=UTF-8 -mx12G com.machinelinking.jservice.server.MainHttpServer \
  -o <HOST> \
  -p <PORT> \
  --no-authentication -a

where <HOST> is the complete host (i.e. http://api.myserver.com) and <PORT> is the port.

Step 4-Airpedia. Download DBpedia RDF dumps

The Airpedia project works mainly using already present information in DBpedia, thanks to the effort of various communities of people around the world. In Step 1, we have already downloaded the mappings using the DBpedia Extraction Framework. These XML files are very small and contain all the information needed to create the RDF triples that form a DBpedia chapter. While it is very easy and quick to extract classes, the steps used to extract properties are quite complex and slow, therefore it's more comfortable to download them directly from the DBpedia website.

This script will download the DBpedia properties file for every compatible language (i.e. languages having a properties file in the configuration folder). It is better to exclude automatically created languages, for example the ones created using Airpedia (Esperanto, Swedish and Ukrainian at the time of writing of this tutorial).

The command is:

./download-dbpedia.sh -x uk -x sv -x eo

The script also creates a Lucene index containing all the properties in all languages. This index is needed for the next steps, but its creation can be skipped using the -s option. See ./download-dbpedia.sh -h for more information.

Step 5-Airpedia. Extract classes from Wikipedia pages

The documentation of the following steps is still under development.

Script to run: extract-dbpedia.sh

Step 6-Airpedia. Extract properties from Wikipedia infoboxes

Script to run: extract-properties.sh

Step 7-Airpedia. Create a DBpedia release

Script to run: create-dbpedia.sh

Updated