1. ILPS
  2. Political Mashup
  3. pm-ngramviewers-kbkranten

Overview

HTTPS SSH

KBKranten Ngramviewer

The KBKranten Ngramviewer analyzes and indexes the <u>Historische Kranten</u> corpus of the <u>Royal Library</u> up to 5-grams in Elasticsearch, and provides an API that allows for suggest/<term> and search/<term> on that corpus. The search/<term> endpoint filters the corpus on the <term> and returns a histogram of the frequency and the relative frequency (frequency of that ngram in a year / total number of ngrams with that n in that year).

Also, there is a <u>webinterface</u> on top of the API that visualizes the <term> histograms, and links back to the <u>Historische kranten</u> website to provide the user with some context.

Requirements

  • Python 2.7.5
    • Flask 0.10.1
      • Flask-Assets 0.8
      • Flask-Script 0.6.2
    • elasticsearch 0.4.1
    • fabric 1.0.0
    • uWSGI 1.9.17.1
    • cssmin 0.1.4
    • closure 20121212
    • pytz 2013.7
  • Java >= 6
  • Elasticsearch 0.9.5
  • 7GB RAM (to fit autocompletion model and cache)
  • ± 75GB disk space

Installation & configuration

The KBKranten Ngramviewer consists of two major components. The <u>ElasticSearch component</u> handles the storage, search and autocomplete of the data. The <u>application component</u> functions as a proxy to <u>Elasticsearch</u>, relaying search and suggest requests, as well as fetching the initial results page from the <u>KB</u>. This way, <u>Elasticsearch</u> does not have to be exposed to the "outside world" and we have fine-grained control on what outsiders can send to the search engine.

Elasticsearch

Download and extract the <u>proper Elasticsearch version</u>. Move to the directory Elasticsearch was extracted in, and create the data and logs directories. Then, install the <u>ICU analyzer plugin</u>:

$ wget https://download.elasticsearch.org/elasticsearch/elasticsearch/elasticsearch-0.90.5.tar.gz
$ tar -xzf elasticsearch-0.90.5.tar.gz

$ cd elasticsearch-0.90.5
$ mkdir data
$ mkdir logs

$ bin/plugin -install elasticsearch/elasticsearch-analysis-icu/1.11.0

The directory contains a config directory, which contains configuration files for Elasticsearch and logging. Edit the following parameters in the config/elasticsearch.yml:

path.home: "path/to/elasticsearch/installation" # this is the folder Elasticsearch was extracted to
bootstrap.mlockall: true # locks the memory to prevent swapping
network.host: 127.0.0.1 # Only listen to requests from localhost
node.name: "<nodename>" # Optional, otherwise the node will be assigned a random name

Different configurations are very well possible. Check the Elasticsearch reference for more information.

Copy the indices to the data/ directory (there is a copy of the indices available on ilps-plexer):

$ cp -r /zfs/ilps-plexer/kbkranten/elasticsearch data/

Start Elasticsearch, giving it 10Gb heap space (at least 7Gb is required to build the FST for the autocompleter, additional memory will be used for caching):

$ bin/elasticsearch -Xmx10g -Xms10g

On *nix, this will start Elasticsearch as a background process. Use the -f switch to run it in the foreground. Elasticsearch will recover the indices that are placed in the data directory and load the autocompletion FST.

Application

Clone this repository. Preferably, create a virtualenv associated with this application. Then, install the required Python libraries:

$ pip install -r requirements.txt

In the kb_ngramviewer folder, the settings.py contains the required configurations for the application to function. Make sure the ES_HOSTS setting reflects your setup:

ES_HOSTS = [{'host': '127.0.0.1', 'port': 9200}]

You can check whether your setup works by running the development server (do NOT use this in production!):

python manage.py runserver

Point your browser to http://<myserver>:5000.


NOTE: The following instructions might be a bit different for setup on mashup2 (no proxying over http, for example)

Start the application by running the following command in the directory wsgi.py is located:

uwsgi --http :5001 --pidfile uwsgi.pid -p 4 -w wsgi:application

This will start the application server on http://<myserver>:5001 with 4 worker processes and write the pid of the process to uwsgi.pid in the same directory. You'll probably want to proxy, so make sure there is a vhost like this:

<VirtualHost *:80>
    ProxyPreserveHost Off
    ProxyPass / http://<myserver>:5001/
    ProxyPassReverse / http://<myserver>:5001/
    ServerName kbkranten.politicalmashup.nl
    Timeout 600
</VirtualHost>

It is advisable to use process control software to manage starting, stopping and monitoring the Elasticsearch and uWSGI processes. <u>Supervisor</u> is recommended.