Overview
Atlassian Sourcetree is a free Git and Mercurial client for Windows.
Atlassian Sourcetree is a free Git and Mercurial client for Mac.
KBKranten Ngramviewer
The KBKranten Ngramviewer analyzes and indexes the <u>Historische Kranten</u> corpus of the <u>Royal Library</u> up to 5-grams in Elasticsearch, and provides an API that allows for suggest/<term>
and search/<term>
on that corpus. The search/<term>
endpoint filters the corpus on the <term>
and returns a histogram of the frequency and the relative frequency (frequency of that ngram in a year / total number of ngrams with that n in that year).
Also, there is a <u>webinterface</u> on top of the API that visualizes the <term>
histograms, and links back to the <u>Historische kranten</u> website to provide the user with some context.
Requirements
- Python 2.7.5
- Flask 0.10.1
- Flask-Assets 0.8
- Flask-Script 0.6.2
- elasticsearch 0.4.1
- fabric 1.0.0
- uWSGI 1.9.17.1
- cssmin 0.1.4
- closure 20121212
- pytz 2013.7
- Flask 0.10.1
- Java >= 6
- Elasticsearch 0.9.5
- <u>elasticsearch-analysis-icu (1.11.0)</u> plugin
- 7GB RAM (to fit autocompletion model and cache)
- ± 75GB disk space
Installation & configuration
The KBKranten Ngramviewer consists of two major components. The <u>ElasticSearch component</u> handles the storage, search and autocomplete of the data. The <u>application component</u> functions as a proxy to <u>Elasticsearch</u>, relaying search and suggest requests, as well as fetching the initial results page from the <u>KB</u>. This way, <u>Elasticsearch</u> does not have to be exposed to the "outside world" and we have fine-grained control on what outsiders can send to the search engine.
Elasticsearch
Download and extract the <u>proper Elasticsearch version</u>. Move to the directory Elasticsearch was extracted in, and create the data
and logs
directories. Then, install the <u>ICU analyzer plugin</u>:
$ wget https://download.elasticsearch.org/elasticsearch/elasticsearch/elasticsearch-0.90.5.tar.gz
$ tar -xzf elasticsearch-0.90.5.tar.gz
$ cd elasticsearch-0.90.5
$ mkdir data
$ mkdir logs
$ bin/plugin -install elasticsearch/elasticsearch-analysis-icu/1.11.0
The directory contains a config
directory, which contains configuration files for Elasticsearch and logging. Edit the following parameters in the config/elasticsearch.yml
:
path.home: "path/to/elasticsearch/installation" # this is the folder Elasticsearch was extracted to bootstrap.mlockall: true # locks the memory to prevent swapping network.host: 127.0.0.1 # Only listen to requests from localhost node.name: "<nodename>" # Optional, otherwise the node will be assigned a random name
Different configurations are very well possible. Check the Elasticsearch reference for more information.
Copy the indices to the data/
directory (there is a copy of the indices available on ilps-plexer):
$ cp -r /zfs/ilps-plexer/kbkranten/elasticsearch data/
Start Elasticsearch, giving it 10Gb heap space (at least 7Gb is required to build the FST for the autocompleter, additional memory will be used for caching):
$ bin/elasticsearch -Xmx10g -Xms10g
On *nix, this will start Elasticsearch as a background process. Use the -f
switch to run it in the foreground. Elasticsearch will recover the indices that are placed in the data directory and load the autocompletion FST.
Application
Clone this repository. Preferably, create a virtualenv associated with this application. Then, install the required Python libraries:
$ pip install -r requirements.txt
In the kb_ngramviewer
folder, the settings.py
contains the required configurations for the application to function. Make sure the ES_HOSTS setting reflects your setup:
ES_HOSTS = [{'host': '127.0.0.1', 'port': 9200}]
You can check whether your setup works by running the development server (do NOT use this in production!):
python manage.py runserver
Point your browser to http://<myserver>:5000
.
NOTE: The following instructions might be a bit different for setup on mashup2 (no proxying over http, for example)
Start the application by running the following command in the directory wsgi.py
is located:
uwsgi --http :5001 --pidfile uwsgi.pid -p 4 -w wsgi:application
This will start the application server on http://<myserver>:5001
with 4 worker processes and write the pid
of the process to uwsgi.pid in the same directory. You'll probably want to proxy, so make sure there is a vhost like this:
<VirtualHost *:80> ProxyPreserveHost Off ProxyPass / http://<myserver>:5001/ ProxyPassReverse / http://<myserver>:5001/ ServerName kbkranten.politicalmashup.nl Timeout 600 </VirtualHost>
It is advisable to use process control software to manage starting, stopping and monitoring the Elasticsearch and uWSGI processes. <u>Supervisor</u> is recommended.