Overview

HTTPS SSH

archie

Full-text search for mbox mailing list archives, written on top of PostgreSQL
using Python and CherryPy.

What this package does right now, is:

  • importing mbox archives (with handling various issues)
  • indexing the messages using a GIN index (built-in full-text)
  • searching the archives using a simple web application (Python + CherryPy)

Not (yet) supported features:

  • attachments (the whole messages are stored, but the attachments are not yet
    available for download)
  • multiple languages (all messages are handled using 'english' dictionary)

Requirements

To make this work, you need

  • Python 2.7 or newer
  • PostgreSQL 9.x

It might work with older versions, I haven't tried that.

Install

First, download the package somehow - either by cloning the repo, or as a .tgz.

$ git clone git@bitbucket.org:tvondra/archie.git

Create a database and tables:

$ createdb archie
$ psql archie < archie/sql/create.sql

Make sure you have a hstore extension available (the SQL script will throw an
error if it can't do a CREATE EXTENSION). Now you're ready to import the data.

Downloading the data

Importing the data is quite simple - just download all the archives you want to
import, ungzip them and then pass them to ./bin/load.py. There's a download
script ./bin/download.py that fetches selected lists from postgresq.org in an
intelligent way (skipping already downloaded files that did not change etc.).

To use it just create a directory and run the script in it:

# mkdir /tmp/archives
# cd /tmp/archives
# ~/archie/bin/download.py --all-lists

It takes quite some time to fetch the files, so be patient. There's also a
download.sh script that downloads all the lists in parallel, so it's much
faster.

Or you may use your custom script, no problem with that - just keep in mind
that the mailing list archives on postgresql.org are password protected, so
you'll have to handle this.

Importing the data

To actually load the data into the database, use the load.py script:

$ ~/archie/bin/load.py --db archie pgsql-general.*

The tool can use multiple worker processes to speed up the loading - by default
the number of cores is used, but you may override it using the "--workers" option.
The maximum number of workers is equal to the number of lists (files for each
list need to be loaded sequentially, to detect threads/replies).

You may also load the data file-by-file, but be very careful as the order is
important to properly detect threads - you should never load archives for one
list out of sequence (older after newer).

The load.py script directly supports reloading an archive - the already loaded
messages will be skipped and the remaining ones will be loaded. So you may load
a fresh copy of the mbox archive regularly (e.g. daily) to get incremental updates
of the archive.

For each loaded archive (file), the tool prints some basic info including number
of messages that were / were not imported. The messages are mostly due to the
repeated loads (already loaded messages).

Warming-up

It's quite common that the database needs to warm-up, i.e. load all the important
data from disk to RAM, and that may take some time. That's why there's a warmup.py
script that does exactly this. It may be used for benchmarking too - just tell
how long should it run:

$ ./bin/warmup.py --db archie --duration 60

You may disable parts of the process (e.g. accessing the messages by hash or using
thread ID), but that's mostly for benchmarking purposes.

It's difficult to say in advance how long it should run - the best you can do is
run multiple short executions and stop once the numbers stabilize. You may also
watch "iostat -x" output and terminate the script once the %util drops to near 0.

Starting the server

Then edit the server.conf file by customizing the db details (database name,
host, port and password). Then you may run the bin/run-server.py, which should
just start the cherrypy server on http://localhost:8080.

Searching

Just go to http://localhost:8080 and do some queries. The queries accept all the
common built-in full-text syntax - it's basically handed over to plainto_tsquery
function. So for example these are all valid queries

slow query
slow & query
slow & (! query)
(slow | fast) & query

and so on. More details are available in the PostgreSQL Full Text documentation,
i.e. http://www.postgresql.org/docs/9.2/interactive/datatype-textsearch.html.

The search works either with individual posts or whole threads - all you need to
do is use the checkbox next to the search field. It's also possible to choose a
time interval.