Full-text search for mbox mailing list archives, written on top of PostgreSQL using Python and CherryPy.
What this package does right now, is:
- importing mbox archives (with handling various issues)
- indexing the messages using a GIN index (built-in full-text)
- searching the archives using a simple web application (Python + CherryPy)
Not (yet) supported features:
- attachments (the whole messages are stored, but the attachments are not yet available for download)
- multiple languages (all messages are handled using 'english' dictionary)
To make this work, you need
- Python 2.7 or newer
- PostgreSQL 9.x
It might work with older versions, I haven't tried that.
First, download the package somehow - either by cloning the repo, or as a .tgz.
$ git clone email@example.com:tvondra/archie.git
Create a database and tables:
$ createdb archie $ psql archie < archie/sql/create.sql
Make sure you have a hstore extension available (the SQL script will throw an error if it can't do a CREATE EXTENSION). Now you're ready to import the data.
Downloading the data
Importing the data is quite simple - just download all the archives you want to import, ungzip them and then pass them to ./bin/load.py. There's a download script ./bin/download.py that fetches selected lists from postgresq.org in an intelligent way (skipping already downloaded files that did not change etc.).
To use it just create a directory and run the script in it:
# mkdir /tmp/archives # cd /tmp/archives # ~/archie/bin/download.py --all-lists
It takes quite some time to fetch the files, so be patient. There's also a download.sh script that downloads all the lists in parallel, so it's much faster.
Or you may use your custom script, no problem with that - just keep in mind that the mailing list archives on postgresql.org are password protected, so you'll have to handle this.
Importing the data
To actually load the data into the database, use the load.py script:
$ ~/archie/bin/load.py --db archie pgsql-general.*
The tool can use multiple worker processes to speed up the loading - by default the number of cores is used, but you may override it using the "--workers" option. The maximum number of workers is equal to the number of lists (files for each list need to be loaded sequentially, to detect threads/replies).
You may also load the data file-by-file, but be very careful as the order is important to properly detect threads - you should never load archives for one list out of sequence (older after newer).
The load.py script directly supports reloading an archive - the already loaded messages will be skipped and the remaining ones will be loaded. So you may load a fresh copy of the mbox archive regularly (e.g. daily) to get incremental updates of the archive.
For each loaded archive (file), the tool prints some basic info including number of messages that were / were not imported. The messages are mostly due to the repeated loads (already loaded messages).
It's quite common that the database needs to warm-up, i.e. load all the important data from disk to RAM, and that may take some time. That's why there's a warmup.py script that does exactly this. It may be used for benchmarking too - just tell how long should it run:
$ ./bin/warmup.py --db archie --duration 60
You may disable parts of the process (e.g. accessing the messages by hash or using thread ID), but that's mostly for benchmarking purposes.
It's difficult to say in advance how long it should run - the best you can do is run multiple short executions and stop once the numbers stabilize. You may also watch "iostat -x" output and terminate the script once the %util drops to near 0.
Starting the server
Then edit the server.conf file by customizing the db details (database name, host, port and password). Then you may run the bin/run-server.py, which should just start the cherrypy server on http://localhost:8080.
Just go to http://localhost:8080 and do some queries. The queries accept all the common built-in full-text syntax - it's basically handed over to plainto_tsquery function. So for example these are all valid queries
slow query slow & query slow & (! query) (slow | fast) & query
and so on. More details are available in the PostgreSQL Full Text documentation, i.e. http://www.postgresql.org/docs/9.2/interactive/datatype-textsearch.html.
The search works either with individual posts or whole threads - all you need to do is use the checkbox next to the search field. It's also possible to choose a time interval.