Overview

PyPI on CouchDB

By now, there are two ways to retrieve data from PyPI (the Python Package Index). You can both rely on xml/rpc or on the "simple" API. The simple API is not so simple to use as the name suggest, and have several existing drawbacks.

Basically, if you want to use informations coming from the simple API, you will have to parse web pages manually, to extract informations using some black vodoo magic. Badly, magic have a price, and it's sometimes impossible to get exactly the informations you want to get from this index. That's the technique currently being used by distutils2, setuptools and pip.

On the other side, while XML/RPC is working fine, it's requiring extra work to the python servers each time you request something, which can lead to some outages from time to time. Also, it's important to point out that, even if PyPI have a mirroring infrastructure, it's only for the so-called simple API, and not for the XML/RPC.

CouchDB

Here comes CouchDB. CouchDB is a document oriented database, that knows how to speak REST and JSON. It's easy to use, and provides out of the box a replication mechanism.

So, what ?

Hmm, I'm sure you got it. This piece of software simply link informations from PyPI to a CouchDB instance. Then you can replicate all the PyPI index with only one HTTP request on the CouchDB server. You can also access the informations from the index directly using a REST API, speaking json.

How all this is working ?

Under the hood, it's using the PyPI XML/RPC API to get data from PyPI, and generate records the right way.

Example

You can use pypioncouch via the command line, or via the python API.

Using the command line

You can do something like that for a full import. This will take long, because it's fetching all the projects at pypi and importing their metadata:

$ pypioncouch --fullimport http://your.couchdb.instance/

If you already have the data on your couchdb instance, you can just update it with the last informations from pypi. However, I recommend to just replicate the principal node, hosted at http://couchdb.notmyidea.org/pypi/, to avoid the duplication of nodes:

$ pypioncouch --update http://your.couchdb.instance/

Using the python API

You can also use the python API to interact with pypioncouch:

>>> from pypioncouch import XmlRpcImporter, import_all, update
>>> import_all('http://localhost')
>>> update('http://localhost')

and that's it ! Enjoy.

What's next ?

I want to make a couchapp, in order to navigate PyPI easily. Here are some of the features I want to propose:

  • List all the available projects
  • List all the projects, filtered by specifiers
  • List all the projects by author/maintainer
  • List all the projects by keywords
  • Page for each project.
  • Provide a PyPI "Simple" API equivalent, even if I want to replace it, I do think it will be really easy to setup mirrors that way, with the out of the box couchdb replication

I also still need to polish the import mechanism, so I can directly store in couchdb:

  • The OPML files for each project
  • The upload_time as couchdb friendly format (list of int)
  • The tags as lists

Once all that will work properly, I'll write a little client for distutils2 to rely on couchdb instead of xml/rpc or the simple API, and try to run some nodes and sync them against PyPI.

The work I've done by now is available on https://bitbucket.org/ametaireau/pypioncouch/. It's still a work in progress, but any feedback is highly appreciated !