This pull request is for review of my work on parallel build (i.e. writing; reading is much harder). Do not pull into main repo yet.
HOW TO TEST: use sphinx-build -j N, where N is the number of parallel processes.
Timing results from writing HTML for the whole Python 3.3 documentation (write only, no reading) on a machine with 2x2 cores:
-j1 -- 94 secs (this is using the old code)
-j2 -- 68 secs
-j4 -- 60 secs
I.e. on this machine we can achieve 30% time reduction using 4 cores. (Why not more: in the end there's a lot of IO involved reading the doctree and writing the HTML to disk. Also, with the latest changeset not everything is actually done in the subprocesses.) This will help for even bigger projects, I hope, even if the reading stage currently is more time intensive (especially due to autodoc).