Including raw html images and videos generates an enormous search index
Recently I've been experimenting with using sphinx to automate evaluating and embedding notebooks in the documentation for the yt project. A typical example is the new yt bootcamp, which links to a list of evaluated embedded notebooks.
Embedding notebooks like this is convenient for a number of reasons. Additionally, since we evaluate cleared notebooks using runipy, doing so also makes it straightforward to test code snippets we include in the docs during the sphinx build process.
To get the embed working, I wrote two custom sphinx extensions (packaged here). Both of extensions create a raw HTML node produced by running nbconvert on an evaluated IPython notebook.
The HTML we get back from nbconvert quite often has images and video encoded using base64 strings. Unfortunately, when these pages get parsed by the search index generator, these raw strings are separated as gibberish words, creating an enormous search index file that encodes the raw image and video data. In our case,
searchindex.js was 65 MB, enough to noticeably delay the appearance of search results, even over fast internet connections.
My solution to this issue is in this monkeypatch, which I've submitted as a pull request to the yt documentation repository. I'm not sure if it makes sense to include these changes in sphinx proper, so I'm bringing this issue up here in the hopes of getting some input from sphinx experts.