Issue #1308 resolved

Including raw html images and videos generates an enormous search index

Nathan Goldbaum
created an issue

Recently I've been experimenting with using sphinx to automate evaluating and embedding notebooks in the documentation for the yt project. A typical example is the new yt bootcamp, which links to a list of evaluated embedded notebooks.

Embedding notebooks like this is convenient for a number of reasons. Additionally, since we evaluate cleared notebooks using runipy, doing so also makes it straightforward to test code snippets we include in the docs during the sphinx build process.

To get the embed working, I wrote two custom sphinx extensions (packaged here). Both of extensions create a raw HTML node produced by running nbconvert on an evaluated IPython notebook.

The HTML we get back from nbconvert quite often has images and video encoded using base64 strings. Unfortunately, when these pages get parsed by the search index generator, these raw strings are separated as gibberish words, creating an enormous search index file that encodes the raw image and video data. In our case, searchindex.js was 65 MB, enough to noticeably delay the appearance of search results, even over fast internet connections.

My solution to this issue is in this monkeypatch, which I've submitted as a pull request to the yt documentation repository. I'm not sure if it makes sense to include these changes in sphinx proper, so I'm bringing this issue up here in the hopes of getting some input from sphinx experts.

Comments (5)

  1. Georg Brandl repo owner

    Thanks for the report! I actually quite like your monkeypatch; I extended it a bit to strip the content of "style" too, and otherwise all kinds of tags.

    First I wanted to skip raw content altogether, since usually it's only used to put in specialized content (like you do), but then I could imagine people using "raw" to include legacy HTML content that should be searchable. Well, let's see if anybody complains about this solution.

  2. Nathan Goldbaum reporter

    Georg Brandl did you mean to leave the raise SkipNode in the node.__class__ is raw block? My naive reading is that raw HTML will now no longer be included in the search index, unless the call to found_words.extend is sufficient to do that.

  3. Log in to comment