Further ideas for import performance improvement

Issue #5 new
Samuel Lai repo owner created an issue

Rewrite the import process using Java so prepared SQL statements can be used with SQLite. Can use https://bitbucket.org/xerial/sqlite-jdbc/ as the SQLite connector.

Not sure if the work is worth the potential benefit though. Slowest part is still the lookup of comments when storing questions/answers.

An alternative idea is to lookup comments on retrieval instead, storing them separately (either in SQLite or Solr). This however, means comments are not indexed (they are currently).

Comments (6)

  1. Alexei Baboulevitch

    From a quick profile, it seems that SQL access in finalise_question and Solr access in commit_questions are taking the longest time by far. (11 and 13 seconds for a 41 second run.) For the latter, that DataImportHandler seems really lucrative to just pass up: unless I'm reading it wrong, we're talking about an order of magnitude improvement on 30% of the total run time! Could the massaged data in finalise_question be written out into an intermediary XML file and then imported using DataImportHandler? 10-30m write + 20-50m import vs. 3-7 hours!

    As for those SQL lookups, it seems that prepared SQL statements would get you a 20% improvement at best (according to a quick Google, I could be wrong), which might not be worth a Java rewrite. I tried a bunch of things to speed them up in Python but nothing really helped. Maybe there's something lighter and quicker that can be used for a database to temporarily store the comments?

  2. Samuel Lai reporter

    It has just occurred to me that I could probably move the Solr call in commit_questions to another thread so the main processing thread doesn't have to wait for the Solr request to complete (https://docs.python.org/2/library/queue.html would be useful here).

    If I'm looking at writing out an intermediary file, I'd rather write the whole import process in Java, as a custom DataImportHandler. I suspect many users would have issues with disk space if there was an intermediary file.

    Unfortunately, Python's SQLite library doesn't support prepared statements. Given that I only use the comments SQLite database to retrieve comments by siteId/postId, maybe using a simpler key/value store like Berkeley DB would be a better option but I'll need to have a look at the cross-platform options available. (I thought about a file for each postId but that might hit filesystem limits pretty quickly.)

    I think there are some useful ideas here!

  3. Samuel Lai reporter

    Because of the cross-platform requirement, most of the key-value store options can't be used (e.g. Berkeley DB). Semidbm seems to fit this requirement though, but I'm not sure how helpful it'll be.

    Research -

    Also looked at writing custom bits to fit into the DataImportHandler setup. Looks like it is very focused on a one-to-one relationship between data source items and Solr items. It might be possible to hack the DataSource or EntityProcessor interface. Probably best to try the ideas in the post above first before looking at this.

    Maybe look at portable Java key-value store options too, e.g. https://github.com/dain/leveldb ?

  4. Alexei Baboulevitch

    What about the Python-native CodernityDB? I haven't used it myself but it seems popular, works across platforms, and doesn't have any third-party dependencies.

  5. Samuel Lai reporter

    That looks like a good option too. The use-case is pretty simple so it shouldn't be hard to try them both out.

  6. Log in to comment