Instead of a flat cache that only stores the latest results of each scrape, our database stores only the new or changed information found during each scrape. This means that we can then walk through the database from any given start date to any given end date and retrieve a snapshot of what the data looked like at the specified time.
This enables users to perform tasks like graphing recruitment velocity or tracking a user's membership changes without having to actually track those statistics themselves at the time they occurred. It also allows users to see data that is no longer available in the live source, such as deleted accounts.
The benefit to the regular user is that this database acts as a cache that can return data much faster than the live queries. If a user requests a cached value that isn't there, the query falls through to the live data and the database is updated appropriately for all future requests.
For massive queries like "All Organizations", we highly recommend you use the cache as it will be many orders of magnitude faster than trying to walk through the live equivalent a few pages at a time. That is, the database can return a list of all the orgs in under a second while the live queries would take at least a few dozen minutes.
Automation & Piggybacking
To keep the cache reliably up-to-date, there are two solutions used.
The first is automation via our Scheduler class. This is set up on a frequent cron job to refresh everything in our database. The Scheduler will start by searching for recently created orgs that aren't in our database yet. It then walks through each org and user account, sorted by the elapsed time since the last scrape. Using this method, our database has a reasonably short refresh cycle.
The second solution is piggybacking. Each time a user queries live data for items that are tracked by our database, the query caches the results for the user's query. This allows items that are requested more often to be updated more often. Certain queries will also launch additional scrapers for certain items found in the user's query (such as orgs and users) to make sure that new items get added to the database as often as possible.
A copy of the database can be found under the Downloads section of the repository. This copy will be updated daily at midnight GMT.
We highly recommend that you use this copy of the database if you need large volumes of data at a time. This will help mitigate excessive traffic to our site and any sites that our API hits. Users who are looking to query all orgs or all accounts should consider using this database.