Over the past few weeks, the Bitbucket Engineering team has been sharing our ongoing efforts and wins in our journey to achieving world-class reliability. In our previous post, Development Manager Dan Tao took you through our PIR process. In this next post, we’ll talk about database scalability.
Bitbucket’s usage is growing
Two services at the core of Bitbucket are the ones serving the Bitbucket website and our public REST API. These services are built on the Django web framework, serving millions of requests every hour. Many of these requests perform multiple database queries, some fairly intensive, putting significant load on our database. As Bitbucket Cloud's userbase continues to grow, this number will as well. In the past year, traffic has increased by roughly 40%. That’s why our database architecture is one of the areas we are investing in improving to help Bitbucket scale.
A stressed primary database
Bitbucket uses PostgreSQL with one primary read-write database and N read-only replicas. This is a good foundation to build on.
However, traffic from our core services until recently was routed almost* entirely to the primary database. This routing is done at the service level, reducing the benefit of replicas to failover scenarios. The primary database was protected from this load by connection pooling, but performance would degrade during peak hours as clients waited for a connection from the pool.
*Developers have the ability to force the use of replicas in their code paths but this is rarely used and not a scalable solution.
Moving 80% of the load to replicas
To improve scalability and performance, we debated various approaches to take better advantage of our replicas, and ultimately agreed on a solution. Here's a look at how this works.
As a user is using Bitbucket she is making requests to the website and the API. From a database point of view, each of these requests is a sequence of read/write operations. Our goal was to route all reads in a request to a replica until a write happens. A write and anything afterward in that request will go to the primary. This is to avoid a race condition where a user might fail to see data they're just written.
We speculated that doing this for all requests would take a majority of the load off our primary database. Analyzing our production traffic showed us that 80% of the traffic from the primary database could be moved to replicas.
Choosing the right replica
Replication from the primary to replicas is quite fast but it's not instantaneous. This mean that if we started using replicas we could run into issues with stale data as a result of replication lag. The race condition we mentioned earlier within a single request could happen during multiple requests. Imagine creating a pull-request, then going to the pull request list page and not seeing what you just created, because we chose a replica that didn't have that data yet. We knew we must be careful to choose a replica that is up to date, at least from the point of view of the user making the request.
PostgreSQL has a write-ahead log (WAL) and entries in this log have a log sequence number (LSN). When a user performs a write operation, it's recorded in the WAL and we can save the LSN. When the same user makes a subsequent request, we find a replica that is as up to date as the user's saved LSN and use that for read operations.
This approach is done at the routing level, so we send most of the read operations, across all requests, to replicas instead of the primary. No extra work for developers?existing and new code will automatically benefit from this.
Django's middleware allows us to save LSNs at the end of a user request and look them up at the beginning of a user request. We chose Redis as our LSN store. We store the LSNs keyed by the user's Atlassian ID so routing is on a per-user basis.
We originally shared an existing Redis cluster in our datacenter with other services. The configuration of that cluster and our load/needs soon pushed us out to Elasticache, an Amazon hosted Redis cluster. Even though LSNs don't use much storage, we update them frequently; sharding helps us spread this write load. We use a proxy called envoy to pool connections from our services to this cluster.
Along with LSNs we also store a blacklist of replicas. This allows us to catch any operational errors with a particular replica and blacklist it for a configurable time, routing around it to others.
The extra logic we've introduced to route to a replica isn't free. We have to query Elasticache and replicas, which adds latency. Our tolerable margin for this was roughly 10ms. If we could shift the majority of the load from the primary database while avoiding race conditions we considered this worth the additional small latency.
The other tradeoff was the introduction of dependencies like envoy and Elasticache. These add additional operational overhead but we felt this was worthwhile to help Bitbucket scale.
We did indeed shift the majority of traffic to replicas while keeping added latency close to 10ms. Below is a snapshot of requests, grouped by the database that was used for reads, before our new router was introduced. As you can see most of the requests (the small squares) were using the primary database.
This is a snapshot of requests once we rolled out the new router.
You can see the load being spread more to the replicas.
Looking at the primary database on a typically busy Wednesday we can see a reduction in rows fetched of close to 50%.
We’re looking forward to seeing how we can apply our learnings from our core website and API services to other Bitbucket services and continuing to scale our database to better support the next 10 million and beyond.