What we're doing to achieve world-class reliability on Bitbucket

I recently joined the Bitbucket team to head up Engineering coming off of 25+ years of experience in software engineering, systems development, and operations. One of my first roles as a manager was to run a build and release team – I love the craft of developing and shipping software!

Bitbucket is a strong product sitting in a robust line of Atlassian tools. The team has been making steady investments in features that will pave the way for greater collaboration and automation in the coming months and years. I’m excited to be joining on in a time of growth and change, but first I also want to address that there’s some work to be done.

I want to personally apologize for our recent product reliability. In the last few months, Bitbucket Cloud has experienced a number of incidents that have disrupted your workflow, including outages to Git and Mercurial over HTTPS, and Pipelines.

I want to start by acknowledging that we know this is unacceptable for you. When Bitbucket’s systems are down, your teams can’t access their code, your builds can’t run, and you can’t deploy your software. It’s unacceptable to us as well – at Bitbucket, we strive for 100% uptime. Don’t #@!% the customer is a value that we deeply believe in on the Bitbucket team, and when our customers suffer, we feel it.

But words are cheap. The reason I’m writing is to tell you what we’re doing to improve reliability on Bitbucket. The team has already been working on several projects to enhance our scalability and security over the past few months. Our recent incidents have also highlighted additional areas where we need to focus. Here are the steps we’re taking to invest in these areas and an overview of some positive impacts we’ve made so far.

Architecture & scale

Bitbucket is one of Atlassian’s most popular cloud products, with millions of users and hundreds of millions of requests every day. While I’m proud of all the work the team has done to scale this far, we are seeing the limits of some core systems that need to be addressed at an architectural level.

Here is an example of an investment we’ve made recently in this area. Bitbucket comprises a collection of microservices along with some larger core services. One of our largest and most complex core services utilizes a master database along with a pool of read-only replicas, which have been underutilized, causing undue strain on the master database server. To remediate this, the team has been working on a project to intelligently route a substantially increased volume of database queries to the read-only pool, without showing stale data to users.

External dependencies

Atlassian is investing heavily in a shared cloud platform, which will provide long-term benefits to all of the company’s cloud products, including Bitbucket. As we integrate with this platform, as with any infrastructural change, there is a risk of short-term instability. We need to make our services more resilient to temporary outages in either the Atlassian platform or external providers.

A recent win we had in this area relates to our adoption of Atlassian-wide user privacy settings, which required close collaboration with multiple platform teams. Our scale presented enormous challenges to internal platform services, some of which had been built from scratch to support these new privacy settings. Our teams addressed these challenges by implementing a sidecar to provide a layer of caching and optimize network access, and recently deployed automated circuit breakers allowing us to transparently adapt to upstream issues without impacting customers.

Observability

While we have no shortage of metrics and dashboards, recent incidents have revealed the uncomfortable truth that we can still sometimes be taken by surprise. We are investing more to close the gaps in our monitoring and establishing more standardized instrumentation across all of our services and components, so that we aren’t relying on bespoke metrics and alarms.

Incident response

The Bitbucket team has automated alerts and 24×7 on-call rotations for every critical service we own, in addition to an overarching rotation staffed by our engineering leaders. But there is still room for improvement in our internal documentation as well as automation in our incident response process that we need to ensure that our engineers who are on call can respond quickly and efficiently to resolve incidents within minutes when they happen. Our improvements in clear observability will help in this area as well.

Security

This will continue to be a major investment area for us, as we have teams dedicated to platform and security work. These teams have been moving mountains to upgrade many of our core libraries and frameworks to the newest supported versions. In the past month, we upgraded to an LTS version of the web framework on which multiple Bitbucket services are built, with zero downtime and zero production issues. (I’ve encouraged the team responsible for this impressive feat to blog about how they pulled it off, so expect to see a more detailed version of that story in the future!)

What’s Next?

In the preceding paragraphs I’ve provided several examples of work that is already in flight or recently shipped.

Starting this week, the Bitbucket engineering team is participating in a reliability exercise, a short-term sprint that makes room for us to pause our day-to-day work and focus all of our efforts on improving reliability. This will provide the foundation for more improvements to come in the following weeks and months.

We’ll continue to take you along for the journey – this blog being the first to come in what will be a series of blog posts. Look out for more by updates from other leaders on the engineering team sharing our progress in addressing each of the key areas above and more.

I will close by saying that Bitbucket engineers are a passionate and incredibly talented team. We know that our recent instability has taken a toll on your trust, and we are going to do everything we can to rebuild that trust stronger than ever.