Last October, I published a blog post describing the efforts we've committed to on the Bitbucket Cloud engineering team to achieve world-class reliability. A lot has happened in the past year (understatement of the year)! What the team has accomplished is tremendous, but we've also learned a thing or two that we can work further to improve. In this post, I'd like to address our recent reliability issues, the lessons we've learned from it, and provide an update on some of the performance work we've done over the last 12 months.
October 2020 incidents
Living up to one of Atlassian's values, open company no bullsh*t, we wanted to lift the curtain and provide an overview of the reliability issues we have seen in October. We strive for 99.9% across the board but we have not lived up to this goal consistently. The team and I know all too well that even if our services are available the vast majority of the time, thirty minutes of degraded performance can be incredibly disruptive, especially if it occurs during your team's core working hours.
Over the past few weeks, we've had several incidents that may have impacted your teams. These incidents have highlighted that there is still plenty of room for us to grow. One incident in particular lasted for 11 hours, and I want to share with you a little about what happened.
On the morning of October 6, automated alerts started notifying our engineering teams that something was wrong, including increased memory usage on some hosts, elevated error rates, and extended end-to-end delivery times for outbound webhooks. The incident response team quickly identified that many of our queues responsible for managing background tasks were backing up, with worker processes failing to process tasks quickly enough.
We were able to mitigate customer impact by making configuration changes to ease the pressure on our queuing infrastructure and restarting many of the worker processes that were failing to keep up with load, noting that some had completely run out of available memory. These changes led to some short-term improvement, but before long the same issues resurfaced: hosts were running out of memory, worker processes were dying, and queues were growing. As a result, many of our background processes, such as webhooks and merging pull requests, were failing or timing out.
Under normal circumstances, the team would have quickly rolled back to the prior release as a precautionary measure, to rule out a code change as the culprit for the incident. We did not pursue that option as quickly as we should have in this case, for a few reasons.
- From our logging and metrics, we could see what looked like the beginning signs of this issue dating back almost a full week. It simply hadn't crossed the threshold for alerting our teams until the morning of the incident. Rolling back a single release is one thing; but rolling back an entire week's worth of changes carries very high risk – often more risk than we can tolerate, even in the face of a major incident, since the last thing we want to do is make things worse.
- The timing of the incident itself did not clearly line up with the latest code release. While this is no guarantee that the code release wasn't responsible, it is typically an indicator that the issues we're seeing are not the result of a code change.
- Nonetheless, we did have our engineers review the code changes in the last several releases just to look for anything that might explain the issues we were seeing. They found nothing suspicious.
For a period of time, much longer than any of us would have liked, this put us in a state where we were still trying to identify the root cause of the problem, and we didn't have a reliable mitigation for the issue other than restarting worker processes, which would only buy us a limited amount of breathing room before the problem came back. During this time, the team worked on parallel streams: automating the process of safely restarting workers every 1-2 hours while continuing to investigate the root cause.
A breakthrough came when we have discovered that the cause for the increased memory pressure on the background worker servers was a proliferation of orphaned worker processes. These are processes that had become detached from their parent process and continued consuming memory but stopped processing tasks. A second breakthrough was the realization that the parent processes for these tasks were dying as a result of the SIGABRT signal. While the team looked into the cause of this signal, this gave us a specific set of conditions that we could detect and respond to in a more surgical way, no longer blindly restarting worker processes, but instead reacting to the SIGABRT by terminating orphaned processes that were left over.
Putting this automation in place ensured that our customers were protected from the effects of this bug, whatever it might be, until we found it. When we did find it, what we discovered turned out to be a painfully ironic plot twist.
One of the most powerful tools we've had to maintain reliability without sacrificing our teams’ velocity has been the use of feature flags to roll out changes safely and incrementally. The irony of this incident is that it was actually the use of a feature flag that caused the SIGABRT that in turn caused so many ripple effects throughout our systems. Specifically, the problematic code checked a feature flag in a core logger. Since feature flagging inevitably depends on periodic calls to an external system- in our case, a Memcached server – every feature flag check carries a chance of logging an error due to transient issues with that system. Checking a feature flag within a core logging module is therefore a type of layer violation, which caused unwanted effects such as infinite recursion or overflow errors.
Once the team was finally able to identify this change, we quickly reverted it and deployed the fix to our production servers. In a short time the team was able to validate that the problem had been resolved and the incident was closed internally.
We stand by feature flags as a best practice for development, they generally help teams avoid buggy product releases, reduce risk, and take a more experimentation-oriented approach to software development. It is my hope that by sharing lessons learned – in this case, the importance of accepting that checking feature flags does not always reduce risk, but sometimes increases it – we can help our customers and other software teams avoid making the same mistakes we did.
Some of the outcomes from our post-incident review for this and other recent incidents include:
- Implemented automation to detect and cleanup orphaned workers gracefully thereby making our systems more efficient and resilient to traffic spikes, paired with alerting so that our teams can respond and work on diagnosing the issue without customers noticing.
- Increased automated testing to catch layer violations such as the one that triggered this incident before they reach production.
- Consolidation of our feature flagging mechanisms (we have several), which will allow us to realize a better return on investment from our reliability-focused efforts on better controls and better tooling.
- An organization-wide operational weekly rollup ritual where tech leads from every sub-team meet to share metrics and operational insights from the week, to facilitate knowledge sharing and accelerate our maturation process as an engineering org.
The last 12 months – Performance and Reliability Improvements
The good news is that we have invested in performance as well as reliability improvements over the past 12 months, and we will continue to make improvements moving forward. Since October 2019, here are some of the improvements we have made:
- Slimming down the interaction between the pull request front-end code and Bitbucket's APIs to avoid serialization of unneeded data reduced mean response times by 3 seconds at peak as shown in the graph below. This change is a perfect example of a win-win for both reliability and performance: less load on Bitbucket's servers and faster response times for customers.
- In Pipelines, we have improved the automatic failover when our primary build cluster is under stress to schedule new pipelines to a secondary cluster in a separate region – this has also been used to good effect when we've had issues communicating with one of the clusters due to network reliability drops
- Refactoring the low-level library call to leverage libgit2 rather than the git binary had a major impact on rendering pull request diffs, especially large diffs, in some cases reducing rendering times by up to 80%.
- Modifying our front-end code base to avoid redirects when fetching pull request diffs and diffstats resulted in a speed boost making every diff and diffstat request 400 ms faster on average – and reducing backend requests at the same time!
- More aggressively packing Git refs and cleaning up leftover data on the file system resulted in 75% – 85% speed increase for Git performance.
- In Pipelines, we have added a lot of resilience handling in the face of small intermittent errors. As Pipelines depends upon a lot of other internal services, the impact of small errors is multiplied to our executing Pipelines and this resilience handling keeps the impact as low as possible. The system now retries many capabilities, in uploading and downloading caches and artifacts, cloning, sending logs and reporting status
- Our team discovered a particular function call that was invoked many times in a loop for requests routed through our Connect API proxy, resulting in a high number of calls adding up to a significant percentage of the latency for these requests. By updating the code path to introduce memoization of results, the team was able to make a 5x speed improvement to these endpoints.
- We completed a long-term project to extract one of the largest tables in one of our largest databases, reducing the size of the primary and all replicas by 28%. This change singlehandedly reduced not only our overall DB footprint but network throughput as well, reducing congestion for the rest of our infrastructure.
- Speeding up Bitbucket Cloud with AWS Global Accelerator for further performance improvements. You can read about it here
- We recently started rolling out server side rendering (SSR) of our front-end single page application. This allows us to fully render a modern SPA on the server, reducing latency and time to first interaction and dramatically improving Apdex. Since we rolled this out, the average Apdex for all Bitbucket pages improved by 6 points.
While the team has done far more over the past year than I have shared here (this post could easily have been 10x longer!), hopefully this has been an interesting glimpse into the reliability and performance work we've been doing and will continue to do for Bitbucket Cloud. I look forward to our engineers sharing more progress and exciting new developments as we work to make Bitbucket faster, safer, and more reliable every day.
I wanted to share the details of this recent incident for a few reasons. First, I want Bitbucket's customers to trust us; and I know that in order to earn your trust we need to be transparent about incidents like these when they happen. And lastly, these details effectively highlight the fact that no matter how far we've come, there will always be lessons to learn and further improvements to make. Rest assured, we will remain vigilant.
Security and reliability are top priorities for Atlassian. We will continue making investments in these areas until incidents like the one I described above are a thing of the past, and bring you along on the journey as we do it.