Combating Flaky Builds

Strategies for controlling randomly failing tests in a growing engineering organization

It is a scene familiar to many of us: an excited programmer picks up a JIRA ticket and works day and night to deliver their best work. They put in their heart and soul and craft a beautiful feature that is sure to delight users. Hours before the end of the sprint, our hero submits a pull request, only to be disappointed several minutes later by an embarrassing red ❌.

A bold error message stands defiantly on the page: BUILD FAILED.

Broken PRs on

But, that is somebody else’s test! Something broke in a package far far away! This feature is totally unrelated! Surely, our work is solid, right? Should we just ignore the failure and merge anyway, since it is probably unrelated?

Hey, I don’t know what this is, and it seems unrelated to what I am working on. Can I ignore it?
— errbody

A Battle That Cannot Be Won

At smaller scales, it may be possible to achieve builds that pass deterministically 100% of the time. However, most programming teams will reach a point where the rate at which flakes or bugs are created is at least the rate at which bugs (or flakes) are fixed. As the business and organization grows, context gets split, legacy code starts to rot, business logic becomes richer, and execution gets more asynchronous.

This is one of the many forms of technical debt. Arguably, this is a necessary outcome of rapid development: perfect software never gets shipped. The mantra is, move fast and try not to break anything.

At Google, somewhere between 1.5% and 2% of all tests are flakes, and 16% of tests have some level of flakiness. Moreover, they learnt (to no one’s surprise) that the larger tests tend to be more flakier.

Much has been written on specific tactics that can reduce P(flake) of individual tests e.g. Eradicating Non-Determinism In Tests. Yet, flaky builds continue to plague programming teams everywhere.

Tragedy of the Commons, Broken Windows

Red builds tend to get redder. Flaky builds tend to accumulate more flakes, and really quickly. It becomes difficult to tell if a test failure is a signal of some real underlying problem. By the time someone puts their foot down and declares a moratorium, everyone is guilty in some way.

In practice, P(flake) tends to hover around the same level. Any higher, it causes enough agony to warrant additional attention; eventually, some poor soul buckles, and one can only hope it is not always the same volunteer. Any lower, the flakes get ignored and forgotten, and everyone on the team hits the Rebuild button. This P(flake) is the team’s implicit level of tolerance, and is passed on as tribal knowledge from hire to new hire.

We can learn to accept this and manage it, while reducing programmer angst. Solving the problem will require changes to processes and engineering practices. We need systems, not goals; tools, not ideals.

It is important to identify what we wish to avoid:

  • desensitization, which causes programmers to ignore build failures and, eventually, actual problems
  • automatic retry e.g. flaky, which masks underlying bugs in tests

The latter is particularly insidious. Once introduced, the incentives and disincentives are completely out of whack. What stops us from increasing the number of retries from 3 to 5, and from 5 to 10? Where does it end?

“People are already doing this” is never a good reason to automate bad behavior.

A Proposal

First, we need to own our technical debt, including the ones left by our predecessors. Acknowledge it on the metaphorical balance sheet, and stop tossing it away as someone else’s problem. This has to be part of the job. It has to be recognized and rewarded as part of performance review, promotions, and bonuses.

Managers need to assign responsibilities and ownership. Tech Leads are obliged to identify commons and utils and allocate them, lest they become dumping grounds for half-baked ideas and undocumented functions. Product Managers have to budget time between sprints to burn down bugs and flakes, before they grind software development to a halt.

This is the first of many good organization habits that we must have. Without it, your codebase will be filled with abandonware written by hackers who had quit within 10 months of joining the startup, and employees will start warning their friends to stay away.

It is easy to know if a test failure is a flake — just ask around. Chances are, programmers have run into the same test failures over and over again, and have learnt to memorize their names.

We can use automation to do this at scale. Most of us have some form of CI that ensures that the build passed on trunk at some point. Use any spare capacity to re-run the build on trunk multiple times, hourly or nightly. Record the data from the test reports (e.g. xunit) and group them by test case to calculate P(flake) for each.

Set a maximum threshold on P(flake) and annotate the rest. File tickets and assign them. Skip these tests from the main build until some reasonable deadline. All of this can be automated.

I do not recommend publicly shaming the assignees. Instead, use the management hierarchy and reward them.

Carrots > Shame

Publish the data from above, and share them at team meetings to remind stakeholders of technical debt. Highlight the cost of lost productivity concretely. Use this to get buy-in from product/project/engineering managers, and technical leads.

Slowly but surely, tune your culture to reduce tolerance. Empower programmers to volunteer time to work on flakes. Do the groundwork to make it worth their time. With enough buy-in and recognition, the last piece of this proposal is claiming a spot on our Gantt charts and roadmaps for fixing flaky tests.

Create a practice that allocates time to burn down flakes instead of risking the loss of test coverage when bad tests are annotated as flakes. Allocate more time to writing good tests, knowing that a well-written one can silently save us weeks of productivity in the long run.

These changes, together with strong ownership policies and the right incentives, will form the backbone of a resilient engineering organization. We can rely upon this to experiment, build, and innovate with certainty and without fear of breakage. Let’s get to work.

Engineering@Samsara (ex-Affirm) -