Flaky Tests: The Silent Killer of Engineering Velocity

Author photo of Amar Rawat

Amar Rawat

Published 22 min read
Dark cinematic scene featuring large metallic gears and machinery fading into shadows on the left side, with a small glowing spark of light floating in the empty dark space on the right, creating a mysterious and industrial atmosphere.

TL;DR: A flaky test is not a broken test. It is a test that lies to your team on a schedule - passing 70% of the time, failing 30% for reasons nobody can reproduce. And a lie with a 70% truth rate is more dangerous than silence, because it produces false confidence while systematically training your engineers to ignore red CI builds. This article covers the four root causes of flakiness, the exact behavioral damage it does to engineering velocity, how to debug a test that refuses to fail consistently, and the only stabilization system that scales.

Here is a pattern every engineering team eventually lives through: A developer pushes a pull request. CI runs. Two tests fail - tests that have failed randomly for weeks. The developer reruns the pipeline. Everything goes green. They merge.

Nobody files a bug. Nobody investigates. Nobody even mentions it in the PR review. Because everyone already knows those tests are flaky. They fail sometimes. You rerun and they pass. That is just how it is.

Two weeks later, a real regression ships. A genuine bug - introduced in that PR or one of the twelve that followed it - slips past CI because the team has learned, through weeks of accumulated experience, that red CI builds are not signal. They are noise. You rerun until green and ship.

That is the damage flaky tests do. Not just the time wasted on reruns. Not just the engineering hours spent investigating test failures that are not real failures. The deeper damage is what they do to the team's relationship with their own test suite. They teach engineers that the safety net is optional. And once a team learns that, unlearning it takes months.

This is why more tests do not automatically mean more confidence. A test suite with 20% flaky tests does not have 80% of its value intact. It has significantly less, because the flaky tests have contaminated the trust in the whole system. You cannot look at a red build and know what it means.

What a Flaky Test Actually Is (and What It Is Not)

The technical definition is simple: a flaky test is one that produces different results - pass or fail - for the same code, without any change to the code itself.

But that definition understates the problem. Let's be more precise.

A flaky test is a test that:

  • Fails on a specific timing, environment state, or execution order that nobody explicitly set up
  • Cannot be reproduced on demand - you cannot make it fail by running it alone
  • Passes when you rerun it, which trains your team to treat reruns as a fix
  • Reports nothing meaningful about the correctness of the code it is supposed to test

That last point is the one teams miss. A flaky test is not testing anything reliable. Every time it passes, you learned nothing about your code. Every time it fails, you also learned nothing about your code. It is a pure noise generator dressed as a quality gate.

The distinction matters because teams often treat flaky tests as merely annoying - a productivity tax they tolerate because fixing them is slower than rerunning them. The correct framing is harsher: a flaky test is a broken instrument in your quality measurement system. It gives you numbers that look like data but tell you nothing.

A test suite with flaky tests is not a partially reliable safety net. It is an unreliable one. There is no such thing as "mostly trustworthy" for a CI gate.

The Four Root Causes of Flakiness (and Why "It Just Does That" Is Never the Answer)

Flaky tests are not random. That is the first thing to internalize. They appear random because the conditions that trigger them are not part of your test's declared inputs - but the conditions exist. Every flaky test has a deterministic cause. Finding it is a debugging problem, not a luck problem.

There are four root causes that account for the overwhelming majority of flaky tests in mobile test suites.

1. Asynchronous Timing Dependencies

This is the most common cause in mobile testing, and it is almost entirely an E2E and UI test problem.

The test clicks a button, then immediately asserts that a screen element is visible. The element requires a network call to populate. On a fast CI machine the call completes in 80ms, the element renders, the assertion passes. On a slower runner under load, the call takes 300ms, the assertion fires before the element renders, the test fails.

Same code. Same test. Different infrastructure load on different runs.

The broken fix most teams apply: add a static sleep. Thread.sleep(2000). Now the test passes more often, but it is brittle - increase your backend latency in staging and it starts failing again - and it makes your entire test suite slower for no quality gain.

The correct fix: explicit waits tied to observable conditions, not time. Espresso for Android provides IdlingResource - a mechanism that tells the test runner to wait until a specific condition is met (an animation completes, a network call finishes) rather than waiting an arbitrary number of milliseconds. XCUITest on iOS provides waitForExistence(timeout:) for element conditions. Appium provides explicit wait strategies via WebDriverWait.

The rule is: never assert on time. Assert on state.

2. Test Order Dependencies and Shared State

The test passes when run alone. It fails when run as part of the full suite. Specifically, it fails after test #47 runs.

This is a state pollution problem. Test #47 modifies a shared resource - a database, a singleton, a static variable, a cached file - and does not clean it up. When your test runs after it, it inherits a state it was not written for and produces unexpected results.

On mobile, the most common shared state vectors are:

Shared State Source What Gets Polluted How It Manifests
Shared app database (Room, SQLite) Rows from a previous test persist Assertion finds wrong count or record
Singleton services (auth state, session) Auth token from test A bleeds into test B Test B skips login when it should not
SharedPreferences / UserDefaults Settings set in test A affect test B Unexpected app behaviour in subsequent test
File system cache Media, config files from previous test File-not-found or wrong-content failures
In-memory app state Static flags, feature toggles Features enabled/disabled incorrectly

The correct fix is strict test isolation: each test must set up its own state before running and tear it down completely after, regardless of whether it passed or failed. On Android, this means using @Rule annotations with proper before/after lifecycle hooks. Using an in-memory test database instead of the production database. Resetting singletons explicitly. On iOS, using fresh app launches or XCUIApplication().terminate() between test classes when state cannot be reliably reset.

If you cannot make a test pass reliably in any order when run in isolation, the test has a design flaw, not just a timing problem.

3. Environment and Infrastructure Instability

The test is correctly written and well-isolated. It still fails intermittently on CI but never locally.

This is an infrastructure problem, not a test problem. Common causes:

  • CI runners under resource pressure - a CPU-throttled container means animations run slower, timeouts expire before UI transitions complete, and async operations take longer than your waits account for
  • Network calls to external services in tests - if your test hits a staging backend, backend latency variance directly becomes test flakiness. One call returns in 100ms, another in 2000ms, your hardcoded timeout is 500ms
  • Emulator instability - Android emulators in particular are notorious for intermittent rendering issues, ANR dialogs that appear mid-test, and slow cold starts that break assumptions about initial app state
  • Parallel test execution conflicts - when tests run in parallel and share device ports, file paths, or process names, races between concurrent tests produce failures that are nearly impossible to reproduce in serial execution

The correct approach: mock all external network calls in unit and integration tests. Use WireMock or OkHttp MockWebServer on Android, or OHHTTPStubs on iOS, to serve deterministic responses without touching a real network. Isolate E2E tests that genuinely need network calls onto their own execution tier with appropriate timeouts. Audit CI runner resource allocation - a test that passes on 4 CPU cores may flake consistently on 2.

4. Race Conditions in the Code Being Tested

This is the most important root cause because it is the one that reveals a real bug.

Sometimes a test is flaky because the code it is testing is flaky. There is a race condition, a thread safety issue, or a non-deterministic execution path in the production code - and the test is exposing it, just inconsistently.

This is the case where the flaky test is actually doing its job. The fact that it fails intermittently is not a test problem - it is a signal that the production code behaves non-deterministically under certain timing conditions that the test occasionally triggers.

The mistake teams make: they rerun the test until it passes and close the investigation. The correct response: when a test fails intermittently and you cannot reproduce it by manipulating the test itself, suspect the code under test, not the test. Look for shared mutable state accessed from multiple threads, callbacks that assume execution order, and async operations without proper synchronisation.

Debugging mobile apps is always harder when the failure is non-deterministic. Race conditions in production code that manifest as flaky tests are exactly the category of bug that only shows up in production under load - which makes the flaky test that intermittently catches it genuinely valuable, if you treat it correctly instead of suppressing it.

How Flaky Tests Kill Engineering Velocity: The Behavioural Economics

The direct costs of flaky tests are easy to measure: engineering time spent on reruns, CI minutes wasted, investigation hours that yield no fix. These are real. Studies from the Google Testing Blog showed that internally, Google identified over 1.6 million flaky test runs per day across their codebase - and this is at a company with world-class testing infrastructure.

But the direct costs are not the main problem. The main problem is the indirect, behavioural damage.

The Ignore/Retry Culture

Illustration of a software engineer looking at a computer screen displaying a failed CI/CD pipeline test, representing flaky tests and their impact on engineering trust and software reliability.

The moment a team starts routinely rerunning failing tests instead of investigating them, a cultural norm is being established: red builds are not automatically meaningful.

This norm is rational from an individual engineer's perspective. The flaky test has failed 40 times without a real cause being identified. Rerunning takes 4 minutes. Investigating takes 4 hours. The incentive structure rewards ignoring.

But from a system perspective, the norm is catastrophic. Because now when a real failure appears - a genuine regression - it enters a team that has been trained for months to treat red builds as noise. The regression sits in the pipeline, failing intermittently between reruns, until it either gets caught in manual QA (expensive) or ships to production (more expensive).

The Coverage Illusion Gets Worse

Test coverage metrics are already misleading because they measure code execution, not correctness. Flaky tests make this worse. A test that passes 70% of the time counts toward your coverage percentage 100% of the time. Your coverage dashboard shows 80% coverage. What it is actually showing is: 80% of your code is executed by tests, some fraction of which may or may not be asserting anything meaningful depending on what time of day CI runs.

Velocity Slows Without Anyone Declaring It Slowed

This is the stealth damage. Nobody announces "we are now 15% slower due to flaky tests." It just becomes the new baseline. PRs take longer because pipelines need reruns. Releases are delayed because the pre-release suite cannot get a clean green build. Developers context-switch to other work while waiting for reruns to complete.

A Microsoft Research study on flaky tests found that flaky tests are the number one cause of CI pipeline failures across large engineering organisations - ahead of actual bugs, infrastructure failures, and configuration errors combined. The hidden cost is not the reruns. It is the disruption to flow state and the accumulated calendar time that disappears into waiting.

Every rerun is not just wasted CI minutes. It is an engineer breaking context, losing momentum, and mentally categorising their test suite as optional.

How to Debug a Flaky Test That Refuses to Fail Consistently

The specific frustration of debugging flakiness is that you cannot reproduce it on demand. Here is a systematic approach that works.

Step 1: Classify It Before Debugging It

Before spending any time on root cause analysis, determine which category of flakiness you are dealing with:

Classification Question If Yes Likely Cause
Does it fail alone or only in suite? Only in suite Shared state / test order dependency
Does it fail locally or only on CI? Only on CI Environment / infrastructure instability
Does it fail consistently on slow machines? Yes Async timing - static sleeps are wrong waits
Does the failure location change between runs? Yes Race condition in production code
Does it fail after a specific other test? Yes Shared mutable state from that test

Five minutes of classification saves hours of unfocused debugging.

Step 2: Run It 50 Times in a Row

This sounds blunt. It is the right tool. Running a test 50 consecutive times on your local machine produces the failure rate and - critically - the failure pattern. Does it fail on run 3 every time? That is a timing dependency tied to something that happens consistently during initialisation. Does it fail roughly every 15 runs with no pattern? That is a race condition. Does it never fail locally but fails on CI? Stop looking at the test and start looking at the runner.

For Android: adb shell am instrument -w -r --no-hidden-api-checks -e class com.your.test.Class run in a loop. For iOS: xcodebuild test -scheme YourScheme in a shell loop. Both are automatable and will produce reproducible failure rates.

Step 3: Isolate the Execution Context

Run the test:

  1. Alone, in isolation
  2. After the test that runs immediately before it in the suite
  3. After a test that you suspect modifies shared state
  4. With and without mocked network calls
  5. On CI, with verbose logging enabled

The execution context in which the failure appears or disappears tells you exactly which root cause you are dealing with.

Step 4: Add Logging at the Failure Point

When the test fails, the error message is rarely enough. Add structured logging at every meaningful state transition in the test: before the action, after the action, before the assertion, what the assertion is comparing. When the test fails, you now have a trace of what actually happened versus what was expected.

For Android, use Log.d with a consistent tag during debugging. For iOS, print() or XCTContext.runActivity. These logs appear in CI output and in local test runs alike.

Step 5: Look at the Production Code, Not Just the Test

Once you have classified the failure and isolated the context, if nothing in the test itself explains the non-determinism - look at the code under test. Find every asynchronous operation, every shared singleton, every thread that the code under test interacts with. The flaky test is telling you something non-deterministic is happening. Trust it.

Stabilisation Strategies: Prioritised by ROI

Digia Dispatch

Get the latest mobile app growth insights, straight to your inbox.

Illustration of a CI/CD pipeline workflow showing stages including Version Control, Build, Testing and Staging, Auto Test, and Deployment, connected in sequence with arrows to represent automated software delivery and continuous integration processes.

Not all flaky tests are worth fixing with the same urgency. Here is a prioritisation framework.

Priority 1: Quarantine First, Fix Second

The worst thing you can do with a known flaky test is leave it in the main suite untagged. It will continue to train your team to ignore red builds while you work on the fix.

The right immediate response: quarantine it. Most test frameworks support tagging or categorisation:

  • JUnit 5 (Android): @Tag("flaky") - run with excludeTags = ["flaky"] in the main pipeline, run in a separate nightly flaky-test-investigation pipeline
  • XCTest (iOS): Custom test plans in Xcode allow test exclusion by class or method
  • Appium / Detox: Configuration flags to skip specific tests in the main run

Quarantining does not mean ignoring. It means the test no longer poisons the main pipeline while a tracked issue works through the fix backlog. Every quarantined test must have a corresponding bug ticket with a deadline.

Priority 2: Fix Async Timing Issues Immediately

These are the cheapest to fix and the most common. Replace every Thread.sleep() and every hardcoded timeout with explicit condition-based waits. This is a line-by-line audit of your E2E and UI test suite that can be done incrementally.

Immediate payoff: the test becomes deterministic. It either passes (the condition is met) or fails (the condition is not met within the timeout). No more 70/30 outcomes.

Priority 3: Enforce Test Isolation in Setup/Teardown

For every test class in your suite, audit the @Before / setUp() and @After / tearDown() methods. Every test must:

  1. Set up all state it needs, not rely on state left by a previous test
  2. Clean up all state it creates, regardless of pass or fail
  3. Use fresh instances of in-memory databases, not production databases
  4. Reset any singletons or static state it touches

This is a larger refactor than fixing timing issues, but it eliminates the entire category of order-dependency flakiness permanently.

Priority 4: Mock External Dependencies in Unit and Integration Tests

Every test below the E2E tier should not make real network calls. Not to staging. Not to a local server. To a mock. This is non-negotiable for deterministic tests.

OkHttp MockWebServer for Android and OHHTTPStubs for iOS let you declare the exact HTTP responses your tests receive. The test does not know or care whether the backend is up. It receives a canned response in 5ms, every time, deterministically.

The objection teams raise: "but then we are not testing the real integration." Correct. That is what E2E tests are for. Unit and integration tests test logic under known conditions. Real network calls in unit tests are a test design error.

Priority 5: Audit CI Infrastructure

Review your CI runner configuration:

  • Are resource limits (CPU, memory) appropriate for the test workload?
  • Are emulator instances being shared across parallel test runs in a way that creates port or file system conflicts?
  • Are external staging dependencies in your test path subject to independent downtime?
  • Are test runners being warmed up correctly before E2E test execution begins?

These are infrastructure changes that require coordination with whoever manages your CI pipeline, but they eliminate entire categories of environment-driven flakiness that cannot be fixed in the test code itself.

The System That Keeps Flakiness From Coming Back

Fixing existing flaky tests is a one-time project. Preventing new ones is a continuous practice. The teams that keep flakiness under control long-term do three things structurally different from everyone else.

They Track Flakiness as a Metric

Flakiness rate per test, per test class, and per pipeline run should be a tracked metric on the same dashboard as test coverage and build time. When a test's pass rate drops below a threshold - say, 98% over the last 30 runs - it is automatically flagged for review.

BuildKite, CircleCI, and GitHub Actions all support test result tracking across runs. Tools like Trunk Flaky Tests and BuildPulse are purpose-built for flakiness detection and tracking at the test level. You cannot manage what you do not measure.

They Have a Zero-Tolerance Policy for Unquarantined Flaky Tests

The norm in most teams is: flaky tests live in the main suite until someone gets around to fixing them, which is never, because fixing them is always lower priority than shipping the next feature.

The norm in high-functioning teams is: a test that fails three times in a row for no code change is automatically quarantined and a ticket is created. Quarantine is not shame - it is maintenance. The flaky test is still being tracked and fixed. It is just not being given the ability to train the rest of the team to ignore red builds.

They Do Flaky Test Retrospectives, Not Just Bug Retrospectives

When a real regression ships to production and post-mortem analysis reveals that it was present in the CI pipeline as a failing test that was rerun into green - that is a flaky test retrospective moment. Not a blame moment. A systems question: which test should have caught this, why did it not catch it reliably, and what do we change so this category of failure cannot slip through the same way again.

Continuous testing in CI/CD only functions as a release gate when the tests in that gate are trustworthy. Flakiness audits are the maintenance that keeps the gate trustworthy over time.

The Question That Reveals Your Flakiness Problem

Flowchart illustrating a CI testing workflow for handling flaky tests. The diagram shows decision paths based on whether a test passes, test history availability, flakiness threshold tolerance, and whether the test belongs to the main branch, ultimately determining whether to suppress or remove flaky test results.

Here is the direct audit for your team:

When a CI build goes red on your main branch, what is the first thing your engineers do?

If the answer is "rerun it to see if it is a flaky test" - your test suite has a trust problem. The engineers already know that a red build is not necessarily a real failure. They have internalised the statistical reality of your flakiness rate and are compensating for it manually.

How many tests in your current suite have a pass rate below 95%? How many have a pass rate below 99%? Do you know? If you do not have this data, you do not have visibility into your flakiness problem - you only have the subjective impression that "some tests are flaky," which is exactly the kind of vague awareness that allows the problem to persist indefinitely.

Test coverage metrics that look healthy can mask a deeply unreliable suite. Flakiness rate is the metric that tells you whether the tests you have are actually trustworthy, not just present.

A team that reruns CI builds as a matter of routine is a team that has stopped using CI as a quality gate and started using it as a checkbox.

Conclusion: Unreliable Tests Are Worse Than No Tests

This is a strong claim, and it is defensible.

No tests: your team knows the safety net does not exist. They compensate with more manual testing, more cautious release practices, more conservative change scope. The risk is visible and managed accordingly.

Unreliable tests: your team believes a safety net exists. They ship with confidence. The safety net fails intermittently and nobody can tell the difference between "this test failed because of a real regression" and "this test failed because it is Tuesday." They compensate by ignoring red builds. The risk is invisible and compounds.

The invisible risk is worse. It is the kind that only becomes visible when a real regression reaches production and the post-mortem reveals that the CI pipeline failed this exact test and someone reran it until green six days ago.

Flaky tests are not a testing problem. They are a systems problem - a signal that your test suite has grown faster than your test discipline. The causes are always deterministic. The fixes are always achievable. The only thing that keeps flaky tests alive in most codebases is the short-term incentive to rerun instead of investigate, accumulated over months until the team's relationship with their own CI is broken.

Reliability engineering requires that your quality signals are trustworthy. A test suite where 15% of tests are known flaky is not a quality signal. It is a source of noise that happens to occasionally correlate with real problems. That is not good enough - and treating it as good enough is how real failures ship.

Fix the flaky tests. Not because it will make your coverage percentage look better. Because without trustworthy tests, everything else in your quality system is built on an unstable foundation.

Key Takeaways

  • A flaky test is not a broken test - it is a test that produces no reliable information about your code. Every pass is meaningless. Every fail is also meaningless. It is pure noise in your quality measurement system.
  • There are exactly four root causes of flakiness: async timing dependencies (wrong waits), shared state between tests (order dependency), environment instability (CI infrastructure), and race conditions in the production code (the one case where the flaky test is exposing a real bug).
  • The worst damage flaky tests do is behavioural, not operational. They train engineers to rerun CI instead of investigating failures - which means real regressions get rerun into green along with the noise.
  • Quarantine flaky tests immediately when identified. Do not leave them in the main pipeline while the fix is worked on. They will continue to erode trust in the suite every day they remain.
  • The metric that matters: per-test pass rate over the last N runs. Any test below 99% reliability deserves investigation. Any test below 95% should be quarantined now.
  • High-functioning teams treat flakiness rate as a tracked metric, not a subjective complaint. They have a zero-tolerance policy for unquarantined flaky tests and run flaky test retrospectives when regressions reveal that the suite failed to catch a real bug.
  • The systemic fix is not a debugging sprint - it is three permanent practices: flakiness rate tracking on the same dashboard as coverage, automatic quarantine when a threshold is breached, and test isolation standards enforced at code review.

Further Reading

From Digia - Testing & Quality Series

External Sources

Shipping mobile experiences at speed without breaking trust in your pipeline? See how Digia's engagement layer deploys without release cycles or book a demo.

Frequently Asked Questions

What is a flaky test?
A flaky test is an automated test that produces inconsistent results - passing sometimes and failing other times - for the same code, without any change to the code itself. The failure is not random; it is triggered by conditions that are not part of the test's declared inputs, such as timing dependencies, shared mutable state between tests, CI infrastructure variability, or race conditions in the production code. Flaky tests are damaging precisely because they cannot be trusted: their failures cannot be distinguished from real regression failures, which trains engineers to ignore red CI builds.
What causes flaky tests?
The four root causes of flaky tests are: async timing dependencies (tests that assert on time rather than observable state, causing failures when execution is slower than expected), test order dependencies (tests that depend on or pollute shared state that is not properly reset between tests), environment instability (network calls to external services, resource-throttled CI runners, or parallel execution conflicts), and race conditions in the production code (where the test is actually exposing a non-deterministic behaviour in the code under test, making the flaky test a real signal rather than a noise source).
Are flaky tests worse than no tests?
Yes, in a specific and important sense. No tests means your team knows the safety net does not exist and compensates accordingly. Flaky tests create the illusion that a safety net exists while systematically training engineers to ignore the safety net's warnings. The result is that real regressions - genuine failures caught by the test - get rerun into green alongside the noise. The invisible false confidence produced by an unreliable test suite is more dangerous than the visible absence of tests.
How do you fix a flaky test?
The fix depends on the root cause. For async timing issues: replace static sleeps with explicit condition-based waits using IdlingResource (Android), waitForExistence (iOS), or WebDriverWait (Appium). For shared state issues: add proper setup and teardown to every test class, use in-memory test databases instead of production databases, and reset any singletons explicitly. For environment instability: mock all external network calls in unit and integration tests using MockWebServer or OHHTTPStubs, and audit CI runner resource allocation. For race conditions in production code: fix the production code - the flaky test is a real signal, not a test problem.
How do you manage flaky tests while fixing them?
Quarantine them. Most test frameworks support tagging or test plan exclusion - mark flaky tests explicitly and run them in a separate pipeline from the main CI gate. This stops the flaky test from training your team to ignore red builds while a tracked fix works through the backlog. Every quarantined test must have a corresponding bug ticket with an owner and a deadline. Quarantine is not the same as deletion - the test is still being tracked and fixed, just not poisoning your primary quality signal.
What is a healthy flaky test rate?
Google's engineering team has documented targeting a flakiness rate below 1% per test (99%+ pass rate across repeated runs). Any test with a pass rate below 95% across recent runs should be treated as broken and quarantined immediately. At the suite level, if more than 2-3% of your tests are known flaky, your team has already developed dysfunctional norms around rerunning builds - the culture problem is likely more severe than the raw percentage suggests.
How do you track flakiness across a test suite?
You need test result history, not just pass/fail for the current run. Tools like Trunk Flaky Tests, BuildPulse, or the built-in test analytics in CircleCI and Buildkite track per-test pass rates across runs and surface the flakiest tests automatically. GitHub Actions supports test result uploads via SARIF or JUnit XML that enable historical tracking. The minimum viable approach: log test results to a simple spreadsheet or database after each CI run and review the per-test pass rate weekly. What you cannot measure, you cannot manage.