TL;DR: A flaky test is not a broken test. It is a test that lies to your team on a schedule - passing 70% of the time, failing 30% for reasons nobody can reproduce. And a lie with a 70% truth rate is more dangerous than silence, because it produces false confidence while systematically training your engineers to ignore red CI builds. This article covers the four root causes of flakiness, the exact behavioral damage it does to engineering velocity, how to debug a test that refuses to fail consistently, and the only stabilization system that scales.
Here is a pattern every engineering team eventually lives through: A developer pushes a pull request. CI runs. Two tests fail - tests that have failed randomly for weeks. The developer reruns the pipeline. Everything goes green. They merge.
Nobody files a bug. Nobody investigates. Nobody even mentions it in the PR review. Because everyone already knows those tests are flaky. They fail sometimes. You rerun and they pass. That is just how it is.
Two weeks later, a real regression ships. A genuine bug - introduced in that PR or one of the twelve that followed it - slips past CI because the team has learned, through weeks of accumulated experience, that red CI builds are not signal. They are noise. You rerun until green and ship.
That is the damage flaky tests do. Not just the time wasted on reruns. Not just the engineering hours spent investigating test failures that are not real failures. The deeper damage is what they do to the team's relationship with their own test suite. They teach engineers that the safety net is optional. And once a team learns that, unlearning it takes months.
This is why more tests do not automatically mean more confidence. A test suite with 20% flaky tests does not have 80% of its value intact. It has significantly less, because the flaky tests have contaminated the trust in the whole system. You cannot look at a red build and know what it means.
What a Flaky Test Actually Is (and What It Is Not)
The technical definition is simple: a flaky test is one that produces different results - pass or fail - for the same code, without any change to the code itself.
But that definition understates the problem. Let's be more precise.
A flaky test is a test that:
- Fails on a specific timing, environment state, or execution order that nobody explicitly set up
- Cannot be reproduced on demand - you cannot make it fail by running it alone
- Passes when you rerun it, which trains your team to treat reruns as a fix
- Reports nothing meaningful about the correctness of the code it is supposed to test
That last point is the one teams miss. A flaky test is not testing anything reliable. Every time it passes, you learned nothing about your code. Every time it fails, you also learned nothing about your code. It is a pure noise generator dressed as a quality gate.
The distinction matters because teams often treat flaky tests as merely annoying - a productivity tax they tolerate because fixing them is slower than rerunning them. The correct framing is harsher: a flaky test is a broken instrument in your quality measurement system. It gives you numbers that look like data but tell you nothing.
A test suite with flaky tests is not a partially reliable safety net. It is an unreliable one. There is no such thing as "mostly trustworthy" for a CI gate.
The Four Root Causes of Flakiness (and Why "It Just Does That" Is Never the Answer)
Flaky tests are not random. That is the first thing to internalize. They appear random because the conditions that trigger them are not part of your test's declared inputs - but the conditions exist. Every flaky test has a deterministic cause. Finding it is a debugging problem, not a luck problem.
There are four root causes that account for the overwhelming majority of flaky tests in mobile test suites.
1. Asynchronous Timing Dependencies
This is the most common cause in mobile testing, and it is almost entirely an E2E and UI test problem.
The test clicks a button, then immediately asserts that a screen element is visible. The element requires a network call to populate. On a fast CI machine the call completes in 80ms, the element renders, the assertion passes. On a slower runner under load, the call takes 300ms, the assertion fires before the element renders, the test fails.
Same code. Same test. Different infrastructure load on different runs.
The broken fix most teams apply: add a static sleep. Thread.sleep(2000). Now the test passes more often, but it is brittle - increase your backend latency in staging and it starts failing again - and it makes your entire test suite slower for no quality gain.
The correct fix: explicit waits tied to observable conditions, not time. Espresso for Android provides IdlingResource - a mechanism that tells the test runner to wait until a specific condition is met (an animation completes, a network call finishes) rather than waiting an arbitrary number of milliseconds. XCUITest on iOS provides waitForExistence(timeout:) for element conditions. Appium provides explicit wait strategies via WebDriverWait.
The rule is: never assert on time. Assert on state.
2. Test Order Dependencies and Shared State
The test passes when run alone. It fails when run as part of the full suite. Specifically, it fails after test #47 runs.
This is a state pollution problem. Test #47 modifies a shared resource - a database, a singleton, a static variable, a cached file - and does not clean it up. When your test runs after it, it inherits a state it was not written for and produces unexpected results.
On mobile, the most common shared state vectors are:
| Shared State Source | What Gets Polluted | How It Manifests |
|---|---|---|
| Shared app database (Room, SQLite) | Rows from a previous test persist | Assertion finds wrong count or record |
| Singleton services (auth state, session) | Auth token from test A bleeds into test B | Test B skips login when it should not |
| SharedPreferences / UserDefaults | Settings set in test A affect test B | Unexpected app behaviour in subsequent test |
| File system cache | Media, config files from previous test | File-not-found or wrong-content failures |
| In-memory app state | Static flags, feature toggles | Features enabled/disabled incorrectly |
The correct fix is strict test isolation: each test must set up its own state before running and tear it down completely after, regardless of whether it passed or failed. On Android, this means using @Rule annotations with proper before/after lifecycle hooks. Using an in-memory test database instead of the production database. Resetting singletons explicitly. On iOS, using fresh app launches or XCUIApplication().terminate() between test classes when state cannot be reliably reset.
If you cannot make a test pass reliably in any order when run in isolation, the test has a design flaw, not just a timing problem.
3. Environment and Infrastructure Instability
The test is correctly written and well-isolated. It still fails intermittently on CI but never locally.
This is an infrastructure problem, not a test problem. Common causes:
- CI runners under resource pressure - a CPU-throttled container means animations run slower, timeouts expire before UI transitions complete, and async operations take longer than your waits account for
- Network calls to external services in tests - if your test hits a staging backend, backend latency variance directly becomes test flakiness. One call returns in 100ms, another in 2000ms, your hardcoded timeout is 500ms
- Emulator instability - Android emulators in particular are notorious for intermittent rendering issues, ANR dialogs that appear mid-test, and slow cold starts that break assumptions about initial app state
- Parallel test execution conflicts - when tests run in parallel and share device ports, file paths, or process names, races between concurrent tests produce failures that are nearly impossible to reproduce in serial execution
The correct approach: mock all external network calls in unit and integration tests. Use WireMock or OkHttp MockWebServer on Android, or OHHTTPStubs on iOS, to serve deterministic responses without touching a real network. Isolate E2E tests that genuinely need network calls onto their own execution tier with appropriate timeouts. Audit CI runner resource allocation - a test that passes on 4 CPU cores may flake consistently on 2.
4. Race Conditions in the Code Being Tested
This is the most important root cause because it is the one that reveals a real bug.
Sometimes a test is flaky because the code it is testing is flaky. There is a race condition, a thread safety issue, or a non-deterministic execution path in the production code - and the test is exposing it, just inconsistently.
This is the case where the flaky test is actually doing its job. The fact that it fails intermittently is not a test problem - it is a signal that the production code behaves non-deterministically under certain timing conditions that the test occasionally triggers.
The mistake teams make: they rerun the test until it passes and close the investigation. The correct response: when a test fails intermittently and you cannot reproduce it by manipulating the test itself, suspect the code under test, not the test. Look for shared mutable state accessed from multiple threads, callbacks that assume execution order, and async operations without proper synchronisation.
Debugging mobile apps is always harder when the failure is non-deterministic. Race conditions in production code that manifest as flaky tests are exactly the category of bug that only shows up in production under load - which makes the flaky test that intermittently catches it genuinely valuable, if you treat it correctly instead of suppressing it.
How Flaky Tests Kill Engineering Velocity: The Behavioural Economics
The direct costs of flaky tests are easy to measure: engineering time spent on reruns, CI minutes wasted, investigation hours that yield no fix. These are real. Studies from the Google Testing Blog showed that internally, Google identified over 1.6 million flaky test runs per day across their codebase - and this is at a company with world-class testing infrastructure.
But the direct costs are not the main problem. The main problem is the indirect, behavioural damage.
The Ignore/Retry Culture

The moment a team starts routinely rerunning failing tests instead of investigating them, a cultural norm is being established: red builds are not automatically meaningful.
This norm is rational from an individual engineer's perspective. The flaky test has failed 40 times without a real cause being identified. Rerunning takes 4 minutes. Investigating takes 4 hours. The incentive structure rewards ignoring.
But from a system perspective, the norm is catastrophic. Because now when a real failure appears - a genuine regression - it enters a team that has been trained for months to treat red builds as noise. The regression sits in the pipeline, failing intermittently between reruns, until it either gets caught in manual QA (expensive) or ships to production (more expensive).
The Coverage Illusion Gets Worse
Test coverage metrics are already misleading because they measure code execution, not correctness. Flaky tests make this worse. A test that passes 70% of the time counts toward your coverage percentage 100% of the time. Your coverage dashboard shows 80% coverage. What it is actually showing is: 80% of your code is executed by tests, some fraction of which may or may not be asserting anything meaningful depending on what time of day CI runs.
Velocity Slows Without Anyone Declaring It Slowed
This is the stealth damage. Nobody announces "we are now 15% slower due to flaky tests." It just becomes the new baseline. PRs take longer because pipelines need reruns. Releases are delayed because the pre-release suite cannot get a clean green build. Developers context-switch to other work while waiting for reruns to complete.
A Microsoft Research study on flaky tests found that flaky tests are the number one cause of CI pipeline failures across large engineering organisations - ahead of actual bugs, infrastructure failures, and configuration errors combined. The hidden cost is not the reruns. It is the disruption to flow state and the accumulated calendar time that disappears into waiting.
Every rerun is not just wasted CI minutes. It is an engineer breaking context, losing momentum, and mentally categorising their test suite as optional.
How to Debug a Flaky Test That Refuses to Fail Consistently
The specific frustration of debugging flakiness is that you cannot reproduce it on demand. Here is a systematic approach that works.
Step 1: Classify It Before Debugging It
Before spending any time on root cause analysis, determine which category of flakiness you are dealing with:
| Classification Question | If Yes | Likely Cause |
|---|---|---|
| Does it fail alone or only in suite? | Only in suite | Shared state / test order dependency |
| Does it fail locally or only on CI? | Only on CI | Environment / infrastructure instability |
| Does it fail consistently on slow machines? | Yes | Async timing - static sleeps are wrong waits |
| Does the failure location change between runs? | Yes | Race condition in production code |
| Does it fail after a specific other test? | Yes | Shared mutable state from that test |
Five minutes of classification saves hours of unfocused debugging.
Step 2: Run It 50 Times in a Row
This sounds blunt. It is the right tool. Running a test 50 consecutive times on your local machine produces the failure rate and - critically - the failure pattern. Does it fail on run 3 every time? That is a timing dependency tied to something that happens consistently during initialisation. Does it fail roughly every 15 runs with no pattern? That is a race condition. Does it never fail locally but fails on CI? Stop looking at the test and start looking at the runner.
For Android: adb shell am instrument -w -r --no-hidden-api-checks -e class com.your.test.Class run in a loop. For iOS: xcodebuild test -scheme YourScheme in a shell loop. Both are automatable and will produce reproducible failure rates.
Step 3: Isolate the Execution Context
Run the test:
- Alone, in isolation
- After the test that runs immediately before it in the suite
- After a test that you suspect modifies shared state
- With and without mocked network calls
- On CI, with verbose logging enabled
The execution context in which the failure appears or disappears tells you exactly which root cause you are dealing with.
Step 4: Add Logging at the Failure Point
When the test fails, the error message is rarely enough. Add structured logging at every meaningful state transition in the test: before the action, after the action, before the assertion, what the assertion is comparing. When the test fails, you now have a trace of what actually happened versus what was expected.
For Android, use Log.d with a consistent tag during debugging. For iOS, print() or XCTContext.runActivity. These logs appear in CI output and in local test runs alike.
Step 5: Look at the Production Code, Not Just the Test
Once you have classified the failure and isolated the context, if nothing in the test itself explains the non-determinism - look at the code under test. Find every asynchronous operation, every shared singleton, every thread that the code under test interacts with. The flaky test is telling you something non-deterministic is happening. Trust it.

