Mobile app automation promises faster releases and better software quality—but those benefits disappear when test suites become unstable. A single flaky test can trigger false failures, force developers to rerun pipelines, and slow down delivery.

Over time, teams begin to distrust their CI results, treating failures as noise instead of signals of real issues. This is why stability testing metrics are essential in mobile automation.

Rather than relying on intuition or sporadic debugging, engineering teams need measurable indicators that reveal how reliable their automated tests actually are.

Metrics such as flaky test rate, crash-free sessions, retry rate, and mean time to resolve (MTTR) help quantify test reliability, detect instability early, and prioritize fixes before they impact release cycles.

In this guide, we break down the most important stability testing metrics for mobile app automation, including their definitions, formulas, and practical ways to measure them in CI pipelines and device farms.

You’ll also learn how to interpret these metrics, build stability dashboards, and implement remediation practices that keep automated test suites trustworthy as your mobile application scales.

Core Stability Testing Metrics, Formulas, Examples, And Deep Dives

Quick Reference: Key Stability Testing Metrics

MetricFormula (copy-ready)Typical data sourceStarter target
Flaky test rate (%)(distinct_flaky_tests / total_distinct_tests) * 100CI history / rerun logs< 3–5%
Test stability (per test %)(pass_count / N) * 100CI rerun aggregation> 95% (critical tests)
Crash-free sessions (%)(sessions_without_crash / total_sessions) * 100Crashlytics / device farm logs> 99% (prod-like)
Retry rate (%)(retried_runs / total_runs) * 100CI rerun metadata≤ 2–4%
MTTD (time)avg(time_detected - time_first_flaky)CI timestamps / issue tracker< 24–72 hours
MTTR (time)avg(time_fixed - time_detected)Issue tracker / PR timestamps< 48 hours (blocking)
Execution time mean & variancemean(runtime), variance(runtime)CI timing logsLow variance
Environmental failure rate (%)(env_failures / total_failures) * 100Device farm + infra logsMonitor for spikes

1. Flaky Test Rate

Definition And Why It Matters

Flaky test rate is the percent of distinct tests that show both passes and fails across a chosen window of runs. A high flaky rate undermines CI trust and leads to reruns and ignored failures.

Formula:

Flaky test rate (%) = (distinct_flaky_tests / total_distinct_tests) * 100

Worked Example

Over the last 7 nightly runs you executed 1,200 distinct tests. You reran unstable tests 3 times and discovered 48 distinct tests that passed at least once and failed at least once. Flaky test rate = (48 / 1200) * 100 = 4.0%.

Collection Source

Aggregate CI run results by test name and build id, mark tests with both pass_count > 0 and fail_count > 0 inside the window.

Remediation Note

Quarantine the top flaky tests for triage and reduce false noise by adding guarded retries for infra failures only; fix brittle locators and synchronization issues for test logic flakiness.

2. Test Stability (Per Test)

Definition And Formula

Test stability (%) = (pass_count / N) * 100 where N is the number of runs in the measurement window.

Example

A critical login test ran 10 times overnight and passed 9 times: stability = (9 / 10) * 100 = 90%. That test needs attention (target > 95%).

Collection And Action

Use nightly profiling to compute per-test stability and produce a ranked list for engineers. Low stability tests should be quarantined or prioritized for fix tickets.

3. Crash-Free Sessions

Definition And Formula

Crash-free sessions (%) = (sessions_without_crash / total_sessions) * 100

Example

If 1,000 automated sessions executed during a release validation and 10 sessions experienced crashes or ANRs: crash-free sessions = (990 / 1000) * 100 = 99.0%.

Collection And Remediation

Collect crash stack traces and correlate them to test steps. If crashes are reproducible outside the test harness, escalate as product bugs; if they occur only in specific device images, flag infra/device provisioning.

4. Retry Rate And Execution Variance

Definition and why it matters

Retry rate is the percentage of test runs that require one or more reruns due to transient failures (infrastructure, flaky network, device hiccups).

A high retry rate often masks underlying instability: teams may rely on retries to keep pipelines green while the underlying cause remains unresolved.

Execution variance is the statistical spread (variance or standard deviation) of a test’s runtime across runs.

High variance indicates unstable environments—device slowdowns, network contention, or background services causing non-deterministic timing.

Formula

Retry rate (%) = (retried_runs / total_runs) * 100
Execution variance = VAR_POP(runtime_seconds) -- or use STDDEV(runtime_seconds)

Worked Example

Daily snapshot: 10,000 total test runs, 420 retried runs → (420 / 10000) * 100 = 4.2% retry rate. That level justifies an investigation.

Separately, a critical scenario’s median runtime jumped from 90s to 160s on a subset of devices — this large shift + high variance suggests device-level contention or environmental issues rather than a deterministic test slowdown.

Collection and measurement
  • Record a retry_count per run and a retry_reason (infra, timeout, flake, environment) so retries are classifiable.
  • Persist runtime in seconds for every run. Compute mean and variance per test and per device/OS slice.
  • Segment retry rate by failure_reason to distinguish infra retries from test-logic retries.
Remediation and action
  • If `<retry_rate` > your threshold (suggested >2–4% triggers review), drill down by device/OS and retry reason.
  • Fix infra retries first (device provisioning, flaky device images), and address test-logic retries by improving synchronization and removing brittle assertions.
  • Reduce retries in CI by adding intelligent gating: allow one guarded retry for infra, but automatically create a triage ticket when retries exceed threshold.
Monitoring queries and dashboard widgets
  • Line chart: retry_rate (daily) with device/OS filter.
  • Bar: top tests by retry_count.
  • Boxplot or histogram: runtime distribution per test (shows variance).

5. MTTD And MTTR

Definition and why it matters

MTTD (Mean Time To Detect) measures how quickly flaky behavior is recognized (from first occurrence to detection/flagging). Fast detection minimizes noise and prevents flaky tests from being relied upon.

MTTR (Mean Time To Resolve) measures how long it takes, on average, to fix a flaky test after detection (detection → verified fix merged and green). Short MTTR indicates effective ownership, triage, and remediation processes.

Formula

MTTD = AVG(time_detected - time_first_flaky)
MTTR = AVG(time_fixed - time_detected)

Worked example

Example dataset: A flaky test first failed on Jan 10 08:00, was detected (automated flag) at Jan 10 08:12, and fixed (PR merged + green) on Jan 12 16:00.

Detection latency = 12 minutes; resolution = ~56 hours. Aggregate many such events to compute mean MTTD and MTTR.

Collection and instrumentation
  • Record timestamps for the following events: time_first_flaky (first observed failing run), time_detected (automated or manual flagging time), and time_fixed (PR merged + green or ticket resolved).
  • Automate detection: run nightly profiling and an automated rule that flags tests with mixed pass/fail over N runs and inserts detection timestamps into your tracking system.
Remediation workflow and SLAs
  • Define SLAs: e.g., MTTD < 24 hours for release-blocking tests; MTTR < 48 hours for high-impact flakiness.
  • Automate triage assignment: detection triggers label, owner assignment, and a Slack/email alert for on-call or the last committer.
  • Measure and report MTTD/MTTR weekly; use them as team KPIs and include in sprint planning for remediation capacity.
Dashboard suggestions
  • Single-number tiles: current MTTD and MTTR (7d/30d averages).
  • Trend line: MTTD and MTTR over time to show improvement (or regression).
  • Table: top slow-to-resolve flaky tests (MTTR descending).

6. Environmental Failure Rate

Definition and why it matters

Environmental failure rate is the percent of total failures attributable to environment or infrastructure: device provisioning errors, device reboots, farm outages, or misconfigured images.

Because environment problems are outside test logic, conflating them with flaky tests can misdirect remediation efforts.

Formula

Environmental failure rate (%) = (env_failures / total_failures) * 100

Worked example

In a 1,000-failure window, 230 failures are labeled as env (device offline, provisioning timeout, farm error) → environmental failure rate = (230 / 1000) * 100 = 23%.

How to classify and collect
  • Add structured failure_reason codes at run time (e.g., infra_timeout, device_reboot, image_mismatch, network_partition).
  • Correlate CI logs with farm health events and provider status pages if available; correlate device IDs to firmware/image versions.
  • Use automated metrics first (exit codes, known error patterns) and escalate ambiguous cases for manual triage to improve classification over time.
Remediation and operational actions
  • If environmental failure rate spikes, pause non-critical runs, investigate provider/infra, and raise an infrastructure incident.
  • Maintain an inventory of device images and versions; roll back or reprovision images that correlate with spikes.
  • Implement retries only for idempotent infra issues, but create tickets for persistent device-image problems rather than masking them with retries.
Monitoring and alerts
  • Alert when environmental failure rate rises above a threshold (suggested: >10% sustained for 1 hour).
  • Device health dashboard: failure count by device_id and image_version to help identify bad batches.
  • Correlate environmental spikes with CI provider status or network incidents to reduce mean time to remediate.

7. Device / OS Coverage Considerations

Why coverage impacts stability

Device and OS coverage determines the breadth of test surface. While increasing coverage is often desirable for user reach, each added device/OS combination increases the probability of environment-specific flakiness (manufacturer-specific behavior, OEM skins, OS-level differences, hardware quirks).

Strategic coverage model

Use a prioritized matrix rather than attempting to cover every device. Prioritize by:

  • User analytics (top devices/OS versions by active users).
  • Crash impact (devices contributing most to production crashes).
  • Region / market-specific devices if your user base is concentrated.
Practical measurement and partitioning
  • Report flaky test rate and crash-free sessions per device/OS slice. This makes it obvious which device lines drive instability.
  • Establish a “core” device matrix (e.g., 10–15 prioritized combinations) for daily/nightly runs and a “broader” matrix for weekly or pre-release runs.
  • Use rotation sampling for the broader matrix to avoid persistent resource costs while still surfacing device-specific issues over time.
Remediation guidance by device class
  • When a device or OS slice shows high instability, isolate it in a separate job to gather additional logs and session video without polluting the main health signal.
  • If a specific OEM or OS version consistently causes instability, consider adding conditional skips or targeted tests that validate the problematic functionality more robustly (e.g., lower-level API checks vs UI flows).
  • Communicate device-specific findings to the product and engineering teams (some device issues may warrant product-level mitigations or user messaging).
Dashboard ideas
  • Device heatmap: flaky rate by device model (rows) × OS version (columns).
  • Coverage tracker: percent of top N user devices included in the current matrix.
  • Rotation schedule: last test date per device to ensure broader coverage is exercised periodically.
Vibe Debugging Example

Everything After Vibe Coding

Panto AI helps developers find, explain, and fix bugs faster with AI-assisted QA—reducing downtime and preventing regressions.

  • Explain bugs in natural language
  • Create reproducible test scenarios in minutes
  • Run scripts and track issues with zero AI hallucinations
Try Panto →

How To Measure, Prioritize, Remediate, And Report

Instrumentation And CI Patterns

  • Emit structured test results (JSON) per run with fields: test_namestatusrun_idbuild_iddevice_idos_versionstart_timeend_timelogsrerun_reason.
  • Record session video and system logs for failed runs when possible to accelerate triage.
  • Attach a failure_reason tag at run time (infra | test_logic | app_bug | network | device_failure).

Compute Flaky Tests (BigQuery-Style)

-- Conceptual: compute pass/fail per test over the last 30 days
SELECT
  test_name,
  SUM(CASE WHEN status='PASS' THEN 1 ELSE 0 END) AS pass_count,
  SUM(CASE WHEN status='FAIL' THEN 1 ELSE 0 END) AS fail_count,
  COUNT(*) AS runs,
  SAFE_DIVIDE(SUM(CASE WHEN status='FAIL' THEN 1 ELSE 0 END), COUNT(*)) AS fail_rate
FROM test_results
WHERE build_time > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 30 DAY)
GROUP BY test_name
HAVING pass_count > 0 AND fail_count > 0; -- marks flaky tests

Grafana Panel JSON

Below is an illustrative panel snippet showing a line chart for flaky-test trend. Import via Grafana JSON import and adapt to your datasource names.

{
  "panels": [{
    "type": "graph",
    "title": "Flaky Test Rate (30d)",
    "targets": [{"rawSql": "SELECT date, flaky_rate FROM test_metrics WHERE date > $__from ORDER BY date"}]
  }]
}

Worked Case Study — Acme Mobile Stability Audit

Context: A mid-size app team ran a two-week stability audit during a sprint freeze to quantify test health before a release.

Initial State

The suite had 1,500 distinct tests. Over two weeks the nightly profiling (N = 10 runs per test) revealed 180 distinct flaky tests (12% flaky rate).

Critical release-blocking tests included: login flow, payment flow, and deep-link handling. MTTR averaged 5.2 days for release-blocking flakiness.

Actions Taken

  1. Quarantined the top 40 worst offenders (based on a weighted score combining failure frequency and execution frequency).
  2. Reproduced top 10 locally; fixed brittle locators and unstable waits (framework and test logic changes).
  3. Added network stubbing for third-party calls that caused intermittent timeouts during tests.
  4. Filed infra tickets for device farm provisioning issues after correlating crashes to a specific device image.

Outcome

After 3 weeks: flaky test rate dropped from 12% → 3.8%; MTTR for release-blocking flakiness decreased from 5.2 → 1.4 days; run-retry volume decreased by ~60%, saving CI credits and engineer time.

Vendor Integrations And Practical How-Tos

Use vendor telemetry to accelerate detection and triage. Integrate with trusted device farms and vendor tools that provide rerun histories, session logs, and session replay.

Prioritization Scoring And Remediation Playbook

Use a weighted scoring model to rank flaky tests. Example weights: failure frequency 35%, execution frequency 25%, business impact 25%, MTTR 15%. Run a weekly “top N” remediation sprint focused on the highest-scoring tests.

Remediation Checklist

  1. Quarantine test in CI if stability < threshold (e.g., < 95%).
  2. Attach session logs and video to the triage ticket.
  3. Reproduce locally with deterministic seed and mock external networks where possible.
  4. Fix fragile selectors, add explicit waits, or convert to lower-level tests if UI is inherently unstable.
  5. Re-run profiling; unquarantine when stability target is met

Conclusion

Stability testing metrics provide the foundation for trustworthy mobile automation. Without measurable indicators such as flaky test rate, retry rate, MTTD, and environmental failure rate, teams are forced to rely on intuition when diagnosing instability.

By instrumenting CI pipelines, collecting structured test data, and continuously monitoring stability trends, engineering teams can quickly identify unreliable tests, separate infrastructure issues from real regressions, and maintain confidence in automated feedback.

FAQ’s

Q: What is a flaky test?

A: A flaky test is an automated test that passes intermittently and fails intermittently under the same code baseline. These failures are usually caused by timing issues, unstable environments, race conditions, or dependencies on external services.

Q: How many runs define flakiness?

A: For routine detection, running a test 3–10 times is typically sufficient to identify instability. For deeper statistical profiling and higher confidence, many teams run tests 20 or more times to measure true failure probabilities.

Q: Should I auto-retry failing tests?

A: Retries should be used cautiously. Most teams allow only one guarded retry for infrastructure or transient errors. Excessive retries can hide flaky tests and reduce the reliability of CI signals.

Q: How often should I compute stability metrics?

A: Stability signals should be computed in real time during CI runs to detect flaky behavior immediately. In addition, weekly or 30-day trend reports help engineering teams prioritize long-term test stability improvements.

Q: What is an acceptable flaky test rate?

A: Many engineering teams aim for an overall flaky test rate below 3–5% across the entire test suite. Critical tests used for release gating should typically achieve at least 95% stability.

Q: How do I separate infrastructure issues from test problems?

A: A common approach is to tag test runs with a failure_reason field during execution. This allows teams to compute a separate environmental failure rate and distinguish infrastructure instability from genuine test logic failures.

Q: Which metrics should product managers watch?

A: Product leaders typically focus on crash-free sessions and the number of release-blocking flaky tests. These metrics directly affect user experience and release velocity.

Q: What dashboards should I build?

A: Effective QA dashboards usually include a 30-day flaky-test trend, the most impactful flaky tests, crash-free session metrics, device failure heatmaps, and MTTR/MTTD indicators for debugging efficiency.

Q: How do I measure MTTR for flaky tests?

A: Track the timestamps for time_detected and time_fixed (for example when the pull request is merged and the test passes consistently). The average difference between these values over time provides the mean time to resolution.

Q: Can device farms help detect flakiness?

A: Yes. Device farms typically provide rerun histories, detailed logs, and session replay capabilities. These features make it easier to reproduce intermittent failures and diagnose the root causes of flaky tests.