{"id":4235,"date":"2026-03-10T15:50:20","date_gmt":"2026-03-10T10:20:20","guid":{"rendered":"https:\/\/www.getpanto.ai\/blog\/?p=4235"},"modified":"2026-05-16T11:30:06","modified_gmt":"2026-05-16T06:00:06","slug":"stability-testing-metrics-in-mobile-app-automation","status":"publish","type":"post","link":"https:\/\/www.getpanto.ai\/blog\/stability-testing-metrics-in-mobile-app-automation","title":{"rendered":"Stability Testing Metrics in Mobile App Automation \u2014 The Complete Guide"},"content":{"rendered":"\n<p class=\"wp-block-paragraph\">Mobile app automation promises faster releases and <a href=\"https:\/\/www.getpanto.ai\/blog\/code-quality#code-quality-as-a-continuous-workflow\">better software quality<\/a>\u2014but those benefits disappear when test suites become unstable. A single flaky test can trigger false failures, force developers to rerun pipelines, and slow down delivery.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Over time, teams begin to distrust their CI results, treating failures as noise instead of signals of real issues. This is why <strong>stability testing metrics<\/strong> are essential in mobile automation. <\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Rather than relying on intuition or sporadic debugging, engineering teams need measurable indicators that reveal how reliable their automated tests actually are. <\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Metrics such as flaky test rate, crash-free sessions, retry rate, and mean time to resolve (MTTR) help quantify test reliability, detect instability early, and prioritize fixes before they impact release cycles.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">In this guide, we break down the <strong>most important stability testing metrics for mobile app automation<\/strong>, including their definitions, formulas, and practical ways to measure them in <a href=\"https:\/\/www.getpanto.ai\/blog\/how-to-reduce-ci-test-runtime#evolution-of-ci-runtime\">CI pipelines<\/a> and device farms.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">You\u2019ll also learn how to interpret these metrics, build stability dashboards, and implement remediation practices that keep automated test suites trustworthy as your mobile application scales.<\/p>\n\n\n<h2 class=\"wp-block-heading\" id=\"core-stability-testing-metrics-formulas-examples-and-deep-dives\"><span class=\"ez-toc-section\" id=\"core-stability-testing-metrics-formulas-examples-and-deep-dives\"><\/span><strong>Core Stability Testing Metrics, Formulas, Examples, And Deep Dives<\/strong><span class=\"ez-toc-section-end\"><\/span><\/h2>\n\n<h3 class=\"wp-block-heading\" id=\"quick-reference-key-stability-testing-metrics\"><span class=\"ez-toc-section\" id=\"quick-reference-key-stability-testing-metrics\"><\/span><strong>Quick Reference: Key Stability Testing Metrics<\/strong><span class=\"ez-toc-section-end\"><\/span><\/h3>\n\n\n<figure class=\"wp-block-table\"><table class=\"has-fixed-layout\"><thead><tr><th>Metric<\/th><th>Formula (copy-ready)<\/th><th>Typical data source<\/th><th>Starter target<\/th><\/tr><\/thead><tbody><tr><td>Flaky test rate (%)<\/td><td><code>(distinct_flaky_tests \/ total_distinct_tests) * 100<\/code><\/td><td>CI history \/ rerun logs<\/td><td>&lt; 3\u20135%<\/td><\/tr><tr><td>Test stability (per test %)<\/td><td><code>(pass_count \/ N) * 100<\/code><\/td><td>CI rerun aggregation<\/td><td>&gt; 95% (critical tests)<\/td><\/tr><tr><td>Crash-free sessions (%)<\/td><td><code>(sessions_without_crash \/ total_sessions) * 100<\/code><\/td><td>Crashlytics \/ device farm logs<\/td><td>&gt; 99% (prod-like)<\/td><\/tr><tr><td>Retry rate (%)<\/td><td><code>(retried_runs \/ total_runs) * 100<\/code><\/td><td>CI rerun metadata<\/td><td>\u2264 2\u20134%<\/td><\/tr><tr><td>MTTD (time)<\/td><td><code>avg(time_detected - time_first_flaky)<\/code><\/td><td>CI timestamps \/ issue tracker<\/td><td>&lt; 24\u201372 hours<\/td><\/tr><tr><td>MTTR (time)<\/td><td><code>avg(time_fixed - time_detected)<\/code><\/td><td>Issue tracker \/ PR timestamps<\/td><td>&lt; 48 hours (blocking)<\/td><\/tr><tr><td>Execution time mean &amp; variance<\/td><td><code>mean(runtime), variance(runtime)<\/code><\/td><td>CI timing logs<\/td><td>Low variance<\/td><\/tr><tr><td>Environmental failure rate (%)<\/td><td><code>(env_failures \/ total_failures) * 100<\/code><\/td><td>Device farm + infra logs<\/td><td>Monitor for spikes<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n<h3 class=\"wp-block-heading\" id=\"1-flaky-test-rate\"><span class=\"ez-toc-section\" id=\"1-flaky-test-rate\"><\/span><strong>1. Flaky Test Rate<\/strong><span class=\"ez-toc-section-end\"><\/span><\/h3>\n\n<h4 class=\"wp-block-heading\" id=\"definition-and-why-it-matters\"><strong>Definition And Why It Matters<\/strong><\/h4>\n\n\n<p class=\"wp-block-paragraph\"><a href=\"https:\/\/www.getpanto.ai\/blog\/detect-flaky-tests#flaky-tests-a-deep-dive\">Flaky test rate<\/a> is the percent of distinct tests that show both passes and fails across a chosen window of runs. A high flaky rate undermines CI trust and leads to reruns and ignored failures.<\/p>\n\n\n<h5 class=\"wp-block-heading\" id=\"formula\"><strong>Formula:<\/strong><\/h5>\n\n\n<p class=\"wp-block-paragraph\"><code>Flaky test rate (%) = (distinct_flaky_tests \/ total_distinct_tests) * 100<\/code><\/p>\n\n\n<h5 class=\"wp-block-heading\" id=\"worked-example\"><strong>Worked Example<\/strong><\/h5>\n\n\n<p class=\"wp-block-paragraph\">Over the last 7 nightly runs you executed 1,200 distinct tests. You reran unstable tests 3 times and discovered 48 distinct tests that passed at least once and failed at least once. Flaky test rate = (48 \/ 1200) * 100 = 4.0%.<\/p>\n\n\n<h5 class=\"wp-block-heading\" id=\"collection-source\"><strong>Collection Source<\/strong><\/h5>\n\n\n<p class=\"wp-block-paragraph\">Aggregate CI run results by test name and build id, mark tests with both&nbsp;<code>pass_count &gt; 0<\/code>&nbsp;and&nbsp;<code>fail_count &gt; 0<\/code>&nbsp;inside the window.<\/p>\n\n\n<h5 class=\"wp-block-heading\" id=\"remediation-note\"><strong>Remediation Note<\/strong><\/h5>\n\n\n<p class=\"wp-block-paragraph\">Quarantine the top flaky tests for triage and reduce false noise by adding guarded retries for infra failures only; fix brittle locators and synchronization issues for <a href=\"https:\/\/www.getpanto.ai\/blog\/detect-flaky-tests#conclusion-flakiness-is-a-signal-integrity-problem\">test logic flakiness<\/a>.<\/p>\n\n\n<h3 class=\"wp-block-heading\" id=\"2-test-stability-per-test\"><span class=\"ez-toc-section\" id=\"2-test-stability-per-test\"><\/span><strong>2. Test Stability (Per Test)<\/strong><span class=\"ez-toc-section-end\"><\/span><\/h3>\n\n<h5 class=\"wp-block-heading\" id=\"definition-and-formula\"><strong>Definition And Formula<\/strong><\/h5>\n\n\n<p class=\"wp-block-paragraph\"><code>Test stability (%) = (pass_count \/ N) * 100<\/code>&nbsp;where&nbsp;<code>N<\/code>&nbsp;is the number of runs in the measurement window.<\/p>\n\n\n<h5 class=\"wp-block-heading\" id=\"example\"><strong>Example<\/strong><\/h5>\n\n\n<p class=\"wp-block-paragraph\">A critical login test ran 10 times overnight and passed 9 times: stability = (9 \/ 10) * 100 = 90%. That test needs attention (target &gt; 95%).<\/p>\n\n\n<h5 class=\"wp-block-heading\" id=\"collection-and-action\"><strong>Collection And Action<\/strong><\/h5>\n\n\n<p class=\"wp-block-paragraph\">Use nightly profiling to compute <a href=\"https:\/\/www.getpanto.ai\/blog\/why-do-tests-pass-locally-but-fail-in-ci#monitor-test-stability-with-metrics\">per-test stability<\/a> and produce a ranked list for engineers. Low stability tests should be quarantined or prioritized for fix tickets.<\/p>\n\n\n<h3 class=\"wp-block-heading\" id=\"3-crashfree-sessions\"><span class=\"ez-toc-section\" id=\"3-crash-free-sessions\"><\/span><strong>3. Crash-Free Sessions<\/strong><span class=\"ez-toc-section-end\"><\/span><\/h3>\n\n<h5 class=\"wp-block-heading\" id=\"definition-and-formula\"><strong>Definition And Formula<\/strong><\/h5>\n\n\n<p class=\"wp-block-paragraph\"><code>Crash-free sessions (%) = (sessions_without_crash \/ total_sessions) * 100<\/code><\/p>\n\n\n<h5 class=\"wp-block-heading\" id=\"example\"><strong>Example<\/strong><\/h5>\n\n\n<p class=\"wp-block-paragraph\">If 1,000 automated sessions executed during a release validation and 10 sessions experienced crashes or ANRs: crash-free sessions = (990 \/ 1000) * 100 = 99.0%.<\/p>\n\n\n<h5 class=\"wp-block-heading\" id=\"collection-and-remediation\"><strong>Collection And Remediation<\/strong><\/h5>\n\n\n<p class=\"wp-block-paragraph\">Collect crash stack traces and correlate them to test steps. If crashes are reproducible outside the test harness, <a href=\"https:\/\/www.getpanto.ai\/blog\/mobile-app-testing-ai-top-bugs#the-top-5-mobile-app-bugs-plaguing-development-tea\">escalate as product bugs<\/a>; if they occur only in specific device images, flag infra\/device provisioning.<\/p>\n\n\n<h3 class=\"wp-block-heading\" id=\"4-retry-rate-and-execution-variance\"><span class=\"ez-toc-section\" id=\"4-retry-rate-and-execution-variance\"><\/span><strong>4. Retry Rate And Execution Variance<\/strong><span class=\"ez-toc-section-end\"><\/span><\/h3>\n\n<h5 class=\"wp-block-heading\" style=\"text-transform:capitalize\" id=\"definition-and-why-it-matters\"><strong>Definition and why it matters<\/strong><\/h5>\n\n\n<p class=\"wp-block-paragraph\"><strong>Retry rate<\/strong> is the percentage of test runs that require one or more reruns due to transient failures (<a href=\"https:\/\/www.getpanto.ai\/products\/code-security\/iac\">infrastructure<\/a>, flaky network, device hiccups).<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">A high retry rate often masks underlying instability: teams may rely on retries to keep pipelines green while the underlying cause remains unresolved.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Execution variance<\/strong> is the statistical spread (variance or standard deviation) of a test&#8217;s runtime across runs. <\/p>\n\n\n\n<p class=\"wp-block-paragraph\">High variance indicates unstable environments\u2014device slowdowns, network contention, or background services causing non-deterministic timing.<\/p>\n\n\n<h5 class=\"wp-block-heading\" id=\"formula\"><strong>Formula<\/strong><\/h5>\n\n\n<p class=\"wp-block-paragraph\"><code>Retry rate (%) = (retried_runs \/ total_runs) * 100<\/code><br><code>Execution variance = VAR_POP(runtime_seconds)   -- or use STDDEV(runtime_seconds)<\/code><\/p>\n\n\n<h5 class=\"wp-block-heading\" id=\"worked-example\"><strong>Worked Example<\/strong><\/h5>\n\n\n<p class=\"wp-block-paragraph\">Daily snapshot: 10,000 total test runs, 420 retried runs \u2192 <code>(420 \/ 10000) * 100 = 4.2%<\/code> retry rate. That level justifies an investigation. <\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Separately, a critical scenario&#8217;s median runtime jumped from 90s to 160s on a subset of devices \u2014 this large shift + high variance suggests device-level contention or environmental issues rather than a deterministic test slowdown.<\/p>\n\n\n<h5 class=\"wp-block-heading\" style=\"text-transform:capitalize\" id=\"collection-and-measurement\"><strong>Collection and measurement<\/strong><\/h5>\n\n\n<ul class=\"wp-block-list\">\n<li>Record a <code>retry_count<\/code> per run and a <code>retry_reason<\/code> (infra, timeout, flake, environment) so retries are classifiable.<\/li>\n\n\n\n<li>Persist runtime in seconds for every run. Compute mean and variance per test and per device\/OS slice.<\/li>\n\n\n\n<li>Segment retry rate by <code>failure_reason<\/code> to distinguish infra retries from test-logic retries.<\/li>\n<\/ul>\n\n\n<h5 class=\"wp-block-heading\" id=\"remediation-and-action\"><strong>Remediation and action<\/strong><\/h5>\n\n\n<ul class=\"wp-block-list\">\n<li>If `&lt;retry_rate` &gt; your threshold (suggested &gt;2\u20134% triggers review), drill down by device\/OS and retry reason.<\/li>\n\n\n\n<li>Fix infra retries first (device provisioning, flaky device images), and <a href=\"https:\/\/www.getpanto.ai\/blog\/why-do-tests-pass-locally-but-fail-in-ci#use-explicit-waits-and-avoid-fragile-patterns\">address test-logic retries<\/a> by improving synchronization and removing brittle assertions.<\/li>\n\n\n\n<li>Reduce retries in CI by adding intelligent gating: allow one guarded retry for infra, but automatically create a triage ticket when retries exceed threshold.<\/li>\n<\/ul>\n\n\n<h5 class=\"wp-block-heading\" style=\"text-transform:capitalize\" id=\"monitoring-queries-and-dashboard-widgets\"><strong>Monitoring queries and dashboard widgets<\/strong><\/h5>\n\n\n<ul class=\"wp-block-list\">\n<li>Line chart: retry_rate (daily) with device\/OS filter.<\/li>\n\n\n\n<li>Bar: top tests by retry_count.<\/li>\n\n\n\n<li>Boxplot or histogram: runtime distribution per test (shows variance).<\/li>\n<\/ul>\n\n\n<h3 class=\"wp-block-heading\" id=\"5-mttd-and-mttr\"><span class=\"ez-toc-section\" id=\"5-mttd-and-mttr\"><\/span><strong>5. MTTD And MTTR<\/strong><span class=\"ez-toc-section-end\"><\/span><\/h3>\n\n<h5 class=\"wp-block-heading\" style=\"text-transform:capitalize\" id=\"definition-and-why-it-matters\"><strong>Definition and why it matters<\/strong><\/h5>\n\n\n<p class=\"wp-block-paragraph\"><strong>MTTD (Mean Time To Detect)<\/strong> measures how quickly flaky behavior is recognized (from first occurrence to detection\/flagging). <a href=\"https:\/\/www.getpanto.ai\/blog\/code-duplication-detection-tools#automated-duplicate-code-detection-tools-to-the-re\">Fast detection minimizes noise<\/a> and prevents flaky tests from being relied upon.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>MTTR (Mean Time To Resolve)<\/strong> measures how long it takes, on average, to fix a flaky test after detection (detection \u2192 verified fix merged and green). Short MTTR indicates effective ownership, triage, and remediation processes.<\/p>\n\n\n<h5 class=\"wp-block-heading\" id=\"formula\"><strong>Formula<\/strong><\/h5>\n\n\n<p class=\"wp-block-paragraph\"><code>MTTD = AVG(time_detected - time_first_flaky)<\/code><br><code>MTTR = AVG(time_fixed - time_detected)<\/code><\/p>\n\n\n<h5 class=\"wp-block-heading\" id=\"worked-example\"><strong>Worked example<\/strong><\/h5>\n\n\n<p class=\"wp-block-paragraph\">Example dataset: A flaky test first failed on Jan 10 08:00, was detected (automated flag) at Jan 10 08:12, and fixed (PR merged + green) on Jan 12 16:00. <\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Detection latency = 12 minutes; resolution = ~56 hours. Aggregate many such events to compute mean MTTD and MTTR.<\/p>\n\n\n<h5 class=\"wp-block-heading\" style=\"text-transform:capitalize\" id=\"collection-and-instrumentation\"><strong>Collection and instrumentation<\/strong><\/h5>\n\n\n<ul class=\"wp-block-list\">\n<li>Record timestamps for the following events: <code>time_first_flaky<\/code> (first observed failing run), <code>time_detected<\/code> (automated or manual flagging time), and <code>time_fixed<\/code> (PR merged + green or ticket resolved).<\/li>\n\n\n\n<li>Automate detection: run nightly profiling and an automated rule that flags tests with mixed pass\/fail over N runs and inserts detection timestamps into your tracking system.<\/li>\n<\/ul>\n\n\n<h5 class=\"wp-block-heading\" style=\"text-transform:capitalize\" id=\"remediation-workflow-and-slas\"><strong>Remediation workflow and SLAs<\/strong><\/h5>\n\n\n<ul class=\"wp-block-list\">\n<li>Define SLAs: e.g., MTTD &lt; 24 hours for release-blocking tests; MTTR &lt; 48 hours for high-impact flakiness.<\/li>\n\n\n\n<li>Automate triage assignment: detection triggers label, owner assignment, and a Slack\/email alert for on-call or the last committer.<\/li>\n\n\n\n<li>Measure and <a href=\"https:\/\/www.getpanto.ai\/blog\/vibe-debugging-effortless-engineering#mean-time-to-diagnose-mttd-and-mean-time-to-resolve-mttr\">report MTTD\/MTTR weekly<\/a>; use them as team KPIs and include in sprint planning for remediation capacity.<\/li>\n<\/ul>\n\n\n<h5 class=\"wp-block-heading\" style=\"text-transform:capitalize\" id=\"dashboard-suggestions\"><strong>Dashboard suggestions<\/strong><\/h5>\n\n\n<ul class=\"wp-block-list\">\n<li>Single-number tiles: current MTTD and MTTR (7d\/30d averages).<\/li>\n\n\n\n<li>Trend line: MTTD and MTTR over time to show improvement (or regression).<\/li>\n\n\n\n<li>Table: top slow-to-resolve flaky tests (MTTR descending).<\/li>\n<\/ul>\n\n\n<h3 class=\"wp-block-heading\" id=\"6-environmental-failure-rate\"><span class=\"ez-toc-section\" id=\"6-environmental-failure-rate\"><\/span><strong>6. Environmental Failure Rate<\/strong><span class=\"ez-toc-section-end\"><\/span><\/h3>\n\n<h5 class=\"wp-block-heading\" style=\"text-transform:capitalize\" id=\"definition-and-why-it-matters\"><strong>Definition and why it matters<\/strong><\/h5>\n\n\n<p class=\"wp-block-paragraph\"><strong>Environmental failure rate<\/strong> is the percent of total failures attributable to environment or infrastructure: device provisioning errors, device reboots, <a href=\"https:\/\/www.getpanto.ai\/blog\/cloudflare-outage#deep-dive-into-the-cloudflare-outage\">farm outages<\/a>, or misconfigured images.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Because environment problems are outside test logic, conflating them with flaky tests can misdirect remediation efforts.<\/p>\n\n\n<h5 class=\"wp-block-heading\" id=\"formula\"><strong>Formula<\/strong><\/h5>\n\n\n<p class=\"wp-block-paragraph\"><code>Environmental failure rate (%) = (env_failures \/ total_failures) * 100<\/code><\/p>\n\n\n<h5 class=\"wp-block-heading\" style=\"text-transform:capitalize\" id=\"worked-example\"><strong>Worked example<\/strong><\/h5>\n\n\n<p class=\"wp-block-paragraph\">In a 1,000-failure window, 230 failures are labeled as <code>env<\/code> (device offline, provisioning timeout, farm error) \u2192 environmental failure rate = (230 \/ 1000) * 100 = 23%.<\/p>\n\n\n<h5 class=\"wp-block-heading\" style=\"text-transform:capitalize\" id=\"how-to-classify-and-collect\"><strong>How to classify and collect<\/strong><\/h5>\n\n\n<ul class=\"wp-block-list\">\n<li>Add structured <code>failure_reason<\/code> codes at run time (e.g., <code>infra_timeout<\/code>, <code>device_reboot<\/code>, <code>image_mismatch<\/code>, <code>network_partition<\/code>).<\/li>\n\n\n\n<li>Correlate CI logs with farm health events and provider status pages if available; correlate device IDs to firmware\/image versions.<\/li>\n\n\n\n<li><a href=\"https:\/\/www.getpanto.ai\/blog\/ai-driven-mobile-qa-testing-metrics#how-ai-driven-qa-changes-the-game\">Use automated metrics<\/a> first (exit codes, known error patterns) and escalate ambiguous cases for manual triage to improve classification over time.<\/li>\n<\/ul>\n\n\n<h5 class=\"wp-block-heading\" style=\"text-transform:capitalize\" id=\"remediation-and-operational-actions\"><strong>Remediation and operational actions<\/strong><\/h5>\n\n\n<ul class=\"wp-block-list\">\n<li>If environmental failure rate spikes, pause non-critical runs, investigate provider\/infra, and raise an infrastructure incident.<\/li>\n\n\n\n<li>Maintain an inventory of device images and versions; roll back or reprovision images that correlate with spikes.<\/li>\n\n\n\n<li>Implement retries only for idempotent infra issues, but create tickets for persistent device-image problems rather than masking them with retries.<\/li>\n<\/ul>\n\n\n<h5 class=\"wp-block-heading\" style=\"text-transform:capitalize\" id=\"monitoring-and-alerts\"><strong>Monitoring and alerts<\/strong><\/h5>\n\n\n<ul class=\"wp-block-list\">\n<li>Alert when environmental failure rate rises above a threshold (suggested: &gt;10% sustained for 1 hour).<\/li>\n\n\n\n<li>Device health dashboard: failure count by device_id and image_version to help identify bad batches.<\/li>\n\n\n\n<li>Correlate environmental spikes with CI provider status or network incidents to reduce mean time to remediate.<\/li>\n<\/ul>\n\n\n<h3 class=\"wp-block-heading\" id=\"7-device-os-coverage-considerations\"><span class=\"ez-toc-section\" id=\"7-device-os-coverage-considerations\"><\/span><strong>7. Device \/ OS Coverage Considerations<\/strong><span class=\"ez-toc-section-end\"><\/span><\/h3>\n\n<h5 class=\"wp-block-heading\" style=\"text-transform:capitalize\" id=\"why-coverage-impacts-stability\"><strong>Why coverage impacts stability<\/strong><\/h5>\n\n\n<p class=\"wp-block-paragraph\">Device and OS coverage determines the breadth of test surface. While increasing coverage is often desirable for user reach, each added device\/OS combination increases the probability of <a href=\"https:\/\/www.getpanto.ai\/blog\/detect-flaky-tests#organisational-practices-that-reduce-flakiness\">environment-specific flakiness<\/a> (manufacturer-specific behavior, OEM skins, OS-level differences, hardware quirks).<\/p>\n\n\n<h5 class=\"wp-block-heading\" style=\"text-transform:capitalize\" id=\"strategic-coverage-model\"><strong>Strategic coverage model<\/strong><\/h5>\n\n\n<p class=\"wp-block-paragraph\">Use a prioritized matrix rather than attempting to cover every device. Prioritize by:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>User analytics (top devices\/OS versions by active users).<\/li>\n\n\n\n<li>Crash impact (devices contributing most to production crashes).<\/li>\n\n\n\n<li>Region \/ market-specific devices if your user base is concentrated.<\/li>\n<\/ul>\n\n\n<h5 class=\"wp-block-heading\" style=\"text-transform:capitalize\" id=\"practical-measurement-and-partitioning\"><strong>Practical measurement and partitioning<\/strong><\/h5>\n\n\n<ul class=\"wp-block-list\">\n<li>Report flaky test rate and crash-free sessions per device\/OS slice. This makes it obvious which device lines drive instability.<\/li>\n\n\n\n<li>Establish a \u201ccore\u201d device matrix (e.g., 10\u201315 prioritized combinations) for daily\/nightly runs and a \u201cbroader\u201d matrix for weekly or pre-release runs.<\/li>\n\n\n\n<li>Use rotation sampling for the broader matrix to avoid persistent resource costs while still surfacing device-specific issues over time.<\/li>\n<\/ul>\n\n\n<h5 class=\"wp-block-heading\" style=\"text-transform:capitalize\" id=\"remediation-guidance-by-device-class\"><strong>Remediation guidance by device class<\/strong><\/h5>\n\n\n<ul class=\"wp-block-list\">\n<li>When a device or OS slice shows high instability, isolate it in a separate job to gather additional logs and session video without polluting the main health signal.<\/li>\n\n\n\n<li>If a specific OEM or OS version consistently causes instability, consider adding conditional skips or targeted tests that validate the problematic functionality more robustly (e.g., lower-level API checks vs UI flows).<\/li>\n\n\n\n<li>Communicate <a href=\"https:\/\/www.getpanto.ai\/blog\/device-farms-for-mobile-testing#device-farm-benefits-for-development-teams\">device-specific findings<\/a> to the product and engineering teams (some device issues may warrant product-level mitigations or user messaging).<\/li>\n<\/ul>\n\n\n<h5 class=\"wp-block-heading\" style=\"text-transform:capitalize\" id=\"dashboard-ideas\"><strong>Dashboard ideas<\/strong><\/h5>\n\n\n<ul class=\"wp-block-list\">\n<li>Device heatmap: flaky rate by device model (rows) \u00d7 OS version (columns).<\/li>\n\n\n\n<li>Coverage tracker: percent of top N user devices included in the current matrix.<\/li>\n\n\n\n<li>Rotation schedule: last test date per device to ensure broader coverage is exercised periodically.<\/li>\n<\/ul>\n\n\n\n<!-- Centered Wrapper -->\n<div style=\"\n  max-width:1200px;\n  margin:0 auto;\n  padding:0 16px;\n\">\n  <!-- Hero Banner: Vibe Debugging -->\n  <div style=\"\n    display:inline-flex;\n    gap:32px;\n    align-items:center;\n    padding:32px;\n    background:linear-gradient(135deg, #ECFEFF 0%, #F0FDFA 100%);\n    border-radius:4px;\n    border:1px solid #99F6E4;\n    box-shadow:0 16px 32px rgba(13,148,136,0.1);\n    margin:40px 0;\n    flex-wrap:wrap;\n    font-family:'Montserrat', -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, Arial, sans-serif;\n  \">\n\n    <!-- LEFT: Product Image -->\n    <div style=\"\n      flex:0 0 420px;\n    \">\n      <img decoding=\"async\" \n        src=\"https:\/\/www.getpanto.ai\/blog\/wp-content\/uploads\/2025\/11\/panto-ai-image-3.png\" \n        alt=\"Vibe Debugging Example\"\n        style=\"\n          width:100%;\n          height:auto;\n          display:block;\n          border-radius:4px;\n        \"\n      \/>\n    <\/div>\n\n    <!-- RIGHT: Value Proposition -->\n    <div style=\"\n      flex:1;\n      display:flex;\n      flex-direction:column;\n      justify-content:center;\n    \">\n      <h1 style=\"\n        font-size:30px;\n        line-height:1.2;\n        margin:0 0 12px;\n        font-weight:800;\n        color:#0F172A;\n        text-align:center;\n      \">Everything After Vibe Coding\n      <\/h1>\n\n      <p style=\"\n        font-size:14px;\n        line-height:1.55;\n        color:#334155;\n        margin:0 0 16px;\n        max-width:520px;\n      \">\n        Panto AI helps developers find, explain, and fix bugs faster with AI-assisted QA\u2014reducing downtime and preventing regressions.\n      <\/p>\n\n      <!-- Feature List -->\n      <ul style=\"\n        list-style:none;\n        padding:0;\n        margin:0 0 20px;\n      \">\n        <li style=\"display:flex; gap:10px; margin-bottom:10px; font-size:15px; color:#0F172A;\">\n          <span style=\"color:#0d9488; font-weight:700;\">\u2713<\/span>\n          Explain bugs in natural language\n        <\/li>\n        <li style=\"display:flex; gap:10px; margin-bottom:10px; font-size:15px; color:#0F172A;\">\n          <span style=\"color:#0d9488; font-weight:700;\">\u2713<\/span>\n          Create reproducible test scenarios in minutes\n        <\/li>\n        <li style=\"display:flex; gap:10px; font-size:15px; color:#0F172A;\">\n          <span style=\"color:#0d9488; font-weight:700;\">\u2713<\/span>\n          Run scripts and track issues with zero AI hallucinations\n        <\/li>\n      <\/ul>\n\n      <!-- CTA -->\n      <a href=\"https:\/\/www.getpanto.ai\"\n         style=\"\n          display:block;\n          width:100%;\n          max-width:520px;\n          padding:14px 0;\n          background:linear-gradient(135deg, #0d9488, #14b8a6);\n          color:#ffffff;\n          font-size:16px;\n          font-weight:700;\n          text-align:center;\n          border-radius:4px;\n          text-decoration:none;\n          box-shadow:0 8px 20px rgba(13,148,136,0.3);\n         \">\n        Try Panto \u2192 \n      <\/a>\n\n    <\/div>\n  <\/div>\n<\/div>\n\n\n<h2 class=\"wp-block-heading\" id=\"how-to-measure-prioritize-remediate-and-report\"><span class=\"ez-toc-section\" id=\"how-to-measure-prioritize-remediate-and-report\"><\/span><strong>How To Measure, Prioritize, Remediate, And Report<\/strong><span class=\"ez-toc-section-end\"><\/span><\/h2>\n\n<h3 class=\"wp-block-heading\" id=\"instrumentation-and-ci-patterns\"><span class=\"ez-toc-section\" id=\"instrumentation-and-ci-patterns\"><\/span><strong>Instrumentation And CI Patterns<\/strong><span class=\"ez-toc-section-end\"><\/span><\/h3>\n\n\n<ul class=\"wp-block-list\">\n<li>Emit structured test results (JSON) per run with fields:&nbsp;<code>test_name<\/code>,&nbsp;<code>status<\/code>,&nbsp;<code>run_id<\/code>,&nbsp;<code>build_id<\/code>,&nbsp;<code>device_id<\/code>,&nbsp;<code>os_version<\/code>,&nbsp;<code>start_time<\/code>,&nbsp;<code>end_time<\/code>,&nbsp;<code>logs<\/code>,&nbsp;<code>rerun_reason<\/code>.<\/li>\n\n\n\n<li>Record session video and system logs for failed runs when possible to accelerate triage.<\/li>\n\n\n\n<li>Attach a&nbsp;<code>failure_reason<\/code>&nbsp;tag at run time (infra | test_logic | app_bug | network | device_failure).<\/li>\n<\/ul>\n\n\n<h3 class=\"wp-block-heading\" id=\"compute-flaky-tests-bigquerystyle\"><span class=\"ez-toc-section\" id=\"compute-flaky-tests-bigquery-style\"><\/span><strong>Compute Flaky Tests (BigQuery-Style)<\/strong><span class=\"ez-toc-section-end\"><\/span><\/h3>\n\n\n<pre class=\"wp-block-code\"><code>-- Conceptual: compute pass\/fail per test over the last 30 days\nSELECT\n  test_name,\n  SUM(CASE WHEN status='PASS' THEN 1 ELSE 0 END) AS pass_count,\n  SUM(CASE WHEN status='FAIL' THEN 1 ELSE 0 END) AS fail_count,\n  COUNT(*) AS runs,\n  SAFE_DIVIDE(SUM(CASE WHEN status='FAIL' THEN 1 ELSE 0 END), COUNT(*)) AS fail_rate\nFROM test_results\nWHERE build_time &gt; TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 30 DAY)\nGROUP BY test_name\nHAVING pass_count &gt; 0 AND fail_count &gt; 0; -- marks flaky tests\n<\/code><\/pre>\n\n\n<h3 class=\"wp-block-heading\" id=\"grafana-panel-json\"><span class=\"ez-toc-section\" id=\"grafana-panel-json\"><\/span><strong>Grafana Panel JSON<\/strong><span class=\"ez-toc-section-end\"><\/span><\/h3>\n\n\n<p class=\"wp-block-paragraph\">Below is an illustrative panel snippet showing a line chart for flaky-test trend. Import via Grafana JSON import and adapt to your datasource names.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>{\n  \"panels\": &#91;{\n    \"type\": \"graph\",\n    \"title\": \"Flaky Test Rate (30d)\",\n    \"targets\": &#91;{\"rawSql\": \"SELECT date, flaky_rate FROM test_metrics WHERE date &gt; $__from ORDER BY date\"}]\n  }]\n}<\/code><\/pre>\n\n\n<h3 class=\"wp-block-heading\" id=\"worked-case-study-acme-mobile-stability-audit\"><span class=\"ez-toc-section\" id=\"worked-case-study-%e2%80%94-acme-mobile-stability-audit\"><\/span><strong>Worked Case Study \u2014 Acme Mobile Stability Audit<\/strong><span class=\"ez-toc-section-end\"><\/span><\/h3>\n\n\n<p class=\"wp-block-paragraph\"><strong>Context:<\/strong>&nbsp;A mid-size app team ran a two-week stability audit during a sprint freeze to quantify test health before a release.<\/p>\n\n\n<h4 class=\"wp-block-heading\" id=\"initial-state\"><strong>Initial State<\/strong><\/h4>\n\n\n<p class=\"wp-block-paragraph\">The suite had 1,500 distinct tests. Over two weeks the nightly profiling (N = 10 runs per test) revealed 180 distinct flaky tests (12% flaky rate). <\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Critical release-blocking tests included: login flow, payment flow, and deep-link handling. MTTR averaged 5.2 days for release-blocking flakiness.<\/p>\n\n\n<h4 class=\"wp-block-heading\" id=\"actions-taken\"><strong>Actions Taken<\/strong><\/h4>\n\n\n<ol class=\"wp-block-list\">\n<li>Quarantined the top 40 worst offenders (based on a weighted score combining failure frequency and execution frequency).<\/li>\n\n\n\n<li>Reproduced top 10 locally; <a href=\"https:\/\/www.getpanto.ai\/products\/automted-test-script-generation\">fixed brittle locators<\/a> and unstable waits (framework and test logic changes).<\/li>\n\n\n\n<li>Added network stubbing for third-party calls that caused intermittent timeouts during tests.<\/li>\n\n\n\n<li>Filed infra tickets for device farm provisioning issues after correlating crashes to a specific device image.<\/li>\n<\/ol>\n\n\n<h4 class=\"wp-block-heading\" id=\"outcome\"><strong>Outcome<\/strong><\/h4>\n\n\n<p class=\"wp-block-paragraph\">After 3 weeks: flaky test rate dropped from 12% \u2192 3.8%; MTTR for release-blocking flakiness decreased from 5.2 \u2192 1.4 days; run-retry volume decreased by ~60%, saving CI credits and engineer time.<\/p>\n\n\n<h3 class=\"wp-block-heading\" id=\"vendor-integrations-and-practical-howtos\"><span class=\"ez-toc-section\" id=\"vendor-integrations-and-practical-how-tos\"><\/span><strong>Vendor Integrations And Practical How-Tos<\/strong><span class=\"ez-toc-section-end\"><\/span><\/h3>\n\n\n<p class=\"wp-block-paragraph\">Use vendor telemetry to accelerate detection and triage. Integrate with <a href=\"https:\/\/www.getpanto.ai\/blog\/device-farms-for-mobile-testing#understanding-device-farms-and-their-impact\">trusted device farms<\/a> and vendor tools that provide rerun histories, session logs, and session replay.<\/p>\n\n\n<h3 class=\"wp-block-heading\" id=\"prioritization-scoring-and-remediation-playbook\"><span class=\"ez-toc-section\" id=\"prioritization-scoring-and-remediation-playbook\"><\/span><strong>Prioritization Scoring And Remediation Playbook<\/strong><span class=\"ez-toc-section-end\"><\/span><\/h3>\n\n\n<p class=\"wp-block-paragraph\">Use a weighted scoring model to <a href=\"https:\/\/www.getpanto.ai\/blog\/why-do-tests-pass-locally-but-fail-in-ci#timing-issues-and-flaky-tests\">rank flaky tests<\/a>. Example weights: failure frequency 35%, execution frequency 25%, business impact 25%, MTTR 15%. Run a weekly \u201ctop N\u201d remediation sprint focused on the highest-scoring tests.<\/p>\n\n\n<h4 class=\"wp-block-heading\" id=\"remediation-checklist\"><strong>Remediation Checklist<\/strong><\/h4>\n\n\n<ol class=\"wp-block-list\">\n<li>Quarantine test in CI if stability &lt; threshold (e.g., &lt; 95%).<\/li>\n\n\n\n<li>Attach session logs and video to the triage ticket.<\/li>\n\n\n\n<li>Reproduce locally with deterministic seed and mock external networks where possible.<\/li>\n\n\n\n<li>Fix fragile selectors, add explicit waits, or convert to lower-level tests if UI is inherently unstable.<\/li>\n\n\n\n<li>Re-run profiling; unquarantine when stability target is met<\/li>\n<\/ol>\n\n\n<h3 class=\"wp-block-heading\" id=\"conclusion\"><span class=\"ez-toc-section\" id=\"conclusion\"><\/span><strong>Conclusion<\/strong><span class=\"ez-toc-section-end\"><\/span><\/h3>\n\n\n<p class=\"wp-block-paragraph\">Stability testing metrics provide the foundation for <a href=\"https:\/\/www.getpanto.ai\/\">trustworthy mobile automation<\/a>. Without measurable indicators such as flaky test rate, retry rate, MTTD, and environmental failure rate, teams are forced to rely on intuition when diagnosing instability.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">By instrumenting CI pipelines, collecting structured test data, and continuously monitoring stability trends, engineering teams can <a href=\"https:\/\/www.getpanto.ai\/products\/self-healing-test-automation\">quickly identify unreliable tests,<\/a> separate infrastructure issues from real regressions, and maintain confidence in automated feedback.<\/p>\n\n\n<h3 class=\"wp-block-heading\" id=\"faqs\"><span class=\"ez-toc-section\" id=\"faqs\"><\/span><strong>FAQ&#8217;s<\/strong><span class=\"ez-toc-section-end\"><\/span><\/h3>\n\n<h4 class=\"wp-block-heading\" id=\"q-what-is-a-flaky-test\"><strong>Q: What is a flaky test?<\/strong><\/h4>\n\n\n<p class=\"wp-block-paragraph\"><strong>A:<\/strong> A flaky test is an automated test that passes intermittently and fails intermittently under the same code baseline. These failures are usually caused by timing issues, unstable environments, race conditions, or dependencies on external services.<\/p>\n\n\n<h4 class=\"wp-block-heading\" id=\"q-how-many-runs-define-flakiness\"><strong>Q: How many runs define flakiness?<\/strong><\/h4>\n\n\n<p class=\"wp-block-paragraph\"><strong>A:<\/strong> For routine detection, running a test 3\u201310 times is typically sufficient to identify instability. For deeper statistical profiling and higher confidence, many teams run tests 20 or more times to measure true failure probabilities.<\/p>\n\n\n<h4 class=\"wp-block-heading\" id=\"q-should-i-autoretry-failing-tests\"><strong>Q: Should I auto-retry failing tests?<\/strong><\/h4>\n\n\n<p class=\"wp-block-paragraph\"><strong>A:<\/strong> Retries should be used cautiously. Most teams allow only one guarded retry for infrastructure or transient errors. Excessive retries can hide flaky tests and reduce the reliability of CI signals.<\/p>\n\n\n<h4 class=\"wp-block-heading\" id=\"q-how-often-should-i-compute-stability-metrics\"><strong>Q: How often should I compute stability metrics?<\/strong><\/h4>\n\n\n<p class=\"wp-block-paragraph\"><strong>A:<\/strong> Stability signals should be computed in real time during CI runs to detect flaky behavior immediately. In addition, weekly or 30-day trend reports help engineering teams prioritize long-term test stability improvements.<\/p>\n\n\n<h4 class=\"wp-block-heading\" id=\"q-what-is-an-acceptable-flaky-test-rate\"><strong>Q: What is an acceptable flaky test rate?<\/strong><\/h4>\n\n\n<p class=\"wp-block-paragraph\"><strong>A:<\/strong> Many engineering teams aim for an overall flaky test rate below 3\u20135% across the entire test suite. Critical tests used for release gating should typically achieve at least 95% stability.<\/p>\n\n\n<h4 class=\"wp-block-heading\" id=\"q-how-do-i-separate-infrastructure-issues-from-test-problems\"><strong>Q: How do I separate infrastructure issues from test problems?<\/strong><\/h4>\n\n\n<p class=\"wp-block-paragraph\"><strong>A:<\/strong> A common approach is to tag test runs with a <em>failure_reason<\/em> field during execution. This allows teams to compute a separate environmental failure rate and distinguish infrastructure instability from genuine test logic failures.<\/p>\n\n\n<h4 class=\"wp-block-heading\" id=\"q-which-metrics-should-product-managers-watch\"><strong>Q: Which metrics should product managers watch?<\/strong><\/h4>\n\n\n<p class=\"wp-block-paragraph\"><strong>A:<\/strong> Product leaders typically focus on crash-free sessions and the number of release-blocking flaky tests. These metrics directly affect user experience and release velocity.<\/p>\n\n\n<h4 class=\"wp-block-heading\" id=\"q-what-dashboards-should-i-build\"><strong>Q: What dashboards should I build?<\/strong><\/h4>\n\n\n<p class=\"wp-block-paragraph\"><strong>A:<\/strong> Effective QA dashboards usually include a 30-day flaky-test trend, the most impactful flaky tests, crash-free session metrics, device failure heatmaps, and MTTR\/MTTD indicators for debugging efficiency.<\/p>\n\n\n<h4 class=\"wp-block-heading\" id=\"q-how-do-i-measure-mttr-for-flaky-tests\"><strong>Q: How do I measure MTTR for flaky tests?<\/strong><\/h4>\n\n\n<p class=\"wp-block-paragraph\"><strong>A:<\/strong> Track the timestamps for <em>time_detected<\/em> and <em>time_fixed<\/em> (for example when the pull request is merged and the test passes consistently). The average difference between these values over time provides the mean time to resolution.<\/p>\n\n\n<h4 class=\"wp-block-heading\" id=\"q-can-device-farms-help-detect-flakiness\"><strong>Q: Can device farms help detect flakiness?<\/strong><\/h4>\n\n\n<p class=\"wp-block-paragraph\"><strong>A:<\/strong> Yes. Device farms typically provide rerun histories, detailed logs, and session replay capabilities. These features make it easier to reproduce intermittent failures and diagnose the root causes of flaky tests.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Mobile app automation promises faster releases and better software quality\u2014but those benefits disappear when test suites become unstable. A single flaky test can trigger false failures, force developers to rerun pipelines, and slow down delivery. Over time, teams begin to distrust their CI results, treating failures as noise instead of signals of real issues. This [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":4237,"comment_status":"open","ping_status":"open","sticky":false,"template":"wp-custom-template-panto-blogs-v3","format":"standard","meta":{"footnotes":""},"categories":[110],"tags":[],"class_list":["post-4235","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-ai-qa-testing"],"_links":{"self":[{"href":"https:\/\/www.getpanto.ai\/blog\/wp-json\/wp\/v2\/posts\/4235","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.getpanto.ai\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.getpanto.ai\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.getpanto.ai\/blog\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/www.getpanto.ai\/blog\/wp-json\/wp\/v2\/comments?post=4235"}],"version-history":[{"count":0,"href":"https:\/\/www.getpanto.ai\/blog\/wp-json\/wp\/v2\/posts\/4235\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.getpanto.ai\/blog\/wp-json\/wp\/v2\/media\/4237"}],"wp:attachment":[{"href":"https:\/\/www.getpanto.ai\/blog\/wp-json\/wp\/v2\/media?parent=4235"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.getpanto.ai\/blog\/wp-json\/wp\/v2\/categories?post=4235"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.getpanto.ai\/blog\/wp-json\/wp\/v2\/tags?post=4235"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}