{"id":2879,"date":"2025-11-26T01:16:05","date_gmt":"2025-11-25T19:46:05","guid":{"rendered":"https:\/\/www.getpanto.ai\/blog\/?p=2879"},"modified":"2025-11-26T01:17:19","modified_gmt":"2025-11-25T19:47:19","slug":"cloudflare-outage","status":"publish","type":"post","link":"https:\/\/www.getpanto.ai\/blog\/cloudflare-outage","title":{"rendered":"Cloudflare Outage: How One Feature File Took Down the Network"},"content":{"rendered":"\n<p class=\"wp-block-paragraph\"><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Just weeks after <a href=\"https:\/\/www.getpanto.ai\/blog\/aws-outage-2025-retry-storm\">AWS disclosed a major DynamoDB outage<\/a> in late October 2025 (triggered by a latent DNS-management race condition), Cloudflare suffered its own large-scale outage on November 18, 2025. Both incidents underscore how fragile modern cloud systems can become when small bugs combine with high deployment velocity.&nbsp;<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Some experts have dubbed today\u2019s fast, AI-assisted \u201c<a href=\"https:\/\/www.getpanto.ai\/blog\/vibe-coding-vs-vibe-debugging-the-modern-developers-reality\">vibe coding<\/a>\u201d culture, which emphasizes shipping new features quickly as trading away robustness for speed. The Cloudflare outage postmortem makes this clear: <strong>a seemingly innocuous database permission change cascaded into a network-wide failure<\/strong>.&nbsp;<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"1258\" height=\"854\" src=\"https:\/\/www.getpanto.ai\/blog\/wp-content\/uploads\/2025\/11\/image-136.png\" alt=\"Cloudflare outage error\" class=\"wp-image-2885\" srcset=\"https:\/\/www.getpanto.ai\/blog\/wp-content\/uploads\/2025\/11\/image-136.png 1258w, https:\/\/www.getpanto.ai\/blog\/wp-content\/uploads\/2025\/11\/image-136-300x204.png 300w, https:\/\/www.getpanto.ai\/blog\/wp-content\/uploads\/2025\/11\/image-136-768x521.png 768w, https:\/\/www.getpanto.ai\/blog\/wp-content\/uploads\/2025\/11\/image-136-200x136.png 200w\" sizes=\"auto, (max-width: 1258px) 100vw, 1258px\" \/><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">Cloudflare\u2019s <a href=\"https:\/\/www.getpanto.ai\/products\/code-security\/reports\">report <\/a>confirms no malicious attack was involved. Instead, on Nov 18 at <strong>11:20\u202fUTC<\/strong>, our network began seeing \u201csignificant failures to deliver core network traffic,\u201d with users getting the above error.&nbsp;<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The culprit was a database change: a ClickHouse permission update at <strong>11:05<\/strong> caused the Bot Management feature-file query to double its output (via duplicate rows). That oversized file violated an internal 200-feature limit (set for performance), causing a Rust \u201cunwrap\u201d panic in <a href=\"https:\/\/www.getpanto.ai\/blog\/cloudflare-self-ddos-outage-breakdown\">Cloudflare<\/a>\u2019s new FL2 proxy code and triggering HTTP 5xx errors across the CDN.<\/p>\n\n\n<h2 class=\"wp-block-heading\" id=\"cloudflare-outage-timeline-and-sequence-of-events\"><span class=\"ez-toc-section\" id=\"cloudflare-outage-timeline-and-sequence-of-events\"><\/span><strong>Cloudflare Outage Timeline and Sequence of Events<\/strong><span class=\"ez-toc-section-end\"><\/span><\/h2>\n\n\n<p class=\"wp-block-paragraph\">Key events unfolded as follows:<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"461\" height=\"575\" src=\"https:\/\/www.getpanto.ai\/blog\/wp-content\/uploads\/2025\/11\/image-134.png\" alt=\"Timeline of Cloudflare Outage\" class=\"wp-image-2883\" srcset=\"https:\/\/www.getpanto.ai\/blog\/wp-content\/uploads\/2025\/11\/image-134.png 461w, https:\/\/www.getpanto.ai\/blog\/wp-content\/uploads\/2025\/11\/image-134-241x300.png 241w, https:\/\/www.getpanto.ai\/blog\/wp-content\/uploads\/2025\/11\/image-134-200x249.png 200w\" sizes=\"auto, (max-width: 461px) 100vw, 461px\" \/><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\"><\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li><strong>11:05\u202fUTC<\/strong> \u2013 Cloudflare deploys a ClickHouse <em>access control change<\/em> to expose underlying shard metadata.<br><\/li>\n\n\n\n<li><strong>11:28<\/strong> \u2013 The first errors appear on customer HTTP traffic as the new Bot Management feature file starts propagating. (Earlier, system <a href=\"https:\/\/www.getpanto.ai\/blog\/code-quality\">metrics <\/a>were normal.)<br><\/li>\n\n\n\n<li><strong>11:31\u201311:35<\/strong> \u2013 An internal monitoring test triggers an alert at ~11:31, and an incident call is convened by 11:35. Engineers initially see degraded response rates in Workers KV and attempt mitigations (rate limiting, traffic shaping).<br><\/li>\n\n\n\n<li><strong>13:05<\/strong> \u2013 To reduce pressure, Teams bypass the core proxy for Workers KV and Access, falling back to an older proxy version. This lessens the downstream impact (the <a href=\"https:\/\/www.getpanto.ai\/blog\/mobile-app-testing-ai-top-bugs\">bug<\/a> was still present, but older code masked it).<br><\/li>\n\n\n\n<li><strong>13:37<\/strong> \u2013 The Bot Management configuration file is identified as the trigger. Multiple workstreams begin. The fastest path: restore the last-known-good version of the file.<br><\/li>\n\n\n\n<li><strong>14:24<\/strong> \u2013 New (bad) feature-file generation is halted. A tested, correct feature file is manually pushed out to all nodes.<br><\/li>\n\n\n\n<li><strong>14:30<\/strong> \u2013 Core services begin recovering. A correct Bot Management file is deployed globally; HTTP 5xx error rates drop sharply. (Cloudflare reports \u201ccore traffic was largely flowing as normal by 14:30&#8243;.)<br><\/li>\n\n\n\n<li><strong>By 17:06<\/strong> \u2013 All remaining systems have been restarted or recovered, and traffic returns to normal levels. (The long tail on the errors chart reflects final service restarts.)<br><\/li>\n<\/ol>\n\n\n\n<p class=\"wp-block-paragraph\">This timeline shows the unusual \u201czig-zag\u201d spike in errors every ~5 minutes as good and bad feature files alternated. Eventually the bad file prevailed and errors stabilized until remediation took effect.<\/p>\n\n\n<h2 class=\"wp-block-heading\" id=\"deep-dive-into-the-cloudflare-outage\"><span class=\"ez-toc-section\" id=\"deep-dive-into-the-cloudflare-outage\"><\/span><strong>Deep Dive into the CloudFlare Outage<\/strong><span class=\"ez-toc-section-end\"><\/span><\/h2>\n\n<h3 class=\"wp-block-heading\" id=\"root-cause-botmanagement-feature-file-and-query-bug\"><span class=\"ez-toc-section\" id=\"root-cause-bot-management-feature-file-and-query-bug\"><\/span><strong>Root Cause: Bot-Management Feature File and Query Bug<\/strong><span class=\"ez-toc-section-end\"><\/span><\/h3>\n\n\n<p class=\"wp-block-paragraph\">Cloudflare\u2019s Bot Management module embeds a <a href=\"https:\/\/www.getpanto.ai\/products\/ai-code-review\/reinforcement-learning\">reinforcement-learning<\/a> model that scores each request as bot or human. This ML model consumes a \u201cfeature\u201d configuration file (a list of request-trait parameters). The feature file is regenerated every few minutes via a ClickHouse query so that it stays up-to-date.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">On Nov 18, the scheduled ClickHouse update at 11:05 changed query semantics: it granted explicit access to underlying shard tables in the r0 schema. In prior behavior, querying metadata returned only the default schema\u2019s columns by assumption; after the change, the same query now returned columns from r0 too.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">In practice, the Bot feature-generation query (a system.columns lookup) suddenly returned <strong>all<\/strong> columns from both default and r0 tables. Concretely, the report notes that <em>&#8220;the response now contained all the metadata of the r0 schema, effectively more than doubling the rows&#8221;<\/em> in the resulting feature file. <\/p>\n\n\n\n<p class=\"wp-block-paragraph\">In other words, a permission tweak caused <a href=\"https:\/\/www.getpanto.ai\/blog\/code-duplication-detection-tools\">duplicate<\/a> feature entries. The new feature file exceeded Cloudflare\u2019s built-in limit of 200 features. As soon as the bloated file hit production (in the FL2 proxy code path), the proxy panicked:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>thread fl2_worker_thread panicked: called Result::unwrap() on an Err value<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">This unhandled Rust panic (during feature-file parsing) turned into HTTP 500 errors for any request going through the bot-management path. In summary, a simple DB query change made the ML feature config file twice as large, which triggered an internal error and caused core requests to fail.<\/p>\n\n\n<h3 class=\"wp-block-heading\" id=\"failure-propagation-and-impact-across-systems\"><span class=\"ez-toc-section\" id=\"failure-propagation-and-impact-across-systems\"><\/span><strong>Failure Propagation and Impact Across Systems<\/strong><span class=\"ez-toc-section-end\"><\/span><\/h3>\n\n\n<p class=\"wp-block-paragraph\">Once the error occurred, it rippled through most Cloudflare services that depended on the core proxy or bot logic. The effect on customers varied depending on their proxy version: Cloudflare was midway through migrating traffic to a new proxy engine (FL2). <\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Per the report, customers on FL2 <em>all<\/em> saw 5xx errors, while customers still on the legacy proxy (FL) saw no crashes \u2013 but all bot scores defaulted to zero. This meant any customer using bot scores to <em>block<\/em> bots would suddenly drop all traffic (false positives). In short:<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Core CDN and <\/strong><a href=\"https:\/\/www.getpanto.ai\/security\"><strong>Security<\/strong><\/a><strong>:<\/strong> Returned HTTP 5xx to requests (customers saw error pages like the one above).<\/li>\n\n\n\n<li><strong>Turnstile (CAPTCHA):<\/strong> Failed to load entirely, so new logins to the dashboard were blocked.<\/li>\n\n\n\n<li><strong>Workers KV:<\/strong> Front-end gateway saw a surge of HTTP 5xx errors as its proxy calls failed.<\/li>\n\n\n\n<li><a href=\"https:\/\/www.getpanto.ai\/products\/code-security\/security-dashboard\"><strong>Dashboard<\/strong><\/a><strong>:<\/strong> Mostly up, but users couldn\u2019t log in (Turnstile was down on the login page).<\/li>\n\n\n\n<li><strong>Access (Zero Trust Auth):<\/strong> Widespread authentication failures from the start of the incident until rollback, meaning users got errors instead of reaching protected apps. (Existing sessions were unaffected, but no new logins succeeded.)<\/li>\n\n\n\n<li><strong>Email Security:<\/strong> Largely unaffected; a minor reputation feed went offline briefly but did not critically impact customers.<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\">In addition, Cloudflare saw unusually high CPU use on proxy nodes as its error-<a href=\"https:\/\/www.getpanto.ai\/blog\/vibe-debugging-ai-qa-testing\">debugging<\/a> systems kicked in, which added latency for surviving traffic. At one point, even the Cloudflare status page (hosted outside Cloudflare\u2019s network) went down by coincidence, initially confusing the team into thinking this might be an external attack. In short, the bug turned into a cascading failure \u2013 flipping core traffic offline and then affecting most dependent control-plane services until fixed.<\/p>\n\n\n<h3 class=\"wp-block-heading\" id=\"incident-diagnostics-and-mitigation\"><span class=\"ez-toc-section\" id=\"incident-diagnostics-and-mitigation\"><\/span><strong>Incident Diagnostics and Mitigation<\/strong><span class=\"ez-toc-section-end\"><\/span><\/h3>\n\n\n<p class=\"wp-block-paragraph\">Initially, the errors looked like an external event (the status page outage and high error volumes spooked the team). However, internal telemetry soon pointed to an internal config issue. An automated test at ~11:31 picked up the anomaly, and by 11:35 an incident was declared. Early mitigation attempts included traffic shaping and account limiting on Workers KV to reduce load. Within about an hour, engineers used knowledge of Cloudflare\u2019s <a href=\"https:\/\/www.getpanto.ai\/products\/code-security\/iac\">architecture <\/a>to narrow the cause: the Bot Management module\u2019s feature file was the trigger.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">To blunt the impact while <a href=\"https:\/\/www.getpanto.ai\/blog\/vibe-debugging-effortless-engineering\">debugging<\/a>, at <strong>13:05<\/strong> the team bypassed the core proxy for Workers KV and Access, sending them through an older proxy build. This reduced error rates (since the FL engine simply dropped features instead of crashing). With that reprieve, engineers focused on restoring the bot-config file. By <strong>13:37<\/strong> they had confirmed the feature-file bug was the culprit and set out to roll back to a known-good version. A parallel workstream halted further bad file generation and propagated the last working file.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">By <strong>14:24<\/strong>, the team had stopped new feature-file deployments altogether and tested the rollback: as promised, the old file cured the errors in FL2. With confidence, they globally pushed the good file and restarted the proxy at <strong>14:30<\/strong>. The core proxy was then largely stable: 5xx rates fell back to baseline and most traffic flowed normally. It took until ~17:06 for all lingering effects (restarting any services still in a bad state) to clear up.<\/p>\n\n\n<h2 class=\"wp-block-heading\" id=\"lessons-learned-and-future-protections\"><span class=\"ez-toc-section\" id=\"lessons-learned-and-future-protections\"><\/span><strong>Lessons Learned and Future Protections<\/strong><span class=\"ez-toc-section-end\"><\/span><\/h2>\n\n\n<figure class=\"wp-block-image aligncenter size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"433\" height=\"577\" src=\"https:\/\/www.getpanto.ai\/blog\/wp-content\/uploads\/2025\/11\/image-removebg-preview.png\" alt=\"Lessons Learned from Cloudflare Outage\" class=\"wp-image-2889\" srcset=\"https:\/\/www.getpanto.ai\/blog\/wp-content\/uploads\/2025\/11\/image-removebg-preview.png 433w, https:\/\/www.getpanto.ai\/blog\/wp-content\/uploads\/2025\/11\/image-removebg-preview-225x300.png 225w, https:\/\/www.getpanto.ai\/blog\/wp-content\/uploads\/2025\/11\/image-removebg-preview-200x267.png 200w\" sizes=\"auto, (max-width: 433px) 100vw, 433px\" \/><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\"><\/p>\n\n\n<h3 class=\"wp-block-heading\" id=\"cloudflares-leadership-response-and-architectural\"><span class=\"ez-toc-section\" id=\"cloudflares-leadership-response-and-architectural-hardening\"><\/span><strong>Cloudflare\u2019s Leadership Response and Architectural Hardening<\/strong><span class=\"ez-toc-section-end\"><\/span><\/h3>\n\n\n<p class=\"wp-block-paragraph\">Cloudflare\u2019s leaders stress that this Cloudflare outage \u201cis unacceptable\u201d and the worst Cloudflare outage since 2019. They have outlined several architectural hardening steps: treating internally generated config files like external input (with validation checks), adding global kill-switches to disable new features, preventing diagnostic logs or core dumps from choking resources, and systematically reviewing failure modes in all proxy modules. <\/p>\n\n\n\n<p class=\"wp-block-paragraph\">In essence, every control plane and configuration path will need stronger guardrails. Cloudflare also reminds us that each major incident informs building more resilient systems, noting that they have always responded to failures by adding redundancy and <a href=\"https:\/\/www.getpanto.ai\/blog\/ai-qa-automation-code-review-quality\">automation<\/a>.<\/p>\n\n\n<h3 class=\"wp-block-heading\" id=\"key-takeaway-for-sredevops-community\"><span class=\"ez-toc-section\" id=\"key-takeaway-for-sredevops-community\"><\/span><strong>Key Takeaway for SRE\/DevOps Community<\/strong><span class=\"ez-toc-section-end\"><\/span><\/h3>\n\n\n<p class=\"wp-block-paragraph\">For the SRE\/DevOps community, the key takeaway is a reconfirmation of \u201cdesign for failure.\u201d Even mature CDNs can be tripped by a subtle data bug if unchecked configuration changes ripple through the network. This incident \u2013 following closely on the heels of AWS\u2019s outage \u2013 highlights how high-velocity engineering (\u201cvibe coding\u201d with <a href=\"https:\/\/www.getpanto.ai\/blog\/best-ai-code-review-tools\">AI tools<\/a>) can create hidden failure surfaces. The rapid iteration culture demands equally rapid observability and testing countermeasures. <\/p>\n\n\n\n<p class=\"wp-block-paragraph\">In practice, teams are now investing more in <a href=\"https:\/\/www.getpanto.ai\/blog\/ai-powered-testing\">AI-powered testing<\/a> pipelines (even for mobile and edge apps) and richer runtime diagnostics to catch complex bugs early.<\/p>\n\n\n<h3 class=\"wp-block-heading\" id=\"automation-and-testing-frameworks-importance\"><span class=\"ez-toc-section\" id=\"automation-and-testing-frameworks-importance\"><\/span><strong>Automation and Testing Frameworks Importance<\/strong><span class=\"ez-toc-section-end\"><\/span><\/h3>\n\n\n<p class=\"wp-block-paragraph\">For example, <a href=\"https:\/\/www.getpanto.ai\/blog\/automated-mobile-qa-ai-testing\">automated end-to-end testing<\/a> and chaos testing can exercise new configuration flows before they hit prod. Similarly, enhanced tracing can pinpoint failures in ML-driven modules. Ultimately, as Cloudflare\u2019s outage experience shows, strong automation and testing frameworks are essential to support the fast <a href=\"https:\/\/www.getpanto.ai\/blog\/death-of-manual-qa-ai-mobile-app-testing\">AI-driven workflows<\/a> that define today\u2019s engineering culture.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Just weeks after AWS disclosed a major DynamoDB outage in late October 2025 (triggered by a latent DNS-management race condition), Cloudflare suffered its own large-scale outage on November 18, 2025. Both incidents underscore how fragile modern cloud systems can become when small bugs combine with high deployment velocity.&nbsp; Some experts have dubbed today\u2019s fast, AI-assisted [&hellip;]<\/p>\n","protected":false},"author":3,"featured_media":2886,"comment_status":"open","ping_status":"open","sticky":false,"template":"wp-custom-template-panto-code-review-blog","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[],"class_list":["post-2879","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-ai-coding"],"_links":{"self":[{"href":"https:\/\/www.getpanto.ai\/blog\/wp-json\/wp\/v2\/posts\/2879","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.getpanto.ai\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.getpanto.ai\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.getpanto.ai\/blog\/wp-json\/wp\/v2\/users\/3"}],"replies":[{"embeddable":true,"href":"https:\/\/www.getpanto.ai\/blog\/wp-json\/wp\/v2\/comments?post=2879"}],"version-history":[{"count":0,"href":"https:\/\/www.getpanto.ai\/blog\/wp-json\/wp\/v2\/posts\/2879\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.getpanto.ai\/blog\/wp-json\/wp\/v2\/media\/2886"}],"wp:attachment":[{"href":"https:\/\/www.getpanto.ai\/blog\/wp-json\/wp\/v2\/media?parent=2879"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.getpanto.ai\/blog\/wp-json\/wp\/v2\/categories?post=2879"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.getpanto.ai\/blog\/wp-json\/wp\/v2\/tags?post=2879"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}