IHA Cloud

When the Internet Blinked: Inside the November 18 Cloudflare Outage and What Really Happened  

On November 18, 2025, the Internet hit a strange and unexpected speed bump. Websites worldwide — from small businesses to major enterprises — began showing error pages. Apps struggled to connect. Authentication systems failed. Even Cloudflare’s own status page briefly went offline.

At first glance, it looked like the start of a massive cyberattack.

But the truth was far more surprising.

In this detailed breakdown, we at IHA Cloud walk through exactly what happened, why it happened, and what the incident tells us about the hidden complexity of the Internet’s infrastructure.


A Normal Day — Until 11:20 UTC  

Cloudflare operates one of the most widely distributed networks in the world. At 11:20 UTC, that network suddenly began returning 5xx errors — essentially an internal server failure — for millions of requests.

Visitors saw Cloudflare-branded error messages when trying to access sites.

Behind the scenes, engineers saw erratic behavior: traffic would fail, then suddenly recover, then fail again. This “flickering” made the situation look eerily like a high-volume, targeted DDoS attack. But the real cause was much quieter — and buried deep inside Cloudflare’s internal systems.

Not an Attack — A Database Permission Change Gone Wrong  

The outage began with an update to Cloudflare’s ClickHouse database cluster. The update was meant to improve permission management and make queries more secure.

But one unexpected side-effect changed everything.

How it spiraled:  

  1. A database permission change caused a query to return duplicate rows.
  2. That query was responsible for generating a “feature file” — a configuration file used by Cloudflare’s Bot Management system, which helps distinguish humans from bots.
  3. The feature file doubled in size unexpectedly.
  4. Cloudflare’s routing software downloaded this file globally.
  5. The software had a hidden size limit.
  6. When the file exceeded that limit, the routing process crashed.
  7. Crashed routing = failed traffic = global 5xx errors.

To make things worse, the feature file was regenerated every 5 minutes. Sometimes it was generated correctly. Sometimes it wasn’t. That’s why Cloudflare experienced a cycle of:

✔ normal service
✖ failure
✔ normal service
✖ failureThis back-and-forth made diagnosis extremely difficult.

Why Engineers First Suspected a DDoS Attack  

During this chaos, another unrelated glitch occurred:
Cloudflare’s status page — which is hosted outside Cloudflare — went down due to an entirely separate issue.

To engineers dealing with fluctuating errors, massive 5xx spikes, and a dead status page… it looked exactly like a coordinated large-scale attack.

Even internal chats reflected this suspicion.

Only later did the team trace the root cause: the oversized Bot Management feature file.


How Cloudflare Stabilized the Internet Again  

It took several steps to untangle the issue:

1. Stopping the spread of the faulty configuration file  

Cloudflare paused generation of the feature file to prevent new bad versions from propagating.

2. Rolling back to a last-known-good version  

A clean feature file was manually injected into the distribution system.

3. Restarting core proxy services globally  

Once devices had the correct file, the routing layer (FL and FL2) began recovering.

4. Fixing downstream services such as:  

  • Workers KV
  • Access
  • Turnstile
  • Dashboard authentication

5. Full recovery:  

By 14:30 UTC, most traffic was back to normal.
By 17:06 UTC, all Cloudflare systems were fully recovered.


Who Was Impacted?  

Because Cloudflare sits in front of a huge portion of the Internet, the outage affected:

  • Websites (HTTP 5xx errors)
  • API requests
  • Login systems relying on Cloudflare Access
  • Worker KV-based internal and external services
  • CAPTCHA/Turnstile authentication flows
  • CDN performance due to increased CPU load during error handling

Some users even saw false-positive bot detections because bot scoring failed.


Why the Issue Became So Big  

This outage wasn’t caused by a single bug — it was a combination of:

✔ A database permissions change  

Exposing additional metadata by accident.

✔ A configuration file generator depending on that metadata  

Which doubled its output size.

✔ A strict size limit in the bot module  

Causing a panic when exceeded.

✔ Global propagation of changes  

Which meant the incorrect file hit machines everywhere, almost instantly.

✔ A coincidence: the Cloudflare status page failing simultaneously  

Creating confusion during early investigation. This rare “perfect storm” turned a simple metadata change into a multi-hour global outage.

How Cloudflare Plans to Prevent This in the Future  

Cloudflare has publicly committed to several improvements:

1. Harden validation of internal config files  

Even internally generated files will be treated like user input — validated before they roll out.

2. Global kill switches  

Allowing teams to instantly disable problematic features.

3. Improved error handling  

Eliminating unbounded memory allocations and avoiding system panics when limits are exceeded.

4. Better safeguards between systems  

So database metadata changes can’t silently propagate into runtime systems without checks.

5. Updated failure mode reviews  Across all core proxy modules (FL & FL2).

Why This Outage Matters (Even If Your Site Wasn’t Down)  

Incidents like this remind us how interconnected the Internet is.

A change inside a globally distributed cloud platform — even a small one — can ripple into major outages. But they also highlight how much engineering effort goes into ensuring reliability every day.

At IHA Cloud, we study incidents like this not to point fingers, but to improve our own infrastructure practices and resilience models.

Understanding these failures helps the entire industry evolve.


Final Thoughts  

Cloudflare hasn’t had an outage of this scale since 2019. This one was painful, unexpected, and complex — and Cloudflare has openly acknowledged it.

The silver lining?

The Internet recovered quickly, lessons were learned, and the global cloud ecosystem becomes stronger each time we analyze such events.

If you want a simpler summary:

👉 A database change caused a config file to double in size →
A size limit caused the routing software to crash →
The crash caused 5xx errors worldwide →
Cloudflare fixed it by rolling back the file and stabilizing the network.We hope this breakdown gives you a clear and understandable look at what happened on November 18, 2025 — one of the Internet’s most unusual days in recent memory.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top