Cloudflare 1.1.1.1 Incident on July 14, 2025 Hackernews Viewer

Cloudflare 1.1.1.1 Incident on July 14, 2025

(blog.cloudflare.com)

577 points by nomaxx117 16 July 2025 | 380 comments

Comments

chrismorgan 16 July 2025

I’m surprised at the delay in impact detection: it took their internal health service more than five minutes to notice (or at least alert) that their main protocol’s traffic had abruptly dropped to around 10% of expected and was staying there. Without ever having been involved in monitoring at that kind of scale, I’d have pictured alarms firing for something that extreme within a minute. I’m curious for description of how and why that might be, and whether it’s reasonable or surprising to professionals in that space too.

v5v3 16 July 2025

> For many users, not being able to resolve names using the 1.1.1.1 Resolver meant that basically all Internet services were unavailable.

Don't you normally have 2 DnS servers listed on any device. So was the second also down, if not why didn't it go to that.

kachapopopow 16 July 2025

Interesting to see that they probably lost 20% of 1.1.1.1 usage from a roughly 20 minute incident.

Not sure how cloudflare keeps struggling with issues like these, this isn't the first (and probably won't be the last) time they have these 'simple', 'deprecated', 'legacy' issues occuring.

8.8.8.8+8.8.4.4 hasn't had a global(1) second of downtime for almost a decade.

1: localized issues did exist, but that's really the fault of the internet and they did remain running when google itself suffered severe downtime in various different services.

jallmann 16 July 2025

Good writeup.

> It’s worth noting that DoH (DNS-over-HTTPS) traffic remained relatively stable as most DoH users use the domain cloudflare-dns.com, configured manually or through their browser, to access the public DNS resolver, rather than by IP address.

Interesting, I was affected by this yesterday. My router (supposedly) had Cloudflare DoH enabled but nothing would resolve. Changing the DNS server to 8.8.8.8 fixed the issues.

i_niks_86 16 July 2025

Many commenters assume fallback behavior exists between DNS providers, but in practice, DNS clients - especially at the OS or router level -rarely implement robust failover for DoH. If you're using cloudflare-dns(.)com and it goes down, unless the stub resolver or router explicitly supports multi-provider failover (and uses a trust-on-first-use or pinned cert model), you’re stuck. The illusion of redundancy with DoH needs serious UX rethinking.

CuteDepravity 16 July 2025

It's crazy that both 1.1.1.1 and 1.0.0.1 where affected by the same change

I guess now we should start using a completely different provider as dns backup Maybe 8.8.8.8 or 9.9.9.9

homebrewer 16 July 2025

This is a good time to mention that dnsmasq lets you setup several DNS servers, and can race them. The first responder wins. You won't ever notice one of the services being down:

  all-servers
  server=8.8.8.8
  server=9.9.9.9
  server=1.1.1.1

Mindless2112 16 July 2025

Interesting that traffic didn't return to completely normal levels after the incident.

I recently started using the "luci-app-https-dns-proxy" package on OpenWrt, which is preconfigured to use both Cloudflare and Google DNS, and since DoH was mostly unaffected, I didn't notice an outage. (Though if DoH had been affected, it presumably would have failed over to Google DNS anyway.)

perlgeek 16 July 2025

An outage of roughly 1 hour is 0.13% of a month or 0.0114% of a year.

It would be interesting to see the service level objective (SLO) that cloudflare internally has for this service.

I've found https://www.cloudflare.com/r2-service-level-agreement/ but this seems to be for payed services, so this outage would put July in the "< 99.9% but >= 99.0%" bucket, so you'd get a 10% refund for the month if you payed for it.

alyandon 16 July 2025

  Cloudflare's 1.1.1.1 Resolver service became unavailable to the Internet starting at 21:52 UTC and ending at 22:54 UTC

Weird. According to my own telemetry from multiple networks they were unavailable for a lot longer than that.

aftbit 16 July 2025

>Even though this release was peer-reviewed by multiple engineers

I find it somewhat surprising that none of the multiple engineers who reviewed the original change in June noticed that they had added 1.1.1.0/24 to the list of prefixes that should be rerouted. I wonder what sort of human mistake or malice led to that original error.

Perhaps it would be wise to add some hard-coded special-case mitigations to DLS such that it would not allow 1.1.1.1/32 or 1.0.0.1/32 to be reassigned to a single location.

chrisgeleven 16 July 2025

I been lazy and was using Cloudflare's resolver only recently. In hindsight I probably should just setup two instances of Unbound on my home network that don't rely on upstream resolvers and call it a day. It's unlikely both will go down at the same time and if I'm having an total Internet outage (unlikely as I have Comcast as primary + T-Mobile Home Internet as a backup), it doesn't matter if DNS is or isn't resolving.

nodesocket 16 July 2025

I used to configure 1.1.1.1 as primary and 8.8.8.8 as secondary but noticed that Cloudflare on aggregate was quicker to respond to queries and changed everything to use 1.1.1.1 and 1.0.0.1. Perhaps I'll switch back to using 8.8.8.8 as secondary, though my understanding is DNS will round-robin between primary and secondary, it's not primary and then use secondary ONLY if primary is down. Perhaps I am wrong though.

EDIT: Appears I was wrong, it is failover not round-robin between the primary and secondary DNS servers. Thus, using 1.1.1.1 and 8.8.8.8 makes sense.

wreckage645 16 July 2025

This is a good post mortem, but improvements only come with change on processes. It seems every team at CloudFlare is approaching this in isolation, without a central problem management. Every week we see a new CloudFlare global outage. It seems like the change management processes is broken and needs to be looked at..

neurostimulant 16 July 2025

I never noticed the outage because my isp hijack all outbound udp traffic to port 53 and redirect them to their own dns server so they can apply government-mandated cencorship :)

angst 16 July 2025

I wonder how uptime ratio of 1.1.1.1 is against 8.8.8.8

Maybe there is noticeable difference?

I have seen more outage incident reports of cloudflare than of google, but this is just personal anecdote.

cadamsdotcom 16 July 2025

> The way that Cloudflare manages service topologies has been refined over time and currently consist of a combination of a legacy and a strategic system that are synced.

This writing is just brilliant. Clear to technical and non-technical readers. Makes the in-progress migration sound way more exciting than it probably is!

> We are sorry for the disruption this incident caused for our customers. We are actively making these improvements to ensure improved stability moving forward and to prevent this problem from happening again.

This is about as good as you can get it from a company as serious and important as Cloudflare. Bravo to the writers and vetters for not watering this down.

0xbadcafebee 16 July 2025

> A configuration change was made for the same DLS service. The change attached a test location to the non-production service; this location itself was not live, but the change triggered a refresh of network configuration globally.

Say what now? A test triggered a global production change?

> Due to the earlier configuration error linking the 1.1.1.1 Resolver's IP addresses to our non-production service, those 1.1.1.1 IPs were inadvertently included when we changed how the non-production service was set up.

You have a process that allows some other service to just hoover up address routes already in use in production by a different service?

dawnerd 16 July 2025

Oh this explains a lot. I kept having random connection issues and when I disabled AdGuard dns (self hosted) it started working so I just assumed it was something with my vm.

alexandrutocar 16 July 2025

I use their DNS over HTTPS and if I hadn't seen the issue being reported here, I wouldn't have caught it at all. However, this—along with a chain of past incidents (including a recent cascading service failure caused by a third-party outage)—led me to reduce my dependencies. I no longer use Cloudflare Tunnels or Cloudflare Access, replacing them with WireGuard and mTLS certificates. I still use their compute and storage, but for personal projects only.

nu11ptr 16 July 2025

Question: Years ago, back when I used to do networking, Cisco Wireless controllers used 1.1.1.1 internally. They seemed to literally blackhole any comms to that IP in my testing. I assume they changed this when 1.0.0.0/8 started routing on the Internet?

sneak 16 July 2025

1.1.1.1 does not operate in isolation.

It is designed to be used in conjunction with 1.0.0.1. DNS has fault tolerance built in.

Did 1.0.0.1 go down too? If so, why were they on the same infrastructure?

This makes no sense to me. 8.8.8.8 also has 8.8.4.4. The whole point is that it can go down at any time and everything keeps working.

Shouldn’t the fix be to ensure that these are served out of completely independent silos and update all docs to make sure anyone using 1.1.1.1 also has 1.0.0.1 configured as a backup?

If I ran a service like this I would regularly do blackouts or brownouts on the primary to make sure that people’s resolvers are configured correctly. Nobody should be using a single IP as a point of failure for their internet access/browsing.

nness 16 July 2025

Interesting side-effect, the Gluetun docker image uses 1.1.1.1 for DNS resolution — as a result of the outage Gluetun's health checks failed and the images stopped.

If there were some way to view torrenting traffic, no doubt there'd be a 20 minute slump.

udev4096 16 July 2025

This is why running your own resolver is so important. Clownflare will always break something or backdoor something

zac23or 16 July 2025

It's no surprise that Cloudflare is having a service issue again.

I use Cloudflare at work. Cloudflare has many bugs, and some technical decisions are absurd, such as the worker's cache.delete method, which only clears the cache contents in the data center where the Worker was invoked!!! https://developers.cloudflare.com/workers/runtime-apis/cache...

In my experience, Cloudflare support is not helpful at all, trying to pass the problem onto the user, like "Just avoid holding it in that way. ".

At work, I needed to use Cloudflare. The next job I get, I'll put a limit on my responsibilities: I don't work with Cloudflare.

I will never use Cloudflare at home and I don't recommend it to anyone.

Next week: A new post about how Cloudflare saved the web from a massive DDOS attack.

b0rbb 16 July 2025

I don't know about you all but I love a well written RCA. Nicely done.

geoffpado 16 July 2025

This was quite annoying for me, having only switched my DNS server to 1.1.1.1 approximately 3 weeks ago to get around my ISP having a DNS outage. Is reasonably stable DNS really so much to ask for these days?

trollbridge 16 July 2025

I got bit by this, so dnsmasq now has 1.1.1.2, Quad9, and Google’s 8.8.8.8 with both primary and secondary.

Secondary DNS is supposed to be in an independent network to avoid precisely this.

xyst 16 July 2025

Am not a fan of CF in general due to their role in centralization of the internet around their services.

But I do appreciate these types of detailed public incident reports and RCAs.

rswail 16 July 2025

I now run unbound locally as a recursive DNS server, which really should be the default. There's no reason not to in modern routers.

Not sure what the "advantage" of stub resolvers is in 2025 for anything.

nixpulvis 16 July 2025

Fun fact, Verizon cellular blocks 1.1.1.1. I discovered this after trying to use my hotspot from my Linux laptop with it set for my default DNS.

Very frustrating.

hkon 16 July 2025

To say I was surprised when I finally checked the status page of cloudflare is an understatement.

egamirorrim 16 July 2025

What's that about a hijack?

greggsy 16 July 2025

I’d love to know legacy systems they’re referring to.

thunderbong 16 July 2025

How does Cloudflare compare with OpenDNS?

sylware 16 July 2025

cloudflare is providing a service designed to block noscript/basic (x)html browsers.

I know.

tacitusarc 16 July 2025

Perhaps I am over-saturated, but this write up felt like AI- at least largely edited by a model.