Please Do Not Destroy The DNS In Order To Save It
Would that one character could really save the day here.
There’s a lot wrong here, the key fact being there are just so many ways around TTL, which itself was never designed to be a security technology in the first place. Gabriel’s trick addresses one particular scenario. It’s not at all enough. Consider:
First of all, you don’t actually know that a nameserver is ever going to provide you a record, or that that record is going to be cached. We’re seeing bugs in both conditions. For example, PowerDNS wasn’t providing responses on strange query types. CNN doesn’t reply at all to nonexistent names. So there may not be a TTL to bypass.
Secondly, the more major the site, the smaller the TTL. One of the issues described in my slides was the fact that nothing prevents an attacker from replying multiple times to a single outbound query. Presume you can get 500 replies in before the real server does. Given that, you have about a 1 in 131 chance of hijacking the record. With Google Analytics’ TTL at 300, that’s about 5 hours on average — and you don’t have to send 4 billion packets, you’re still sending just a couple tens of thousands.
If Google Analytics gets taken, the web pretty much gets taken — welcome to the power of <script src=”http://www.google-analytics.com”> putting foreign code into DOM’s around the world.
And it’s not like 300 is unusually low. Facebook’s at 30 seconds. That translates to about 30 minutes of security for Facebook — or their pizza’s free 🙂
But there are records that do have long TTL’s, and that’s where things get really dicey. The records with the longest TTL’s in the world are all name server records. Google’s NS records have TTL’s at 345K seconds. Microsoft’s NS records have TTL’s at 143K seconds. Whether that’s a good idea or a bad idea, it’s reality. We allow in-bailiwick overwrite of cached NS records precisely because these very long TTL’d records sometimes need to be overwritten anyway. When Gabriel writes:
What’s the downside to my patch ? I guess we are now holding an
authoritative server to the promise not to change the NS record for
the duration of the TTL, which is kinda what the TTL is for in the
first place 🙂
What he’s saying is that Google and Microsoft should accept situations where their website is down for up to 95 days hours (still too long). Now, granted, almost nobody’s going to actually hold onto a cached record for that long. But a single point of failure causing up to a week of residual outage out in the field is a very bad thing. A one character patch that caused such failures would be a serious problem indeed.
Now, all this being said, there’s lots of interesting thinking going on out there, and one of the things we all fully expected was a healthy discussion of all the possible options on the table. Maybe there’s a little more press than expected on one of those options, but I do think it’s good that we can now all see just how careful we need to be fixing this bug. There are a couple of approaches that are in fact converging on a safe and effective fix to the DNS, and I’ll be writing about them soon. In the meantime…nobody should presume any easy fix will actually solve the problem.