Home > Security > Please Do Not Destroy The DNS In Order To Save It

Please Do Not Destroy The DNS In Order To Save It

So someone put together a “one character” patch to fix the “dns flaw”, and it hit Slashdot.

Would that one character could really save the day here.

There’s a lot wrong here, the key fact being there are just so many ways around TTL, which itself was never designed to be a security technology in the first place. Gabriel’s trick addresses one particular scenario. It’s not at all enough. Consider:

First of all, you don’t actually know that a nameserver is ever going to provide you a record, or that that record is going to be cached. We’re seeing bugs in both conditions. For example, PowerDNS wasn’t providing responses on strange query types. CNN doesn’t reply at all to nonexistent names. So there may not be a TTL to bypass.

Secondly, the more major the site, the smaller the TTL. One of the issues described in my slides was the fact that nothing prevents an attacker from replying multiple times to a single outbound query. Presume you can get 500 replies in before the real server does. Given that, you have about a 1 in 131 chance of hijacking the record. With Google Analytics’ TTL at 300, that’s about 5 hours on average — and you don’t have to send 4 billion packets, you’re still sending just a couple tens of thousands.

If Google Analytics gets taken, the web pretty much gets taken — welcome to the power of <script src=”http://www.google-analytics.com”&gt; putting foreign code into DOM’s around the world.

And it’s not like 300 is unusually low. Facebook’s at 30 seconds. That translates to about 30 minutes of security for Facebook — or their pizza’s free 🙂

But there are records that do have long TTL’s, and that’s where things get really dicey. The records with the longest TTL’s in the world are all name server records. Google’s NS records have TTL’s at 345K seconds. Microsoft’s NS records have TTL’s at 143K seconds. Whether that’s a good idea or a bad idea, it’s reality. We allow in-bailiwick overwrite of cached NS records precisely because these very long TTL’d records sometimes need to be overwritten anyway. When Gabriel writes:

What’s the downside to my patch ? I guess we are now holding an
authoritative server to the promise not to change the NS record for
the duration of the TTL, which is kinda what the TTL is for in the
first place 🙂

What he’s saying is that Google and Microsoft should accept situations where their website is down for up to 95 days hours (still too long). Now, granted, almost nobody’s going to actually hold onto a cached record for that long. But a single point of failure causing up to a week of residual outage out in the field is a very bad thing. A one character patch that caused such failures would be a serious problem indeed.

Now, all this being said, there’s lots of interesting thinking going on out there, and one of the things we all fully expected was a healthy discussion of all the possible options on the table. Maybe there’s a little more press than expected on one of those options, but I do think it’s good that we can now all see just how careful we need to be fixing this bug. There are a couple of approaches that are in fact converging on a safe and effective fix to the DNS, and I’ll be writing about them soon. In the meantime…nobody should presume any easy fix will actually solve the problem.

Categories: Security
  1. Gabriel Somlo
    August 29, 2008 at 9:35 am

    > What he’s saying is that Google and Microsoft
    > should accept situations where their website
    > is down for up to 95 days. Now, granted,
    > almost nobody’s going to actually hold onto a
    > cached record for that long. But a single
    > point of failure causing up to a week of
    > residual outage out in the field is a very
    > bad thing. A one character patch that caused
    > such failures would be a serious problem indeed.

    Picking a TTL value is a compromise between how much you want to prevent people on the Internet from repeatedly asking you the same question, and how long you can tolerate being invisible in the event of a failure. Pick a small one, you’re period of “invisibility” is small, but you get to answer the same question over and over, a lot. Pick a large one, and everyone holds on to your answers for longer, asking you questions less often — but if you go away, you’ll stay gone for that much longer.

    I guess all I’m saying is that if Microsoft or Google don’t want to be invisible for 4 days, they should simply pick a lower TTL. Or advertise multiple NS records pointing at geographically redundant servers. Or both. Not a big deal, IMHO…

  2. Silva
    August 29, 2008 at 9:45 am

    “PowerDNS wasn’t providing responses on strange query types. CNN doesn’t reply at all to nonexistent names. So there may not be a TTL to bypass.”

    Who is destroying DNS order here?

    “Presume you can get 500 replies..”

    Is still less than infinity, which is the case without the patch. Is means that even with the random src port e still possible to poison the cache.

    “The records with the longest TTL’s in the world are all name server records. Google’s NS records have TTL’s at 345K seconds. Microsoft’s NS records have TTL’s at 143K seconds. Whether that’s a good idea or a bad idea, it’s reality.”

    It’s a reality that can be change without problems

    The question is, does the RFC mention that a received Authoritative nameserver record should prevail from the one in cache?

  3. Silva
    August 29, 2008 at 9:49 am

    “PowerDNS wasn’t providing responses on strange query types. CNN doesn’t reply at all to nonexistent names. So there may not be a TTL to bypass.”

    Who is destroying DNS order here?

    “Presume you can get 500 replies..”

    Is still less than infinity, which is the case without the patch. This means that even with the random src port is still possible to poison the cache.

    “The records with the longest TTL’s in the world are all name server records. Google’s NS records have TTL’s at 345K seconds. Microsoft’s NS records have TTL’s at 143K seconds. Whether that’s a good idea or a bad idea, it’s reality.”

    It’s a reality that can be changed without problems

    The question is, does the RFC mention that a received Authoritative nameserver record should prevail from the one in cache?

  4. August 29, 2008 at 10:05 am

    As far as Google and MS’s long TTLs go, I think they should accept those long potential outages. After all, that’s what they’re claiming with those long TTLs, that it’s valid to keep referring to those records for that long. If they consider that unacceptable, the standard way of dealing with it’s to lower the TTL to an acceptable value. As long as I’ve been familiar with DNS the need to pick a TTL that balances the desire to keep the load low vs. the need to propagate changes acceptably quickly’s been there.

    I still have to ask, though: why are additional records that don’t match the query even being looked at? I know why they were allowed, but do we truly need to look at any that aren’t either for the name we’re querying or glue A records for NS records in a delegation response? I can’t help but think that if non-matching additional records were simply ignored it’d eliminate a large portion of the attack surface. Am I missing some detail of this?

  5. Travis
    August 29, 2008 at 11:25 am

    While I agree with what’s written, I would like to point out that 345,000 seconds is 95.8 hours, not 95 days. And since 345,600 seconds is exactly 4 days, I’m going to take a WAG and say that the actual timeout is likely 345,600 seconds.

  6. Steve Rhoton
    August 29, 2008 at 11:45 am

    > A one character patch that caused such failures
    > would be a serious problem indeed.

    Not quite sure how the number of characters plays into the seriousness of a problem that a patch may or may not cause, but perhaps the forthcoming patches mentioned in this rebuttal will be much more elegant and will contain many more characters, so that any problems these patches may or may not cause will be less serious.

  7. August 29, 2008 at 11:54 am

    As others have suggested, the onus is on the Googles and Microsofts of the world to fix their TTLs. If you don’t understand the implications of setting a really big TTL, you shouldn’t be a DNS administrator.

  8. August 29, 2008 at 12:55 pm

    The patch doesn’t fix everything on DNS, but IMO it prevents the glue records attack you found out.
    From your answer i didn´t understand if you are saying that the patch shouldn’t be applied because there are some software and configurations which don´t respect the protocol continues vulnerable.

    An high value for an NS record is not a protocol flaw, is a configuration decision from the sysadmin.

    Even if your maths aren´t correct like someone has point, you have to agree that the time window is very short (500 packets? the real NS is in some kind of dial up connection?) compare to the infinity time without the patch.

    My conclusion is that you agree with the patch, but not as an isolation solution.

  9. August 29, 2008 at 2:15 pm

    Goncalo–

    Oh, I found lots of attacks. I just decided to spend my talk time showing why we needed comprehensive fixes, rather than spelling out each nasty variation.

    Ultimately, DNS requires the ability for in-bailiwick names to trigger a TTL bypass, for reliability purposes. A fix that breaks reliability will not be deployed, no matter what I say. The network must stay up. There are really good solutions in development that don’t screw this up, so I’m supporting that rather than this partial fix which doesn’t even cover Google Analytics.

    TTL’s are dead. We can do better.

  10. August 29, 2008 at 2:27 pm

    Btw, i do not have a math degree but even with that scenery, 500 packets and TTL 300, making the 5 hours _average_ to send all 64k, you are not trying to guess the same number because txid and src port are also being changed every 300 seconds, so i wonder if this doesn’t change the probability 😉

  11. August 29, 2008 at 2:31 pm

    Goncalo–

    The point is that the safety is coming from port randomization, not from preventing RRset override.

    I do think there’s some potential merit in evaluating when and how we override RRsets, but we can’t just ignore the reliability issues and go “full steam ahead”. People just won’t deploy the patches.

  12. hcf
    August 29, 2008 at 2:44 pm

    Dan, micro correction;

    If all of the servers named by google’s NS records were down, then the in-bailiwick overwrite would do no good. You need one working server to reply in order to get the new NS records out.

    The analysis is still the same; “95 hours / 2” (statistically 1/2 of the recursive caches would be 1/2 of the way through TTL by the time of any failure) is a long time to run in a degraded fashion with one nameserver taking all queries from all the recursive resolvers that have it cached…and dropping queries left and right.

    If you turn the TTL into a security extension as the patch supposes;

    – You remove this tool from DNS operators.

    – You place a larger burden on the root nameservers (despite that your DNS server can reliably talk to example.com’s authoritative servers, you refuse to update in-bailiwick extensions of their TTL and they all TTL out at the same time…so you wind up querying for root/tld delegations all over again). This is an everyday thing as most people get their glue NS all in one additional section and they all have the same TTL, and not part of an attack or corner case.

    – You gain maybe 6 months before being back in the same situation.

    The existing TTL interpretation is the right one. Skip the patch and go read up on DNSSEC.

  13. August 29, 2008 at 2:46 pm

    Dan, even with only 16bits from txid changing and preventing RRset override the probability of succeed was very low.
    But i agree with you, reliability is a important issue and if you ppl are aware of betters solutions, go ahead 🙂

  14. Brose
    August 29, 2008 at 3:04 pm

    > Gabriel’s trick addresses one particular
    > scenario. It’s not at all enough.

    The problem is that none of the options currently available to protect us from this flaw are enough. Gabriel’s one character trick may be just that, but we all know that the source port randomization patch we all applied doesn’t buy us much more than time in the grand scheme of things.

    Gabriel’s patch may not fundamentally fix the flaw, and I believe there are other, more accepted fixes in the works, but this patch gives DNS admins an additional layer of defense (even if just a temporary one). It also gives DNS admins the choice between applying an additional protection or worrying about other DNS admins’ TTL-related decisions.

    > What he’s saying is that Google and Microsoft
    > should accept situations where their website is
    > down for up to 95 hours (still too long).

    In my opinion, it’s not so much that Google and Microsoft *should* accept these situations as it is that they *have* accepted these situations by implementing the TTLs they implemented for their authoritative records.

    And while I don’t know much of anything about their specific infrastructure implementations, I have to believe that they are pretty robust, geographically diverse, and unchanging (I know I’d want my infrastructure to be all of those things to hand out large authoritative TTLs).

  15. August 29, 2008 at 3:17 pm

    TTL should be obsoleted. This isn’t 1988, there should be a more elegant way to manage things.

    I wish I understood more of how bind works. However, DNS and bind are not the real problems here. People’s resistance to change is the problem.

    Fix the TTL issue. If idiot admins make poor decisions with their TTL’s, more fool them. Either educate them, or force the change.

  16. TotalNoob
    August 30, 2008 at 3:19 pm

    Can you not introduce a random choice on how LONG to keep the TTL’s? If it is a question between being as little vulnerable as possible (but having to answer lots of the same queries) OR being quite vulnerable (but not having to re-do the queries); then why not switch randomly between a long and a short interval, in such a way that STATISTICALLY you have the best of both? Won’t solve the problem, but will certainly make the life of the attacker much more difficult…….

    Now tell me to go away 😉

  17. August 30, 2008 at 6:18 pm

    Noob,

    There’s been discussion about that; specifically, the idea is to have TTL only actually last some random value between TTL/2 and full TTL. Then, theoretically, the attacker doesn’t know the exact second the value will expire.

    However, they can often flood with DNS queries with the RD bit set to 0, so they know the moment it leaves cache.

    The reality is that none of it is enough. It might be a nice thing to add to a further comprehensive fix, but as long as I know I’m going to get a chance every couple of minutes, I’m going to win a 1/65K race with a relatively small amount of traffic.

    –Dan

  18. August 30, 2008 at 6:18 pm

    Stefan–

    What we’re doing is decommissioning TTL as a security technology, since you know, it never was one in the first place 🙂

  19. August 30, 2008 at 6:19 pm

    Brose,

    Nobody’s deploying anything that might lead to services going down. Not happening, not even in the range of consideration.

  20. Tels
    September 1, 2008 at 5:40 pm

    Would it be feasible to deploy some kind of TTL override locking mechanism on DNS implementations, of the sort that would disable bailiwick for a given domain if >x number of non-existent sub-domains were looked up?

    My understanding is that the initial spoofing relies on sending massive numbers of requests for non-existent subdomains, this is suspicious behavior and could be a trigger for the server to not accept cached record overwrites for a pre-configured time.

    Notwithstanding performance issues, can you see any other problems with this approach?

  21. Malcolm Parsons
    September 3, 2008 at 5:14 am

    > DOM’s
    > TTL’s

    No apostrophes are needed here.

    http://answers.google.com/answers/threadview?id=499296

  22. Brose
    September 3, 2008 at 11:23 am

    Dan,

    I think there was a disconnect between my last comment and your related reply. My comment revolved around DNS administration practices, and I think your response revolved more on the code fixes in the works. I’m not a developer, I don’t audit code for vulnerabilities or fixes, and I don’t read DNS-related RFCs when I’m at the beach, but I have administered authoritative DNS servers for a long time now. I read your comments about Microsoft and Google going down for long periods of time because of this patch, and I either don’t buy it or, like I said before, I’m missing something:

    Why, as a DNS admin, would I ever need to change an NS record, let alone all of them at the same time? Between being able to rebuild a dead server, publish 1 IP for multiple backend servers (e.g. Anycast, hardware lb, etc.), and publish multiple NS records for a zone (geographically distrbuted, highly available, etc.), I’m missing the real-world application for needing to quickly/frequently push NS record changes to other caching servers.

    Given the massively redundant infrastructures that are likely at play (this goes back to my aforementioned comment), I can’t think of any good reason for an organization like Microsoft or Google (to keep your example going) to need to change ALL of their NS records at the same time. Only this unrealistic situation could cause the type of outage you described in your first post.

  23. September 3, 2008 at 11:51 am

    Brose,

    The problem is you’re assuming best case scenarios, and missing that there’s weird bizarre mojo floating around the DNS because of content distribution networks. Never underestimate just how weird Akamai and Limelight are.

    The blocking change here is the alteration of DNS semantics. The constraint is that the same thing must resolve after a fix, that would before. Changing the semantics of DNS has consequences that are impossible to predict. People will simply not deploy that fix, even if 99% of the time it’d be fine.

  24. September 5, 2008 at 7:21 am

    Now, what is also needed is a true VPN fix as well. Hopefully, you Dan K. are up to this challenge because at least with a fully updated XP Pro. SP2 with a hardware router and associated security and safety protections after the APS Domain was compromised in the middle of 2007, the hackers broke into my fully updated computer in September 2007. However, the most clever hackers could only do a denial of service error on Windows 98 Second Edition because of its internal safety and the vulnerabilities as seen on secunia.com compared to 98 Second Edition and XP Professional and also Chris Quirke, mvp of southern Africa always talks about and my question for you is how do we as individuals, cooperations and governments safeguard and secure individual computers if domains are not properly safeguarded and secured and only protected by dumb default settings?

  1. September 8, 2008 at 12:37 pm

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: