Serve-Stale Update

The Day the Internet Broke

The successful massive DDoS attack on Dyn during the fall of 2016 caused some popular sites - including Amazon, Netflix, and Twitter - to be unreachable for hours, because Dyn was unable to answer authoritative DNS queries. That attack shocked a lot of people on the Internet, because it was such a massive DDoS and it successfully brought down a well-managed DNS service. At the time, a lot of DNS administrators clamored for a “solution.” Some of BIND 9’s users hoped that a feature that allowed BIND to continue serving stale content could help them ride out another prolonged successful DDoS attempt against a significant authoritative provider.

The serve-stale feature was added in the BIND 9.11.4-S subscriber edition in 2018, and then included in the open source as of BIND 9.12.0. At the time, there was no Internet standard for how to serve stale answers, but there were a few early draft proposals, including https://tools.ietf.org/html/draft-wkumari-dnsop-ttl-stretching-00, followed by https://tools.ietf.org/html/draft-tale-dnsop-serve-stale-02.

Eventually, RFC 8767 was standardized in March 2020. This article explains the differences between the initial BIND 9 implementation and the updated implementation in BIND 9.17.7 and 9.16.9, following the standardization of RFC 8767.

Original Serve-Stale Options

In mid-2018, when ISC released our first implementation in BIND 9, there were two deployed implementations that were discussed at the IETF: one from NLnet Labs’ Unbound, and one from Akamai.

The Unbound Method

Find an answer in cache. Then:

Regardless of whether the answer has expired, serve it (if the answer has expired, serve it with TTL=0).
If the answer rdataset has expired or is expiring, start a prefetch for it.
If no answer is found, perform recursion.

This method prioritizes fast answers from cache, as explained in this Unbound blog.

The Akamai Method

Find an answer in cache. Then:

If the answer has not expired, serve it, starting a prefetch for it if it is expiring.
If the answer has expired, start a fetch for it. If the fetch is taking too long, serve the stale answer while the fetch continues to run to timeout.
If no answer is found, perform recursion.

In essence, if a record in the cache has expired, the Unbound method serves the stale record (with a TTL of 0), while at the same time initiating a fetch for it. The Akamai method issues the fetch first, and only serves the stale record if no answer is received. Akamai had been using this implementation internally and found it helped them during the original attack on Dyn.

ISC discussed both options with some BIND users who were asking for this feature, and the preference at that time was for the Akamai method. Because Akamai had applied for patent protection on their method, and they had already implemented their algorithm in BIND, we asked them if they would donate their patch to ISC. After some internal review at Akamai, they generously contributed the implementation they had been using in-house, which was based on an earlier release of BIND, to ISC.

Initial BIND Implementation

There were three options provided with BIND 9’s initial implementation of serve-stale, with a fourth (stale-cache-enable) added more recently:

stale-answer-enable : If yes, enable the returning of “stale” cached answers when the name servers for a zone are not answering. The default is not to return stale answers (answering from stale cache can also be enabled and disabled dynamically by the BIND server administrator via rndc serve-stale on|off).
max-stale-ttl : If stale cache is enabled, max-stale-ttl sets the maximum time for which the server retains records past their normal expiry to return them as stale records, when the servers for those records are not reachable. The default is 12 hours (initially implemented as one week but reduced in the June 2020 BIND maintenance releases).
stale-answer-ttl : This specifies the TTL to be returned on stale answers. The default was originally set at 30 seconds. As of the 9.16.6 and 9.11.22-S1 update, it defaults to one second.

By the time this feature was available in the stable open source version of BIND, the industry had adapted to the threat of massive DDoS by diversifying. Since the 2016 Dyn outage, many authoritative providers have contracted for a strong secondary presence. Perhaps because of that trend, the Internet has not seen another authoritative DNS outage on that scale since then. So the original goal of preserving answers in case of an hours-long outage is no longer as important as it once was.

Recently, ISC has received some complaints that our serve-stale implementation is not efficient in production. Every client that asks for a record that is stale but still available in cache waits for a lengthy timeout, as BIND re-queries the authority before sending the stale answer. That was part of the original design, which was to serve as a last resort in case of a lengthy outage, but it provides a slower response.

The timeout BIND uses is based on an option called resolver-query-timeout. The default value of this timeout is 10 seconds, but it can be configured from 301 msec to 30 seconds. Although 10s is a very long time for most browser clients to wait for a response, we don’t recommend reducing resolver-query-timeout below 10 seconds in most operational environments, as this is known to cause a higher rate of SERVFAIL responses to clients due to lack of time to progress cache-loading of complex answers.

Another observation was that we had not provided operators with an option to disable the stale cache; in response to this we added the following parameter in 9.16.6 and 9.11.22-S1:

stale-cache-enable: If yes (the default), enables the retention of expired cache records so that they are available to be returned from cache if either stale-answer-enable is set to yes, or is switched on later using rndc serve-stale on.

At the same time, we made another tweak. As of 9.16.6 and 9.11.22-S1, answers that are received with TTL=0 are ineligible for serve-stale.

Revised BIND Implementation Prioritizes Faster Responses

As of the releases of BIND 9.17.7 and 9.16.9 in November 2020, we have revised our implementation more significantly to prioritize faster responses. BIND now replies with the stale answer in cache immediately if an attempt to refresh the RRset has previously failed, and continues to provide the stale answer for an amount of time specified by stale-refresh-time. After that stale-refresh-time has expired, the stale answer is regarded as “unusable” and is not served. Specifically:

stale-refresh-time: The period of time that BIND serves a stale answer. The default stale-refresh-time is 30 seconds, as RFC 8767 recommends. A value of zero disables the feature, meaning that normal resolution takes place first, and named returns “stale” cached answers only if that fails. (A value of zero results in the same behavior as the original BIND serve-stale feature.)

This enhancement speeds up responses for nearly all of the users that are in need of a stale answer: the very first user that queries for a record that has just become unavailable from the authority will still have to wait for the query timeout, but all the subsequent users will get the stale answer from cache. There is another option we would like to add, the stale-answer-client-timeout below, which will help this first user. The “first user” may also be impacted by fetch limits (more on this topic below).

Compliance with RFC 8767

These recent changes do not make the BIND 9 implementation completely compliant with RFC 8767; we have two relatively small changes left to implement:

Add stale-answer-client-timeout, which is the maximum amount of time a recursive resolver should allow between the receipt of a resolution request and the sending of its response (only to be used if stale-answer-enable is set). https://gitlab.isc.org/isc-projects/bind9/-/issues/2247
Update the defaults to the RFC 8767 recommended values. https://gitlab.isc.org/isc-projects/bind9/-/issues/2248

Parameter	Current default	RFC 8767
`stale-answer-ttl`	1 second	30 seconds
`max-stale-ttl`	12 hours	1-3 days
`stale-refresh-time`	30 seconds	30 seconds or higher

Rate-Limiting and Serve-Stale

Unfortunately, cache cleaning and cache maintenance are very complex topics. We have a more detailed Knowledgebase article on the BIND implementation of serve-stale and its interactions with other BIND features, but here are the most important points.

How does serve-stale interact with fetch-limits?

Fetch-limits allow you to rate-limit the number of requests for a specific zone or to a specific server. Fetch-limits were implemented as a mitigation for the pseudo-random subdomain DDoS attack. The reason we are concerned about fetch-limits is that the primary use case for serve-stale is when the authoritative server is unavailable due to a successful DDoS.

When a query is dropped due to fetch-limits, before sending SERVFAIL or DROP (depending on what’s configured in fetch-limits), we’ll look to see if there is stale data we could send instead. A query dropped due to fetch-limits won’t activate ‘stale-refresh-time’, as this is not considered a real failure in contacting the name servers in an attempt to refresh the given RRset.

The fetch-limits implementation does not block all requests. Some will succeed in bypassing the rate-limiting process. As soon as a refresh has been attempted (and failed), the stale-refresh-time window will be activated. The reduction in client queries due to serving of stale data should also help to increase the likelihood that subsequent refresh attempts will not be blocked by fetch-limits.

Serve-stale and prefetch

Prefetch, implemented first in BIND 9.10, is a technique for refreshing the cached information for popular data, even without a pending query. The theory is, if this is information that is frequently requested, BIND can anticipate that it will be needed again soon.

In the case of prefetch, the client request prior to the RRset expiry initiates an early refresh of the cache content. Client queries received during the period that stale-refresh-time is active, however, do not initiate an early attempt to refresh the stale RRset.

The following section was updated in March 2021 to reflect changes made in BIND 9.17.11; 9.16.13 and 9.16.13-S1

Serve-stale and negative answers

If there is a stale NXDOMAIN or NXRRSET in cache, BIND returns it only if the resolver query times out (stale negative data will not be returned on stale-answer-client-timeout). Although stale-answer-client-timeout is not used to provide an early response to clients from negative stale cache RRsets, once a refresh of these RRs has timed-out the client will receive the negative stale cached answer, and the stale-refresh-time will be started so that subsequent client queries will receive the negative stale response immediately.

RRset Aging

With serve-stale, BIND now has four stages in the aging of an RRset:

“active” = within the published TTL. This is a fresh product, analogous to yogurt with a “sell by” date in the future.
“expired” = past the published TTL. This is analogous to yogurt that is past its “sell by” date, but may still be edible. (“expired” includes both “stale” and “ancient” RRsets.)
“stale” = “expired,” but not by more than max-stale-ttl seconds. This is analogous to yogurt that is past its “sell by” date, but may still be edible because it is not yet past its “use by” date.
“ancient” = “expired” by more than max-stale-ttl seconds, which also means that it’s ready to be removed from the cache as soon as the references, locks, and opportunity allow. This is analogous to spoiled yogurt that is inedible, but has not yet been thrown away. On servers that do not have stale-cache-enable yes;, all “expired” cache content is “ancient” (the yogurt’s “sell by” and “use by” dates are the same).

Real-Network Scenarios

Since we are discussing the design of serve-stale, Cathy Almond, our Support team lead, raised questions about interactions with rate-limiting under two scenarios.

Scenario 1: This resolver is innocent and is sending only queries for “good” names, but because the authoritative server/zone is under attack, it is not reliable. If there are RRsets available, does BIND respond from stale cache instead of SERVFAIL or DROP?

In this case, the revised implementation performs better than the prior implementation for almost all users. BIND sends any relevant answers from cache, but may not attempt to refresh the data if the server or zone is subject to fetch-limits. The fact that fetch-limits are active is a clear indication that there is a problem with getting answers for queries from these servers or from this zone.

Scenario 2: This resolver is participating in the attack via its compromised botnet clients. Note that in a Pseudo-Random Subdomain (PRSD) attack, the resolver receives queries for a series of apparently random, unique names. Because these are generated, non-existant names, the response should be NXDOMAIN. This is the scenario that fetch-limits was implemented for, and fetch-limits should apply here.

In this case, BIND sends SERVFAIL if a new query is attempted and fails. If fetch-limits are triggered then it will SERVFAIL or DROP the query, depending on the configuration. There is unlikely to be an eligible stale answer in cache to serve instead (but if there was one, it would be used). This is exactly the behavior desired.

References

Wikipedia article on Dyn attack: https://en.wikipedia.org/wiki/2016_Dyn_cyberattack
Original serve-stale draft: https://tools.ietf.org/html/draft-wkumari-dnsop-ttl-stretching-00
Second serve-stale draft: https://tools.ietf.org/html/draft-tale-dnsop-serve-stale-02.
RFC 8767: https://tools.ietf.org/html/rfc8767
Blog on Unbound implementation: https://medium.com/nlnetlabs/some-country-for-old-men-7b9add7820c9
ISC Knowledgebase article with further BIND implementation details: https://kb.isc.org/v1/docs/serve-stale-implementation-details

AUTHOR

Posted by: Victoria Risk

PUBLISHED

24 Nov 2020