Monday, December 12, 2011

DNS Negative caching

In the past I have frequently run into the problem of delay in seeing new DNS records over a large DNS environment.  For example, if we want to put a new server in with a HOST record, it may take an hour or more before it can be seen throughout an enterprise or for external entities and customers.  There are several factors that could cause delay.  First of all, if you have Primary/Secondary servers, there could be delays in zone transfers of the new record.  In Active Directory environments, you have AD replication delays.  In other environments, systems may specifically have negative lookup caching, or by default act this way.  In this post, I will focus on Negative Caching.  You may ask, what is that?  Simply put, if a system is doing negative caching, and it does a dns lookup for a record, but gets no result, the system will remember this failed lookup and hold on to it for a period of time.  This prevents the system from trying to lookup the record again until a timeout has occurred and the negative cached entry is flushed.  You can view the cache in windows with ipconfig /displaydns, and a negative entry looks like this:
   elwood.bobscountrybunker.com
   ----------------------------------------
   Name does not exist.

In Windows clients, there are registry settings for Dns Client service to do this, but the Windows DNS server does not have it.  BIND servers will do it when acting as an intemediate dns system in the lookup process.  So if we have a client machine doing a lookup for elwood.bobscountrybunker.com and it is sending this lookup to 8.8.8.8 (google dns), this server tries to find a record at the authoritative holder of the zone bobscountrybunker.com.  If no result is found, the 8.8.8.8 server will negatively cache this failure for a specified period of time.  The amount of time the server will cache it is provided by the bobscountrybunker.com SOA record for this zone.  If you look at the last value of an SOA record (the minimum TTL), this will be used for the negative cache time period.  Lookups will have their own timeout on a per record basis.  So if the TTL was 10 minutes, our original lookup is counting down in cache from 10 minutes.  In a few minutes from now, if we look up another record that fails, this new lookup will start at 10 minutes while our original lookup may be down to 7 minutes.  This factor is important when thinking of how fast you want new records to be seen, and also how long dns lookup failures will cause unavailability.  At the same point, you don't want too low of a value which could cause increased load on your DNS infrastructure.  If you are troubleshooting these issues from a client perspective, you can use nslookup to see what the timeouts of a particular record are, so you can see the delay in some intermediate dns system.  For example:

nslookup -type=a -nosearch -d2 elwood.bobscountrybunker.com 8.8.8.8

Will do a lookup for HOST records with no dns suffix search, run in debug mode and point the dns query to 8.8.8.8.  The debug mode will show extra information on the lookup.  At the end of the output, you want to look at the authority record

Got answer (104 bytes):
    HEADER:
        opcode = QUERY, id = 2, rcode = NXDOMAIN
        header flags:  response, want recursion, recursion avail.
        questions = 1,  answers = 0,  authority records = 1,  additional = 0

    QUESTIONS:
        elwood.bobscountrybunker.com, type = A, class = IN
    AUTHORITY RECORDS:
    ->  bobscountrybunker.com
        type = SOA, class = IN, dlen = 46
        ttl = 1754 (29 mins 14 secs)
        primary name server = ns0.phase8.net
        responsible mail addr = support.phase8.net
        serial  = 2009042001
        refresh = 28800 (8 hours)
        retry   = 3600 (1 hour)
        expire  = 604800 (7 days)
        default TTL = 86400 (1 day)

where you can see the TTL status in cache on that server.  If you continue to run the command, you can watch this value decrease.  The same idea works for viewing valid records that are cached.  Individual records have their own TTL (not always following the same as the zone SOA record), which will point out how long they are to be cached.  So if you changed a record and want to know why the new data is not up to date in your dns queries, you can use the same methodology to track it down.

If you want to change the TTL of a zone in Microsoft DNS, open the zone properties, and go to the SOA tab.  Here you will see two TTL values, one is Minimum default TTL...this is for your resource records cache length.  The other is TTL for this record, which is split into  DDDDD:HH:MM:SS input format, and this is where you control negative caches of lookup's into this zone.  By default, Microsoft DNS sets this to 1 hour.

On the client side, the Microsoft dnscache will cache negative results for a default of 15 minutes.  This can be adjusted with the registry key: HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\DNSCache\Parameters\MaxNegativeCacheTtl (ref).

1 comment: