At Vena we've been recently engaged in an effort to upgrade our servers to take advantage of AWS's Autoscaling and Elastic Load Balancing (ELB) services, the ability for servers that are in an unhealthy state to be replaced by new healthy servers. During this process we have run into a bit of a problem with DNS and I thought I'd share my discoveries with you as well as a potential solution.
First for some context
Our clients might be logging into one of multiple datacentres but we don't want them to have to keep track of which datacentre their data lies in (since they might be moved around for various reasons) so we tell them all to go to vena.io to login. Previously under our old non-Autoscaling solution we simply had a DNS record that made vena.io resolve to any one of our public web servers and each one of these servers would then figure out where the client's data actually lives and forward them to the right place. After that login all other requests will now go against the subdomain they've been forwarded to. Simple right?
As we transitioned to Autoscaling we discovered this would no longer be so simple because our webservers might go down at any time and we can't keep updating our DNS entry because that just takes too long to update. Instead the load balancer comes with a URL provided by Amazon that will resolve to the current servers that are healthy enough to accept traffic. The problem with this URL is that it's very long and no one will ever be able to remember it, for example it could be something like
prod-ca2-20160309-nginx-elb-1656592959.us-east-1.elb.amazonaws.com. To solve this we can use what is called a CNAME DNS record.
For those who aren't familiar with all the ins and outs of DNS a CNAME is a type of DNS record which tells your DNS client that one domain, say foo.example.com, is an alias to another domain, bar.example.com. Your web browser will then see this, try to resolve bar.example.com and then go to the resulting IP address. This helps us because we can create a CNAME record for vena.io for each of our ELB's URL. The issue is that with our non-autoscaling setup we could create a single A record (a type of DNS record that returns just IP addresses) to return. Since CNAMEs can only hold a single value we need to create multiple CNAME records for vena.io, one for each of our datacentres. We're able to create multiple records with the same name because Amazon's DNS service, Route53, allows what they call a "weighted routing policy". This allows us to create multiple records with the same name and associate a weight to each one. Amazon's server will then return one of these records a certain percentage of the time based on the weight relative to all the other weights.
Since we can only return one answer, we want each time a user logs into vena.io to be randomly chosen between all our datacentres because we pay a bit of a performance hit when we need to forward a user to another datacentre and if their browser has cached the wrong response we don't want them to have to pay this performance hit every single time. To solve this we will create these DNS records with a short time-to-live value (TTL).
Time-to-live is a value associated with a DNS record that tells the client/server how long to cache the answer before asking for a new one. Instead of asking for a new answer every single time it's usually better on performance to cache the result and use the locally saved answer. Once the TTL value runs out (say after 5 minutes) your browser should ask for a new answer. A short TTL value ensures that every time a user goes to log in their DNS record for vena.io will likely be expired and they will get a new answer, randomly chosen from the various datacentres we have.
Now caching DNS records can give performance gains but in our cases because of this cross datacentre login time caching the wrong result can have unfortunate consequences so let's look a bit about how caching works. First you have caching at a local level where your computer will hold onto an answer until it's decided that it needs a new one and will request its DNS server for a new answer. Your computer's DNS server only knows so many answers and when it's asked to resolve a domain it doesn't know this server will have to ask other servers for the answer. Each one of these in turn might need to ask another server for an answer and so on and so on until finally the request reaches the server that controls the domain (called the "Authority") is able to return the correct answer. Due to this long chain of requests each one of these name servers will cache the result of a DNS query just like you browser does so it doesn't have to ask servers up the chain every time. A problem can come into play when servers don't respect the TTL.
I know what you're about to ask, "But Josh isn't there some sort of specification that says that all servers and clients must throw away an expired DNS record once it's reach the TTL?" Good question. You would totally be correct in thinking that, there is in fact a specification that dictates that new DNS records must be fetched when they reach the TTL and they must set their own cached TTL based on the answer they receive. Unfortunately out in the Internet not all servers play by the rules.
In an ideal world all clients, servers, proxies, operating systems, etc would all follow the specification and fetch a new record when it has expired but sadly this is not the case. It's a bit of a known problem that TTL values aren't always expected, some responses might be cached for a day (regardless of the TTL value set), some for a week and worse yet some might be cached indefinitely. I did a little research and found that Stack Overflow, Hacker News and countless other sites are filled with horror stories of companies switching over their domain name, waiting a few weeks and bringing down the old servers once the traffic died down, only to discover that a few customers are still trying to hit to old servers. Fortunately this problem can often be remedied by telling your clients to clear their DNS caches but I don't want to have to do that.
I decided that all this anecdotal evidence wasn't enough for me so I decided to get a bit of hard data. First I found a site that provided a list of a public DNS servers. As of this writing they currently know about 65,723 Nameservers in 205 countries. Then I found a nice little library to help with DNS querying, dnspython. Next I changed a test record set to hold a different value and set the TTL to 10 seconds. I needed to wait at least a few hours because of the hierarchical nature of DNS servers once Route53's servers are updated we need to wait for the servers that get their answers from Route53 to be updated and then wait for the servers who get their answers from those servers to be updated, etc etc.
Now that a few hours had gone by I ran a script I wrote (available on my Github) which simply takes the list of DNS servers and checks each one for the record set I had updated. It checks that it can reach it, that the server is returning the newly updated result and most importantly it checks the TTL value that it gets back.
I ran the script both against all the DNS servers and again against just the countries Vena has clients in and in both cases the results were pretty concerning. For the entire world here are the results:
global: 6236 out of 67440 failed: 2044 out of 67440 (3.03%) had query timeouts 71 out of 67440 (0.11%) had the wrong address 4121 out of 67440 (6.11%) had the wrong TTL
Now the 3% of servers with a timeout is not too concerning since if they happen to be operating near you, you can always reconfigure to hit a different server. Fortunately less 1% of servers had the wrong value. The odd thing about these servers with the wrong answer is that they weren't returning the old value nor the new value, they were returning a completely different value. Fortunately there weren't too many of these cases however the big alarming thing is 6% of DNS servers on the internet are reachable, were updated but decided to invent their own TTL value. I dug into a few of these failed servers and found a lot had a TTL of less than an hour but a few decided to cache that response for several days, and one even was going to hold onto it for two weeks! When I tested just the countries for which Vena has clients in the results were a bit better, about 2.5% of all servers had the wrong TTL value.
What to do?
Given that there are servers out there that you customers could be using that may or may not hold onto a DNS answer for a long period of time what can you do? Do you continually tell your users to clear their DNS caches whenever a public server's IP needs to change. It's 2016, that can't be an acceptable answer and it won't work for everything. How can you ever take advantage of features that modern nameservers provide such as healthchecks (the ability for a nameserver to know about a server's health and not provide an record corresponding to an unhealthy server)? Fortunately there is an answer out there.
The answer lies in what Amazon calls "Cache Busting" . DNS records are allowed to be defined with a wild card in them, meaning that a simple pattern can be defined that will match multiple records. For example I could define *.example.com and the domains foo.example.com, bar.example.com and even example.com would all match to that record. What this allows you to do is that instead of having your applications make requests of example.com you could generate a semi-random string to prepend to that URL and instead make requests to u123nsb.example.com. Since you are making a request of a random domain there is a very high chance that this particular entry has not been cached yet and will require the name server to get a fresh response. The next time your application makes a request it will make it to a new URL and demand a fresh response as well. This will ensure that your application always will get an up to date response.
Now this approach does have a bit of a performance hit since it requires more trips to your clients' nameservers and causing all the servers down the line to Amazon's server do a bit more work, however what you pay in a slight performance hit you make up in resiliency. You can define healthchecks on your DNS entries so even your public front ends can go down but as long as you build the proper fail over your clients may never notice a difference and this is something we strive hard at Vena to achieve.
Discuss on Hacker News
Photo credit: Porter Novelli Global