Skip to main content

Go-Daddy: Low TTL DNS Resolution Failures

Some of you may have noticed recently that  www.nonfunctionalarchitect.com was not resolving correctly much of the time. At first I thought this was down to DNS replication taking a while though that shouldn't really explain inconsistent results from the same DNS servers (once picked up they should stick assuming I don't change the target (which I hadn't)).

So eventually I called Go-Daddy support who weren't much help and kept stating that "it works for us" suggesting it was my problem. This despite confirmation from friends and colleagues that they see the same issue from a number of different ISPs. They also didn't want to take the logs I'd captured demonstrating the problem or give me a reference number - a far cry from the recorded message in the queue promising to "exceed my expectations"! But hey, they're cheap...

Anyway... I'd set the TTL (Time To Live) on my DNS records to 600 seconds. This is something I've done since working on migration projects where you'd want DNS TTL to be short to minimise the time clients point at the old server (note: you need to make the change at least x1 the legacy TTL value before you start the migration... and not every DNS server obeys your TTL... but it's still worth doing).  This isn't an insane value normally but really depends on whether your nameservers can handle the increased load. I asked the support guy if this was ok and he stated that it was and all looked fine with my DNS records... Cool, except that my problem still existed!

I had to try something so setup a simple shell script on a couple of servers to perform a lookup (nslookup) on Google.com, www.nonfunctionalarchitect.com and pop.nonfunctionalarchitect.com, and set the TTL to 1 day on www and 600secs on pop. This should hopefully prove that (a) DNS resolution is working (Google.com resolves), (b) that I am suffering a problem on www and pop and; with a bit of luck, (c) demonstrate if increasing the TTL makes any difference.

The result shows no DNS resolution fails for either Google.com or www.nonfunctionalarchitect.com. On the other hand pop fails around 10% of the time (12 failures from 129 requests). Here's a few of the results:
pop.nonfunctionalarchitect.com	canonical name = pop.secureserver.net.
pop.nonfunctionalarchitect.com canonical name = pop.secureserver.net.
** server can't find pop.nonfunctionalarchitect.com: NXDOMAIN
pop.nonfunctionalarchitect.com canonical name = pop.secureserver.net.
pop.nonfunctionalarchitect.com canonical name = pop.secureserver.net.
pop.nonfunctionalarchitect.com canonical name = pop.secureserver.net.
** server can't find pop.nonfunctionalarchitect.com: NXDOMAIN
** server can't find pop.nonfunctionalarchitect.com: NXDOMAIN
pop.nonfunctionalarchitect.com canonical name = pop.secureserver.net.
pop.nonfunctionalarchitect.com canonical name = pop.secureserver.net.

I can think of a number of reasons why this may be happening; including load on the Go-Daddy nameservers, overly aggressive DoS counter-measures, or misaligned configuration in nameserver/CNAME configuration etc. Configuration seems ok but I do wonder about the nameservers 1hr TTL versus the CNAMEs 600s TTL. For now it seems more stable at least and I'll do some experimentation with TTL values later to see if I can pin this down.

In the meantime, if you're getting DNS resolution failures with Go-Daddy and have relatively low TTL values set (<1hr) then consider increasing these to see if that helps.

Comments

Popular posts from this blog

An Observation

Much has changed in the past few years, hell, much has changed in the past few weeks, but that’s another story... and I’ve found a little time on my hands in which to tidy things up. The world of non-functionals has never been so important and yet remains irritatingly ignored by so many - in particular by product owners who seem to think NFRs are nothing more than a tech concern. So if your fancy new product collapses when you get get too many users, is that ok? It’s fair that the engineering team should be asking “how many users are we going to get?”,   or “how many failures can we tolerate?” but the only person who can really answer those questions is the product owner.   The dumb answer to these sort of question is “lots!”, or “none!” because at that point you’ve given carte-blanche to the engineering team to over engineer... and that most likely means it’ll take a hell of a lot longer to deliver and/or cost a hell of a lot more to run. The dumb answer is also “only a couple” and “

Inter-microservice Integrity

A central issue in a microservices environment is how to maintain transactional integrity between services. The scenario is fairly simple. Service A performs some operation which persists data and at the same time raises an event or notifies service B of this action. There's a couple of failure scenarios that raise a problem. Firstly, service B could be unavailable. Does service A rollback or unpick the transaction? What if it's already been committed in A? Do you notify the service consumer of a failure and trigger what could be a cascading failure across the entire service network? Or do you accept long term inconsistency between A & B? Secondly, if service B is available but you don't commit in service A before raising the event then you've told B about something that's not committed... What happens if you then try to commit in A and find you can't? Do you now need to have compensating transactions to tell service B "oops, ignore that previous messag

Equifax Data Breach Due to Failure to Install Patches

"the Equifax data compromise was due to their failure to install the security updates provided in a timely manner." Source: MEDIA ALERT: The Apache Software Foundation Confirms Equifax Data Breach Due to Failure to Install Patches Provided for Apache® Struts™ Exploit : The Apache Software Foundation Blog As simple as that apparently. Keep up to date with patching.