Downtime happens fast, especially if NIC (domain registry) deletes your domain by accident

Written by - 0 comments

Published on - Listed in Internet Personal DNS


As you can see on my LinkedIn profile, I'm working for one of the leading news corporations in Switzerland. As a news portal, you can imagine how important the domain is. And what if this domain is suddenly deleted? This is exactly what happened last Friday, April 15th 2016. But let's start at the begin. 

Monitoring alerts

At 2.14pm our satellite monitoring running in the AWS cloud (for having an outside view of our web services) reported a failed HTTP check of our main domain. And not just a HTTP 500 or something like this - it was the following error:

Name or service not known HTTP CRITICAL - Unable to open TCP socket

When I got this alert (by both e-mail and SMS) I immediately knew something's off. Name or service not known indicates a state, when the domain name could not be resolved. I first expected a problem in the AWS DNS servers, maybe having a DNS resolving problem. I logged onto the satellite server and verified the DNS resolving issue - and to figure out that other domains resolve without a hiccup. What the hell...?!
A few minutes after this (the time difference was most likely caused by the domain's TTL), we received alerts on our internal systems as well. Now we were in deep trouble.

Problem at the domain registrar?

A whois of our domain did not show any DNS nameservers anymore so I suspected a problem at our domain registrar (Gandi). Maybe someone deleted the nameservers from the domain's configuration? But when I logged into our account, the DNS servers were there. No modification has been done. I called Gandi to ask them for help to figure out what was going on with our domain - but they confirmed me that DNS configuration seemed correct and they can't explain why the domain isn't working.

Calling the Swiss domain registry (SWITCH)

After Gandi's response, I decided to call SWITCH, the registry operator or also called NIC (network information center) for domains ending with .ch (Switzerland) and .li (Principality of Liechtenstein). That was at exactly 2.59pm. In a few short sentences I explained our domain problem to the first level support and he asked me to hold on, he'd check with the responsible team (which I know is just a few feet away, I visited their offices back in 2012). A few minutes later he was back and explained me that our domain was blocked - probably because of malware (that were his words). I should contact the security team of SWITCH by e-mail. He couldn't give me any additional information. I sent the mail, explaining the situation in the shortest way possible, asking for an immediate call back to explain what's going on. That was at 3.06pm. I didn't get a call back.

At 3.15pm I called again, reached the same guy from before and demanded to speak directly to the security team or to a supervisor. Which didn't work with the excuse that they don't have a direct phone number. My ass. Our company is completely down (e-mails as well) and I'm being held idle on the phone... At least he went again to see his colleagues from the security team on my request. A few minutes later he was back on the phone and told me that the domain will be reactivated shortly. But still no answer to my question "But why? What happened?!". I was told, the security team would contact me.

Monitoring shows DNS recoveries

At 3.29pm we received first recovery alerts. A whois command showed the DNS nameservers again. But of course this is only a direct whois call on the central servers - DNS cache servers at the big providers have "deleted" our domain. It'll take more than a few minutes to get the domain "back in".

At 4.18pm I got an information from a colleague who has a direct contact with someone from SWITCH and was able to talk to him. It turned out that a human mistake happened and that our domain was accidentally deleted. It took until 4.40pm until we saw normal incoming traffic again.

Wrapping up this incident

Besides the downtime which was costly, avoidable and, as you can imagine, hectic, there are a few facts which still anger me:

1) Communication disaster.
Until today, nobody ever called or mailed me back and (technically) explained to me what happened.

2) Technically in shape?
What kind of official registry operator/network information center just deletes a domain by "error"? What are your monitoring tools? Is there no prevention and verification before "accidentally" deleting a domain? Can anyone working at SWITCH just delete a domain without validation? Let's say you "accidentally" delete a domain like SBB.ch (the Swiss Federal Railways) - oh congratz, you've just brought a huge part of Switzerland's transportation system down.

3) Lies - sweet, sweet lies
SWITCH told my colleague that they "found the problem ourselves at around 3pm". Remember the time when I called and sent an e-mail? Be at least honest and acknowledge the end user had to report you've made a mistake.

Later that day, SWITCH posted a "sorry" on Twitter: "nzz.ch is back online. We're sorry for the erroneous manipulation on our side!".

Interestingly, on the very same date this "accident" happened to our domain, the Swiss government released a public document stating:

"Technical management of the .ch domain in relation to the global internet domain name system is being provided by Switch until 2017"

and:

On 15th April 2016, OFCOM launched a public invitation to tender to award the management mandate for .ch domain names. (registry function).

So after 2017 a new private or public organization will take over the registry function currently held by SWITCH. After last Friday I salute this very much.


Add a comment

Show form to leave a comment

Comments (newest first)

No comments yet.

RSS feed

Blog Tags:

  AWS   Android   Ansible   Apache   Apple   Atlassian   BSD   Backup   Bash   Bluecoat   CMS   Chef   Cloud   Coding   Consul   Containers   CouchDB   DB   DNS   Database   Databases   Docker   ELK   Elasticsearch   Filebeat   FreeBSD   Galera   Git   GlusterFS   Grafana   Graphics   HAProxy   HTML   Hacks   Hardware   Icinga   Influx   Internet   Java   KVM   Kibana   Kodi   Kubernetes   LVM   LXC   Linux   Logstash   Mac   Macintosh   Mail   MariaDB   Minio   MongoDB   Monitoring   Multimedia   MySQL   NFS   Nagios   Network   Nginx   OSSEC   OTRS   Office   PGSQL   PHP   Perl   Personal   PostgreSQL   Postgres   PowerDNS   Proxmox   Proxy   Python   Rancher   Rant   Redis   Roundcube   SSL   Samba   Seafile   Security   Shell   SmartOS   Solaris   Surveillance   Systemd   TLS   Tomcat   Ubuntu   Unix   VMWare   VMware   Varnish   Virtualization   Windows   Wireless   Wordpress   Wyse   ZFS   Zoneminder