As if there weren’t already enough cloud skeptics out there, any high-profile outage – such as the CloudFlare disruption on March 3 – inopportunely positions the flaws of cloud computing under the microscope.
On March 3, CloudFlare dropped off the Internet, affecting all of CloudFlare’s services including DNS and any services that rely on the company’s Web proxy. The router crash that took down CloudFlare’s entire network affected thousands of websites that were taken offline over the weekend.
“During the outage, anyone accessing CloudFlare.com or any site on CloudFlare’s network would have received a DNS error. Pings and Traceroutes to CloudFlare's network resulted in a ‘No Route to Host’ error,” CloudFlare’s co-founder and CEO Matthew Prince explained in a post-mortem blog post.
The cause of the outage was a system-wide failure of the company’s Juniper edge routers, he said. CloudFlare currently runs 23 data centers worldwide.
“These data centers are connected to the rest of the Internet using routers. These routers announce the path that, from any point on the Internet, packets should use to reach our network,” Prince said. “When a router goes down, the routes to the network that sits behind the router are withdrawn from the rest of the Internet.”
An attempt by the company to thwart a Distributed Denial of Service (DDoS) attack resulted in it accidentally crashing its routers, taking all its clients’ websites offline, Prince explained.
The DDoS attack being launched against one of its customers specifically targeted the customer’s DNS servers, Prince said, adding that CloudFlare has an internal tool that profiles attacks and outputs signatures that its automated systems and ops team can use to stop attacks.
“Often, we use these signatures in order to create router rules to either rate limit or drop known-bad requests,” Prince said.
In this case, the attack profiler output the fact that the attack packets were between 99,971 and 99,985 bytes long, much longer than the 500- to 600-byte average.
“What should have happened is that no packet should have matched that rule because no packet was actually that large,” Prince wrote. “What happened instead is that the routers encountered the rule and then proceeded to consume all their RAM (News - Alert) until they crashed.”
Despite the inconvenience the outage caused customers, CloudFlare had the problem fixed within one hour.