Last month’s outage on Windows Azure was blamed on an expired SSL certificate. The company has taken a few steps to prevent a reoccurrence.
Several Windows Azure services were impacted by the worldwide outage – and led to some very frustrated users. The Feb. 22 outage was cleared up on Feb. 23 with the SSL certificate updated and all service restored, according to a company blog post.
Azure apologized and will credit customers impacted by the outage. Also, it undertook a root cause analysis to prevent a reoccurrence, according to Steven Martin, general manager, Windows Azure Business & Operations.
The Feb. 22 outage began at 3:29 PM (ET). It impacted customers who were accessing Windows Azure Storage Blobs, Tables and Queues using HTTPS. Service was restored by 3:09 AM (ET) on Feb. 23.
“When the certificate expiration time was reached, the certificates became invalid prompting a rejection for those connections using HTTPS with the storage servers,” according to a recent blog post from the company. “While the expiration of the certificates caused the direct impact to customers, a breakdown in our procedures for maintaining and monitoring these certificates was the root cause. Additionally, since the certificates were the same across regions and were temporally close to each other, they were a single point of failure for the storage system.”
It was traced to events on Jan. 7, when a storage team updated three certificates and included them in a future release. “However, the team failed to flag the storage service release as a release that included certificate updates. Subsequently, the release of the storage service containing the time critical certificate updates was delayed behind updates flagged as higher priority, and was not deployed in time to meet the certificate expiration deadline. Additionally, because the certificate had already been updated in the Secret Store, no additional alerts were presented to the team, which was a gap in our alerting system,” the company explained in the blog post.
Azure put in place steps to ensure a quicker response if a similar incident takes place in the future. The company will improve how it detects expiring certificates, as well. Manual processes will be improved, too, to ensure how updates are tracked and prioritized. In addition, the company took steps so any uncaught expirations will not lead to a widespread outage.
It was positive that during the outage, Windows Azure Storage service was working for customers who use HTTP to access data and some of the customers switched to HTTP as a temporary option.
However, the outage got a lot of attention in the media. It was described by The Associated Press as “sloppy housekeeping” and an “embarrassing lapse” for Microsoft (News - Alert), causing “major aggravation” for users.
There were some positives coming from the service outage. For instance, Network World reported how Matt Watson of Stackify, based in Kansas City, Mo., lost Azure service for 12 hours, then came up with a free service “to help certificate administrators avoid Microsoft's mistakes. … After Azure went down, Watson and company began looking for a service that reminded administrators about certificate expirations and couldn't find any. So the company, which specializes in creating remote application monitoring and troubleshooting tools, set up a cert alert service up itself.”