In a post on Microsoft’s (News - Alert) MSDN blog, Windows Azure manager, Mike Neil, analyzes the root cause of the Windows Azure storage disruption on February 22, 2013. The service interruption affected customers who were accessing Windows Azure storage blobs, tables and queues using HTTPS. As per the blog post, the availability was restored worldwide by 9 AM PST on February 23, 2013.
Besides providing more information on the components associated with the interruption, the blogger also presented the root cause of the interruption, the recovery process, as well as the steps Microsoft is taking to improve service reliability for its customers.
In fact, the blogger first went into describing the internal components of Windows Azure associated with the event before looking into the details of the interruption.
The overview shows that Windows Azure runs many cloud services across various data centers and geographic regions around the globe, and Windows Azure Storage runs as a cloud service on Windows Azure. Likewise, the Windows Azure fabric controller provides the resource and management layer on the Windows Azure platform. Also, the Windows Azure uses an internal service called the “Secret Store” to securely manage certificates needed to run the service. This internal management service automates the storage, distribution and updating of platform and customer certificates in the system. As a result, personnel do not have direct access to the secrets for compliance and security purposes.
Diving into the root cause, the blogger wrote that Windows Azure Storage uses a unique Secure Socket Layer (SSL) certificate to secure customer data traffic for each of the main storage types: blobs, tables and queues. The certificates allow for the encryption of traffic for all sub domains, which represent a customer account (ie. myaccount.blob.core.windows.net) via HTTPS.
The blog post further shows that internal and external services leverage these certificates to encrypt traffic to and from the storage systems. The certificates originate from the Secret Store, are stored locally on each of the Windows Azure Storage nodes and are deployed by the Fabric Controller. The certificates for blobs, tables and queues were the same for all regions and stamps.
The expiration times of the certificates in operation last week were as follows:
When the certificate expiration time was reached, the certificates became invalid, prompting a rejection for those connections using HTTPS with the storage servers. Throughout, HTTP transactions were still operational.
“While the expiration of the certificates caused the direct impact to customers, a breakdown in our procedures for maintaining and monitoring these certificates was the root cause,” wrote Neil, continuing, “Additionally, since the certificates were the same across regions and were temporally close to each other, they were a single point of failure for the storage system.”
To improve service in the future, the problem was further analyzed by organizing it into four areas: detection, recovery, prevention and response.
“The Windows Azure team will continue to review the findings outlined above over the coming weeks and take all steps to continually improve our service,” Neil concluded.