Recently there was an 11 hour outage of Microsoft's Azure storage services.
Again users were hard pressed to get details on the outage as "the Service Health Dashboard and Azure Management Portal both rely on Azure."
I commend Microsoft for owning up to the root problem quickly and succinctly.
"Unfortunately the issue was widespread, since the update was made across most regions in a short period of time due to operational error, instead of following the standard protocol of applying production changes in incremental batches."One of the comments summed it up best:
So much tied into itself that there is no dependency tree - it is a pure network - thus issuing bad changes take down the entire net.
It can be a spectacular update process - minimum to no outage... but only if the updates work.
It also shows a major vulnerability. That central update can take down the entire company if it gets penetrated.
20 November, 2014 12:46
- Diversify - Don't build your notification tool on top of what you're monitoring.
- Manage change - Don't let operational error bite you in the a**. Your execution has to be perfect. Users are unforgiving.
My previous posts on this topic:
When Clouds Go Bump
When Clouds Go Thump
Lessons from the Cloud
When Clouds Go Bump Revisited
To Be Fair
To Be Fair, Again
To Be Fair, Again and Again
Update: Microsoft has published a thorough analysis of the problem with the corrective action. Good job Microsoft!