Sunday, March 05, 2017

Cloudy Day

To some I'm sure that March 1st felt like April 1st. Really, that couldn't be happening? Amazon's S3 (Simple Storage Service) went down in their Eastern Region (Ok, it just had "high error rates").

But there are a couple of lessons to be learned from this.

First, it seems nobody is listening to me.

Cloud services aren't magical (even Apple's). They rumble. They go bump.

Don't abdicate your responsibilities to the cloud provider. If you need high availability make sure that that is what your contract guarantees.

In the March 1st S3 outage either lots of customers didn't feel they needed high availability or they misunderstood what they contracted for.

Make sure you are not surprised like "Docker's Registry Hub, Trello, Travis CI, GitHub and GitLab, Quora, Medium, Signal, Slack, Imgur, Twitch.tv, Razer, heaps of publications that stored images and other media in S3, Adobe's cloud, Zendesk, Heroku, Coursera, Bitbucket, Autodesk's cloud, Twilio, Mailchimp, Citrix, Expedia, Flipboard, and Yahoo! Mail (which you probably shouldn't be using anyway)." (source)

At the same time don't over buy. One of my customers was considering migrating their on-premise servers to Azure. As part of their on-premise setup they had a specific backup system and service. When I investigated Azure's service commitments I found that Azure's committed backup and availability met my customers needs and the customer could discontinue their backup system and service.

By the way Amazon did a thorough post mortem on the outage.

Second (and more concerning since they should know better), even Amazon had highly visible services down.

Reminiscent of one of Microsoft's outages Amazon's own online public dashboard was down along with many of Amazon's customer facing services, e.g. Amazon Fire tablets.

From Amazon's post mortem "we have changed the SHD administration console to run across multiple AWS regions."

Amazon, hadn't you thought of this before? What else have you overlooked?

No comments: