On 28th February a lot of websites and online services slowed down or downright stopped working - Amazon S3 had a service disruption and outage. This tells us two things - just how many organizations rely on AWS for their web services and also how little attention is paid to business continuity by organizations when moving to the cloud.
If companies had added redundancy safeguards, this service disruption which hit AWS would not have impacted customers as badly.
But it wasn’t as dark as it sounds. Sure Amazon broke the internet, but Twitter came to the rescue as twitterati responded with humor
Good news. Amazon has found the cause of the massive outage and is implementing the fix. Live video: pic.twitter.com/FIUwd3OZEf— Jon Passantino (@passantino) February 28, 2017
wow this amazon outage is really taking a toll pic.twitter.com/7efX84789P— Jarry (@jarry) February 28, 2017
The outage lasted 4 hours and had a domino effect - it took down many other services which relied on S3. The net result of the outage was that 54 of the top 100 eComm sites saw a 20% or more decrease in performance (source: Apica). Even companies like Nike saw longer load times and slower service.
Companies can implement various contingency plans to safeguard against such cloud outages. Some of them include
- Making use of the AWS cross-region replication ability
- Using one of the disaster-recovery services available
- Spreading workloads across regions
- Backing up to other public cloud providers
For instance, when the S3 service was affected, one of AWS’ biggest proponents, Netflix reported no issues.
But there is a trade-off many companies are willing to make - between redundancy for continuous uptime and relying completely on vendors for high SLAs. For all its exponential growth, public cloud is largely used by startups and websites, many willing to suffer an occasional outage - which would probably be unacceptable for some mission-critical applications.
What options do I have?
A lot of companies opt to host their workloads on a region other than U.S. East-1, even if the region gives them better performance. This is in the face of the fact that East-1 generally gets quicker access to new features. Their reason - U.S. East-1 is highly crowded and not as stable. So it makes more sense for them to host mission-critical workloads in a different region.
Another option is replication. But replication is not always foolproof either. Also, even if a company does have replication in place, it might just make sense for them to not conduct a failover in case of an outage and wait for the issue to get resolved, especially if it doesn’t affect operations much.
Yet other companies prefer to spread their cloud infrastructure to use multiple providers - for instance Apple iCloud reportedly relies on AWS, Azure and Google Cloud Platform.
Still uncertain how to maintain business continuity for your cloud offerings? Get in touch.