With increasing reliance on the cloud, and in many cases on a single cloud service provider, the probability for a widespread (though infrequent) outage grows. On Tuesday, AWS S3 storage experienced a major outage, taking down the back-ends of many sites that include Netflix, Slack, and HubSpot, two of which we use at Cavirin. For enterprises that were single threaded, they just had to wait it out, and though the actual outage lasted only 4 hours, it took the remainder of the day for many to recover. To give you an idea of the magnitude of the impact, AWS S3 supports over 150K sites and upwards of three trillion data elements. Thousands of tweets were questioning if the Internet went down, just like last October with the Mirai outage. Compounding the problem is that the storage service is shared across multiple AWS zones, and though an enterprise may distribute compute across geographies, due to practical or cost reasons they may depend upon a single storage instance.
Despite immense amounts of automation, the human element may still be the weak link, as reported by USA Today – “The most common causes of this type of outage are software related,” said Lydia Leong, a cloud analyst with Gartner. "Either a bug in the code or human error. Right now, we don't know what it was." The publication Slate took a more somber view - “At this point, we practically expect that whatever personal information we enter into websites will be stolen.”
So how to combat these types of outages as well as human risk?
First off, the larger enterprises do in fact have a cloud DR strategy. For example, if AWS fails, the enterprise may have warm-standby capability on GCP, Microsoft Azure, or maybe on-premises. Though most DR programs fail into the cloud, there is nothing precluding a scenario where an enterprise may have critical applications on-premises, less critical ones in the cloud, and an option to rehome these on-premises in times of emergency.
What this implies is that the enterprise must have a security compliance architecture that spans these multiple domains.
The success of any sound DR strategy involves continuous replication of critical data to failover after the disaster is rectified, so that the business continuity is guaranteed. In addition, the replicated systems must have the same rigorous, continuous security monitoring and assessment requirements that is expected from live production systems. That way, when failover happens during outages, the restored systems and services will not have any vulnerabilities. The scope of any security platform such as Cavirin must include DR-replicated systems as well in addition to live production assets.
If enterprises have implemented AWS Hardening Benchmarks, and their workloads move to GCP, they should ensure that the same protections are in-place. And this applies not only for conventional virtualized workloads but for containers as well. They need to ensure that the hardening applied to a given OS on one cloud provider are also available on another, and that compliance is agentless and continuous to quickly build the baseline and identify any risk.
It is in times of outages that IT is stressed the most and likely to make mistakes.
Here, automation of the security compliance process is critical. In the same way, if workloads move from the cloud to on-premises and vice-versa, the same benchmarks, rules, and automation must span these different domains. Having to use one tool on one CSP and another in-house is yet another area of potential failure.
We will never be able to totally prevent outages, but by implementing best practices based on available security tools, the enterprise will be able to more effectively protect against negative customer impact or worse.
Snippet of AWS Eastern US status during outage.
As many noted, even accurate reporting of the outage was unavailable for a while, which harkens back to the Mirai US DNS outage last October.