Adam Stern | Channel Futures
Amazon, that mighty river of e-commerce and cloud computing, recently sprung a leak. You may have heard about it.
When an errant keystroke from an AWS engineer took down the world’s largest public cloud for five hours on Feb. 28, the glitch exposed a fundamental truth: the bigger you are, the harder you fall.
Those on the receiving end of that typo-generated AWS outage were not amused. It turns out that Amazon, the cloud industry’s behemoth – with a very fat 31 percent share of the market – is accountable neither to users nor Internet at large. The moral is simple: bigger is not necessarily safer. Or as Yaron Haviv, co-founder of Israel-based big data cloud provider iguazio put it in SiliconANGLE, “the real question is: why have we created such a dependency on services such as AWS?”
AWS fell on the mighty and the masses alike. With an estimated one-third of all Internet traffic passing through AWS servers, sites from Slack to Quora to the U.S. Securities and Exchange Commission were out of commission for much of prime time that Tuesday.
According to Gizmodo, “in theory, a series of fail-safes should keep the fallout from such errors localized, but Amazon says that some of the key systems involved hadn’t been fully restarted in many years and ‘took longer than expected’ to come back online. Amazon says that its S3 service is ‘designed to deliver 99.999999999 percent durability’ and ‘99.99 percent availability of objects over a given year.’ But when one piece of the infrastructure fails, AWS fails big.” And that makes Amazon a giant dust cloud.
With all due respect to the AWS organization, just try parsing this, from Amazon’s official statement on the outage: “The servers that were inadvertently removed supported two other S3 subsystems. One of these subsystems, the index subsystem, manages the metadata and location information of all S3 objects in the region. This subsystem is necessary to serve all GET, LIST, PUT, and DELETE requests. The second subsystem, the placement subsystem, manages allocation of new storage and requires the index subsystem to be functioning properly to correctly operate.”
Remember, this was a self-inflicted wound. There’s no volumetric attack, no nefarious hack from Moldova anywhere in sight. As Haviv aptly observed, “What [Amazon’s statement is] saying is that big chunks of the Internet depend on just one or two local services to function.”
Where mega providers like Amazon are concerned, no one knows what’s under the covers. People assume Amazon — and Microsoft and IBM, for that matter — are doing things the right way, but the lack of transparency is precisely the problem. Amazon hadn’t rebooted its systems in years? What’s up with that? Very same customers put due diligence away…
AWS could have put customers in separate silos, but opted for one big pool, in that geography. Mega providers go out of their way to use homegrown products, the design of which remain trade secrets, and any efficiency gains they achieve can be undone by an absurdly minor human error. Bottom line: just because Amazon and Microsoft are big doesn’t mean they’re safe. Amazon’s share of the public cloud market currently stands at 30+ percent, with Microsoft at 9 percent and growing rapidly, and IBM SoftLayer at 7 percent. But size doesn’t equal smart.
It does matter what products and architecture a provider chooses. Smaller companies are by definition much more transparent, much more open to demonstrating to customers that they’re in a safe place. Some AWS users now want to know how to use Amazon to protect themselves from Amazon’s failures. That’s crazy.
Does “fail-safe” no longer have any meaning? Aiming for market dominance, Amazon neglected not only the computing masses it wants most fervently to woo, but the corporate mandarins whose loyalty, up to now, has been unquestioned. In a climate where DDoS attacks are relentless and wreaking havoc, no provider – least of all the largest among us – can afford to allow a typographical error (a typo!) to slam into the S&P, as it did on the last day of February.
Fact is, multiple geographies need multi-provider redundancy. Where was AWS disaster recovery? It’s not a rhetorical question. The Sarbanes-Oxley Act mandates that businesses understand risk – like online outages — and take steps to ensure business continuity.
In my view, the lesson here is that the biggest players in the game need to clear the air, get out from under that dust cloud and model both transparency and accountability.