Cloud failures and failures that could be fixed with clouds.

[Note: I use the words "could" and "may" carefully here. As shown below "cloud" and "cloud-like" can be highly divergent ideas.]

In this article I will provide an introduction to two failure models to large-scale systems, one classic and one new. Through examples I will show that server stability through “cloudification” has not yet been achieved. This is illustrated through IBM failing to cloudify the critical infrastructure of Air New Zealand, and Amazon’s cloud-like solution failing in a manner that geographically dispersed clouds should not. The examples each fit one of two system failure types. The first type is large-scale computer system failures, which could have been avoided or mitigated had the system designers used a cloud architecture or cloud principals. The second type is software and system deployments that have failed, despite being implemented in a cloud-like environment, to live up to the promises made around the technology. Both of these issues are important to fully understand as a great number of people are either completely in the dark as to what clouds are and what benefits they provide, or believe putting their applications “on the cloud” is a silver bullet which will magically slay their uptime, scaling, and security demons.

Put it on the cloud

The second type of failure is exemplified by The Register with their report on the IBM mainframe crash, which took down Air New Zealand‘s check-in desks, online bookings and call centers. In this case, the issue was definitely not a cloud problem. There was no cloud involved in any way on the systems side. According to the original report from Australia’s The Age, Air New Zealand outsourced management of their mainframe and mid-range systems to IBM, who then dropped the ball and the crash occurred. Theoretically, had those systems been deployed in a private cloud which embodied the core values of geographical dispersion, multi-network homing, and self-healing of resources, this issue could have been avoided or, minimally, greatly reduced. This would allow for multiple system failures at the software and hardware level to occur, causing a service slow down or even a loss of real-time access to lower priority services, but the system as a whole would remain online and able to serve customers.

If the cloud doesn’t hold, service evaporates

Spinning this back around to see what happens when a service is deployed on a cloud-like service and fails despite the promise, we have the Bitbucket DDoS attack that occurred early this month. The most detailed account of the affairs comes from the blog of Jesper Nøhr, the developer who runs the code repository hosting service. To summarize, the attack was an extended, multi-phase bot-net attack that was launched against Bitbucket’s website as part of a dispute between warring factions of developers surrounding a project source repository hosted there. Most news coverage to come out of this focused heavily on Amazon’s incredible lack of immediate reaction to a customer under attack, ignoring and downplaying the analysis and recommendations from Jesper’s network administrators, yielding a full 19 hours of downtime before initial service was resumed. This is alarming from a customer service point of view, but not the most concerning element. What seems to have been largely ignored by the news reports was just how stunning it is that a seemingly large scale, established, distributed “cloud” service could be taken offline so easily. And related but second to that is why it was so difficult for Amazon’s technicians to trace the issue back to its source and implement a fix. While no one can truly say exactly why the DDoS attack was so successful, as Amazon’s cloud service is a black-box, a service implemented according to true cloud fundamentals would be able to withstand the beating seen by Amazon without the complete shutdown of site and service availability that Bitbucket experienced.

Denial of Service, the cloud way

Now, while the Bitbucket attack was definitely a traditional DDoS, this is a perfect place to mention a variation on the attack which has been dubbed “EDoS,” or Economic Denial of Service. The twist comes from the subtle difference in goals. In a DDoS, the intent is, through brute force measures, to completely wipe a service or presence off the Internet, making it wholly unavailable to users for the duration and fallout periods of the attack. In EDoS, the intent is to slowly suffocate a service to death without the service provider realizing the problem until it’s too late. Chris Hoff (@beaker) wrote a great intro article on the topic in which he says:

EDoS attacks are death by 1000 cuts. EDoS can also utilize distributed $evil_doers as well as single entities, but works by making legitimate web requests at volumes that may appear to be “normal” but are done so to drive compute, network and storage utility billings in a cloud model abnormally high. Example: a botnet is activated to visit a website whose income results from ecommerce purchases. The requests are all legitimate but the purchases never made. The vendor has to pay the cloud provider for increased elastic use of resources where revenue was never recognized to offset them.

Full circle

So, what should be made of all of this? Cloud deployments and systems, if properly setup and maintained, can help in a number of cases to maintain and improve uptime and availability while mitigating or sidestepping losses and outage from [D]DoS attacks and system failure. However, as with most technologies, “Cloud” is not the panacea for all service woes. It requires diligence from both the service provider and application deployer to fulfill on its promise. I will be going into depth on these and other issues in future posts. Stay tuned.