We use cookies to provide you with a better experience. If you continue to use this site, we'll assume you're happy with this. Alternatively, click here to find out how to manage these cookies

hide cookie message
80,258 News Articles

Lessons from Amazon Cloud Lightning Strike Outage

Amazon lost both its primary and backup power in a single lightning strike.

A lightning strike in Dublin took out a power transformer. In and of itself, that isn't all that unusual or noteworthy, but this particular lightning strike also impacted the backup power systems at Amazon's cloud data center, knocking the service offline. Looking back, there are some lessons to be learned both for Amazon, and for businesses that rely on cloud services.

We're talking about a massive Amazon data center. Data centers are built from the ground up with backups and failovers designed to address virtually any scenario and ensure the survivability and availability of the data center no matter what sort of catastrophe strikes. Amazon, of course, has redundant mechanisms in place, but obviously they didn't work in this case.

On its Service Health Dashboard site for the European EC2 cloud service, Amazon explains, "Normally, upon dropping the utility power provided by the transformer, electrical load would be seamlessly picked up by backup generators. The transient electric deviation caused by the explosion was large enough that it propagated to a portion of the phase control system that synchronizes the backup generator plant, disabling some of them. Power sources must be phase-synchronized before they can be brought online to load. Bringing these generators online required manual synchronization."

In a nutshell, the lightning strike was direct and powerful enough that it simultaneously took out the transformer, and phase control system necessary for initiating the backup generator system. Amazon is in the process of restoring service and data for customers--a process that is taking longer than expected, and has required Amazon to add additional server capacity to handle the load.

So, what are the lessons to be learned here? Well, Amazon should do a post mortem once the service is fully recovered. First, Amazon should analyze the circumstances that led to both primary and backup power being impacted at the same time. It should determine the likelihood of such an event occurring again, and what--if anything--can be done to avoid it. Perhaps the backup power should be on a different grid from the primary power, or maybe this is such a fluke incident that such an investment is cost-prohibitive.

Next, Amazon should review the recovery and restoration process. It should consider the hurdles and stumbling blocks it has encountered--like needing additional server capacity to handle the load more efficiently--and it should revise incident response processes and procedures to make any future disaster recovery operations more effective and efficient.

If you are a customer of Amazon, or Microsoft--which was also affected by the Dublin lightning storm, or any other cloud data or server service, there are lessons to be learned as well. As I explained a few months ago following a cloud outage for Amazon in the United States, "Don't use cloud services unless you can adequately answer the question "what happens to my business if the cloud service in unavailable?""

You should have your own redundancy and disaster recovery systems in place. Depending on how crucial your cloud server or data storage are to normal business operations, you could contract with more than one cloud service provide to hedge your bets and prevent an outage at one provider from taking down all of your operations at once.

You should also make sure you understand the failover and redundancy mechanisms offered by your cloud provider. Amazon offers Availability Zones that enable customers to set up their own redundancy within the cloud.

The ultimate lesson, though, is that nothing is 100 percent guaranteed. Even the most reliable service can be knocked offline by a fluke natural disaster, or even catastrophic human error. Your mission is to develop a system that enables you to continue business operations no matter what.


IDG UK Sites

Best January sales 2015 UK tech deals LIVE: Best New Year bargains and savings on phones, tablets,...

IDG UK Sites

Chromebooks: ready for the prime time (but not for everybody)

IDG UK Sites

Best Photoshop Tutorials 2014: 10 inspiring step-by-step guides to creating amazing art,...

IDG UK Sites

Complete guide to iPhone and iPad settings: Get to know iOS 8 Settings UPDATED