Go back

Lessons from Amazon’s Massive AWS Outage

by Mark Harris Mar 22, 2017

It’s not breaking news to say that humans aren’t perfect. Yet, many organizations rely on an unrealistic expectation that their IT teams will never make a mistake. According to Uptime Institute’s ongoing research, IT is actually falling behind in keeping the systems and services running, with more outages being reported, each of longer duration and higher negative impact to the business. And migrating your IT services to the cloud providers is NOT the answer.

AWS outage 1

The 2017 Amazon Web Services (AWS) outage is a perfect example. Hysteria ensues after any major outage, and the pressure placed on the IT teams at that time can be overwhelming to quickly identify and fix the problem. Yet, something as mundane as a typo can be the cause of the issue. A simple human error, and yet it caused havoc across the Fortune 2000 globally.

In the case of Amazon that’s exactly what happened when an engineer tried to address a problem with its billing system:

“An authorized S3 team member using an established playbook executed a command which was intended to remove a small number of servers for one of the S3 subsystems that is used by the S3 billing process. Unfortunately, one of the inputs to the command was entered incorrectly and a larger set of servers was removed than intended.”

Like most human errors, this one could have been avoided, and not just with a little more attentive typing. In fact changes can be made to individual devices only to realize that the IT services that traverse these devices have been affected unintentionally. In the networking world, the problem can be quite acute. Traditionally, network engineering has required a lot of manual work, from data collection to manual troubleshooting. Manual work, particularly tedious manual work, often leads to human error. And rarely do all of the applications and services that are involved in changed devices run through a quality-control proactively to assure that they are fully operational. In the case of AWS, an engineer was working through an established playbook and made a simple typing error, but it may have easily been the change was made correctly, but it had unintended consequences to IT services. It happens ALL THE TIME.

At NetBrain, we’ve designed our entire Network Problenm DIagnostic Automation System to help minimize the tedious and inconsistent manual work by implementing network automation through Executable Runbooks. And by leveraging our real-time model of the network, and the intended outcomes expected, we can verify that change has been good for the business.

Instead of relying on traditional grass-root efforts where knowledge is often found on piece of paper or isolated to a team of experts, network engineers can codify their proven best practice processes into executables that can be shared with collegues and then with minimal human intervention. The power of intent-based automation extends beyond reducing error. They also accelerate troubleshooting time while distributing the workload of advanced tasks across multiple team members. This helps reduce the over-reliance of tribal knowledge and builds up a stronger culture of collaboration across the network, security, and change management teams. It’s a means to scale knowledge and experience across any organization.

Digitizing best practices and automating their execution is what’s key. If AWS had leveraged something similar to Executable Runbooks, it’s entirely possible that the outage may have been avoided. In our world, network teams can easily create, run, and share Executable Runbooks. And with them, they can troubleshoot issues, diagnose network slowness, proactively guard against misconfiguration, and more – all without the fear that the fat fingered lady will sing.

Learn more about Executable Runbooks and how network engineers can share knowledge, reduce manual work and improve the network.