Go back

Lessons from Amazon’s Massive AWS Outage

March 22, 2017

It’s not breaking news to say that humans aren’t perfect. Yet, many organizations rely on an unrealistic expectation that their IT teams will never make a mistake.

AWS outage 1

The recent Amazon Web Services (AWS) outage is a perfect example. Hysteria ensues after any major outage, and the pressure placed on the IT teams at that time can be overwhelming to quickly identify and fix the problem. Yet, something as mundane as a typo can be the cause of the issue. A simple human error.

In the case of Amazon that’s exactly what happened when an engineer tried to address a problem with its billing system:

“An authorized S3 team member using an established playbook executed a command which was intended to remove a small number of servers for one of the S3 subsystems that is used by the S3 billing process. Unfortunately, one of the inputs to the command was entered incorrectly and a larger set of servers was removed than intended.”

Like most human errors, this one could have been avoided, and not just with a little more attentive typing. In the networking world, the problem can be equally acute. Traditionally, network engineering has required a lot of manual work, from data collection to manual troubleshooting. Manual work, particularly tedious manual work, often leads to human error. In the case of AWS, an engineer was working through an established playbook and made a simple typing error.

At NetBrain, we’ve designed our entire system to help minimize the tedious manual work by implementing network automation through Executable Runbooks.

Instead of relying on traditional playbooks, where knowledge is often found on paper or isolated to a team of experts, network engineers can codify processes into lightweight apps (Executable Runbooks) that can be run without human intervention. The power of these apps extends beyond reducing error. They also accelerate troubleshooting time while distributing the workload of advanced tasks across multiple team members. This helps reduce the over-reliance of tribal knowledge and builds up a stronger culture of collaboration across the network, security, and change management teams.

Digitizing best practices and automating their execution is what’s key. If AWS had leveraged something similar to Executable Runbooks, it’s entirely possible that the outage may have been avoided. In our world, network teams can easily create, run, and share Executable Runbooks. And with them, they can troubleshoot issues, diagnose network slowness, proactively guard against misconfiguration, and more – all without the fear that the fat fingered lady will sing.

Learn more about Executable Runbooks and how network engineers can share knowledge, reduce manual work and improve the network.