Runbook Automation: Convert Actions into Shareable Knowledge

Jason Baudreau
By Jason Baudreau May 15, 2018 4 minute read
Jason Baudreau has a passion for enterprise networking, writing and geeking out with Smart Home technology. With 8 years' experience in Systems and Network engineering, Jason now specializes in network automation technologies at NetBrain.

How to transfer tribal knowledge with self-documenting runbooks

There’s a modern idiom, “there is nothing new under the sun” which emphasizes the cyclical nature of life. You can take this in many contexts, but there’s one worth applying to enterprise networks and the cyclical (i.e. common) problems we find ourselves troubleshooting time and again. I’ll challenge you to find a network problem that is truly unique. But unless you’re the person that was in the trenches when it rared its ugly head, you won’t benefit from the lessons previously learned. You’ll likely need need to escalate to someone who has that experience.

 

What has been will be again, what has been done will be done again; there is nothing new under the sun.

 

In my experience, problems which are escalated all the way to Tier-3 are either complex (deep design expertise is required) or obscure (the underlying problem is very rare). In either case, it’s the “Tribal Leaders” who have the expertise and experience to swoop in and save the day. These people earned their reputation because they either designed the network (so they understand the complexity) or have managed the infrastructure for a long time (think Farmer’s Insurance: “we know a thing or two, because we’ve seen a thing or two”)

 

We know a thing or two, because we’ve seen a thing or two.

 

The thing is, it’s both inefficient and frustrating to have this knowledge locked away in a “single point of failure”. If you’re that Tribal Leader, your time is never truly your own… imagine getting that escalation call while you’re vacationing in the Maldives.

 

Maybe you’re thinking to yourself, “Isn’t that what network wikis and runbooks are for?” These are a great way to document knowledge, in practice. In reality, I’ve rarely seen it executed well. If you’re like most Tribal Leaders, you’re probably too busy putting out those fires and designing new network upgrades to write that network wiki. Wouldn’t it be better if those runbooks could write themselves?

This is our thinking behind Runbook Automation – that knowledge should be so easy to capture, digitize, and share that everyone can do it without thinking. That’s why I’m excited about the new self-documenting aspect of runbooks which we introduced in NetBrain 7.1. Now, whenever you use NetBrain to troubleshoot, your actions are recorded – whether that’s a ping, traceroute, CLI command… or maybe more advanced automation like a Qapp. That means your methods are documented automatically and translated to a repeatable (and executable) a process. The next time a similar problem occurs, another member of the staff can take that runbook and execute it themselves, before they call you. Poof – knowledge transferred.

Let’s take an example. Suppose you’re a network architect and a ticket was just escalated to you at Level-3… you need to diagnose a multi-tier app which is running slow. Here’s a map of the problem area:

NetBrain A/B Path

Problem Area – Slow Application

 

As you begin to troubleshoot, notice that each action you take to diagnose the path is documented inside the runbook automagically. Maybe you start with a basic ping to the firewall:

…Next you, run a traceroute from the gateway of the web server to the gateway of the database server:

…And then you collect some data by executing some common CLI commands across the map:

… Finally you check the overall health of the network and then run a Qapp to determine a basic configuration mistake is to blame:

What you’ve done (besides resolving a critical incident) is documented the method you used to arrive at your conclusion. Now for the most important part… don’t let that knowledge go to waste! Save this runbook as a template so that it can be reused by Tier-1 and Tier-2.

This executable knowledge is even more valuable than the outage you resolved. Why? Because there will be another problem like this one (there’s nothing new under the sun). But next time, maybe you won’t get the call while you’re drinking a salted margarita from your beach bungalow.