Go back

Network Troubleshooting: Using a Magnet to Find That Needle in a Haystack

July 9, 2018

According to recent Enterprise Management Associates (EMA) research, the average enterprise network management team spends 75% of its time troubleshooting problems. Specifically, they spend about 35% of their time firefighting or reactive network troubleshooting, and 40% of their day on proactive problem prevention.

While one can argue pretty convincingly that ensuring uptime is job #1 for network teams, this singular focus presents a serious problem. It leaves only a couple of hours a day for working on strategic projects that deliver value to the business. Today’s enterprises are asking their network teams to support major shifts in technology strategy and architecture. (Think SDN, IoT, cloud, new virtualization technologies.) And with troubleshooting the network-as-is monopolizing all your time, there just aren’t enough hours in the day for the network-as-it-could-be.

Check out this 1-minute video to see how automation cuts network troubleshooting time in half

 

NetBrain’s own research has found that the vast majority (80%) of network outages last over an hour, with roughly 40% of network teams saying that it takes more than 4 hours to troubleshoot a “typical” network issue. Given the volume, variety and velocity of trouble tickets hitting the NOC (and the number of tickets getting escalated to senior Level 2 and 3 engineers — the very guys tasked with working on those “strategic projects”), it becomes crucial to accelerate network troubleshooting time.

Finding the needle in a haystack: Reducing MTTI

Whether we’re talking about IT, manufacturing or the power grid, the formula for MTTR (the total amount of time it takes to get back to normal operations ÷ the number of problems) is well understood. What’s less intuitive is the dividend, or top half, of the equation. It really comprises two parts — the time it takes to actually repair something, and the time it takes to figure out what needs fixing. This second piece is referred to as Mean Time to Identify (MTTI). About 80% of our MTTR time is spend trying to identify and locate the problem — trying to find the needle in a haystack — only 20% of our time is truly spent on repair.

Further, since the network is so often guilty until proven innocent,  MTTI can also stand for Mean Time to Innocence.

MTTI graphicFor example, say an application is running slow. Immediately all fingers point to you, the network guy. Must be your fault. So you spend two hours investigating the issue, only to conclude definitively that it’s not the network. (Everything is running perfectly.) Now the systems guy or the application team jumps in to troubleshoot. You didn’t spend a single minute on repair, but you still burned a couple of hours dealing with the situation.

Or say the network is the guilty party. Detecting the problem is almost instantaneous, thanks to modern 24×7 monitoring solutions. Fixing the problem usually isn’t what takes up all our time. It’s finding the problem that’s so laborious and what sends your SLAs down the drain. It’s finding that needle in a haystack. The EMA report states that the single most time-consuming aspect of troubleshooting is “identifying the problem (e.g., information gathering, symptom analysis, etc.).” The time to identify can take hours, even days.

EMA research finds that the single most time-consuming aspect of troubleshooting is identifying the problem.

 

The typical manual workflow today takes a lot of time and involves a lot of duplication of effort.

Say you come in one morning and there’s a ServiceNow ticket that came in overnight. First you need to assemble all the documentation and hope that the diagrams are up to date. (Spolier alert: They won’t be.) You probably have some kind of predefined set of procedures outlined in a playbook somewhere. This usually involves a lot of manual data collection via the CLI,  issuing one command at a time, one device at a time. Then you have to sift through reams of raw text output in “stare and compare” mode to find the pertinent information. Not to mention stitching together silo’ed data from other tools,  jumping from screen to screen, to get a clear, coherent picture of what’s really going on in the network. And after all that manual effort, most trouble tickets still get escalated to more senior network engineers — who usually perform the exact same basic diagnoses to verify the data received and then dig a little deeper. And the same thing happens again if the ticket gets escalated to a Level 3 engineer. At each successive stage of escalation, engineers find it difficult to verify information, with either not enough diagnostic data to draw any conclusions from, or too much (e.g., log dumps).

Traditional manual troubleshooting workflowt shoot maual process

So what’s the best way to find the needle in a haystack? (Probably burn the haystack, but let’s assume that’s not an option.) Use a magnet. For network engineers, that magnet is automation.

Reduce MTTI with automated network troubleshooting

Just like a high-powered magnet is the fastest way to find the exact location of a needle in the haystack, NetBrain automation — Dynamic Maps, Executable Runbooks and API integration — zeroes in on the network problem with pinpoint accuracy and at a speed no human can match when troubleshooting by hand. Automation reduces the number of repetitive, time-sucking steps a human is required to perform when troubleshooting.

Here’s how a troubleshooting workflow enhanced by automation would look.

Let’s say you’re  a Level 1 NOC engineer troubleshooting poor VoIP quality. When the ServiceNow ticket was created, NetBrain was triggered to automatically build a Dynamic Map of the L2 path that traffic was flowing along when the problem was detected.  NetBrain’s path framework looks into VRF and MPLS labels, evaluates ACLs and PBR, etc., so you know you’ve got the problem scoped out accurately. An Executable Runbook was also auto-triggered by the ServiceNow ticket notification. The Runbook ran a heat map of top performance indicators (Up/Down, utilization, CPU, memory) and saved the results right within the map. The map is further enriched with critical data from other systems via API (e.g., 24×7 monitoring information). The Runbook also contained a node that pulled live data from a bunch of different show commands (config files, route tables) across all devices simultaneously, compared the data from these devices to see if anything changed since your last benchmark, and again documented the results automatically. This all happened before you even saw the ServiceNow ticket, and all this data and analysis are just waiting for you. Turns out that nothing had changed since the last benchmark, and this is as far as your company expects a Level 1 analysis to go. You escalate the ticket up to a Level 2 engineer.

Automated troubleshooting workflowt shoot automated process

Let’s switch roles and now you’re the Level 2 guy. When the ticket gets escalated to you, you can see exactly which diagnoses were and weren’t performed, and the results are visualized in context on the Dynamic Map. All you have to do is click on the URL of the map from within the ServiceNow ticket. Your first step might be to check out the interface policy across all L3 devices, but you have a smorgasbord of various devices from different manufacturers. No problem: NetBrain can parse CLI commands from over 90+ vendors. All you have to do is drag and drop this Highlight Interface Policy node into the Runbook. Looks like everything has QoS policies except one router. Could be a problem, but you keep going to see if there are other problems you can detect quickly since there wasn’t a lot of traffic — looking into each queue along the path to see drops and configurations, say. All results from your Detect Queue Drops and Provide Details node are annotated on the same Dynamic Map. You immediately visualize which policy and class is experiencing drops and in which direction. Hovering over QoS configuration labels lets you quickly review the policies. In a matter of moments, you’ve discovered that QoS is doing its job but you’ve exceeded your bandwidth allocated, which is causing the drops. All the steps you performed are automatically captured in the self-documenting Runbook. You’ve effectively digitally documented your Level 2 troubleshooting process — and codified your tribal knowledge. Add notes or comments to the Runbook so it can be executed with literally just a mouse click or two by your Level 1 first responder. You just empowered your Level 1 to handle an issue that previously necessitated escalation!

Resolve known issue before they become problems

The other way to improve your MTTR is, of course, to reduce the number of problems in the first place. Continuing the VoIP example, let’s say you make a fix by adjusting the queue size or finding a device that is congesting the traffic. Now that it’s a known problem, you’ll want to keep an eye on it or search across the rest of the network for similar issues. Leverage your lessons learned by scheduling NetBrain to continuously monitor (every hour, for example) specifically for this problem to make sure it never cops up again. Regularly scheduled “problem-based monitoring” helps pin down those ephemeral intermittent problems and turns reactive firefighting into proactive problem prevention.

 

Three ways to see NetBrain troubleshooting capabilities in action

1. Want to see NetBrain automation but not ready to engage with sales?

Check out our public 20-minute engineer-to-engineer Live Demo with Q& A. See upcoming schedule here.

2. Test-drive NetBrain in our trial environment.

Get a hand-on feel for the technology in our curated lab environment. Sign up for a 14-day free trial here.

3. Schedule a personalized demo.

Want to get down to brass tacks and discuss NetBrain for your specific workflows and challenges? Schedule a private demo tailored to you individual needs.

Related