Go back

Three Tips to Reduce MTTR through Network Automation

October 7, 2019

Troubleshooting must be smarter

In this blog, you’ll see why the application of network automation is one of the few fundamental changes that can be made in the NetOps function to increase efficiency, enable rapid scale and ultimately improve customer experience by significantly reducing MTTR.

“Nearly 70 percent of workers say the biggest opportunity of automation lies in reducing time wasted on repetitive work,” says the report Automation in the Workplace. How can that be? Well, as it turns out, the vast majority of all network service tickets can actually be grouped into just a couple dozen buckets of problems types *IF* you understand and can leverage the concept of ‘similar’. For instance, when an HA pair of devices need to stay mirrored, it doesn’t matter if the vendor of the pair is Cisco or Citrix, as long as the task is keeping the pair mirrored. Without the concept of ‘similar’, handling the verification activity would require a separate ticket for each pair. In fact, across the infrastructure, there may be a hundred HA pairs from various vendors that need to remain in lock-step. Pretty easy to address if you understand the concept of similar, but labor and time-intensive if you must treat everything as unique. (And better yet, with the right automation platform, the synchronization of all similar HA pairs could even be tested continuously and proactively eliminating the need for tickets altogether!)

And what about resolving reported network problems of other types? Well, think about what happens every time a network engineer or technician begins to work on a service ticket. What do they do? They try to determine the vicinity of what may be involved, they look for available documentation, they try to determine what is connected to what, they run a series of health checks on devices and paths, they look at all of the involved devices’ operating conditions, CPU utilization, memory and firmware versions, etc. All of this preliminary work gets repeated over and over for every service ticket and could consume more than HALF of the total repair time needed to close the ticket. Again, if an automation platform was able automatically to capture the SME’s set of best preliminary practices (like those above), and simply execute those best practices the moment every similar problem is reported, it would be able to conduct all of the time-consuming investigation, health checking, and mapping, and ultimately hand the RESULTS to the network engineer for their reference the moment they begin their work. Hours would be saved for every service ticket, multiple by hundreds or thousands of service tickets per month. Adds up pretty quickly!

That is where NetBrain’s PDAS solution shines! Let me explain what it can do in more detail…

Tip 1: Fully automate the diagnostics of common problems

IT Service Management (ITSM) platforms manage the lifecycle of your network issues on an ongoing basis. Everything that needs attention in the hybrid network typically takes the form of a service ticket being created, worked, and then closed, It’s how the business of NetOps is managed. And over time, ITSM reports give you the right insight to identify the most common problems you are experiencing and can be used to determine the kind of similarity that exists across hundreds or thousands of problem reports. With a little analysis, you can easily identify the top 10 or so and determine how best to automate all of the preliminary steps needed for each of them. A common mistake initial adopters make is to think that they should start with the most complex problems. That is a mistake since big complex problems are difficult to automate but most importantly, those problems only occur only once a month or a quarter. Even if you could automate them, there would be It would be a lot of work with little return since the problem occurs so infrequently. So it’s best to start at the other end of the spectrum, with the most straightforward problems that occur all the time. That is where automation is most effective and where you’ll realize huge savings in time and other resources.

By integrating NetBrain PDAS with your service ticketing or monitoring systems, like BMC, SolarWinds, ServiceNow, or Gigamon, your service desk agents are armed with instant visibility (via Dynamic Maps) and diagnosis/analytics (via Triggered Executable Runbooks) into the problem area. And best of all, the problem context is captured at the instant the problem is reported or an external event occurs. Once the support engineer gets involved, they will have all the data they need to isolate the root cause at the moment the problem occurred, not hours later when they are assigned the ticket.

ServiceNow NetBrain integration application slowness database server

Ticket enrichment in ServiceNow, showing the work that has already been performed BEFORE the engineer gets involved

To understand how that works watch our video: Event-Triggered Automation.

Tip 2: Enable engineers with interactive automation and real-time intent visualization

For most incidents, a network engineer still gets involved. As seen above, most of the tedium has been reduced or eliminated, but there is still some final work to be performed by the assigned engineer. NetBrain PDAS includes an interactive visual console that is built upon a robust data model and a real-time network rendering and visualization engine. The interactive console is deeply aware of every device and the relationships and intents of all of the connectivity so that changes can be made safely. This awareness prevents human error from occurring since the console prevents inadvertent mistakes from being applied. Things like mismatched MTU happen all the time and can wreak havoc on a network. The NetBrain PDAS console won’t allow that since it knows the MTUs must match. In addition, the PDAS mapping is aware of traffic and performance detail, bi-directional, right don to the protocols and quality of service. At a glance, the network comes to life and most problems can be spotted without even having to think about where to look!  This map is at the core of the diagnostic process. The engineer working on the incident will use these interactive views to display additional information (routing protocols, interface information, QoS metrics, …) on that map. So, he has visual support to understand easily the state of the network, without having to manually telnet into a multitude of devices. With NetBrain PDAS, they can visualize the intents of the network…

NetBrain Data Views - NSX Fabric Stats

NetBrain Data Views – NSX Fabric statistics

To learn more watch the short video: Simplifying Network Complexity with Interactive Automation

Tip 3: Getting engineers and specialists on the same page enhances collaboration to solve complex problems

Sometimes multiple IT professionals from different disciplines are needed to solve complex problems that span the infrastructure. Perhaps security, an application, and the network expert need to collaborate to identify the root of a problem. Or maybe a similar problem has already been solved by a Subject Matter Expert previously. All of this is a tip of the hat to the power of collaboration.

NetBrain PDAS does both. It allows the experience of subject matter experts to be captured without the use of code, and then these experiences are shareable throughout the organization. Level 1 technicians can actually use the best practices of their level 3 experts, reducing the number of escalations required. And when multiple engineers are needed to solve tough problems, the PDAS collaboration portal enables all of these experts to get on the same console, in real-time, interacting with the network itself. This real-time collaboration portal eliminates the loss of context or content of any kind, and problems are addressed much faster and with fewer human errors.

Troubleshooting Is a Team Sport: Automation That Promotes Collaboration

NetBrain PDAS is a powerful network automation platform that understands the intention of the network, and by leveraging this top-down understanding, focuses on reducing the time it takes to resolve network service tasks, and proactively eliminating problems before they begin to impact production. Maybe its time to have a deeper look