Go back

Best Practices for Automating the Troubleshooting Workflow in Multi-Vendor Network Environments

by Valerie Dimartino Mar 27, 2024

Let’s face it – manually troubleshooting hybrid networks is painful and time-consuming. Every problem is addressed as if the problem has never occurred previously and various network engineers apply different solutions to similar problems based on their expertise and experience. The result is today’s network troubleshooting is more of an Art than a Science- which is a significant problem by itself! To make matters worse, escalation engineers are limited in number and when engaged, must repeat initial investigation steps because of limited context, workflow challenges, and differing approaches.

Case in point: a well-known electric vehicle manufacturer was drowning in service requests for specific networking data from other IT departments. For example, the security and IT infrastructure teams often needed to know the switch port that a specific device like a camera was plugged into or wanted to find ports that were not being utilized where they could deploy additional devices. Second, staying on top of weekly device change requests, password rotations, and hardware refreshes was challenging. Keeping all this running and in support of the business while responding to the stream of constant requests from other IT departments was overwhelming.

Now, more automation is required due to the sheer volume of service tickets and the leveling off of NOC personnel to handle them. Budgets are tight and skilled resources are limited. And, we know, we aren’t exactly overflowing with skilled network engineers to spare for every time a slow app is reported, or a connection drops. Many fit the bill of common repetitive issues which can easily be handled by level 1 operations if only they had the diagnostic tools. Yet, troubleshooting is seen as a team effort using a manual response protocol:

👨‍💻  Level 1 engineer diagnosis   >   🎫  Ticket escalation   >  👨‍💼  Level 2 engineer diagnosis, and so forth.

What’s more, there are false positives from our monitoring tools that lead to endless chasing of ghosts, like flapping. If only there was a way to filter these transient problems out.

While automation has long been the desired solution to speed up troubleshooting, most of these efforts become developer-led projects that fail to deliver enough results. And when grass-roots automation is attempted, it takes the form of user-specific scripting, which also fails to meet the efficiency goal. Neither of these approaches transforms the organization’s core reference workflow. Neither approach can be re-used across the organization, reduces MTTR, scales to a multi-vendor network, enhances collaboration, prevents issues from reoccurring, or maximizes efficiency in any significant manner.

To address these operational shortcomings, an entirely new and machine-centric approach must be implemented for network operations. It requires a fundamental change to the way network engineers think about operations, including network automation in everything they do. To do so requires an automation platform available to all engineering resources (without the need for code). These skilled engineers already know how to solve problems- they need a simple way to capture their deep problem-solving experience and make it executable by machine by anyone who wants to address the same troubleshooting situation, anywhere in the infrastructure. With the right platform, every engineer becomes a network automation engineer able to create network automation for any problem big or small in minutes, not months.

Best practices to consider in investigating network automation: