Go back

Why Problem Diagnosis Automation is so Hard

by Lingping Gao Mar 7, 2022

On December 7th, 2021, Amazon AWS had a major outage that started from a disruption in North Virginia and quickly spread across the entire country.  Before long, many business sites such as Google, Netflix, DoorDash, Southwest Airlines were impacted by the outage.  At the height of it, over 600 people from AWS were on a conference call bridge to troubleshoot the issue.  The outage lasted more than 8 hours. Think of the long-term business implications of an 8-hour outage.

For the month, AWS went on with 2 more outages.

It begs the question: Can it be better?

It can. AWS in fact is one of the world’s most automated networks, but this outage, based on the post-mortem summary by AWS, took so long to isolate because the outage itself impacted its access to the automated diagnosis capability.

In our modern IT world, problem diagnosis has to be automated, even though it is really hard to do.  A 2021 survey by NetBrain to hundreds of our customers revealed that 2/3 of these network engineers do not have any automation capability during troubleshooting.  What do they use? The plain old command-line interface console. What suffers is the meantime to repair and prolonged outages, along with much longer-term impacts to customer satisfaction and retention, valuations,  reputation, etc.

For the enterprises that did aspire to leverage automation for problem diagnosis, the journey was very rough.  RCA(root cause analysis) tools attracted a lot of eyeballs 20 years ago, but the outcomes were far from satisfactory.  Most innovators have been absorbed by big IT solution players and simply stopped further innovation.  More recently AiOps solutions are trying to fill this void with a black-box approach.  All AiOps solutions leverage machine learning or traditional statistics-based AI functions to discover root causes from large amounts of machine data. But for most IT problems, a set of clean data is very hard to come by, on top of many other challenges including a PH.D to operate such a tool.  As one of NetBrain’s customers put it when discussing their early efforts with AIOps, they waited for 6 months to see their first issue diagnosed through their AiOps tool, and that was a very simple issue.  (The name of the tool is omitted here to avoid confusion)

Not without trying, problem diagnosis automation remains the largest unsolved IT challenge today. NetBrain started working on this problem more than 10 years ago, using a Whitebox approach centered around Network Intent.  The so-called intent-based Next-Gen can be connected to most ITSM tools, which enables it to begin solving problems the moment they occur. It can address more than 95% of the network problems coming to IT systems and potentially impact the organizations’ business applications, and helps to prevent many recurring problems as well.  In the next few blogs, we will explain the inner working of NetBrain’s Intent-based automation system for hybrid networks in more detail.