Go back

Would You Rather Fix a Network Outage, or Prevent it from Occurring

September 16, 2022

Getting on the Front Page

In today’s connected business, network infrastructure is its lifeblood. What’s hard to imagine is that enterprises are reporting catastrophic outages in their network at an INCREASING rate year over year, just when most people would assume the importance of these networks has been addressed with the highest priority in IT. Sad but true, major network outages occur every day somewhere in the world. As enterprises grow in size and their hybrid networks become more complex and distributed, the number of outages has increased steadily. The result: pick up any major newspaper or visit any online news outlet and you can almost be guaranteed to see front-page stories related to service disruptions somewhere in the world.

When there is a network outage, it is a race against time to get things working again. Panic sets in, users begin burying their help desks in service tickets, and in some cases, reporters start publishing news articles about what happened, which becomes interesting reading for shareholders over their dinner. Obviously, this is not a comfortable situation. The fact is, most network operators and engineers spend more than three-quarters of their professional lives in this chaotic race, fixing problems AFTER they become front-page news.

The Bigger Picture

Typical enterprises have hundreds or thousands of applications running at any point in time across their hybrid network. Front-office and back-office, manufacturing, R&D, audio and video streaming, database access, web, and a litany of other application types. And while each of these applications have been deployed initially to provide a specific level of service needed by the business, as new applications come into place, rarely do designers consider the potential impact to previously installed applications. Most applications are installed as if there are no other constraints imposed by the years of previous work. In the end, for the business to be successful, each of these applications, new and old, must be running at the level they were designed to run. In the end, the success of earlier applications almost always suffers over time, as shiny new applications take the attention of the enterprise architects and operations with less regard or formal processes in place to assure previous applications are preserved.

Finding The Holy Grail

The holy grail of networking is to prevent potential problems from becoming production nightmares. Sounds simple, but not intuitively obvious how to do so. The answer is by proactively looking for potential problems and resolving them before they impact production. For instance, when two network devices are set up in some form of load-balancing and resilient configuration, the proactive approach would be to confirm that the pair of devices are always and continuously configured identically. Again, while this sounds straightforward, most enterprises simply don’t have the automated tools required to test the mirroring of this pair, and the thousands of other pairs continuously. And this design compliance situation is painfully revealed when resiliency is supposed to kick in, resulting in disaster because the two devices were NOT mirrors of each other.

And the same kind of verification requirements exists across every hybrid network. Not just mirrored pairs, but connectivity performance, security zones, access control, throughput requirements, etc. The list of network behaviors that must be proactively verified can easily be 100 times the number of devices in a complex hybrid network. Years of brute force management of all of those devices affects the behavior of the network, which unfortunately is tested at the worse possible times.

It’s All About Scale

Prevent network outage

By focusing on all of the desired behaviors of the network rather than the low-level individual device health, an entire network can be described as thousands of network “intents”. Each application architect and deployment team must simply describe their unique requirements, which are then combined with all of the other application requirements to form the long list of network behaviors (intents) for the business. And then by automating the testing of each of these desired intents, the entire network can be verified continuously before any sort of production incident occurs.

Cultural Change

As any network becomes larger, operators and engineers find themselves consumed by reacting to problems that have already manifested into significant production problems needing immediate attention (“Fire-Drills”). The situations occur hundreds or even thousands of times per month in a large network, with desired behaviors (network intents) needing to be tested and verified. However, as the number of network intents can easily exceed 100 times the number of devices, it’s not feasible to manually verify all of these thousands of network intents. Proactive intent-based design verification simply doesn’t happen and the result: Outages.

Now I know there are a ton of dollars being spent on resiliency strategies and many readers may be saying, “I don’t have single points of failure or outages”. Let me assure you that you do whether you are aware of them or not. And while it is true that total failures are rare, in modern times “outages” are less binary and more graduated. The perfect network may be able to service 100,000 transactions per minute, whereas a network with some sort of failed component and resilient designs active may reduce the number of transactions to 60,000. Or the perfect network may offer crystal clear voice calls, whereas the impacted network may offer call that sound metallic or choppy, with voice dropouts. And to compound the problem, many of these kinds of reduced scenarios go unreported!

It’s THE BIG Opportunity!

Many businesses and IT operators have stubbed their toes on automation projects for years, typically taking huge amounts or resources to generate little or no results. Failed IT automation projects are everywhere. That is why many IT leaders don’t consider the power of automation for NETWORK problem solving and largely the tools have not existed that were specific to networks, until NetBrain.  Many operators continue to focus on brute-force remediation of problems “the way it’s always been done before.” For many IT leaders, taking an automated proactive approach to outage prevention does not even seem possible. But it is…

It is estimated that the majority of all network outages could have been prevented if automated verification and continuous design compliance enforcement were in place. By describing your hybrid network as the compilation of all of the design intents needed by all of the business applications in use, potential problems can be corrected long before they cause problems. And while this kind of compliance enforcement at the enterprise scale has never been feasible when attempted manually, the NetBrain solution changes all of this by automating this proactive verification and enforcement of desired behaviors at any scale.

NetBrain technology is doing this for thousands of customers and can show you how to incorporate proactive network automation today. For more information, please contact us today.

Related