This webinar leverages NetBrain’s map-driven automation software to showcase 10 rules for minimizing network outages. You’ll learn how to use automation to proactively ensure your network doesn’t go down, and how to improve response times if it does.
Erin: Hi everybody, welcome. Today we’ll be talking about network outages, how to minimize them when they occur and how to prevent them completely by following a couple of key rules. My name is Erin Komm, and I’m a member of the marketing team here at NetBrain, and I’ll be your host this afternoon. I’m joined by our panel. NetBrain’s Marketing Manager, Jason Baudreau will be conducting the majority of the presentation, in addition to our Product Specialist, Vincent Smith who will be running through the demo. And last but not least, Ray Belleville will help field your questions.
Now, before we get started here, there are just a couple of housekeeping items that I’d like to mention. First off, while the webcast is in progress, all of our lines will be on mute, but if you do have questions at any point, please feel free to type them in, and Ray will try to get through as many as he can. We’ll also have an open Q&A session at the end of the presentation. Second of all, the webinar will be recorded and a link will be sent out for future viewing. This usually takes about a day or so after the conclusion of the event, so be on the look-out for the follow-up email in your inbox. All right, well, I think we’re ready to get started here, so let me pass it along to Jason to kick us off.
Jason: All right, thanks, Erin, and thank you everybody here that’s dialed in today for spending an hour with us. We’re going to discuss, as Erin mentioned, network outages. As Erin said, my name is Jason Baudreau and I’m the US Marketing Manager here at NetBrain. I was previously a network engineer at Raytheon, and I managed networks and networking projects for the US Army and the Port Authority of New York and New Jersey, amongst others. Today I’m going to talk about some of my observations during that time, specifically relating to network outages and troubleshooting.
So, let’s continue with the agenda. First, I want to talk about the impact of network outages. Most of us understand this impact, but I’m going to give you some numbers to bring that impact to life. Next, I’m going to talk about “Why is troubleshooting so hard?” I will share my own opinions here and propose an argument for why there are so many inefficiencies in the troubleshooting process.
Next, I’m going to discuss five rules to shorten network outages. And of course, here I’m really talking about accelerating troubleshooting and making troubleshooting more effective and efficient. After each rule that I present, I’ll hand the baton back to Vincent Smith, our Product Specialist, who will conduct a brief NetBrain demo to show you how automation can be applied to each of these five rules.
In the last third of this presentation, I’m going to talk about five rules to avoid network outages. We’ll go through these a little bit more quickly, but mainly I’ll be talking about the most common causes of network outages and how you can prevent them. Vincent will do a brief demo here, as well.
To conclude, I will share a couple of slides on customer testimonials and how they use visual troubleshooting to accelerate their troubleshooting. Finally, we’ll open it up for Q&A, we’ll take some of your questions. Ray is going to help address those at the very end.
So, let’s get into it. We’ve all experienced a network outage at some point, whether we can’t watch a movie on Netflix or maybe our email is down. A lot of us on this call are probably network engineers, though, so we’ve been in the hot seat. We’ve been troubleshooting the network when there’s an outage, and we understand this urgency quite well, right? But let’s put some numbers behind this impact.
According to a recent study of 7,000 businesses, 25% of enterprises suffered unplanned outages of greater than four hours last year, so this represents significant outages. These outages cost may be $1.7 billion, roughly, in lost profits. And, of course, there’s an impact beyond just the revenue here, too. There’s a business’ reputation on the line when there’s an outage.
It was found, according to that same study, that about one quarter of the outages were network-related. But as most of us know that troubleshoot the network, usually we’re involved no matter what. The network is usually the first thing to blame, right? But it’s also the hardest to vindicate sometimes, so we’re involved, and it turns out that one quarter of the time the network was indeed at fault last year.
So, I’ll take a minute on this slide to talk about some of the worst outages in the recent past. The first one is, last year, the NASDAQ stock exchange was forced to halt trading for three hours. This shutdown was prompted by a disruption in the connection between the New York Stock Exchange Arca networks and the data processing subsidiary of NASDAQ. When the backup system kicked in, there was a software bug that was found, and it resulted in a total failure on the backup network. So, even the backup network couldn’t save NASDAQ in this case.
Another one that was popular in the news headlines this past March was Xbox Live. They had an outage that was caused by a configuration error during a routine maintenance. The impact here was that it crippled the launch of one of their largest online games, “Titanfall.” So, this had a big impact to their business bottom line.
And there was a brief outage in Facebook, August this past year. Again, it was a configuration error. A routine maintenance, a small configuration error had a ripple effect on the network. And even though the network was down for 30 minutes, people across the world, they couldn’t access Facebook, and they lost about $500,000 estimated in potential ad revenue.
So, on the next slide here, I want to talk about what are the most popular causes, the primary causes of network outages. Of course, there may be one way for a network to perform optimally, but there could be dozens or hundreds of ways for it to fail.
A separate study from Cisco I’m looking at here concluded that almost a quarter of network outages are a result of a router or a switch failure. A third of outages, roughly, come from a link failure, including fiber cuts and network congestion issues. Greater than a third, the largest chunk, in fact, resulted from a network change, meaning an upgrade or a configuration change.
In two out of those three headline outages last year, Facebook and Xbox Live, configuration change was the culprit. But in fact, finding the source of the problem is the hardest part, and we know that once the root cause is determined, usually resolving it is kind of a cakewalk at that point. But finding the source of the problem, that’s really where the challenge lies.
So, I want to examine that a little bit more. Why is troubleshooting so hard? Again, these are my own observations, but I’m going to start by looking at a common troubleshooting methodology. It’s where we’re usually presented with a problem. Maybe I can’t connect to my application server. In order to start to diagnose, we need to begin gathering information about that problem and analyzing that data. If we’re lucky, we’re able to eliminate some variables or propose a hypothesis about what’s going on. You know, maybe it’s a congested link that’s causing the problem. Is it possible there’s a routing issue, maybe a misconfigured ACL?
As we propose these hypotheses, we need to test them. And of course, in order to do that, we go around this circle here, and we gather the data, analyze it, and around and around we go, right, until we determine the solution.
Where are we spending all of our time in this cycle? I propose that we spend the bulk of our time here, this area highlighted in red, which is diagnosis stage, that we’re spending a lot of time gathering information and analyzing that. It’s estimated that about 80% of the time diagnosing a network outage or any kind of network issue we’re actually spending gathering information, and 20% of that time is analyzing it. So, there’s a lot of inefficiencies here. It’s a very manual process to collect data on one network device at a time.
Some questions that we might have to gather information to address: What’s connected? How’s it configured? What’s happening, in terms of performance? Also, what’s changed? If it’s a configuration change, sometimes it can be beneficial to know what’s changed to help answer the challenge there.
But I propose that we run into issues even proposing a hypothesis to test. Is it top-down? Am I starting from the top of the stack, a bottom-up approach, or are we able to shoot from the hip? And I’d argue here that there’s insufficient network experience in order to address what could be going wrong. In some cases, and this is an issue that many organizations run into, they need to call in, a network expert or a network hero, and this is one of the key reasons why.
So, I’m proposing that we have limited visibility into these areas where I’ve put bullets on this slide, and I want to look at that a little bit deeper here. I’ve identified the challenges of network visibility. In order to determine what’s connected, we have some insights. A lot of us leverage network diagrams when we’re troubleshooting. I kind of put this circle, this red circle here, halfway into this visibility window because I propose that a lot of our diagrams, they’re outdated. They’re incomplete, they’re not comprehensive, and they’re just not reliable. So, I say that this provides some visibility.
How’s it configured? Well, we have the CLI for that. Most of us spend most of our time in the CLI, but it’s a very serial process. We’re looking at one device or one interface at a time. This is what I was talking about earlier, it’s a very manual process. Like I said, we have a lot of visibility here, but not completely because of the manual nature of it.
Also, what’s happening? We’re getting into performance. In a lot of enterprises, they deploy a performance monitoring solution to answer that question. I argue, actually, that a lot of performance monitoring tools, they provide you with dashboards of data, but sometimes too much. You can have information overload. Perhaps you have information about what’s changed, there’s change logs. Or, “Have I seen this before?” And again, this is where expert knowledge comes in.
The ideal scenario here, of course, right? In an ideal world, one tool, total network visibility. And this is something that the network management industry has kind of been talking about. It’s a buzz word, it’s that single pane of glass. That’s something that we’re all striving toward. So, that’s what we’re looking for, right?
But what I’m gonna talk about here is five rules to shorten network outages. And I’m gonna offer suggestions in how we can apply automation for each of these five rules. So, let me get into that. And ideally here we’re talking about “with efficiency,” and that’s why I’m talking about automation.
So, rule number 1, “Update Network Diagrams Often.” And so, why? A lot of us might know the answer already, but I’ll go through this. An accurate network diagram is a troubleshooter’s best asset. We use these all the time to define the scope of the problem, which devices on the network are responsible for forwarding traffic, you know, for a particular application, perhaps. For understanding the logical connectivity, the logical redundancy that you have in the network. Of course, what’s the intended design? What kind of routing, dynamic routing protocols that have been configured, which VLANs? Are there any firewall rules established? Even understanding how application traffic can travel across the network. I know a great network diagram can help answer this question for you, too.
So, how do we update our diagrams often? You know, there’s kind of…it’s an obvious answer, perhaps. Ensure processes are in place to capture your network changes and update them accordingly. The challenge, of course, is a lot of us have larger networks, and they’re changing all the time. Sometimes, you know, processes aren’t enough here without spending significant manpower. So, you know, if the network’s changing frequently, I propose an automation solution may be necessary. And at this point, I’m actually gonna ask Vincent to demonstrate a little bit of how that can work and how NetBrain might be able to provide that sort of automation solution for your diagram. So, Vincent, could you take it over?
Vincent: Absolutely. Thank you very much, Jason. How are you doing, everyone, on the webinar? My name is Vincent Smith. I’m gonna be the Product Specialist here today. So, I’m gonna demonstrate how we can use NetBrain to update our diagrams often.
Now, before we begin to create our diagrams, we would want to discover our network using NetBrain. So, before we begin the discovery, we want to put in our Telnet/SSH logins, as well as the SNMP read-only community string. Once this information is in the software, you’re free to start the discovery. So, what I’m gonna do is I’m gonna demonstrate the Seed discovery method here. So, I’m typing in the IP address of a core router on my network.
Now, when I begin the discovery here, we’re starting on that core device. We’re looking at routing tables, CDP, LLDP tables as well there, determine the neighboring devices, and then discovering them as well. So, what is going on here is NetBrain is creating a comprehensive data model on the backend, and then we can use this data model to draw dynamic network maps. We call them Qmaps inside of the software.
So, now that I’ve discovered my network here, let’s go ahead and search for a device inside of the software that maybe had some issues on it. So, we’ll just go to this visual search bar here, and I’ll type in the Hostname of the device. And as you’ll be able to see, NetBrain will give us categorized results. What we’re interested in right now is the actual Dynamic Network Map. We call them Qmaps inside of the software here.
So, you can see that we mapped out this Bos-WAN device and the surrounding devices as well. You may also notice that there are some red plus marks on these devices, which means that neighbors are already discovered and we can build out our diagram accordingly. These diagrams are dynamic and data-driven, so we can drag these items around to clean up our diagram, and we can zoom into these maps to learn more about our network.
So, take a look here. All I’m gonna do is push my mouse wheel forward. As I do push my mouse wheel forward, we learn more about the network here. Looks like we’re running OSPF, we have multicasting configured as well, and let’s say that we’re interested in, maybe, this routing protocol here. What we could do is, we could add a note directly to our diagram. And this note that I’m adding, it doesn’t affect the live configuration in any way whatsoever. Well, so we can delete this and add our own notes if we please.
Using NetBrain, you can very easily visualize your network as well. What I’m gonna do now is I’m gonna go to the Highlight menu here and I’m gonna highlight the routing protocol on the map. So we’ll instantly get a color-coded diagram showing us where we have everything configured. Looks like we’re running OSPF and ISIS, and this red dotted line represents an EBGP connection.
So, that’s one example of how we can create our diagram, searching for a troubled device. What we can also do is, maybe, if we’re troubleshooting a routing issue, since NetBrain is a comprehensive database here, we can search for lines inside of configuration files. So, what I’m gonna do now is I’m gonna search for a static route that I configured on a network device. I’m gonna put it in quotes so NetBrain knows to look for that specific line inside of the configuration file. As you can see, it looks like we have that static route configured on three of our network devices. So, based on this search, we can create a dynamic map here based on our results. And of course, we can zoom in to this diagram to learn more about our network.
And let’s say, maybe, we need to document and are troubleshooting an application path. What we could do with the software…so, let’s first create a new map. Let’s come up to this A to B path calculator and we’ll type in a source IP address and a destination address, and we’re gonna find this path here on our live network.
So, the way this works is once we hit find path, we’ll start with that source IP address, and we jump out to the default gateway. We’ll log into that device using CLI commands to view the routing table. But as you can see from the execution log down here, we’re not just looking at the routing table. It’s a comprehensive investigation of each device along the path. It’ll be looking at any ACLs, VRF, NATing that could potentially change the traffic path. So, it looks like we made our way from A to B just fine. If we’re curious about the return trip, all we have to do is change this arrow around and find this path, now, from Los Angeles to Boston. So, let’s go ahead and take a look at that as well.
So, now we see that purple arrow there go, and as you can see from this demonstration here, it looks like we’re mapping out an asymmetric path on our live network. So, it didn’t come back through this DMVPN tunnel like it did. It’s going through this MPLS cloud on the way back to Boston.
So, that’s you how you can use NetBrain to update your diagrams often. At this point, I’d like to turn it back over to Erin, who has a polling question here for us.
Erin: Thanks, Vince. So, I’ll go ahead and launch that question. So, we’re curious, how often your organization updates your network diagrams. And I’ll give you a couple minutes here to poll your answers, and then I’ll go ahead and share the results. All right, and it looks like 48% of you update your network diagrams every time the network changes.
Jason: All right, so looking at performance visibility here, ideally, you’d be able to identify a tool which can help you look at how your network’s performing in real-time. So, you know, why? How can you troubleshoot, if you can’t see what’s happening on the network? And by that, I mean performance, how it’s performing.
So, the top five performance problems that we usually see when we’re troubleshooting, it’s the CPU utilization on the router or switch. Maybe it’s running high, it indicates a health issue on the hardware. Or a memory utilization issue. Also, link congestion or link utilization issues, dealing with bottlenecks, link errors, and latency. So, these are the type of performance issues that, if we can get some insight in here, these are gonna provide strong clues about where the issues are in the network.
Ideally, you want to deploy a performance monitoring tool, one that specializes in diagnostic monitoring. And what I mean by that is, instead of dashboards of data, you know, charts and graphs, something that can give you information about a particular problem area and how the health of that area might be affecting the performance of the network. So this is, like, all contextualized performance data. And just like with that, I’d actually like to hand it right over back to Vince to talk about how you can visualize performance data on a live map in NetBrain.
Vincent: Thanks, Jason. All right, so let’s go ahead and take a look at how we can enable our performance visibility here. So, if you recall this A to B path that we mapped out on our live network, now what I’m gonna do is I’m gonna turn on the live diagnostic monitor. So, we’re gonna evaluate some performance hotspots directly onto our diagram.
Right when we turn this monitor on, we’re using SNMP data to gather CPU utilization, memory utilization, even the bandwidth utilization of the links will appear on this map as well. We’ll be able to mouse over this information, and we’ll get a graph instantly appear in front of us with that data. As you can see, this data will also be available in chart formation here below. We’re taking a look at the device data itself. We can also take a look at the interface data as well.
In a couple of moments here, we’ll get to see the interface traffic come in, the interface utilization in and out, and any errors on the links that are on our diagram here as well. So, anything that is over a threshold that we set is appearing in dark red onto our Qmap, as we can see here. So, this is a Layer 3 diagram. What we can also do is take a look at the performance on Layer 2. So, let’s go ahead and stop this monitor, and now let’s just go ahead and right-click this A to B path and view a Layer 2 map.
So, based on this diagram here on Layer 3, what we can do is just right-click on that map there and view the path the application would take as it comes into which port, as it leaves which port, all the way until we reach our destination on the other side.
So, let’s take a look at this diagram here as it loads up. I can just clean up the map a little bit here. And now, again, let’s turn on the live diagnostic monitor. When we do so, on Layer 2, we’ll now get to see active and non-active ports as well, again, that interface data here as well come in.
So at this point, I’d like to turn it back over to Erin here, who has another polling question for us.
Erin: Thanks, Vin. I’ll go ahead and launch that. So, now we’re curious if you guys use performance monitoring tools when you’re troubleshooting. And again, I’ll give you a couple of minutes here to vote, and then we’ll go ahead and share those results. All right, let me go ahead and close that out, and then I’ll share. So, it looked like 45% of you troubleshoot with another performance monitoring tool.
Jason: Just moving on, this is gonna get…you know, between defining a map of the problem area and then visualizing performance on that map, I think you’re off to a good start in, kind of, identifying, usually, the top performance issues on a network, but I think we can go a little bit deeper.
So, my next rule is to baseline the network consistently. That means, collect a lot of data about your network, and do that on a routine basis. So, the reason why is something I alluded to at the beginning of the presentation, which is that greater than a third of all outages are caused by a network change, so troubleshooting can be a heck of a lot faster if we can concisely answer, “What’s changed?”
And by “What’s changed?” I, of course, mean what’s changed in configuration files, but beyond that, what’s changed in routing tables, what’s changed in the topology. Have there been any changes in the application traffic paths, and even something as detailed as a NAT table, a spanning tree table, is the interface status now different than it was, perhaps last week.
So, this list here really represents the impact of a change. So, someone might have made an intentional configuration change, but if you can get visibility into the impact that change has, then you’re gonna be able to learn valuable insights during a troubleshooting event.
So, the tip here is to leverage a tool that collects and baselines that network data for you frequently and to avoid tools, you know, which only back up configuration files. There’s, of course, NCCM, network configuration and change management tools that will do that. They’ll back up your configs, and that’s how you’re able to have a backup configuration file, but if you can get more information, then you can really leverage it during troubleshooting. And Vincent’s gonna talk to you a little bit more about baselining your network.
Vincent: Thanks, Jason. So, yeah, let’s take a look at how NetBrain is going baseline your network. Well, we have something built into the software called a Benchmark here. So, let’s go ahead and launch the server benchmark webpage. What a server benchmark is, at a designated time that you set, whether it be every day, every week, every month, or on demand, you can go out to the devices that you have discovered. It will automatically gather configuration files, routing tables, CDP inventory information, ARP, MAC, STP tables, and save this information to your server. So, all this data will build up over time, and you’ll be able to use this data to compare to one another.
Let’s go ahead and see how this can be useful. So, if you recall this A to B path here on our live network, it’s asymmetric. I’m curious about what this A to B path looked like at a previous point in time. So, now what we’re gonna do here is we’re gonna type in that same A to B path. The only difference now is we’re not gonna find it on the live network, but we’re gonna review cached data. We’re gonna see what this path would have looked like last week.
So, once we hit this Find Path button, again, it’s still gonna do a comprehensive investigation of each device along the path, as you can see. But this time here, look at this. Last week, it looks like the path was symmetric, but if you recall, our live network here is asymmetric path. So, that tells us something. It tells us there’s probably been a change made this week that caused this to occur. So, let’s go ahead and compare historical data across the board on all these network devices to determine what changed.
So, as you can see here on the list, there’s a lot of information you can instantly get comparisons between, but for this example, let’s stick with the basics here, configuration files and routing tables. And I want to compare last week’s data to our live network.
Let’s hit Compare. The first thing that pops up for us now is a comparison of the routing table. As we can see, there’s only been a few new routes added this week, one of them on LA-Core 1-Demo. So, let’s go ahead and double-click this and learn more about it.
As we can see here, it’s a route learned from EIGRP. The next hop is LA-WAN-1-Demo. So, to visualize this route, let’s just throw it right onto the diagram so we can see that this week, we’re hopping to LA-WAN-1-Demo now. So, what we can do, is we can take a look at the configuration files and see what’s changed. So, on LA-WAN-1-Demo, from last week to this week, it looks like the configuration file changed. It’s the only device that did so. So, let’s go ahead and double-click this, and now we’ll get a side-by-side comparison of the configuration file from this week to last week with highlighted differences, and let’s use the Next arrow to locate those differences.
Whoa! Look at this. And as we can see here, this week, it looks like there’s a redistribute static in the configuration file that was not here last week, which is surely the cause of this asymmetric path we found by comparing historical data.
Now, this feature is also available in something called the observer mode here. What we can do is we can zoom in to this device, and this will open up the observer deck here. It gives us a deck of cards with relevant information about the device. You can go through these cards on all the devices on the map and be able to compare data as well. For instance, let’s look at last week compared to this week again and hit Compare. And we can also see that change inside of the configuration file from this way as well.
At this point in time, I’d like to pass this over to Erin again, who has a polling question for us.
Erin: Thanks, Vince. I’ll go ahead and launch that. So, we’re wondering, how often you baseline your network. And again, I’ll give you just a few minutes here to vote. All right. And let me close that out and then share the results. And it looked like 55% of you don’t have a way to baseline your network currently.
Jason: Okay. Thanks, Erin. So, I want to talk a minute now about making troubleshooting diagnoses more efficient by automating them. So, the idea here is, of course, I said earlier, and you probably have experienced that, that idea of trying to find a needle in a haystack when you’re troubleshooting. Because hundreds of things can possibly go wrong, testing one hypothesis at a time, it can take forever.
You know, suppose, like I’ve mentioned, one hypothesis is a routing issue. Well, you’re gonna do, you know, a lot of data collection and analysis in order to, you know, come to a resolution here. And, you know, the same thing for another hypothesis, and so on and so forth.
And this gathering and analyzing this data, it’s very manual. You know, you’re gonna do that through that CLI window more often than not, and it’s just slow. So, you know, wouldn’t it be nice if we could automate that and diagnose many things in parallel? So, the idea here is really troubleshooting automation at its finest, and I want Vince to talk a little bit more about what that looks like, perhaps, in NetBrain.
Vincent: All right. Thanks, Jason. So, I’m gonna pose a scenario to you. Let’s say I’m troubleshooting an issue, and I have a hypothesis that there could be an ACL blocking the traffic from an A to B path. So, what I’m gonna do with NetBrain is, I’ll type in a source IP address and a destination address, and now I’ll check if we can make our way from a specific source port to a specific destination port here.
So, now we’re gonna find this path again on our live network here. So, once we hit the Find Path button, as you can see, we’ll start with that source IP address, and like I said before, we’re taking a look at routing tables as well as any ACLs policy-based routing, NATing, VRFs that could change the traffic path. And if anything does change the traffic path here, we’re gonna be visually notified with NetBrain.
So, look at this here. It looks like on this A to B path, it looks like we never made our way to the destination because NetBrain has pointed out that we have an ACL denying the traffic. So, if you recall how we’d learn more about our network, all we have to do is zoom in and we learn more about it.
Looks like that ACL appeared directly on to our diagram now. Well, we can just mouse over that ACL and add the configuration right to the map. Okay, well, this is a pretty easy fix. We can’t get to port 8080 because there’s an ACL denying that traffic. So, we could essentially just SSH right into the devices here and fix that issue.
I also want to discuss how NetBrain can automate troubleshooting for you with something called a Qapp, so I’m gonna hop over to my Layer 2 diagram again. What a Qapp is, is a way to automate repetitive and difficult troubleshooting steps. There are over 150 built into the software ready to use out of the box. Let’s take a look at one of them now.
I’m actually going to run a Qapp here to check interface errors. So, what’s going on in the background here when I run this? Well, we’re issuing that show interface command to all these network devices, and it’s parsing out those errors and populating it directly onto our diagram.
So, if we can see there are no errors, the link will appear in green, but if there are errors, take a look over here. We have over 9,000 collisions and over 2,000 CRC errors between these two switches. So, this leads me to believe we could have a problem with, maybe, a speed or duplex mismatch between these two network devices.
So, what we can do is, we can certainly do it the old-fashioned way, CLI into the devices, issue some show commands. But why do that when we can automate that with another Qapp? Let’s go ahead and run one to check the interface speed and duplex mismatch on all the network devices across the board here.
Once we hit run, what’s going on in the background is we’re issuing a show CDP neighbors detail command to these devices, then we’re issuing a show interface command to check the speed and duplex, then we’re hopping over to the neighboring devices and issuing that show interface command again. And if there’s a mismatch, NetBrain will automatically label it directly onto the diagram here.
Ah, there we go. So it looks like we do have a duplex mismatch between these two switches. And because we’re such a visual product here, we can just hover over these ports just to verify. Looks like we have a full duplex here connected to a half a duplex on the other side.
So, these Qapps, like I said, there are many built into the software ready to use out of the box, but the power behind them is they are very easy to create your own. So, let’s run through a quick example here.
Let’s say that I did not have SNMP enabled on these devices, but I still wanted to check the CPU utilization here. So, let’s go ahead and create a Qapp to do so. Let’s check the CPU utilization. Step one is we want to issue the show command to check that. So, show process CPU, and we want to retrieve it from a device in the network. So, let’s select one to gather this data.
Now, what’s going on is we’re going to this device, issuing the show command, and then presented with the information here on the left. Now, as you can see, all we have to do is parse out the variables that we care about. So, in this example, I care about this one-minute CPU utilization. Let’s highlight it and define it as a variable.
Now, what we need to do, is we just want to name it something, this is CPU UTILIZATION, and we want to put it somewhere directly on our map. If you look at the legend below here, let’s put it right under the device itself, so we’ll put it under position one.
And that’s it. We just created our first Qapp by parsing out data from a show command, and now when we’re ready to run it, we can just go to our Qapp menu here and run that check CPU Qapp here.
Let’s go ahead and run it now. And now we can see that on all the devices on the Qmap, we get to see that information appear directly onto the device like we set it up to do. And at this point in time, I’d like to pass it over back to Erin here, who has another polling question for us.
Erin: Thanks, Vince. I’ll go ahead and launch that. And we’re wondering if you automate your troubleshooting diagnoses currently. And again, I’ll just give you a few minutes here. All right, let me go ahead and close that and then share the results. And it looks like 69% of you still use the CLI for this.
Jason: Great. Thanks, Erin. I want to conclude my fifth point regarding, you know, accelerating troubleshooting, and that is to document what you’re doing. This might sound counterintuitive because we’re really injecting process here, adding a step to a troubleshooting methodology, but the trick here, of course, is automation.
So, the reason why I propose this is helpful is especially aimed in escalation issues where a tier one team is going to escalate an issue that they weren’t able to resolve to tier two, the idea being to minimize the repetition of that data mining after escalation.
So, the idea here is that tier one is gathering a lot of information, and they’re gonna run their analysis on it. But if they’re not able to resolve the problem, won’t it be nice if when they escalate they could send all the data they collected up to tier two, so that tier two doesn’t have to repeat the work, the data mining work that tier one did? And they can gather less information, kind of just additional information that they need.
And the same thing as they escalate up to tier three, if that’s necessary. Tier three can spend all of their time analyzing and, kind of, debugging. They don’t necessarily need to collect a lot of information manually. So, if you can automate the data collection and analysis and, sort of, use that as a collaboration media, then you’re gonna save a lot of time during escalation events.
Documenting what you’re doing. Now, this is beneficial to help you later on, after the fact, understanding lessons learned. So the idea is if it took, you know, 12 hours to troubleshoot an issue one time, it may improve that next time if we see similar problems. We learn from our issues here. And, you know, documentation is great, too, for root cause analysis reporting for management purposes. So, there’s a lot of benefit to your documenting what you’re doing.
The key is automation, to build this documentation into your troubleshooting process such that it’s seamless and it’s not gonna inhibit your workflow. Find a tool that lets you share the information effortlessly, without adding any overhead to your tasks. So, Vince’s gonna talk about how the Qmap can be that collaboration media in at least one example of how NetBrain can help there.
Vincent: Thanks, Jason. So again, I just brought up this Qmap of our Layer 2 diagram, and here I want to point out something inside of NetBrain called the data pane. So, let’s go ahead and open it up. Inside of the data pane, all the information we’ve been gathering here to troubleshoot an issue is saved. So, any show commands, here, is in the data pane, or that monitor data, that we’ve been saving and monitoring our network devices.
Let’s first go ahead and take a look at this. So, I’m gonna take a look at the overall health monitor. We had this running for a little bit, and then, using this time bar feature, we’d be able to scroll around and see what was going on at a specific point in time. So, a great use case if you had an issue overnight. You could turn the monitor on before you go home. You could come back in the morning and then review that data at a specific point in time to review some of that CPU utilization, memory utilization, bandwidth utilization as well.
Now, it’s not just this basic monitor that we’re saving the data here. If you recall, we ran some Qapps as well on this diagram. That data is also available here to review over time. So, if you left that monitor on to check interface errors over time, you could then come back and use the time bar and see how many errors were on a specific interface at any given point.
As well as this monitor data here in the data pane, you can collect and save information, particularly by issuing show commands here. Let’s go ahead and see an example. If I’m troubleshooting an issue and I need to document what I’m doing, maybe I’m a tier three NOC engineer, what I can do is I can issue show commands here across the board. So, on all of the devices on the map, I want to issue a show interfaces command and a show process CPU sort of command. And I’m gonna start up that process.
So, NetBrain will then go into each device, issue those show commands, gather the information, and then save it directly into our map. So, now that’s also gonna be collected here in the data pane. So if I have to escalate an issue to a tier two guy, if I’m tier three, I just save all this information right to the map itself, and I can even add hyperlinks directly to my diagram. Any notes can be added, or we can bold items here as well and point out specific pieces of the configuration file for my other team members.
Now, if I need to share this map with my team, we have something built into the software as well called the Qmap center. Let’s take a look at that.
So, using this map center, we have the ability to create folders for our team members and share the maps there. So, when my tier two guys launched NetBrain here, I saved the map to the map center. They can just come here and click on that map and have it instantly open with all the data directly in front of them.
And at this point, I would like to pass Presenter over to Jason, who will now discuss five rules to avoid network outages.
Jason: So, that was the five rules I promised about shortening a network outage, accelerating troubleshooting. Now I want to talk about, let’s get to the next slide, five rules to avoid network outages. I’m gonna go through these a little bit quicker. The focus is mostly on troubleshooting, and these were how to avoid them.
So, I’m gonna be focusing on the 36% of outages that are resulting from network change. So, I’m really focusing on how we can improve that change management process. So, step one here is to document your proposed changes. The idea here is to collaborate and plan a change workflow, make this a smooth process.
Tip number two is to carefully plan the network change. Now, this might seem obvious, but a lot of changes, they happen sort of ad hoc, and they can be rushed. And, of course, a well-thought-out change is a safer change.
For a safer change methodology, too, leveraging automation for repetitive changes. So, suppose you need to configure an SNMP community string, and you have to do it across 50 devices. Rather than bus the potential of fat-fingering a configuration command to the device, it could have an impact downstream, right? So, you want to minimize the probability of that, and it’ll certainly accelerate your change deployment process as well.
You want to validate every network change, too. Minimize the probability of those downstream issues. So if a configuration change has an unanticipated effect on routing downstream, that might have a ripple effect that causes a catastrophe, a network outage.
So, if you could validate the network changes, in other words, understand, what is the impact of that change in your configuration, your routing, and your topology, then you’re gonna be much better equipped to catch that, and also to test those changes as well. So, that if there is a problem, capture it in your change window, and ideally be able to rollback that change automatically so that it’s not a network user on a Monday morning making a phone call. You can learn within the change window.
So, this is five steps of a methodology, really, a changing management methodology. And the last handoff I’d like to do to Vince is to talk about how that can be automated through NetBrain.
Vincent: Thanks, Jason. All right, I’m going to demonstrate for you the NetBrain change management module. So, let’s say that we need to make changes on these network devices on the map. So, let’s go ahead and first define a network change. It’s gonna ask us to save the map first, and then NetBrain breaks it up into a seven-step process to pushing out your changes.
Step one is going to be adding a summary of the change. Now, this is just gonna be for internal use there. You can have a summary for your team.
Then you can go ahead to move to step two, which is defining your network change. So, you can potentially go to each device individually here and apply that unique configuration to that device. However, we also have the ability to use this Config Template here, so we’re gonna push out the same change across the board. I have this configuration that I don’t want to push out to my devices here. Let’s use that. I’m gonna just copy and paste it directly into the Configlet.
During this step, we’re also going to want to define our rollback plan. So, this rollback plan is going to be used to negate the configuration change in case there’s a mistake, as you can see here, so we have a stable state to go back to if need be.
So, let’s go ahead and define that rollback plan as well. Now when you select your network devices, you get to see the change you will push out and a rollback plan in case something goes wrong.
Now we’re ready to move to step three, which is, take a benchmark of your network before any change is made. So, here we’re gonna automatically gather configuration files, so you can add any CLI command you care about. Maybe you want to ping a device to see if it’s available before and after, or you can select any or all of this pre-defined data as well. So, we’re essentially grabbing a snapshot of what our network looks like before any change is made.
After we gather that first benchmark, let’s just go ahead and push out those changes. So, it’s just as easy as selecting this play button here to push out those changes you defined at step two. And this is an SSH window, so you’ll be able to follow along with the progress as it goes. And let’s say that you realize, maybe, there’s a syntax error or a mistake. You get visually notified in the execution status, but no worries because you can just stop the process and go to that rollback plan that you defined, so you have that stable state to go back to if need be.
After you’re happy with those changes, go ahead and take a benchmark when you’re done. So, you have a snapshot before the changes. Now you’ll take a snapshot after the changes. The next logical step is gonna be comparing the differences. So, you’ll get a snapshot of the configuration files side-by-side with highlighted differences, any new routes added or removed. Or, any of those CLI commands you told the software to look at, we’d show you the differences there as well.
And the last step here is NetBrain will automatically document this process as well. It can create multi-page change management documentation you can hand off in a couple of minutes, including this, a comparison of the configuration files, an attachment with a ZIP file of all the configuration files inside of it. A nice little Visio diagram here in the report as well. A hundred percent customizable, and you can have them ready to hand off to management or a client within minutes.
And at this point, I’d like to pass it back to Jason.
Jason: Cool. Thanks, Vince. So, we’re starting to wrap up here towards the end. We’re getting towards the end of our hour. I have a slide here, this is something maybe I’ll just let you guys read if you choose to download the slides afterward. I wanted to include some testimonials so that people get a sense of some use cases for how NetBrain’s customers are applying automation during their troubleshooting and the benefits they’ve seen. I won’t bother reading these out loud, but I wanted to include that slide. Really, the next thing I’d like to do is open it up to see if Ray, who’s been answering our questions via the chat window, if he has any that stood out to him and he wanted to address, kind of, verbally here on this webinar.
Ray: Okay. Thanks, Jason. So, I have been answering questions as we go along. Hopefully, we’ve seen the answers. Just to recap, there were a couple of highlighted questions, the first one being, do we monitor QoS usage? And the answer is yes, we have Qapps built that will monitor different Qs. We’re constantly evolving these Qapps as well and working with customers to improve the features and the information that’s displayed. And, of course, as Vincent showed you, you can edit these and build them yourself, so you don’t have to wait for us to implement those features.
The other question had to do with firewalls, if we support routing through firewalls, and the answer is yes. I posted the list of the different firewalls that we do support today.
And that kind of follows into the third question, is which vendors do we support? If you go to netbraintech.com, under our FAQ you will see a link to our complete supported vendors and which level of support we have for each. I also pasted the URL into the chat log. And that’s it for the questions.
Jason: Well, the big ones. Okay, cool, thanks Ray, and thanks everybody that’s been submitting their questions as well. We’re happy to answer those. I just have a couple more slides, maybe, as we wrap up, and if there’s any last-minute questions that anybody has, feel free to type them in as well. We’ll try to get to those.
I just have a slide for some of NetBrain’s customers. I won’t spend too much here on this kind of a sales slide. But we’ve been around, NetBrain, since 2004. We’ve been innovating on the concept of computer-aided network engineering. That was the premise that our company was founded on, trying to take the benefits of CAD technology, computer-aided drafting, and applying it to network engineers so they can benefit.
Today, what we’re doing is, we’re moving on to 10 years of innovation along that path, trying to establish visual troubleshooting change management, and we’re working hard on our NG, which is NetBrain’s next-generation track, which will include, kind of, more of a thin client, IPv6 support, amongst other things. And there’s a lot of great stuff in our roadmap as well.
So, we have a lot going on, and I think it’s definitely an exciting time in the industry, and I just want to thank everybody today for participating on this webinar.
If you’re not a customer already and you’d like to get a chance to try, to lay your hands on NetBrain, we have a free trial on our website, www.netbraintech.com/trial. And so, it’s unlimited features for 30 days. I think it’s a couple of hundred nodes you can discover and map out, and even use it to start troubleshooting on your network. A lot of our customers that used it for troubleshooting, they’ve said that there’s ROI input in that trial, so it’s definitely worth trying out. We also have a slide, I want to include this slide deck for architecture, so you can see how NetBrain achieves scalability.
And with that, I’d like to, again, thank everybody for attending today. I hope you enjoyed the webinar. Have a good afternoon, everybody.
This executive summary recaps our Reduce MTTR and SLA Violations webinar, now available on demand — no registration form required. Network…Learn More
Enterprise organizations incur thousands of network incidents each month which equates to many hours of IT time spent troubleshooting and repairing.…Learn More
This webinar examines how automation can help network teams enhance their existing workflows by addressing the main inefficiencies that exist with typical methods.Learn More