Webinar

5 Rules to Reduce Network Downtime

Learn how to reduce network downtime using Dynamic Network Mapping

This webinar will teach you 5 rules to reduce network downtime using network automation. You’ll learn how to use network mapping to proactively protect your network and how to use automation to speed up incident response times.

Read Full Video Transcript +

Jason Baudreau: My name is Jason Baudreau. I lead the marketing team at NetBrain and I’m excited about today’s webinar. The topic as you probably recognize, “5 Rules to Minimize a Network Outage.” We’re going to be talking a little bit today about network troubleshooting and how we can be more effective with our time when the network’s going down. So let’s get right into it.

I have a panel of people helping me out and we’ll do some introductions real quick around the table. Ross Merkle, one of our senior engineers here is going to be running a demo portion of our webcast today. Ross, could you introduce yourself?

Ross: Sure. Thank you, Jason. My name is Ross Merkle. I’ve got about, just about 30 years of experience in the IT field. Currently I am a senior engineer over at NetBrain and I have been here for about 16 months.

Jason Baudreau: Thank you, Ross. So Ray Belleville is in the room with us too. He’s going to be answering questions through the chat question panel on the GoToWebinar so if anybody has questions Ray will be on the other end. Ray, could you introduce yourself?

Ray: Hi, everyone. Ray Belleville as Jason mentioned. I’ve been in networking for 20 some years. I’ve worked with vendors, telcos, startups, service providers and research and education organizations mainly to design, deploy and operationalize leading-edge technologies which is what NetBrain is. So, glad to be here and will help you answer any questions you have.

Jason Baudreau: Cool. Thanks, Ray. And last we have a guest host joining us today, Jason Neumann, owner of LAN Technologies. He’s going to be talking about some of his technology and the use of tools in his workplace. Jason, could you introduce yourself?

Jason Neumann: Yeah, sure, thank you. My name is Jason Neumann. I’m the author of “The Book of GNS3” which is a product, a free product that I’m going to talk about a little bit about today that’s pretty handy for networking. And I’m also the owner of a company called LAN Technologies and I’ve got about 20 plus years of networking experience.

Jason Beaudreau: Cool. Thank you, Jason and thanks to everybody on the panel. So let’s get into just one slide of housekeeping before we get into the meat of things. I want to let everybody know that the webinar is being recorded. We’ll make sure we share that with you next week. So we’ll shoot out an email so everybody has access to the recording. Same thing with the slides I’m going over. We’ll share those too.

Number two is you met Ray. He’s going to be on the other end of the chat so I want to just remind everybody if you have a question please don’t hesitate to type it into the GoToWebinar chat panel. We’ll be happy to take as many questions as you have. We’ll also try to leave about 10 minutes at the end of the presentation and take some of those questions and address them over the phone as well.

So real quick on the agenda. I’m going to talk a little bit about the impact of a network outage. A lot of us have faced them before but we’ll look at:
• The business cost. We’ll look at just a few examples to put some concrete element to the impact of an outage.
• Address why troubleshooting is so hard. This is kind of from my perspective. I’ll use a common troubleshooting methodology to identify a couple of factors I think are slowing us down when we’re troubleshooting.
So really then to the bulk of the issues here, the five rules to shorten a network outage. I say rules. I just thought it sounded more compelling. They’re suggestions and what I’m really proposing is a methodology for troubleshooting. So some best practices. And Ross is going to run through some demo on how automation can help through NetBrain.
And as I mentioned we’ll save some time for open dialogue question and answer at the end of the call.

So getting right into it. The impact of a network outage. I have some numbers to make this a little bit more concrete or a little bit more real right off the bat. So a recent survey – Seven thousand businesses say that 25%, a quarter of enterprises suffered a major outage last year. That was defined as four hours or more. So that cost those businesses over $1.5 billion dollars in lost profit. So this is just raising urgency to the fact that when there’s an outage every minute truly does count. And about a quarter of those were actually found to be network related even though today we are talking about troubleshooting networks and network outages, I think it’s important to note that usually if there’s a problem the network is guilty until proven innocent. It’s usually not the other way around. So we’re faced and tasked with troubleshooting the network regardless of what the actual fault is. We need to vindicate that network if it truly isn’t at fault.

So a couple of examples I’ll just run through. We kind of see these headlines in the news. A couple of years ago the NASDAQ went down and forced trading to halt for a few hours. That was traced to a bug on the backup network. Facebook was down last year for 30 minutes and that cost them about half a million dollars in revenue right off the bat. And Google too had an outage earlier this year so that brought down some of the Google services. So just a point on this slide is that no company is immune and the bigger the company kind of the harder they can fall.

So looking at some of the causes of network outages. I have a Dilbert here which kind of makes me chuckle. It’s discovered the cause of a network outage. I’ll let you read that yourself but I have more data from another study that CISCO ran recently. They said that 23% of outages were caused by basically a hardware failure, a router or a switch. They identified just less than a third were from a link failure, things like network congestion or a fiber cut. And the bulk of outages were actually the result of a network change. So if we could identify what that change was, we’re in a pretty good space. So 36% were due to an upgrade or a configuration change. And then there’s a miscellaneous category kind of bringing in the bottom. So the takeaway I have is that finding the source of the problem is really the hardest part. It’s where we spend a lot of our time. Once we find out the source of a problem, addressing resolution is usually second to that.

So I want to talk a little bit more about that, what makes that so hard. Why is troubleshooting so difficult?
I have kind of the needle in a haystack analogy here. So it really is truly hard sometimes like trying to find a needle in a haystack. So I’m going to kind of illustrate a point by just outlining kind of a high level troubleshooting methodology. When we’re faced with not just defining but troubleshooting a problem, we need to gather a lot of information about that problem and we spend a lot of time sort of understanding what information we collected by analyzing that data.
If we’re lucky we can eliminate possible variables and really we need to propose, you know, what could be going wrong is the hypothesis. Maybe there’s a congested link or a routing issue. If there’s an ACL. So these are all things that, you know, from experience might tell us could be contributing to a possible problem. So in order to kind of narrow that down we need to test that. We need to collect information and analyze it. And so we go through this loop. And I really identified this is what we’re spending all of our time when we’re troubleshooting until we can find a solution.

But specifically I want to kind of highlight a particular area. I call it diagnosis phase of troubleshooting. It’s where we’re doing the gathering of the information, we’re analyzing the data. That’s where really manual and time-consuming processes are in place. Collecting information can take a long time across multiple devices, across multiple interfaces and things like that. So we have to face ourselves with questions like, “What’s connected? How is it configured?” So usually these are the answers that we seek. The questions rather that we seek answers to.

So that kind of drives in and sort of sets the stage for what I propose as a new methodology or five rules that we can follow, five steps we can follow to minimize a network outage.

Part one is really focused on what I identified a minute ago which is automating manual tasks. So really when we’re troubleshooting it’s not changing the way that we troubleshoot, it’s not changing the processes that we go through but it is doing it with more efficiency and that’s going to be enabled through automation. Rule number one I think the first and foremost we need to maintain accurate network diagrams. So they’re our best resource when we need to understand things like logical connectivity and design of the network. That includes redundancy design, firewall rules, whatever it may be. That’s all wrapped up in our documentation. We also visualize how application traffic flows across the network. So if we’re troubleshooting an application slowness, we want to understand how that application leverages the network to pass its data. And really just for isolating the potential source of the problem. We’re visual beings. Understanding problems at kind of a visual level through a network diagram or a network map is really critical.

And so I just want to identify the challenges that are really inhibiting this. These diagrams are manual. We spend a lot of time creating network documentation. When they are finally, you know, just right the network’s going to change. The networks are always changing and when they change those diagrams become outdated and are no longer reliable.

And more than that, even with a very disciplined approach of network documentation you’re left with what often I call one-size-fits-all documentation solution. So every site might be documented, but what if you’re troubleshooting an issue that traverses multiple sites? Well, then you’ve got to find the right documentation. Just getting it can be a challenge even if you have it.

So there’s an opportunity here and we think there’s an opportunity to automate this. We can create maps instantly. We want maps that can be updated automatically so we can trust that they’re always reflecting the latest state of the network. And more than that having the right map at the right time – I call it a contextualized map so it’ll help you troubleshoot the task at hand. So to talk more about this I want to showcase this technology a little bit and we’re going to do this through a live demo. I’m going to ask Ross to take over controls and he’s going to share his screen and show you how NetBrain can help automate network documentation.

Ross: Thank you, Jason. So what you’re looking at here is our NetBrain tool and what I want to do is leverage one of our search features which is our A to B live path discovery. What I’ve done is I’ve gone ahead and put in two IP addresses for a couple of systems and what the system is going to do is it’s going to start by logging in to this first device running a series of show commands and you’ll see down here, as it’s checking for things like IPSEC tunnels, ACLs, VRFs, policy-based routes, NATs and it’s going to look for anything that would prevent the traffic from going from point A to point B as well as mapping out all of the devices along that path. Once it’s completed working its way from point A to point B it’s going to start working its path back.

You’ll see here it’s adding some additional devices so we have found an asymmetric route. But we were able to get from point A to point B. Now in focusing on the idea of having the right information, the next thing I want to do is actually start zooming into some of these devices. As I zoom in you’re going to see these additional labels appear. As I hover over these labels the configuration that drives them appears. So here you have the accurate up-to-date information and all of these labels are again based on the configuration that is within these devices. So as your configuration changes these dynamic labels can automatically be updated. Now as I zoom in like I said I get additional information. If I hover over a single device and zoom in far enough I’m going to reach what’s called the observer mode. And the observer mode is like a deck of cards focused in on one single device. I’ve got a picture card, properties card, topology card, a design card where you can get to the full configuration. We can do performance monitoring here. We can also do a comparison based on just that one device.

So as well as having that focused information another element I can do is I can pick this path and go ahead and create it from a layer two perspective. And now what this is going to do is take that layer three path which is the routing path and convert it over to the actual physical path. And this is my physical path. So it’s going to show me all of my switches and routers. I don’t know if you remember from the previous diagram, right, we only had that one switch in there but we are now going to see every switch, every port that that traffic is going into and out of. So I’ve effectively taken my entire network, boiled it down to a half a dozen devices that allow me to do a focused investigation of only the devices that would be involved in this traffic path to look for any issues.

Jason Baudreau: Thank you, Ross. I will grab myself as a presenter back over here. So we can see it basically not just having up-to-date documentation but having the right documentation at the right time. So in the case of what Ray was showcasing, he was troubleshooting potentially a full application, map that application flow in its contextualized map, it’s updated and it has the right information you need.

Number two is kind of ‘perform an overall health assessment.’ So once you have the right documentation looking at network health is kind of the next step. Up to 50% of problems can be traced back to a common set of symptoms. Looking at things like bandwidth congestion, a high utilization or memory utilization or CPU utilization whether it’s at the device level, increasing link errors and high latency – these are all examples of performance characteristics.

The challenge that we have when we’re looking at the health of a network is really we often lack what I call a holistic view of that. So we can either look at things drilled down by the device level. I can log into a device and I can look at the interfaces. I can see what kind of interface errors they’re having. Looking at things wider than that, we’re often looking at either the network device view or what I call sort of the holistic network view, network wide dashboards. So rather than looking at entire networking monitoring tools, we look at dashboards and they’re not really contextualized or maybe we’re looking at an individual device. But if we had a way to identify, you know, the health and the performance of the devices that we’re specifically interested in kind of at the map level that would be a little bit more valuable. Monitoring every device and interface visually and also being able to record and rewind that performance.

So if there’s a problem that’s happening but it’s happening with some degree of intermittence, can we record that over time? Can we look for peaks and values in the network performance? Can we trace those back to a particular network event potentially or a particular user event? So being able to monitor performance at the map level, visualize that and then rewind and kind of record that gives us a really powerful view of not just what the network looks like from a topology perspective but also just kind of getting a pulse of the network. And so I’m going to do this kind of handoff a couple more times as I watch Ross to sort of showcase what this looks like at the map level. So Ross will take that back and take a look at NetBrain from a demo perspective again.

15:04 Ross: Thank you, Jason. Yeah, so here’s our map that we went ahead and created. Now I did already have this map running with a monitor going on it, because I want you guys to see what the data looks like after we’ve had a moment to gather it. So the first thing I want you to see is that we took that map and we turned everything red and green. So green of course means good. Red means that you’re over threshold. I’m going to hover over this and you’ll see that the threshold is set right around 30%. Now I did specifically set that because these thresholds are all user definable, because I wanted you guys to see what happens when something goes over threshold.

Now, as well as giving you, you know, the graphical overview perspective, we’re also going to show you the latest point in time. This is a polling process. Every 30 seconds it’s going out and gathering that data so you can come down to this data point here and see for any device, these are, you know, sortable. I can also select any one of these. So as I select something you’ll notice that it goes ahead and flashes it on the screen so you’ll always know what you’re looking at. So as well as having the latest point in time, I also have this history graph. As Jason had mentioned, this will gather history over time so I could look for my peaks, my valleys, that type of information.

Now overall, I’m looking at it. It looks…except for this little bit of memory utilization going on – it looks pretty healthy but let’s say that I had an interface that was being oversubscribed. I could just right click on it and bring up the net flow information as well. So what the system is going to do is bring the latest IP net flow collect cache out of the device and it’s going to show me who my top talkers are, who’s using my bandwidth. Also, because I know who the source IP address, the source port, the destination IP, the destination port, I could create a path from here as well and so I could have the two paths on the same map overlaying to see if it’s not only impacting one device. It should be impacting many.

Now the monitors not only run on the layer three map but I can also run those on our layer two map as well. So I’m going to go ahead and bring that monitor up but what I also wanted to point out was that we have a variety of different monitors based on different technologies. So if you’re not looking for only overall health we can help you with a lot of different monitors. But as you can see we’re going to see that same information here. We can now see things like the interfaces that are active, versus the interfaces that are down. You know, the bandwidth and CPU memory utilization, you know, all that information is available on the L2 map just like our L3 map.

Jason Baudreau: Absolutely. Thank you, Ross. So kind of stepping through the methodology if you will that we’re sort of outlining, it’s mapping out a particular problem area, kind of getting initial diagnosis as if you’re going to, you know, a doctor’s office and just checking out, “What could be wrong with me?” It’s kind of getting your blood pressure and looking at how your heartrate is doing, everything like that. That’s sort of the analogy at the network level that we’re doing from a performance perspective.

So I’ll take over as presenter. So kind of moving on, on the step by step if you identified a problem you might want to drill in but if you haven’t yet there’s other things that you might want to dive into. Rule number three that I kind of outlined is a way for us to capture and document network changes. So beyond just network performance, understanding what’s changed.
So if you remember earlier on I identified top causes of network outages. The study from CISCO identified that 36% of outages were caused by some sort of a network change. So the challenge is identifying what’s changed. Sometimes that can be the hardest part. We’re not always the ones that are doing the changes ourselves. Sometimes something happened over night and we come into the office next morning and things aren’t working as they were. So identifying what’s changed.

In fact, there’s a study I came across that said the average number of changes on a network is about two changes, per device, per year. So depending on how large your network is — if you have a couple of thousands of network devices — you could experience thousands of network changes per year. Usually they’re small. Maybe just changing your configuration name or some sort of a bandwidth constraint but those can have a really nasty ripple effect across the network, something so small. But any change can have an unexpected impact downstream.

So the opportunity for automation here is to capture every change that you make to your network and enable that to happen automatically. So configuration changes are one thing. Making sure that we understand how our configurations look today, how they looked yesterday or last week or last month and what’s changed, but beyond that what about the routing tables? Have there been any modifications to the route tables, the MAC tables, changes in topology? Wven changes in the way an application traffic path sort of flows across the network? So I think it’s important to highlight that while somebody might make a configuration change, the impact of those changes can be seen across the network. So an opportunity for automation is to identify a way to capture all changes across the network. It doesn’t just end with capturing configuration changes.

So there is a kind of a solution there as well. Ross will help us walk through that with NetBrain.

Ross: What I really wanted to get to was the idea of looking at a map at a time instead of just simply a device at a time. So here we have this asymmetric route and we don’t know if this is a new phenomenon or it’s been existing for a while. So there’s a couple of things I can do. The first thing I’m going to do is revamp this out but instead of looking at the live network, as Jason had referred to, we’re going to gather all of your routing tables, ARP, CDP, MAC, all the state tables as well as your configuration files so I can map this out in a previous perspective as well. So now instead of reading the live data it’s actually reading all of the cached data and you’ll have to excuse me. I did make one small mistake that I do need to address and that is our default gateway.

So I have to set the code. Because it is looking at cached data I do have to select the appropriate default gateways, but once I’ve done that the system will go ahead and draw that out. Now one of the things I do want to highlight here is you’ll notice that the traffic path is symmetrical. It almost writes over top itself. I can take these paths and drag them out, make them a little easier to see. And we went out to this DMVPN cloud and we came back through the same DMVPN cloud whereas currently we’re going out through the DMVPN cloud and coming back through this MPLS cloud. So now that we’ve identified that this happened within the last week, what I want to do is use our change analysis to determine where this came from. I’m going to compare our historical data.

So what I’m going to do is I’m going to select two different benchmarks in time, basically what we’re looking at today versus what we’re looking at last week, and when I do the comparison I can compare a lot of different tables of data but I’m just going to focus in on the configuration and the routing table. When I do the comparison, you’ll see very quickly it comes back with the results. Now I have these badges that I can hover over that’ll tell me what’s changed on each device. So if I want to stay focused on the map or I have this results window over here. By clicking on each of these devices I can see very quickly which of the tables have changed.

So to leverage this information what I’m going to do is let’s look at LA core one demo because that’s where our traffic path started to diverge. I can see that I have a routing table change, as well as configuration file. To analyze it all I have to do is double click on it. So very quickly we know that we’re looking for 103-20 and here it is. We have a new route. It’s an external route, EIGRP going to WAN1 demo. So if I click on WAN1 demo you’ll see it automatically brings up that routing table already compared for me. Scan down through it really quick, I can see that I don’t have a new route. Let me just validate that that route was already there. So sure enough we already had that static route. So if the routes didn’t change I can look at the config file and, again, with a click of a button.

Now we know it’s EIGRP because of the letter in front of it. So I’m just going to walk down through our changes to the EIGRP and see a redistribute static. So one of the benefits and the values of using our automation is a lot of time is spent with engineers gathering data, setting it up for comparison, running the comparisons, figuring out what has changed. We’re going to do all that for you. You’re going to spend your time analyzing change, not going out grabbing the information and documenting it. Now as easy as this was, it was only the second hub. But let’s say it wasn’t. I can just as easily have clicked around any device on the map that had a change in it and see that change automatically and that’s also true for the configuration file as well.

So we’re really going to make an effort to maximize your time and minimize basically the wasted grunt work of gathering that data.

Jason Baudreau: Thank you, Ross. That’s excellent. So if we’re walking through sort of a set of diagnoses we’ve sort of looked at things from the performance perspective, we’ve looked at things from the what’s changed perspective, we did highlight that there was some sort of a change, a routing change since last week and a change in the application traffic path, hadn’t necessarily zeroed into what could be causing some slowness but let’s kind of continue and look at rule number four.

We’ve looked at various ways to diagnose the network but rule number four is kind of a broader suggestion. It’s automate troubleshooting diagnoses. So for any problem there can be hundreds of possible causes. Networks are complex. So the challenge here is manually diagnosing these, you know, potential issues one at a time is extremely time consuming. So this is a section of that troubleshooting methodology I walked through earlier on the right side really the gathering information, the analyzing data. We’re constantly doing that manually, we’re spending a lot of time over there. Is there a way that we can automate this loop?

So, you know, there are monitoring tools that will collect data for us automatically. So there’s a degree of automation there but I have the question, “What’s next?” So if you’re aware that there’s a problem how do you begin to troubleshoot that problem? The most popular tool that we have would be, you know, the CLI. And we’re limited basically to other tools that would help us analyze a live network.

So an opportunity for automation. You see that I kind of tried to make that haystack smaller and make the needle bigger. Diagnose the entire map. Diagnose the network at the map level rather than just one device at a time. This means automatically collecting live data across every device, every interface on the map. Generating alarms based on the output of those devices. If there’s something that’s not what you’d expect, is there a way that you can have an alarm generated automatically and alert to alerting tools? And then again contextualize and visualizing that data on the map. So there’s an opportunity to leverage the map as that single painted glass to automate your troubleshooting diagnoses. And that’s number four. Ross, could you take a look? Give us a look, I should say.

Ross: Absolutely. And this time I won’t be on mute first. Okay. So some of the things that Jason referred to was let’s go ahead and look at some CLI data. That’s very common. I could just right click onto something, launch a telnet session from here and go ahead and look at that CLI data, but what I’m going to do instead is actually ask the system to log into each of these devices and show that information for me. Now what I did is I just selected a list of commands from a template. You can build a template of any particular area that you want to look at. We have some built-in for you but one of the things I also want to point out is that this is a read-only tool. You can do a show, a get or display. You specifically cannot run a config T or a debug, because an improperly done T bug could take a system down and we want our tool to be safe for anybody to run.

So I’ve got this list of commands. I’m going to go ahead and just hit the start button. You’ll see here it’s going to log into each one of those devices, run that command for me and then save that data for us. And that is also one of the key elements is that when you’re doing your investigation you need not only the access to the data but you need a way to save it. So any time I run these commands it’ll be saved over here on our map data pane. And I’m just going to create a hyperlink to make it easy for me to get to that data.

So as you can see here I have my show IP internet brief. I can click down any one of these devices and it’ll show me that one command on any of them. I could also select any of the different commands. So whether it’s the same command over multiple devices or multiple commands on a single device, I can have all of that.

Now the other side of this is I’ve got an item that’s in an up-down state. Well, maybe that’s the inappropriate state and I want to ask someone a question about that. I can come in here, maybe highlight it, maybe I’m even going to give it a different color and then save this off to say, “Oh, please investigate.”

I can then take this map, send it off to another engineer, just tell him, “Hey, bring that up. Take a look at the interface I highlighted.” So we also act as a, you know, a shared communication point so you can share that information with others.

Now the next thing I want to do is getting back again, as Jason had said, to looking for that performance issue. And that’s going to be going to our layer two map. Now some additional things we can do from here, the first one I want to introduce you to is our Qapp platform. And this is where we have a number of tools prebuilt-in that will allow you to basically take any command and turn it into a monitor. So as an example, this particular one is checking for interface errors. Over here you can see where all the items are green with zeroes. That means that it was healthy. Over here you can see items that are showing that we have some collisions. We’ve got some input errors, we’ve got some CRC errors. Now what I what I want to show you is how easy and simple it is to make those. And I’m going to edit that one right here for you. Anything within our tool or any of these QApps within our tools are editable.

So if you find one that’s close and you want to modify it it’s just this simple. The first it’s going to ask me is how often I want it to loop. And these are all just drag and drop objects. Then it’s going to basically ask me what command do I want to run. I want to do a show interface. Down here I can see the different fields that are being gathered. Say I wanted to add overruns. I can just highlight it, click define variable. It’s going to do the parsing for me. I just have to call it overrun.
From there it’ll add it to the table so I now have my overruns. I’ll add it to the command. Then I’ll add it to the table and then finally I would come over here and determine where I wanted to put it on the screen. So it’s just that simple to take any output from any show command and turn it into a custom monitor. So as my example here, which is showing me that I do have some errors between these interfaces.

Now another QApp that I want to show you is one for our speed duplex mismatch, because if I looked at that information I may look at that and say, “I think that is a speed duplex mismatch.” Now I can do this a couple of ways. I could also just hover over it because there I get my half duplex or my full duplex but there’s a value to allowing the system to do this for us and that’s when it comes to documentation because it’s not enough that we simply do a root cause analysis. We now have to have the ability to document that map and now I have a fully documented map. I do have a speed duplex mismatch. I could turn this into a Visio diagram with a click of a button. I could also just export it out to an image file.

Jason Baudreau: Thank you, Ross. That’s excellent. So really what we’re looking at is a way to spend less time in the command line interface and ask the system to perform those commands for us rather than one at a time but across multiple devices and even taking it a step further, perform a set of analyses on the output, determine if something is within a tolerance or if something is unusual. And that’s really what the power of the QApp technology is. You can customize what you want to perform and ask the system to do that for you automatically.

So the last thing I just want to touch upon is kind of part two. We spent a lot of time in part one which is enhancing the way that we troubleshoot our networks. Next I want to talk about empowering engineers at all levels really to troubleshoot with kind of a collective experience I’ll call it. So what do I mean by that? So looking back at the methodology that I sort of outlined early on in the discussion was looking at where we spend a lot of time, but in terms of expertise, you know, obviously there’s a lot of networks that have those heroes, those network heroes that they need to be called in when there’s a particularly large challenge. So a lot of what they offer and what experience guides us in doing is proposing a really, well, a hypothesis, “What could be going wrong?”

Advanced problems mean that there might be advanced possibilities where it could be an issue so things like, “What can cause this? Have we seen this before? How did we fix it last time?” So those questions come up and I guess the takeaway I’m trying to articulate is, “How can we prepare everybody on the team to troubleshoot with that level of experience?” And really, it’s not something that is simple to come by but it’s important so… I have this kind of quote here, “All this has happened before. And all of it will happen again.” Kind of a big mystical quote but what it means is very few things are unique and that goes for troubleshooting as well. Most problems have come up before. Maybe we haven’t seen them before ourselves but somebody has. Somebody figured out how to solve that problem in the past. How can we take that knowledge and wrap it up?

And so what I’m building up a case to is to leverage the troubleshooting playbook. So I kind of touched on the challenge but what does a playbook offer? It’s a crowdsource, maybe it’s a document, maybe it’s a process but it’s crowdsource. Everybody who troubleshoots the network is ultimately lending their experience to this playbook. It should be a living document. And, you know, the opportunity here for automation is that, “Could we take it a step farther and automate various plays in that playbook?” So it’s really talking about two disciplines. Granted, one of them is the discipline of capturing our knowledge from experience. And the second one is really the opportunity for automation is, “How can we automate those tasks?” So if we create a really excellent playbook, can we put some automation power behind that? Each diagnosis is part of the play. Every diagnosis can be automated, right? We saw that a second ago with what Ross helped showcase automating network diagnoses. So imagine the power of diagnosing a network through a series of best practices implemented across an entire team and doing that with NetBrain automation, really building a catalogue of QApps into the playbook is what we’re talking about.

So this is kind of a way to wrap it up. So we talked about how we can map better a particular problem area, how we can isolate performance issues, isolate changes on the network, even begin to drill down by diagnosing automatically what can happen. And rule number five is really just a way to wrap it all up together and make this into a process that can be followed, something that the entire team can leverage and ultimately, even automate it. That would be the best of both worlds.

I just have an image that I kind of did a Google search for network troubleshooting playbook. I found an example. Maybe a failure to apply an ACL. That could be one of those hypotheses I talked about. And it’s really just a set of checks in the decision tree to see, “Well, is the QS policy filter configured correctly?” And depending on how that tree flows you might run a set of plays according to the structure. And each one of these has a potential for automation. I just thought it was kind of interesting to point out an example.

So that wraps up my five rules. I wanted to give a little bit of spotlight to Jason Neumann who we asked to join the call. Jason is very much into network troubleshooting and network tools and ways that he can help enable his small business, and one of his favorites I know is GNS3. So, Jason, I’d like to invite you to just open the mic up a little bit and talk about your in this regard.

Jason Neumann: Sure, thank you. Yeah. So we’re going to shift gears a little bit, talk about what GNS3 is and how you can use it. So, you know, what is GNS3? Well, GNS3 is a graphical network simulation software and “The Book of GNS3” is basically a user’s manual for the program. It’s free, it’s open source software, runs on Linux, Windows and OSX. It’s developed on Linux and then ported to those other operating systems so in my opinion it runs most gracefully I’ll say on Linux. But it can run on Windows and OSX just fine and lots of people do it. So it’s fine to do that. But what it does is it creates a virtual network environment so that you can design and test multi-vendor equipment. And that even includes NetBrain’s free version of their software which is called DevOps which I got running instead of GNS3. I created the simple GNS3 network, installed DevOps on a little Windows virtual machine, pointed it at my network and did, you know, some pretty cool things with the little B2B network.

But really what GNS3 is good for is research and development type stuff, you know, kind of proof of concept. It’s good for education. Really used a lot I think in certification so people that are doing exam preparation will use it to build these virtual networks, you know, to get CPNA or CPNB or CPIE or maybe a Juniper network certification, something like that. But the nice thing about GNS3 is it’s really easy to use. It has a simple graphical interface. Off to one side of the screen you’d see your devices, routers, switches etc. and you can just drag them to a workspace and link them up together, link the interfaces up together virtually inside of GNS3 just click of the mouse. You can boot up those devices and log in and start configuring your network. I mean, it’s…so, you know, in closing I’m just going to say that for me GNS3 is pretty much an invaluable tool. I use it all the time and if you haven’t tried it…or DevOps for that matter. You really should give them both a try. They’re free and they work pretty well.

And the other thing about GNS3 that I should probably mention is it scales pretty well because it does support balancing across multiple PCs. So you can fire up the GNS3 server application on a whole bunch of PCs and run the GUI on one computer and access, run all these different routes and devices across all of those PCs to build a much more scalable network and GNS3 has some pretty exciting stuff coming out. Right now, it uses dynamics that I mentioned, QEMU which is the Quick Emulator. It integrates with the Virtual Box and pretty soon, they’re working on a new development which is it’s going to integrate with VMware so ESXI and VMware workstation and that’s going to add a whole other bunch of devices that you can integrate into GNS3.

So that’s kind of my talk on GNS3 but if you haven’t tried it, I really recommend that you just take a look at it.

Jason Baudreau: Thanks, Jason, yeah. I just want to point out, I think there were some interesting kind of parallels. One of the reasons I asked Jason to join – parallels between GNS3 and NetBrain – is they both kind of offer a live map into a network and in the case of GNS3, it’s more oftentimes a virtualized network but you have to have a live view into a new perspective that you wouldn’t otherwise see. And similarly, NetBrain has a free version in DevOps edition. So you could actually play with virtualized network routers and switches through GNS3 and you can plug that into NetBrain and it’s a free tool as well. So thanks, Jason.

So I think now’s a good time to wrap up and once again just say thank you to everybody that’s still on the call. Appreciate you dialing in today. Hope to see you at the next time. Thanks again. Have a good rest of your day.

Watch a Demo - Learn About NetBrain

See how Dynamic Mapping and Runbook Automation help you reduce network downtime.

Request a Demo

Related Resources