NetOps teams lack end-to-end network visibility across their networks. This webinar will show you what End-to-End Visibility looks like, and how to achieve it in your network.
Jason: All right. Hello, everybody, and welcome to the event today. We’re talking about end-to-end network visibility. We think it’s the holy grail in network management and it looks like a lot of you agree that this is our most anticipated webinar ever. I’m happy to report with over a thousand people registered – we’re really excited. And so, first let me just say thank you guys so much for joining us on the call.
I want to introduce myself and the panel. My name is Jason Baudreau, I’m the U.S. Marketing Manager here at NetBrain. I have some slides and a presentation I’m going to share with you for the next several minutes. I’ll also introduce you to Ross Merkle. He’s a Senior Network Engineer here at NetBrain and he’s got a great demo lined up for this presentation as well. Ross, would you like to say a quick hello?
Ross: Greetings, everyone. I look forward to the opportunity to show you some of the fantastic new features that we have available in NetBrain.
Jason: Thank you, Ross. And also I’d like to introduce Todd Bristol. Todd Bristol is a Principal Design Engineer at Move, Inc. Move, Inc. is a customer of NetBrain, they run the website, realtor.com and Todd’s going to talk a little bit about a design project he’s been working on. Todd, would you like to say a quick hello?
Todd: Yes. Hello, everybody, and I look forward to speaking with you a little later on in the presentation.
Jason: Awesome. Thank you, Todd. So let’s get moving. We also have Martin in the room with us here, Martin Venkov and he’ll be answering your questions through the Q&A window. And so, I’d like to remind everybody who’s on the phone and on the WebEx, to please use the Q&A panel at the top of your window if you have any questions throughout the event. We’ll be here to answer those and we’ll try to leave some time at the end of the event as well for some open questions. And the webinar is being recorded. We’re going to share the recording with everybody who registered next week via email. Check for that and we’ll share these slides as well. So don’t worry if you miss something.
So let’s start by looking at three priorities for network operations. For some organizations, these may be divided into unique teams. The network design function is responsible for network upgrades and enhancements to support enterprise growth. The troubleshooting function, of course, is responsible for responding to network problems promptly, to minimize business interruptions. In a security function, they’re responsible for assessing network vulnerabilities in defending the network to protect the business and customer data. And you’ve probably heard the news recently about the story on Yahoo! security breach. It’s a good reminder of the threats that face us.
And ultimately, the network is one of an organization’s most important, invaluable assets. So really supporting the network means supporting the business.
And so, you know, there’s a set of challenges associated with that, and I’d like to focus on the complexities of these networks and how those impact the network team. So there’s complexity that is really compounded by series of trends in the industry. Most organizations are looking closely at these in the first row here represents trends driving network change, right? So the first is the Internet of Things. It’s multiplying the number of endpoints and the networks because virtually everything from coffee makers nowadays to TVs can have an IP address. The second, cloud computing, is really changing the traffic patterns in our networks, it’s pushing a lot of the traffic off to the WAN. And, of course, there’s mobile computing which brings with it an increased bandwidth requirements from all the apps and the streaming that’s happening.
Right now, virtualization and software defined networking- they’re emerging technologies to address some of these complexities, but they each come with their own learning curves associated with adoption. And all this has happened while we’re continuing to fight, what I’m calling an asymmetric war, against cyber-attacks. And so, I’ve identified four challenges that stem from all this complexity. The first is that our networks are growing so vast and so complex, that keeping them documented is very much a manual process, and it’s just not practical or scalable.
The second is that when there is a problem, because of all the complexity and depth, the troubleshooting is very much like trying to find a needle in a haystack. So the third challenge is the network changes themselves are very risky. And so the latest studies that we’ve seen is that about 50% of network outages can be traced back to a network change as the root cause. And Todd, who you met a moment ago, is going to talk a little bit about how Move, Inc. is leveraging NetBrain the design phase to make these changes a little bit less risky.
And now all this is happening, and it means teams are relying very heavily on what we call or what I’ve heard referred to in the industry as “tribal leaders.” So you have virtualization, cloud computing experts, and you have data center experts, maybe voice experts. These are all different levels of expertise that require someone to sort of stand out and be the leader. And when there’s an issue, we rely all too heavily on these leaders. And so, what this all adds up to is that this increased complexity is really reducing the amount of visibility we have in our network.
And I think about this, you know, in the analogy of the network, comparing it to an ocean. It’s both extremely vast and extremely deep. And so, you get end- to-end visibility, you think you need to zoom out really far. You see from one side, maybe from North America all the way over to Africa and Europe. You get end- to-end visibility, but what you lose is any sort of depth. You only see the surface kind of superficial layer and that’s kind of how it is when you have a wide spanning network. If you want to get deep into how that network is configured to design, you really need to drive in.
And then there’s the various complexities here back to the analogy, there’s an iceberg as your approach the water now. We know that 90% of the mass of an iceberg is below the surface of the water, but this analogy actually even goes beyond that. It goes beyond the depth where discover aquatic life and coral reefs all the way down to sunken ships. And then when you get to the bottom where there’s not even light a, there’s still, you know, new organisms being discovered. So we had to challenge both of understanding the complexity as far as the network expands wide and as far as it goes deep, and that means looking at how it’s designed, how it’s performing. We’ll talk more about that.
And so, there’s really three priorities that we’ve talked about before, but I want to zero in on one in particular to draw a little bit of this and bring this example home, and that’s looking at troubleshooting. So troubleshooting is where the challenge of visibility is most severe, you know, when every minute counts. And to allow me just to dive in deeply just a minute into a troubleshooting methodology because I think this does a good job to highlight the challenges associated with visibility. And this is just a high level methodology, but really what happens is in order to diagnose a problem, we have to gather a lot of information and analyze that data.
And ,you know, what happens next is we have an idea of what could be wrong, maybe a hypothesis, but we need to test that hypothesis. And, again, we have to go around this loop, gathering information and analyzing that information until we can understand a little bit more about what could be going wrong. And this can take days until we determine a solution, and in some cases, depending on how manual this process is, it can actually take days.
And so, we’re spending our time here, you know, an area that we’ve identified here at NetBrain, is that people are spending a lot of time in this, we call diagnosis phase. That’s gathering information, that’s looking at the data and analyzing the data. And when it comes to troubleshooting, there’s really four questions, you know, to be asked. And this gives us a better visibility. So the first question is, for any given problem, what’s the path of that problem? From a source to a destination, networks are always designed to move traffic. So what’s the path of that traffic flow for a given problem? Two is looking at how is the network configured are designed along that path. Do we understand, first, how traffic is designed to be flowing across that path? Without that, we won’t understand what could be going wrong.
Number three is what’s happening on the network. That means in terms of the networks live performance, are there any issues on the interfaces or other devices up and down? Are they stable? And the last question, perhaps the most important, is what changed? Remember, 50% we found of outages can be traced back to some kind of a change. The visibility into that change is going to give us visibility into 50% of the problem potentially. And so, just to draw this analogy into, you know, what it looks like from the perspective of ‘in the trenches,’ what happens when we’re troubleshooting is we find that there’s a problem. There’s two ways. Either you’re going to get an alert or an alarm detected or maybe someone’s going to complain about an issue. And so, in either case, what happens next is very manual.
First, we have to rely on network diagrams. When it comes to visualizing in network visibility, this is the best resource that we have. But there’s two questions you have to ask and that’s, “Where is the diagram?” And, “Is it up-to-date?” Because the networks are changing so frequently. So without those, you really have only methods of ping and traceroute to understand how traffic is propagating.
Next, you know, we’re left with few choices when troubleshooting except for to get in there and go deep into the weeds. And that’s through the command line interface. We have to look at the configs in order to understand the design, we have to, you know, issue a bunch of show commands and then you pull up some design documents to better understand that. And, then, of course, when it comes to finding out what’s changed, we might not know to ask. So there’s this frantic aspect of trying to track down the change. And what I’m arguing is that this is all very manual and very time consuming, at a time when we can really afford at least. So the question, again, we ask ourselves here, is what the end-to-end visibility look like and how would that change this process if we had that?
And so the first thing is, what’s the path? And so, again, understanding what the path of a problem is really the most important step. And everything in NetBrain happens on the map. So you can create a map of any application path or any sort of slow voice that you’re experiencing by just importing a source and a destination IP address and map that out in real time. The second thing is okay, now that we have the path, how is the network configured? You don’t have to necessarily log in to a bunch of devices to figure that out. You can zoom into the map and understand virtually hundreds of design attributes.
Next is what’s happening or how has the network performing along that path? We’re looking at a view here of a heat map that we see in red. If you look closely here between the DMVPN and the WAN router in Los Angeles, we see that there’s a bandwidth bottleneck and we can see there’s a spike in the utilization of the bandwidth. So this has given us some visibility into the performance or what’s happening. And, again, what’s changed? This is a view into what that looks like. You can see what’s changed by looking at incremental changes on the map. That brings up a timeline on the bottom here and you can drill into any change and find out if it’s been a change in configuration, a change in routing, maybe it’s a MAC table or an ARP table.
So you can go very deep, but the key here is we talked about the breadth of the complexity – you can really zero in on a problem area with a map — and the depth and you can really zero in on any sort of design attributes, performance attribute, or even what’s changed. And so, it’s one thing to talk about this and show you some pictures. What I’d like to do next is hand that over to Ross to run a little bit of the demo and show you what that looks like in action.
Ross: Thank you, Jason. So what we’re looking at here is where we’re going to start? And we’re going to start, as Jason talked about with that end-to-end visibility, is really to build a path. So I have a couple of IP addresses. Now, we’re looking for a performance issue.
So the first thing we’re going to do is actually draw that path out. The system is going to start by logging into this first device, running a series of commands to determine what the next hop in that path is going to be. As you can see, we’re going through things like ACLs, we’re going to do policy based routes, NATs, etc. This is more than just a simple ping or a traceroute from one side of the network to the other.
Now, after I’ve built the path from point A to point B, I’m going to go ahead and turn that around and then build our return path as well, because it’s important to know what it looks like on both sides of the equation. And then very quickly, I now have that end-to-end visibility. Now, this is also very surface level, as Jason had alluded to, we’ve got to not only see the path, we have to start getting additional detail about it. And for us, that can be as simple as starting to zoom in. Now, as I zoom in, you’re going to see additional pieces of information start appearing on these different links.
So here I have things like the IP address, the EIGRP. If I hover over the EIGRP, I can bring up the configuration that drives why it appears on this map. Now, because these are dynamic labels created by the system based on your configuration, the important things to remember is that these will automatically update, anytime that you update your configurations. Now I can also add static configurations as well. So let’s say I wanted my static IP addresses, so I’m just going to click on full config, go down to my static IP addresses, select them, and just have the system put that on the map for me. So whether it’s dynamic configurations that the system puts on there automatically or supplemental configurations or annotations that I want to put together myself. This is just one way we can go ahead and start annotating the map.
Another way is to leverage the database that we built around all the devices. And within our design tab, I can do that by bringing out all of the routing protocols. So with a click of a button, I can see every routing protocol we’re running and the interfaces that is configured to use those. These red and blue lines are putting out or demonstrating my internal and external BGP as well as the AS numbers for my different BGP routers. So at any level, you can start getting more and more detailed about what your particular design is. I can even pull out specific information based on any one of these devices, but instead of doing a single device at a time. I’m going to do it an entire map at a time and that’s by just running CLI commands, but instead of against the single device, I’m going to run it against all of them. So with the click of a button, you’ll see, it’s going to log into each of these devices, and at the same time pull all that information out for me.
Now from here, I get to see all of those commands ran against a single device or multiple devices. And again, this is also annotatable. Maybe I don’t like the fact that that particular interface is showing up and down. I can highlight it, make it bold it, make it red, and I can even leave myself notes. Click the save button, and that information is now saved in this map for me so I can come back and review that at anytime.
Now, the next exciting feature, this is a new feature for us that I want to bring out, is looking at what we call Instant QApps. Now, Instant QApps allow you to virtually pull out any piece of information on your network and quickly put it on your map. So let’s say that I want to know the version of my devices. All I have to do is type in the word “version,” look for it in the list, and drag and drop it out. These green boxes are going to show me each of the devices that understand that particular show command and it’ll populate it for me. It’s now literally logging in to those devices and adding that information to the screen.
Let’s look for another piece of information. Let’s say I want to see the MTU. Now, the MTU is going to be coming out of the show interface because that’s on the interface table. All I have to do is look for the one that’s highlighted in red. And when I click on it, it’ll actually show you where we’re getting the data from. It’s looking at a show interface command, is looking for the MTU size right there. So not only will you get to see the data, we’re also going to tell you exactly where we’re getting that data from. Again, all I have to do is drag and drop, put it on the screen, and it’ll start annotating the map with these labels wherever they appear on our interfaces.
So the next piece I want to do is, let’s say, I want to see our CRCs. It’s the same thing. I can click on it, validate that that is exactly what I wanted to see, drag and drop it out. Now, with CRCs were starting to get away from pure design because what I really care what the CRC is, are these errors incrementing. Not necessarily that I have them, but am I generating more of them all the time? Because if I got a couple CRCs, that could easily be from a previous, you know, six months ago when I had an error and no one cleared the counters. So at any moment, I can click a button and turn what I just pulled out into a monitor. And now the system will go in and periodically start pulling out that CRC information for me. So really we’ve kind of made it a bit of a shift from how is this configured to what is actually happening.
Now, you don’t have to drag and drop out everything you want to see because we understand that there are certain things that, from a monitoring perspective that you really want to check first. And that’s all contained within our overall health monitor. So if you want to look for things like bandwidth utilization, memory utilization, CPU utilization, even this little proxy ping here, it’s going to show up very quickly on the map.
The other thing is you have the ability to set thresholds. Now, the threshold for this one I had originally set to 30%. You saw it flash for just a second. But if anything was over threshold, because they are user-configurable, they would actually highlight itself as red, what we call a hotspot. So not only will you tell very quickly what the data value is, but you also have these hotspots to tell you immediately or draw your eye immediately to what you need to be focusing on.
Now, let’s say that we looked at this traffic and we go, ”All right, we want to know where this asymmetric route is coming from.” There’s a couple of things I can do to help you with that as well. The first is we’re going to map this out. I don’t know if you remember, we talked about how you can map things out from what it currently looks like, but I can also map this out from a historical perspective, because, as Jason had alluded to earlier, having the ability to compare a known good state to the current state and look for those differences allows you to quickly get to what an issue might be. So what I’m going to do is look for…or, actually have the system to map this path out from last week’s perspective.
So what I’m looking at here now is what does it look like last week, what does it look like today. And it looks pretty simple that we have a symmetrical route today as opposed to our asymmetrical route in the past. I’m going to drag and drop these out so you can see very clearly that last week the traffic went through this DMVPN cloud on the way out and came back through that DMVPN cloud. Yet today, our traffic is going out through that DMVPN cloud, but coming back through this MPLS cloud. So now we’ve found it, but we really want to see where it’s coming from. So if you want to do that, we can help you with that. And that’s where the comparison comes in. We have a couple different tools to help you with that.
The first is will be compare historical data. So the idea behind comparing the historical data is, we have all of these data points that we can compare based on this week versus last week to find out exactly what happens. And if you’re interested to see more, just sign up for a personal demo, we’d love to show you how this thing works. So not only can I do that to find a specific change, if I just want to get a good overview, that’s our ‘show incremental’ changes. And what this does…now, we’re in a demo environment so we can’t have changes happening all the time. But what this would do is in a colorful stack graph, it would show you all of the changes that happened benchmark by benchmark by benchmarks. So if you do a daily benchmark, you’ll be able to see what days you actually have the most amount of changes and then drill into that as well. So that’s how we can provide end-to-end visibility by quickly mapping it out and then diving in to get as much detailed, granular information as you would need.
And with that, I’m going to hand it back to you, Jason.
Jason: Thanks, Ross. That was great. Let’s make sure you can see my screen again; I hope that pops up back for everybody. So to wrap up Ross’s demo while that’s coming up is he had a lot to show there. Hopefully you guys were able to see that. We talked about what’s the path of any problem. We zoomed in to see how is it configured and we’d looked at data from any sort of data that we want to see dragged and dropped into the map. We turn that into a monitor to see what’s happening and we’re able to even see what’s changed and here’s a view of what that would look like if we were able to benchmark some data on the screen right now.
So with Ross’s demo, I’d like to take the time to ask a polling question. And the question is, how would you characterize your network visibility today? And so, we’re going to throw this poll up. Just a minute here while we pull it up. Now, for those of you that don’t have the Webex in a full screen, you might have to pull up the poll from the top of the drop-down menu and you’re Webex to see that come up. But you should see the question, “How would you characterize your network visibility?” So I’ll give everybody just a few seconds to please submit your answers and now let’s see how we all did.
It looks like about 50% of you said that you have limited or outdated documentation. And this is very similar to what we see when we talk to our customers as well. So 50% have limited or outdated data documentation, 28% say their documentation is up-to-date, which is great, but the challenge of this end-to-end, meaning how deep can you go? A visibility is a little bit manual.
And then 15% of you are saying you’re using NetBrain – using it for mapping. So the takeaway here, I hope, is that there’s a lot more, there’s a lot deeper you can go when you use NetBrain. And then there’s about 10% people here saying we use for NetBrain network design, performance and history, which is really realizing the full value. So thank you all for participating and that would be kind of an interesting way to get everybody’s feedback.
So going back to this slide here, we talked about the top priorities for network operations team. We just saw a demo from Ross. So he was really looking at troubleshooting along the path. You know, the next thing I’d like to zero in on is how does end-to-end visibility benefit in the network design and upgrade phase? And so, I’d like to introduce, again, Todd Bristol. Todd Bristol, Principal Design Engineer at Move, Inc, has a case study where they’re actually using NetBrain’s visibility capabilities in the design for a migration to Amazon web services. So, Todd, are you there?
Todd: Yeah, I’m here. Thanks a lot Jason. And again, thanks for the opportunity to be here. You know, looking back at some of your previous slides, there was one where you had mentioned new trends equals increased complexity. And I love that slide in the points that are highlighted there. And if I can add one slide to the deck right there would probably say, “new technology equals increased complexity.” Most of us here on the call, including myself, if we’re fortunate, we’ve had the opportunity to design or implement a new solution, method, technology be it switches, storage, or just something like that. And along those lines, we also introduced an increased complexity. Now for some, it may have been the highlight of your career, for others, sometimes not so much. I’ve kind of had both, but that’s just how it goes.
Another fact is most of us, and I feel it’s just pretty safe to say, get the better part of our understanding or begin to master new technology in production. And that doesn’t always have to be the case, and sometimes that’s not the best place to do that. And here’s an example. I have a six-month-old daughter. And 15 weeks into the pregnancy, we started catching a glimpse of who she was before she showed up, right? We started seeing a, not just MRIs, but ultrasounds and 3D ultrasounds. And as the baby grew, we can see her heart and other organs, they started counting fingers and toes, determined whether as a male or a female, and all those kinds of things happened before she showed up. Now in the same way we see value in that, we should also see the value in becoming familiar with new technology before it goes into production. And my point is production doesn’t have to be the place where we begin to understand or master complexity associated with new technology.
So here’s an example, and this is pretty much why I’m here, where you can see a build in such an environment away from production using NetBrain and VIRL. And for those of you who may not be familiar with VIRL, it’s just goes virtual lab environment, pretty similar to GNS3 and we’re not going to spend a lot of time talking about that here. But if you’re not familiar with it, I strongly suggest that you take a look at it.
So let’s jump into our environment and what we’re doing in the next one. Earlier this year, we were given a challenge of connecting to two additional 10 gig circuit between our AWS environment and our data center and then you can kind of see that here. I’ll kind of go over some of the components so you can pretty clearly see there’s a BGP level and also we kind of highlighted eBGP because that is part of it, obviously. There’s an OSPF level, there’s the spanning tree level, there’s, you know, a physical switch or there’s a whole bunch. And this is really just a high level, you know, 50,000-foot view.
But in this, and actually I want to make one quick comment – Our design that I’m talking about is really focused on West. We have a requirement to do the same type of thing in AWS east and we just figured, ”Hey, if we’re going to design for the west, we might a well design for the east at the same time, it doesn’t really take that much more because we’re pretty much just going to copy it.” But in any case, this is what we’re going to eventually build out. And when you build out an environment like this, there’s a bunch of things you have to consider. And there’s a big list of challenges, complexities, and there’s some on the screen in some even I thought of as I was sitting waiting here.
The most recent one we had or into the design was that AWS changed the private AS range that you can use now as a customer. So we have an existing circuit going up to AWS, a DX. And that’s on one private AS number. We’re not allowed to use that number for these two new circuit. So there’s one right there. You also have things like a route poisoning, right? You want to make sure that routes don’t come back to you that actually came from your environment. You know, you have to think about inbound route filtering, inbound route, district redistribution, inbound traffic steering, route map placement, and all kinds of things. And there’s absolutely zero room to figure all that out in production. You really need to have a place to not only figure it out, but master it and tweak it before you show up in production.
Now what normally happens, and this is, you know, we’re guilty of this. As you get a package like NetBrain and you pointed towards production, and in that, you get lots of practice adding devices into production. You get a lot of practice and a lot of playtime with adding maps or creating maps and things like that. But you seldom experience the personality of an outage because, after all, you don’t really want a lot of outages happening in production. Then there’s the flip side. You get something like VIRL, which is perfect for validating your methodology, your syntax, all your configs and things like that. You build out 88% of your production configs and then, and I say 88 because you know how it is, you’re never going to totally build everything out in VIRL. You’re going to actually have to bring it into production and get the final things to ironed out, but you pretty much get the point.
But the problem is when it comes to testing your methodology and testing the ways that you switched from one circuit to another during failures and things like that, you have to manually go and grab all of that stuff. So what we’ve done, and this next slide is going to actually show this. We took NetBrain and we pointed it to VIRL running in packet, which is basically VIRL running in the cloud. Now, this was huge because now not only do we have the virtual environment to test and play around and build out, you know, this complex design, but we also have NetBrain which is effortlessly reporting the state of my environment as it changed.
So in this graph, in this picture, you can see an example of that and it’s pretty similar to what we show in the previous demo. And basically what we’re doing is we’re asking the question, how do we get from a source to a destination? And the destination is the 10.19216 host and it’s pretty straightforward. The way we built this out, as we said, ”We want normal traffic in and out to flow through,” for lack of a better phrase, “the router on the left.” And in the event that there’s a failure with any of that, any of the components or circuits associated with that, we wanted to take the right path. So here you can see it’s doing exactly what we wanted. You can see the inbound and the outbound path and this is in VIRL, and NetBrain is reporting that back to us.
Okay, so let’s move on to the next slide. Then we induced the failure and here’s an example of one of the sweet things that NetBrain brought back to us. Firstly, in the upper left hand corner, you can see a list of devices that had changed, and the change count. That’s beautiful because we introduced one change, but we can see on one router that actually ended up in four changes as it relates to that router’s current environment including the configuration. And we also see another router that’s downstream that had nothing to do with the change and was far away from it, but we saw how that change impacted the router as well.
And on that one router, we can see the things that change into the route table change, ARC table change, CDB table change, all changes that were expected. And in the middle, you can see where the arrow is, you can actually see where we actually introduced the change by the highlighted shut down. And if there are any other changes and even though this is in our lab environment, you can imagine how this will help in production as well. Pretty much what we looked at earlier, 50% of changes are caused by us, right? So if there is a situation where there was a change in production, you can see how NetBrain could help you figure that out too, especially if it’s something like an interface that was shut down that maybe shouldn’t have been. In this case, it’s what we wanted and it’s showing us exactly what we did.
Okay, on the next slide we can kind of move forward. And now we can see after the change, we went ahead and told NetBrain to regather information about the environment and display that graphically and that’s what we see. And now we can see the traffic pattern is actually flowing the opposite way. And this is good for us because, at a high level, we can say there’s more that we have to dive into here, but initially, at a high level, we can see the network – according to how we designed it – is behaving the way we designed it to behave.
Okay, the next slide, here’s another view, and this is more of a routing table view. It’s still in, it was put in a table format, but it kind of dives a little bit deeper. And here we can actually take a look at the routing table and this is really, really cool. So if we see in the orange or, actually green and yellow segment that’s just highlighted there, you can actually see what happened to the route that this host was sitting on. So initially before the change, we see that that router had a IBGP route for 10.192160. It’s an IBGP because we could see the AB is 20. Then after the change, we see that route turned into an OSPF route with an AD of 110 and a metric or cost of 400. That’s exactly what we want it to do.
So basically what happened was it’s IBGP peer said, ”Hey, you know what, I have this route because my link is up. I’m going to hand that route to you.” And that route came across through OSPF. We can also see that, and I want to kind of make note of this, that, and the metric is 400 and that’s because we’re balancing out the route. As you deal with outbound steering, we just kind of gave that route a little heavier weight so that it wouldn’t be preferred. And the fact that 400 is in our table lets us know that the primary route went away. So this is huge. This makes it really, really easy to validate and verify how we designed the network.
And this screen, I actually like the screen a little bit better for some reason just cause I’m a CLI guy. But this screen actually kind of gives us the raw data. The other screens were kind of graphically representing things or putting them in tables, but this is actually the raw data from the router. And we can see for that route the 10.192160 route that it’s experiencing a rip failure. Again, this is exactly what we expect because we broke that circuit and now that route doesn’t know how to get where it’s supposed to get.
So if you prefer getting things back from the CLI or if you prefer getting things graphically, NetBrain is going to basically tailor itself to however you decide to look at your environment. One more quick thing I’m going to show, is the graph off to the right. This graph pretty much represents ‘how do I get to my destination before and after a change.’ So the green path, what we’re seeing here is the path that represents how those bottom three routers get to the destination. And then kind of like the pinkish purple path is a change in that, so we can see two routers had a change in how they get to the destination.
Again, there’s all kind of tools that you can use to help yourself figure things out within NetBrain when you point it to this environment. And at the end of the day, you know what this does for us. Now, we can actually enjoy the art of an outage and I’ve never heard that before, so I’m going to kind of coin it. But there really is an art to an outage. Normally, when there’s an outage in production, we don’t have the opportunity to sit back and look at it from where we’re running around with a fire hose trying to put things out, answering to the service desk or answering to our manager who’s reporting up to somebody else what’s going on, how quick we’re going to fix it. But in this environment you can actually enjoy it and get a little more intimate with it. And when you get more intimate with it, you actually begin to understand the complexity associated with new technology which is what we began this conversation with.
Last thing I want to talk about is without NetBrain, if I were to have to get this information and gather it and get this comfort level, I probably wouldn’t, for a couple of reasons. It’s not because I don’t want the comfort level, but the effort that it would take to get this and to make a change in your config and then go back and have to redo it again, it would probably take too long and I probably would end up just not doing a good job of it. So I’m super glad that I do have NetBrain to assist us here and to help us get familiar with this project and the technology and complexity associated with it before it goes into production. Anyway, that’s it. Thank you, everybody. I’d like to thank everybody for your time. I really hope this was helpful. And let’s turn this back over to Jason.
Jason: Todd, thank you so much. When I talked to Todd, everybody’s, it was a really interesting use case that we saw NetBrain deployed on a VIRL virtual environment to see how he’s using it. I want to thank you, Todd, for sharing that story with us. So we were about two-thirds through the hour here and, you know, for the last 10 or 15 minutes or so, I just want to talk about one last point, which is the question of knowing what to look for. You know, when you’re talking about end-to-end visibility, there still is the question, what are we looking for? And I have the subtext of the value of network experience is really the key.
So the reason for any network problem, you know, going back to the troubleshooting scenario, but for earlier any org problems, there can be hundreds of possible causes, right? So here’s just a few examples of things that can go wrong and here’s dozens of examples of underlying reasons behind it. Not to go into the details, there are just examples that kind of came to my head. The point being, back to that analogy, it’s like trying to find the needle in the haystack, right? And so knowing where to look is the first place, you know, in order to know where to look, you must, first, ask the right questions is the point. So going back to that troubleshooting methodology, this is actually zeroing in on the bottom right quadrant here. Remember, we glossed over it. Proposing a hypothesis for what could be the challenge. And, you know, the argument here is that that only comes with experience. Again, tribal knowledge without having seen it before, you might not know what could be causing the problem.
And so, the challenge is, without that tribal knowledge, even with the best automation and visibility tools, we still struggle. And this is something that we’ve been thinking about for a long time here in NetBrain as well. And here’s a couple of examples of those types of hypotheses. But what we came up with is pretty interesting. It’s the concept to what we call digitize that tribal network knowledge. And this is what it looks like. You might have a data center guru, a voice guru, security guru, and they have each have their own, you know, specific tribal knowledge. And how can we digitize that for the entire team to benefit from? And what we came up with is a concept called Runbook automation.
So what’s the Runbook? A Runbook is basically a place that we can encapsulate a methodology, a process, and not only document that process, a lot of people are familiar with the concept of playbook, but also to automate that. And this is really following a very similar flow to what we saw in the first part of the demo where we looked at the topology, the design, the performance, the changes. And, you know, the specific methodology for each instance might be different while the general flow might be the same.
And so you give an example and we actually have two Runbooks here. The first is diagnosing the hypothesis that there is a layer-two problem, this is going back to the analogy that Ross run at the beginning, where there is some slowness on that path. And the second is maybe there’s a QoS problem. And so we’re going to run these two Runbooks, and this is going to be…we’re actually hand this off to Ross to demo again, look at how Runbook automation can benefit not only the end-to-end visibility, but helping instill the best practices from that tribal knowledge. So that’s the really the best of both worlds. And so for that, our last demo here, Ross is going to take the controls again or maybe I should give them to him. And we’re going to see that, what that looks like in real time. So, Ross, here’s the ball.
Ross: Thank you, Jason. So, yeah, the first thing that I want to do is getting back to our previous example is if I want to look at something from a layer-two perspective, the easiest way to do that is to have a layer-two map. So with the click of a button, I can create that layer-two map for us. Now, this provides some initial value because we’ve been talking about that end-to-end visibility and how important that is. And the best example of our end-to-end visibility.
So now I have the two sources of destination. Every device, whether it’s a firewall or router, a DMVPN cloud, all of the interfaces from point A to point B to tell me exactly where I need to look. But maybe I don’t know what to look for. So let’s say that I’m a, you know, I’m the NOC engineer, and so I get the ticket, I build the path, but what information do I need to gather so I can escalate, or even if I wanted to investigate it myself? So as Jason referred to, we created this layer-two troubleshooting path.
Now, the idea behind this is to borrow off of Jason’s analogy, if anybody can jump into the ocean, but wouldn’t it be great if the first time you jumped into the ocean, you had an experienced dive master to tell you where to go, what to do to see all the latest and greatest cool things? Well, that’s what the Runbook does. The Runbook lets you create that information or to take that experience and package it up for you.
So to do that, all I have to do, I’m going to click on the Runbook, gives me a quick description of what the runbook is going to do and the steps it’s going to take. This is just free form. You can put anything you want in there. And then all I have to do is click this Run button and the system is going to pre-load all of the questions or all of the pieces of information that I would want to gather. I hit the start button, it’s going to log in to each of these devices, gather that information, and it’s going to store it for me right here in the Runbook. So when I want to review that data, all I have to do is click on the results and it comes back up.
Next, let’s look for some monitoring of have some interface errors. All I have to do is click that button. And just that quick, a monitor is going to come up that’s going to show me that I have input errors between these two switches. Where that becomes important is that if we look…just let me stop this for just a moment because what I want to point out is that if you remember from the beginning, we only showed one switch, but from the layer-two perspective, we actually have multiple switches involved with that.
The next piece that we’re going to be looking at is one of our QApps and. Our QApps gives us the ability to basically take any show command and turned it into a monitor. You saw our Instant QApp, but what I wanted to show it to you here is one of our procedural QApps. So, but I just wanted to point out really quick was it, if you remember back here on our original map, we only had that one switch, but in reality, we’ve got two.
So the next one we’re going to check is our speed duplex mismatch. And, again, this system is going to log in to each of these devices. And it’s not only going to look exactly where we saw those error messages, it’s actually going to check it across the entire map. And that’s one of the important things is that it’s easy to become so focused on what you’re fascinated by that, you know, you may have a dive master to tell you, ”Hey, you’re out of time. You need to get up.” This is going to tell us not only on that particular pair, but any place in the network. And so very quickly, we now have created a map that shows us the entire path from point A to point B and these highlighted map notes to tell us exactly what the issue is. So that would be running the book.
Now, I could save this map, forward this off to, maybe not an escalation engineer because we found the issue, but maybe there’s a change control team so they can review this and determine exactly how they want to apply that change.
So the next piece I actually want to show you is another Runbook based off of a map of a VoIP issue that we already have. So the idea behind this is on the VoIP engineer. I don’t want to spend my time gathering all of this data, I want someone else to be able to do that for me. I want to empower my first level engineer to go out and gather the information I need so I can do my job. And so this is a pre-run Runbook, and all I have to do is come over to the individual steps and click on the results. So I want to annotate the config file and that means the system’s going to go in there and pull out the relevant VoIP information across my path. So first I’ve got that network end-to-end network visibility and now I’m really starting to dig in. So I can see things like my class maps, with just the click of a button, for all of these devices and I can scan through and see exactly what I want to see.
The next piece I can bring out here is executing the commands. Now you saw us executing it multiple times. I’m not actually running them. It’s the first level engineer ran them for me and gave me the results. He can even annotate them. So as an example, ”Hey, Dave, does this look right to you?” So he’s checking out this DSEP value. But all of that information is available to me within a click of a button and I don’t have to be live on the network. I could literally be sitting in a swimming pool on vacation somewhere helping them troubleshoot this map just because they sent me the map and all the data is there for me. Same thing with highlighting the queueing strategy, with the click of a button now…Of course, you would never choose this as a queuing strategy and I don’t like that color too close to white. Let’s make that one orange. So very quickly, you would never design a network that way, but within a couple clicks of a button, it’s very obvious that, you know, we don’t have a good grasp on what our queuing strategy is across this network path.
The next one I want to show you here is the monitoring results, so even your monitoring results can be stored in the individual maps. So here we have all of that information, so whether it’s incremental drops, remember incremental changes or incremental issues are oftentimes the most important. We have total class drops. So when we built this, you know, we said, ”Hey, monitor the VoIP class and tell me if I have any VoIP drops.” So that’s important to know as well. And then I actually have all that information down here as well. So I’ve got history graphs, I’ve got the last point in time and then I have the beautiful graphical version that highlights and tells me very easily what I want to see.
And then the final piece here, of course, is to document your findings. You can always add a map note to put in what you want to see, what that change should be. You know, this, because then can get also now fed into you know, a change control group where you can detail out what you want to be done as that escalation engineer. And then the last piece I want to go ahead and show you, so I want to bring up an additional Runbook that we have available to us. And that is, maybe I want to get some additional information from, say, Cisco. We’re working with a pilot program for Cisco as Jason had mentioned, for an integration between our network or our tool and their talents.
And so, there’s a couple of things that I can do as an example, the Cisco EOX. With the click of a button…well, let’s just take the first two devices. I can find out if any of this stuff is end of sale, end of life, end of support. We’re literally logging in – packaging all of our information – and then logging in with a secure API to Cisco. So this is literally what Cisco has to say about these devices right now. If you were to call Cisco, that’s exactly the same information that they would give you on those devices because we’re literally logging in and pulling data from them.
Now the next one I want to show you is the Cisco Diagnostics. Now this one, I did cheat a little because I pre-loaded it from the standpoint that this one, because it’s gathering a full config file, plus a whole bunch of other information, sends it off. This one takes about two minutes to run. I didn’t want to have to try to entertain you for two minutes because my sense of humor isn’t that good. But now you can see very quickly we have detailed information coming from Cisco.
Now, these are just the most important ones. Down here, you can see everything that Cisco had done, whether it’s a warning, an information message, a system message, you can see the raw data. And we also have the ability to pull this stuff back or I should say not the ability. What we do is we pull this all back, we also drop it into a table for you. So if you want to have access to this data outside of our tool, that’s all available for you automatically in the form of an Excel spreadsheet. And with that, that wraps up the second part of my demo. And I’m going to turn it back Jason.
Jason: All right. Thank you so much, Ross. I’ll just take the controls back and make sure I can share my screen for everybody to see again. So the takeaway, again, and we’ll have a poll for you. We’ll launch that poll for you, [webinar cut out] institutionalize or what we call digitize all that knowledge and that one more thing, at the end there, that Ross showed, you’re not even limited to the knowledge within your own organization. Through our pilot program, we can actually leverage the expertise of Cisco’s tech. And what Ross ran in the last step was a diagnoses of basically analysis across hundreds of diagnoses, sort of, Cisco’s tech database and showed results there on the map. So extremely powerful for end-to-end visibility.
And so the question that you’re all seeing up on the poll here, how does your team currently document what we’re calling tribal knowledge.
And so, you know, what’s interesting is the most common response here that I’m seeing over 50% are saying when there’s a problem, we called the tribal leader. So this isn’t just a concept that we’re talking about. It’s a very, I think, thank you all for sharing that. This is a very real challenge and that was my experience as well. So I have 35% of you are saying network design is documented, but not our processes in certainly BND are not mutually exclusive. And only a small fraction maybe those of you that work on smaller networks, 1% of you are saying the whole team knows a network intimately, but really when the complexity we talked about that, that’s really no longer the case. And it is interesting 13% of you are already familiar with this concept of playbook for documenting the knowledge that we’re trying to institutionalize. So I think those are really interesting results and thanks a lot for sharing those with me.
We’ll close this poll and I’ll wrap up. So we’re getting to the end of the hour, I realized. I want to thank everybody for sticking with us.
So, the last slide, I want to talk about…actually I have two more. The second to last slide I want to talk about is the concept that we’re striving for. The map becomes the single pane of glass. So with NetBrain, everything that you need to know about your network, we hope to be able to grab that from the map. And that means leveraging external systems. We’re not saying that we’re the only solution you need and certainly that’s not the case. Whether it’s an event system, a ticketing system, the traffic analyzer, there’s a lot of value in being able to visualize all that information, whether it’s triggering an alert from an event system and creating a map automatically or leveraging our Runbook automation and embedding it into the ticketing system for you, there’s a lot of value integrating with your other system. And the vision that we’re executing here and we’re continuing to expand upon is our vendor APIs.
And so, you know, with that, I guess, certainly, there are we saw soften the polling questions earlier, there are a number of customers on the call here that are using NetBrain and even using NetBrain for this is end-to-end visibility. There over a thousand more, large enterprises global network, they’re using NetBrain and don’t necessarily like to leave this slide up because it is a kind of a marketing slide a little bit of who our customers are and how they’re using it. But, you know, we’re not a small company. We’re not a startup. We’ve been around for a long time and we’ve been really having our eye on this challenge for a long time and we’re really excited about the solution.
So like I mentioned, it’s about the end of the hour. We’re going to be on the lines here. Martin’s still over there typing away, answering questions. And what we want to see is any more questions you have, we’re going to probably pause the, you know, mute the line for a few minutes. At this time, I want to say thank you so much for joining us. The four of us will be here to continue answering your questions, but that is the end of our formal webinar and thank you so much for joining today’s webcast on “End-to-End Network Visibility.” Have a good day, everybody. And as a reminder, we will be here.
I’m glad it came up because I didn’t address it yet, is for those of you that are customers, some of you are asking what version of NetBrain we’re running in our live demo here and it’s a good question. We’re actually running 6.2, which is currently in pre release coming early next month for beta customers, but for mass release in November, so 6.2 brings with it a couple of really, really powerful stuff. It could’ve easily been a 7.0 release and the Runbook automation is really one of the most exciting new features in 6.2. And Ross showed a little bit of what was called Instant QApp, where you were dragging and dropping any data fields, you know, CLI data to the map. That’s also part of that 6.2. So this is a big release for us.
So another important question I want to make sure to address that came up was on most of our demo lab people are observing, we’re leveraging Cisco, Cisco devices. And in the question is, are we a multi-vendor solution? And the answer is an emphatic yes. We do tend to use Cisco in our demo environment. It’s a familiar technology so a lot of people that see but we have a multi-vendor solution. So any routers, switches, firewalls, wireless access points, and things like that, we support most major vendors.
While we’re still on the line, I just want to make sure I thank, again, Ross Merkle for running a demo for us. Todd Bristol, it was really excellent to have you on the call with us today sharing your story. And Martin as well, thank you for being on the line to answer the questions. Again, we’re still sticking around. I still see that there’s a couple hundred people on the line so we’re not going away until all of the questions are answered. Thanks, again.
For those of you who are still on the phone. Yes, we are still here. And the last reminder is that we are going to be sharing the recording with everybody. The webinar has been recording the whole time and we’ll share the recording and the slide deck through email next week.