Did you know that 22% of data center outages (on-premises and cloud) occur due to human error? Outages at the data center, and arguably your remote offices and campuses, have a significant impact on your organization. Beyond the downtime, inconvenience, loss of productivity, and impact on future revenue, you might be responsible for fines, penalties, compensation, or refunds (or depending on your industry – all of the above). In 2016, the average partial unplanned outage took 64 minutes to restore, and a total outage took 130 minutes to restore. In the study, Ponemon Institute found the average cost was $9,000 per minute for an outage!1 By far the greatest costs were business disruption (including reputation damage and customer churn), lost revenue, and end user productivity.
The average partial unplanned outage takes 64 minutes to restore — at a cost of $9,000 per minute.
Now that we’re caught up on the severity of an outage that is caused nearly a quarter of the time by human error, what can we do to prevent it? A moment occurred that I wanted to share with you all in which NetBrain helped me justify “I wouldn’t do that if I were you.”
I had a team working on a large-scale WAN that involved multiple MPLS circuits across three different carriers (depending on location in the USA) and direct internet access, be it local broadband or 4G LTE. A conversion was taking place that helped reduce the strain on the engineering personnel by simplifying the environment, but there were some temporary routing policies to prevent loops as we slowly shut down tertiary circuits. Due to the intermixing of sites, we couldn’t let MPLS A route leak into MPLS B, and the DMPVN over the internet had to be controlled as primary at some locations and backup at others.
This all sounds simple if you’re a routing guru, but some of the team wasn’t. The complexity was intense after years of rapid expansion, Band-Aids, and repeated “I’ll get to it.” Each site was configured in a way that was synonymous with the month and year of the “standard” at the time it was rolled out or acquired. As you can expect, they didn’t all have the same standards – local usernames, SNMP, and NTP were all over the place – so you can bet that routing was equally atrocious.
A configuration was presented and stated how it would be put onto every location and, voilà, all the world’s problems would be solved. Unfortunately, the configuration wouldn’t do that. After discussing with the team, there was still some disagreements on the effectiveness as they swore all boxes were covered, I’s dotted and T’s crossed.
Welcome NetBrain to the conversation. Utilizing NetBrain, we did an updated discovery of the topology and saved it as “current state.” Next, we turned on all the toggles for routing neighbors and zoomed in on the map to see IP addresses next to the links between locations. Everything looked great with eBGP and iBGP neighbors shown graphically with color-coded boxes next to EIGRP for internal site communication. Now that things were visually stunning, we started modeling the change. We went in and edited a handful of sites’ configuration with the proposed changes. This was saved as “proposed state.”
Use NetBrain to benchmark device configuration before and after a major IT initiative, such as a network refresh or data center migration, to quickly assess changes, and rollback, if necessary.
Now, with NetBrain, if you didn’t know, you can look at the difference between two different instances of a single map. Be it manual changes or automatic scans that took place at specific times. In our case, we wanted to compare proposed versus current. Once we picked our two states and did a compare, one by one we saw links change to a grey line with a red ‘x’ over it (indicating a link that would no longer be there). Each grey line and red ‘x’ combo represented an outage that would have taken a link down, and in some instances, the entire site ultimately bricking the box until someone could get on-site (unless a reload in ‘x minutes’ was used). One of the sites we modeled was the primary data center, which would have lost 2 of 3 MPLS connections, rendering those WANs inaccessible to the data center. It was at this time I reiterated, “I wouldn’t do that if I were you.”
What every engineer needs is a way to test, validate, and prove changes before they go live.
The point of this story is that you can have an expert certified individual, have 20+ years of real-world experience, or be a new kid on the block learning what CIDR and VLANs are – it doesn’t matter. What every engineer needs is a way to test, validate, and prove changes before they go live. Look at every development team you see; you will find Development, Q&A, testing, staging, or some other terminology of multiple environments before things go live. Not every network engineer or architect works at a place with the resources to build a lab and replicate their environment to do the same. However, you can leverage a tool like NetBrain, which is already providing you value in troubleshooting, Dynamic Maps, and more. You can also use this tool to model your changes to see the impacts that will occur. Next time you’re looking at a change, think to yourself – do you want to be part of the 22% or 78%?
1 Ponemon Institute Research Report, Cost of Data Center Outages (January 2016)
Check out the 90-second overview of NetBrain change management capabilities