Go back

“Human Error” is a Big Bucket

by Mark Harris Nov 18, 2021

We all use the term “Human Error” as a catch-all as it relates to the cause of outages and service availability issues. We typically think of an operator hitting the wrong key on a keyboard, which we all know could be disastrous, but let’s put this in a bigger context before those keyboard images start dancing in our heads.

Human-Error comes in many forms and can start much earlier than the moment that an incident starts. (Pause and think about this for a moment). As a point of fact, everyone involved in the delivery of information technologies has different backgrounds, perspectives, and sets of knowledge and experiences. And for each and every task at hand, each individual’s approach can be very wide. This creates the opportunity for human error in a litany of places, some obvious and others less frequently discussed. This range of human error sources needs to be carefully considered when designing a strategic operations plan.

Let’s examine some of the specific issues that increase the risk of outages that can be exposed through human error:

  • Hardware or Software Suitability over time – This is an often overlooked source of problems because the selection of hardware and software is done at a single point in time, but as any infrastructure grows, morphs and changes, those choices may no longer be valid. In most cases, there is a significant portion of an infrastructure that is rarely (if ever) re-evaluated to confirm suitability for the changing job at hand. The human error is not re-evaluating each component each year or two to confirm that the function is still being handled as the business needs it to.
  • Staff Availability or Skills Sets – Infrastructures are complicated, and getting more so. The location and amount of resources along with the kinds of skills they must possess can be daugnting to keep in place. So with all of the pressures the IT organization has to innovate and bring new capacity on line, many times there is a lag in building the required support organization operate and support it. The human error is building new infrastructures without a definitive 100% support plan which must be updated every time a change is made.
  • Equipment Configurations – There are any number of ways to build physical and logical infrastructure. And that infrastructure include two domains: the facility itself which provides power and cooling, along with the active components which provides computing and connectivity. In the era of software defined everything, many equipment configurations may be sub-optimal, but work perfectly fine under limited conditions, only causing issues under stress or higher loads. The human error can be found when designers overlook the need to establish the context and ‘normal’ operating conditions for every component, and how those components behave over the range of operating conditions.
  • Software Licensing – Software licensing problems create problems more commonly that you might think. Most apparent will be when a license expires, but their may be capacity or usage restrictions as well. Most of us have experience SLL certificates or domains that expire which causes all kinds of cascading effects. The human error stems from a lack of discipline and process to establish the business parameters for all licensed usage, verifying the license terms and scope on a regular basis.
  • Security & Access – So many times production infrastructures will be up and running, only to experience service degradations or complete failures due to security intrusions or responses to them. While the security issues impact the performance of service delivery, these are addressable. The human error is relinguishing the responsibility for service delivery due to certain 3rd party events. IT professionals must ALWAYS own service delivery, and must have comprehensive support and contingency plans for these kinds of scenerios.
  • Configuration of Equipment Parameters – If you were to survey 100 people about human error, more than two-thirds would immediately identify configuration errors as the cause of most outages. It is easy to imagine and oftimes the case that an operator enters a single digit incorrectly which causes catostrophic results. The human error is two-fold here; 1) the operator mis-typed or mis-understood a parameter that was part of the information critical path, and 2) the operator was tasked to solve mission-critical problems manually, rather than leveraging known working procedures which had been tested, QA’d and proven to yield the desired results.

So you can see that there are many causes of outages and service degradations that are not commonly discussed in the context of human error, but in fact, they are. The human error itself may have happened months or years before an outage occurs, and only when looking over a longer period of time does that detailed surface.

What can you do? Start with building your own list like that above. Realize that each party in the mix has a lifecycle and that all aspects of that lifecycle need to be supportable and defendable. Any weak link will increase the risk to production. And most importantly, you should individually work through each item on the list to optimize it, reformulate a support plan, create contingency plans, add operational processes, etc. Work with the facilities and enterprise application designers to better understand their expected range of normal operating conditions. Talk with business line owners about needed capacity over time and map it to equipment refresh cycles. Talk to the business owners and look at today’s workloads and then project the expected loads in 2 or 3 years. And lastly, invest in the management tools that continuously and proactively confirm the performance of the infrastructure and which help make any required operations consistent and repeatable regardless of the staff members involved over time.

Above all, challenge your teams to defend their work, to defend their plans, to defend their contingency plans… at scale. Remember, infrastructures are getting bigger, so defendable problem solving is one of scale. By looking at the ‘Bigger Bucket’, IT operations will become a strategic partner to the business, rather than a tactical provider.

Related