by Phillip Gervasi Sep 20, 2017
I clicked “Answer” on my softphone and took the call. Part of me wanted to ignore it, but this issue had been annoying me for too long.
It was one of the guys from the Applications team. Even though he sat only a few hundred feet away on the other side of the building, it felt like we were a million miles apart most of the time.
The project manager on this new implementation was hovering now, too. I liked her, but she was getting nervous. The customer and everyone else for that matter wanted this working yesterday.
“Could it be an issue with the VLANs?”, the voice over the phone asked.
It was a nonsensical question, but I wasn’t going to say that. Frankly I was surprised he didn’t ask if it was a problem with the firewall since that’s usually where these things seem to start. I assured him that there was no blocking at all within any single VLAN or between the VLANs we were looking at with this issue. I really didn’t believe this was a network issue, but that’s usually the first thing to get blamed.
We reached another impasse, so we scheduled a meeting for after lunch with several other team leads.
I left the office and grabbed some Burger King, a guilty pleasure I find too easy to justify on stressful days. The call was for 1pm, which annoyed me since I had to rush back, but at least when I sat back at my desk I had some fries left to munch on and a tall Coke to sip.
Almost immediately after joining the conference call the security manager began questioning why we had no segregation among certain VLANs. Of course this didn’t pertain to the issue whatsoever, but I was obliged to explain our network and how in this case it was a good thing that there was nothing in between the app server and backend databases. It was all layer 2, and frankly I was annoyed that our security person still didn’t know our network.
The project manager interjected and politely asked how I troubleshot connectivity in order to come to my conclusions. She used terms like “level-set” and “bubble-up” which I heard but summarily dismissed as gibberish.
How was I supposed to answer? She was smart, I knew that, but she didn’t know anything about networking. I started to talk about tracing paths and that sort of thing which immediately resulted in someone bringing up firewalls in the path. I didn’t know who brought that up at first – I think it was one the guys on the Applications team.
I rolled my eyes, but at least it resulted in someone on the storage or Applications team, I’m not sure which, start talking about timeouts.
Bingo!
I jumped in excitedly to confirm what he said and explained that connectivity is fine if the servers are talking but timing out. I assured that there were no configurations that would cause timeouts especially because it was all layer 2 between the devices. This, for some reason, the entire group accepted. We were finally starting to make some progress.
This issue was on the application side of the house, but it took several meetings to explore possible network issues with people that had only cursory knowledge of networking. There was certainly a desire to collaborate, and we had a good laugh after the problem was resolved, but the process to get there was painful.
First, I didn’t know anything about the application or the backend storage in question. How did they actually communicate with each other? What ports were being used? Was this really a complete outage or was it intermittent?
Second, most of the rest of the team leads knew very little about networking but assumed right away that it was the problem. That meant that I had to first make sure it wasn’t and then prove it to a group of folks who knew little of broadcast domains, TCP/IP, access control lists, and stateful firewalls.
Third, the person who led the charge wasn’t a full stack engineer, or in other words, the project manager didn’t have a reasonable understanding of all the technical areas related to the incident. This meant we didn’t move forward in a clear direction; the effort was certainly collaborative, but it also felt disjointed and like we were shooting in the dark.
I don’t think my experience is unique. Many of us technology professionals have experienced the blame game in an attempt at collaboration. Tribal knowledge among silos is still very real in enterprise IT, and though there may be the occasional full stack engineer and project manager with deep technical knowledge, by and large we’re still dealing with walled off teams and minimal knowledge sharing.
The issues I faced in the application troubleshooting I described illustrate this very thing. Individually, each team lead knew their area extremely well, but we knew very little of each other’s areas. For example, I didn’t know how the application talked to its backend servers, and, until this troubleshooting session, I didn’t know which servers talked to each other at all.
Also, the application people didn’t know that these particular servers had no firewalls in between them. I have to wonder how long they discussed that one theory alone before creating the incident in our ticketing system.
And on top of that, the project manager handling the incident had no unified view into the technical aspects of the incident other than poorly worded emails and ambiguous notes in the ticket.
When teams collaborate on a specific problem or design, sharing information is critical. But often it’s either difficult to share or difficult to make sense of random information from a high level. Log files, data dumps, email, and ticket notes all have their place, but we need a better solution to bridge the gap among teams trying to collaborate in the dark.
Automating the most common troubleshooting steps we like to start with is one way to get a baseline of information that is easily shared. In fact, a sophisticated automation tool should go beyond simply automating the most common troubleshooting tasks but also run a variety of platform-related show commands to get a baseline right at the time the issue is occurring.
Multiple teams would then be able to access the same information at the same time to analyze, both in technical depth and in a graphical format, thereby democratizing knowledge that would otherwise be in the hands of individual teams, or worse, individual engineers.
I’m no expert in storage, and I’m no application development guru. My guess is that in most enterprise organizations, engineers still specialize in one or only a few areas just like I do. If we’re going to make quick progress to identify the cause of slowness in an application or why a path in our infrastructure is completely broken, we’re going to need to work together.
A culture of collaboration that rejects the blame game and utilizes tools that democratize knowledge and makes it easily available across teams is the way forward, not hoarding it on our local drives, and certainly not first blaming the network.