Go back

Using Network Automation to Solve the Multicast Puzzle

May 31, 2018

As engineers, we love networking puzzles. We get a thrill of figuring out why something is broken and then fixing it. But especially when services are offline, a leisurely troubleshooting session isn’t an option.

Multicast is one of those technologies that isn’t terribly difficult to understand (once it clicks) but can be tedious to configure and cumbersome to troubleshoot on a large network. Every device in a multicast tree must be configured, and in many environments, that’s a lot of devices. And when something isn’t working right, solving the multicast puzzle has traditionally meant visiting every single device one at a time.

When something isn’t working right, solving the multicast puzzle has traditionally meant visiting every single device one at a time.

 

Multicast routing is the process of sending network traffic to multiple destinations at once. This isn’t like broadcast, however, which sends network traffic to all nodes in a broadcast domain. Instead, multicast relies on multicast groups, group members, rendezvous points, and functional underlying routing for SPT, or the Shortest Path Tree algorithm, to deliver one stream to multiple listeners at the same time.

Each member of a multicast group must have the appropriate configuration for it to generate an IGMP message to request membership to the group. Once part of the group, the new member will actively receive the multicast streams rather than discard the traffic.

Finding the RPF Mismatch Needle in a Haystack

Among the most common issues is Reverse Path Forwarding failure, or RPF failure.  Every time a multicast packet is received by a router running PIM (Protocol Independent Multicast), the router must determine how it will return traffic to the source IP. This is normally a simple unicast route lookup to the source IP address. The expectation is that the incoming interface of the multicast traffic and the outgoing interface of the return traffic are the same, which implies a loop-free path.

In the image below, notice the RPF problem identified on R-22. Here we can see that this check, otherwise known as the RPF check, failed when the incoming interface and outgoing interface are not the same, which would suggest that there is a loop in the network. When this happens, all multicast traffic is dropped.

RPF failureFixing an RPF mismatch is easy, but finding it can be time-consuming. That’s where automation comes in.

The problem network engineers face with multicast routing is that many network topologies make use of multiple layers of reachability via simple GRE tunnels, complex DMVPN topologies, or the various scenarios in which asynchronous routing occurs. In these designs, incoming and outgoing interfaces will likely not be the same.

Troubleshooting an RPF failure is a matter of troubleshooting multicast configuration and simple interface configurations, but this presupposes that the underlying control plane and data plane are functioning properly. Oftentimes troubleshooting multicast routing turns into troubleshooting the underlying routing topology.

This is where network automation takes what is a multi-layered, multi-faceted problem and turns it into something much simpler to deal with. It still requires an understanding of multicast networking, of course, but automating these configuration checks and troubleshooting steps greatly reduces the time it takes to find and resolve a problem.

Automating Multicast Troubleshooting with NetBrain Runbooks

And that’s what NetBrain is doing with network automation. It’s not about changing networking itself; it’s about increasing efficiency and decreasing error in order to reduce the time to configure repetitive tasks and reduce the time for resolving issues.

Below is a relatively simple topology from NetBrain’s Online Lab (which you can check out here) that shows a typical multicast tree, including a multicast sender, receivers, and rendezvous points.

multicast online labNetBrain’s Dynamic Maps highlight multicast sender, receivers, and rendezvous points.

Even in a somewhat small network such as this one, troubleshooting multicast routing is still a matter of logging into each device one at a time and executing a variety of commands. NetBrain’s built-in multicast Runbooks check for typical errors across the entire network programmatically with just a few mouse clicks. Notice in the screenshot below how NetBrain automates multicast troubleshooting by programmatically checking multicast configurations and reachability.

multicast source tree health checkNetBrain’s built-in multicast Runbooks check for typical errors across the entire network programmatically.

But because there are many moving parts in a multicast routing topology, NetBrain provides additional Runbooks for analyzing various components of the multicast domain. Notice in the image below that we can simply run the Multicast Shared Tree Health Check against the entire network and programmatically check for reachability to rendezvous points and to the multicast group address.

multicast shared tree health checkThis Runbook automatically checks the multicast shared tree configuration, including  verifying the reachability of all RPs and destination multicast addresses.

However, because multicast routing problems may be a result of an underlying reachability problem, we can use NetBrain’s Dynamic Network Maps which have the integrated ability to execute numerous specific CLI commands in order to facilitate a custom view of configuration snippets.

As you can see in the screenshot below, it’s very simple to add any custom commands needed to look at several parts of the multicast puzzle at once and get a clean output from all devices at once and in one place. In fact, this output can be exported and saved to be shared and added to an IT department’s knowledgebase.

multicast cli outputSimply drag and drop any CLI command onto a Dynamic Map to visualize the output from all devices at once in context.

The goal is solving the multicast puzzle and restoring services — not wasting time logging into devices one at a time.

 

Troubleshooting an RPF failure this way is dramatically more efficient compared to logging into each of these routers individually, and as much as I love the feeling of solving a networking puzzle, a production environment isn’t the place for a lengthy troubleshooting session. NetBrain automates the cumbersome elements of this process, making sure an engineer can focus on the task at hand. The goal is solving the multicast puzzle and restoring services — not wasting time logging into devices one at a time.

 

Click here to test-drive NetBrain in the Multicast Online Lab (or any of the other 15 curated technology environments).

Related Content