“The network is down!” — I’m sure you heard that before.
Despite your best efforts as a network engineer, network failures happen, and you have to fix them. Hopefully, you’ve implemented a network observability platform in advance, so you should be collecting a wealth of information about your network, making troubleshooting easier.
But what happens when it’s time to activate troubleshooting mode?
In this post, I’m going to talk about the steps to troubleshoot your network. And then I’ll provide some best practices, as well as provide examples of troubleshooting with Kentik’s Network Observability solutions.
First off, what exactly is network troubleshooting?
It’s the process of solving problems that are occurring on your network. A simple definition for what can often be a hard task!
When users complain, whether they’re internal or external to your organization — or ideally, before they do — you need to figure out what the cause of their problem is. The goal is to troubleshoot and fix whatever issue underlies the problems.
Troubleshooting requires taking a methodical approach to resolving the issue as quickly as possible. Unfortunately for you, the user doesn’t care what your service-level objective for fixing the problem is. In today’s “gotta have it fast” culture, more often than not, you need to fix it now — or revenue is affected.
Let’s get into some ways you can troubleshoot your network and reduce your mean time to repair (MTTR).
When you’re troubleshooting network issues, complexity and interdependency make it complex to track down the problem. You could be solving many different issues across several different networks and planes (underlay and overlay) in a complex, hybrid network infrastructure.
The first thing you want to do is identify the problem you’re dealing with. Here are some typical problems:
Fixing these kinds of troubleshooting problems needs more than identification. To paraphrase French biologist Louis Pasteur — where observation is concerned, chance favors only the prepared mind.
No network engineer can troubleshoot without being prepared with their tools and telemetry. So once you’ve identified that there is a problem, it’s time to use your network troubleshooting tools.
Ideally, you have tools and telemetry in advance, so your network observability toolchain is using AI to automatically identify problems and linking you to a jumping off point so you can drive down both MTTK (Mean Time to Know) and either MTTR (Mean Time to Repair) or MTTI (Mean Time to Innocence).
When your network is down, slow, or suffers from some other problem, your first job is to send packets across the network to validate the complaint. Send these pings using the Internet Control Message Protocol (ICMP) or TCP to one or any of the devices you believe to be involved.
Ping is a utility that’s available on practically every system, be it a desktop, server, router, or switch.
There’s a sports analogy that says “the most important ability is availability” for systems. If you can’t reach it, it’s not available to your users.
Sending some ICMP packets across the network, especially from your users’ side, will help answer that question, if your platform isn’t presenting the path to you automatically. In some cases if ICMP is filtered, you can usually switch to TCP (Transmission Control Protocol) and use tcping, telnet, or another TCP-based method to check for reachability.
If you’re not getting any ping responses, you need to find out where the ping is stopping. You can use another ICMP-based tool to help, and that’s traceroute.
Your ping could be getting stopped because ICMP isn’t allowed on your network or by a specific device. If that’s the case, you should consider TCP Traceroute on Linux, which switches to TCP packets.
From traceroute, since you will see the path of IP-enabled devices your packets take, you will also see where the packets stop and get dropped. Once you have that, you can further investigate why this is happening. Could it be a misconfiguration issue? A misapplied access list?
Kentik Synthetic Monitoring enables you to continuously test network performance (via ICMP, TCP, HTTP, and other tests) so you can uncover and correct issues before they impact customer experience. Ping and traceroute tests performed continuously with public and/or private agents generate key metrics (latency, jitter, and loss) that are evaluated for network health and performance.
To get ahead of the game, Kentik also allows you to set up autonomous tests, so there’s already test history to your top services and destinations. You can also run these continuously (every second, like the ping command default) for high resolution.
Now that you’ve identified the device or group of devices that could be the culprit, log into those devices and take a look. Run commands based on your device’s network operating system to see some of the configuration.
Take a look at the running configuration to see what interfaces are configured to get to the destination. You can take a look at system logs that the device has kept for any routing or forwarding errors. You can also look at antivirus logs on the destination systems that could be blocking access.
At this point, you may find yourself unable to get enough detail about the problem. Commands are telling you how things should work. What if everything’s working the way it should? What now? Or you might be getting overwhelmed by the amount of log data.
Many network outages relate to changes that humans made! Another key step on the troubleshooting path is to see if anything changed at about the same time as issues started.
This information can be found in logs of AAA (Authentication, Authorization, and Accounting) events from your devices. Ideally stored centrally, but often also visible by examining the on-device event log history.
The old saying about packet captures is that packets don’t lie! That’s also true for flow data, which summarized packets.
Both packets and flows provide information about the source and destination IPs, ports, and protocols.
When getting flow data, you’re not as in the weeds as during a packet capture, but it’s good enough for most operational troubleshooting. Whether it’s with NetFlow, sFlow, or IPFIX, you’ll be able to see who’s talking to whom and how with flow data going to a flow collector for analysis.
Capturing packet data is truly getting into the weeds of troubleshooting your network. If it’s unclear from flow, and often if it’s a router or other system bug, you may need to go to the packets.
Unless you have expensive collection infrastructure, it’s also often more time consuming for you than any of the other tools above. Whether it’s tcpdump, Wireshark, or SPAN port, you’ll be able to get needle-in-the-haystack data with packet captures.
One great middle ground is augmented flow data, which captures many of the elements of packets. This can be great if you can get performance data, but not all devices can watch performance and embed in flow — in fact, the higher speed the device, the less likely it is to support this kind of enhancement.
Collecting and analyzing packets and flows is where you start to venture into the next step. You’re using a mix of utility tools (tcpdump) and software (Wireshark, flow collector). If you’re expecting to keep a low MTTR, you need to move up the stack to software systems.
If you can’t find issues using these tools and techniques at the network level, you may need to peek up the stack because it could be an application, compute, or storage issue. We’ll cover more on this cross-stack debugging in a future troubleshooting overview.
Kentik Network Explorer provides an overview of the network with organized, pre-built views of activity and utilization, a Network Map, and other ways to browse your network, including the devices, peers, and interesting patterns that Kentik finds in the traffic.
To make NetOps teams more efficient Kentik provides workflows for troubleshooting and capacity management. These are some of the most basic tasks required to operate today’s complex networks, which span data center, WAN, LAN, hybrid and multi-cloud infrastructures.
The Network Explorer combines flow, routing, performance, and device metrics to build the map and let you easily navigate. And everything is linked to Data Explorer if you need to really turn the query knobs to zoom way in.
If you can’t find the obvious issue with something unreachable or down, it’s key to look beyond the high level and into the details of your network.
Kentik Data Explorer provides a fast, network-centric, easy-to-use interface to query real-time and historic network telemetry data. Select from dozens of dimensions or metrics, 13 different visualizations and any data sources. Set time ranges and search 45 days or more of retained data. Query results within seconds for most searches.
This lets you see traffic, routing, performance, and device metrics in total, by device, by region, by customer, by application, or any combination of dimensions and filters that you need to zoom in and find underlying issues.
Marc Andreessen of Netscape fame once said that “software is eating the world.” But when it comes to network troubleshooting, software has made things a lot easier. It has taken over from the manual tools run from a terminal or network device.
There are software tools that ping not just to one device but multiple devices at the same time for availability and path. Many are flow and packet data stores with software agents sending network data. All this is done and put on a nice dashboard for you. Network troubleshooting is still hard, but software makes it easier.
However, in this era of cloud-native and multi-cloud infrastructure, some software makes it easier than others. For that, you need to move beyond traditional monitoring software because it’s not enough anymore. You need to move to observability software.
With software tools like products from Kentik, you can use the devices to send data to observe the state of your network instead of pulling it from the network.
Whether you’re using network observability tools, or have a network small enough where the other tools are sufficient, here are some best practices you should consider.
When troubleshooting, you should develop a checklist of steps like what I’ve outlined above.
In his book The Checklist Manifesto, Dr. Atul Gawande discusses how checklists are used by surgeons, pilots, and other high-stress professionals to help them avoid mistakes. Having a checklist to ensure that you go through your troubleshooting steps in a timely and correct manner can save your users big headaches. And save you some aggravation.
Over time, this checklist will likely become second nature, and having and following it ensures you’re always on top of your game.
You want to have already picked the tools you need to troubleshoot a network problem before you get an emergency call. That isn’t the time to research the best software tool to use. By then, it’s too late.
If you run into a network troubleshooting problem that took longer than you hoped with one tool, research other tools for the next time. But do this before the next big problem comes along.
It’s tough to jump on a network troubleshooting call and not know much about the network you’re going to, well, troubleshoot. IT organizations are notorious for not having enough documentation. At times, you know it’s because there aren’t enough of you to go around.
But you have to do what you can. Over time, you should compile what you learn about the network. Document it yourself if you have to, but have some information. Identify who owns what and what is where. Otherwise, you could spend lots of troubleshooting time asking basic questions.
If you closely follow the toolset above, you may have noticed that I’m moving up the stack with each tool.
In some ways, I’m following the Open Systems Interconnection (OSI) stack. When troubleshooting, you want to start at the physical layer and work your way up. If you start by looking at the application, you’ll be masking potential physical connection problems such as interface errors or routing issues happening at layer 3. Or any forwarding problems at layer 2.
So follow the stack, and it won’t steer you wrong.
And there it is. When the network is down, troubleshooting can be a daunting task, especially in today’s hybrid infrastructure environments.
But if you follow the steps I’ve outlined, you can make things easier on yourself. Create your network troubleshooting checklist, decide on your toolset, and get ready. If it’s not down now, the network will likely be down later today.
Now that you know this about network troubleshooting, you’ll be ready when the network issues affect traffic in the middle of the night. You won’t like it; nobody likes those 1:00 A.M. calls. But you’ll be prepared.