Do you work with distributed software systems? Designed well, they’re normally more robust and reliable than single systems, but they have a more complex network architecture. Many teams spend long hours at the keyboard querying different tools and nodes, trying to figure out why things have failed — and we’re sure you’ve been there too. And while it’s great that cloud providers often hide much of this complexity, they fail differently than compute and network that you control.
You probably already use tools to monitor your network. Common monitoring metrics are latency, packet loss, and jitter. But these metrics usually are at an individual service level, like a particular internet gateway or load balancer. The outcome of having metrics and logging at the service level is the difficulty of tracing through the system. Additionally, if your service is cross-platform, you’ll waste even more time debugging between the various providers. Observability helps you understand what’s going on in your system to speed up your debugging and make the best decisions.
This post explains the roles of observability and monitoring. In addition, it will cover how network observability is the missing piece to complete visibility.
Although distributed systems are more robust, they come with added complexity. You can debug each component individually, but problems in these systems are often due to network issues between services. This complexity magnifies if communication is between various cloud providers or on-premise machines. To better support these systems, you need a more effective way of understanding how the different components communicate. More specifically, when something goes wrong, you need to figure out the root cause as fast as possible. You achieve this by having the best possible understanding of your system without wasting time debugging each node.
Observability and monitoring are entirely different concepts. However, you may often hear the terms mixed up or used interchangeably. And you’d be forgiven if you thought the two meant the same thing. Observability measures how well you understand your system from only its external outputs. The meaning of “external outputs” is described in the three pillars of observability: metrics, logs, and distributed tracing. Specifically for network observability, the output is telemetry. We described network telemetry in detail in our recent blog The Network Also Needs to be Observable, Part 3: Network Telemetry Types. It’s important to note the definition specifies observability as a measure, not a final state or an activity.
Observability increases your understanding and visibility of different components of your network and infrastructure. You might be wondering why visibility into your infrastructure is essential. Well, we’re sure you’ll agree that maintaining and updating components of a production system is a huge pain. Changing even the most minor section of the networking infrastructure may cause you to feel a little sick with worry in the pit of your stomach. When you look deeper into why you felt this way, you may find it’s because, at the time, you had no idea what was happening in different parts of the system. Most documentation and fancy diagrams trying to explain how a system works are nearly always out of date. The only way to understand how information flows through your system is by observing what’s happening — right now.
Monitoring refers to the activity of capturing data, usually metrics or flow data on different nodes in a system. A common metric is the health of a device. For example, this could include a server’s CPU usage over time or total memory usage. The goal of this type of metric is to warn you when the resources are nearing their limits. Application performance monitoring (APM) is a type of monitoring that gives a great understanding of your application health. However, instrumenting your code with an APM will not provide you coverage of your network.
It’s essential to understand which metrics will give you the best insight into your system’s health. In general, types of monitoring include user activity monitoring, network monitoring, and event monitoring. The type of monitoring you choose depends on what’s going to give you the most value. For example, if network monitoring is the space that often causes you the most pain, you could focus on that first.
The type of observability that you see most often is compute observability. This usually takes the form of compute metrics, logs, distributed tracing, and APM. However, when you’re trying to get a clearer understanding of how your network is functioning, distributed tracing and metrics don’t give you enough visibility of your network. This is where network observability fits in. You can learn more about the specifics of network observability from The Network Also Needs to be Observable, Part 1. The aim is to gather all types of telemetry from all networks and business metadata. Without this visibility, answering questions like “who are my top talkers?” or “am I under attack?” is manual and time-consuming.
Observability is very useful for finding the root cause of issues — fast. However, after implementing different levels of observability in multiple systems, there are a few other benefits just as valuable. Firstly, it’s great for capacity planning done right. One of the hardest parts of designing a system is planning for capacity. If you understand how your capacity requirements have grown over time, you can make a more informed decision. No more guessing. Secondly, it makes onboarding new members of staff much easier. You don’t need pages and pages of (probably out-of-date) documentation when joining a new team. Finally, it’s really good for team confidence and experimentation. You can stop fearing that a small change will have enormous consequences.
Worrying less about making changes and being able to solve issues quickly sounds great. However, there’s a catch. You need to collect all the telemetry data and use it. Tools like Kentik make this process easier by automating most of the collection of the data you need. Once you have the data, Kentik can alert you to any abnormal behavior and give you relevant visualizations and metrics to understand what is going on.
Monitoring provides information and visibility, but observability brings you deep insights into how your application, infrastructure, and network perform, all from external outputs. The classic outputs are metrics, logs, and distributed tracing. However, for a fuller understanding, you need to know about the control plane. Network observability covers the control plane by gathering all types of telemetry along with the required context. But why is this insight useful? Instead of querying each part of the system to debug issues, all the information you need is easy to query in a central place. This increased understanding also helps you design and plan for changes or migrations.
It may take some trial and error to get the right amount of data with the right amount of detail. However, once you start adding different types of observability, you understand how useful it is. But it’s up to you to collect this data and use it to help with your understanding. Tools like Kentik can help collect and organize this data so you can focus on making the right decisions instead of spending all your time collecting data and building visualizations.