Do you work with distributed software systems? Designed well, they’re normally more robust and reliable than single systems, but they have a more complex network architecture. Many teams spend long hours at the keyboard querying different tools and nodes, trying to figure out why things have failed — and we’re sure you’ve been there too. And while it’s great that cloud providers often hide much of this complexity, they fail differently from the compute and network you control.
You probably already use tools to monitor your network, often at an individual service level or networking layer. You may be monitoring a particular internet gateway or load balancer, or only seeing device metrics, or focusing on only flow or synthetic measurements. Additionally, if your service is cross-platform, you’ll waste even more time debugging between the various providers. Since you’re supporting applications and users, once you cover the basics, getting visibility synchronized between the app and network layer is important as well. The emerging principles and practices of observability help you understand what’s going on in your system to speed up your debugging and make the best decisions.
This post explains the emergence of observability and how it relates to traditional monitoring. In addition, it will cover how network observability is a critical requirement for gaining complete visibility.
Although distributed systems are more robust, they come with added complexity. You can debug each component individually, but network issues between services often cause problems in these systems. The problem you’re usually seeing is triggered by root causes a few layers or services over. This complexity magnifies if communication is between various cloud providers or on-premise machines. To better support these systems, you need a more effective way of understanding how the different components communicate. More specifically, when something goes wrong, you need to figure out the root cause as fast as possible. You achieve this by having the best possible understanding of your system without wasting time debugging each node.
Observability and monitoring are entirely different concepts. However, you may often hear the terms mixed up or used interchangeably. And you’d be forgiven if you thought the two meant the same thing. Observability measures how well you understand your system from only its external outputs. It’s important to note the definition specifies observability as a measure, not a final state or an activity.
The meaning of “external outputs” is often described in the application-centric world by the three pillars of observability: metrics, logs, and distributed tracing (or sometimes MELT, when including “events”). Specifically for network observability, the output is a broad set of telemetry and metadata. Network telemetry usually includes device metrics, traffic telemetry, and synthetic telemetry as the core. We see advanced solutions combining and other sources. Metadata for infrastructure-focused observability usually includes routing, customer, applications, user, cost, DNS, IPAM, and other orchestration data. We described network telemetry (and its relationship to observability) in detail in our recent blog The Network Also Needs to be Observable, Part 3: Network Telemetry Types.
Observability gives you full access to enriched data to see the inputs and activity in your infrastructure and application systems. With the right implementation, you can interact with the underlying data and signals to detect, diagnose, and repair issues as they occur.
Observability increases your understanding and visibility of different components of your network and infrastructure. You might be wondering why visibility into your infrastructure is essential. Well, we’re sure you’ll agree that maintaining and updating components of a production system is a huge pain. Changing even the most minor section of the network infrastructure may cause you to feel a little sick with worry in the pit of your stomach. When you look deeper into why you felt this way, you may find it’s because, at the time, you had no idea what was happening in different parts of the system. Most documentation and fancy diagrams trying to explain how a system works are nearly always out of date. The only way to understand how information flows through your system is by observing what’s happening — right now.
Monitoring refers to the activity of capturing data and querying it in known ways. Traditional monitoring often focuses purely on data capture and query, without the combination of telemetry types and metadata to help achieve observability.
Usually, these queries present as dashboards and alerts that look for well-known patterns, such as interfaces with errors or poorly performing links.
As organizations embrace DevOps cultural principles to extend their operations maturity, retrospectives often wind up with additional monitoring deliverables as various failure modes become known as patterns.
Modern observability platforms can also support monitoring techniques and allow proactive notification and interactive analysis, turning those successful investigations into saved queries and alerts.
As with observability platforms, there are different monitoring platforms, including user activity monitoring, application monitoring, network monitoring, and event monitoring. The type of monitoring you choose to focus on often depends on where it’s going to give your organization the most value. For example, if network monitoring is the space that causes you the most pain, you are more likely to focus on that first.
The three common elements of compute observability are metrics, logs and distributed tracing. However, when you’re trying to get a clearer understanding of how your network is functioning, distributed tracing and metrics alone don’t provide enough visibility.
This is where network observability fits in. You can learn more about the specifics of network observability from The Network Also Needs to be Observable, Part 1. The aim is to gather all types of telemetry from all networks and business metadata and to use it to provide the most valuable insights and action-focused workflows to help the works on the networking front lines.
Observability is very useful for finding the root cause of issues—fast. However, after implementing different levels of observability in multiple systems, there are a few other benefits just as valuable. First, it’s great for human-assisting workflows like, just as an example, capacity planning done right. One of the hardest parts of designing a system is planning for capacity. If you understand how your capacity requirements have grown over time, you can make a more informed decision. No more guessing.
Worrying less about making changes and being able to solve issues quickly sounds great. However, there’s a catch — you need to collect all the telemetry data and use it. Tools like Kentik make this process easier by automating most of the collection of the data you need. And once you have the data, Kentik can alert you to any abnormal behavior and give you relevant visualizations and metrics to understand what is going on.
Monitoring provides information and visibility (especially around questions you knew were important to ask). But, observability brings you deep insights into how your application, infrastructure, and network perform, all from external outputs, and available for both novice and expert humans to explore. The classic application and DevOps-focused outputs are metrics, logs, and distributed tracing. However, you need to know about orchestration and control planes and other business metadata for a more complete understanding. Adding network observability and seeing a wide variety of infrastructure telemetry along with the required context makes this even more valuable, and not just to network teams. Instead of querying each part of the system to debug issues, all the information you need is easy to query and integrate to support your regular and unscheduled designs, plans, and operational workflows. It may take some trial and error to get the right amount of data with the right amount of detail.
But it’s up to you to collect this data and use it to help with your understanding. Tools like Kentik at the network layer, and New Relic at the application layer, can help collect and organize this data so you can focus on making the right decisions instead of spending all your time collecting data and building visualizations.