Collecting and enriching telemetry data with DevOps observability data is key to ensuring organizational success. Read on to learn how to identify the right KPIs, collect vital data, and achieve critical goals.
Network observability is critical. You need the ability to answer any question about your network—across clouds, on-prem, edge locations, and user devices—quickly and easily.
But network observability is not always easy. To be successful, you need to collect network telemetry, and that telemetry needs to be extensive and diverse. And once you have that raw telemetry data, you need to interpret it. And even then, key questions— such as, Am I using my network resources effectively?—are not always easy to answer.
To answer the business-level questions that can move the needle, you need to enrich your network telemetry. This post will provide concrete guidance on how to do just that. We’ll look at how, by combining DevOps observability data with network telemetry, you can get strong, network-focused observability. Let’s begin with a discussion of KPIs.
The first step toward comprehensive network observability is identifying your key performance indicators. Here are some examples of network-related KPIs:
Note that these KPIs can be aggregated at different levels of the hierarchy—individual endpoints or instances, multi-instance services, entire data centers, across regions, and globally.
After identifying and categorizing the relevant KPIs, you need to gather data about these KPIs. Network monitoring tools use various techniques for data gathering, including polling, collecting metrics from network devices, and scraping traffic logs.
In the cloud, you can ingest network telemetry data from cloud providers into your network observability platform. In your own data centers, you will need to choose, install, and configure network monitoring tools.
The next step is to collect the auxiliary data that will be used to enrich the network telemetry data. Let’s cover the different major types of auxiliary data.
Log files from network devices, servers, and applications can contain information relevant to your network observability KPIs. The basic process looks like this:
Events, such as alerts generated by network devices, can also be ingested into the observability platform, potentially triggering a higher-level alert.
Endpoint telemetry refers to data collected from devices that are connected to the network, such as laptops, tablets, and smartphones. This data may include performance metrics and resource usage of the devices, as well as the applications and services running on them. This endpoint telemetry data, too, can be used to enrich network telemetry.
For example, if you see a spike in CPU usage on endpoint devices, this might indicate an issue on the network, causing the devices to work harder than usual.
As another example, let’s assume you see an increase in network latency. As part of your investigation into the issue, you can use endpoint telemetry data to see if there are changes in network access patterns on endpoint devices.
Application-level telemetry refers to data collected from the applications and services running on the network, such as web servers, databases, and custom business applications. This data includes performance, errors, and resource usage by these applications and services.
Imagine that your monitoring of application-level telemetry shows a spike in response times. This might indicate an issue on the network that is causing the application to wait longer for network responses. Application-level telemetry can help you determine if your network is having problems. When properly correlated with network telemetry, it can even help you with root cause analysis.
When considering observability at the application level, take advantage of distributed tracing, making sure to use it comprehensively. This can be especially helpful for enriching network telemetry if your system is based on a microservice architecture.
Your network observability platform should have dashboards and visualizations for humans to understand overall network health and performance. However, at scale, humans alone can’t detect and respond to issues fast enough.
Machine learning (when implemented and trained properly) excels at digesting high-dimensionality data like enriched network telemetry. It can identify trends, predict future outcomes, and discover anomalies. These are network observability insights that even keen-eyed human operators would be unable to spot.
In addition, AI/ML-backed tools can be used to summarize and consolidate complex data to make it digestible by humans. As it helps human operators understand the state of the network, these tools can also recommend courses of action during incidents.
Now, let’s look at a few of the key considerations you’ll want to consider when you manage and store all this enriched telemetry data.
First, when collecting the data, accounting for user privacy is imperative. You need to be aware of the types of data you feed into your network observability platform and ensure you comply with all relevant laws and regulations.
Next, observability doesn’t come cheap. It is easy to collect a lot of data, but you must consider the cost of collection, storage, and analysis and weigh that against the value that you derive from your data. For example, do you need to capture and analyze every network packet, or is it sufficient to analyze only 10% of the packets? Do you need to store your flow logs forever, or can you purge them after two years?
Finally, the value of enriching network telemetry is clear. However, an organization must manage and store all that data appropriately in order to reap the benefits. This is where a network observability platform like Kentik comes in. You need a solid platform that follows industry best practices, integrates with all the standard tools and network providers, and offers a turnkey (yet customizable) solution.
Let’s recap. Network telemetry is fundamental to network observability, but it can be much more useful if you enrich it with data from auxiliary sources such as logs, events, endpoint device telemetry, and application-level telemetry. Once you identify your network observability KPIs, you can collect all the relevant data and feed it into your network observability platform.
Meanwhile, you should leverage AI/ML-backed tools to understand your network, detect problems early, and provide predictive analysis.
Ongoing analysis of network telemetry data is crucial for maintaining network health and performance. Enriched network telemetry can level up your network observability effectiveness significantly. This is because your network—and the traffic attempting to pass through it—is dynamic and constantly changing. Real-time analysis of the current state and behavior of the network can help network administrators (and automation safeguards) to identify issues and take proactive measures to resolve them.
To take advantage of enriched network telemetry and realize the goal of true network observability, you need a robust network observability platform like Kentik.