When coupled with a network observability platform, device telemetry provides network engineers and operators critical insight into cost, performance, reliability, and security. Learn how to create actionable results with device telemetry in our new article.
For cloud network specialists, the landscape for their observability efforts includes a mix of physical and virtual networking devices. These devices generate signals (by design or through instrumentation) that provide critical information to those responsible for managing network health.
In this article, I will provide some background on different types of telemetry, discuss key network performance signals, and highlight ways network specialists can leverage this device telemetry in their network observability efforts.
Telemetry, in its broadest sense, is any signal that is automatically measured, transmitted, and then processed/stored. This can be an error message transmitted from your phone to the manufacturer, any of the myriad signals sent from your vehicle’s many sensors to their respective CPUs, or life-preserving health monitors updating nurses on their patients’ conditions.
Regarding telemetry in cloud networks, the measurement, transmission, collection, and processing of these signals is both a tremendous challenge and an opportunity for network operators. Traditional network monitoring relies on telemetry sources such as Simple Network Messaging Protocol (SNMP), sFlow, NetFlow, CPU, memory, and other device-specific metrics. This being said, network observability includes an even more comprehensive source of telemetry for network specialists to leverage.
In the framing of cloud networks, network telemetry becomes a much more significant concept, as these signals are being generated in a vast, deep system of networks. With so many network boundaries being navigated (application, service, cloud providers, subnets, SD-WANs, etc.), network operators and engineers cast as wide a net as possible to source their telemetry.
In the network observability world, one of the principal telemetry types operators have to concern themselves with is device telemetry. Your switches, servers, transits, gateways, load balancers, and more are all capturing critical information about their resource utilization and traffic characteristics. This is the case for both physical devices and their digital abstractions.
Whether or not this telemetry is being collected and analyzed depends on an organization’s needs, constraints, and budgets. Still, it holds immense value for operators making cost, performance, and reliability decisions.
A subset of device telemetry is endpoint telemetry and includes physical sources such as mobile phones, handheld payment processors (think Square or Stripe hardware), personal computers, heavy machinery, and telemetry from applications operating at those endpoints. These endpoints represent the very edge of modern networks and have considerable operational and security implications.
The bread and budder of the DevOps world, application-level, or “layer 7,” telemetry is finding increasing value under the purview of NetOps. Representing diverse sources such as application functions, schedulers, orchestration tools like Kubernetes, and more, application-level telemetry is critical to providing the context that operators and engineers need to make sense of traffic, performance, and security in their networks.
With the level of detail that application-level telemetry provides, operators can quickly answer: Is this even a network problem?
The Internet of Things refers to the networks that power and support enterprises at the edge. It includes devices like the ones we covered in the “endpoint telemetry” section above, plus the connectivity layer that attaches them to an enterprise’s more extensive network. This connectivity layer can include tech like WiFi and Bluetooth or network abstractions like WANs and SD-WANs, among others.
IoT is about more than just thermostats that can connect to WiFi. A real business scenario to consider is a fast food chain using employees outside in the drive-thru to manage traffic during the busiest hours. Armed with a tablet to run point-of-sale, communicate with staff inside, and manage wait times, these employees rely on IoT to delight their customers and keep things running smoothly. Can you imagine what happens to sales if these tablets have trouble connecting to the network or are not adequately secured?
At scale, network issues affecting IoT, like the devices at this fast food chain, can be devastating to an organization’s bottom line and reputation. Common network device metrics for understanding network health So far, we’ve covered common sources for device telemetry, but network operators want to know what signals prove their value among all this data noise.
In this section, we’ll take a closer look at a few of these critical signals:
One of the more foundational device signals is uptime. Usually communicated as a percentage (ideally in the range of 99.999% or “five nines” for most of today’s offerings), uptime is the ratio of the network device or service being available versus not.
Lagging uptime is a key signal that performance is suboptimal and, in complex, distributed systems, should be addressed swiftly as services deep within a network can cause severe performance issues further up.
As a network’s uptime begins to be better described as “downtime,” there is more at stake than client expectations and user experience: the integrity of the network’s data. Malicious actors can target specific network devices that provide security layers, creating isolated points of vulnerability that can be difficult to detect if uptime telemetry isn’t being collected and analyzed.
Probably the hallmark signal for network specialists, a device’s bandwidth refers to its maximum capacity for data transfer. A more detailed picture of how a device handles its role in the network can be seen when bandwidth is coupled with a device’s throughput, the amount of data actually moving across a device in a given time frame.
Bandwidth and throughput telemetry are instrumental in capacity planning, identifying cyberattacks in their earliest stages, and providing meaningful baselines for optimization efforts.
Monitoring a device’s CPU and memory utilization gives operators insight into several aspects of network health, allowing them to ask and answer key questions:
Interfaces provide entry points between devices and networks. Collecting error telemetry from these devices can give network operators insight into any authorization issues or security threats.
Unfortunately, this telemetry is not very meaningful if left to its own devices. But, when incorporated into a big data approach like network observability, it provides robust statistical baselines for the network engineers and operators making decisions around cost, performance, reliability, and security.
Even businesses of modest scale can generate petabytes of device telemetry data. Collecting, processing, and storing this data requires significant engineering efforts that can span entire organizations. Managing this data separately across distributed teams sets organizations up for failure or, at the very least, underperformance.
Assuming it can scale, providing a central, single source of truth, a unified data and analytics platform reduces the risk of miscommunication and accelerates incident resolution and optimizations in large, complex networks where different teams are responsible for various aspects of network operations.
Unified data platforms also facilitate using machine learning and artificial intelligence algorithms to detect and alert on network issues automatically. Although the full utility of AI and ML in NetOps is emerging, having access to a unified data platform gives these technologies richer datasets.
In short, data platforms make the “big data” analysis central to network observability a possibility.
To see what network observability can do for you, get a Kentik demo today.