In the final entry of the Data Gravity series, Ted Turner outlines concrete examples of how network observability solves complex issues for scaling enterprises.
So far in this series, I’ve outlined how a scaling enterprise’s accumulation of data (data gravity) struggles against three consistent forces: cost, performance, and reliability. This struggle changes an enterprise; this is “digital transformation,” affecting everything from how business domains are represented in IT to software architectures, development and deployment models, and even personnel structures.
As datasets scale and networks become distributed to free their data, the data gravity story begins to morph into a data complexity story. The more distributed an enterprise’s data, the more heterogeneous its sources. This means moving more data more frequently, carrying out more complex transformations, and managing this data across more complex pathways. To accommodate this, new data infrastructures must be implemented to facilitate the data lifecycle across multiple zones, providers, customers, etc.
There are several main problems with this complexity for networks:
- The complexity gets worse as they scale.
- Finding the root causes of issues across these distributed networks takes time, effort, and is ultimately expensive.
- Cloud networks give operators a much wider surface area to protect against cyberattacks.
- How do network operators effectively optimize against cost, performance, and reliability with so many moving parts?
Solving these problems for distributed cloud networks has required a big data approach, ultimately resulting in the evolution of network observability.
Tenets of network observability
A detailed explanation of network observability itself is out of the scope of this article, but I want to focus on its core tenets before exploring a couple of brief case studies.
Network observability, when properly implemented, enables operators to:
- Ingest telemetry from every part of the network. The transition from monitoring to observability requires access to as many system signals as possible.
- Have full data context. Instrumenting business, application, and operational context to network telemetry give operators multifaceted views of traffic and behavior.
- Ask any questions about their network. Rich context and real-time datasets allow network engineers to dynamically filter, drill down, and map networks as queries adjust.
- Leverage automated insights and response flows. The “big data” approach enables powerful automation features to initiate in-house or third-party workflows for performance and security anomalies.
- Engage with all of these features on a central platform. Features and insights siloed in different teams or services have limited impact. Unifying network data onto a single platform provides a single observability interface.
As mentioned earlier, engineering data pipelines for the intense flow of telemetry is a huge component of making systems observable. Extract, transform, and load (ETL) systems are used to modify the data received and coordinate it with other data streams. Often this is the first tier of “enriching the data,” where correlations between network details like IP addresses, DNS names, application stack tags, and deployment versions can be made.
These ETL/ELT servers can provide a standard method of synthesis, which can be replicated across as many servers as it takes to ingest all of the data. This clustering of servers at the beginning of the pipeline enables the growth of data sets beyond the capabilities of most legacy commercial or open source data systems. It represents the technical foundation of observability features like a centralized interface, network context, network maps, and the ability to “ask anything” about a network.
Two case studies: Network observability in action
Ideas about networks are great, but nothing beats seeing them play out in production. Here are two scenarios from current and former clients managing data gravity and how network observability could have been used to triage or prevent the issue.
Unexpected traffic patterns
For the first case study, I want to discuss an international DevOps team using a 50-node Kubernetes cluster. The question they brought to us was, which nodes/pods are ultimately responsible for pushing the traffic from region to region? They were looking for more info about two pain points: exorbitant inter-region costs and degrading performance.
A hallmark of cloud networking is that many networking decisions are
- limited by the provider and 2) simultaneously made by developers at many different points (application, orchestration, reliability, etc.). This can make for a very opaque and inconsistent traffic narrative for network operators.
It turned out that there was a significant transaction set in region one, but the K8s cluster was deployed with insufficient constraints; automated provisioning pushed the whale traffic flows towards another, less utilized (and further) region. Without nuanced oversight, this reliability move (being multi-region) led to latency increasing and degrading performance. And by not prioritizing the movement of smaller traffic flows, the automated networking decisions attached to this reliability setup proved very expensive.
With a network observability solution in place, these mysterious traffic flows would have been mapped and quickly accessible. Assuming proper instrumentation, observability’s use of rich context would have made short work of identifying the K8s components involved.
In another case, a large retailer with a global footprint asked for help locating some “top talkers” on their private network. With some regularity, one (or multiple) of their internal services inadvertently shut down the network. Despite what was supposed to be a robust and expansive series of ExpressRoutes with Azure, the retailer’s system kept experiencing cascading failures.
After quite a bit of sleuthing, it became clear that site reliability engineers (SREs) were implementing data replication pathways––distributing data gravity––that were inadvertently causing cascading failures because of bandwidth assumptions.
This scenario highlighted the need for the following:
- A platform to provide a comprehensive, single source of truth to unite siloed engineering efforts
- Network maps and visualizations to quickly diagnose traffic flows, bottlenecks, and complex connections
- Rich context like service and network metadata to help isolate and identify top talkers
New data gravity concerns with observability
While I am quick to celebrate the pragmatism of a network observability implementation, it has to be pointed out in data gravity discussion that the significant ingress of network telemetry is challenging to handle. The highly scalable model of today’s observability data pipelines requires robust and extensive reliability frameworks (similar to their parent networks).
Besides the inherent transit and storage costs involved in this additional level of reliability, observed systems present other constraints that engineers need to negotiate:
- Devices being monitored can be pestered to death; SNMP queries for many OIDs (Object ID) can cause CPU or memory constraints on the system being monitored.
- Low CPU/memory capacity devices like network equipment can sometimes cause outages during high bandwidth consumption events (DDoS attacks, large volumetric data transfers, many customers accessing network resources concurrently, etc.)
Because of its significant data footprint, observability is about saving precious time and optimizing your network. Many of the same hidden costs associated with cloud networking can creep up as the network observability platform scales and achieves “mission critical” status.
Data gravity series conclusion
For the scaling enterprise, data gravity represents a severe challenge for application, network, and data engineers. Distributing this data gravity across multiple DCs, zones, and providers offers organizations a competitive edge in pursuit of lower costs, higher performance, and more reliable systems. But this distributed data leads to a complex networking infrastructure that, at scale, can become an availability and security nightmare.
The best way to manage this complexity is with a network observability platform.
Want to talk to a cloud networking professional about data gravity concerns in your network? Reach out to Kentik today.