In the final entry of the Data Gravity series, Ted Turner outlines concrete examples of how network observability solves complex issues for scaling enterprises.
So far in this series, I’ve outlined how a scaling enterprise’s accumulation of data (data gravity) struggles against three consistent forces: cost, performance, and reliability. This struggle changes an enterprise; this is “digital transformation,” affecting everything from how business domains are represented in IT to software architectures, development and deployment models, and even personnel structures.
As datasets scale and networks become distributed to free their data, the data gravity story begins to morph into a data complexity story. The more distributed an enterprise’s data, the more heterogeneous its sources. This means moving more data more frequently, carrying out more complex transformations, and managing this data across more complex pathways. To accommodate this, new data infrastructures must be implemented to facilitate the data lifecycle across multiple zones, providers, customers, etc.
There are several main problems with this complexity for networks:
Solving these problems for distributed cloud networks has required a big data approach, ultimately resulting in the evolution of network observability.
A detailed explanation of network observability itself is out of the scope of this article, but I want to focus on its core tenets before exploring a couple of brief case studies.
Network observability, when properly implemented, enables operators to:
As mentioned earlier, engineering data pipelines for the intense flow of telemetry is a huge component of making systems observable. Extract, transform, and load (ETL) systems are used to modify the data received and coordinate it with other data streams. Often this is the first tier of “enriching the data,” where correlations between network details like IP addresses, DNS names, application stack tags, and deployment versions can be made.
These ETL/ELT servers can provide a standard method of synthesis, which can be replicated across as many servers as it takes to ingest all of the data. This clustering of servers at the beginning of the pipeline enables the growth of data sets beyond the capabilities of most legacy commercial or open source data systems. It represents the technical foundation of observability features like a centralized interface, network context, network maps, and the ability to “ask anything” about a network.
Ideas about networks are great, but nothing beats seeing them play out in production. Here are two scenarios from current and former clients managing data gravity and how network observability could have been used to triage or prevent the issue.
For the first case study, I want to discuss an international DevOps team using a 50-node Kubernetes cluster. The question they brought to us was, which nodes/pods are ultimately responsible for pushing the traffic from region to region? They were looking for more info about two pain points: exorbitant inter-region costs and degrading performance.
A hallmark of cloud networking is that many networking decisions are
It turned out that there was a significant transaction set in region one, but the K8s cluster was deployed with insufficient constraints; automated provisioning pushed the whale traffic flows towards another, less utilized (and further) region. Without nuanced oversight, this reliability move (being multi-region) led to latency increasing and degrading performance. And by not prioritizing the movement of smaller traffic flows, the automated networking decisions attached to this reliability setup proved very expensive.
With a network observability solution in place, these mysterious traffic flows would have been mapped and quickly accessible. Assuming proper instrumentation, observability’s use of rich context would have made short work of identifying the K8s components involved.
In another case, a large retailer with a global footprint asked for help locating some “top talkers” on their private network. With some regularity, one (or multiple) of their internal services inadvertently shut down the network. Despite what was supposed to be a robust and expansive series of ExpressRoutes with Azure, the retailer’s system kept experiencing cascading failures.
After quite a bit of sleuthing, it became clear that site reliability engineers (SREs) were implementing data replication pathways––distributing data gravity––that were inadvertently causing cascading failures because of bandwidth assumptions.
This scenario highlighted the need for the following:
While I am quick to celebrate the pragmatism of a network observability implementation, it has to be pointed out in data gravity discussion that the significant ingress of network telemetry is challenging to handle. The highly scalable model of today’s observability data pipelines requires robust and extensive reliability frameworks (similar to their parent networks).
Besides the inherent transit and storage costs involved in this additional level of reliability, observed systems present other constraints that engineers need to negotiate:
Because of its significant data footprint, observability is about saving precious time and optimizing your network. Many of the same hidden costs associated with cloud networking can creep up as the network observability platform scales and achieves “mission critical” status.
For the scaling enterprise, data gravity represents a severe challenge for application, network, and data engineers. Distributing this data gravity across multiple DCs, zones, and providers offers organizations a competitive edge in pursuit of lower costs, higher performance, and more reliable systems. But this distributed data leads to a complex networking infrastructure that, at scale, can become an availability and security nightmare.
The best way to manage this complexity is with a network observability platform.
Want to talk to a cloud networking professional about data gravity concerns in your network? Reach out to Kentik today.