Getting the most out of cloud networks requires new tools and strategies, captured in the idea of network observability. Read the final entry in our series on managing the hidden costs of cloud networking.
I used the first two parts of this series to lay out my case for how and why cloud-based networks can effectively “Trojan horse” costs into your networking spend and highlighted some real-world instances I’ve come across in my career.
In the third and final installment of this series, I want to focus on ways you can optimize your personnel and cloud infrastructures to prevent or offset some of these novel costs. When considering network optimizations, I like to group them in terms of cost, performance, and reliability.
As engineering goals, cost, performance, and reliability are often at odds with each other, the cheapest path for your traffic is often not the most performant; a highly reliable network is maybe not the most cost-effective. If there are natural trade-offs, how do we decide what to prioritize? Cost, performance, and reliability must constantly be balanced against clearly defined business objectives.
The first step is understanding what cloud network your customers and business objectives demand. Does your network infrastructure need to scale dynamically to deal with highly variable traffic volumes? Should your network be multi-zonal, and if so, how should your network configurations differ based on these different zones? Which cloud provider or combination of providers is the best for your infrastructure? What peering strategy makes the most sense, and for which customers? Are your resources, cloud or otherwise, being used in the cheapest, most performant, most reliable way? And, of course, is this constantly shifting network safe?
Being able to answer these and similar questions about your network regularly is the foundation of any honest cloud optimization attempt. Unfortunately for network operators, the scale and complexity of networking in cloud, hybrid cloud, and multi-cloud environments have largely rendered traditional network monitoring strategies obsolete. To fill this gap, network observability has arisen as a body of tools and strategies to help you answer any question about your networks.
Optimizing your org chart
Before I go over some optimizations that can be made leveraging network observability, I want to take a couple of steps back up the development chain and discuss some best practices at the org level. While cloud networks are certainly “software problems,” my experience with customers and my work history has shown me that how businesses organize themselves around their software problems is a crucial component of success.
Just as many software stacks have transformed from monoliths to distributed, service-oriented architectures, the modern network has evolved from a strong, single data center into a distributed web of networks with service-specific configurations and infrastructures. If you’ll recall from Part I, one of the benefits of working with cloud resources is the pace of development enabled by loosely-coupled services with independent architectures (and networks). While this is undoubtedly a pro, a con is that this independence can lead to siloed decision-making and costly networking oversights and miscommunications.
For such a development environment, I’ve found it essential to have a platform-level NetOps team to manage visibility over the many service-level networks. This personnel abstraction establishes responsibility for inter- and extra-service networking issues and creates a team to manage a network observability tool.
Optimizing your cloud network
As a complement to the metrics, logs, and traces of DevOps observability, networking observability takes additional, network-specific data into consideration: flow logs, network hardware telemetry, prefixes, paths, underlays and overlays, software-defined networks, and more. A good network observability tool incorporates this data and helps platform-level teams have real-time visibility into the state of the network and its impact on the business.
The visibility provided by such a tool allows NetOps to answer any question about the network and forms the technical foundation for optimizing away your cloud network’s “hidden costs.”
Optimizing network costs
Keeping track of costs can get tricky in cloud networks that incorporate a range of provider contracts, variable cost models, router configs, and peering relationships. An effective network observability tool can provide actionable insights by allowing you to overlay configurable financial models over this traffic accounting data. It’s one thing to see that there was some dropped traffic and quite another to immediately be shown affected customers, any SLAs that might have been breached, and the real-time dollar and cent impact.
By getting immediate and actionable insights into the real costs associated with networking decisions, network observability can help you optimize costs by:
- Spotting and avoiding inefficient and costly traffic routing
- Implementing peering policies that protect you from overspending
- Avoiding congested and degraded paths
- Identifying unique applications affected when a pathway is impacted
- Identifying impacted availability zones, regions (cloud PaaS), or geographies (ISP, WAN, data center, campus, retail location…)
Optimizing network performance
As mentioned above, DevOps-centric observability focuses on logs, metrics, and trace telemetry and analyzes this data to provide useful ways of exploring performance in distributed systems. By combining visualizations like service maps with statistical analysis of key signals, DevOps teams can not only surface root causes when there are issues but provide a wealth of data to create meaningful performance baselines and thresholds to optimize against.
A good network observability tool ingests network performance telemetry from your private, cloud, multi-cloud, or hybrid cloud infrastructure. It adds contextual details such as customer, cloud provider, region, or traffic type to this telemetry, giving you the complete picture of your data when weighing optimization strategies.
One of the most powerful ways network observability can help you optimize performance is by allowing you to ask (and answer) any question about your cloud network:
- What is the most performant path for my traffic?
- Do we need to invest in more hardware to meet capacity, or can we improve routing or peering efficiency for our cloud onramps?
- Is my traffic optimally distributed across cloud regions?
- What is our highest priority traffic?
Optimizing network reliability
It’s important to have a performant network; it’s very important to have a cost-effective network, but it is absolutely critical to have a reliable network, as being up and running is something of a prerequisite for both performance and cost.
Network reliability has two main facets: security and availability. The baselines and thresholds established through your performance optimizations and a powerful analytics engine can give you advanced warnings about capacity or security issues. An effective network observability tool can detect these outliers in traffic quality or source immediately. Automated processes can be triggered that reroute or re-provision your traffic accordingly.
Failures will happen. As a network engineer, you must ensure that systems of record, sources of truth, and customer and internal data are accurate and available in multiple regions. Validating that the pathways between data stores can communicate without error or latency ensures data replication strategies can execute. When errors or latencies crop up, as they inevitably will, knowing that the data stores are out of sync is critical to making service restoration choices. It could be devastating to a business to choose to transmit stale data into a data store due to unawareness.
Working in the cloud isn’t for every business, and if the pure motivation for the migration is cost savings, you should be wary. If scalability, availability, and an infrastructure that can support distributed, service-oriented development and deployment are your primary goals, running your network in the cloud can significantly improve cost, performance, and reliability.
For networks at scale, managing the complexity of your public, hybrid, or multi-cloud traffic and infrastructure in a way that captures actionable insights and deepens business value is a tall order. Getting the most out of these cloud networks requires new tools and strategies, captured in the idea of network observability. By taking a big data approach and centering the ability to ask and answer any question about our networks, network observability tools such as Kentik offer NetOps teams and engineers an unrivaled tool to secure, optimize, and strengthen your cloud network. Get a demo and see for yourself.