Hyperscale data centers are true marvels of the age of analytics, enabling a new era of cloud-scale computing that leverages Big Data, machine learning, cognitive computing and artificial intelligence. Architected to scale up smoothly in order to accommodate increasing demand, these massive data centers are based on modular designs that allow operators to easily add compute, memory, storage and networking resources as needed. Yet massive scale creates network visibility challenges unlike those faced by operators of existing enterprise data centers based on the classic three-tier architecture.

Hyperscale data centers achieve massive scale by racking and stacking cost-effective, commodity hardware platforms like those specified by the Open Compute Project. Consisting of thousands of servers based on multicore processors ─ each with up to 32 CPU cores using the latest Intel Xeon processors ─ the compute capacity of these data centers is staggering. Advanced virtualization techniques enable hyperscale data centers to execute hundreds of thousands to millions of individual workloads.

To further complicate matters: These workloads are highly distributed and dynamic. Container-based applications composed of numerous discrete microservices typically span multiple servers and racks within a data center and auto-scaling orchestration mechanisms can spin workload instances up and down as needed, consuming available compute capacity on-demand and distributing workloads in a virtual topology with little correlation to the underlying physical topology. At the same time, workloads accessing data sets stored on multiple servers within the data center and even on servers in other data centers can generate a high volume of traffic internal to the data center. The net result is constantly shifting and unpredictable “east-west” traffic flow patterns traversing the leaf-spine switching fabric inside and between data centers.

In a classic three-tier data center, traffic flows predominantly “north-south” from the ingress/egress point through load balancers, web servers and application servers. In this architecture, it is straightforward to identify bottlenecks and performance anomalies. In hyperscale data centers, the trend is for only about 15% of traffic to flow north-south while the remaining 85% flows east-west. At the same time, the overall flow of data is increasing such that the link speed of typical spine connections is increasing from 10G to 40G and moving to 100G. The sheer volume of data traffic and the number of flows is such that existing methods for instrumenting and monitoring classic data center networks either won’t work or are not cost-effective in the hyperscale domain.

Massive scale presents data center operators with new types of network visibility and performance management challenges.

With thousands of servers interconnected via a leaf-spine switching architecture and traffic flowing predominantly east-west, the resulting network topologies are so complex that most operators are employing BGP routing within the data center and equal-cost multi-path routing (ECMP) to select the optimal path for each flow. Operators need new tools for gaining visibility into these topologies and for continuously monitoring traffic flows in order to discover bottlenecks and detect anomalies rapidly enough to take corrective action in real time.

It is also critical that operators be able to gain visibility into application dependencies external to the data center. How is traffic flowing to other data centers? Which services or microservices are being accessed? Is the performance of these connections impacting the application? Is traffic engineering required to improve the performance of Internet connections or the data center interconnect? Or perhaps traffic engineering is needed to ensure the resiliency of these connections in the event of connectivity failures or performance anomalies? Data center operators need new tools in order to address these challenges.

The bottom line is: The adoption of hyperscale principles presents data center operators with massive scale visibility challenges. Yet this evolution is inevitable as businesses race to exploit a new generation of data-intensive, real-time, cloud computing applications for competitive advantage. Therefore, operators need to prepare for the inevitable by acquiring and mastering the new tools needed to assure the performance, reliability and security of these applications.