Kentik - Network Observability
Back to Blog

The Evolution of Data Center Networking for AI Workloads

Phil Gervasi
Phil GervasiDirector of Tech Evangelism

AI
feature-data-center-ai

Summary

Traditional data center networking can’t meet the needs of today’s AI workload communication. We need a different networking paradigm to meet these new challenges. In this blog post, learn about the technical changes happening in data center networking from the silicon to the hardware to the cables in between.


As artificial intelligence continues its march into every facet of modern life and business, the infrastructure behind it has been undergoing rapid and significant changes. Data center networking, or in other words, the infrastructure that connects AI compute nodes, won’t meet the needs of today’s artificial intelligence workload communication which means traditional data center networking must evolve to meet these new challenges.

What makes AI workloads unique?

AI workloads are fundamentally different from the traditional data center tasks. They:

  • Rely on extremely high-performance computing nodes.
  • Are normally distributed across multiple CPUs and GPUs that need to communicate with each other in real-time.
  • Predominantly use IP networking but require extremely low/no latency, non-blocking, and high bandwidth communication.
  • Cannot afford “time on network,” where GPUs wait on data from another, leading to inefficiencies and delays in the overall “job completion time.”

The underlying premise is that scalability of these workloads doesn’t come from a single, giant monolithic computer, but from distributing tasks among numerous connected devices. Echoing the famous quote from Sun Microsystems, “the network becomes the computer,” and this has never been more true than it is today.

Traffic patterns of AI workloads

Traditional data center traffic typically consists of many asynchronous flows. These could be database calls, end-users making requests of a web server, and so on. AI workloads, on the other hand, involve ‘elephant flows’ where vast amounts of data can be transferred between all or a subset of GPUs for extended periods.

These GPUs connect to the network with very high bandwidth NICs, such as 200Gbps and soon even 400Gbps and 800Gbps. GPUs usually communicate with each other in a synchronized mesh or partial mesh. For example, when a pod of GPUs completes a particular calculation, it sends that data to another entire pod of GPUs, possibly numbering in the thousands, to use that data to then train an ML model or perform some other AI-related task.

As opposed to traditional networking, the pod of GPUs, and in fact each individual GPU, requires all the data before it can move forward with its own tasks. In that way, we see huge flows of data among a mesh of GPUs rather than numerous lightweight flows that can sometimes even tolerate some missing data.

And since each GPU relies on all the data from another GPU, any stalling of one GPU, even for a few milliseconds, can lead to a cascade effect, stalling many others. This makes job completion time (JCT) crucial, as the entire workload then relies on the slowest path in the network. In that sense, the network can easily become the bottleneck.

Challenges with traditional data center networking

Since AI workloads are so large and synchronous, one flow can cause collisions and delays in a pathway shared by other elephant flows. To get around this, we need to re-evaluate the old data center design principles of oversubscription, load balancing, how we handle latency and out-of-order packets, and what type of control plane we use to keep traffic moving correctly.

In traditional data center networking, we might have configured a 2:1, 3:1, 4:1, or even 5:1 oversubscription of link to bandwidth under the assumption that not all connected devices would be communicating at maximum bandwidth all the time.

We also need to consider how we load balance across paths and links. Using a technology like a simple LAG or ECMP hashing technically works, but with very high bandwidth elephant flows that don’t tolerate packet loss, latency, etc, we end up with flow collisions and network latency on some links while others sit idle.

We also end up with an uneven distribution of traffic since ECMP will load-balance entire flows, from the first packet to the last, and not each individual packet. That could also result in collisions, but it could also cause ingest bottlenecks.

What AI interconnect networks require

AI interconnect, or the the new data center network purpose-built for AI workloads, has several requirements:

  1. Non-blocking architecture
  2. 1:1 Subscription ratio
  3. Ultra-low network latency
  4. High bandwidth availability
  5. Absence of congestion

To get us there, we need to consider exactly how we can engineer traffic programmatically in real-time. This means intelligence at the data plane level, with minimal (or no) latency from control plane traffic.

AI interconnect

An AI interconnect will solve the issues of traditional data center networking by making use of several new technologies, and several re-purposed old technologies.

  • Multi-pathing (packet spraying)
  • Scheduled fabric
  • Hardware-based solutions such as RDMA
  • Adaptive routing and switching

Packet spraying is the process of distributing individual packets of a flow across multiple paths or links rather than sending all the packets of that flow over a single path. ECMP typically pins a flow to a link which won’t work well for AI workload traffic. So the objective of packet-spraying is to make more effective use of available bandwidth, reduce congestion on a single link, and improve overall network throughput.

A scheduled fabric, in particular, will manage the packet-spraying intelligence and solve the inability of ECMP to avoid flow collisions. The catch is that path/link selection on a packet-by-packet basis needs to be local on the switch, or even on the NIC itself, so that there isn’t a need for runtime control plane traffic traversing the network. So though there may be a controller involved, policy is pushed and decisions are made locally.

Next, RDMA, or Remote Direct Memory Access, is a protocol that allows computers in a network to exchange data in main memory without involving the processor, cache, or operating system of either computer. By bypassing this additional overhead, RDMA improves data transfer speed and reduces latency. It essentially permits memory-to-memory transfer without the need to continuously interrupt the processor.

Adaptive routing and switching is the ability to change routing or switching decisions based on the current state and conditions of the network. This is different from static or predetermined routing, which always follows the same path for data packets regardless of network conditions. Instead, adaptive routing can adjust paths based on factors like congestion, link failures, or other dynamic variables, all pointing ultimately to link and path quality.

This kind of runtime dynamic routing improves performance, fault tolerance, reliability, and is exactly what is needed for the type of traffic produced by a distributed AI workload.

And lastly, apart from networking, infrastructure challenges also include the choice of cabling solutions, power, cooling requirements, and the extremely high cost of optics.

The Ultra Ethernet Consortium

The Ultra Ethernet Consortium comprises mostly network vendors and one hyperscaler, Meta. They’re working collectively to address the networking challenges posed by AI interconnects. By 2024, the consortium aims to release its first set of standards, coupled with awareness drives like conferences and literature.

The importance of visibility

For efficient AI workloads, there’s a need for robust telemetry from the AI interconnect. This allows for monitoring of switches, communication between hosts and switches, and the identification of issues affecting job completion time.

Ultimately, without flow-based and very granular telemetry, a switched fabric doesn’t have the information needed to make path selection decisions in real-time as well as schedule path selection decisions for the next packet, or in other words, short-term predictive path selection.

As AI continues to shape the future, it’s important for data centers focused on this kind of compute to evolve and accommodate the unique requirements of AI workloads. With innovations in networking, the industry is taking steps in the right direction, ensuring that the backbone of AI – the purpose-built AI data center – is robust, efficient, and future-proof.

For more information, watch our recent LinkedIn Live with Phillip Gervasi and Justin Ryburn in which we go deeper into AI interconnect.

Explore more from Kentik

We use cookies to deliver our services.
By using our website, you agree to the use of cookies as described in our Privacy Policy.