Back to Blog

The Critical Role of Networks in AI Data Centers

Phil Gervasi
Phil GervasiDirector of Tech Evangelism
AIData Center Networking
feature-networks-ai-data-centers

Summary

The fastest GPU is only as fast as the slowest packet. In this post, we examine how the network is the primary bottleneck in AI data centers and offer a practical playbook for network operators to optimize job completion time (JCT).


We’re living through a new truth of artificial intelligence infrastructure — the fastest GPU is only as fast as the slowest packet. Teams keep upgrading compute, but for large-scale AI training and inference, the interconnect fabric sets your time-to-market. If your network quietly stretches tail latency or forces retries and reransmits, your job completion time grows, and every extra hour is cash burned and momentum lost.

This is why JCT (job completion time), the clock time from job admission to final checkpoint, is so critical in AI data centers. In modern GPU clusters, communication phases, including gradient exchange, all-reduce, parameter sync, and checkpoint I/O, regularly dominate p99 latency even when the average p50 looks fine. In some workloads, the network slice of total JCT can be 10-20%, but in other scenarios it can climb far higher, approaching 50%. This means that the main goal for network operators managing these kinds of environments is to minimize the communication duty cycle, so GPUs spend more time computing and less time waiting.

In a recent episode of Telemetry Now, we explored this critical role of the network in AI data centers.

 

You have three networks now

First, traditional data centers have a “front-end” network that links CPUs, storage, and the internet; however, AI data centers have a different network architecture with two additional networks.

Next, we have the scale-out network (back-end). This is the GPU-to-GPU communication across the fabric (often 3-4 times the bandwidth of the front-end).

Third, we have the scale-up network. This high-performance layer is all about ultra-low-latency and high bandwidth, making tens to hundreds of GPUs behave like a giant shared-memory device.

An AI data center topology still uses a spine-leaf design, but when we’re talking about GPU communication, hop count matters more than it ever has. Fewer tiers mean fewer hops, lower power, and simpler operations. High-radix switches (fixed or modular) help reduce tiers and oversubscription, which in AI data centers is 1:1 rather than the 3:1 or 5:1 we see in traditional data centers.

AI data center networks: Lossless at scale

AI data center traffic patterns aren’t a collection of many small flows. Instead, AI training produces periodic compute windows punctuated by synchronized bursts of traffic at line rate—perfect conditions for microbursting, elephant flows, and incast, all typical traffic scenarios in a cluster-based compute environment.

Thus far, several technologies have emerged to solve the adverse effects of this kind of traffic, namely:

  • RDMA over Converged Ethernet (RoCE), which provides a lossless networking with extremely low latency, along with CPU offload.
  • Priority Flow Control (PFC) to prevent dropped frames during times of network congestion.
  • Explicit Congestion Notification (ECN) marking to allow for end-to-end notification of congestion between ECN-enabled nodes, preventing packet drops.
  • Load balancing methods to account for low entropy (infrequent large flows) and avoid hash polarization.

Why do optics and power matter in data center networking?

In AI fabrics, the network can exceed 10-15% of the total power budget, largely because the number of links and their speeds have exploded. Inside the switch, optics now dominate power, with the DSP in pluggable modules playing a major role.

Two directions are reshaping optics:

  • Co-packaged optics (CPO) move optics next to the switch ASIC to shorten electrical reach and cut DSP power.
  • Linear-drive pluggables (LPO) keep modules serviceable on the front panel but eliminate the DSP via pristine channels and stronger SerDes (Serialization/Deserialization).

Both approaches offer significant energy savings and a better ability to handle heat dissipation. Improving efficiency is crucial because it improves reliability, as link flaps force rollbacks to the previous checkpoint, inflating JCT.

How can AI data center network observability help predict and reduce JCT?

A lossless network doesn’t necessarily mean congestion-free. It means congestion shows up as queues and pauses, rather than drops. Visibility into these unique network behaviors in AI training can accelerate job completion. At sub-second resolution, there are several markers to watch out for to spot congestion even in a lossless network.

  • PFC “time-in-pause” and per-queue occupancy refer to the rising pause on GPU classes with deep queue excursions (greater than around 70%) and are red flags even if actual drops are zero.
  • The ECN mark rate, even at modest average utilization of around 40–60% suggests a classic microburst.
  • FEC histograms and optics health are important elements to watch because a trend toward heavier correction before hard errors show can suggest a coming link margin issues (such as issues related to dirty fiber) preceding link flaps.
  • Selective drops and retransmits clustered in short bursts (not evenly distributed) suggest some sort of transient, localized congestion, and not an overall lack of capacity.

A practical playbook to speed up JCT

So, how can network operators design AI training data centers in such a way to mitigate these problems and keep the network slice of JCT as low as possible?

First, design for bursty traffic, not averages. That means minimizing switching tiers (and therefore hop count), as well as keeping link subscription at 1:1 where training happens — also, size buffers and ECN thresholds for cluster-based network traffic typical to AI training activity.

Second, scope “lossless” precisely. That could mean enabling PFC only on the classes that carry collective traffic and verifying no head-of-line blocking leaks into other classes.

Third, instrument sub‑second polling for high‑resolution visibility. 30- to 60‑second polling leaves blind spots that mask performance issues and can delay a training job.

Fourth, close the loop with hosts. We need to treat this type of infrastructure as a single system, encompassing both switches and hosts. This involves configuring DCQCN parameters, paying attention to NIC firmware, and making flow pinning decisions, all of which impact fabric behavior.

Lastly, prevent flaps proactively by acting immediately on optical power anomalies. Swapping an optic and avoiding a 30-minute rollback is probably the cheapest JCT win.

Why this matters to the business

JCT is the currency of competitive AI. You pay per GPU-hour. If tuning fabric thresholds, cleaning fibers, or removing a switch tier cuts a training run by 10-20%, that’s immediate cash and earlier model iteration. Inference isn’t exempt, either. For reasoning-heavy models, tail latency dominates user experience and infrastructure costs. In the end, this boils down to the most critical metrics an executive should watch the most: drops and emerging congestion — the signals that move JCT.

Explore more from Kentik

We use cookies to deliver our services.
By using our website, you agree to the use of cookies as described in our Privacy Policy.