More episodes
Telemetry Now  |  Season 2 - Episode 56  |  August 29, 2025

From GPUs to Packets: The Critical Role of Networks in AI Datacenters

Play now

 
AI training isn’t just about GPUs, it’s about the network that ties them together. Host Phil Gervasi sits down with Vijay Vusirikala to unpack why job completion time is the true metric of success, how optical interconnects shape AI datacenter performance, and why power efficiency and observability are becoming mission-critical.

Transcript

The fastest GPU is only as fast as the slowest packet.

As enterprises and hyperscalers race to train larger and more powerful AI models, the network, then, yes, I'm I'm really talking about the network. The interconnect fabric between all those GPUs, it's becoming the bottleneck. It's the competitive edge, and it's the silent force behind time to market.

Joining me today is Vijay Vusirikala, an industry leader with deep experience in large scale optics and in data center fabrics.

Together, we're gonna unpack why job completion time, not just GPU count, is the metric that really matters and how everything from RoCE and InfiniBand to co packaged optics and power budgets are shaping the next generation of AI infrastructure.

So stick around because if you're building, operating, or just interested in learning about infrastructure for AI, this is the episode episode that you can't afford to miss. My name is Philip Gervasi, and this is Telemetry Now.

Vijay, thank you so much for joining me today. This is, really gonna be an interesting show because, on this podcast, in our blog, in my own personal life, I've been so focused on the speeds and feeds that are coming out from folks like Arista, like yourself, from, different writing from the UEC and and from others in the industry. Very exciting stuff. And as a as a former network engineer, I love seeing those huge numbers as far as bandwidth and, you know, the, you know, elimination of latency in the data center and in the network is very exciting.

But today, we're gonna take a we're we're gonna talk about that for sure, but we're gonna talk about some other aspects of, data center networking, but data center in general in terms of power consumption, which which is really interesting. And, so before we get started, again, thank you for joining. And would you give our audience a little bit of a background about what you do and and your background professionally, how you got to where you are today?

Thanks, Phil. Absolute pleasure being on this podcast. This is a field that, I am very passionate about. Obviously, Arista is very passionate about.

Love to talk to you about, the emerging trends in AI networking and some of the considerations and innovations happening on the energy efficiency and the low power side. So by way of, background, so what I do at Arista is I look at the next generation systems. I work with our cloud titan customers, look at where their networks, where their architecture are going, and ensure that our road map and what we have in the pipeline matches what they're looking for. So it's a bit of, a customer facing role combined with, an internal systems development role.

So my background's been prior to that, I was at, a sovereign cloud company that was, focused on building large data centers. Again, focused on, looking at the most energy efficient way of building data centers. It's a new cloud where the business model was GPU as a service. Prior to that, I spent more than a decade at Google focused on a variety of networking technologies from the optical side to network planning to software systems and control systems to run one of the largest global networks in the world.

Prior to that, I was at a few equipment companies in a variety of roles, but, few decades of experience. And as Phil was saying, it's very interesting to see every few years an additional zero that gets added to the speeds and feeds. It, you know, feels like just another day that, just, the other day that a gigabit Ethernet seemed like, something that was insurmountable in terms of the size of the pipe that we could fill the bits. And today, we've added a few zeros to that.

We have. Yes. Yes. And it is exciting. I wish I could get that to my house, but I think that's a little more a little ways off, still.

But I I remember I've been in networking for, me personally a little over fifteen years as a as more of a traditional network engineer. And I do remember, when, like, ten gigabit came out, And I would talk to customers. I worked for VARs at the time. And I would talk to customers, and they'd look at me saying, why would we ever need ten gigabit? That's an incredible amount of, you know, of capacity.

And then lo and behold, a couple of years later, we're putting in, you know, port channels and lags to bundle that and then talking about twenty five, forty, hundred gig.

By the time I was in specking and installing hundred gig, that's when I was transitioning out of being a field engineer. So I actually never installed four hundred gig, though I I spec'd some projects. And now we're talking about multiple terabits per second for this unique environment that we're gonna talk about today. The the data center that's been specially designed, built, architected, spec'd for all these components of infrastructure, but also power cooling, you know, physical footprint for specifically training AI models.

And I assume we're talking about training, pretraining, and inference as well, but by and large, we're talking about training.

So having a networking background and now talking about the context of model training, what's the difference between a data center that would do that? And you can we can go beyond the network, of course, Vijay. You know, whatever you think is important. What's the difference between that and just, you know, a a pretty high performing data center? I've put many data centers together over the years. And what's the difference?

Yeah. And it's a great question.

And so as we go on this journey, so you highlighted something very interesting. Right? Some of the drivers for this bandwidth increase. For a long time, the driver was just the ability for humans to consume this bandwidth, and that was in some ways limited by our sensory absorption rates. So how many video streams can you see at a time, etcetera.

And the paradigm changed from that to the cloud where it was primarily driven by machine to machine. It was CPU to CPU communications, and that was the next inflection point.

And now we are at the third inflection point where it is not CPU to CPU, but it is GPU to GPU that have a much more intense compute cycles. And the value that we are deriving is not just the bandwidth consumption, but it is the, it we are really leveraging the intense compute cycles of the GPUs, and the value is a good token that is coming out of it. So their tokens are getting generated. Some of them are getting recirculated, etcetera etcetera.

And then the ROI of a good token is what is driving a lot of the economics. And to generate a good token, there is an intense amount of high bandwidth connectivity between the GPUs. And that's primarily the case for the training clusters that we are looking at, but the same concept is now valid for inference clusters, especially with the emerging, reasoning models that are happening. So to your, to your point, that the data centers themselves have evolved quite significantly from the traditional cloud data centers to the current AI centers.

And in terms of the architecture, the primary difference has been the number of additional connectivity points across the GPUs. Traditionally, we had the CPU to CPU, the front end network that connected to the storage network, went out to the Internet, through, data center interconnect. It had a few management networks. It had connections to databases, etcetera.

So now we have a whole second and the third layer of interconnection in an AI cluster.

The first, of those layers are the back end network that's used to connect the, GPUs to each other, and these can scale to ten thousand GPUs, twenty thousand GPUs, even hundred, and people are talking about half a million GPUs that are interconnected to each other using the, scale. And for, for context, the back end network is typically three to four times the size of the front end network in terms of the speeds and feeds because of, the GPU to GPU communication that requires extremely high bandwidth.

And the third network is the scale up network, where the GPUs talk to each other over an even higher bandwidth network with an even lower latency so that all the GPUs work together with a common memory structure. So it looks like a giant GPU that is constructed out of multiple GPUs. And we are talking about, seventy two or a hundred or a one fifty going up to multiple hundreds of GPUs that are in a scale up domain.

So fundamentally, what was just a front end network has now expanded to two additional networks, each of which is a much higher bandwidth. So the back end network is roughly three to time four times the amount of bandwidth And then the scale up network is eight to ten times the bandwidth of a back end network, on a per GPU basis. And this additional bandwidth is what is driving the power consumption, the energy efficiency focus, as well as the capital expense for the switches, as well as the connecting cables and optics.

So never really more than now in history can we really and rightly say that the network is the computer as that old Sun Microsystems quote. You know, you spoke to about multiple tiers of of network connectivity.

GPU to GPU connectivity, and the second tier where we have clusters of GPUs speaking to other clusters of GPUs.

And and that is because we don't have one giant monolith computer. We don't have one giant monolith GPU, which I I suppose if we had literally limitless resources, we could construct something like that.

But ultimately, it's this, clustering of multiple discrete, computing units in the form of a GPU that allows this model training to happen. That's where we have the the power of all of these resources to train. And what we're talking about here by and large is probably large language models. Right, Vijay? We're not necessarily talking about running, you know, logistic regression models against some data. Although, we certainly could at scale, and that would require those resources. But I think by and large, we're talking about training the next iteration of something along the lines of a GPT or a CLOD or something like that.

That's exactly right. So yeah. These are the super large foundational models that require multiple tens of thousands of GPUs to run, and it still takes multiple weeks to get these foundational models run. And then there are the smaller versions of those, that get trained or fine tuned and domain specific training. But, yeah, the super large clusters that folks are building are to really enable the foundational model training.

Okay. So then why does the, clearly, the network matters in the sense that it's interconnecting these nodes to behave as one computing unit in in a sense. But what is it specific about the network that's different that matters now? I you know, just to throw something out there. In traditional data centers, I would design maybe a three to one or a five to one subscription ratio. That's not necessarily the case now, especially with the way that traffic flows east west, GPU to GPU, cluster to cluster. Can you explain a little bit about how the traffic moves and why the network matters at that level in, in AI training?

Yep. So there are few fundamental differences in terms of the traffic patterns themselves. So if you contrast an AI cluster traffic flow with a traditional cloud traffic flow, The first difference is the traffic pattern itself. So in the, cloud data center, you typically have tens of thousands of flows. They're essentially distributed across the CPUs.

The bandwidth of each flow is relatively small, but it's characterized by the number of flows that you have.

If you contrast that with an AI cluster, if you look at the traffic pattern, it's very different. The traffic tends to be very quiescent for the period that the GPUs are cranking away at its compute cycles, so the flops are churning, churning, churning. But when the time comes for them to exchange the weights, exchange the gradients with each other, They slam the network to the maximum, available capacity. So if you look at essentially the traffic pattern over time, it looks essentially close to zero, goes to hundred in a matter of a few milliseconds, and it is highly synchronized because every GPU is trying to communicate to everyone. So this places, requirements for a very different architecture, in terms of condition control, in terms of load balancing, in terms of the speeds and feeds that you need to accommodate.

Fundamentally, what you are trying to optimize here is you want the GPU cycles to be, you you want the GPUs to be utilized to its maximum extent possible, which means the time that the GPU spend in communicating to each other needs to be minimized.

And how do you do that optimization?

By having the highest bandwidth network so that the duty cycle of communication is small relative to the duty cycle of computation.

I see. Okay. Is it true that the, the time spent in network, for a training run is approaching fifty percent when we we look at the entire, activities that occur in a single training run? Is that about right still?

So I think it varies varies by workload by workload and the parallelism techniques, but we've seen anywhere from about ten percent to twenty percent, in some cases, about fifty percent as well. But it does vary quite a bit in terms of, the workload.

So Okay.

So then it's imperative that we do what we can to reduce the amount of time that a packet or a frame is spent in processing and then in transit. Right? Absolutely. Hence, this incredible, bandwidth requirement and, a one to one subscription. And then you mentioned a couple of things. You you know, you mentioned load balancing, and you also mentioned congestion control. Can you explain those expand upon those a little bit for us?

So, as as you know, the considerations for load balancing and congestion control depend quite acutely on the entropy in the system. If you have very few flows, you tend to have, essentially a hash polarization or a hash collision. So having a congestion control mechanism that enables you to have proper load balancing of the flows so you don't get into these hotspots then, you have the obvious challenges of, either having to retransmit or having to throttle at the host sites.

Well yeah. But, I mean, doesn't TCP take care of that? Like, a retransmit isn't the end of the world, or or am I incorrect? Is a retransmit, very, very bad for, you know, a a training run. I mean, I'm thinking because of the amount of traffic that there's probably an exponential increase to the adverse effect.

So here, you want to build lossless networks in an RDMA configuration, and you want to have a direct read write without, having to go through, your normal path. And so, hence, compared to a cloud based network where you have the opportunity to retransmit, here you're trying to truly build a high bandwidth, lossless network that avoids retransmissions.

Okay. Lossless, no latency, incredibly high bandwidth, a a changing nature of how traffic works. So there's also a visibility component on what's going on with my traffic. Right.

You know, what's stalling GPUs, that sort of thing. So, you know, we've talked about the the importance of the network. We've talked about the the the differences between this kind of a network for AI training and then and more traditional data centers. Not not even mentioning things like campus and wide area and things like that, which, of course, are yet yet another thing, and why it matters.

You know, is it is it a matter of of only cost that we we want to reduce the amount of of time spent in the network because it's just a waste of money? Or or does does that have an adverse effect on the training itself, you know, to the business perhaps? I'm I'm assuming yes, but I'd like to hear from you.

So great question. So the figure of merit here is the job completion time.

And to optimize for job completion time, so you might ask the question, so how does it matter if it takes twice as long? How does it matter if it takes ten percent longer?

And, the other related question is, the network typically is a fraction of the cost and power consume consumption of the GPU slash XPUs. So why do we care so much about optimizing the network?

So these are questions that we get all the time. Right? So let's let's tackle one by one. So job completion time is the key figure of merit.

And what you want to do is to ensure, as we discussed previously, that the GPUs are doing the computation most of the time, and they're spending the least amount of time possible on the communication. And this is what enables them to increase the job completion time. Associated part of that is the network has to be performant as well as reliable. So if you lose a link, you lose a part of GPUs that are part of this Hive computation, and then you have to go back to the previous checkpoint.

Then you lose thirty minutes, forty five minutes, whatever your checkpoint frequency is, and then you've got to restart from the previous checkpoint. That adds to the job completion time. So in the end, you have the most expensive and scarce resource, which are the GPUs. And you want to do everything from the network perspective to make sure that they are fully utilized, they're fully up, and fully performant.

So in that context, spending a little more on the network or doing more on the network to optimize everything you can to increase the, essentially, the utility of the XPUs is the right thing to do at a data center or a crust cluster level optimization.

So job completion time then has a direct impact on, how long it takes for a model to train a particular training run, which therefore has a direct impact on, actual cost money, because it would it it it it lengthens how long it takes to complete this overall training and this get this model to the market.

And I assume now with the competitiveness of multiple companies, that are coming out their foundational models for sure, but also some smaller models that we're seeing, competing with each other for and and, you know, a a lot of it they're competing on benchmarks that are around accuracy, and how fast they are, how good they are as far as if you hook into their API, what's their latency look like? There's all these different things that they're competing on. But but the bottom line is that they're competing.

And so when a new model comes out, everybody's gonna flock to that, run their benchmarks, and that becomes the new baseline. And so it sounds like job completion time and specifically the network portion of the job completion time has a direct impact on money, on my ability to make money if I'm putting this model out there.

Absolutely. It's as simple as that. Right? You're paying per hour per GPU. So what you want to do is maximize the GPU rental, so to speak, to complete your training time or complete your training job in the least amount of time slash GPU hours.

So let's talk about, a little bit of the topology and then some of the components that relate specifically to power. So we are talking about a data center. So immediately, I think the industry has settled by and large on spine leaf, you know, two, three tier, whatever it happens to be, our super spine, so maybe you add more.

And then their variance and keeping it as flat as possible to minimize the hops. Is is that that's what I have in my mind when I think of a high performing data center. Is that the the the fabric topology that we're talking about here?

Yep. That's exactly right. So the predominant architecture in AI clusters is a spine leaf topology. And as you said, the goal is to interconnect as many GPUs as possible with the least amount of oversubscription for the reasons that previously discussed that we want to have an extremely high bandwidth from any to any connectivity.

Okay.

And, the main consideration there is to minimize the number of tiers. The more tiers you have, you have the interconnection between those, and those add cost, they add complexity, and they add power. So the best way to reduce the number of tiers is to have a larger radix switch. Radix here refers to the amount of connectivity that you have from the switch to the GPUs. The higher the radix, the fewer tiers that you can get. And, there are multiple ways in which you can increase the radix. So you can start out with a fixed form factor box that typically has, say, five hundred and twelve lanes or sixty four ports.

Mhmm.

And you can have a modular chassis where you put a bunch of these line cards that are interconnected with the fabric, and that has, on a box level, a much higher radix. And based on the number of GPUs that you want to build, you can use a combination of these switches and to minimize the number of tiers. Beyond the point, you just literally run out of radix and you have to add a third tier, but most of the time, you want to minimize the number of tiers to make it most power efficient and capital efficient.

Okay. There are so many direct I'm just looking at our notes, and there are so many directions that we can go in this conversation. And we're already at, halfway through our recording today, yet we could easily make this an entire series and focus on one piece at a time, one hour at a time. There's just so much.

So today, focusing on this power component, and, you know, certainly, we could talk about how do you how do you do queue engineering? How do you do straggler mitigation? We could talk about all those things and make that a podcast. And it would be interesting, but focusing on power, I know that, that can be from a cost perspective a very significant component.

So, but but there's also the aspect of heat dissipation. And there's a difference between the optics that some might be familiar with and the optics that we would use for a very east west heavy AI fabric, you know, a fan in, fan out fabric that you just mentioned.

What what is the difference between the kind of optics that are being deployed for these kinds of environments, you know, as as compared to, like, a traditional data center that my local, health care facility might have?

So if you look at, traditional data center, the contribution to the overall power footprint from the network was relatively modest. Let's call it five percent or so. But what has happened as you saw earlier in an AI cluster, we are adding additional networks. We are adding a back end scale out network. We are adding a scale up network. And as you saw, those were extremely high bandwidth.

So the amount of connectivity has increased very substantially, and the speed of these, switches has also increased substantially. So now we are at a situation where the network is contributing to more than ten percent, sometimes even fifteen percent of the overall data center. So it behooves us to do everything we can to make that more optimal for a few reasons. Right?

First is any power that you save on the network can go towards the GPU footprint. You can monetize that directly, and that improves your business case. The second one is just the TCO from, consideration of just the the your electrical bill. You minimize that.

And the third one is if you have more power efficient networks, which is in optics, you can increase the density in a rack because you have a rack constraint in terms of the amount of power you can support in a rack. So that improves your overall, number of racks and improves the amount of, the density in the data center. So for all these reasons, there is a very strong imperative to look at switches that have a lower power footprint. And if you now go to the next level and see what causes, what are the contributors to the power in a switch.

Again, traditionally, optics accounted for less than fifty percent of the power. But as the more, as we took advantage of Moore's law and the power per bit on the switching Philip reduced, we did not get the same Moore's law benefit on the photonics side. So as a relative fraction, now optics contribute to more than fifty percent of the switch power. And with every generation as we went from four hundred gig to eight hundred gig, and now we're going to sixteen hundred gig, that fraction has been increasing quite significantly.

So if we do not do anything, now optics will contribute to sixty or sixty five percent of the overall switch power.

So then the optics community being super innovative loves a challenge. They decided to take a look at it and say, alright. So what can we do here to reduce the optics power?

The main contributor within an optical module is a digital signal processor.

This is the device that conditions the signal as you go from the switch chip to the module through electrical traces or cables. Any of the distortions that you see, those are all corrected. The losses that you incur are compensated at this digital signal processor.

It is like on a reset button, cleans up the signal, and then the clean signal then drives the optics across the link distance that you want. So now, like, if that is the biggest part of the optics, is there a way in which we can eliminate that DSP to get a very substantial reduction in the power? And how do you eliminate that DSP? You look at the function that it's doing. It is signal conditioning. So if you have your electrical signal and optical signal at the same data rate, but it's the channel that is causing those distortions.

How do we figure out a way in which we reduce the channel distortions?

So the best way to do it is to bring it closer to the chip so you reduce the channel size or channel length.

Okay.

Or the other way to do is to design a very pristine channel, and you have a very good set of optical drivers so that you can just drive the optics directly with the signal that's coming from the chip without having to put a digital signal processor.

So those are the two big trends that folks have been following. The first one where you bring the optics right next to the chip is the co package solution or the CPU optics. The other one where you design a channel where your optics is still on the front panel, it is serviceable, it is pluggable, but you design a channel from your switch to the module to be very pristine in a manner that you don't need the signal processor.

Both these have equivalent benefits in terms of the power reduction Mhmm. And in fact, they result in a sixty percent power reduction on a module basis.

So the two the two optics we're talking about here, the co packaged optics and then linear drive, pluggables.

Pluggables. Thank you. Right? So the co packaged optics are, reducing the electrical path in order to therefore reduce power and I assume latency as well. Right.

And linear drive pluggables, LPOs, they, are they were created to remove the DSP from the equation. So they're they're DSP less modules, which also reduces the power requirement. So now we have power savings there. And, and you mentioned something about, possibly some forward error correction implications for the LPOs as well.

Right. So we can walk through that. So, actually, in both cases, the digital signal processor is remote. In the co packaged case, the, because you're pulling the optics right next to the chip, the channel is very short, and there's not much distortion or loss that requires a DSP.

Okay. In the linear pluggables case, you have the regular channel in terms of the distance between the chip and the module. But what you do is you design that channel in a manner that it is within the lost budget that can be handled by the driver of the optical module. And the, the serial deserializer or the so called that sits on the switch has the ability to compensate on the receive side for any trans, any distortions that happen through the link.

So the combination of having a very good thirties and designing that channel between the switch and the module to be the lowest loss possible, gives you the ability to eliminate the DSP from the pluggable module.

And because we're reducing the power requirement, does that mean that there's also an improvement, with regard to heat dissipation and therefore, you know, our need to cool and and all that?

Absolutely. So you get the additional flywheel benefits. Right? So you get a reduction in the, overall heat dissipation that directly results in, improvement in the reliability because heat is the number one factor that affects the reliability of these optical modules. And that has a huge, flow through effect at the cluster level reliability. So as we saw previously, a lot of the job completion time is very, very dependent on the reliability and uptime of the links as well as the modules. So taking out the heat producing components like a DSP reduces the overall thermal profile, improves the reliability.

It, improves the power dissipation. And naturally, when you take some, some components out, it improves the cost as well.

Okay. So the reliability has also increased as a result of, not having heat as a factor and, you know, failure of optics and things like that. Exactly. So so I assume that also means that there's an improvement to operations where you actually have, you and, you know, with fewer optics failing that people have to go and replace, you have a a more or you have a more efficient operations, data center, network operations, or whatever you call it in AI data center. Right?

That's right. So that's absolutely right. So one of the items that fail in these switches are the pluggable optical modules. So the frequency at which you need to replace this, roughly halves because of improvement. Yeah.

So I didn't know that. That's a significant decrease.

It is, indeed. Yeah. So another interesting aspect is now we discussed two types of optics that both accomplish the low power. One is the CPO, and one is the LPO.

While they're equivalent in terms of the power footprint, they're very different in terms of the operations that you just alluded to. So in the CPO, the optics are embedded in the switch itself. And if you have sixty four channels in your switch, even if one channel fails or there's a challenge when that one channel, your replacement unit is the entire switch. That can be very challenging unless the components are outstandingly reliable. Whereas in the LPO case, you are replacing it at one module level rather than at a sixty four module level. So there are fundamental differences in the serviceability and the operations profile of the two optics. And the industry is going to work through some of those challenges.

So from our perspective, the pragmatic approach is to start with LPO. And as the CPO system matures and as some of the reliability profiles of those get proven out, we believe the operators will, will, embark on that.

Okay.

So we've addressed problems with heat, with power, with reliability, even even efficiency of my operations.

But, are there any outstanding problems that have yet to be solved that that you're facing in in the projects that you're working on, the customers that you're working with? And and I am speaking, specifically in the context of of, optics and and, that that aspect of this.

Yep. Yeah. So good question. So this is really a race to get to the next speed, the next capacity, and the next radix.

So the problem statement is people want a higher number of in a scale up domain. So today, that is limited to a a rack because within a rack, you can use very, low power interconnects like copper to interconnect those Right. Of course. But what people really want is to take it to, say, a few hundred GPUs. That means you need to put those GPUs in multiple racks.

If you have multiple racks, you need to interconnect them with very low power, low cost optics, and that is still an unsolved problem in the industry. And the this is the reason why a lot of venture capitalists are super excited about this industry because Right. It's a huge market. It's a huge unsolved problem.

There's no obvious solutions. There are three or four solutions that are vying for that space. So all the way from, very exotic technologies like, RF waves that are going through a guided wave to make this connection to, like, LEDs that you use in your display systems that are going into a fiber bundle, and they go across a few meters. So So it's very interesting to see the level of innovation that is going to solve that interconnect problem.

Just so long as we keep on consuming these models and and AI in general, which I don't see an end in sight. You know? I I hear about people talking about small language models and and, you know, devoted to a particular task, and therefore, you have instead of a trillion parameters, you maybe have a hundred thousand parameters to deal with and therefore smaller requirements. We're also hearing about, some of these webscale organizations, whether they're buying or building or planning a a small modular reactor, a a very small, you know, nuclear facility. So I see I see and hear and read about all these things, and it is amazing. So the innovation is still happening, out of necessity. So but, you mentioned as we scale up though, and and so there are some problems as we scale up these networks.

Can you can you walk me through what that looks like? I mean, I I know what a what a a rack of, you know, old Dell PowerEdge servers looks like interconnected with some fiber and things. What does this, this new network, you know, from server to rack to pod and how the GPUs fit in, what does that look like?

Yep. Absolutely. So at a high level, there are two paradigms that are emerging. The first paradigm is you have an extremely high density compute rack that you interconnect with an independent network rack. And the other paradigm is you co design your compute and network together, and you create this extremely high density, extremely high power rack.

The main, evolution there is folks are going from the traditional ten kilowatt racks that we've used for a very long time to hundred kilowatt racks, and now people have designs for five hundred kilowatt liquid cooled racks. And this is a stunning amount of, high density compute and high density networking. So that's the first paradigm that is happening. And, as we discussed, if you have compute and network together, the interconnection is through either copper backplanes or copper cables.

Yeah. The next big paradigm is now you look at this as not just one rack, but essentially a row of racks and how you figure out the best way to interconnect those row of racks. But to your earlier question, so in either paradigm, think of this as essentially a stack of compute trays. Each compute trays like four GPUs, and then you have, your network within the rack or adjacent to the rack.

And there is a cooling, system, a CDU rack with a mod all the modern ones are liquid cooled. So that's, one paradigm. A similar network rack, you think of this as like a series of network switches. They have, an equivalent of a top of rack switch.

They have a patch panel, and you have, some kind of rack management controller. So those are the canonical designs for, network rack even in the AI, AI cluster. So, fundamentally, it's not that different from, traditional Dell based servers. It's just that each server now has the CPUs as well as the GPUs with the associated network interface cards, and those, are connected to a bunch of these, network switches.

Right. Which means that once you start inside the rack and you're looking at GPU to GPU connectivity, what what is that? Is that NVLink or something like that that folks are using?

So the so with an NVIDIA GPU, the scale up domain is an NVLink and Yeah.

For an NVNVLink seventy two. So if you have seventy two GPUs, those are connected through NV switches. So that's the NVLink. Yep. In a non NVIDIA paradigm, so that can be, Ethernet. It can be, some of the emerging technologies like UALink. So there are few options for the scale up domain in a non NVIDIA link.

But in any case, the cable plant and the planning, this that is is is significantly more than in a traditional data center. That's clear.

That's right.

But I'm also thinking about how and and and this has always been the case where your infrastructure architecture is directed by the activity that you're going to do with it. So we wanna think to ourselves, I have this application that I'm running in the cloud, and my front end is in AWS. And where do I put my back end? Well, I'm not gonna put it in Azure because it's making database calls, and I wanna keep latency low. So there's an example in traditional networking where your architecture is related to the, to the app the application, what the activity.

That is, an order of magnitude more important. Now I say that it's just a a mathematical number, but it's more important now because you have cluster and cluster, pods and pods. And so you wanna design this in such a way, considering this incredible, complexity of your cable plant and and and all of that, that having a a rack farther away could add a millisecond of latency or maybe, you know, how do I design this so I can reduce a hop? Is that correct where you're seeing more of a tight coupling between architecture and and, you know, the the the purpose for which this data center is built? Right?

That's that's absolutely right. And the latency considerations are a lot more important for the scale up domain. So as you rightly pointed out, optimizing the switch hop, optimizing the number of tiers in the, leaf spine architecture, as well as the physical connectivity because of the time of flight latency. All those are, very important. On the scale outside, you it is a little more forgiving, and you really have no option. If you are putting hundred thousand GPUs, you have a natural distance that you need to contend with. What folks are looking at are very large data halls, as the first unit of compute, but some of the cloud titans and hyperscalers are looking at connectivity across buildings and some of them even across the campus, and they have the ability to tolerate that latency in some of how the model training is architected.

So in this data center where architecture and its purpose are so tightly coupled and there's complexity upon complexity, what happens when I wanna make a change? I wanna add a rack. I wanna move something. I mean, it sounds like that's not really possible, especially not not just because of the complexity and the ridiculous amount of time and effort it's gonna take to move things, but also because I'm taking down a data center or a portion of the data center, which as we've discussed is, extremely, extremely, expensive to to operate and run.

Right.

So that's a good question. Today, the blocks of compute are homogeneous in its structure. So as you build new blocks, you can connect them at a higher level. But what most people are doing today is they're staying with, say, a cluster size.

They determine ahead of time. Let's say I'm building a ten thousand GPU cluster or twenty thousand GPU cluster. They design it for that maximum size. They have, cookie cutter versions of those multiple instances that they can tie together with, a a super spine layer, so to speak.

But as the compute infrastructure changes, you get the next generation of GPUs.

They are upgrading that entire compute block into a new homogeneous block.

I see. Okay. Alright. Well, I'd like to move on then for the last portion of our our, show today on observability in, AI training data centers.

That's obviously something that at Kentik we're very familiar with and, very familiar with in traditional data centers and looking at your VXLAN overlay and mapping your VTIPS and Vienna. That's that's fine. And and, I have to assume it's a little different now because we are looking at sure. Those metrics. I wanna know if my switch is, you know, fans are operating at a hundred percent and things like that.

And we wanna look at flow data and things like that. But what are the the golden signals? That's the term we like to use in observability, at least, for AI fabrics that you'd like to call out.

So there are perhaps three things that are super important in an AI cluster. So we discussed the importance of reliability, and this is even more acute in an AI data center because it's very, very sensitive to, what we call flaps, which are short transient loss of connectivity.

And this has a very deleterious effect in terms of, like, resetting the GPUs, going back to a checkpoint, etcetera. So in a cloud data center, it was fairly straightforward. If a link failed, you are able to move your workload from that specific machine to another machine fairly seamlessly. Life went on.

That wasn't the key. That's not the case in, AI data center. So we have to have much richer telemetry and specifically for determining the fiber faults or fiber flaps, identifying the root cause. And as you are alluding to earlier, so some of the forward error correction statistics give us a clue as to how close you are to a link degradation, whether you need to take any proactive action, etcetera.

And it's very interesting what causes some of these issues.

The most dominant cause is a speck of dust on the end of a fiber facet, and that causes reflections. And those reflections actually mess with the link performance, and they cause a burst of errors. And the burst of errors result in actually a link flap, and that causes, the downstream effects that we're discussing. So identifying those and finding the telemetry that indicates some of those issues is definitely a new focus area for AI.

So the other one, as we discussed, the transient behavior is very different between an AI cluster and a cloud cluster. So having very, very fine granularity telemetry to catch those spikes that transition from the dormant state to, like, the hundred percent state where all the GPUs are slamming the network with, with their packets. So being able to monitor those spikes and being able to distinguish between the very high bandwidth states and the regular states and, having that granularity in the telemetry is super important. And the third one is, as we are, as we are discussing earlier, congestion control is absolutely critical.

So having the telemetry to look at both the NIC level as well as the network level and having a holistic view on a workload by workload performance basis, that's definitely new for the AI case because it's not just the network. You got to look at end to end and identify which workloads or which jobs are getting impacted by any of the congestion events.

So our telemetry sources need to reflect, the ability to observe actual traffic. Like, the elephant flows is not something that we mentioned, but it is a that's what you described when you talked about these large volumes of traffic, which is generally gonna be derived from flow data, as flow IP fix, whatever.

But you also mentioned device metrics down to the NIC. So that would be things like GMMI, streaming telemetry, probably SNMP for certain devices if they require that.

You're talking about mapping workloads. So I assume that's gonna be enriching all of that telemetry with, like, DNS. Is that correct? I'm not sure. But you wanna map something, a workload, a cluster name, or something to the flow.

And, and then what what do we use for Optics Health? Because these are DSP less, are they still able to provide us that telemetry?

Good question. So, if you do not have the DSP, you still have, the metrics at the receiver, SIRDES. So you do have the metrics that you can pick up from there. And at the optical module itself, you still have all the optical level parameters like your transmit, and the receive, power levels, etcetera.

But the likelihood is well, not the likelihood, but the most common, failure scenario that we have, as you said, are link flaps. And so whether we're able to detect the link link flap directly or not, that's different. I mean, we you've described several proxies that we can use to identify why a link is flapping. But that sort of, tells me that there are then events or incidents or problems on the network that we could likely correlate to be able to identify that, this is this is a likely a link flap.

Because there is no metric that we're gonna get from our streaming telemetry ingest that says link flap. We're gonna see other things like a packet drop or a whatever it happens to be. So so is that something that we're doing now where we can identify, you know, idle GPUs over here that shouldn't be idle is a result of this path change, which is a result of this link flap? I mean, that kind of a thing.

So what we do have is the ability to monitor the health of the physical links, including the link flaps based on the statistics that we can see at the, switch, switch layer.

And this helps us not only to proactively see which links might fail. We have, some algorithms to look at training on the previous set of failures to anticipate what is likely to fail. It we can also get an idea as to how much margin we have because of something called the FEC histograms.

So as it gets closer to the limit of the link budget, we will see more errors that are getting corrected without the link itself taking any errors. So it's almost like, an early indication of future issues. So these will help us in terms of debugging the links, etcetera, and that can then be used to infer the implications on the job completion.

If you were managing an AI data center or if you were talking to a CEO of a company or, like, maybe the CIO, right, And, you were gonna put on the page of the front page of the executive summary one metric that would give them an idea of the health of the training run of the data center and how it's operating, what would that be? So there are two hot spots that you will look for.

The one is the obvious dropped packets, and you can definitely aggregate the top packets, and you can isolate them to where those are happening. The other one is any emerging congestion points that will have an impact on your job completion time, both of which can be very dash boarded, and you can have an overall executive view of what is happening in your network.

That's excellent. It really sounds like the job completion time is the currency for these companies building and then really offering as a service these these large language models.

The dollars per model, the dollars per megawatt, when upgrading your network is cheaper than buying more GPUs.

And we've discussed power, heat dissipation, thermal, and all of that kind of thing. You even mentioned liquid cooling, which I think would be a fascinating, podcast onto itself. We really just glazed right over that, but that would be really neat. So there is, there's certainly much more, that we can discuss and, much of it at a very deep level. So I would love to to speak to you again one day on maybe a different aspect, a different sliver of AI training or data centers for AI training.

So, Vijay, thank you so much for your expertise, your insight, your experience, and your willingness to come on and talk to me today. It's thoroughly enjoyable, and I hope our audience, I'm sure our audience feels the same.

Thank you, Phil. Absolute pleasure.

So to our audience, if you have a comment about today's episode or a question, I'd love to hear from you. You can reach out to us at telemetrynow@kentik.com. So for now, thanks so much for listening. Bye bye.

About Telemetry Now

Tired of network issues and finger-pointing? Do you know deep down that, yes, it probably is DNS? Well, you're in the right place. Telemetry Now is the podcast that cuts through the noise. Join host Phil Gervasi and his expert guests as they demystify network intelligence, observability, and AIOps. We dive into emerging technologies, analyze the latest trends in IT operations, and talk shop about the engineering careers that make it all happen. Get ready to level up your understanding and let the packets wash over you.
We use cookies to deliver our services.
By using our website, you agree to the use of cookies as described in our Privacy Policy.