Kentik - Network Observability
More episodes
Telemetry Now  |  Season 1 - Episode 25  |  October 10, 2023

How data center networking is changing in the age of AI

Play now


Justin Ryburn
Justin Ryburn
Field Chief Technology Officer, Kentik

Justin Ryburn is the Field Chief Technology Officer for network observability company Kentik. He has 25 years of experience in network operations, engineering, pre-sales, and pre-sales leadership with service providers and vendors. Justin contributed content to Cyber Forensics (Auerbach Publishing, 2007) and authored Day One: Deploying BGP FlowSpec (Juniper, 2015). He has also spoken at numerous industry conferences on the topics of network monitoring and security.

Connect with Justin on LinkedIn

Transcript

Philip Gervasi: Buzzword alert, artificial intelligence, it's probably the buzzword of the year, and that's both in tech and in popular media, especially if you're thinking about things like large language models or LLMs and platforms like ChatGPT. The thing is that AI, it's really just not a buzzword anymore. We really are getting into a new world of very advanced computing and data analysis with more powerful computers running more complex workloads and huge data sets. The thing is that these powerful computers, they're not just standalone mainframe boxes, these giant monolith computers that sit in data centers or in university labs. What we're seeing today is that these are actually a collection of distributed computers working together to distribute an AI workload across many GPUs, and that's usually connected over a pretty traditional network in some ways. So if you think about it, the network that connects all of these GPUs, it's absolutely critical to how we perform AI tasks, how those AI jobs get done today. And what we're finding is that traditional data center networking isn't cutting it anymore. So what we're going to talk about today is how networking has adapted or is adapting to accommodate these new types of distributed AI workloads. Now I'm being joined by Justin Ryburn, the Field CTO at Kentik, and he's also a veteran network engineer in both the service provider and enterprise spaces. And I'm Philip Gervasi. This is Telemetry Now. Hey, Justin, it is good to have you on again. How are you?

Justin Ryburn: I'm doing well, Phil. How are you?

Philip Gervasi: I'm also doing well. Very well, actually. I'm recording on the road today. I'm at my mom's house actually because my brother is getting married tomorrow. So we're all here packed in the house and a lot of commotion, a lot of fun. And looking at your calendar, I see that you have a lot of PTO lined up for the next few days, actually more than a few days. Are you doing anything exciting?

Justin Ryburn: I am. Well, first of all, congratulations to your brother. I am going to be doing a vacation on the northeast, so my wife and I are taking my parents on a little vacation for their anniversary. We're flying into Boston, taking a cruise ship that goes up the northeast and does some stops in Maine, in Canada. So this time of year, as we're recording this in late September, should be absolutely beautiful in that part of the country. So we're really excited, looking forward to that.

Philip Gervasi: Yeah, Upstate New York and then into New England really is special this time of the year. The colors of the leaves changing, the weather, it tends to be drier and it's in the sixties and sunny, so it's very comfortable. It is interesting to me how it seems like somebody flips a switch on September 15th to take us from summer to fall that quickly and dramatically. It's neat how that happens every year, but we got pumpkin spice everything. We got apple scented candles everywhere. It really is nice like a Norman Rockwell painting that's all unfolding before your eyes every morning. So we really like this time of the year, and then of course when we hit the end of November, it goes into the brutal winter of Upstate New York and New England for the few months till about April. Actually, one thing that you should consider checking out if you're into that kind of thing is since you're heading into that part of the country in late September, early October, a lot of the microbreweries around Upstate New York and New England are having their Oktoberfest celebrations this time of year.

Justin Ryburn: Yeah, we have some of those on our list actually.

Philip Gervasi: Well, I hope you guys really enjoy it, you and your whole family. It's going to be great. So the last time that you and I spoke though was maybe about a year ago as far as on this podcast, I mean you and I speak all the time, but the last time you were on Telemetry Now was when you shared the opinion that network engineers out there in the world are under utilizing flow data, NetFlow, sFlow, Jflow, all that stuff, and how there's just so much more we can do with it. But today, I don't want to talk about flow necessarily. We can if we need to, but what I want to get your opinion on is what's going on with data center networking, how it's changing or how it needs to change to accommodate new AI workloads. And I know that's not in all data centers, but it's in those data centers that are urpose- built to run these types of workloads, but there's new networking requirements, there's very specific requirements for how traffic moves and how these artificial intelligence workloads function. So Justin, let's start with that. What is different about networking for AI workloads? What is special about it?

Justin Ryburn: I think the workloads themselves are very different than what we're used to when it came to how we built architected scaled data centers for traditional web applications. I mean, most of the applications that are served up by your normal spine leaf or three tier switching architectures in a data center were really designed for web applications of one form or fashion. And AI workloads are much different. If you read a little bit about how they're solving these large data problems in these data centers for AI, it's really a huge distributed computer. The entire data center becomes one big computer. You can no longer put enough GPUs, enough CPUs into a single piece of sheet metal in a rack to be able to, for Moore's law to keep up and be able to process enough of these data sets in one machine. So what you wind up doing is you distribute those GPUs all across the data center and then you interconnect them across the network and like I said earlier, just the entire data center becomes one huge computer for munging all these data sets.

Philip Gervasi: Yeah, yeah. And like Sun Microsystems said back in the early mid- eighties, the network is the computer, and I think you probably agree based on what you just said, that that's never been more true than it is today. I mean, if you think about it, yeah, we're not standing up one giant monolith mainframe that's doing all our computational analysis and database pulls and all that stuff. We can't physically do that. The nature of where we are with wafers and chip technology down to like five nanometer, the physics won't allow us to get much smaller and add more resources on those wafers without extraordinary increases in cost that don't result in that much of an increase in the ability to do these calculations and to process these workloads. So the answer, like you said, is to distribute them among many, many, many nodes and I think some data centers that are doing this kind of activity, they're up to thousands or tens of thousands, like 30, 32, 000 GPUs in a particular AI interconnect, and that's a term I want to throw out there. We're talking about networking for artificial intelligence workloads. So we have all these GPUs connecting and we're going to call that an AI interconnect. It's the networking that connects all these GPUs that are doing their work together, not necessarily the network that's connecting that entire group of GPUs to the rest of your traditional network, especially your web servers that you mentioned earlier. The AI interconnect is that thing that we're talking about today. Now is this just IP networking or are we talking about some kind of fancy proprietary vendor specific thing?

Justin Ryburn: There's some competing standards. I mean it's IP, but under the layer two in the OSI model, there's still some competing standards between InfiniBand and ethernet that some people land on one side of that. Some people land on the other side of that. I think if I were going to make a bet on this, I'm going to bet on ethernet just because it's very ubiquitous and we've used it in a lot of other applications. I mean, even your modern cars now are using ethernet for connectivity between the chip that drives the vehicle and all the various components in it. It's just become so easy to cable out and the protocol's so well understood, and so many of your staff already understand it. If I were a betting man, I'm going to bet on the side of ethernet on that, but there are arguments I think to be made for InfiniBand because again, we're talking about a high bandwidth, low latency type of network for these GPUs, which InfiniBand has worked very well in storage area networks in a very similar type of environment where you have similar engineering requirements for what you're trying to accomplish.

Philip Gervasi: Yeah, it sounds like the problem then with InfiniBand and similar technologies is that, one, they're proprietary, right? So they're not, like you said, ubiquitous among the entire industry. And so ethernet is plug and play, it's everywhere. It's very straightforward and simple and everybody's familiar with it, but also from where are people putting in their research time and money. So moving forward, I think it makes sense that since the industry has rallied around ethernet, that it makes sense for that reason where we're going to be in five years, well, we're not putting a tremendous amount of effort into InfiniBand. And also I think for scaling purposes, we have the technology in ethernet, especially as we get into very high bandwidth, and I get it, there's costs with optics and things that we can run entire data centers off of ethernet and then run that interconnect between that AI workload and then our traditional data center with ethernet and so on and so forth. So I would agree with you there, but we're still just talking about the fact that these GPUs are talking over a network. Okay, who caress? What's the actual problem here? Why don't we run it over a traditional data center network? I mean, can't the GPUs talk to each other that way?

Justin Ryburn: Well, it's interesting because the coordination between these GPUs is super high. In a traditional data center, if you had multiple web applications, they're operating independently. Your traffic patterns are very north- south in nature. What I mean by that is you may have a user out on the internet that comes and accesses your web application, and then your web application responds to them with a page. And so you have a very north- south traffic heavy flow. Sure, you have some east- west where you may have to pull information from a database or something along those lines, but the majority of your traffic is north- south. In the data center environments, AI interconnected environments that we're talking about, there's a heavy coordination between the various GPUs because they rely on one another to process this data. So as just an example, a tangible example here, you may have one GPU that's processing one dataset. The results of that are an input into the next run of a language model that needs to be processed. So until one GPU completes and passes its data along to another, you're hung up waiting for the next GPU to complete, and then it's higher job as one job has to be coordinated. So you need high amount of bandwidth between these devices. You need low latency, ideally no loss or very low loss. You need a non- blocking architecture. There's a lot of requirements on the interconnect network that you don't really have in a traditional data center fabric that's serving up web applications.

Philip Gervasi: Yeah, yeah. You're talking about the job completion time, which is ultimately how long it takes for these GPUs in a synchronous fashion, not working autonomously and in asynchronous fashion, working together in a partial mesh or a full mesh to complete this AI job. And they are going to be these jobs. The job completion time is going to be determined by the slowest link, which could be a particular GPU that is stalled because of some congestion on the network that that particular one single GPU among 30, 000 is experiencing. And then when you have one GPU that sits idle for a moment, it stalls all the GPUs because everything is now waiting and sitting in that idle state for those maybe millisecond or two milliseconds. But keep in mind that 1. 5 milliseconds, two milliseconds, which may seem like nothing like a throwaway number, when compounded over time, over many seconds, minutes, hours, months even, that's how long some of these large AI workloads take to complete, literally weeks and months. You're talking about huge delays in the completion of the job, the job completion time. So that means, yeah, we end up with this movement of individual flows that are very, very large because we're transferring entire data sets from maybe one pot of GPUs to another pot of GPUs. And yeah, I think you're right that that's in contrast to the way a lot of traditional data center networking is, where it's a lot of north- south, you have a lot of hits on web servers and other services, and that web server might or cluster of web servers might be then in turn making calls on backend databases and there's some east- west, but it's not nearly at the scale as we're seeing with AI workloads. So yeah, we're looking at I guess what you could call elephant flows. That's what we call them back in the day when you have these large individual flows as opposed to many, many, many individual lightweight flows, which is typical for web traffic. Most web traffic is on the lighter side and asynchronous also. So it sounds like one of the goals of this new type of networking in the data center for AI workloads is to find ways to reduce that job completion time because of the danger of the network itself to be the bottleneck. And so how do we do that? You talked about having higher bandwidth, and I know that some of these high- end nodes can connect into the network at 200 gigs. That's a pretty serious bandwidth. And I know there are standards coming down the pike for double and quadruple that. You mentioned something around being non- blocking. Maybe you can define that for us. And I think you talked about subscription, but we should definitely talk about that as well. So yeah, how do we solve this problem? In what specific ways is this different than my traditional data center?

Justin Ryburn: Yeah, I mean there's some really interesting articles out there on the concept of job completion time. And like you said, a small millisecond delay may not seem like much, but once you multiply it over a long run time on a job as well as 32,000 GPUs like you mentioned earlier, that becomes an exponential problem. And it's really fascinating to read about how they calculate those job completion times. But to your point, I mean, I could show my age here a little bit, but I can remember when one gig on an interface was a lot of bandwidth and now we're talking about 200 is the standard bare minimum in some of these AI interconnects and standards are being ratified for 400 and 800 gig. I think we'll talk a little later about an ultra ethernet consortium that's coming up and working on trying to drive these speeds and standards even faster because they just need so much bandwidth on any given link. But the only way you can really service the type of elephant flows and the type of bandwidth demands that we're talking about here is you have to have multiple links. And so you have to take that traffic and spray it as a term that they use across all the links that are available to you. And so I can remember early on in my career, that was how we did traffic across multiple links is we actually did spraying. The downside to that was then you had to have some way to reorder those packets on the other end because if you put a packet on a packet that's part of a larger flow on one link and then another part of on another link, and they come to the opposite end of that path in a different order due to whatever delays in the network, then you have to buffer and reorder those packets. And so for a lot of your TCP type connections and a lot of your web applications, people don't want to do that because that just causes more headache than it's worth to have to reorder and reassemble those packets on the far end. With the AI workloads, that's actually the state of how they're doing things. They're actually doing this concept of packet spraying and they schedule jobs. We can talk a little bit more about that so that they can handle that reordering, but we're back to an earlier technology that we had in the early days of some of the ATM and stuff where we're intentionally just spraying and load balancing per packet load balancing across all of the links that are available to us. I think one other thing you mentioned there, Phil, that I'll talk about is oversubscription. We used to, at least I used to, when I was involved in data center designs, a lot of times we would intentionally design oversubscription between the spine and the leaf layer. So we may have two by 10 gig link down from a top of rack switch down to a server, and then we'd have maybe half as much data going up to the spine layer or maybe three to one, four to one, five to one over subscription. And that was considered to be okay in the design, it's trade- off, like anything in engineering it's a trade- off, lowers your cost, but the presumption was that not all of your web applications would be filling up your links at any given time. So that oversubscription was a good engineering trade- off to get to a reasonable amount of cost and still provide good experience to the applications. When you start talking about these huge elephant flows that these AI workloads are creating an AI interconnect, you can't really do that. You have to have one- to- one that oversubscription in your design that we're used to.

Philip Gervasi: Right. And the reason we don't want that over subscription, we want that one- to- one is because we want make use of all available paths and links. You remember that we can't even tolerate a one millisecond delay on a particular GPU traffic send, data transfer. And so we want this packet spraying where that's how we're going to load balance. That means that every single, but we can't drop a packet, we can't drop one of those packets among the many links that we're choosing. And so we need everything available and we need high bandwidth and ultra low latency with no packet loss, no blocking, none of that stuff. And that's different than the way we used to do it with a lag or ECMP, which is flow- based. So you have some kind of a hashing algorithm and you're pinning an entire flow to a link or to a path, and then from the very first packet of that flow to the very last packet of that flow, you're taking that path, it's deterministic. Whereas packet spraying or multi- pathing, I've also seen it called that, makes the most use of all your physical links in your fabric. So it gets us away from flow- based path decision making and like you said, it's per packet, but to do per packet load balancing when it's not just a hash, that has to mean that there is some serious intelligence going on making those decisions in a runtime environment. So how does a control plane work for a packet spraying environment that we've been talking about?

Justin Ryburn: Yeah, so that's another area that's very interesting as far as innovation in the industry. There's a lot of competing standards. It'll be interesting to see where we land if one standard wins out over another. But a lot of your hyperscalers that are building these AI interconnects, they each have their own approach on how they're doing it, but generically, they have some sort of scheduler that's figuring out the path through the network that a particular AI workload is going to take and figuring out what bandwidth is available and then spraying the traffic across those links to maximize the utilization of every single link that is available to it. Now, that's typically a centralized controller that's making those pathing decisions, but then pushing that intelligence, that knowledge down to the individual switches because at the end of the day, the individual switch has to know what do I do with a packet that arrives on a particular inbound interface, like what interface do I shove it out of on the outbound side? So the forwarding plane, the forwarding decisions still have to be made by the individual switches in your AI interconnect fabric. But the scheduling of that and figuring out how to program those ASICs is typically done by a centralized controller, which is, I don't know, maybe the term's SDN, this sounds like a term we've had before here, Phil, as an industry, right?

Philip Gervasi: Sure, absolutely. And we're always in these days looking at how SDN, which was kind of a marketecture 5, 7, 8 years ago, 10 years ago, is now starting to manifest itself in these various ways, SD- WAN, programmatic infrastructures, and now with an entire fabric that is purpose- built for AI workloads. But I do understand what you're saying. We're going to have a controller of some sort with policy and whether they be dynamic thresholds or hard thresholds for the quality of each link, but that decision- making process, again in a real time runtime environment where we're looking at individual packets, that has to happen on the NIC or on the switch because, number one, we can't be waiting for the control plane to send its decision, that bidirectional conversation, that's out. And we also don't want to dump control plane traffic onto the network, thereby possibly causing additional issues with congestion and latency and things like that. Now, you could probably have an out of band network to do that for you. But yeah, what we're seeing with chip manufacturers, I'm thinking like Broadcom, Nvidia, things like that, we're seeing that this decision- making process is happening locally on the box in order to eliminate any of those problems and to move traffic as fast as possible and make those decisions as efficiently as possible. And the decisions are both what link is best to use right now based on whatever metrics. So it's not just like a path, but we are looking at now gauging the quality of a link. And that's really interesting, but also we're looking at what are we going to do with the next packet. So there is sort of a predictive component here where this is how this link has been behaving in the past two seconds. I don't know how these things work exactly, but in these short amounts of time and then we're going to queue up the next packet or the next series of packets depending on what you're doing to use this same link or this other link. So there is both a current decision on where we're going to forward this packet in the immediate and then this queuing activity that's going on as well. And I believe what we're doing now, we're starting to see things like RDMA, for example, which is it's a method to offload this traffic directly to the NIC, and so you're doing direct memory to memory communication as opposed to what I guess traditional TCP/ IP, which uses the kernel to compute that and then process and send the data and all that kind of stuff. So what are the things that we're looking for then? I mean, obviously we're talking about the quality of the path and stuff like that. We're talking about how much bandwidth we have and you mentioned a scheduler, but we're connecting all this stuff with good old- fashioned cables. I mean we're talking about a physical data center, so how is that different or are we just using traditional copper and fiber?

Justin Ryburn: Well, I mean it can be done that way, but that brings another interesting engineering challenge, which is power density, right? We're talking about 32, 000 GPUs in a given data center combined with all of the AI interconnect fabric equipment we've been talking about. That's a lot of switches, it's a lot of ports on those switches and that draws a lot of power. So then you come up with a power and heat density type of a problem. That's another engineering challenge you have to deal with. And so what we're seeing is a lot of companies are trying to figure out ways that they can reduce points of failure, reduce power draw, get these high bandwidths, and so there are a couple of interesting trends that we're seeing or innovations that we're seeing in this area. One is going back to using DAC cables. I mean they've been around for a while and I think they were at least in the last data center eye design, they're really popular between the NIC and the top of rack switch to use a DAC cable, but typically from the top of rack or the leaf switch up to the spine, you went ahead and did fiber optic because you likely already had it run between your various racks, and especially if you're having to go from in the data center from one cage to another or from one room to another, most of your cabling between that was already fiber optic, so you were using fiber optic, but most of these AI interconnects are completely DAC because now you can reduce some power draw by not having all of those lasers and a point of failure. One of the number one failure scenarios in an optic is the light itself, the laser itself. And so by doing DAC cables, you don't have active electronics there, and so you have less power draw. You have one less failure domain or one less place where the failure can incur. Another interesting thing is this concept of linear pluggable optics, and I won't pretend to be the expert on all things optics, but I encourage the people listening to go check this out. Essentially they're removing the DSP module from the pluggable optic. It makes it maybe a little bit less flexible in what you can use it for, what kind of other things you can plug it into, but it essentially allows your fiber to communicate directly with the inaudible and removes some of the intelligence of the optics in some of the things that it does. And the trade- off there, again, engineering is all about trade- offs is that it's lower cost, lower power draw, so there's lower power that we have to provide to a switch full of these optics if you use these LPOs, linear pluggable optics, if I can get the term correct. So that's a really fascinating innovation to me as well.

Philip Gervasi: Yeah. And a reduction in the amount of heat that it throws off as well. So that's absolutely right. So yeah, I mean I remember using DAC cables, DAC cable is like saying ATM machine, automatic teller machine machine. It's a direct access cable cable, but I do remember using DAC cables all the time in data centers a lot of the time. It's when I was setting up storage networks and things like that, that was the most common. And then running single mode or multi- mode to the rest of the network from there, whatever was necessary. But we're seeing that again, again because like you said, low latency, low power draw, fewer failures, but I do see a lot of activity happening on the ethernet side, whether it be the development of optics and you mentioned LPOs and just a lot of time and research and effort being put into improving ethernet, so that way that does become the defacto standard for connecting everything. In fact, there's something called the Ultra Ethernet Consortium, which if folks aren't familiar with go look it up, but it is a collection of network vendors. Most of them are network vendors. I think one hyperscaler that's involved Meta, maybe today it's two, I don't know, but last time I checked it's mostly network vendors trying to solve these problems with AI interconnection and primarily in ethernet. And so they put out literature, they do research, they have conferences, they do all that kind of stuff. And from what I understand, they're going to have their first UEC standards coming out next year in 2024. So we're going to see some movement there. It's really interesting that it's not just that we all are relying more on artificial intelligence in this more advanced data analysis to empower the way we're living our lives these days, but the entire industry has to change to accommodate that as well. It's really interesting. Now as you were talking about schedule fabrics and you were talking about offloading intelligence here and there and doing all these things, if we are gauging the quality of our connections and we're doing it at that sub- second level, and it is also vital, like mission- critical, so all of these things coming together that suggests to me that visibility, network telemetry, whether it be traditional or maybe some new form that I'm not aware of, is incredibly, incredibly important. I mean, I have to assume you agree, Justin.

Justin Ryburn: Yeah, it's amazing we've gotten this far on the podcast and just bring up telemetry, right? But yeah, I mean it's absolutely critical. If you're going to be able to schedule your traffic on a particular links, you need to know the quality of those links. You need real time, and that's probably less than a five minute SNMP poll for sure to be able to figure out, okay, how loaded is this link? Do I have loss on that link? What's the latency across that link? You're going to be able to answer questions like that to be able to figure out which links to schedule particular traffic on. I mean, it's a full loop. In order to be able to do that type of automation, that type of full scheduling, you have to have that telemetry data. You have to have that information in order to make those decisions. You can't make those decisions without that knowledge. And so yeah, we're seeing some interesting innovations from Broadcom and some of the other chipset makers. We're right on the chip. They're going to have the ability to export some amount of telemetry data. That could be things like, what's the queue depth of traffic? What's the utilization on the link? What's packet loss? What's my latency? So yeah, I think it'll be really fascinating to see what kind of telemetry data we're able to get out of these newer chipsets that are coming out.

Philip Gervasi: And a lot of the telemetry that we're going to need and that we do need today is really the same as information that we've been gathering for a long time, but now we're talking about connecting to the network, right? At the NIC at 200 gigs or we're even 400 or 800, which we're going to see in time. So if that's the line rate that we're operating at and we need to know what's going on on a packet per packet, sub millisecond, millisecond level, then yeah, there's going to be some interesting advancements in how we do telemetry moving forward. So whether it's just advancements and changes in flow sampling and how we monitor the state table of a particular switch and things like that, I'm not exactly sure, but it is-

Justin Ryburn: That's a really fascinating one. We hadn't talked about the state tables.

Philip Gervasi: Yeah, absolutely. Where are my individual flows? How do we know as far as packet reordering on the other side and how do we build all that together in such a fast way without error that we don't affect job completion time? A lot of those things, I'm not exactly sure how we're going to do that yet, but I am interested to see all that. I mean, we are looking at hyperscalers leading a lot of this charge as well. We've been talking about the chip makers themselves, but there's hyperscalers out there that they're building their own SmartNICs so they can design their own protocols, whether that be for scheduling fabrics or for avoiding congestion, congestion control protocols in general. So everybody has a very vested interest in this and sure there's the for- profit component and then there's of course the R& D component from the large universities. But really all of this stuff is so that way we can reduce the job completion time and the network isn't a bottleneck to the completion of this AI workload because when it comes down to it, some of these data centers and these are purpose built data centers to do artificial intelligence tasks, right?

Justin Ryburn: Yeah, for sure.

Philip Gervasi: They can be hundreds of millions of dollars if not over a billion. So if I can find a way to reduce power consumption, reduce cooling needs, reduce the number of optics, make use of every single path I have available so that nothing is idle so I can use it to its maximum and therefore efficient, just like we do with compute as well, we run our GPUs at 99%. And so that's really the key here. And that's why I think a lot of these hyperscalers as well as the vendors and academia are so interested in this because there is money to be made here. And then also of course there's money to be made when you save money, you make your operations more efficient. Really interesting stuff.

Justin Ryburn: Yeah, and I mean that's one of the reasons I love this industry. I think you probably would echo this, Phil, is just there's never an end to interesting challenges to solve. It seems like every day there's something new, different and unique. Sometimes you get to reapply solutions that you've had like we were talking about earlier, I mean packet spraying or multi- pathing has been around for decades, but being able to apply that solution to a new problem statement, to a new problem domain is really fascinating to me.

Philip Gervasi: Yeah, it's interesting that a lot of the technology that we've talked about isn't new. It actually is like old technology that's maybe it's being applied in a new way or it's being updated in some way.

Justin Ryburn: Applied in a new way. Yeah.

Philip Gervasi: Absolutely. To solve a problem. And that's how I've always defined engineering too. It's like, okay, here's the problem. How can we solve the problem? Not what new box can I buy necessarily. Although sometimes it means I need a new box so I can have greater port density or more bandwidth, but it's what tools do we have available, fancy or not. I remember hearing somebody say that BGP is old and dumb, and I'm just like, what? That makes no sense. The idea of a technology that works well, being not good simply because of its age makes no sense to me. So it is interesting to see that we are resurrecting things like InfiniBand and DAC cables and that kind of stuff for moving forward. So in any case, Justin, really interesting conversation, great to have you on again. Look forward to doing something like this again soon.

Justin Ryburn: Yeah, thanks for having me.

Philip Gervasi: Absolutely. So if folks want to reach out to you, if they have a question about artificial intelligence, how we do networking with artificial intelligence workloads, how can they find you online?

Justin Ryburn: Yeah, sure. So probably LinkedIn's where I'm spending the most amount of time these days. I'm Justin Ryburn, last name is spelled R- Y- B- U- R- N on LinkedIn. Same handle on Twitter. I guess we call it X these days. I don't spend nearly as much time there. Or you can always feel free to drop me an email, jryburn @kentik. com

Philip Gervasi: And you can find me online @ network_phil on Twitter. I am also Philip Gervasi on LinkedIn. My blog is networkphil. com, which I have really neglected in the past year, so I'm going to try to get busy with that again. Now Justin is Kentik's Field CTO. So if he doesn't get back to you right away, keep emailing him, keep sending him notes on X and on LinkedIn because I know he loves that.

Justin Ryburn: Persistence is key.

Philip Gervasi: And Justin, I know that you love the engagement and being a part of the networking community, I know that that's important to you.

Justin Ryburn: For sure.

Philip Gervasi: Yeah, we're definitely all about the network and community at Kentik. We love the community and being a part of it, and that's what Telemetry Now is all about. And on that note, if you have an idea for an episode or if you'd like to be a guest on Telemetry Now, I'd love to hear from you. Our email address is telemetrynow@ kentik. com. Just shoot us a note and we can start from there. So until next time, thanks for listening very much. Bye- bye.

About Telemetry Now

Do you dread forgetting to use the “add” command on a trunk port? Do you grit your teeth when the coffee maker isn't working, and everyone says, “It’s the network’s fault?” Do you like to blame DNS for everything because you know deep down, in the bottom of your heart, it probably is DNS?

Well, you're in the right place! Telemetry Now is the podcast for you!

Tune in and let the packets wash over you as host Phil Gervasi and his expert guests talk networking, network engineering and related careers, emerging technologies, and more.

We use cookies to deliver our services.
By using our website, you agree to the use of cookies as described in our Privacy Policy.