More episodes
Telemetry Now  |  Season 2 - Episode 55  |  August 14, 2025

Neocloud and the Future of AI Operations

Play now

 
In this episode of Telemetry Now, Phil and guest David Cliffe explore “neocloud”—what it is, why it’s growing fast, and how it’s reshaping AI infrastructure. From GPUs and high-speed interconnects to orchestration layers and energy constraints, we discuss what makes neocloud different from traditional cloud, and what it means for data center operators, service providers, and AI architects.

Transcript

Years ago in nineteen eighty four, John Gage from Sun Microsystems famously coined the phrase, the network is the computer.

And when we think about how AI models work today, especially large language models, both training and and also inference, that's really never been more true than it is now.

Training AI models like GPT and Claude and even much smaller models means thousands or tens of thousands or with some models even approaching a hundred thousand GPUs connected together with ultra high performance networking in specialized data centers so that they can all work together to make the magic of ChatGPT and Llama and Claude actually happen.

And the problem is it for most of us. For most organizations, training your own model can be kind of a problem. The cost and the logistics of retrofitting or even building a specialized AI data center, it's just crazy.

GPUs are insanely expensive, and sometimes it really is just a moot point because you can't get the power to your data center that you wanna retrofit.

So what do you do? This is where neoclouds come in, a new type of cloud service provider, and and it's what we're talking about on today's show. So with me is Dave Cliffee, a product manager at Kentik and an SME on cloud technologies, and we'll be unpacking neoclouds.

My name is Philip Gervasi, and this is Telemetry Now.

Cliffe, thanks so much for joining the podcast today. It's been great to to know you, to meet you, only this last what was it? Last March, I think, at our old company off-site. So that was great to kinda chitchat with you and meet you for the first time.

And and now here we are recording a podcast together. But before we get started, I call you Cliffe. The name on the screen here on our recording software says Cliffe, but the social media card for this podcast and, you know, when I email you, says Dave or David. Can you what what's going on there?

That is so strange. I wonder why. No. I I have some background there for sure.

Yeah. This goes all the way back to, high school calculus class. I remember mister Hayhurst, affectionately used to, call me call me Cliffe, and he called on me often. Not entirely sure why.

Calculus was something that I enjoyed a lot, so maybe it, maybe it came stemmed from that, but definitely picked up again when I, spent a lot of time at PagerDuty, kinda pre, from series a through IPO. We had a lot of Daves at the time, even in product management where I, am with Kentik now. And, and so yeah. In order to disambiguate the the many Daves, Cliffe became became the the moniker and stuck.

I think my Twitter handle for a while was Cliffehangers.

Uh-huh. Cliffe with the e, of course, and, and that's kind of stuck on LinkedIn as well now. So, yep. Definitely still respond as as Cliffe. There are fewer Cliffes in the world, and thank goodness for that.

Okay. Very good. So you're a product manager at Kentik.

And I know that you're focused very much on the cloud, but what does that mean? I know what a product manager does, but you personally and what your role is and especially, you know, especially for Kentik, but specifically in the context of cloud since we're gonna be getting into the whole concept of neocloud in a couple minutes.

Yeah. I mean, I'm glad that you know what a product manager is because sometimes I wonder in my day to day is all I know. A dictionary. That's right. Yeah. No. You can ask Gemini that now nowadays.

Yeah, product management, I mean, man, I've been, I've been doing this for probably close to fifteen years now. It's definitely evolved in terms of a a discipline from, like, early Microsoft days when I was there, and it was called program manager.

And it kinda became, there was a whole manual written on what it was, and then program management kind of, split a little bit. And I'm definitely more of the kind of Marty Kagan kind of product manager, in terms of just focusing on what's, you know, what is valuable, usable, and feasible, breaking it down that way. And at Kentik, that's thankfully consistent. I love love that we kinda work in more of a triad model with designers, with engineering, counterparts and, and kinda bringing that bringing that, those kind of three elements into, what we build at Kentik.

And so in terms of the cloud experience, yeah, we're we've got our our heads in the cloud, in many clouds, in fact. And and so, yeah. And so it's it's, you know, we support AWS, Azure, GCP, OCI, definitely more to come in the future certainly as, as customers continue to bring workloads from all over the place, multi cloud that I'm sure we'll touch on, in a little bit, and and kind of the network implications to that. But, yeah, there's turns out there's a lot of network activity, that people wanna analyze and understand, in their cloud environments.

And so we're trying to make sense of of all of that all of that activity that people are seeing because, I mean, network is truth. That's what I what what I heard from a customer recently.

Mhmm. That is the way they think about it, especially in their cloud environment.

That's the way that you you kinda know really what's going on is what's on the wire.

Well well, why don't we expand upon that a little bit? Net the network is truth.

No. You you you just said that you that's how you know what's going on because you see what's on the wire. Can you expand on that? What does that mean exactly?

Yeah. I mean, I traditionally, I think, a lot of observability in particular is really kind of focused heavily on metrics.

And, you know, and those are really important, elements of really what's what is going on within your environment.

You know, logging of, individual processes on your servers, etcetera.

But, really, like, what is going on within the the network is ultimately gonna going to be the the the truth of, you know, of what you're trying to analyze.

And, you know, it's, sometimes, the metrics can lead you astray. You really do need to kinda get down to that log level. Metrics in many ways are kind of a a way of optimizing, you know, the way that you look and finding that needle in a haystack oftentimes.

Mhmm.

But if you really wanna get down to, okay, what happens specifically, you know, in the communication between this instance and that instance or this instance out to the Internet to, you know, to some other cloud potentially.

Really, you need to be able to look at and understand what is what is that flow.

And so it's really, really exciting and interesting certainly being, you know, being at Kentik and trying to make sense of the massive network flow data that we, that we look at and the trillions and trillions of records that we kind of help to process for in our customers' environments every year.

Yeah. Yeah. You know, the way I I've said it many times on this podcast and in other places that, you know, I'm sitting here at my computer, and the vast majority of the apps on my computer are not on my computer. It's a web interface or or maybe some kind of a local application that's connecting over the Internet to the actual application. Yeah. And so, an application's slow or it's not working properly or the data is not properly, you know, populating in the fields.

Very often, it's it's a matter of, it it's a network problem. And if it's not a network problem, then maybe it's the inter the interface between the network and the application or or something along those lines. Now, yes, sometimes it's the application. Right? We can't just give them a free pass here.

A lot of the time is Everybody points at the network, though, Phil.

Come on. I know. Yeah.

Well, I mean, I'm just trying to be realistic that, you know, if you're if you're if you have a front end in AWS, a back end in Azure, and you're doing some sort of, like, AI training in a neocloud, which we're gonna talk about today, and then, you know, you have your own flaky, like, you know, old school wireless modem router thing two two floors away from you. I mean, the network is critical in an application's performance and then your experience with it.

And so and then that's just what it is today. And so I would say that most most people are living in this hybrid, multi cloud, SaaS world whether you realize it or not. You know, when was the last time you installed, like, Office products locally? I mean, that's just been well, we use Google Apps, you know, or whatever it's called now. Work workplace, whatever it is. But certainly, even just simple mundane productivity tools, they're not on my computer for the most part, other than, like, you know, some some, developer tools and things like that. And even that, that's kind of migrating away from that as well.

No. You're absolutely right.

Yeah. Yeah. I agree with you, and I think that's a good segue into where we're going now. You know, there there's a there's a cost benefit analysis that people do with, like, running stuff on premises, whether it's, you know, like, your your campus, you know, rather your your branch office. Do I run my productivity tools on my my my folks' laptops and desktop computers? Or can we run it in Microsoft three sixty five or in Google Workspace? So you have a cost benefit analysis.

But that's the same with the the applications, front end, back end, you know, serious stuff like, you know, our storage environments, our high performance computing environment, all of that stuff. Do we also run that across campus in our on prem data center? Maybe the answer is yes, but increasingly it's no, and we all know that. That's not that's not news. And what I wanted to talk about today was how that relates to this new concept. The concept isn't new, but specifically the context of AI training, this new term neocloud.

And, you know, if you're like me and you're you got your feeds and you're getting your Twitter feeds and your LinkedIn feeds and maybe you got RSS feeds still going, you've probably seen that term. I'm speaking to the audience, Cliffe. I know you have. You've probably seen that term. And if you haven't, you know, and it's a new term, great. I'm I'm I'm hoping that you learn something today.

Let's unpack that. What is a neocloud?

What is it for?

Why did it develop? Why do they exist? Who are the notable players? Those are the kind of things I wanna unpack now. So let's start with, Cliffe, your your definition.

Yeah. I mean, Neo comes from the Latin of no. I'm just, I I I don't I don't have that much education, honestly. Yeah.

Yeah. No. I mean, it it really is just new cloud.

And largely, I mean, this is really just kind of purpose built clouds is the way that I think of it. In this particular case, it's purpose built for AI workloads.

And so that does mean certainly involving GPUs as you can imagine as most of our listeners are probably, you know, picturing in their heads as well.

And a lot of it's been, you know, kinda just born out of GPU scarcity, to a large extent, that we can kinda get into. But, yeah, it really is as simple as just, like, these are purpose built clouds that are optimized for, you know, these, AI workloads for training new models, for, in certain cases, fine tuning as well, you know, and and kind of inference on top of that. And so there's, that's really what what kind of neocloud is.

It's a really interesting set of workloads because, again, it's not, general purpose like most of our of kinda what people think of as traditional, now kind of cloud providers.

And that that's gonna be the main difference that they are purpose built or, in some cases, purpose rebuilt. Right? Because they're repurposing, like, their crypto farms then.

That's true. Yeah.

But in any case, they are different than our traditional clouds and and maybe hyperscalers in that they aren't general purpose.

Right. You know It's what you call an effective pivot right there.

Yeah. There you go.

My goodness.

Yeah. Yeah. So just to level set on what why this is important, when we're training an AI model and and right now, we're generally talking about large language models. That doesn't have to be the case, but let's stick with large language models.

You're talking about training runs that could take a very long time where the entire training of a new model, like GPT five just came out recently, might take, six months, seven months, something like that.

And in that time, you have thousands, perhaps tens of thousands, perhaps even approaching a hundred thousand GPUs operating in clusters that all have to talk to each other in order to process, you know, an enormous amount of data. And, ultimately, they're going through, you know, math and back propagation and all that kind of stuff to to come up with a, a model that's accurate, that's consistent, and everybody says, okay. We're good. We're done training.

The interesting thing is, you remember that quote by John, was it John John Gage from Sun Systems where he said, the network is the computer. Right? So you gotta think of it like this, all of these GPUs have to speak to each other, and they speak to each other in a very specific way, which we're not gonna really get into today, but it's a synchronous communication and, they usually speak to each other in these huge flows and not necessarily these consistent dribbles. Yeah. That means in order for anything to happen with AI, training or, as you mentioned, inference, it all relies on that underlying network.

So I think I think a lot of the time when we think of, like, traditional data center networking, except for, like, in finance where there's, like, super high performance and, you know, maybe things like that science Yeah.

Generally, especially campus networking, but data center networking is not, it's not exactly the same. And you sort of start to think, like, I just need those big, dumb, fat pipes of bandwidth and plenty of them, you know, and we're good. And then we don't even think about QoS because we have so much bandwidth. And we don't think about dropped packets because we have so much bandwidth and we, you know, we have these links and we're, you know, subscribe five to one, three to one, whatever. And that's not the case for training an AI model in that in that kind of a data center. And it's it's it's a little bit, you know, less less so, when you're talking about inference, but it's it's similar.

So that's the kind of stuff happening in neoclouds. And, so let's let's talk about some of those characteristics. Like, what is it about, you know, the infrastructure, the speeds and feeds, the requirements of gear and hardware and architecture in a neocloud that kinda sets them apart? Because you because you mentioned that, like, AWS is more general purpose. Well, what do you mean by that exactly?

Yeah. I mean, it it's it's that it's really kind of born out of, the the flexibility that's needed for, you know, obviously, providing customers with, with different options for different types of apps that they might be building.

You know, whether you're writing just, like, basic kinda client server app versus a bunch of, like, back end processing or, you know, asynchronous workflows.

You know, there's very different requirements around, like, different storage solutions you may wanna use for your for building your app. And so, you know, the cloud providers, I I think the the major ones certainly, AWS, Azure, GCP, OCI, like, these they've really kind of gone and provided customers with a lot of options and making sure that they've kind of had some, you know, some kind of tailored capabilities effectively for these type different types of apps, which is really, really, nice. And then there are you know, in terms of networking, it is fairly, you know, it's fairly consistent.

It's your typical kind of TCPIP, you know, ether Ethernet based, communication, at least in terms of them building out data centers. And, and that's where, you know, it's been fascinating to see kind of this neocloud, build out process. And, you know, we a couple of the the top players certainly in this in this space are, you know, customers of ours that we, have have conversations with and interactions with. So it's really neat to be able to kinda peek under the under the covers a little bit and understand, you know, hey.

This is not really it's not just GPUs, certainly, because that's obviously the the kind of compute side of things. But you still need storage. You still need networking, you know, in order to kind of, you know, in ensure that these things can talk, and then you can actually kind of build, you know, build kind of processing into it. And so, you know, high, like, storage systems that are more optimized for for these this kind of workload as well as network in particular, is, you know, is very, very different, in terms of the way that routing works, in terms of, the expectations around, you know, around job completion, and, like, high you know, hyper low latency communications.

So it's really, really fascinating to see kind of what, you know, what the requirements are. And, again, this is where when you're purpose built, you can, you know, you can focus in on the requirements of your customers much more effectively, which is is fantastic, honestly. I think it's a a boon to, to to just customers in general and for, you know, giving giving us options, where we can do that cost benefit analysis that you were talking about earlier.

Yeah. Yeah. Because when it comes down to it, it is eye wateringly expensive. That's an understatement.

Yes.

To build your own AI training data center if you're trying to build, some sort of a model that's even remotely on par with, like, a foundation model, like a like a GPT or a Claude, which, you know, is crazy. But even if you're training a model with just a couple hundred thousand parameters, which I know happens in industry and and in, university and education, I've worked with folks that do that. They're building their own models, and they're dramatically smaller because they're very, very, focus and specific.

It's still very, very expensive. I mean, they're gonna pick up, like, one NVIDIA h one hundred or something like that, and that's, like, their budget. And here, we're talking about building a data center, like a neocloud. Right?

Where there's racks upon racks upon racks of these things, to accommodate these very large, AI training workloads. So I think that's you mentioned a couple things. One is, you mentioned Ethernet, and so I'll just throw out there that I think it's pretty clear that Ethernet won the Ethernet versus InfiniBand InfiniBand kind of battle with regard to, you know, model training. Yeah.

And you still see InfiniBand out there, and I know there's people that are, like, religiously, devoted to it, which is fine. But I think the underlying, you know, message there is, why do people care about InfiniBand? Well, lossless connectivity, extremely high speed connectivity.

And that speaks to the very, the very, very, stringent requirements of training an AI model using this kinda like GPU architecture. Right?

Yeah.

So you have four hundred, eight hundred, one point two, one point six terabit.

I'm seeing, like, specs for, like, over three terabit per second connectivity, which I don't believe is in production yet. A lot of that stuff I see coming out from, like, a wrist and things, and the, the Ultra Ethernet consortium. Just insane bandwidth, but also that super ultra low latency lossless connectivity.

You cannot abide one packet drop.

And then you get a retransmit, and your one millisecond of delay, you know, over months of training translates to huge amounts of time lost and money lost. So, you know, to optimize a general purpose data center for that is just crazy. And so I just Cliffe, I don't know about you, but I don't I don't foresee most organizations building a data center like that. It just makes no sense. Or or even trying to say, well, let's just portion off this third of our data center with this number of NVIDIA, GPU clusters.

I don't think that's feasible. It's just too expensive. It's, requires different networking skills, things like that.

No. I think Yeah. In many ways, it's I mean, again, some of the same impetus of, like, customers doing that analysis themselves of, like, what types of apps should we move to, you know, your to your traditional public cloud. I mean, this is, you know, this is something that we went through ten years ago, and, you know, and we are kind of history is repeating itself. But but, again, it's more just to teach us that, hey, the the some of the principles, and and the approach of of doing that analysis is still critical. Right? It's it's you know, just just because you you have different options now doesn't mean necessarily that, you know, that you shouldn't necessarily follow the same process.

Yeah. I do remember when, like remember just not long ago when everybody started to talk about, like, repatriation and bringing workloads back back on prem because, like and I think what happened there during that whole conversation out there in, like, the tech world was not necessarily that we were wrong. Everything needs to be on prem. It I think it was a realization that maybe we should look at this more strategically, workload by workload, app by app, case by case.

But I I don't know. I agree with you, Cliffe, but I think I think it is a little bit more difficult to apply that completely. It's not analogous when we're talking about AI training because the the cost is just and it's just so much more Enormous. Yeah.

Specific as far as what you're trying to do. Again, it's not general purpose. Right? So I think I mean, you know, in the folks that I talk to, the vast majority of people are are, you know, network teams, IT teams.

They're interested in AI stuff, but they're either not interested in training a model. They just wanna use, you know, some model as a service. Yep. Or if they are gonna train something, you know, it could be a fine tuning process or, they're gonna run a very large footprint of a local model like a llama, except they don't wanna do it locally because they're gonna run, you know, advanced inference.

And and those are all very few and far between. I I really think that we're talking about, I don't know, four or five percent of enterprises out there that are gonna do something like that.

That's why I think, like, neocloud makes a lot of sense because now it's like, well, I don't gotta I don't have to build anything, number one.

So the build versus buy is out and, you know, you could scale up, scale down. It's it's as a it is as a service, so you know how that goes.

Yep.

So I think You're you are still looking at lead times.

Right? I mean, ultimately, like, you know, which I think is is interesting and why a lot of people are pushing from, you know, your traditional cloud providers, AWS, Azure, GCP, OCI. Like, you are looking at neoclouds because they have access to the GPUs that you cannot get, you know, access to in the regions that you need potentially, you know, in in any given time frame.

So, that that place faster speed to market?

Yep. Yeah. And, you know, the hyperscalers are building their their whole stuff, and there's demand from AI start ups now. So there's a whole new realm of this whole new cohort of companies that that require or at least want, you know, whether they require it or not, who knows? But but they desire, the this hardware infrastructure to do what they're doing, their ideas. There's research institutions that are pushing this harder.

Yep.

You know, I remember, in my own research over the past four or five years, I did this, like, survey of the history of AI and, and I really dug into some of the AI winters.

And it's interesting to see that during a couple of those AI winters, like in the seventies and there was another one in the the mid nineties, Because of Terminator, wasn't it?

Because of Terminator. No. There was no Terminator. Yep. And so people were like, what are we doing here? This is a waste of time.

Yeah. Yeah. No.

But what happened was, people didn't see the results that they expected. So, you know, it sort of died off as a thing, but but also funding died off. And so you saw less money and less interest and therefore just a less need. And then so right now, I think that demand, from start ups, from, educational institutions that just wanna learn this stuff and develop models of their own and kinda go down that road, that's what's driving a lot of this. And because it's it's so much, like you said, there's GPU scarcity.

And how do we do this? I wanna do something next month or or in a few months, and I can't. You know, the lead time to buy those NVIDIA GPUs GPUs is a year or two years.

So I think especially companies that are, you know, domain leaders in many cases and have the data, you know, are already in a place where, hey, it's, you know, labeled in the way that they need.

And, you know, and it is more of a strategic advantage that they wanna take advantage of. Like, yeah. Yeah. You you gotta get on that now. Right? I mean, like, time time matters here in terms of, like, competitive pressure and, and time to market of, like, what you can really kind of use, with a a purpose trained model, that is specific to your domain.

Yeah. And so we've talked about some advantages of why why neoclouds exist well. You can don't have to worry about building an AI data center. There's a purpose built data center that you can use as a service, and and it has all of those infrastructure requirements built in, that ultra low latency, the schedule fabrics, the extremely high band it's all there. It's all ready to go, so you're good. You're not worried about a general purpose model or rather general purpose data center.

But also, you know, thinking about power and cooling, I mean, these data centers are built for this, so, you know, they're gonna employ that liquid cooling, they're gonna employ that, maybe dedicated, power plant, you know, power generation facility. Whereas if you're going to go down this road, you know, and let's say you wanna augment your own data center or spin up a new building on your campus if you're a larger organization, larger enterprise. You may not be able to do it simply because there's no way for that local municipality to get the power into you. It's just not gonna support it on your local grid. So there's there are other reasons that this just makes a lot of sense if this is the road you're going down where you're training models and and, and you just can't do it feasibly on your own. Yeah.

Yeah. Absolutely.

Yeah. And I think, I mean, it definitely has it brings, other, components into play, especially when you think about, you know, kinda what your strategic, direction is, I think, from a company perspective in terms of what infrastructure is strategically important for you to for you to have, yourself in comparison to, what you should effectively not spend as much as much time or energy on, talent on in particular. Right? Yeah.

Yep. Because Who who are some of the main players? I know, like, one. Who are the main players in this space?

Yeah. We've, I mean, I think some of the the I mean, the the one you probably know of certainly, CoreWeave, definitely kinda got there.

But, yeah, we've seen, like, Lambda Labs, Voltage Park, Together AI. Like, there's an a number a number of that have really kind of sprouted up, and have either kinda come into the space from adjacencies or or, you know, through through Pivot, you know, because they had access to GPUs and good relationships with GPU vendors in particular. Yeah.

So then what what would these Neo clouds compete on? I mean, if it's all very similar architecture, there's VXLAN, Schedule Fabrics, crazy fast bandwidth. Like, what are they competing on?

Yeah. I I mean, in in cert certain cases, certainly, I mean, there's a there's a price component to it, you know, in terms of a race to the bottom there.

But largely, I feel, I mean, we're starting to see some of them launch things like managed services around, kind of, around the, kind of the fringes of, of what they're what they're working on. So it's, you know, being able to host Inference APIs more effectively.

Maybe it's managed services for Kubernetes or, or for Inference specifically, that is attached to, kind of the GPU infrastructure that you can effectively rent from them.

And, and so some of those things, I think, are really interesting because it does start to, if anything, kind of impede on, more so what the what what your traditional cloud providers are obviously obviously have. Right?

So there's a at least a few of the things that come to mind.

There's no there's no real direct competition, though. I mean, like Amazon has Amazon Bedrock and yeah. I I know that they have their own AI models, but they're not offering them as a service per se where you're renting the GPUs to do your own thing. You're using their model as a service.

Right. So any Other other than I mean, you you know, with some of the traditional cloud providers, you can, you know, take a pretrained model, you know, whether it's your model or or obviously kind of theirs off the shelf. You can do additional fine tuning.

You know, you can host, you know, inference APIs there.

So there are definitely opportunities, and, you know, you can see that, that the the cloud providers certainly have, you know, have have kinda moved in the space. They're not, you know, just kinda sitting there waiting for GPUs to show up.

And I I do think, like, there's an interesting decisions that customers have to make of when to start to leverage those services. You know, when is the when is it more beneficial to have, you know, kind of the Inference APIs closer to your application, closer to, you know, to to to the benefits that you get from a cloud provider, you know, around CDN and access to your customers and getting, you know, getting compute close to them.

And, you know, there there's there's still some benefits there where, you know, where I think there's decisions to be made, for individual applications that, that do depend on a bunch of your kinda traditional nonfunctional requirements, if you will.

Yeah. Yeah. That makes sense. What what about, you know, the assumption here is that you're almost certainly a multi cloud environment.

And so this app that you mentioned lives in maybe it lives in AWS and wherever. You're in the US east region and then multiple maybe multiple regions, whatever. You have, other instances of things running in in other clouds, other vendors.

How how does the Neo cloud fit into that? Are we because, I mean, we're doing that now in the sense of we have stuff moving around among clouds back on prem, so it's not like it's a new thing.

No. That's right. Yeah. I mean, even, you know, even going to, like, the over the last couple of years, certainly, it's been very common for, you know, customers using, OpenAI endpoints or, you know, using Azure OpenAI, but doing that, you know, making calls from their AWS environment, from their GCP environment.

So we've already been seeing, I feel like, some of this. Now granted, it's been more what would be considered more SaaS traffic, if you will.

This is obviously a little bit different in that this is obviously an IaaS, you know, an infrastructure as a service.

And so, you know, I it again, because you have some of that flexibility of, like, do I wanna my, you know, inference API over here? Do I wanna, you know, copy my model out over into SageMaker or, into, you know, Azure machine learning or into Vertex?

You you know, you do have a little bit more flexibility there dependent on what your application requirements are. But we definitely have started to see, yeah, like, you know, multi cloud is not just, give an application team has chosen, like, they wanna host their stuff here and a different application team or a different BU business unit has, you know, has decided to host their, you know, their app over here.

You're seeing a lot more crosstalk, than ever before. And so, like, that has has been, you know, some really fascinating kind of network implications and, that that certainly we are we are very interested in, as you can imagine, as, as Kentik.

Yeah. I mean, the thing is with what we do at Kentik, we're dealing with real time or near real time information telemetry. Right? So and we know how to do that.

We have our data pipelines to accommodate that sort of thing. But if we are serving a model with new data and fresh data in near real time, I mean, that poses challenges. Now we do that already. We have, an AI component to the platform that is able to answer questions about real time data in natural language and then do some advanced, analysis on that.

That's great. But, for someone I mean, I'm I'm looking at this as a possible, challenge. You know, if I am running in you know, I'm training my own custom model, that's why I'm doing this neocloud thing. Right?

And I'm, quote, unquote, renting GPUs, but we know it's more than that. It's the network and the the fabric and everything.

That's gotta be a heavy lift to to to grab an entire, you know, model that I've trained and move it to a different neocloud provider, or I wanna bring it on prem. So, you know, I personally don't believe that people have much of a problem with vendor lock in, not nearly as much as, like, the blogs would suggest.

And, you know, we're we're all in with AWS. Okay. Is that vendor lock in? Kind of. But you can move those workloads. Those workloads are not like proprietary AWS workloads.

They're not necessarily here in the neocloud land, but it is a different animal. I mean, do you do you think that that is a a potential challenge for folks at our training models in in CoreWeave or, you know, Lambda Labs or something?

I I I feel like, I mean, the the community certainly around, you know, the around AI and ML tools, especially, I feel like, has done a really good job of ensuring that a lot of the, you know, the vendors in this space, you know, and all all of them, including OpenAI and Claude and and others that I I feel like there's there's been such a a a focus on kinda making these more ubiquitous, but then also not specifically locking people, people into a particular model. And even with MCP and, you know, a two a and kind of the back and forth there, like, those things are ultimately, you know, kind of more fungible than people would would you you know, there there's definitely some more, I would say, religious debates there, that you could get into about what is you know, thou shall do it this way.

When in reality, like, to your point, I mean, there's not actually as much lock in as, you know, as as kind of people purport to be. It a lot of it does come down to, I feel like, you know, operational skill set more than anything because it you know, that's really when rubber meets road. It's not, you know, it's not it's not a day one problem. It's, you know, it's day two. And maybe that's more of an AWSism, if anything, but it's like, you know, it's it it really is, you know, kind of building for, what you can support as a team going forward.

That becomes a much more critical element, I think.

And, you know, ops teams are savvy. SRE teams, especially with automation and, you know, infrastructures code and all of that are able to kind of, adapt. But I do think that a lot of it does come down to, you know, ops teams, you know, being able to reduce kind of operational burden, if anything.

I'd like to you know, I again, given my time in kind of, DevOps communities and all and all of that, I really, I think, grew and and to appreciate that a lot more and, and kind of the support burden that that put that puts on ops teams.

Yeah. I mean, to be fair, we had, you know, whatever kind of engineer, and then cloud came, and and then we had cloud engineers. So operations, we talk about them like, oh, we've unburdened them because all your workloads are in the cloud, and we manage a lot of that for you. But now we have new titles and new certification pathways and and all that kind of thing.

So I do wonder, if we're gonna see that here as well. I mean Yeah. I mean, you know, one thing I was thinking of, you know, with regard to staff and operations and skills that you mentioned, we're sort of democratizing access to AI. And when I say AI, I don't mean, like, access to more models that you can play with and Vibe code your way into some app.

But I mean, access to, you know, the training environments where you can experiment if you have the, you know, the money.

It's a lot easier to do that. You you're literally saying I have a this budget. I'm gonna do this for three months or six months and see what comes of it. It's a POC maybe.

Who knows? But that's kind of new. Prior to that, it was restricted to mostly wealthy industry, not even academia as much because even in, you know, most, not all, most universities, you know, there are some with built multibillion dollar endowments, sure, aren't able to do that kind of stuff at scale that like an OpenAI or a Microsoft wall. They're kinda the same now, but you know what I mean.

Yeah. So I suppose yeah. You're totally right.

I mean, like, ML ML ops teams, you know, that that that as a term even, I think kind of evolved. I I'm I'm sure there are top teams that are still in, you know, kind of in that vein today, but largely, I do feel like, some of that is is becoming more of just kind of, like, ops skill set, you know, in the way that, you know, operational responsibility does kind of matter for, for all of these teams. And you just need to kinda know what kind of workloads you're running.

Yeah. Yeah. As a side note, not really related to this this topic, but, I don't know if you remember. There was some blog posts and some discussion about a new a new type of engineer just called an operations engineer.

And, it was that idea of understanding workflows and not necessarily being like a double CCIE or a VCDX. Does anybody do VMware? Whatever. Whatever the the the soup du jour is.

Yeah. But the idea is, somebody who's really a specialist in understanding work flows and pipelines and and, they understand the basics of data engineering. They understand the basics of storage and networking, and they're able to to be that operations person. So Yeah.

Maybe that's more needful now, you know, more than ever.

That's my that's definitely my sense. I mean, I see it, more from the communication. Again, like, the more customers that I talk to in the different teams, like, you know, software's it's, you know, socio technical problem. Right? I mean, it's not it's not fully technical. It's not just not just, you know, communication, but it is the combination of the two.

And I'm I'm a firm believer in that. I mean, I see Yep. Anything we can do to effectively make it such that, you know, in our case, network data can be communicated more effectively to more teams such that they can make better sense of it, and we're we're doing our job right there.

So let's talk about some of the, like, the future trends that you that you have that that you think of. You know, what where where do you see things going in the in the near term, in the short term? And and I specifically mean with this idea of neoclouds, how people use them, you know, maybe those regulatory bodies that start to get involved. They always do after something gets popular. So what do you think?

It's gotta be the case. Yeah. I mean, you know, I it wouldn't surprise me if, again, it follows the same sort of trends where it's like, yeah, your traditional cloud environments eventually got to the place where, you know, data sovereignty and compliance and all of that.

You know, those those boxes need to be checked, and it's and they're they are very important. I I really don't mean to, you know, kinda brush that under the rug or anything like that. Yeah. Right.

But, yeah, the I think though that's definitely going to be the case, especially, given the interest in, you know, in kind of neocloud in financial services, environments and, you know, and media companies and, you know, and and others. So I I definitely think that's, you know, kinda one, you know, maybe more more obvious one.

I think, like, prevalence of multi cloud, and this is, you know, maybe not specific to neocloud again, but, like, neocloud is pushing that even more so, because, again, like, this is effectively another cloud environment that you need to now think about, from an operational standpoint and how it communicates, you know, who who has responsibility over it.

You know, going back to what I said specifically about communications, like, we we interact with a lot of companies where, to your point, it's like, here's the cloud team, and then here's the, you know, the net the your network network team otherwise. And it's like, why are those not one and you know, why are they not connected? Why can they not communicate with each other? Right. And so it's you know, we're gonna see that, you know, that uptick, I think, in in the same way with neocloud.

So I it it'll be interesting to see how I think, like, on the business model side, that things shift because, you know, at some point, even with traditional compute and, you know, e c two instances way back in the day, it was very much race to the bottom, you know, and, you know, and scale was the was really the most important characteristic. I'm not sure for neocloud if that's exactly what is, you know, what is ultimately gonna be the driver here.

I do feel, you know, I feel like, at at some point, you know, there won't be there there's there's not enough Microsofts and open AIs and whatever to basically, you know, require that that require that level of that that number of GPUs potentially.

And so, like, the market is very, you know, is very different, in in comparison to, you know, traditional cloud computing.

Yeah. I feel like this whole you know, the the idea of neoclouds are to AI today kinda like what AWS was to, you know, web applications in, like, what was that? Two thousand five, two thousand six. Right? Something like that. So it's kinda that that transition and shift, and then we'll see a normalization.

I also wonder if we're gonna see a consolidation of vendors.

As as you say, there's a race to the bottom, and vendors do what they can to cut costs, and they just aren't, feasible, financially. And so from a business perspective, perhaps, you know and and then also based on the high capital cost of building and maintaining these data centers. And I and I did, read just recently that there are some, data center companies, and I don't mean data center company like in Equinix. I mean, like, you know, AI training facilities.

Mhmm. And they're becoming geographically dispersed in that they can't they can't get the power and cooling in one specific building or their the building isn't big enough. Yeah. And so, I mean, you're talking about running clusters in multiple, geographic locations then, of course, requiring the network to do that so it's successful.

I wonder if, you know, we're gonna see that happening with the whole neocloud, the future of neocloud in that we're running certain workloads here, but maybe perhaps other AI training workloads over there for whatever reasons. And, I mean, it maybe it sounds silly right now, but, I mean, who knows? Right? I mean, we didn't anticipate having five different clouds and my front end is here, back end is there, storage is here, you know.

And so We could just reserve the the moon or something for, you know, neocloud data centers and then everything else?

No.

Well, I mean, they've been talking about when I say they, that's like the ambiguous they. Who are we talking about?

But folks out there in in, you know, social media and and online and and all that stuff, you you've I've seen those, comments about, data centers in space, you know, float space, orbiting data centers There we go.

And also on on the moon and things like that. Because the latency between the Earth and the moon is is fine. That's no problem.

No problem. No. Exactly. No. InfiniBand. Right? Solves all your problems. Isn't that the way it works out?

Yeah. Yeah. Well, anyway, Cliffe, I think we're gonna end there just for the sake of time. So, thanks so much for joining me today. This is a really interesting conversation. You know that I've been really into learning about AI and experimenting with it. I will not be renting any GPUs anytime soon in any of the neoclouds, but it is so interesting to see the entire industry shift in this way.

And what I'm personally most excited about is just the availability to train models as long as you have the cash. Right? But the availability out there to train models without having to wait for three years of environmental studies before you build your data center that you can't afford anyway. You know? So this is that's that's what I'm personally most, interested in and excited.

Yeah. That's awesome. Totally agree.

Great. Well, listen. I'm looking forward to having you on again soon and, and talk about clouds and cloudy things because I know that's what you're an SME in, of course.

So for our audience, if you have an idea for an episode or you'd like to be a guest on Telemetry Now or if you have a comment about today's show, I'd love to hear from you. You can reach out to us at telemetrynow@Kkentik.com. So for now, thanks so much for listening. Bye bye.

About Telemetry Now

Do you dread forgetting to use the “add” command on a trunk port? Do you grit your teeth when the coffee maker isn't working, and everyone says, “It’s the network’s fault?” Do you like to blame DNS for everything because you know deep down, in the bottom of your heart, it probably is DNS? Well, you're in the right place! Telemetry Now is the podcast for you! Tune in and let the packets wash over you as host Phil Gervasi and his expert guests talk networking, network engineering and related careers, emerging technologies, and more.
We use cookies to deliver our services.
By using our website, you agree to the use of cookies as described in our Privacy Policy.