More episodes
Telemetry Now  |  Season 2 - Episode 59  |  October 9, 2025

From Data to Decisions: How AI Turns Network Noise into Clarity

Play now

 
Kentik CPO Mav Turner joins host Philip Gervasi to cut through the AI hype in NetOps. They discuss where ML and LLMs actually help—anomaly detection, root cause analysis, and agent-driven runbooks—and where deterministic methods still win. Join us for real talk on data pipelines, telemetry quality, model evaluation, human-in-the-loop guardrails, and the build-vs-buy trade-offs that transform network noise into informed decisions.

Transcript

In this episode of Telemetry Now, Mav Turner joins us to talk about the reality of applying AI to network operations. So no smoke and mirrors, just real talk about what AI can do and how we need to think about AI for our own operations, operations, and also what building an AI solution really entails. I'm Philip Gervasi, and this is Telemetry Now.

Mav, thanks so much for joining today's episode of Telemetry Now. I always love having well, first of all, I love having nerds on the program. So no offense, but I do count you among this inner circle of nerds, of course. But also having folks talking about, like, the reality of what's going on with AI in the industry right now, that's really important to me because I really kinda I don't like the whole smoke and mirrors thing that a lot of folks do, and you're not one of those folks. I'm really glad to be talking to you today. Before we get started with today's episode, though, would you just take a moment, introduce yourself to our audience, let them know who you are?

Sure. Sounds great. And I appreciate that I still have my nerd card. Glad that was not revoked over the years.

But yeah. So, Mav Turner, I'm the chief product officer at Kentik. I've been here only for a couple months now, but it feels very comfortable given what I've been doing in the industry for over twenty five years. So started in IT, everything from frontline help desk to system administration to data center and security management, mostly on the campus side versus the service provider.

Did that for several years, then went to the channel with a Cisco reseller partner doing large data center migrations, lot of IP telephony projects at that time. That was kind of the thing, a lot of wireless deployments because that was up and coming. And then I have a computer science degree.

Then I went to work for SolarWinds, so network management, IT management company. I was there for almost fifteen years. It was part of the spin out that we did with Enable, which is actually monitoring and management solutions for managed service providers. And usually, these MSPs are managing more SMBs, very small businesses.

But so this is a little bit different technical challenges. Right? Whether you're a small business, medium enterprise, very large enterprise, some very core same problems, but the scale makes a big difference. And ultimately, that's what landed me here at Kentik was, you know, we've got amazing scale. It's kind of started from day one based on the service provider startup. So that's when I got excited to to come here and play with some of this technology.

Yeah. Yeah. I will I will agree with you that some of the when you look at, like, our portfolio of customers, I Kentik, like global customers and global companies and web scale, it's pretty it's pretty neat.

But I will agree with you. I worked in SMB as well, like, years and years ago when I was first starting out, also at an s MSP, and I loved it. Yeah. Because, like, every day was, like, a new, like, oh, figure it out. We have no money. And, like, you had to be, like, a a pure I don't know if this is the right way to say it, but like a pure engineer. Know?

Yeah.

Like, you know the command line, and you have, like, the Cisco textbook. Figure it out. And we didn't really like YouTube and Google as quite as much in those days, a little bit.

But I enjoyed it. So that's kind of where my background is and my heart is, and just kind of figuring out problems and where people are really. Again, like I said earlier, no smoke and mirrors. So I wanted to get into that today and basically address this idea of using artificial intelligence AI in network operations. That's kind of the the main theme of today's episode.

But, like, like, the reality, like like, first of all, starting off with why. What are the use cases? Why do we even wanna go here when I can just run some really good Chrome runbooks or, you know, like some Perl scripts or Python scripts? Like, why do I need to go this route considering some of the heavy lifting that needs to be which we're gonna talk about that needs to be done for the to make this work?

Yeah. It's it's a great question because to be honest, we could spend this entire podcast talking about AI or talking about NetOps individually as topics. To bring them together, there's so much ground that we could cover. What I will say is, a, I'm surprised you mentioned Perl, because I don't know how much new Perl is being written, but I used to use it back in the day a lot because I was a big fan.

I think Python is kind of the, you know, language of the day for network automation in lot of cases. But I I I I just like your reference to to Perl because it's it's it's fun and and and funky language. But yeah, AI and NetOps, I think, you know, yes, there's all to your point, there's all the buzz around, oh, this or that. And and sometimes the answer is not AI.

This should be very clear. There's deterministic use cases where using AI actually makes things harder, more expensive, less reliable. So really, like how you bring these to get things together is very important. So when I think about AI and NetOps, I think about a couple of categories.

So again, back to the long conversation about AI. When you think about traditional AI, you think about machine learning algorithms, right? How do you take massive scale? How do you do anomaly detection?

There's a image detection, obviously was a huge thing, early 2000s. And then more recently, last couple of years now, language models, these more natural language interfaces, what do they allow us to do and and how AI is continually evolving. And so when you when you start to piece those apart, I would say that the industry overall has been taking advantage of AI techniques such as machine learning for a while, not all vendors, but vendors that have been smart because, again, the scale of data, How I sift through all of this and find the signal and all of this noise? And how can I use these techniques to make my job easier versus looking at just screens and screens of log files and trying to find that that one, you know, that actually matters or or like these, you know, gigabytes or now terabytes of log files to look for something?

Right?

You're having hopefully to pull out set and off too often to parse manually all these log files. And you can use some anomaly detection to help you better understand things. I think that was one of the first things in network operations where we're starting to see AI make a real big impact. And then obviously, now, you make a lot of opportunities with large language models. But before we get into that, I'm sure you've got some, you know, good war stories on the AI machine learning and anomaly detection things and and and net ops that might be worth chatting about.

War stories?

Well a manual before before having to do it before having good tools to do that, maybe. Yeah. Yeah.

Well, you know, the thing is that before we started talking about LLMs, right, I saw the value in MLOps, you know, right away in the networking space, and some other vendors did as well. I mean, know that there were companies that were applying various models to network data in order to do some sort of prediction. And you remember back in, like, two thousand three, everybody was talking about big data and Hadoop and things like that, and the the talk of the day was predictive analytics. And so applying, like, clustering algorithms and and models and and regression algorithms and things like that to data in order to do something useful, find some sort of insight, some sort of conclusion, so then you can send out a tech, you know, a human being to actually, like, make a change or whatever it happens to be.

And I always saw so much value in that. And then little by little, we started to see, like, the rise of network automation. You started to hear the buzzword of SDN, software defined networking, and intent based networking.

I always thought to myself, there's so much value, though, in ingesting the data, cleaning it up, and then applying a model to it to understand what's going on.

And so that's actually been my background for the past seven years, six years, something like that, and before any LLM. And so I do want to talk about that today, like the data pipelines, the the more I guess you can call it, like, traditional AI. Right? Which is basically, you know, the application ML Ops. Right? For lack of a better term.

Because I think there's a tremendous amount of value there. But you know what? Correct. You you can you can disagree with me here, Matt.

Sure. I actually think from a you know, what's going on in the industry perspective. Right? I think that large language models and the craze that we have over LLMs by the way, I think it's awesome.

It's not like I'm trying to denigrate LLMs, and it's just so interesting and so cool what they can do and so useful. But I think that it has sparked a resurgence in interest in those more classical forms of AI. So, like, you and I talking about, like, regression and clustering and DBSCAN and these kind of models, they've been around since, like, what, the late seventies? And people have applied them in various industries for various reasons over years over the over the past three decades, four decades.

And and and now there seems to be a resurgence in interest in in those things as a result of the explosion of interest in LLMs, because it's all part of the same world, right?

Yeah. So that's when people talk about AI, I like to say, like, well, that's such a broad area. And you started to go there earlier. And then maybe we can come back to it as the data engineering challenge, because that's where they really overlap, right?

Oh, yeah. In order to be able to use MLOps, there's a massive data engineering pipeline you have to do. And if you're training your own models, or you're trying to act on real data beyond a very simple POC, which LLMs, by the way, are amazing at doing a simple POC, right? You're like, wow, look what I did with, you know, half a day of messing around with my POC.

But then the scale that's production is very hard, and it's usually because of the data problem. Right? And that's that's one of the things. There's a whole unstructured versus structured data and the amount of data in any enterprise that is actually structured and usable is very little.

But in order to take advantage of whether it's MLOps or using new large language models and using AI in any business application is being able to connect those things. And and and, you know, look, generative AI is really great at going through unstructured data and giving you some information, but it's still early days, I think, on being able to, like, properly sort and give us those those learnings. And I think really the goal there is to use that as part of your pipeline in order to say like, look at all this stuff and try to get some signal and try to get some structure out of it. So now I

can actually act at scale on this information. And to me, that's one of the things, you know, not to, you know, to toot our own horn too much at Kentik. But like, that's really when I talk to people and they say, Hey, what is Kentik do? Or why is it valuable?

Or doesn't it all go away now that we've got AI and you don't need anything? What we're doing is we're representing not only the current state of the network, but the past state. And so a lot of large language models, they can go inspect a device, right? You can point an LLM and say, hey, go SSH into this box and do something which by the way, don't recommend without proper guardrails.

And the the history that that device has and the network and the context it has, it's gonna be very limited. So if you're not collecting and storing that data for these large language models to be able to go over, you're not really gonna be able to achieve AI and NetOps like you talked about earlier. Right? It's not your your you're there's some small use cases, but you're not gonna fundamentally be able to change the way you operate unless you have that structured data, that knowledge of the infrastructure, how that data works, and then and then enable enable AI to access and use that data. Because that AI layer is changing so rapidly now, right? It's the fastest changing thing that we've seen in this industry ever, frankly.

And it needs these structured data sources and engagement points in order to be impactful.

Yeah, I mean, I do see value in the unstructured data, of course, though, in the sense that the large language models do give us an opportunity to do things like sentiment analysis at scale, which is something that we've done for a long time anyway within NLP, but also analyzing tech calls, transcripts, and tickets. So there's a lot there, but but you're right. As a as a former network engineer, right, and a lot of the folks that listen to this podcast are either engineers or in that realm, right, it comes down to hard metrics. So, you know, how do you use an LLM to sort of interrogate hard metrics and then make decisions?

Well, you sort of can't. You you can to an extent, but you sort of can't. Maybe there's a text to SQL thing that you can kinda do for your query engine, but it's it's bigger than that. And I I so I like the way you put it that it's like a it's a bigger workflow.

So now that we're talking about really the application of AI in network operations, maybe even IT operations more broadly, we gotta think about the bigger workflow. The large language model is is a piece, almost like a modular piece. And and for a lot of workflows, it's very modular in the sense that you can swap out LLMs, you know, maybe one is more accurate or one has been updated and now it's hosing everything, I don't know, whatever whatever it happens to be. And so it's really all of it, and especially if we get into, like, agent systems where the LLM is, your interpreter, you know, of of the intent.

Oh, I have a whole thing I wanna talk about with you.

Yeah.

Was, yeah, I was just waiting where they came to that.

Yeah. The whole intent based networking thing, like, I'm starting to wonder if it's, like, finally coming to fruition because it was all it kinda, like, fizzled out. You know? Yeah.

Because how how do you how do you encapsulate content in front of us?

It's very hard. Yeah. But in any case, like, that's I like how you put it. That there there's an entire workflow there, and the entire workflow, as sophisticated as it may be, as, you know, as as as progressive as you are technologically as a company, it all comes down to your data pipelines. Yep. And that's one of the things that I've been preaching about as far as, you know, what Kentik does and what Kentik offers in the in the marketplace is kinda it we don't put ourselves out there as as data pipeline as a service.

We do not.

I do like yeah. Yeah. Know. Maybe I'll probably get in trouble for saying that. But I I do kinda think that, like, you know, I was just talking to somebody earlier today about this.

You know, we have these network engineering teams out there. We have that data engineering teams as well, and they don't necessarily work with each other. They're they've always been very, very different sides of the cubicle wall. But now in two thousand twenty five going into twenty twenty six, you have data engineers or I'm sorry.

You have network engineers that are dealing with an enormous amount of data, diversity of data in order to figure out this whole application delivery thing, and they almost sort of have to be data engineers, but they're not.

Right.

So how do we do this for them? I mean, one one solution Mav, I'm sorry if this is, like, totally shooting Kentik in the foot. One solution is that you hire a whole bunch of postgrads from, like, MIT or something to, like, do this for you manually, and, like, you hire teams of data engineers and incorporate that them into your network operations team. That's ridiculous. So that's why I like I like the idea of the data data pipeline as a service idea that we ingest all the information, streaming, SNMP, flow, metadata, all that qualitative and quantitative information.

And we do the data cleaning, we do the pipelines, we build the Kafka bus for you, you don't have to figure that out for streaming in real time. And then we can feed models or we can provide the insight internally. So I think that's going to be a real need going forward. Maybe not just networking, actually.

Really any industry that deals with like, real time data, any manufacturing for that matter. It's just a matter of, okay, now that we have all this data, how do I make use of it? You know?

Yeah. And and I think the whole, you know, data rich and all insights, poor concepts that have been floating out for a while are even more relevant now than ever. And you're right, it's not just our industry for sure.

And I think that's what you will see the industry going overall when you're talking about whether whatever type of telemetry that you're collecting, you know, what are you doing with it? How do you act on it? How do you how do you receive it, store it, normalize it, make it accessible, bring insights from that? It doesn't matter whether it's network or, like you said, factory level information, like I need to be able to pull all these signals and then synthesize them and then get insights and then be able to act on them.

To me, that's why when we think about the future, our business future, but also the industry as it evolves, I think there's is a need more than ever for these types of platforms to exist. And I think the key is how do you engage with these different AI technologies, I'll use that word, and how do you enable those use cases to exist and infuse that domain language? Because another approach is a little bit to your grad school analogy, like you can just dump the raw data and right thing at like s3 bucket, just push everything in there, dump it into the storage, and then sort it out. And you know, some people would say, hey, AI is really good at figuring things out or, you know, LMS, just just store everything and then say, hey, tell me what's in there.

And again, I think you can actually do a decent POC and you'll be like, hey, look, found some things. But when I'm trying to build my operations practice and make repeatable and I have really, you know, low tolerance for outages or latency or any of these other business impacts, that's when you need that reliability. And this goes back to my earlier statement about the mix and the balance of like, when can you use non deterministic systems, not versus, but like how to use them with deterministic processes and systems in order to achieve the best of both worlds. And I think that unfortunately, if you take the other approach of just collecting all the data, dumping it somewhere, trying to use some large language models, point out it and hope to get some insights, that's just not going to scale.

And that's that's unfortunately going to put you a lot of blinders.

And, you know, that's that's where I think we've spent a lot of time trying to be really thoughtful about how we build our platform.

The other the other cool thing I think is important for anybody building or using AI is it's a it's a common it's a common design pattern in AI systems is a support of a router, basically. And then not I don't mean like a network router. I mean like in a in a model router. Because these models are changing and different models are applicable for different use cases.

So being able to quickly change to whichever model is the right one is super important. And it's a small change at the beginning when you architect a solution. It's a little bit harder to go back later and change. But being able to test performance between these every time a new model drops and side by side and really be able to use the right tool for light job, super important topic.

Again, that's a whole another topic in this that we could spend a lot of time on.

Yeah. Is. I just finished well, not just finished. It was over the summer. We are recording in October, early October for the audience's sake.

But over the summer, I read a book about a technical book on model evaluation. It was specifically LLMs, but I did read a book prior to that on model evaluation in general and how that's done both in math and qualitatively. And is big. You're right.

You know, that's gonna be a big part of this because, like you said, like, nine times already, LLMs in particular are are not deterministic. They're probabilistic models. Can you can you expand on that a little bit? What what does that mean?

Why is that a problem?

Yeah. That's great. Great point. So when you think about a deterministic model, think about a function.

Right? I'm gonna put two numbers. It has a two input x y, and it does an operation like addition. Two and two, yeah, outcome is four.

That is actually when you look at how that is being calculated, a repeatable and minus, you know, some really crazy astrophysics event aside, like that's going to return the same result all the time. Right? So you can then they'll build on top of that. When you're using a probabilistic system, you're actually not able to guarantee the response.

It's a range of probabilities. Right? So it's that that goes from that binary concept. Is it one or zero?

Is it on or off to a range of probabilities? Now, again, in in a lot of cases that the probability range will be so high, it'll feel deterministic. Like, oh, every time I ask it, it gives me the same thing.

Actually, think one of the best things that ChatGPT did for the industry is help introduce this concept to people because most people at this point have used ChatGPT, have asked it something, have gotten a weird response, have asked it to respond again or told it it's wrong, and and been surprised about the response and the interaction. And that's kind of a really hands on example of what it means when you have a probabilistic response versus a deterministic response. So those of you who took statistics in high school or college, you know, are able to pull some of that that out and and have fun with some of that math. But but, yeah, that's that's the that's how I think I like to describe the super simple The way, you know, don't know if you have any good tricks for explaining that to to people.

Do I have any good tricks? I mean, the like you said, there's there's literally a probability coefficient. Like, what is the most probable next word that the, you know, the user wants to hear in this prompt that they gave? Right? So you're exactly right.

That's why, like, you have one like, bringing it to NetOps. You have one network engineer throw a prompt and a thing, you know, and chat chippity or clauder, whatever it is. You get a you get a response, and then you do it again. You get a slightly different response. That's not so bad, like, you know, with, like, just generating paragraphs and social media content and essays and poetry for your whatever.

But it is when it comes to network operations when we need hard metrics. When we need, like, what is the latency? What are the packet how many packets am I dropping on this link?

What is the cost to take this link over this link? And you start to, you know, build that out. So how can we how can we mitigate that? I mean, I already know one of your answers is gonna be, well, the the large language model is part of the overall system, so we can start looking at which model is better, and and you have, you know, fewer points of latency and fewer points of failure when you have it as a a modular system.

I'm sorry if I took away your thunder.

But in any case, how how do we mitigate that?

So that way, it's it's it's always gonna be probabilistic in that point, especially as we get into agent systems where it's using a model a large language model to interpret the user request. How do how do we make it so it's less probabilistic and more deterministic?

Yeah. And and before answering your direct question, I I'll give an example. So if you think of a probabilistic model as asking your your peer. Right?

You're sitting next to somebody like, hey, I've got this problem. What do you think? And it gives you an answer. Oh, that sounds right.

But you're I'm gonna check. I'm gonna make sure that works versus if I'm gonna put my calculator and then I put in a number, expect it to give me, like, the math, the exact, like, deterministic response. It like, you mentioned, like, the poetry, you know. It is like talking to somebody.

But then we start to string together whether there's different systems or different different layers of complexity and and and non determinism. Now you're kind of in the the traditional, I think we've all been there like, I'm the network team, I say this is the problem, the application team says this, the database team says this, the storage team. You got all these different people that are making their best guesses based on their ability to guess the next character, to troubleshoot based on their experience, they're trying to come together. And I think the promise of a lot of this technology is, okay, that may happen and it may feel a little chaotic when you think about it, But because of the information and the data that they have access to in these overall systems, they should be able to quickly get to a proper root cause analysis when you have these ecosystems of agent to agent communications going on bringing in their expertise.

And this is something that, you know, we're actually working with ServiceNow. I'm sure you're aware of this to how do we make that network data available in the case of a ticket workflow. Right? Customer user has an outage, they ask it, we gives them interrogates all the tools in the stack, gives them back an answer and that reduces that mean time to resolution.

And I think that's super exciting as we move forward is what we can do there.

How do we take the risk out of the probabilistic system or how do we manage it properly? You know, it is it is a system and when we think about it, there's a couple of things that we build around this. Right? So it's if you can understand and take the instead of the LLM only or providing the full answer, use that LLM, you know, one of the first things Kentik did with large language models was our natural language interface, Just make it simple to ask question get data back.

Great. That's baseline. But then when you want to actually get to that next level, and you want to actually have reasoning, and you want the LM to reason, it needs to have layers of information and tools it's pulling into it. And you have logic around it outside of the LLM itself.

Right? You need to be able to come back and ask a recursive questions and and be able to iterate. And this is where another example you'll see where like a lot of people have agents interrogating other agents to try to ask, you know, have Gemini ask Anthropic, you know, try to type write a prompt for Anthropic from Gemini or something like that to try to get to these results. And so if so so creating those those those logic conditions where you say, oh, okay.

This person is troubleshooting a BGP peering issue. Great. They ask a question about it. You need to able to bring in the right contextual information, look at that, and then have that then interrogate additional systems or sources of data before coming back.

And that's what we've been spending a lot of time doing is the tools that AI has access to. When you think about the different data sources in Kentik, the different telemetry, the metadata, the anomaly detection and cause detective, alerting all these different systems and data sources, teaching it how to interrogate those properly. Some of that is prompt engineering, right, that's super important. You add that system prompt into it.

The other is just making sure we bound those questions and direct them and pick up or intercept specific types of questions and then kind of guide them in other paths to ensure that we're giving good recommendations. And then, you know, I do personally believe and I think most folks at Kentik that, you know, we're we're still very much in a person in the loop model where we want to provide as much information, make it as easy as possible, but we're not trying to get full autonomy, know, autonomous AGI where I can just we can all sit on the beach and do no work because the robots are doing everything.

We still want to review, right? We still want that that just like we do a pull request, you'd have somebody else take a look at it before before committing finishing that process. So, anyway, sorry, rambled on for a little bit there, but hopefully that gave some some additional color.

Oh, yeah, absolutely. I mean, you basically touched on a few key areas. I mean, you talked about the concept of RAG and limiting the large language model to a specific external database, in this case maybe Flow or a customer's streaming data, whatever it happens to be. And so that puts those bounds so that way it gives answers within the context of that database.

So that's one thing. You talked about model evaluation by using one model to evaluate the results of another model, which is a really popular thing. You talked about human in the loop, which is another thing that we still do and will still continue to do for quite a while, I'm sure. Whether that's, you know, just pure r l a r RLHF and and human feedback in that or or some other mechanism, you know, I know that's one of the things that we do is that we have engineers using these tools, both in beta and in production now, and looking at those answers, providing that feedback, and that's very valuable.

I mean, that's qualitative for sure, but that helps us to understand, you know, models are actually responding to a query, to a prompt.

And I always appreciate just really making it more broad and talking about how it is an entire system and how there is an entire ecosystem of technology, of code modular components that work together to produce this. And especially when we're talking about in NetOps, when it has to be in real time, and so you're talking about a data pipeline where from ingest to when it's available to be queried might be in the order of milliseconds, you know. So you need a very advanced, in my opinion, very advanced sophisticated data pipeline to get there. Again, going back to the idea of why I think that's a that's a big value what Kentik brings to the marketplace right now.

Now you mentioned root cause analysis.

We can start with that, but my question is, what are the actual use cases for network operations that we're trying to solve, that we're trying to do? Because honestly, I've had enough eye rolls from people about, like, putting an LLM wrapper on your platform. Oh, I got a little chat bot. I got an advanced clippy. That that cute POC, we're we're done with that. So what are the actual use cases that we can offer NetOps and, you know, what we're doing at Kentik right now?

Yeah, absolutely. I think one of the things that we've been super excited about, and I would I would argue, I've actually exceeded our expectation, is the the reasoning ability, given the prompt proper system prompt, given the proper data, given the proper tools, how our solution can actually reason and recommend. And, you know, we've had plenty examples where they can say, oh, hey, based on all this data, I think I know the problem. If you gave me a config snippet, I could tell you exactly what to change.

Right? We've able to get to to to that level. Again, we're not trying to just let it loose free out on the infrastructure and go make the change directly. Yeah.

But but the ability for us to take another way we describe it is take this kind of this book smart LLM that's learned all these things across all this training data sources, and then give it the real world experience. And that's a lot of the things that that that we've done that that allow you to solve real world problems about, you know, what is this alert generating? Why is it generating? Why should I change it?

What's really the problem? We had a great example just yesterday, think it was one of our internal teams was creating a runbook in order to say, hey, whenever this problem happens, I wanna go look at some of the the physical characteristics of my transceivers to see like, is this failing? Right? Is it operating outside of its normal balance?

Right? Here's the troubleshooting steps I would take as a human. We can capture that in runbook. And when it sees those patterns, go go operate that.

And that allows us to really tune and tweak and bring that real world experience into the customer's hands so that they don't have to go recreate that themselves. But they of course, they can if they want to. But, yeah, that that reasoning component is, I think, really the key to the difference that that we're seeing here and the impact that these that that our solution is is having is the ability to just like, this is why this is happening because of the data source. And the other part that we've been really critical on is making sure that we are citing all of our data.

So whenever you're interacting with our solution, you'll see like the reason the reasoning process, right? It's like, here's the steps that I'm going to take. Now I'm gonna go pull data on the interface utilization of this device for the last thirty days to see if there's a history of this flapping of an interface or whatever it is, right? And then and then you could see the chart of that data in line.

And you get this long, you know, thought process kind of like a paper that the person has written out to describe their troubleshooting and then recommendations. Right? We recommend that, you know, you allocate more bandwidth or that there might be a cable problem between these type these two sites because during the hot part of the day, you know, you start to see some some higher errors and discards or, you know, or or there's a peer flapping issue over here, maybe you should change peer providers, we recommend you peer with this as in order to have the better traffic resiliency or lower latency.

So those recommendations and that rationale and that reasoning and and demonstrating all of that is something that I think is huge, huge value. You know, kind of a surprise at how far it got, how fast given the the the power ups we gave it with our with our context and our tools and and all of that.

Yeah. Yeah.

I mean, it really is about giving it the appropriate tools and feeding it the appropriate data Yep.

That that's gonna be the key. That's really interesting. So, I mean, really, what you're speaking about because you you did mention, like, we're not gonna give it the ability to go out and make changes and everything yet. Maybe that maybe that's a just yet.

I don't know. I don't wanna talk about road map today. But certainly, what you're talking about is the idea of data analysis. And this idea that, you know, we're we ingest all of this data, and then we have an AI workflow that does have these multiple components, and we've discussed some of those, that ultimately can programmatically analyze the data and come up with some conclusion for you for you as a network engineer.

I I presume, to make the job of, like, you know, solving tickets and fixing the network and, you know, quicker and faster and perhaps even, you know, more insightful, like, you know, identifying correlation where it's very difficult to do that unless you have, like, a whole team of people. Right? Something like that. So there's a big operational aspect. Is that the main reason that we're doing this is to just improve the operations part of of networking? Because you sort of hinted at a couple of other things there.

Yeah. So the way we think about the scope of this is, yes, operations, that's everybody's like, I've got a burning fire, I need to put the fires out quickly and fast. Great.

Yes, we wanna do tickets.

Absolutely. Yes. We wanna stop the stream of tickets. But like anything, the goal is to move to that next level up as those hierarchy, right?

I want to be able to get to a better design, right? So I want to spend more of my time, especially as the more senior architect that I am, to redesign my network to minimize these problems, right? And so when we think about what we can do with our network intelligence capabilities, it's, you know, design, operate, and secure what is the infrastructure. I protect the infrastructure and all of these different things, You know, we could, there's a lot of details underneath those.

But we've talked a lot about the operation, but on the design side, again, having a good sense of the infrastructure, whether that is, you know, the network infrastructure and interior, or that's the BGP view of the world and leveraging all the synthetics and the ability to test, you know, latency across all of the internet, and being able to then get that data and ask tell network intelligence and capabilities to give you recommendations on how to recon reconfigure your network. That to me is like the ultimate like when I talk to customers that are using it for that purpose. Those are the mature customers that they're able to have such a massive impact on their business because those are the proactive moves, right?

That's what limits the outage occurring versus the reactive of like, okay, outage occurred, we have to respond as quickly as possible. I want to do as much as I can to prevent that. And so that's where understanding the relationships, like I said, having this BGP perspective of the infrastructure along with all the flow data and knowing where traffic is trying to go where if you're a service provider, where the subscribers trying to reach rather than mostly going to YouTube, or video game streaming services, and and how do I reconfigure my network in order to make sure that low latency, high bandwidth availability for those subscribers to the endpoints.

And now I can think about my business differently from a planning perspective versus just the reactive side.

Yeah.

So, you know, I wanna understand how this works now.

So from an architectural perspective, because we could talk about LLMs, and I know that like a GPT or a Claude or a LAMA, they've been trained on all the stuff that's out there on the Internet. So I get it. They've read all the Juniper and Cisco textbooks. They've read my blog, you know, I I assume.

I don't know. I assume. Right? Anything on the Internet. They've read, you know, Reddit and Wikipedia.

So they know, like, if I throw something in there, it'll give me, like, oh, yeah. You should try a shut no shut on the interface. And lo and behold, it works. But it's because of lang it's a language framework.

It doesn't really it doesn't know my network. It doesn't know my circuit IDs, and, you know, it doesn't know any of that stuff. It's just, you know, reciting stuff that it learned in its training data. And and it sounds great, and often it is great, so I'm not gonna try to, you know, denigrate it.

It's it is it is great.

How how do we do that at Kentik where we, in our architecture, are, you know, making all of the responses and the entire workflow based customer's data on their data so that way it is as relevant and and real as possible for what they're trying to do.

Yeah. Absolutely. This goes back to my original point about LLMs are awesome at quick POCs, right? You ask it to do something and it's like, oh, this is amazing.

And you could say, hey, go write a script to when this event happens, write a script to go in and automatically shut down the port. And here's my Python script or Perl script that will go down and take this action. That's awesome. But the downside of having read everything is it could be wrong, right?

There could be changes that is not aware of. It could be old data, it could be just wrong, wrong vendor, right? Like, okay, make sure you identify which vendor first and, you know, and maybe even which device device class, you know, I could say, how do I do something on the Cisco device? And that's like massive, even that's like broad, even though they've they've worked over the years to consolidate their operating systems, there's still like a wide range of wrong answers that you'll get.

I grew up on Catalyst and then had to make the migration to iOS.

And obviously, hopefully there's not too many Catalyst devices still floating around. I'm sure there are some course switches, course six thousand five hundred somewhere. But being able to, and then this is why we go back to that AI system, right? We think about what are all the pieces.

So when we're doing this, we're doing it with able to understand what it means, what an interface is, right? The concept of a network interface and networking, the LLMs don't quote unquote know that. They can recite a definition, you can ask it what it is, you could ask it how to reset it. But it has no real like concept of what it is in the relationship in a way that can allow it to be effective without providing more context.

And that's what we're, you know, what we've done, right? So when we think about our architecture of our system, as I mentioned earlier, we have our AI gateway or AI router that can flip between any of the different providers and the different models within those providers, which allow us to do model evaluation, you know, every time very often.

And then we have our agent. And that agent has the ability to interact with other agents, right through MCP server, to a communication. So we can integrate or just standard API, right, you can hit our AI agent via API and and create different workflows there if you haven't kind of gotten far enough yet. And then it has a different access to systems as well as our, you know, our our rag mainly is for our knowledge base and our documentation.

Right? So when we think about what information it should have back to the earlier point, if you were to ask a if you were to ask chat GBT a question about Kentik, you may get an article from two years ago that was outdated. If you ask our AI advisor, it's going to be something that we are constantly streaming and constantly updating and refreshing the information on. And, and that's that's important.

But that's, that's the knowledge base article stuff. That's great. The bigger thing are the tools. And you mentioned this earlier, I was hoping we'd be able to come back around to it.

So yes, all of our neural telemetry, our metadata, our inventory information and the understanding of what that inventory really means and the relationships between those things, Anomaly detection or cause detection, all of these forecasting and alerting all these different systems are what our agent has the data access to for that specific customer. Right? So this is not a generic, like, how should I solve this problem? Like, here's how one would do that.

This is your specific device. Here's how the routing change in your environment. Here's this alert that occurred. Here's what you should specifically do.

And then we have other tools. Now today, it's fairly straightforward. You you do something and we might go hit DNS or who is, right, as part of the response in order to give you some names as opposed to just a bunch of IP addresses.

So the framework in order to then reach out and engage with changes, either on Kentik itself. So, hey, noticed that you've got a lot of alerts from this device. It looks like a the threshold needs to be reset or there are a lot of false positives. Do you want me to silence this alert for you or change this alert or add an exclusion for this alert in the future? Yeah, that sounds good. Right? Back to the human in the loop point.

So it can act on Kentik. Right? Do want me to create a dashboard for this? These are things that are very near for us around the corner.

So so our agent acting on on Kentik is super important. And then then beyond that, to me, I I really think about it. Ideally, the interaction model is agent to agent or MCP so that we can say, hey, do you want to you want to make a change your network, I prefer that change to be an agent to agent communication versus a command line, you know, we we can spit out the command line and tell you, here's what you should do. We could integrate with third party tools to make those changes.

But ideally, the scalable way to do it eventually will be more agent to agent communication within the ecosystem.

But again, we'll today we already spit out snippets like this is the thing that you should change, right? Interface do plus mix match. Here's the things to change on this one, right? We were able to do that today. That's already pretty straightforward.

But so it's kind of a decision point, right? About when do you start to open things up and then do you stay limited so that we we have a, you know, limited blast radius or, you know, simple things like whenever a config change occurs, do I wanna reset the router if I notice different characteristics around it and and go back to the previous image if I noticed performance degradation after pushing out a security patch, for example, right? All those different examples are something that allow us to there are things to look forward to hopefully in the in the near very near future.

There's a lot of domain knowledge there, right, to to make those decisions. And that's hard to embed that in math or in a workflow.

So I get it. And that's where we're heading. But but I mean, you know, that's not to say that there's a tremendous amount of value in a programmatic root cause analysis workflow. I mean, that alone is huge.

I mean, I remember being an engineer trying to figure stuff out, crawling the network, know, pinging around, doing show commands, trying to figure stuff out. What am I doing? I'm manually clue chaining for, like, three days, and I was usually with VARs, by the way, Mav, so I wasn't, like, on a network that I was super familiar with. So I was clue chaining, you know, emailing, you know, my customer and all different people trying to figure stuff out.

And eventually, you know, eventually it worked, by the way. Like, I found the problem. Yeah. That's what engineers do.

Yeah. We did that. But it took days and a lot of heartache to get there. And so I think there's a lot of value, even if we're not pushing config and getting into a more autonomous sort of environment, autonomous behaving environment.

There's a lot of value in identifying the cause, in looking out and getting all of that visibility information, pulling that in, doing some sort of analysis on it, and then providing a conclusion. That's huge.

Then being able to go to the next step and then provide an easy way to then interface with that data, perhaps push config using natural language, I think is really great. I mean, one of the things that I've talked about early on, so maybe like eighteen months ago early on, right?

Eighteen months ago is early on.

That's a weird thing to say, but that's how fast things are.

I was always a proponent of, among other things, you know, the technical stuff aside, I I love the idea of being able to democratize information and access to all of this and say, look, alright, you're not a CCIE. You don't know how to configure like, you know, your VXLAN overlay, fine.

But you don't need to. You're managing your your level one or level two, you know, knock engineer, and you're looking at this data center and you're trying to figure something out. If you have the ability to use natural language and it generates the query for you, maybe it is SQL, or it's a pandas function or something like that, whatever it happens to be, it gets you there without having to need all of that other knowledge. So it makes it more accessible.

Now if we bring that to the next level that was cool at that time, and it's still cool. But if we bring that to the next level and we're incorporating that same ability into this more advanced workflow where you can use natural language to kick off very advanced runbooks of information, playbooks of whatever you're trying to do.

Granted, it's not quite yet autonomous networking yet, which that's debate how far we're going to get and how soon. Right?

But that's still a huge advancement where I can say, based on the conclusions that you had, we have this information that's coming in, this analysis that you've made, Mr. AI.

Adjust my network such that latency is below this threshold in this area, in this office in Chicago, right? And it makes those changes. It generates the runbook for me. Maybe it doesn't push the actual changes, but that's huge. That's huge.

And so I see that as a great advancement in and of itself. And I think that's actually probably where we are in the near term right now. I don't know if you disagree with me or not.

Yeah. No, I I agree. I mean, when you think about autonomous, you know, my my definition of autonomous is that's when the human is out of the loop, Just to be make sure we're talking about the same thing. Right?

Yeah. Full autonomous. You're saying humans not in the loop. Okay.

But what percentage can we get to that autonomous, you know, today?

And that's where The pieces? I mean, the little pieces isn't DHCP completely autonomous? When was the last time, like, you physically and and and, you know Manual IP address.

Your IP address. No. Right.

So there are certain components that are autonomous. Right? You know, I there are security systems that'll shut down an interface when there's a a a an identified security breach. Right?

Shut down the interface. Yep. So there's certain things we can do. I mean, that's not AI, but or maybe it is.

I don't know.

Yeah. Well, it depends on which vendor you're talking to. But yeah, it's So that's I want to be careful because I also don't want to imply there's not that autonomous component of network automation like that is certainly we're a key part of that. But what I'm trying to differentiate from is like, we're going to sit on the beach and you know, everything is just automatically running and there's no problems and everything solved.

Like that's to be the dream of the full autonomy. But we are definitely pushing that bar farther and farther to the right when we think about like what can be done autonomous. So when I think about, you mentioned earlier, the workflow for troubleshooting an issue in the data center and the steps that you have to take, like, it's massive. Right?

Whereas today, what we try to tell people is if you were to simply go into Kentik and you just ask it, what's the problem or what should I do? And it gives you an answer. Right? And that's days of crawling around and emailing and trying to find people or, you know, we and we've unfortunately, I say unfortunate because I feel bad for the customers in this environment.

We've talked to customers who have struggled with this like, hey, we've got this outage or we've got this massive service issue and it took us days and days of finding the right people and trying to track it down manually that that, you know, they weren't customers. And then now they use use Kentik and they're able to solve that faster. And then now with how we've gotten AI, they're able to just ask it and get an answer. And that that is powerful.

Right? So whether that's the drastically reducing that in meantime resolution or that's automating some of the specifics on my environment with run books and being able to say, hey, when this happens, go do these five things. And and that's something that I think Kentik has always been really big on which is openness of the platform and extensibility of the platform. Being able, like I want be able to give customers run books for common scenarios and I want the AI to know what to do and to just give these insights to customers without them having to ask.

But I know we're not gonna ask everything and there's gonna be some environmental situations where customers are gonna say, hey, when this happens, I want you to do these five things and I want that to be super simple. Right? Just write it out in text just like you would an AI prompt, write it once and whenever that happens, you'll get the response that you've gotten. Hopefully, you share that with community and when you know, we can get some some some goodness there.

But being able to make that simple and being able to bring in your custom network context and define, you know, these are my IP address range. These are my naming schemes.

Like, yes, we can kind of infer some of those things by looking, but being able to to override and provide some of that context out of the box in an easy way

Just just makes that whole process so much simpler and faster. And so one of the things my show when I show customers how to add their their custom network context in there, then they ask that same question again that they had asked, you know, a couple hours before and they see with their specific context, like, the difference that makes the response is it's it's fun.

Yeah. Yeah. That's probably, like, huge difference. Yeah. And it makes it that makes it relevant and useful.

But for sure, I think I think that, you know, when it comes down to it, kind of getting a getting getting away from all of the the hype, the smoke and mirrors, and, like, you know, talking real about, you know, autonomous networks and and what we can do with it today. I mean, there's a lot of value, and we're heading in a in a particular direction.

And it requires work. It requires a lot of data engineering. It requires a lot of forethought. I really think that it begs the question, like the age old question of build versus buy again, you know. Do you want to build an entire data pipeline? Maybe you do. I don't know.

But there's a cost to that, both in money and resources and time and grief.

So there's a lot of components here. Now that we're starting to get into this more and more in the real life applications to this stuff, I think we're seeing the fruits of the benefit, right? So we're seeing the benefit, we're seeing the cost. And so I think I think the whole bill versus buy discussion, we're gonna see that more and more. In my opinion, we're gonna see that more and more in IT operations, probably other industries, but I'm not as well, you know, into what's going on in those other industries.

So I'm looking forward I'm looking forward to what's coming down the pike in the next few months, next few years, especially.

Yeah. One more point on the build build Vi thing, if I may. Sorry. Okay. Because I've seen this in a couple of other industries, this is why I kept repeating the whole point about, hey, LLMs demo very well.

Everybody gets very excited. Hey, we're gonna get rid of this vendor or that vendor or this vendor. We're just gonna do it ourselves. And and I've had well, pretty much every time that a customer has done that, they have come back and said like, oh, we realized like all of our AI engineers are building something that we can just get off the shelf.

And so like, what is your core business function? Where do you need to create innovation? And like, where do you where can you rely on the industry experts to partner with? And that's how I view this for us is like, we're gonna come and we're gonna help if we can as opposed to having to to roll your own.

There's always gonna be people doing it themselves. There's a lot of value in that, frankly. It creates a great community in in a lot of areas. But but honestly, there's so much to keep up up with.

And there's probably a better use for your internal AI developers than to, you know, do something that that other vendors are on top of. And I think, you know, for for speaking for Kentik, I think we we are very on top of this and I think it makes a difference. And it's gonna cost you a lot more to to roll your own and the maintenance of that. And so, you know, if you wanna talk about it, I'm always happy to talk with customers about that.

But I've seen this this game play out many, many times over the years even before AI. I think it's actually worse for customers with AI given the critical resource scarcity.

Oh, yeah. Yeah. You you mentioned what they can do, what what they can do better with their AI engineers. And I'm thinking to myself, yeah, that kind of presupposes that they have AI engineers.

I mean, you're not gonna see that as common, you know, and and especially if you're talking about network teams, we are talking about, you know, technologies and and workflows that are not common among most network teams. I'm sure that there's always exceptions. There's and we're always gonna, like, put those up on the and say, look. This is how we all should be.

But that's not really the case, especially when you get into, like, enterprise organizations. You're not gonna see that as much. And so I think, certainly, you can start to build it. But I I do know from experience that, even prior to AI, just in general with POCs, you get those never ending POCs that just keep going, and then they don't go anywhere because there's no like number one, understanding the amount of effort that you need to make this thing successful.

I mean, that that's a misalignment. A lot of folks don't realize that. And then number two, understanding, like, what the real business value of applying an AI initiative to network operations is. That that's a thing that I think is lost on many people right now.

It's like, alright. I gotta use AI in twenty twenty six. And then my contention will be like, okay. Cool.

Why? Like, why? What are trying to solve? We haven't figured that out yet. Yeah. But we need AI next year.

Okay. Cool. And then, again, you end up with your never ending POC because you don't really understand what you're trying to solve. And is it is it reducing the number of tickets?

Is it trying to make your network, you know, better with as far as uptime and reduce, whatever it happens to be? Yeah. All of those things are important questions. And, you know, if you don't have that in your NetOps team, that's that's hard sell.

You're never gonna solve anything. You're just gonna circle around. Yep.

I mean, it's fun. It's fun. I got an old Dell PowerEdge server right here to my right, and I build stuff all the time and then tear it down and build other things. You know, that's fun.

But that's not necessarily what you're doing in your It could be your hobby, but don't know if that needs be your production Absolutely.

Absolutely. So, Mav, thank you so much for joining today. I really appreciate it. I love talking about this stuff.

So whenever you want to come on and talk about AI, you get as deep into the weeds as you want. And I love talking about the business case for AI as well very much. So I'd love to talk to you about that as well, especially as you continue to work with more customers and go down this road. So thank you.

Thank you again for joining. Appreciate it.

Thanks for having me, Phil.

And that wraps up another episode of Telemetry Now. If you have a comment or question about today's episode, I would love to hear from you. You can reach out to us at telemetrynow@kentik.com. So for now, thanks so much for listening. Bye bye.

About Telemetry Now

Tired of network issues and finger-pointing? Do you know deep down that, yes, it probably is DNS? Well, you're in the right place. Telemetry Now is the podcast that cuts through the noise. Join host Phil Gervasi and his expert guests as they demystify network intelligence, observability, and AIOps. We dive into emerging technologies, analyze the latest trends in IT operations, and talk shop about the engineering careers that make it all happen. Get ready to level up your understanding and let the packets wash over you.
We use cookies to deliver our services.
By using our website, you agree to the use of cookies as described in our Privacy Policy.