Video

Streaming Telemetry and SNMP Monitoring with Kentik NMS

Welcome to Video Bytes today from the Packet Pushers, and our guest today is Chris O'Brien. Chris is a product manager at Kentik, and you guys have announced a new NMS, Chris. Yeah. It's 2024. It's time for a new NMS. People have been complaining about SNMP, for decades. SNMP was invented in 1988. And I keep hearing this phrase. SNMP is dead. Right? But the reality is like 99.9 percent of network monitoring at least of metrics is still through SNMP. You know, I think the the future is streaming telemetry, streaming telemetry can pull much, much more quickly and more detail collect, rather than poll. And so there's a lot of value in switching to streaming telemetry, but let you know when it's doing it. And fundamentally, I think the problem is the tools to ingest and interact with streaming telemetry are designed around a cutover. So turn SNMP off turns streaming telemetry on. And I don't think that's, you know, like most of the networks that I see are mixed networks, some of their gear sports streaming telemetry, and and some of it doesn't. It's just SNMP only. So that doesn't work. So we're trying to figure out ways that you can get those benefits of streaming telemetry in a in a mixed environment. Alright. Well, then, show me some streaming telemetry. We're gonna have some SNMP support, but also streaming telemetry. I I wanna see the streaming telemetry part as it's exciting. Yeah. Let's, let's start with that. This is, an example, you know, imagine you plan to increase load on a device. And so you wanna check the bits of hardware and make sure you have enough headroom here's CPU. And this is this is actually from SNMP polling at once per minute, which is about as fast as you can drive SNMP without it overloading the device. It's aggressive. Yeah. Yeah. Yeah. But even with this, so so we see, one data point per one minute, we see a spike to about nine percent So maybe I can increase the load by almost tenfold on this device, and it would be okay. Yeah. You think you'd have lots of headroom. Yeah. Yeah. Lots of headroom. So if we look at this data instead from streaming telemetry, streaming telemetry can get down to, like, every one or two seconds. And it's a very different picture here. I've got very low usage most of the time, but bumping up to twenty nine percent. So now I'm thinking, you know, if that task that twenty nine percent is common and I need to support that and I can't smooth that out. That load out for the device. Now I can only increase this by something like threefold. It's a very different conclusion. So on this tab we're looking at, Chris, we're not polling. This is data flowing into the Metrics Explorer from the switch. That's right. So the, our system has subscribed via streaming telemetry to this data point, and the device is sending us a new data point every two seconds. Okay. So rather than a one, once per minute polling, I was getting with SNMP. And it'd be now I'm getting data every two seconds pulling in from the device. And so my granularity way better. And so now I know I'm pushing almost thirty percent during that burst, and I have to I changed my capacity planning thinking accordingly. Okay. This is this is great. This is what I was excited about is be able to see just how granular we can get. Well, what else can I explore in this, Chris? Yeah. So, one of the other most common problems with SNMP we've been dealing with for decades is fake spikes and fake troughs. And that's caused because the the time stamps when we're calculating windows, not to get too nerdy, but when we're calculating windows, we don't know which window a data point goes into because the time stamp for that is all the way at the management system. There's lots of buffers and variable delays between that. So with streaming telemetry, we get that time stamp at the source, which means we remove a lot of those fake troughs and fake spikes making the data just more accurate overall. The the other thing that more frequent data points does for you is, you know, it means the latest data point is much, much more recent. So imagine we're in an outage and after some troubleshooting, we found that, you know, the problem seemed to resolve on this guy. On this Arista. And for whatever reason, our CPU is pegged, go figure. So I'm gonna put in an attempted fix, press it now. And we'll see now that we're pulling every two seconds versus that five or ten minutes, we'll see that that starts to come down very, very quickly. Like, five, ten seconds, you can see that in your NMS. Okay. This is this is interesting. I've done lots of NMSs in my career. And I'll tell you. The NMS is, like, the last thing I worry about when I'm in the middle of troubleshooting a problem. Because it's always behind. If I'm doing SNMP polling, it never knows exactly when I'm working. [Five, ten minutes behind, right?] Yeah. And so what I would do is since I didn't have system like this that would tell me that this is this is this is effectively real time. It's told you your fixes work, the CPU spike is gone. I would have to go on the device itself and run show commands or do other sorts of things to let me know that my fix had worked. And the last thing I check is, like, Oh, yeah. The NMS is all back to green. Right? Kind of thing. Yeah. Because it wasn't especially useful for troubleshooting. This is immediately useful. Yeah. And and, you know, NMS is at scale. So you're doing this across all of your devices, all that support streaming telemetry and all of the metrics you're getting there. So And it's like going from spindle to SSD. It's way, way faster. You you don't wanna go back after that. No. I would not. How do we make this possible in a mixed environment? Cause that's what everyone has. Like, everyone is ninety percent devices that only sport SNMP. So it really needs to work great in that. So we we treat both SNMP and streaming telemetry as first class citizens and all of the data across the the system, whether it's SNMP or streaming telemetry is normalized into a single, data model based on open config So you can see this dashboard has devices, and metrics on it. Some of these devices are being polled with SNMP. Some of this data is being collected streaming through streaming telemetry, but it's all normalized and your dashboards just look like how you're used to. That's a big deal, actually. Okay. So I you guys are handling the inbound data from streaming as well as SNMP polling and just presenting it to me in a unified way. I don't have to worry what my source was particularly. And your alerts work as you would expect. It's not like my alerts have to know what data is coming in. Right? Well, that's that's a bigger deal even than people listening might realize in that there's a lot of data modeling that has to go on for that. The format that the data is coming into you via streaming telemetry is gonna be a different animal than what you're gonna be polling from OIDs in an SNMP MAB. Yeah. It's also true that there so there's a mixture of which devices support which, and then even within one device, it may not support all of the the data that you're used to in streaming telemetry. Some of it can sit in SNMP. So we've designed it so that, you know, even looking at the data for a single device, some of that can come from one and some come comes from the other, and you don't really have to care. And if you drill into one of these bits of data, you can access our Query Builder, and that gives you sort of more direct access to the data model itself. When you apply a normalized data model to all of this data and you're consistent, then you can reliably start asking some really interesting questions. So, you know, most people, most network engineers, and most systems are really anchored on looking by device. But if we wanted to say, okay, don't look at just this one device. And, you know, I've got all of these different sources of data and within this BGP, prefixes measurement, I'm gonna drop out these metrics. And I'm gonna think about Hey, what's my received prefix count versus how many am I installing and how many am I rejecting? And I don't wanna do that based on device. I wanna do that on something broader, maybe address families. So I'll run that query, and I can see for my IPV4 prefixes via BGP, I'm receiving four hundred and twenty one thousand and installing eighty thousand, about a fifth of them. Versus IPV6 where I'm receiving two hundred and ninety thousand, installing a little bit over half of them. So a very different way to think about different perspective on your network when you have all of this data normalized. So the way that Kentik is gathering the data, you're allowing me to query across the database broadly. And this You're right. You made that point about engineers, thinking more in device terms or a lot of times we do. We're guilty of that, Chris, because we lovingly installed that router in the rack and brought it online ourselves, and care about it. But in fact, it's just one component of a distributed system and it's the system. The network as a whole that you care about. So now if I can query the network as a whole and ask it questions and get interesting answers back like what you're illustrating here, that is that gives us more insight into what's really going on network wide. Yeah. We're pretty excited about more detailed data and And, man, in 2024, I don't think we should be polling every five minutes using SNMP from 1988. That's craziness. Like, we just we We should have more detail, more accurate, more fresh data. Okay, Chris. This is a great illustration of what we can do with Kentik's new NMS thank you for pointing out. It's not just another. Even though you haven't gotten rid of SNMP because just to recognize its importance in network management, we've gone beyond that now. We've got a way to manage and monitor and analyze all of that data in a unified way. So if people wanna find out more about, the Kentik NMS, Chris, where do they go? Yeah. You can check it out at kentik.com or shoot me an email I'm c o brien at kentik dot com. Chris, it was great to have you on Video Bytes today with the Packet Pushers. Thanks, Ethan.

Packet Pushers host Ethan Banks gets an overview of Kentik’s new Network Monitoring System, Kentik NMS, from Chris O’Brien, Sr. Principal Product Manager at Kentik.

Chris demonstrates key features including how Kentik NMS allows network engineers to take full advantage of streaming telemetry. Faster access to network device data using streaming telemetry over SNMP ensures that NetOps professionals don’t miss critical events that happen between traditional polling intervals.

Explore more from Kentik