Chris spent 10 years as a network engineer starting by manning the phones in the support department at a wireless ISP, helping customers trace cables and restart their computers. His experience culminated with leading network engineering at a company with 750 locations. Chris has always been obsessed with network monitoring, so when an opportunity popped up to build network monitoring tools, he jumped at it. Since then, he's been working with network and development teams to build monitoring solutions and has been fortunate to release some of the most popular and innovative network monitoring tools in the world
Phil Gervasi: "SNMP is dead, long live SNMP." Now, I remember a blog post with that title or something like that, maybe seven or eight years ago. And that makes sense to me because that was right around the time, if you remember. That there were some engineers from Google at NANOG 73, that was back in 2018. That had a presentation called, " SNMP is dead." And of course then there was a lot of buzz in the network industry around that time that we were all going to replace SNMP, like the whole industry altogether. And the solution was streaming telemetry. And for that specific presentation, it was a gRPC framework and then the associated gNMI, management interface. But I have to say it's been a few years now and I have not seen everyone in the entire industry ditch SNMP in favor of streaming telemetry. Now it is true, I have seen streaming more and more. And I understand that we're always going to have those use cases and corner cases out there. Where there's a network operations team that chooses to stick with SNMP or maybe stick with streaming for whatever reason. But what really happened over the last few years? Why hasn't SNMP really been totally abandoned and then replaced with streaming telemetry by the entire industry? Like we all predicted. So maybe it's that it's too heavy of a lift to make the change. Or maybe that it's the benefits of streaming telemetry over SNMP were a little overhyped. And then when engineers got into the weeds, they decided, " You know what? This isn't worth the effort." Now we all know the reality. We all work in IT. We all know the reality is that the answer is, it depends. It's more complicated than that. This is networking and it's probably one of the reasons I really love this field. So in today's Telemetry Now, Chris O'Brien, a product manager at Kentik. And a subject matter expert on network monitoring and telemetry, is with us to explore the answer to this question of whether SNMP is really dead or not. My name is Phil Gervasi and this is Telemetry Now. So Chris, it's good to have you on. You and I have been chatting with regard to SNMP and streaming telemetry and networking in general for a little while. So it's really cool to have you on the show. But before we get into it, I would like to get a little bit more of your background. I know you are a subject matter expert in monitoring and telemetry. I mean, I know what you do for a living, but give me a little bit more about your technical background in engineering and in networking.
Chris O'Brien: So I was a network engineer for about a decade, maybe six different places I worked at. At varying sizes and complexity. I spent most of my career on the enterprise side, but I did start in service provider. So about a decade as a network engineer. And after that, about a decade working as a product manager, which basically means building software tools for network engineers. As my career suggests, even when I was a network engineer, I was pretty obsessed with monitoring. So I've been spending a lot, I suppose 20 years, which is the majority of my life, thinking about monitoring at this point, thinking about observability.
Phil Gervasi: And so when you say network engineer, are you talking about a traditional... What I would say, a traditional field engineer turning a physical and a virtual wrench at the command line, racking and stacking. Designing networks, troubleshooting why this thing isn't working like that kind of an engineer
Chris O'Brien: Maybe to add a little bit of precision. So all of that. Yes, and I started at wireless ISP.
Phil Gervasi: Oh, nice.
Chris O'Brien: The largest in the country at the time. Later got acquired by AT& T. I worked in their call center and then in their network operations center on the phone. And then I became a network engineer and then a senior network engineer and a lead network engineer. So definitely leaning service provider. It was more running the gambit of running enterprise data centers back when that was much more popular. You would have one at your office or at local Colo. So both the physical side and configuration and then obviously later in my career, more of the architecture.
Phil Gervasi: Right, okay. And I ask you that question on purpose and it's because I wanted to know if you have that in the weeds. Field network engineer, network operations, day- to- day keeping the lights on background. As opposed to a purely academic background or theoretical or quote- unquote" Thought leader." Oh my goodness, I just said those words, but you know what I mean, right? The difference between somebody who has designed, fixed, operated networks and knows the reality. As opposed to just a purely academic perspective. And that's what I wanted to know, and so I'm glad to hear your answer. So is SNMP dead? And that's kind of the topic here, and is it being replaced with streaming telemetry? What's happened over the last few years, but we got a level set. What is... Really super short because our audience is mostly networking professionals. They're going to be technically minded, so they're going to probably know what SNMP is. But from your perspective, in your definition, in your words, what is SNMP? And why has it been dumped on so much over the last 5, 7, 8, 9 years by thought leaders and maybe even vendors in the network community?
Chris O'Brien: Yeah, so Simple Network Management Protocol, one of the biggest books about SNMP says, it's amazing SNMP, every single word in the acronym is wrong. It's not simple. It's not for the network only, management isn't how it turned out to be. So SNMP was invented a long time ago, and we've all been working with SNMP for a long time, and that means two things. Number one, it's aged a lot as networks around it have changed a lot. So it's becoming a less and less good fit for our current networks. And then it's also true that all of the problems that SNMP has. As network engineers are working at scale like SNMP is ubiquitous. So we have all experienced a ton of challenges and limitations and problems from SNMP. As we apply it to our network, and we're all too familiar with those. So I think that's a big part of it too.
Phil Gervasi: I'm glad you put it that way where you said that it has aged. And therefore maybe isn't appropriate for where networks are today or how we do networking today. Because there's a subtle point that you made that I kind of disagree with. I don't believe that the age of a technology, you can't conclude logically that therefore it's bad, right? Like BGP is therefore bad because it was invented in the late'60s. No, or TCP/ IP.
Chris O'Brien: '60s,I didn't know that about BGP, wow.
Phil Gervasi: Well, maybe the '70s. I don't know whenever that napkin was written.
Chris O'Brien: I don't know.
Phil Gervasi: Right. Either way, it's old, it's decades old. TCP/IP is decades old. Spanning Tree is decades old, maybe that is a bad technology. But my point is that the age of a technology doesn't preclude its ineffectiveness or it's, not being good anymore in our year 2023, 2024 and beyond. However, the way we approach networking today can differ and therefore its usefulness can change as a result. So I don't think it's its age because let's say for example that we still did networking the same way, then it would be fine. You see, I think that it's the changes in the networking industry over the past 40 years, especially in the last 10 years. That have resulted in us looking at SNMP differently and then looking at possibly, " Are there other monitoring solutions, data types, formats, protocols that we should be considering?" Is it that ineffective? I mean, why has the community been dumping on it in your opinion?
Chris O'Brien: I mean age is a proxy for understanding the degree of change that the network has gone through around it, right. And-
Phil Gervasi: All right, that's fair.
Chris O'Brien: Yeah. SNMP originally invented in the'80s. We've had a lot of changes to our network since that time. And I mean I would almost flip this on its head and say, SNMP was a wild success with nearly every network in the world. Depending on it for visibility and really no serious competitors in terms of percentage of use. SNMP has been 90% of it since the'80s. That's incredible. If you were designing a protocol, you could hardly hope for a better result. As far as why people hate it, vendors do implement SNMP in different ways. So there's sort of a constant hassle of figuring out where the data is and making sure the data is collected. This isn't always the case, but theoretically the data that's available from SNMP could change. With every vendor, with every make, with every model, with the features that you deploy on that system, with every software change and with every software version. So because we're always building more equipment and more software, we're kind of forever trying to catch up with SNMP. And it's not one tool trying to do this. It's a bunch of different management tools. So dealing with missing data is a common frustration.
Phil Gervasi: So is there anything about SNMP as far as what it gives us in terms of information and metrics and the data that we want to collect from our network? Is there anything that's missing that we could say SNMP is lacking? Or is it that we're not really implementing it to its fullest potential? And I ask that mostly based on a conversation I had with a friend and colleague, a few months back on flow data. And how a lot of people look down on flow as well when the reality is they're just not using it to its fullest potential. And there's just so much more we can do with it. It's the same case with SNMP.
Chris O'Brien: I think it is a little bit different. I think some of the challenges or limitations people run into is the architecture of SNMP. The nature of the protocol is to request a data point. So the manager requests a piece of data from the network infrastructure like, let's call it a router for simplicity's sake. And then the router returns that data point. Well, we're all using this data to build graphs. Which means we're all saying our management station should ask that question of the router once every X minutes, X seconds. And so you get this kind of ridiculous constant asking the same question and then answering with a new value. That's happening forever for all of our devices, which is not a super intelligent way to get a stream of data, the same data.
Phil Gervasi: And that would lend itself into scalability problems. Constantly talking to the device, asking it for information that you probably don't even need. " Oh, your interface is still up?" " Great." " Oh, your interface is still up?" " Great." So there's redundant data, therefore there's a performance penalty possibly. Both on the network side and then on the device side itself. And then you're telling me that there's no industry standards device by device, where individual vendors are going to implement SNMP differently on their own platforms. So that's a limitation. Then what is streaming telemetry in comparison? If I'm trying to get that same kind of information, is it really providing me any benefit? Or is it just solving those few problems I listed?
Chris O'Brien: Yeah, well maybe just to close the loop on what you just said. The other limitation with SNMP is, you ask these... Again because it wasn't made to build a graph of data, but that's kind of what we're all doing with those data points. You end up with challenges around your intervals essentially. You might have some delay. So say you're trying to draw a graph with your rate of something for every five minutes for over the last year, right? Then you need one data point for every five minutes. So you often will poll at every five minutes, but if the return of that value gets near a boundary between those intervals. You may find that your first interval has no readout and your second interval has two. So then in your first interval you're saying, " Hey, there's no data being sent across this interface," as an example. And then in second interval you're saying, " There's twice as much data as usual." So in this way because of this interval problem, SNMP can create the sense that there are spikes and lulls. That aren't actually true spikes and lulls. It's also true that because you poll so infrequently, whatever time there is between your polling intervals, you're averaging. In reality what's happening is you're saying, " How much traffic was sent outbound on this interface in total?" Let me get that. That's a counter. You do that again in five minutes and then you calculate the difference. And you say that, over that five- minute period was the rate. Well, that's the average rate over that five- minute period. And more and more frequently networks are dealing with microburst and other spikes that are a whole lot less than five minutes. So that five minutes tends to really smooth those. One- minute polling, which I think is the right sort of default for today's SNMP implementations, still is a lot of smoothing or averaging out those spikes. So this is a common problem with SNMP more indirectly because of its architecture, because it doesn't scale. You can't poll your devices once a second or once every a hundred milliseconds. Because you're asking that same question over and over. And the nature of that is requesting that router whose primary job is to move traffic to sort of interrupt its CPU. And ever so briefly spend its CPU time instead, on preparing an answer to this question. So it's very inefficient and the faster you go, the worse it gets in terms of efficiency. And the net result of that is no one's doing SNMP at one second intervals. So you get this thing where you're averaging out the spikes in the lulls. And you get this thing where you're actually introducing spikes and lulls that don't exist. So that's like a lot of loss of accuracy in your monitoring. Where the core purpose of what a lot of us are doing with SNMP is drawing these darn graphs. These graphs need to be accurate.
Phil Gervasi: Yeah, that's the whole point obviously, especially in a very high transaction environment. In an environment where bandwidth is scarce and when you can't drop packets and therefore no congestion is tolerated. Or very little congestion, which I think is more and more common today. Especially in networks that operate AI workloads and things like that. So does the streaming telemetry... And we can get into the different types, where it came from, how it developed, does it solve those problems?
Chris O'Brien: The biggest difference with streaming telemetry is you are subscribing to a data source and that's a huge difference. That means you're saying, " Hey, I have this question, just send me updates to that question at this interval." So that is a much more efficient way to do it. The router can schedule in the processing. It doesn't have to always do that. So interrupt driven, it can batch that. And it also prepares the... So we can timestamp these things at the router and so it can batch a set of these things and send them over. So this is a much better fit if your goal is to draw these graphs and send the same data point over and over for years. Much, much better way to go about it.
Phil Gervasi: So that would speak to this concept of a push versus pull driven methodology. And then you mentioned a scheduled versus a purely event driven delivery methodology, right?
Chris O'Brien: Yeah. Now you scheduled in the time span of seconds here, right? It's not necessarily that you're going to schedule it once every 30 minutes, but imagine you're the CPU. There's a big difference between, " A question came in and I need to respond before that SNMP timer goes up." Or ideally the request is to do it immediately. So I'm looking at processor interrupt versus a scheduler like cron or even something where you're trying to go to put that task a few seconds out. That's a lot of new flexibility for the CPU of that router. And then it knows what it needs to query, knowing ahead of time what the query is and that you're going to have to repeat it a thousand times. You can imagine in computer science, that's way easier than I have to facilitate any question that comes in, in all of SNMP at any moment within a second.
Phil Gervasi: All right. So you're going to get the benefit of a higher degree of granularity. Or higher degree of definition in the data without necessarily the hit on CPU utilization and then ultimately performance of that device, right?
Chris O'Brien: That's right. People are running large chassis with hundreds, thousands of interfaces. And in 2023 when we're recording this, I think polling this data every five or 10 minutes is crazy, crazy slow. So you want to move to a minute. I would suggest as a minimum you want to be at a minute, but you can't pull one of these chassis. Maybe it has a thousand interfaces. A lot of times a single interface will have 15, 20, 25 metric series. So think the state of the interface, whether it's up or down. Well you need that for administrative state as well as operational state. Then think about traffic alone. You've got in bits per second, a lot of people will do multicast packet count, regular packet count or all packet count, broadcast packet count. And then both of those things you do in both directions. So you multiply all of that out and typically people are monitoring 15 to 25 series of metrics per interface. So you've got a thousand interfaces, you're doing 20, 000 data points per interval. So if you do that at one minute interval, that's not a insignificant load on that router or that chassis. That's a significant load. And so as soon as people get into faster polling and larger devices in terms of interface count. Or any other trigger causing them to monitor a whole bunch of metric series, they start running into scale problems. They start running into challenges. And typically really the only solution is to slow down. And I've talked to a lot of folks who are running at five minutes and they're having to slow down to 10 minutes-
Phil Gervasi: Really?
Chris O'Brien: This is not the right direction. This is not going to work.
Phil Gervasi: So then clearly streaming is much more scalable than SNMP. I think you've mentioned that just now, which is a major concern. Especially we get into larger networks like hyperscalers, web scale companies, service providers, things like that. And then what about the reliability of the data? Because I know that SNMP is going to send us information over UDP. And so there is a potential to drop some data when it's transiting the network itself. Is that how streaming works as well or is it a different method?
Chris O'Brien: Yeah, streaming uses TCP amongst other things to ensure reliability. So definitely more reliable. SNMP uses UDP, so theoretically you could drop packets and that does happen. I mean if you imagine just how... I just talked about 25, 000 requests per minute or per five minutes, whatever your polling interval is. Given that, it's really amazing how reliable SNMP is, it's amazing that you're not constantly having gaps in your data from SNMP. It took the industry some time to figure that out. But I would say there's little gaps in your charts, has been a problem in SNMP for decades. And the problem is not gone. One of the contributors are certainly the fact that the data is delivered unreliably. So streaming telemetry takes a big step forward by simply moving to TCP. This is interesting when it comes to metrics. But it's way more interesting when it comes to events because events are sent once, they're typically urgent. They're really important information and if you miss it, you miss it, that's a real problem. So stuff like SNMP traps, we haven't even talked yet about sort of the different use cases with SNMP. I'd say the most common one is that asking the same question over and over and drawing a graph with it. SNMP traps was designed to facilitate that sort of event- based stream of information, which is super valuable. And sort of a different tool in your toolbox to monitor and troubleshoot network. And then referencing back to that Simple Network Management Protocol, and none of these words are true. The management section, all of the "put" stuff in SNMP was a lot of the protocol is designed over that quote, " Management." And what I mean really is writing changes to our network devices, but practically speaking, we don't use that. That's not how we use SNMP. I mean because you don't have change control, versioning. You don't even have a decent facility to manage when you write the change versus when the change is just in memory. So you have to be extremely careful and you get mediocre results when you try and use SNMP to push changes. And there's just better tools to do that. So it's almost like half of the protocol, the management half didn't really function how we wanted it to. It's just vestigial, right? It's there, but we don't use it.
Phil Gervasi: And then the specific framework that I remember hearing in the NANOG presentation a few years ago was gRPC, what is it the Remote Procedure Call framework? Specifically, gNMI, which is the actual Network Management Interface. And they're going to communicate over things like NETCONF, RESTCONF, which a lot of folks are familiar with today. I mean, you're seeing vendors support this. And so does that mean that streaming telemetry is therefore, there is less deviation vendor to vendor in how streaming is supported. Because you did mention that one of the weaknesses of SNMP is that it's going to be implemented differently. MIBs are going to be different by device, right? See, the audience can't see that you're smiling. This is not a video podcast, inaudible. Right now, Chris is smirking in the camera.
Chris O'Brien: I think unfortunately the opposite is true. So despite the fact that SNMP still varies so much. In SNMP, you have this facility of the MIB that defines a lot of the data that's available. That doesn't exist to the same degree and the same availability for end users with streaming telemetry. Further, maybe even the more impactful point is that streaming telemetry is much newer and where it came from was the Googles and the others of the world that can inaudible. And apply overwhelming pressure to their vendors for the changes they would like to see done.
Phil Gervasi: But we do have things like the OpenConfig initiative, I'm sorry if that's sacrilegious to say that word. But we do have those initiatives to make that implementation more ubiquitous in industry standard, right? I mean isn't that where we're going with this?
Chris O'Brien: Yeah, that's true. So I think that's where we're going. But I think the fact of the matter is we're still sadly, despite being years into it. We're still very early on in that journey. And because the initial drive for streaming telemetry did not come from your average enterprise. But your average enterprise is looking to now use it, that has caused additional fragmentation. So we've got some teams working on streaming telemetry and we see a lot more variation in what's available, how it's available, and how we interact with the box. A lot more variation on the streaming telemetry side than we see on the SNMP side.
Phil Gervasi: Where I was going with my question earlier is that maybe that's a problem that streaming is solving is that it is, again, more consistent platform to platform. And you completely thwarted my argument, but that's why we're talking is I wanted to understand this. And I do know that there are some vendors out there. They haven't made any public announcements whatsoever about support for streaming telemetry in general, let alone anything in particular. So then that's a negative. But based on our conversation so far, it's been 20 minutes or 30 minutes of discussing the weaknesses of SNMP. And now we're talking about some of these weaknesses of streaming telemetry, but the weaknesses aren't technical in nature. You're just telling me that we're not there yet, okay. But should we be getting there, should streaming replace SNMP? Is SNMP dead, albeit dying where we're still making streaming telemetry more commonly adopted in the industry? Or is SNMP still very much alive and well and should be part of our overall strategy of monitoring?
Chris O'Brien: And if we step one back from that's like, if streaming telemetry is so great then why are we all still using it SNMP? I think there's something fundamentally wrong there with how we're approaching the situation. And I think the first step we need to take to solve that is to recognize that SNMP isn't dead or close to dead. SNMP is running in maybe 99.9% of networks today and the primary method of monitoring in maybe a full 99% of networks. So if we're starting from the position of SNMP is dead, we don't understand even where we are. It's hard to figure out how to navigate to where you want to go without understanding where we are. Where we are today is, SNMP is absolutely ubiquitous and vital in almost every network in the world. So starting from that position, the next step is we've got this new tool, streaming telemetry, we want to use it more. The first thing that has to happen is manufacturers have to build support. They are and they have. But the number of devices most people have that support streaming telemetry is relatively few. So maybe it's 10%, maybe it's 5%, maybe it's less than 5%. So the bad news is you got all this gear that doesn't support streaming telemetry. And in some networks it makes perfect sense to run that gear into the ground. You're going to have that gear seven years, 10 years longer than that. So replacing your gear to support streaming telemetry is like a silly proposition for many or most companies. So a lot of the approach so far has been how do we replace SNMP with streaming telemetry? And I think that is what is causing a lot of the delay and basically making it impossible for folks to adopt streaming telemetry. What the world is, is a place where maybe today it's 99% SNMP and 0. 1% streaming telemetry. And maybe a year from now we want it to be... If everything was fantastic and we had our druthers and we've got all the benefits of streaming telemetry as fast as we could. Maybe the world would be 95% SNMP and 4. 9% streaming telemetry. So my view is that's the ideal world. We have to design a way to interact with our networks that has SNMP. Not relegated as archaic technology from the past that you can get nothing from, but really treats SNMP like a first class citizen. Like this is a large part of the lifeblood of the network. And then on this 5%, 10% whatever portion of our gear that does support streaming telemetry... The good news here is this tends to be your more expensive gear and your more critical gear. So that 5% can be an outsized portion of the value in the importance of your network. So what you want to do is be able to collect data, use streaming telemetry on that gear. And really put these things side by side and make it so that if you're looking at a dashboard. If you're writing an alert or receiving an alert, if you're building some sort of query over your data. All of these things return data regardless of whether the data source is SNMP or streaming telemetry. It's just with streaming telemetry, the data is much fresher, it's more frequent. Maybe there's other sources of data. We need to be able to use both of these things like first class citizens.
Phil Gervasi: So ultimately what you're describing is more of a device lifecycle, IT operations problem. Or at least hindrance to the mass adoption or quicker mass adoption of streaming telemetry. That the reality is we have devices that we're trying to depreciate over time. At least that's what the finance team wants to do. And then also just the IT operations, the operational aspect of lifting or rather a ripping and replacing all of my gear with other gear. I mean the reality is you have your hardware refreshes every five, seven years, maybe 10 years on core devices. And if you're in the middle of it, you're not touching it right now. If it doesn't support, it doesn't support it. So it doesn't sound like that streaming is inherently fraught with weaknesses and inaccurate information. That it's just not a good telemetry tool or monitoring tool. But it's just that it's much more difficult to move to it than I think... At least based on what you're saying than what the industry described back in 2018. And in the past seven years since. Seven years, not that many, six years, five years.
Chris O'Brien: Yeah, like a wholesale cutover. This inaudible cutover is completely inapplicable. We won't be able to get there. It's a crazy idea and it's really just not... Again, I think if we start from, " Man, is SNMP valuable?" Which seems like a foregone conclusion when you consider how many of us are using it to what degree we're using it. Then it's much easier to get to the point where we have a rational plan where we still use the SNMP. And then are starting to use string telemetry as a new tool. To solve some of these problems we're seeing where applicable, not as a full replacement. That's the path forward for sure.
Phil Gervasi: And in the interim, and for probably quite a few years, I mean you are going to have to have systems in place that are able to ingest all of these types of telemetry, including SNMP. Despite it having that moniker of legacy and archaic like you used before. It's still very much relevant and therefore has to be taken into consideration even among the most modern and forward- thinking monitoring platforms. And that's alongside what we're doing now, moving forward with ingesting streaming telemetry information. Just agree with me, Chris, just agree with that-
Chris O'Brien: Yeah, I agree. Yeah, I agree with what you're saying. Yeah.
Phil Gervasi: Right. Chris, this has been a great discussion. I would like to get into the weeds more specifically on how streaming works, on how SNMP works, how we can utilize it today better than we have in the past possibly. Maybe just flesh it out a little bit more in the weeds that I think we had time for today. But I did want to address this idea of SNMP is dead. I remember hearing once in a conference in a small group I was in. Somebody made the comment, " SNMP needs to be taken out back and shot." And I was just like, " Really? Really? Because I know a lot of people that are still using it and operating their networks with it." So it struck me as strange. So this has been a great conversation. Now Chris, I'm going to assume that somebody wants to yell at you that has heard this podcast today. So how can they reach out to you online with a question, comment, angry or positive?
Chris O'Brien: Yeah, you can contact me at kentik.com, cobrien, C- O- B- R- I- E- N, @ kentik. com. I love talking about this stuff, so shoot me an email.
Phil Gervasi: Yeah, great. And then of course, to our audience, look out for future podcasts where Chris and I get into this more. Because I have a lot more questions and we have a lot more stuff in the show notes that we didn't even cover. And you can find me online at Twitter still, @ network_phil. You can search me on LinkedIn, of course, and you can email me at pgervasi @ kentik. com. Now, if you have an idea for an episode or if you'd like to be a guest on Telemetry Now. Please reach out to us at telemetrynow @ kentik. com. And of course, keep an eye out for future episodes coming your way. And for now, thanks for listening. Bye- Bye.
Do you dread forgetting to use the “add” command on a trunk port? Do you grit your teeth when the coffee maker isn't working, and everyone says, “It’s the network’s fault?” Do you like to blame DNS for everything because you know deep down, in the bottom of your heart, it probably is DNS?
Well, you're in the right place! Telemetry Now is the podcast for you!
Tune in and let the packets wash over you as host Phil Gervasi and his expert guests talk networking, network engineering and related careers, emerging technologies, and more.