Network observability is a popular term in the industry right now, and everyone seems to have their own definition. In this episode of Telemetry Now, Avi Freedman, CEO, and co-founder of Kentik joins us to talk about what network observability is really all about, what makes it work, and what problems it solves for packet nerds and router jockeys trying to keep the lights on.
Avi is the co-founder and CEO of Kentik. He has decades of experience as a leading technologist and executive in networking. He was with Akamai for over a decade, as VP Network Infrastructure and then Chief Network Scientist. Prior to that, Avi started Philadelphia’s first ISP (netaxs) in 1992, later running the network at AboveNet and serving as CTO for ServerCentral.Connect with Avi on LinkedIn
Phil Gervasi: This is Telemetry Now, and I'm your host, Phil Gervasi. With me today is legend in the networking world and the co- founder and CEO of Kentik, Avi Freedman. Now, I just finished reading Avi's book, Network Observability for Dummies, and I do have a few questions to ask and also comments to make. For example, how do we get visibility into networks that we don't own? Why do we need just so much information from the network? I mean, what are we really trying to solve here with network observability? Let's get started. Hey, Avi. It's really good to see you. Thanks for joining today. Well, I guess nobody else can see you, this is an audio only podcast. I mean, I can see you with our software, the rest of the audience will have to use their imagination, but it is good to see you and thanks for joining today.
Avi Freedman: Good to see you and thanks for having me on.
Phil Gervasi: Great. Before we get going, I want to establish just a foundation here, a kind of knowledge base between the two of us.
Avi Freedman: Okay.
Phil Gervasi: I know that we share a common passion for certain science fiction. Among all of the different Star Trek motion pictures out there, what would you consider your favorite one?
Avi Freedman: Wow, my favorite one. This is going to show that I'm not quite the Star Trek nerd that I should be for. I would say four, yes.
Phil Gervasi: The Voyage Home, it was four.
Avi Freedman: On the bus, with the nerve pinch, yes.
Phil Gervasi: Okay, that was a good one, because I love whales and marine biology and that sort of thing. That was a big part of my childhood. However, I am disappointed that you did not say Star Trek II: The Wrath of Khan.
Avi Freedman: Well, I try not to read badmoviephysics. com and look at too much of it, but I just have to say the idea that the Z axis in the third dimension was a new concept not taught at Starfleet Academy. I just couldn't really, " Oh, we're going to come up from the bottom and no one's going to expect us there," I didn't really, that sort of was like, " Eh."
Phil Gervasi: Right. I think most sci- fi nerds agree that that is probably the best one. Let's get started here, Avi. I just finished reading your eBook not long ago and I do have some questions and comments, and frankly I know that the idea of observability itself, that predates technology. I mean, we're talking about-
Avi Freedman: Well, predates networks especially, right? Network observability was the state of the resistors and electrical network type stuff.
Phil Gervasi: Yeah, but I mean, in it's pure form, observability is about looking at the components or the outputs of a system to infer its health, right, or its status.
Avi Freedman: Mm- hmm. Yeah.
Phil Gervasi: Yeah.
Avi Freedman: I guess I haven't seen that in a non- tech setting, but I will go look and see.
Phil Gervasi: Mm-hmm. Okay, well, then let's talk about that in the context of networking. How does that really hold true for networking considering that there are so many disparate parts?
Avi Freedman: I guess that's sort of the point. If you're peering at one port on a network trying to scry and understand and reverse engineer all the inputs that might have caused what you're seeing, then usually you're sitting there scratching your head and you can't figure it out. Especially with internet related networking, if it takes you longer to figure out what the problem is, then the duration of the problem, that's not very helpful. That's just wastes a lot of time. Sometimes things are real problems, but only for two, three hours congestion on a network that is remote to you or in towards Twilio if that's where you need to get to send your SMS. That's the whole point, is the more complex your network, the more you need to be able to see the different parts of it. Or you're just sort trying to infer what might be the problem beyond it, which is pretty difficult. When you've got cloud, you've got data center, you've got win, you've got the internet side and really you've got the applications which are, I call them nowadays, magic packet, transporters, even CDNs, right? It's not like you have a prefix and your shortest path or distance factor and you're just trying to get somewhere. People are magically load balancing traffic around and those are things that are important to see from a telemetry perspective too.
Phil Gervasi: Okay, so somewhere in the beginning of your ebook, I think it was page four or five, I don't remember, there was a bulleted list of a bunch of things that you personally consider as the major tenets of network absorbability. You established, and your answer just now just made sense. You established that it's observability by necessity, because of the complexity of the network.
Avi Freedman: Yeah.
Phil Gervasi: There's a bunch of bullets there. I want to focus on the first two though. The first one is you have to see all networks. Now, that to me, as a experienced network engineer, kind of bothers me. Not bothers me, but I really need you to elaborate on that for me, because we all know that maybe we could see our own network. We have some visibility there, but we're also dealing, like you said, with networks that we don't own, like public clouds, SaaS providers, CDNs, maybe the overlay to our SD-WAN, that's a thing now. How do we see all those networks?
Avi Freedman: Well, we live in the shadow world underneath the platonic ideal and you can't always get to a hundred percent, but you need to measure the internet, if that's something that's important to you. You can measure the internet by doing probes, synthetic testing with the academic folks. They don't like the name synthetic, so they would say active testing or probing, or if you actually have routing information that you can combine with performance enhanced network traffic, that could come from a service match, that could come from eBPF and a server and you can combine that with routing, then you can see project that across the AS paths that you have from BGP, you can actually begin to look into the internet, even though it's not your network. If it's SD-WAN, you are dependent often on getting telemetry from the vendor. Or again, you can set up full meshes of probes and do at least testing to understand whether there could be some transport instability. In cloud, depending on the cloud, Google actually will give you performance data in their VPC flow logs. For the others, you're not going to get performance for the network layer. Again, you're left without, you're not going to get their BGP information, but you can look at path data from synthetic traces and add that all together. The less you control the network, the more you need to be measuring it differently than with SMP and streaming telemetry and NetFlow, because they're not going to give you that data. You need to measure your own traffic and then you need to look at what the performance of that infrastructure is by testing it.
Phil Gervasi: Ultimately there's a lot of inferring going on here, then. You have to extrapolate based on certain data that this is what's happening in the provider's network. Am I right?
Avi Freedman: Yeah. You're still doing some level of correlation, normalization, inference, but it's a lot easier to figure out. You may not be able to debug Amazon's problem for them, but to know that there is a problem that if you don't have any data for that cloud, or if you don't have any data that looks at the internet and you're just seeing, " I can't connect somewhere."
Phil Gervasi: Okay, so this is actually really good, because what you're doing is talking about your second bullet point, which is the other thing that I wanted to discuss here. This idea of ingesting telemetry. In your book, you should have a system to receive or poll for all kinds of data, or all kinds of telemetry and then ingest it so you can analyze it. I forgot what you said, correlate measure it, that kind of thing. That's what you're talking about just now. You're not just talking about flow data and maybe some SNMP information. You just mentioned a lot of different types of information. Is that really the key here, the underlying technology that we need to rely on that diversity of visibility?
Avi Freedman: Well, I think even within there, there's three things you can look at, which is what's the kind of data that you want? That could be metrics, whether it's SNMP, streaming, telemetry, God help us, CLI scraping, which some optics data you need to get from the CLI, but that's really just a time series of metrics. API, you would think in theory all the routers, those four things would be the same, but they're not always for various reasons. Traffic data which can be NetFlow, could be VPC flow logs, it could be from EBPF, it could be from PCAP, it could be up at the application layer from Envoy, a proxy. All those things are traffic data. It could be events, something went up or down, whether that's this log or SNMP trap, it could be configuration information or changes. There's a tremendous amount of telemetry that you can get from network. Of course, traffic performance probes like we talked about that you do synthetically, whether it's at the network layer, up to the application level layer, all that's telemetry. Then you need to be able to normalize it. It's better if you can say, " Show me, think about traffic," not as NetFlow versus VPC flow logs, versus whatever. But some of it may have V6 addresses, some of it may have VXLAN, some of it may have TCP flags, some of it may have HCP return codes. But putting it all together lets you ask better questions. The third, is you need to enrich it, because if you just have interface ID and not name, that doesn't help. If you have just name but you don't really know what part of your network it's going to, that doesn't help. If you have VDP, I forgot routing as a whole category of telemetry, if you have routing separate, you're never going to put it together after the fact because it's changing every second. You need to take a wide variety, you need to be able to make the similar things look the same, and you need to be able to extend it so that people can ask better questions, so they know what customer that traffic is or et cetera.
Phil Gervasi: Yeah, I want to ask you about that and elaborate on it, but I do want to go back to one point you made. You said you want to decide on what information you want to collect, but going back to our original definition of observability, don't we want to collect everything, or is that even not necessary?
Avi Freedman: In the platonic ideal one would like to collect and keep everything. Unfortunately, as you said, we don't control. I mean, there's networks you control, there's networks you don't, and even the networks that you control, there's vibrating AGSES left in the corner that are leaking oil, flies swimming around them in the Simpsons episode that are not going to fall over if you ask them to do too much. In fact, the little trivia thing, Kentik probably has, I don't know, 50 customers sending us streaming telemetry. As far as I know, zero of them send less than 32nd interval IF MIV equivalent data, because they're concerned about overloading the control plan. Some problems. You can't really get everything, but you want to get enough of a view across the different networks and across the different devices where there are devices and across the different categories of routing traffic, metrics and such, to be able to stitch together an understanding by being able to look at, being able to look at the data and figure out what's actually happening inside, which is what the observability is about. Or where the problem is and ideally be told about it and with a pointer where to look, so you're not just scrying with the interface.
Phil Gervasi: Therein lies the benefit to network operations. Correct?
Avi Freedman: Yep.
Phil Gervasi: Yeah, so we talked about that kind of classical definition of observability and then how we want to have visibility into all the networks, see everything and then ingest that. But you also talked about correlating it, normalizing it, standardizing, those are all like machine learning terms right there. Why? Why are we doing that? Now, you did say so you have some sort of insight, so you can ask any question of the network. Well why though? Why do I need to do that? What's the problem that we're trying to solve here?
Avi Freedman: Well, there's a couple things that you want to be able to do. The first is you don't want to say, ask the same question from my NetFlow V4 enabled devices, or the traffic data I get from hosts, or the IP fix devices, and have to ask all those questions separately or go to five different databases, time series, databases, platforms, boxes, whatever to ask the same question basically about every part of your network. You want to put all that together and be able to ask it in a single way, which is show me this across all the networks, really no matter what the source of traffic data. It's much better to be able to say, " Show me what applications this customer is accessing or this division of my company." Break down that land link by application and remote as which you need BGP for or by a customer if you're a hosting company, " Which of my customers is using my expensive long haul connectivity?" If everything is stuck in IP addresses, which is what you get without enrichment, that is torturous, laborious or impossible depending on your scale. If you don't put all the traffic, all the metrics, all those things together, and by the way, most people have different systems even for SMNP and streaming telemetry today, which is again even a basic bar for people to get over. If you don't put all the metrics together, all the traffic and all that, then you have to ask the same question multiple times. If you don't enrich it with the business, I don't know, there's no great IETF term, well I don't really like IETF terms, but there's business identifiers. " What is the meaning of this? Is it by customer application?" Whatever that means to you is you debug things, then that can make it more difficult to actually get at the root of things also.
Phil Gervasi: Okay, so ultimately getting at the root of things more quickly. Right?
Avi Freedman: Exactly.
Phil Gervasi: To summarize what you said is that you're going to collect a ton of information, disparate information in order to get to the root cause of a problem, application delivery, service delivery problem very quickly, efficiently, all that kind of thing.
Avi Freedman: Yeah. Yeah.
Phil Gervasi: But you know, Avi, you mentioned SNMP, routing table information, enrichment, there's a ton of other stuff there. Those are all really different types of data.
Avi Freedman: Oh yeah. Mm-hmm.
Phil Gervasi: Yeah. I mean how do you just take 862 different types of data databases and then smoosh them all together? That seems like a major hurdle in and of itself.
Avi Freedman: It is. I mean, that's what we have a lot of people at Kentik that do.
Phil Gervasi: Right.
Avi Freedman: It's more than that, because when the flow data comes in, if the want to know the interface name, you need to look in the metrics and when any kind of the traffic data, which might not be net flow comes in, it could be VPC flow logs, it could be from it a web server, it could be up at that or Palo Alto firewall. If you want to see the intermediate network ops, you might want to put the entire BGP AS path on it. You need to be able to store all these things and you need to be able to enrich. At Kentik, when we started, we actually didn't store BGP Index separately. We just used it to enrich the traffic data, because we were very traffic sensitive. As we grow, we try to both store these things and link them together in real time at Ingest. Maybe one of the craziest versions of that is the way for some super large telcos, we'll take a feed of all the DNS queries in real time and all the BGP and then when we get the traffic, we try to figure out what site someone was going to and know a token, not their actual id, but of what subscriber it was so that people can do analysis of those magic packet transporters, right? What's Facebook doing in my network that comes from different CDNs or might come from their own AS. There's a huge amount of real time correlation and then they're storing these things which are in different formats. We have pretty generic database and that evolves over time for that.
Phil Gervasi: Right. Okay. Well, I mean this has been a really great discussion today, but I honestly feel like we're just scratching the surface. I love getting into this stuff, into the weeds and we're kind of heading in the direction that I've been very interested in lately the past couple years. I'd love to talk to you again soon and dig into some of the other aspects of network observability, especially how we're doing, correlation, that kind of thing. I'd love to have you on again soon.
Avi Freedman: Sure. I would love to talk about this or other nerding topics, Star Trek, network or otherwise with you Phil.
Phil Gervasi: Great.
Avi Freedman: Thanks.
Phil Gervasi: Great. Before we close then, how can folks reach you online?
Avi Freedman: Avi Freedman at LinkedIn, Twitter, probably some other things or avi @ kentik. com.
Phil Gervasi: Great. You can find me at Twitter @ network_phil and you can search my name Philip Gervasi in LinkedIn as well. I'm pretty active in both places, so until next time, thanks for listening to this Telemetry Now.
Do you dread forgetting to use the “add” command on a trunk port? Do you grit your teeth when the coffee maker isn't working, and everyone says, “It’s the network’s fault?” Do you like to blame DNS for everything because you know deep down, in the bottom of your heart, it probably is DNS?
Well, you're in the right place! Telemetry Now is the podcast for you!
Tune in and let the packets wash over you as host Phil Gervasi and his expert guests talk networking, network engineering and related careers, emerging technologies, and more.