More episodes
Telemetry Now  |  Season 1 - Episode 1  |  November 15, 2022

What does network observability really solve?

Play now

 

Network observability is a popular term in the industry right now, and everyone seems to have their own definition. In this episode of Telemetry Now, Avi Freedman, CEO, and co-founder of Kentik joins us to talk about what network observability is really all about, what makes it work, and what problems it solves for packet nerds and router jockeys trying to keep the lights on.


Key Takeaways

  • [00:33 - 02:03] A shared love of Star Trek
  • [02:04 - 04:10] Observability predates network technology, and inferring status
  • [04:11 - 06:35] Visibility into all networks and observability as necessity
  • [06:35 - 09:44] Ingesting telemetry and extrapolating inferences from data
  • [09:48 - 11:19] Collecting everything that can be collected, in an ideal world everything would be collected
  • [11:22 - 13:54] Why are we correlating, normalizing, and standardizing data acquisition? And what is the problem we're solving?
  • [13:54 - 16:09] The flow of data and how it gets enriched along the way


Transcript

This is telemetry now, and I'm your host Phil Jervasi. And with me today is a legend in the networking world, and the co founder and CEO of Kentech Avi Friedman.

Now, I just finished reading Avi's book network observability for Dummies, and I do have a few questions to ask and, and also comments to make. Like, for example, how we get visibility into networks that we don't own. And why do we need just so much information from the network? I mean, what are we really trying to solve here with network observability?

So let's get started.

Hey, Abby. It's really good to see you. Thanks for, for joining today. Well, I guess nobody else can see you. This is an audio only, podcast. I mean, I can see with our software. The rest of the audience will have to use their imagination it is good to see you and thanks for for joining today.

Good to see you and thanks for having me on.

Great. So before we get going, I wanna establish just a foundation here, a kind of a knowledge base between the two of us.

Okay.

I know that we share a common passion for certain science fiction. Among all of the different Star Trek motion pictures out there, what would you consider your favorite, your favorite one?

Well, my favorite one And and this is gonna show that I'm not quite the Star Trek nerd that I should be for. So I would say four. Yes.

The voyage home, it was four.

Lock on the bus, you know, with the nerve pinch. Yes.

Okay. That was a good one because I love, whales and marine biology and that sort of thing. That was a big part of my childhood. However, I am disappointed that you did not say Star Trek two, the wrath of Khan.

Well, You know, I try not to watch bed movie visit or read bed movie physics dot com and, you know, look at too much of it, but I just have say the idea that the z axis in the third dimension was a new concept not taught at Starfleet Academy. I just just couldn't really, you know, they, oh, we're gonna come up from the bottom, and no one's gonna expect us there. I didn't really. That that that sort of was like, yeah.

Right. I think most sci fi nerds agree that that is probably the best one.

So let's get started Avi. I I just finished reading your ebook not long ago, and I do have some questions and comments. And, you know, frankly, I know that the idea of observability itself, that predates technology.

I mean, you know, we're talking about Well, it dates networks, especially.

Right?

Network observability was like the state of the resistors and electrical network type Yeah.

But I mean, in its pure form, observability is about looking at the components or the outputs of a system and to to infer, you know, it's it's health, right, or its status.

Mhmm. Yeah. I haven't I guess I haven't seen that in a non, a non tech setting, but I will go look.

And and Okay.

Well, then let's let's talk about that in the context of networking. How how does that really hold true for networking, considering that there are so many disparate parts?

I guess that's sort of the point. If you're peering at one port on a network trying to scry and understand and reverse engineer all the inputs that might have caused what you're seeing, then, you know, usually you're sitting there scratching your head and you you can't figure it out. And especially with Internet related networking.

If it takes you longer to figure out what the problem is, then the duration of the problem, that's not very helpful. That's just way a lot of time. And so, you know, sometimes things are, you know, real problems, but only for two, three hours congestion on a network that is remote to you. Or, you know, in a in in, towards, Twilio, if that's where you need to get to to send your SMS.

So that's sort of the whole point is the more complex your network the more you need to be able to see the different parts of it, or you're just sort of trying to infer what might be the problem beyond it, which is pretty difficult when you've got cloud, you've got data center, you've got when, you've got the internet CIDR. And really, you've got the applications which are I call them nowadays magic packet transporters, you know, even CDNs. Right? You know, it's like it doesn't it's not like you have a prefix and your shortest path or distance fact and you're just trying to get somewhere.

It's people are magically load balancing traffic around, and, and those are things that are important to see from a telemetry perspective too.

Okay.

So somewhere in the beginning of your ebook, I think it was page four or five. I don't remember. There was a bulleted list of a bunch of things that you you personally consider as the major tenants of network observability. And you kind of established and that your answer just now just made sense. You kind of established that its observability might necessity because of the complexity of the network.

So there's a bunch of bullets there. I wanna focus on the first two though. The first one is you have to see all networks. Now that to me as a as a, you know, experienced network engineer kinda bothers me.

Not bothers me, but I really need you to elaborate on that for me because we all know that maybe we could see our own network. Right? We have some visibility there, but we're dealing like you said with networks that we don't own, like public clouds, SaaS providers, CIDR, maybe the overlaids are SDN. That's a thing now.

So how do we get you know, how do we see all those networks?

Well, you know, we live in the shadow world underneath the platonic ideal, and you can't always get to a hundred percent.

But Mhmm. You need to measure the internet if that's something that's important to you. And you can measure the internet by doing probes, you know, synthetic testing with the academic folks. They don't like the name synthetics, so they would say active active testing or probing.

Or if you actually have routing information that you can combine with, performance enhanced network traffic. So that could come from, that could come from a service mesh that could come from EBF in a server, and you can combine that with routing, then you can see project that across the AS paths that you have from BGP. You can actually begin to look into the internet even though it's not your network. If it's SD WAN, you are dependent often on getting telemetry from the vendor, or again, you can set up full meshes of probes and do at least testing to under and whether there could be some transport instability.

In cloud, depending on the cloud, you know, Google actually will give you performance data in their VPC flow logs. For the others. You're not gonna get performance for the network layer. So, again, you're sort of you're you're left without you're not gonna get their BGP information, but you can look at path data from, synthetic traces and add that all together. So, you know, the less you control the network, the more you need to be measuring it differently than with SNMP and streaming telemetry and Netflow because they're not gonna give you that data. You need to measure your own traffic and then you need to look at what the performance of that infrastructure is by testing it.

And ultimately, there's a lot of inferring going on here then. You have to extrapolate based on certain data that this is what's happening in the provider's work. Am I right?

Yeah. So you're still doing some level of correlation normalization inference, but it's a lot easier to figure out, you know, you may not be able to debug Amazon's problem for them, but to know that there is a problem that if you don't have any data, for that cloud, or if you don't have any data that looks at the internet, and you're just seeing, I can't connect somewhere.

Okay. So this is actually really good because your, what you're doing is talking about your second bullet point, which is the other thing that I wanted to discuss here. You know, this idea of ingesting telemetry in in your book, you should have a system to receive or, pull for all kinds of data or all kinds of telemetry and then ingest it so you can analyze it.

Forgot what you said, correlate, measure it, that kind of thing. And that and that's what you're talking about just now.

So so you're not just talking about flow data and maybe some SNMP information. You you just mentioned a lot of different types of information. Is that really the key here, the underlying technology that we need to rely on, that diversity of of visibility?

Well, I think even within there, there's there's three things you can look at, which is what's the kind of data that you want. And that could be metrics, whether it's, SNMP streaming telemetry, got help us CLI scraping, which some optics data you need to get from the CLI, but that's really just a time series of metrics. API, you would think in theory all the routers, those four things would be the same, but they're not always for various reasons.

Traffic data, which can be Netflow, could be VPC flow logs, it could be from EPPF, it could be from PECAP, you know, it could be up at the application layer from, you know, envoy, you know, a proxy.

All those things are are traffic data. It could be events, you know, of something went up or down, whether that's Sys log or, you know, it's in a p trap. It could be configuration information. You know, or changes. Mhmm. There's a tremendous amount of telemetry that you can get from a network and of course, you know, traffic performance probes, like we talked about, you know, that you do synthetically, whether it's at the network layer, up to the application level layer.

Yep.

All that's telemetry. And then you need to be able to, you know, normalize it. It's better if you can say, show me, think about traffic, not as Netflow versus VPC flow logs versus whatever, but you know, some of it may have V six addresses, some of it may have VX land, some of it may have TCP flag, some of it may have, you know, they have HTTP return codes, but, putting it all together lets you ask better questions. And the third is you need to enrich it because, you know, if you just have interface ID and not name, that doesn't help. If you have just name, but you don't really know what part of your network it's going to that doesn't if you have BGP, I forgot routing as a whole category of telemetry. If you have routing separate, you're never gonna put it together after the fact because it's changing every second. So you need to take a wide variety, you need to be able to make the similar things look the same, and you need to be able to extend it that people can ask better questions.

Right? So they know what customer that traffic is or, etcetera.

Yeah. I wanna ask you about that. And and elaborate on it, but I do wanna go back to one point you made. You know, you said you wanna decide on what information you wanna collect, but going back to our original definition of observability, don't we wanna collect everything, or is that even not necessary?

In the platonic ideal, what we'd like to collect and keep everything.

Unfortunately, as you said, we don't control. I mean, there's networks you control. There's networks you don't. And even the networks that you control, there's vibrating AGSes left in the corner that, you know, are leaking oil with flies, you know, swimming around them, like, in the Simpson Simpsons episode that, you know, are not gonna are gonna fall over if you ask them to do too much.

In fact, the little trivia thing, Kintech probably has, I don't know, fifty customers sending us streaming telemetry, As far as I know, zero of them send less than thirty second interval if, Mib equivalent data because they're concerned about overloading the control plan. So some problems, you know, so you can't really get everything, but you want to, get enough of a view across the different networks and across the different devices where there are devices and across the different categories of routing traffic metrics and such to be able to stitch together an understanding, by being able to look at, you know, being being able to look at the data in figure out what's actually happening inside, which is what the, you know, observability is about, or where the problem is.

And ideally, be told about it and and with a pointer where to look so you're not just scrying you know, within their They're in line with the benefits of network operations.

Correct?

Yep.

Yep.

So, you know, we we talked about that kind of classical definition of observability and then, you know, how we wanna have visibility into all the networks see everything and then ingest that, but you also talked about correlating it normalizing it standardizing it. Those are all kinda like machine learning terms right there.

Why? Why are we doing that? Now you did say, so you have some sort of insight. So you can ask any question of the network. Well, why though? Like, what why do I need to do that? The problem that we're trying to solve here?

Well, there's a couple things that you you wanna be able to do. The first is You don't wanna say, ask the same question from my Netflow v four enabled devices, or the traffic data I get from hosts, or you know, the IP fix devices and have to ask all those questions separately or go to five different, you know, databases, time series databases, platform boxes, whatever, to ask the same question basically about every part of your network. You wanna put all that together and be able to ask it in a single way. Which has shown me this across all the networks, really no matter what the source of traffic data.

Right? And it's much better to be able to say show me what applications this customer is accessing for this division of my company. Right? So, you know, break down that WAN link by application and, you know, remote AS, which you need BGP for.

Or by, customer if you're a hosting company, right, which of my customers is using my expensive long haul connectivity. And if everything is stuck in IP addresses, which is what you get without enrichment, that is, torturous, laborious, or impossible depending on your skill. So if you don't put all the traffic, all the metrics, all those things together, and by the way, most people have different systems even for SNMP and streaming telemetry today.

Which is, again, even like a basic bar for people to get over. If you don't put all the metrics together, all all the traffic and all that, then You have to ask the same question multiple times. And if you don't enrich it with the business I don't know. There's no great ITF term. Well, I don't really like ITF terms, but there's business identifiers. Like, what is the meaning of this?

Is it, you know, by customer application, whatever whatever that means to you as you debug things, then that can make it more difficult to actually get at the root of things also.

Okay. So ultimately getting at the root of things more quickly. Right? Exactly. To summarize what you said is that you're gonna collect ton of information, disparate information, in order to get to the root cause of a problem, you know, application delivery service for every problem.

Quickly efficiently, all that kind of thing. But I, you know, Avi, you you mentioned SNMP routing table information enrichment, you know, I'm there's a ton of other stuff there. Those are all really different types of data. Oh, yeah.

Yeah. So, I mean, how do you just take, like, eight hundred and sixty two different types of data databases and then smush them all together. Like, that seems like a major hurdle in and of itself.

It is. I mean, that's that's what, we have a lot of people that can take that do. And it's more than that because when the flow data comes in, if you would know the wanna know the interface name, you need to look in the metrics. And when any kind of the traffic data, which might not be Netflow comes in, you know, could be, it could be VPC flow logs, it could be from, could be from a web server.

Right? It could be up at that or Palo Alto Firewall. If you wanna see the intermediate network ops, you might wanna put the entire BGP AS path on it. So you need to be able to store all these things, and you need to be able to enrich.

It can take, you know, when we started we actually didn't store BGP index separately. We just used it to enrich the traffic data because we were very traffic sensitive as we grow We try to both store these things and link them together in real time and ingest.

So, you know, maybe one of the craziest versions of that is the way for some, you know, for some super large telcos will take a feed of all the DNS queries.

In real time and all the BGP. And then when we get the traffic, we try to figure out what site someone was going to. And, no, a token, not their actual ID, but, like, of what what subscriber it was so that people can do analysis of those magic packet transporters. Right? What's Facebook doing in my network that comes from different CDNs or might come from their own AS? So there's a huge amount of real time correlation, and then they're storing these things, which are, you know, in different formats. So, we have pretty generic, database and, you know, that evolves over time for that.

Right. Okay. Well, I mean, this has been a really great discussion today. But I I I honestly feel like we're scratching the surface. I love getting into this stuff into the weeds, and we're kind of heading in the direction that I've been very interested in lately the past couple of years. So So, I'd love to talk to you again, you know, soon and dig into some of the, the other aspects of network observability, especially how we're doing correlation, that kind of thing. So I'd love to have you on again soon.

Sure. I would love to, talk about this or, you know, other nerding topics Star track never or otherwise would do, a fill.

So Great.

Thanks.

Great. So before we close then, how can folks reach you online?

Avi friedman at LinkedIn, Twitter, probably some other things, or avi at kintech dot com. Great.

And you can find me at, Twitter at network underscore fill, and you can search my name, Philip Jirvasse, and LinkedIn as well. I'm pretty active in both places. So until next time, thanks for listening to this telemetry now.

About Telemetry Now

Do you dread forgetting to use the “add” command on a trunk port? Do you grit your teeth when the coffee maker isn't working, and everyone says, “It’s the network’s fault?” Do you like to blame DNS for everything because you know deep down, in the bottom of your heart, it probably is DNS? Well, you're in the right place! Telemetry Now is the podcast for you! Tune in and let the packets wash over you as host Phil Gervasi and his expert guests talk networking, network engineering and related careers, emerging technologies, and more.
We use cookies to deliver our services.
By using our website, you agree to the use of cookies as described in our Privacy Policy.