Avi is the co-founder and CEO of Kentik. He has decades of experience as a leading technologist and executive in networking. He was with Akamai for over a decade, as VP Network Infrastructure and then Chief Network Scientist. Prior to that, Avi started Philadelphia’s first ISP (netaxs) in 1992, later running the network at AboveNet and serving as CTO for ServerCentral.Connect with Avi on LinkedIn
Phil Gervasi: Today, most of us use applications that actually don't live on the computer or the phone or the tablet right in front of us. Many, if not most of the applications that we use are actually delivered over the network. And they touch many different network devices and services and network- adjacent services, some of which we own and manage and some that we don't. And that's to go from wherever that application lives, in your own data center, in the public cloud, in a containerized architecture, all the way down to the screen in front of you. So that means being able to analyze application performance over the network requires a lot more than just collecting flow or SNMP information. It means more than collecting flow from a WAN router and let's say SNMP info from your MDF switches to really understand why an application is performing the way it is over the network. We need more-- it means collecting a tremendous amount of telemetry from all sorts of different devices, but it also means adding to that body of data, the additional less qualitative business context and network adjacent data to help us really understand why an application is performing the way it is. So with me today is a returning guest, Avi Friedman, founder and CEO at Kentik to talk about telemetry enrichment or in other words, adding that additional contextual data to our network telemetry to enable modern network analytics. My name is Phil Gervasi, and this is Telemetry Now. Avi, welcome. It's great to have you on again this time to talk about enrichment network telemetry enrichment, a lot of which I learned through reading your ebook Network Observability for Dummies. So again, welcome.
Avi Friedman: Thank you very much. Thanks for having me and thanks for reading the book.
Phil Gervasi: Absolutely. So I'd like to start off by defining enrichment. Now I don't remember what page exactly, but I do remember some bullet points where you talk about all the variety of data that we collect as far as network telemetry. And so my first question for you is why do we need something else? I mean, what is enrichment that it is different from flow and SNMP and eBPF and all these other forms of data? What is it? And then of course, why do we need it to do this advanced form of network analytics?
Avi Friedman: So in traffic data, let's just pick on NetFlow, VPC flow logs, EBPF, things like that, things that look at traffic. Sometimes the router can look up in the BGP routing table what the origin as is for example, and put that in the NetFlow record, which is... as a template. Sometimes that works, sometimes it doesn't, but it doesn't, for example, do the entire BGP as path. So you can see sort of more of the vector of how the traffic is flowing. But if you have a BGP feed and you have a traffic record, you can do that look up yourself, which is what Kentik does, and put BGP attributes like community or as path which lets the service provider then say, oh, my customer inaudible or is this community or or this other attribute that isn't present in any one single telemetry stream. But by combining them or adding metadata from your IPAM, let's say from NetBox about let's say you have a telemetry stream effectively of what IP addresses or what servers and what servers are in what locations and what data centers or clouds, that becomes a telemetry stream. Your metadata that you can combine with the flow traffic data to say, oh, something from this data center went to that data center or from this application to that application. Whereas if you just look at the underlying telemetry, it just says that IP import to that IP imports in so many bytes and packets. So to really ask questions, having the raw data, especially at the current volumes, joining all that after the fact is really hard. It's very important when you get the data to add as much color. In fact, another word for enriching is coloring-- coloring telemetry with additional attributes.
Phil Gervasi: So, Avi, really quick, is that what you mean by context? I mean you're using both of those terms here? Yeah, enrichment and context.
Avi Friedman: Yes, thank you for being pedantic. I try to be a member of homo- pedantic, but I am not always capable of fully getting there. So the context is, for example, this IP addresses on that server that's context about the telemetry that you're getting or that server has that role or this application lives on that virtual server or this matches a threat feed or that is on a CDN or that is a crypto mining network. Those are context pieces of information that you can use, enrich add to other telemetry that you get.
Phil Gervasi: Well, Avi, I mean doesn't that kind of imply that all of the other telemetry taken as a whole isn't enough then, otherwise we wouldn't be doing this enrichment? What is it that we're getting from enrichment that we don't get from all of the rest of the telemetry that we have been collecting?
Avi Friedman: It's not enough by itself to ask some kinds of meaningful questions that people want to ask. If you don't understand what the cost of a link is, not the OSPF cost, but the dollar cost of a link, then how can you ask a question of your telemetry that says, how expensive is the traffic that I get from this customer? In fact, if you don't enrich your data with what a customer is, that's context and how much costs are, and you can't ask certain kinds of questions that are helpful to ask or what's the performance of this application in this data center from seeing traffic. If you have a performance feed but you don't can't map IP addresses to applications, then there's just questions you can't ask unless you get that context and added on.
Phil Gervasi: Yeah, okay. That makes sense. I do have to say that when you said the word cost, the first thing that came to my mind was like OSPF cost.
Avi Friedman: Sorry, I had to clarify. Yes, we're talking about networking. So yes,
Phil Gervasi: I'm still very much in that frame of thinking as far as being a network engineer, but just reading this back to you, it sounds like enrichment or network telemetry enrichment is really adding that additional information to tie the various types of data that we collect together. So tying an interface statistic with an application flow with a timestamp from some log somewhere, tying all of that together to make it make more sense.
Avi Friedman: Or if I'm trying to do SLOs and I want to understand what are the important applications, I can have a whole bunch of performance data coming in, but if I don't identify what's worth waking someone up about because it matters, then it's hard. Then that data becomes... it's the machine that goes bing, it'll just be ignored because it's too much. There's always problems. Is the network down? Yes, the network is down-- to somewhere, somewhere, the network is down, doesn't matter. That's the interesting question.
Phil Gervasi: Right. Doesn't even matter. I mean obviously a hard down of the entire network is going to affect application performance and an end users' digital experience, but there are things that occur on a network. So we're going to collect those metrics. We're going to collect that telemetry of some event that occurred, something that might not look so good, but actually doesn't affect application performance in any meaningful way. And so do we care? Now from a technical perspective, what is the mechanism that we're using to ingest all of this information and then ultimately also enrich it with that additional, that metadata, that business context, whatever it is that we want to add to that network telemetry. I'm familiar with KTranslate and that we use that as one of our mechanisms to do that.
Avi Friedman: So KTranslate is Kentik opening up as open source in Kentik Labs, a lot of the ingest technology that Kentik uses to do ingest normalization of data and some kinds of enrichment. There's some kinds of enrichment which are too large, scaled to be in KTranslate, but KTranslate kind of threat feeds and BGP and can do a lot of those kinds of enrichment. KTranslate is more primarily a telemetry bus that can do the easy and medium levels of enrichment, of adding context. But there are some kinds of context that even the folks in the telemetry bus space, like what KTranslate does, what Cripple does are too large scale. For example, we have a dozen plus global tier- one backbone sending us data and we take all the BGP feeds, one from each RR at each pop, and we do multi- level lookups and we have all those BGP feeds at every point where we get their traffic data or metrics or whatever, that's HA- clustered... KTranslate doesn't do that autonomously today. So it can do simple... I have one BGP feed and one flow or SNMP and do some basic correlation, but it's a little bit more of a bus like replicate filter roll up and add a little bit of context.
Phil Gervasi: So it seems like a lot of this enrichment information isn't necessarily purely networking- focused. We're taking in information from other parts of the network, other parts of our infrastructure, stuff that's just subjective, like application tags and ideas that you mentioned, routing tables, which suggests to me that we are interested in more than just the pure network function of application delivery. And we're also interested in helping those that are not necessarily in a purely network engineering role. So perhaps sys admins, the help desk, the NOC, those that are more concerned with the application itself. Am I right?
Avi Friedman: Absolutely. So KTranslate today can take open telemetry, it can take Syslog. It can't yet take directly from some of the proprietary formats of the other observability and application platforms, but it can send to them and you could imagine it will soon be able to do that. So it's like a Swiss army knife to take data in, take data out, filter, do rollups, change the shape of it, send it in a different binary format, and one of the things they can do is add telemetry. KTranslate started with sending data from Kent Tech into other data lakes and things like that. But when we started playing with the other observability players, they couldn't really take the full cardinality of network data-- network data, we'll just explode time series databases, and then we adapted it to take flow in and things like that, so.
Phil Gervasi: All right, well, then how about this? I know that yester year when I was in network operations doing day- to- day network maintenance, building, troubleshooting, all that stuff, a lot of the time we were really just concerned with loss latency and jitter probably because those are the only tools that we had. And when I say tools, it's probably singular. We just had one visibility tool and whatever we had, we had. What would you say to those folks that claim that that's still enough today? Just looking at lost latency and jitter.
Avi Friedman: I mean, no, I'm going to have a link that has some packet loss, what was affected? Well, if I don't know what applications are going over the link, if I'm looking at applications by port and protocol, but it's all unified and dynamically provisioned and you don't have any concept of that, then no. I mean whether you're a forward- thinking SLO- waving monitor, observe everything, excuse me, or whether you're a meantime to innocence after the fact or everyone's really both, you do your best, but observability is the platonic ideal, and we live in the shadow worlds of trying to understand what we can get and doing the best that we can. Then no. I mean, I have a firm belief. This has always been something that I've believed is that you need to have the data and you need to add on things that help you so when you are puckering and waking up at night and robbing the cross from your eyes and trying to figure out what the hell's going on, you don't have to remember IP addresses and you have as much data presented to you and ready as possible.
Phil Gervasi: And so enrichment really is adding more, sometimes qualitative data, sometimes quantitative data, sometimes very subjective data to tie those other data points from traditional network telemetry together, adding more business context, like you said, making things relevant. And ultimately, I think that that kind of goes in line with that whole thing from short circuit where Johnny- Five says more data or what does he say? More input, right?
Avi Friedman: Need input, yeah.
Phil Gervasi: But this also, to me at least, it reminds me of what we're doing with insights. We are trying to find some sort of meaning in the data, something that is meaningful to a network engineer, a human being, and then saying, Hey, we see these things going on in your network devices and because we are also enriching that data with these other elements of telemetry, we're able to imply that this is causing this and this thing is going on, you should pay attention. Is that sort of what insights is all about?
Avi Friedman: So, English is a wonderful language. I have a button somewhere that says it doesn't just borrow words. It follows other languages down the back alleys and beats them up and mugs them for their words and meanings. So insight can mean many things. Having the right data on a map or a pre- setup dashboard that can give you insight. But one of the ways we talk about insights at Kentik is thinking about" let me show you something you didn't know," " let me show you something to look at." We now have KMI insights for Kentik market intelligence so we can show someone what's going on with BGP routes in a given area. So our goal has been to target things that we think people will know what the hell to do with. So pretty early on, I mean machine learning, well no, statistics has been around for some time. We looked at, if you try to do generalized correlation across all the telemetry data that you have, you get a whole bunch of things that are correlated. But no network engineers knows what to do about. So our approach has always been, " let's come from understanding the semantics of," oh, if I tell you that a whole bunch of traffic that you didn't used to pay for, you're now paying for, that might be interesting and you might know what to do about it because then I can break down the interfaces and I know which ones you pay how much for. And so that actually for some networks, you might want to wake someone up if they're about to bust their 95th percentile. That's an example of correlation that you might want to do baseline factor, maybe not seasonality on, but depends what your contracts say. If I just correlate high traffic with different ports and things like that. Again, you're going to see a lot of things that are unusual. Do they matter? Who knows? But if it's likely to fill a link, we're projecting something is going on that's different, that's going to fill a link. So I need to know the history and I'm focused on things that are generally garbage traffic, then again, you will probably know what to do about that. Or you might look at it and say, oh, actually we just didn't tell anybody, but this is a new application we have. Which happens if you're a hosting company with a lot of gaming customers, I've seen gaming use or... God help us. ICMP as production protocol to get around certain limitations, which I don't know what they were thinking, but whatever. Or certainly UDP. A lot of DDoS detection for example, fires against UDP. A lot of gaming uses UDP. So we have a whole bunch of insights that we don't show people because we think they're not that insightful, which is when traffic to this as goes up, traffic to this as goes down, but who cares? What are you going to do? If it isn't meaningful to you, what are you going to do about it? So that's how we think about it, is try to show the best things that ideally someone will wake up and know, and 80% of the time care and know what to do about 90 to 100% of the time.
Phil Gervasi: Yeah, it is a matter of something being wrong and then something just being off or anomalous or weird having a hundred- gig link or today 400, I actually haven't configured any 400- gig links because I've been out of engineering.
Avi Friedman: I've only done a hundred, and it was anti- climactic. It was like, wait, it's up, that was it?
Phil Gervasi: Yeah, anti- climactic is right, but you know, got a couple hundred gig links in your data center, maybe they're redundant links and you have them in a very active standby situation. So you're not using a link, and it typically has a couple megs going on it, and then it goes up to a hundred megs. You still have a hundred- gig link. Statistically, it dramatically increased and you're going to fire off all these alerts. Nobody cares. So that's kind of the difference there. We have something to report on. And then that therein lies the problem with alert fatigue and all of that. So the insight is adding that context of this is actually a problem, and this is why. It's got to be hard though to start correlating data that's just so different, especially if you're talking about here's geolocation and here's interface statistics and all these things are just so different, right?
Avi Friedman: We talked about this a little bit. At the scale of modern telemetry, we have customers that send us tens of millions of events per second. You really have to do that correlation. Some of it, the enrichment, unjust. There's actually a blog we did about enrichment as correlation, because if you try to join after the fact, let's take the following. How often does BGP change or I'd say how much of BGP routing table is different day to day? I think it's something like 1%, but certainly, when there's flaps or things down, it's changing every second. So if I take a one week, one month or current BGP snapshot, but some prefixes have moved and then use that to analyze, I'm going to be wrong about some percentage of the traffic. Same thing IPGO, which is a horribly craptastic thing to do anyway, because it's always inaccurate. To some extent, that's more than 1%. But over time, things certainly, move around by geography because they move around BGP. So if you use today's BGP or geo- data to look at telemetry from a month ago, it's going to be wrong. And if you think about the data challenge, which I'm happy to whiteboard for anyone who wants some time of a time- versioned version of BGP routing tables and all that, and then querying trillions of records and correlating them sort of streaming after the fact, it's really hard. With network data and with even just operational data and application data, there's some things you just really have to do. Think of it as streaming joins, ingest joins, do it in real- time. So you're effectively putting some of it together. Now, you still need to learn based on it baseline. You still need to baseline it, you still need to do whatever your statistical methods are to say what's normal and what isn't, but some of those things you just have to put on when you get it or it's too late unless you're willing to busy 4, 000 machines and wait a day for your answer. That's the also Hadoop methodology is, oh, well, map reduce, it'll be fine.
Phil Gervasi: So we started our conversation today talking about enrichment, really focused on that and how enrichment is all about adding that additional data to give our network telemetry, our flow and SNMP information and eBPF information, all that stuff, additional context, whether that's a business context, whether it's network adjacent services like DNS and routing tables that helps us understand our network telemetry better. And then we got into insights, which honestly I didn't expect, but it does make sense to me. I mean, ultimately that's what we're doing with enrichment is we're adding that additional information that may not be necessarily traditional network telemetry to help us understand what's going on better, to have more insight into application delivery. And avi, I have to assume that you'd agree with me that the network, as much as we love it, is really nothing more than the mechanism, the substrate that we use to deliver services normally in the form of applications down to end users, human beings. So, Avi's been a pleasure having you on to talk about this again today. Many more questions. We could certainly go on and on, and I'd love to talk about correlation more, and of course the whole ML conversation's intriguing to me. But for now, how can folks reach you online, Avi, if they wanted to ask a question or make a comment?
Avi Friedman: Sure. So kentik. com, K- E- N- T- I- K. com. I'm avi @kentik.com. I'm Avi Friedman on Twitter and LinkedIn and various other things, and happy to talk to people. It's not hard to get me started nerding about storage, networking, distributed systems, et cetera.
Phil Gervasi: Great. And you can find me online still on Twitter network_Phil. You can search my name in LinkedIn. I am on other various social media as well these days. And also, if you have an idea for an episode of Telemetry now, please reach out to us at telemetrynow@ kentik. com, or if you'd like to be a guest, we'd love to hear from you. So until next time, thanks for listening, and bye- bye.
Do you dread forgetting to use the “add” command on a trunk port? Do you grit your teeth when the coffee maker isn't working, and everyone says, “It’s the network’s fault?” Do you like to blame DNS for everything because you know deep down, in the bottom of your heart, it probably is DNS?
Well, you're in the right place! Telemetry Now is the podcast for you!
Tune in and let the packets wash over you as host Phil Gervasi and his expert guests talk networking, network engineering and related careers, emerging technologies, and more.