Kentik - Network Observability
More episodes
Telemetry Now  |  Season 1 - Episode 14  |  May 16, 2023

Enriching Your Network Telemetry for Real-World Insight

Play now

 

In this episode, Avi Freedman joins us again to talk about "enrichment", or in other words, adding additional context to our more traditional network telemetry so that we can truly understand what's happening in our network and with application delivery.

Transcript

Today, most of us use applications that actually don't live on the computer or the phone or the tablet right in front of us.

Many, if not, most of the applications that we use are actually delivered over the network, and they touch many, many network devices and services and network adjacent services. Some of which we own and manage and some that we don't. And that's to go from wherever that application lives your own data center in the public cloud and a containerized architecture all the way down to the screen in front of you. So that means being able to analyze application performance over the network requires a lot more than just collecting flow or SNMP information.

It means more than collecting flow from a WAN router and let's say SNMP info from your MDF switches to really understand why an application is performing the way it is over the network. We need more It means collecting a tremendous amount of telemetry from all sorts of different devices, but it also means adding to that body of data the additional, less qualitative business context and network adjacent data to help us really understand why an application is performing the way it is.

So with me today is a returning guest, Abby Freedman, founder and CEO at Kentic, to talk about enrichment, telemetry enrichment. Or in other words, adding that additional contextual data to our network telemetry to enable modern network analytics.

My name is Phil Gervasse, and this is telemetry now.

Avi. Welcome. It's great to have you on again. Time to talk about enrichment, network telemetry enrichment.

A lot of which I learned through reading your ebook network observability for dummies. So again, welcome.

Thank you very much. Thanks for having me, and, thanks for, reading the book.

Absolutely.

So I'd like to start off by defining enrichment. Now I, I don't remember what page, exactly, but I do remember some bullet points where you talk about all the variety of data that we collect, as far as network telemetry. And so my first question for you is, why do we need something else? I mean, what is enrichment that that it is different from flow and SNMP and EBPF and all these other forms of data. What is it? And then and then of course, why do we need it? To do this advanced form of network analytics.

So in traffic data, let's just pick on Netflow VPC flow logs, EBF, things like that, things that look at traffic.

Sometimes the router can look up in the BGP routing table what the origin AS is, for example, and put that in the netflow record, which is a, you know, as a template. Sometimes that works, sometimes it doesn't, but it doesn't, for example, do the entire BGP AS path. So you can see know, sort of more of the vector of of how the traffic is flowing.

But if you have a BGP feed and you have a traffic record, you can do that lookup yourself, which is what Kentic does, and put BGP attributes like community or ASPath, which lets the service provider then say, Oh, my customer, Fu, is this community, or is, you know, or this other attribute, that isn't present in any one single telemetry stream, but by combining them or adding metadata from your iPad, let's say from Netbox, about, you know, let's say you have a telemetry stream effectively of what IP addresses or what servers and what servers are in what locations and what data centers or clouds, that becomes its telemetry stream.

You metadata that you can combine with the flow traffic data to say, oh, something from this data center went to that data center, whereas, or from this application to that application, Whereas if you just look at the underlying telemetry, it just is that IP import to that IP imports in so many bytes and packets. So to really ask questions, having the raw data, especially at the current volumes, joining all that after the fact is really hard. It's very important when you get the data to add as much color.

In fact, another word for enriching is coloring, you know, coloring, telemetry with additional, So, Avi, really quick, is that what you mean by context?

I mean, you're using both of those terms here. Yeah. Enrichment and context. Yes.

Thank you for being pedantic. I am I try to be a member of homo pedentes, but I am not always capable of fully going there.

So the context is, for example, this IP address is on that server. That's context about the telemetry that you're getting. Or that server has that role, or this application lives on that virtual server, or this matches a threat feed, or that is on a CIDR.

Or that is, crypto mining, you know, network.

Those are context pieces of information that you can use enrich add to other telemetry that you get?

Well, Avi, I mean, doesn't that kind of imply that all of the other telemetry taken as a whole isn't enough then. Otherwise, we wouldn't be doing this enrichment.

What is it that we're getting from enrichment that we don't get from, all of the rest of the telemetry that we have been collecting?

It's not enough by itself.

Right. It's not enough by itself, to ask some kinds of meaningful questions that people wanna ask. If you don't understand what the cost of a link is, not the OSPF cost, but the dollar cost of a link, And how can you ask a question of your telemetry that says, how expensive is the traffic I get from this customer. In fact, if you don't enrich your data with what a customer is, that's context and how much costs are, you can't ask certain kinds of questions that are held to ask, or what's the performance of this application in this data center, you know, from seeing traffic if you if you have a performance feed but you don't can't map IP addresses to applications, then there's just questions you can't ask unless you get that context and add it on.

Yeah. Okay. That makes sense. I do have to say that when you said the word cost, the first thing that came to my mind was like OSF cost.

Yeah. Sorry. I had to clarify. Yes. We are talking about networking.

So I'm still very much in that frame of thinking as far as, being a network engineer.

But, just reading this back to you, it sounds like, enrichment or network telemetry enrichment is really adding that additional information to tie the various types of data that we collect together.

So tying an interface statistic with an application flow with a time stamp from from some log somewhere tying all of that together to make it, make more sense.

Or if I'm trying to do SLOs, and I wanna understand are the important applications? I can have a whole bunch of performance data coming in. But if I don't identify what's worth waking someone up about because it matters, then it's hard, then that data becomes, you know, it's the machine that goes big. It'll just be ignored because it's too much.

There's always problems. Is the network down? Yes. The network is down. To somewhere, somewhere the network is down.

Does it matter? That's the interesting question.

Right. Doesn't even matter. I mean, obviously, a hard down of the entire network is gonna affect application performance and an end user's digital experience. Right?

But there are things that occur on a network. So we're gonna collect those metrics. We're gonna collect that telemetry of some event that occurred, something that might not look so good, but actually doesn't affect application performance in any meaningful way. And so do we care?

Now from a technical perspective, what is the mechanism that we're using to ingest all of this information and then ultimately also enrich it with that additional that metadata, that business context, whatever it is that we wanna add to that network telemetry. I'm familiar with K Translate and that we use that as one of our mechanisms to do that.

So K Translate is Kentech opening up as open source and Kentech Labs.

The a lot of the ingest technology that Kentech uses to do ingest normalization of data and and some kinds of enrichment. There's some kinds of enrichment, which are too large scale to be in K Translate, but K Translate can have threat feeds and BGP and can do a lot of those kinds of enrichment. The K Translate is more primarily telemetry bus that can do, you know, the, the easy to medium levels of enrichment of adding context.

But it there are some kinds of context that even the folks in the telemetry bus space, like what K Translate does, what Kripple does are too large scale. For example, you know, we have a dozen, you know, plus global tier one backbone sending us data, and we take all the BGP feeds one from each or or at each pop, and we do multilevel lookups and we have all those VGP feeds at every point where we get their traffic data or metrics or whatever.

That would be that's a clustered. K Translate doesn't do that autonomously today. So it can do simple you know, I have a b g one BGP feed and one flow or SNMP and do some basic correlation, but it's a little bit more of a bust like replicate filter roll up and add a little bit of context.

So it seems like a lot of this enrichment information isn't necessarily purely networking focused. We're taking in information from other parts of the network, other parts of our infrastructure, stuff that's just subjective like application tags and IDs that you that you mentioned routing tables which suggests to me that we're interested in more than just the, the the pure network function of application delivery, and we're interested in helping those that are not necessarily in a in a purely network engineering role. So perhaps, sys admins, the help desk, the knock, those that are more concerned with the application itself. Am I right?

Absolutely. Yeah. So K Translate today can take open telemetry. It can take syslog. It can, you know, it can't yet take, directly from some of the proprietary formats of the other observe ability and application platforms, but it can send to them, and you could imagine it will soon, you know, be able to do that. So it's like a a Swiss army knife to build, to take data in, take data out, filter, do roll ups, change the shape of it. Send it in a different binary format, and one of the things it can do is add telemetry.

And and the goal is, you know, it started can translate. It started with sending data from Kentech into other data lakes and things like that. But when we started playing with the other observability players, they couldn't really take the full cardality of network data. Network data will just explode time series databases, and then we adapted it to take flow in and things like that.

So Alright.

Well, then how about this? I know that yesterday, when I was, in network operations doing day to day network maintenance, building troubleshooting, all that stuff, a lot of the time, we were really just concerned with loss latency and jitter probably because those are the only tools that we had. And when I say tools, probably singular. We just had we just had one visibility tool and whatever we had we had.

What would you say to those folks that claim that that's still enough today. Just looking at loss latency and jitter. What?

Like, like, I have a link I mean, no.

I'm gonna have a a link that has some packet loss. What was affected?

Well, if I don't know what applications are going over the link, If I'm looking at applications by port and protocol, but it's all crew identified and dynamically provisioned and you don't have any concept of that, then no, when you're trying to do, I mean, whether you're a forward thinking SLO waving, monitor, observe everything.

Excuse me, or whether you're a, mean time to innocence, you know, after the fact or everyone's really both. Right? You do your best, but but observability as the platonic ideal and we live in the shadow worlds of trying to understand what we can get and doing the best that we can, then no. I mean, I have a firm belief. This has always been something that I've believed is that you need to be have the data and you need to add on things that help you so when you are puckering and waking up at night and rubbing the crust from your eyes and trying to figure out what the hell's going on.

You you don't have to remember IP addresses, and you have as much data presented to you, you know, and ready as possible.

Right. And so enrichment really is adding more sometimes qualitative data, sometimes quantitative data, sometimes very subjective data to tie those other data points from traditional network telemetry together, adding more business context like you said, making things relevant.

And and ultimately, I think that that kinda goes in line with the, that whole thing from short circuit where Johnny five says more data or what does he say? More input. Right?

Needed. It could.

Yeah.

Yeah. But, this also, to me, at least, it reminds me of what we're doing with insights. We are trying to find some sort of meaning in the data, something that is meaningful to a network engineer, a human being, and and then saying, hey, we we see these things going on in your network devices. And because we are also enriching that data with these other, the elements of telemetry, we're able to sort of imply that this is causing this and this thing is going on, you should pay attention. Is that sort of what insights is all about?

So English is a wonderful language.

Have a button somewhere that says. It doesn't just borrow words. It follows other languages down the back alleys and beats them up and mugs them for, you know, their words and meanings.

So, insight can mean many things.

You know, having the right data on a map or a preset up dashboard, that can give you insight. But one of the ways we talk about insights against thinking about, let me show you something you didn't know. Me show you something to look at. We now have KMI insights for Kintech Market Intelligence. So we can show someone what's going on and with BGP routes in a given if a give in a given area.

So our goal has been to target things that we think people will know what the hell to do with.

So pretty early on, I mean, machine learning, statistics, has been around for some time. You know, we looked at if you try to do generalized correlation across all the telemetry data that you have, You get a whole bunch of things that are correlated, but no, no network engineer knows what to do about.

So our approach has always been, let's come from understanding the semantics of, oh, if I tell you that this is a, that a whole bunch of traffic that you didn't used to pay for, you're now paying for, that might be interesting, and you might know what to do about it, because then I can break down the interfaces, and I know which ones you pay how much for.

If and so that actually for some networks, you might wanna wake someone up if they're about to bust their ninety fifth percentile. Right?

That's an example of correlation that you might wanna do, baseline factor, you know, maybe not seasonality on, but depends what your contract say. Right?

Or, you know, I've got a ton. If I just correlate, you know, high traffic with different poor and things like that. Again, you're gonna see a lot of things that are unusual.

Do they matter? Who knows? But if it's more than if it's likely to, you know, fill a link, you know, we're projecting something is going on that's different that's gonna fill a link. So I need to know the history, and I'm focused on things that are generally garbage traffic, then again, you will probably know what to do about that, or you might look at it and say, oh, can, you know, the actually, we just didn't tell anybody, but this is a new application we have, which happens if you're a hosting company with a lot of gaming customers, I've seen gaming use our gut help us ICMP as as as we read a production protocol to get around, you know, certain limitations, which I don't know what they were thinking, but whatever.

You know, or certainly UDP, a lot of DDoS detection, for example, uses u d fires against UDP, a lot of gaming uses u d p. So we have a whole bunch of insights that we don't show people because we think they're not that insightful, which is when traffic to this AS goes up, traffic to this AS goes down, but who cares? What are you gonna do? It doesn't, if it isn't, if it doesn't, if it isn't meaningful to you, what are you gonna do about it?

So that's how we think about it is. Try to show the best things that, ideally, someone will wake up and know, and and and eighty percent of the time care and know what to do about ninety to a hundred percent of the time. Yeah.

It is a matter of of, you know, something being, you know, wrong and then something just being, like, off or anomalous or weird, like, you know, having a hundred gig link or today, four hundred.

I don't have I actually haven't configured any four hundred gig links because not of engineering.

I've only done one hundred, and it was anticlimactic. It was like, wait, it's up. What? That was it?

Yeah. Anti climactic is right. You know, you got a couple hundred gig links in your, in your data center. Maybe they're, you know, they're redundant links.

So the and and you have them in a very standby situation. So you're not using a link, and, it typically has a couple megs going on it, and then it goes up to a hundred megs. You still have a hundred gig link. You know statistically it dramatically increased and you're gonna fire off all these alerts.

Nobody cares. Like, nobody cares. So that's kind of the difference there. You know, yeah, we have something to report on and then, you know, that therein lies the problem with alert fatigue and all of that.

So, you know, the the insight is adding that that context of this is actually a problem, and this is why. That it's gotta be hard though to start correlating data that's just so different. Like, especially if you're talking about, like, here's you know, geolocation and here's interface statistics and all these things are just so different. Right?

Well, at the scale, we talked about this a little bit.

At the scale of modern telemetry, we have customers that send us you know, tens of millions of events per second.

You really have to do that correlation.

Some of it, the enrichment on ingest. There's actually a blog we did about enrichment as correlation.

Because if you try to join after the fact, let let's take the following.

How often does BGP change, like, or I'd say how much of BGP routing table is different day to day?

I think it's something like one percent. But certainly when there's flaps or things down, it's changing every second. So if I take a one week, one month or current BGP snapshot, but some prefixes have moved, and then use that to analyze I'm gonna be wrong about some percentage of the traffic. Same thing I p g o, which is a horribly craptastic thing to do, you know, anyway because it's always inaccurate to some extent that's more than, you know, one percent.

But over time, things certainly move around by geography because they move around BGP. So if you use today's VGP or geo data to look at telemetry from a month ago, it's gonna be wrong. And if you think about the data challenge, which I'm happy to whiteboard for anyone who wants some time, of a time versioned version of BGP writing tables and all that. And then co querying trillions of records and correlating them, you know, sort of screaming after the fact, It's really hard.

With network data and with even just operational data and application data, you really there's some things you just really have to do Think of it as streaming joins, ingest joins, do it in real time. So you're effectively putting some of it together. Now you still need to learn based on it. You still need to baseline it.

You still to do whatever your statistical methods are to say what's normal and what isn't. But if but some of those things, you just have to put on when you get it or it's too late.

Unless you're willing to wait, like, you know, unless you're willing to busy four thousand machines and wait, you know, a day for your answer, you know, that's the old Hadoop methodology.

Is, oh, we'll map reduce it. It'll be fine.

So we started our conversation today talking about enrichment, really focused on that and how enrichment is all about adding that additional data to give our network telemetry, our flow and SN and P information and EVP information, all that stuff, additional context, whether that's a business context, whether it's network adjacent services like DNS and routing tables that helps us understand our network telemetry better. And then we got into insights, which honestly I didn't expect.

But it does make sense to me. I mean, ultimately, that's what we're doing with enrichment is we're adding that additional information that may not be necessarily traditional telemetry to help us understand what's going on better, to have more insight into application delivery. And, Avi, I have to assume that you'd agree with me that the network much as we love it is really nothing more than the mechanism, the substrate that we use to deliver services normally in the form of applications down to end users, human beings.

So Abby's, it's been a pleasure having you on to talk about this again today. Many more questions. We could certainly go on and on, and I'd love to talk about correlation more and and of course the whole ML conversations, intriguing to me. But for now, how can folks reach you online, Abby, if they wanted to ask a question or make a comment?

Sure. So, kentic dot com, k e n t I k dot com, avi at kentic dot com, Avi friedman on Twitter and LinkedIn and various other things.

And happy to talk to people. It's not hard to get me started, nerding about, storage, networking, distributed systems, etcetera.

Great. And you can find me online still on Twitter network underscore fill. You can search my name in LinkedIn. I am on other various social media as well these days.

And also, if you have an idea for an episode of telemetry now, please reach out to us at telemetry now at kintech dot com. Or if you'd like to be a guest, we'd love to hear from you. So until next time, thanks for listening and bye bye.

About Telemetry Now

Do you dread forgetting to use the “add” command on a trunk port? Do you grit your teeth when the coffee maker isn't working, and everyone says, “It’s the network’s fault?” Do you like to blame DNS for everything because you know deep down, in the bottom of your heart, it probably is DNS? Well, you're in the right place! Telemetry Now is the podcast for you! Tune in and let the packets wash over you as host Phil Gervasi and his expert guests talk networking, network engineering and related careers, emerging technologies, and more.
We use cookies to deliver our services.
By using our website, you agree to the use of cookies as described in our Privacy Policy.