Video

Network observability: The evolution of network visibility with Kentik

My name is Philip Gervasi. I'm the head of tech evangelism at Kentik. I've been with Kentik for about four, five, six months now, so not quite as long Justin. And what I'm gonna be talking to you about for the next thirty, thirty five minutes or so is network observability. Now you may be familiar with the term observability already, especially in the context of DevOps and if you're familiar with what SREs do, but what we're doing at Kentik is pretty much applying some of those methodologies and workflows to network telemetry. So it's network centric observability or network observability. That's where we get that term from. And you can use a different term if you I know that for some it's architecture or both word bingo. So you can certainly call it advanced or next gen monitoring. How about that? That's a good one. Now, what we believe at Kentik though is that network observability, which is different, I'm gonna take my watch off, which is different than what I'm gonna call traditional visibility, and it's the evolution of that. Network observability is the evolution of traditional network visibility. It's built on a foundation of traditional visibility. It's not like we don't care about that anymore. In fact, that's the core. And why are we doing this? Twenty twenty two, it's pretty clear. I think all of you understand and know that most of the applications that we use today, the mundane and the mission critical are delivered over the network, and so there's a tremendous amount of application performance telemetry embedded in the network. Yeah, code level stuff and looking at what's going on with PHP, that's all important for sure, but there's a tremendous amount of application performance telemetry in our network telemetry, a lot of which we're already ingesting. So this subtitle here, the ability to answer any question about your network, that's kind of a Kentik definition that our CEO and co founder Avi Friedman came up with, and it's in his book, his ebook, Network Observability for Dummies, which you can go and download for free. It's just a PDF ebook. But what does that mean, right? It certainly doesn't mean that you can literally ask your computer a question about your network. Now I'm being silly, obviously, a little bit of levity here, but this is the abstract concept that I'm gonna be speaking to you about for the next little while and defining what network observability really is. So it's gonna be kind of a high level presentation, though I am gonna get into the weeds a little bit about how we do it, but I wanna spend some time defining it first. So as a quick definition for you, a working definition that we can start with, network observability is the difference between more data, and we love data, we want more data, right? Like Johnny Five says in Short Circuit, more input, we want more data, but it's the difference between more data and more answers, and you gotta understand that we're not shunning legacy or traditional visibility. Data is the foundation of what we're doing with network observability, And so volume, having more diverse data sources, making sure that that data is as accurate as possible is central to doing what we do with network observability well, the old saying garbage in garbage out. And you can see kind of in a trite short fashion here, we have that traditional visibility lets you see what's happening on your network. Very important. We still care about that. But network observability then goes to the next step and helps you to understand why a particular event occurred. Does that make sense? I understand that that's an abstract differentiation, and so therein lies the whole concept of architecture and vaporware and what are you really talking about? But hopefully by the end of the presentation, you understand. So before we get into a true, like what are we doing? How are we actually making network observability work, it's important to understand that it is not replacing traditional visibility. It's built on top of what we're doing with visibility, and that's the collection, the ingest, and the analysis of flows of SNMP, IPFIX, VPC flow logs, Google system logs, eBPF, what else? A lot of stuff, right? The results of synthetic tests, even metadata that Justin alluded to in the previous presentation. But all of that stuff ultimately stops at showing us what is happening on the network, and it populates all the charts and the graphs, and that's great. We want that. But it's critical for what we're doing with network observability. However, it's also critical that that data is We have a lot, so there's volume. It's diverse that we're not looking at only flows or only streaming, and it's accurate because, again, garbage in garbage out is very, very important to having accurate results like predictions and correlations, which I'll be discussing later on. So some of the problems that we at Kentik have identified with just stopping with collecting more data and then showing it on really awesome charts, which again is important, is that having a lot of data and stopping there without doing any sort of advanced analysis lends itself to point solutions and disparate databases. You have a flow tool. You have a tool focused on SNMP. You have a security tool, completely standalone focused on security. Maybe you have some kind of homegrown tool for eBPF, and maybe you're still trying to figure out how in the world am I gonna integrate my VPC flow logs into my NOC workflow or my network operations workflow. I think you're starting to see where I'm going with this, right? We're talking about a network operations problem. This lends itself to gaps and pockets of visibility. And then remember, we're talking about a human being engineer having to then look at all these tools and in your mind, you're the computer, figure out how that data relates to each other and then make sense of it, the correlations happening in your brain. The data types themselves are gonna be drastically different as well. Think about an application tag. It's a tag. It's a random number. It's a label. And then over here you have billions of bits per second and millions of packets per second. And over here you have a scale, zero to one hundred. It's a percentage. These are all different things. And so how do you make sense of that very quickly, right? Scale and format. Ultimately, again, going back to the human engineer, the root problem here is that the human activity of troubleshooting, forecasting, predicting, root cause analysis is very tedious, manual, and slow, so it's a network operations problem that we're looking at. The goal here is to augment you as a network engineer sitting at the Kentik UI trying to figure something out, an application performance problem running over the network, okay? And ultimately I've been there, right? Maybe you have as well. I think I'm hitting the microphone with my hand. Many of you, we know each other personally. I've been a network engineer for fifteen years and I remember sitting, looking at the Webex screen or the Zoom screen on a team call with a Windows person, with the security person, with other engineers, trying to figure out why something is working, right? So we identify a clue and we chase that clue. And then it leads us to another clue and then another clue, and that's fine, right? Because ultimately what happens? We find the solution. We're engineers. That's what we do. So you could replace all of this with a team full of PhD data scientists, but even then you're still prone to human error and the time that it takes for them to figure it out. And so what we're gonna do now, I'm gonna tell you the goals of network observability is to make that a programmatic automated process as much as we can, and to the extent that we can, we're gonna give you useful, insightful intelligence with the Kentik Observability Cloud platform. Now, I chose those words very specifically. They have meaning, okay? Useful means, well, you know what useful means, right? Not useless, but let me give you an example. It has meaning to you as an engineer trying to figure out a problem. What if I showed you some correlation? You're like, who cares? I don't care if these things are correlated. What does it mean to me? What can I do to resolve this problem on network? Let me give you an example. You have a four hundred gig interface in your data center, right? And it's chugging along day to day to day. You're kind of plotting along here at one meg per second, and then I all of a sudden see that it jumps to fifty megs per second. Who cares? Does that have any impact on user experience? Probably not. Does it have any impact on, you know, application performance? Probably not. And so do I start firing off alerts? Because it's a statistically large increase. It's a weird thing. It's interesting, but we don't care about weird and interesting per se unless it does affect application performance, unless it affects the user experience, and then it becomes a bad thing. And then when I say insightful, imagine that one meg becomes one point weeks after week after week, right? It's one meg every day on a four hundred gig interface and then goes to one point two five megs for a week and then one point four megs and then one point seven five and then it's two megs. It doubled over the course of a couple weeks. Again, who cares? Well, maybe you do, and that's where the insight comes in because we do see a trend that you need to be aware of. Do we fire off all like the fire trucks in your town and send them on mission critical alert thing? Maybe not. Maybe it's a warning and we send it to you, system generated alert that this trend is occurring and this might be the cause, which in our system we call factors, okay? We're gonna, and I spent a lot of time on the first bullet because that's the crux of everything that we're doing. When I talk about statistical analysis and machine learning and some specific algorithms, that's not the point of what we're doing. We're not bolting on machine learning and then going around the community saying, We're an ML platform. No, no, no, no. We are a network observability platform with the express purpose of giving you useful, insightful information so we can augment and engineer and help you do root cause analysis faster, more efficiently, right? ML, statistical analysis, divergence of data, being able to query our databases faster, those are just tools in our tool belt, and we'll use whatever tools are necessary to get to that first bullet point. So we're gonna automate root cause analysis, and to the extent that we can, we're also gonna do it in the context of probability, because when we start talking about running algorithms and applying ML algorithms, it is a matter of probability, a matter of certainty, right? Can you trust what this result is? Kind of presupposes that you have accurate data, doesn't it? So we're also going to correlate diverse visibility information, but I want you to understand something. I'm gonna pause here and even put my little clicker down because this is really heart to heart now. Correlation is not the end all of network observability. We can correlate a ton of stuff in your network and a lot of it is not useful, so we don't care. We're not gonna create alerts for and that's hard because we're always working in this feedback loop at Kentik with our data scientists, our product managers, product team, with our customers themselves to figure out what's useful, what kind of insight can we generate, what is a weak correlation versus a strong correlation, there's a difference, what's a causal relationship. We're gonna infer visibility as well with this method, and you can imagine two ends of a conversation, TCB comes up conversation, right? Yeah, a user and a server, whatever, and you have switches and routers on both sides. You collect a bunch of telemetry, the application's running slow being delivered. So you look at all that telemetry and you see, all right, yeah, I mean, there's no dropped packets between the core and the router. There's no jitter or latency. You look at that same thing on both sides. You look at how long it takes for the user to generate a request, no problem. You look at how long it takes the server to generate a response, no problem. But then you look at the round trip time or the server response time overall, and it's like, there's this chunk in the middle that it's not the network that we own, but something is going on there. So we'll start to able to say, there's a latency problem specifically, and it's happening maybe in this specific ASN, because now we're gonna start looking at the different hops in the public network. And we're gonna make predictions of potential issues, trends. Trends and baselines are different things. I'm gonna break that down a little bit for you. Seasonality, that's really important in what we're doing. How do we do that? You're gonna think I'm contradicting myself because I said, This is not just adding more data. Well, it is because we want a more robust dataset that's highly accurate and diverse in its nature. So we are gonna start with adding more data. This is one of the pillars of what we do. Justin mentioned it in the initial presentation, the corporate intro. We're gonna do something called enrichment, telemetry enrichment, with metadata to provide our dataset, our raw data, our interface statistics, our flow data, all that stuff, more context, hopefully some sort of a business context that makes sense. Geolocation, security groups, application tags, DNS information, which is a big one. Threat feeds, that's actually really big for us. A lot of our customers really value our DDoS detection abilities, so this is part of that. I really like this meme a lot as well. I had a different one and then I swapped this one last minute. I hope you like it too. And we also, this is another pillar. We make a distinction between passive and proactive telemetry. Justin mentioned synthetics, synthetic testing. We call it synthetics. You probably are already familiar. This is not necessarily new in the industry, but the difference is that passive telemetry is stuff that already happened on your network. It might've happened like just now, right? So it's like you're never really in the present that whole like metaphysical thing, But in any case, we're talking about flows, streaming telemetry. You see it on your screen, SNMP, IP fix, all that stuff. All right, technically it's not all generated from production traffic because you could have like an SFP go bad, interfaces down, SNMP gives you an alert. Technically that's not production traffic, but generally speaking, you are gauging the health of your network, the performance of an application based on production traffic, users, already happening stuff, so we balance that. We don't do one or the other, by the way. We combine these in our dataset. We use synthetic tests to check things like page load times, to check latency loss, jitter, basic network centric things like that. We do transaction tests where we can simulate an end user logging into something and doing stuff, and we'll capture each individual metric as those things occur. Really useful information. It's artificially generated traffic, not user traffic. Now, I do wanna pause for any of you purists. I'm looking at you, Peter Welsher. This is technically not observability. Synthetics is not observability in the pure sense because observability means not making any changes to the system or affecting the system in any way, and synthetics is literally putting traffic onto the system. Nevertheless, we do these synthetic tests, we take the results of those, and we combine them with passive telemetry in our oral database, and I'm gonna explain what the database looks like in a little bit. So you're saying to me, You do observability by just adding more data? That's kind of counter to what you said before. Well, how do we actually do network observability? Let me go through the five main kind of points here. First, we ingest a huge amount of data from a variety of sources, and we're always looking for more ways and more types of data to ingest. So if there's a new cloud platform that comes up or whatever, and then they become really huge, we're gonna look for a way to ingest that. Now, I admit that that's gonna cause heartache for our product team and our engineers to figure out how do we ingest that? What do we do with this now? Our data scientists have have to figure that out, but nevertheless, we want it. We want that additional perspective of what's going on in the network. Next, number two, we're gonna classify cluster, group, scale, and normalized data. That's interesting stuff. Those are all machine learning terms, but I don't want to downplay machine learning. I just want to remind you that it's one piece of what we do, okay? So otherwise, what we're doing is creating structure in otherwise unstructured data, and you have to remember a lot of network telemetry is unstructured, unlabeled, ephemeral, short lived. When we're looking for correlations, sometimes that correlation is temporal, very short, and then that thing is no longer occurring, and then how do compare that to long term permanent correlation or dependencies? We're gonna recognize patterns in data, which is pretty much related to number two, but we're gonna recognize patterns in data, which kind of underpins how we do DDoS detection and other things as well, looking at seasonality, trend analysis, and then we're gonna automate baselining and perform anomaly detection as opposed to you doing it. See? We didn't invent anomaly detection, but we are creating a system so you don't have to do it and then we can combine that knowledge with the rest of what we're doing on the platform. And then we're gonna correlate to learn how data points relate to each other. So how this works, this is kind of a marketing slide, so I'm gonna spend a short amount of time on this one and more on the next one, which really digs into the platform. How we do this is we have a telemetry and data pipeline where we collect all sorts of information from our customers, and that gets ingested into our Kentik Network Observability Cloud. We are a SaaS company, so it's all coming into us, and there we have the compute, we have the storage to be able to do what we wanna do with it. And I'll explain how we ingest that shortly. So we ingest that. At the point of ingest, we also do the enrichment and we start to do the classification and the clustering and applying maybe a Holter's algorithm to do exponential smoothing and find some kind of trend in new predictions, whatever we need to do at the point of ingest for a reason, which I'll get to in my next slide. But notice how that blue box there is in the middle and it's kind of the foundation for all our products, so I wanna make it clear that we are not Like, there is no ML menu in the Kentik UI. It underpins a lot of what we're doing. Now, we do have one particular product called Kentik Insights, which I'm gonna show you, but a lot of this stuff, it underpins how we do capacity planning, which is not part of Kentik Insights in the sense that it's not part of that menu, and trend analysis and even cost analysis when we're looking at what's your cost over time with AWS or whoever, right? Our DDoS detection uses this technology, and there's actually a couple more products that aren't even on there. Now, how does this look? Like how does this look as a platform? Let me go through this with you. So first, we're gonna collect information from a bunch of sources, the lower left network data being kinda like what we've been doing for years, but we will also collect information from CDNs, DNS information, which is actually very important, containers, eBPF information from that. Whatever it is that we can get, we're gonna ingest that. The way that works is that we just install a very lightweight Linux package at the location, whether that's in your public cloud environment, branch office, data center, whatever, it's called kproxy, and then we're gonna use that to establish a secure SSL tunnel back to us and so that there's a secure type of communication back to us. Although some, I guess, don't even bother with that, but that's basically how we ingest information. At that point of ingestion, we're also going to enrich that data with the metadata that I mentioned to you before, geo IP, threat feeds, whatever's necessary. And at that point of ingest, we're gonna start to do all those things like classification, like looking for patterns in the data on purpose. The reason is that imagine having a giant growing database and then later on you wanna query it and look for correlation. Well, good luck. You know, like that's silly. You wanna be able to do it right away when it's manageable, number one. Number two, a lot of it's unlabeled. You're just gonna go store it unlabeled. You need to classify that information right away so it's useful for you later on. Also check its accuracy. That's probably not the right word, but you wanna make sure it's accurate as it comes in, and then you're gonna take the results of that and you're gonna feed basically the ability query that data at you as an engineer, sitting at a client. You're gonna be able to query it with custom queries, which can take longer if they're weird queries. You're gonna be able to just sit at the UI and receive automatically system generated insights, which is iterative, right? This is something that we've been doing, something that we do now, and it's also roadmap, all three at the same time. And so over time, those insights, the goal is to make them more useful, more insightful. We're also going to store that data. We store our data in columnar databases specifically because we believe that that is a good balance of being efficient and fast to query, and also it allows us to be compliant as far security is concerned because it allows us to separate customer data. We have a main data table for each individual customer, two databases within it, one being a short term storage where you can query something for root cause analysis or something that just happened, and therefore the queries are faster. And then we have something more long term, and that's one of the ways that we keep things moving faster and efficient for you as an actual engineer sitting at the Kentik UI. And then of course, all that stored locally in the Kentik private cloud on encrypted disks. And then we're also gonna send that off to whatever tools you're using for notifications, email, Slack, ticketing systems, all that stuff that we're used to. So let me go into the specifics of what we're doing at that point of ingest, but before I do, I want to just make you understand or I want you to understand that the machine learning component that I'm going to discuss with you now is not everything that we do. It's not, and it's not a bolt on either, but it is one of the tools in our toolbox to figure out what is most useful for you. So for example, if we find that the accuracy of the results a particular algorithm aren't what we want, drop it. Do something more simple that's just more accurate. It's not the tool that we care about. It's the results. So at the point of ingest, one of the things we're gonna do is Classification includes algorithms like k nearest neighbors and decision tree, if you're familiar with ML. What we're doing is recognizing patterns in data. Classification is generally Well, it could be in the context of supervised or unsupervised learning, but I guess semi supervised as well, I don't know. But ultimately, what you're trying to do is classify information that's coming in. What is it? Learn from labeled data, things that have tags that are known entities, incorporate domain knowledge, which is enrichment, and then we're gonna cluster information. Clustering is very important. It's very important because a lot of the telemetry that we collect is unlabeled data. That's just the nature of network telemetry. But clustering, if you can imagine like the I should have had a slide of this, but you imagine the four quadrants of colored scatter plots and they're like, these all go together, these all go together. That's clustering, basically. It's like a visualization. And so now I have an idea of, oh, the system recognizes that these data points are related to each other some way. This is a hypothetical. And so you're creating structure among unstructured network telemetry data. That allows you to also group data. Grouping data allows me to treat a group as one object in the database, thereby reducing the amount of data I have and also making querying faster. That's what data reduction means. And we use clustering for outlier detection, what's not in that group, where things are. Now, I wanted to get into the time series. Time series models are a family of models. We're talking about machine learning, but there are some algorithms that don't necessarily fall into the realm of machine learning that I'm gonna group into time series. So time series models is what we use as opposed to other types of models like the baseline model because of the accuracy of the results. And that includes things like the Holt Winters algorithm, linear regression, autoregression, that kind of stuff that if you're not familiar with, I encourage you to go check out some YouTube videos from like Stanford or all that free stuff that's on YouTube to learn a little bit more how this works, because it's being used anywhere, everywhere. And so we'll use Fault Winters just as an example to do prediction. It kind of eliminates over time some of the outliers, and then we're able to predict what's gonna happen and do some kind of trend analysis there. But ultimately all of this gives us the ability to start finding correlation as well. But this is where it starts to break down, and this is one of the reasons I say correlation is not the end all be all of what we're doing with network observability. So here's kind of a silly graphic for you. Remember that correlation is a matter of probability of So here we have a graph charting the number of people who drowned by falling into a pool and the films that Nick Cage, Nicholas, I'm not on a first name basis, Nicholas Cage appeared in. Is there a correlation there? What do you think? You're kind of going like this. You don't have to raise your hand. No, did your data scientists figure this out? Yeah, it took them weeks. They came across this? It took weeks. It took weeks to get to amount of compute that it took to figure. No, I'm being silly, but that's the point. You and I know that's not correct. Does a computer? I don't know. Maybe a trained model would over time, but the idea is that when you're looking at the results of running an algorithm and by the way, this is not magic. This is math. I gave you some examples of algorithms that we use in Python, in Jupyter Notebooks, with GitHub, and columnar databases. There's no magic here, okay? But anyway, here we have a- What's the p value? I mean, that's how we're gonna determine if this is correlated. That's true. That's true. P95, everything is ninety five percentile. But here's another example. I put this one on the internet the other day. I love this one. There is two elements. Are they correlated? I don't know. They kind of are. What do you think? They sort of are, right? Well, here's the issue, right? They're kind of correlated, but it's not a causal relationship, and we kind of jump to that, don't we? There's a third variable there. And now eventually you and I figure that out, by the way, this is Northern hemisphere, right? The third variable is summertime and sun, obviously in the Southern hemisphere, it would be the inverse of this, but there's an issue to just saying, We do correlation, then we have a magical platform. Well, that's not exactly how it works. So we don't put all our eggs in this basket. It is a tool in our toolbox to provide you useful, insightful intelligence as a human being engineer trying to figure out why your application stinks over the network. So this is a tough thing to do well. False positives have traditionally been very high in the industry across the board, although they're getting way better. Just like a few years ago, it was really bad, seventy percent false positives across the board with various companies. Today it's the inverse, way, way better, way better, but it's still something that we want to address. As an example, look at the graph on the top. That's a baseline model that we use in our own testing. This is an internal picture. I got permission to use it, screenshot. And on the bottom is the time series model. Notice that the number of alerts, that's what the orange and red dots are, those are warnings and criticals or warning and critical. Using a time series model, we get a more accurate result. I'm sorry, the font is small and the pictures are small, but I think you get the idea that we go with one, we don't get the results we want, and so we have that constant feedback loop where we see, all right, this isn't working, and we try a different model for that specific reason. I'm sure you've suffered from alert fatigue, I have. The answer is not turning off all your alerts. That is an answer. Is an option. The right answer is not turning off your alerts. How about alerting on things that you care about that are accurate, right? Hire knock. What's that? Hire a knock to take over. Hire a knock to just take over everything? Okay, okay. Do you have the budget? Awesome. So internally, our data scientists, product team, leadership, right, are always working on feedback loops to make improvement to this, to increase the level of accuracy, to increase the divergence of data that we have as far as our traditional visibility set and accuracy of it. So that's still very, very important to us at Kentik. We're gonna do things like use train models to reduce IO and CPU utilization, so to make querying faster, more efficient. We're gonna look for ways to apply statistical analysis and ML algorithms where appropriate, and we reserve the right to change when the results aren't exactly what we want, because again, we're not just bolting on ML to Kentik, we're trying to provide you useful, insightful intelligence to make your root cause analysis faster and augment you as a network engineer. And then of course, always underpinning everything is, can you trust what we tell you? Can we trust, when I say we, I mean the system, can you trust it? That's very important. What's the level of certainty? Some organizations say confidence. What's my confidence level? That's fine. And so maybe you're looking at me like this as I'm talking to a bunch of network engineers about correlation. I was a network engineer for fifteen years. I'm not a data scientist. So it's important to me that all this cool, interesting stuff, and it is so interesting from an academic perspective, the actual practical goal here is to solve a network operations problem. So So network visibility, or rather observability, it's built on that foundation of data, rich, robust, diverse, accurate data, so we're still working on that. Very, very important to us, but it goes beyond what. It's to tell you why something is happening, not just that it is happening in your network. That's kind of the goal here, okay? And how that flushes out over time, again, it's iterative as we improve the platform and we look for new ways to apply ML, as we look for new data to ingest. So I want to talk to you for a few minutes now as I end here about Kentik Insights, but before I do, I want to remind you that a lot of this stuff that we're doing permeates our entire platform, not just Kentik Insights, which I will show you. So I'm actually going to start off by showing you some other components of our platform first. So this is the Network Explorer. Network Explorer is a common place of the dashboard where our customers like to camp out. I mean, there's others and you can customize it. You have to see your aggregate flow information, stuff about your clouds and sites. What's really cool on the side here, you notice that we have our widgets for synthetics, capacity planning, connectivity cost, down here DDoS defense. A lot of this stuff is actually the function, the DDoS detection function is powered by the things that I've been talking about. And it's not like we're just using a particular algorithm for all of it. No, different functions require a different approach. And so we're not necessarily looking for correlation when we do what your AWS cost looks like over time, how it's trending, you see? Here in the upper right, here are some insights that the system is generating for you, very small font, so my apologies, but I can rattle a couple off here. We have a device that typically sends flow and it's not sending flow anymore. Well, normally that'd an SNMP alert. Oh, the device is down, right? Device isn't down. You should still know though, because we normally expect flow. Normally, that's normal behavior. This is abnormal. So it's something that we're gonna alert you to. We know about different bot That is an object in our database. These IP addresses, this type of traffic, and we see that on your network, we're gonna alert you of that. This is what's happening. Maybe it's not the volume where things are hosed and therefore you wouldn't notice as a human being looking at the data, but the system noticed. And so that's something maybe trivial, but maybe very, very important as a preventative measure before you really get hosed with like a DDoS attack or something. You can see that we, let's see down here, we have a device. What is that? Kentik SPDA NYC1, whatever. We see fifty nine percent more traffic this week compared to last week. So we're starting to see a trend analysis. All that stuff is What's that? School started again. So all the students are back. Yeah, there may be a specific reason for that or not, but it's something that we wanna look into. And you know what, Steve, it's important to remember that we're not just looking for what's interesting though, you see? So from an academic perspective, I want all of it. It's all interesting. Oh, this is cool, look at that. But if I'm sitting there trying to figure out why a mission critical application is slow, interesting is fine, but is it useful to me? And that's hard. Hard differentiating that on a programming and mathematical basis, so that way what we produce for you automatically is useful for you. So if I go over here to the left hamburger menu, you can see all the different products in our system. I'm gonna start off by going to, where am I gonna start off with? Let's go to All right, we're gonna go here. Where do I wanna start off? Let's go to capacity planning. Look at our external interfaces. So you look at the external interfaces here. This is a demo environment, by the way. So if some of the information itself is like lunky looking, it's just a demo environment. But if you look at this, let's go with the first one because it's got a lot of red, it's critical. You see, we have our run out, it's a P95, so there's your percentile score, and we're looking at things like rolling standard deviations and all that stuff, and you see a trend over here. And so looking at something like this is an easy way to see how we can just do something very minor to give you some nice visualization of what's happening on your network over time. And if you wanted to drill into it, you go into the interface quick view and start to drill into what's causing that traffic, what other devices behind it are causing it, because if it's a router, there's an entire network of devices. How do you track that down really quickly without doing packet captures? Well, like I used to do years ago, create an inbound ACL on my switch or router and then look at traffic. That's crazy. So that's what I used to do. Now, I wanna stop here and also mention that notice that I can start clicking on things. One of the things that we believe about network observability is yes, we'll give you these system generated guided insights, but we also wanna give you the ability for completely unbound exploratory power within the system to stack data the way you wanna see, to do custom queries and to do it quickly. So we are looking not only for cool data, but make it very efficient. So let's see. Let's go over to synthetics. We'll go to the test control center. You can see over here on the right under filters, these are all our different synthetic tests. I'm gonna select this one here because it's critical. And this is just a network mesh test where we have a bunch of meshes here. It's only like four agents, three agents, but you can imagine this may be hundreds of agents. You have a large network with many branches, maybe you're a service provider with many hundreds of branches. And we're doing things like testing obviously up down, latency loss jitter, some basic network centric stuff. And as you can see here, we got a lot of critical alerts. So let's see what's going on. I click on this here and go into view details, and let's give it a second here. This here, again, simple. We don't need to go nuts because I want it to be fast. So I'm gonna use the appropriate algorithm to figure out what the rolling standard deviation is so it can automatically generate baseline and automatically generate a threshold, two different things, right? And then alert on that rather than You can certainly create custom thresholds and baselines. That's completely fine. In fact, if you have a very custom application and it can only be a hundred milliseconds latency, whatever, you can do that. So there's the average latency and let's see down here, average jitter, same thing, right? And then when you see something interesting that you're concerned about, you can actually go and then have the system generate a custom We call it custom policy, but it'll create a policy off of those things. Another thing I wanna show you for time's sake, I'm just gonna go right to it, DDoS Defense. Very popular with our customers because we're good at it. But ultimately what we're doing is recognizing patterns in your data and saying you have a potential DDoS attack occurring or brewing, or you're gonna be hosed soon at different levels. One of the reasons that we do this stuff at ingest is so we can do it in real time as well. You don't wanna know about your DDoS attack two days from now, you wanna know now. So again, demo environment. So there's the data is whatever, there's not that much stuff going on. But as I scroll down here, take a look at this, all the different types of attacks, right? You have Azure attacks, IP scans, distributed TCP reset, and let's see, keep going. DDoS ICMP flood attack. If I click on full log, you could see a lot more of these here. And this is specifically DDoS just because it's popular with our customers and we're good at it. But ultimately we have patterns that we recognize because we have those as defined objects and we know what they look like, and so we can recognize it when it's happening in real time. Let me go over to insights now. I'm gonna just open the tab for that. Actually, let's go here and I'll open insights. So our insights tool is that very explicit, like here's something interesting you should look at. So when you click on that in the menu, that's what we're looking at. So I can expand it to the last thirty days and start scrolling through. You see you have all sorts of interesting stuff. So we're comparing various sites and say this site compared to this site used to look like this, now it looks different, and this is something you should be concerned with. So if I scroll all the way down to the bottom, there's one in particular I wanted to show you, this interface utilization spike. See there? I have it open in another tab. When I click on it, this is the issue that we're seeing. We see a spike in interface utilization, and the system is going to give you when that happened, why that alert was fired. Right? And then down here, notice that we're getting what we call factors. This is what we think their system thinks caused it. Now that's a router. So if I click on that router, you see that on the bottom there? It's hard to see traffic for source address, whatever, this port and this value. Maybe our UI people can make this bigger in the future. So if I click on that IP address, I dig down further, and I find that it's this specific device. I can edit the data sources. So what do I want to look at? Flow data, interface statistics? You have an unbound ability to explore what's going on. If I click on the device itself, I can start to very quickly identify, all right, it's this device on the LAN behind the router, this wireless control, whatever it is that's hosing my network. That's basically how Insights operates as a service with a specific ability to give you some kind of useful, insightful intelligence that we can offer you unsolicited, but also how it underpins everything that we're doing.

Presented by Phil Gervasi, Head of Technical Evangelism

Phil Gervasi discusses how today’s complex environment of cloud-hosted applications, containerized services, and overlay networks requires a network-centric approach to observability. Network observability, built on a foundation of traditional visibility and powered by statistical analysis and machine learning, provides a greater understanding of today’s network and application performance than legacy visibility alone.