For the last few years, the entire networking industry has focused on analytics and mining more and more information out of the network. This makes sense because of all the changes in networking over the last decade. Changes like network overlays, public cloud, applications delivered as a service, and containers mean we need to pay attention to much more diverse information out there.
After all, if we want to figure out why an application delivered over a network isn’t performing well, we need to look at all these new network components involved in getting that application down to me and you, sitting behind a computer screen.
The industry has looked to machine learning for the answer, sometimes resulting in severe eye-rolling. So let’s take a step back and look at what problems we’re trying to solve and how we actually solve them, whether that’s machine learning or not.
Let’s look at some of the problems we face today with network telemetry.
First, the sheer volume of network data we have available today eclipses the smattering of flows, SNMP information, and occasional packet capture we used to rely on. Consider what we can collect now — the sum of just one day’s worth of flows, packet captures, SNMP messages, VPC flow logs, SD-WAN logs, routing tables, and streaming telemetry can overwhelm a network operations team.
Add to that additional information like IPAM databases, configuration files, server logs, and security threat feeds, and a network operations team would be hard-pressed to find meaning in the ocean of information flooding their monitoring systems.
So to handle this volume of data, we first need to figure out how to:
This isn’t trivial, and most answers will have little or nothing to do with machine learning. These are database architecture decisions and workflow designs using sufficient compute resources. For example, using a columnar database is a good balance of fast queries but with the ability to separate data vertically for multi-tenant scenarios. From a high level, there really isn’t any reason to cram an ML model into this process.
But dealing with large databases is just one problem. How do we analyze the variety of very different data types?
The second problem, which is really two distinct problems, is working with a variety of very different data types. Think about some of the very basic telemetry we’ve been collecting for years.
Flow data can tell us the volume of one protocol on our network concerning all the other protocols. So the data point is a percentage, like 66%, and describes the volume of a protocol.
SNMP can tell us what VLAN is active on an interface, which ultimately is just a random tag, not a percentage.
SNMP can also tell us the uptime of a device, represented in seconds, minutes, hours, days, and years. A measurement of time. Not a percentage and not a tag.
A packet collection and aggregation tool can tell us how many packets go over a wire in a given amount of time — some number in the millions or billions. A finite but dynamic number many orders of magnitude larger than a percentage.
It’s not just the huge volume of data but also how different the data are, and the scale each uses. Telemetry as represented by percentages, bits per second, random ID tags, timestamps, routing tables, etc.
We can solve these problems using standardization or normalization at the point of ingest into the system. Remember that though data scientists often use standardization and normalization functions of statistical analysis in machine learning preprocessing, they aren’t technically machine learning themselves.
Using normalization, which is simple math, we can transform diverse data points on vastly different scales into new values all appearing on the same scale, usually 0 to 1. Now we can compare what was previously very different data and do more interesting things like finding correlations, identifying patterns, etc.
The right database design and fundamental statistical analysis are enough to do some amazing things. It isn’t necessary to start with ML. When we get to the point with our analysis when we can’t do much else with a more basic algorithm, we can apply an ML model to get the desired result. And that’s the key here — the result, not the method.
That’s why the reality of ML is that it’s another tool in our toolbox, not the only tool we use to analyze network telemetry. We use it when it makes sense to get us the desired result, and we don’t use it when it doesn’t.
So, for example, we can apply an ML model when we want to do more advanced analysis, such as:
Because network telemetry is inherently unstructured, dynamic, and usually unlabeled, it makes sense that we use ML to perform specific tasks.
We should use clustering to group unlabeled data and reduce the overall amount of data we have to deal with.
We should use classification to recognize patterns in data and classify new data.
We should apply a time series model to estimate seasonality, make predictions, detect anomalies, and identify short and long-range dependencies.
At this stage of our analysis, ML serves a specific purpose. It isn’t just a bolt-on technology to join the fray of vendors claiming they use ML to solve the world’s problems. And when a model doesn’t produce the result we want, is inaccurate, unreliable, or takes too many resources to run, then drop it and do something simpler that gets you most of the way there.
I don’t believe using machine learning (and artificial intelligence) is just marketing hype, though it can be when presented as a one-size-fits-all solution. The reality of ML in network observability is that it’s just another tool in the data scientist’s toolbox. We use ML to get a specific result. We use it when it makes sense and don’t use it when it doesn’t.