This post is based on a webinar presented by Jim Frey, VP Strategic Alliances at Kentik, on Network Performance Management for cloud and digital operations. The webinar looks at how changes in network traffic flows — due to the shift to the Cloud, distributed applications, and digital business in general — have upended the traditional approach to network performance monitoring, and how NPM is evolving to handle these new realities.
Defining Network Performance Monitoring
It’s important to understand at a very high level, that with network performance monitoring, we’re talking about going beyond just recognizing whether or not the network is available and functional, to seeing how well it’s doing its job. Is the network is doing what is expected for delivering applications and services? That’s slightly different than just understanding “is the network there and available?”
Some things that you would see as typical activity measures on the network are not necessarily directly relevant to NPM. So this is not just looking at traffic volumes, or the counters or stats that you can get from interfaces. Or even the kind of data you get in logs and events. Those can be helpful in understanding network activity, and some forms of that data are relevant in simple ways to network performance, but really just on the side of utilization.
So NPM is about another set of metrics. Metrics such as round-trip time, out-of-order packets, retransmits, and fragments, tell you how well the network is doing its job. And more specifically, they tell you what role the network is playing when it comes to the bigger question, which is the end-user or customer experience.
Ideally, when you put a network performance monitoring solution in place, not only can you get these new metrics about what’s going on with the network, but you have a way to tie them back to the applications and services that you’re trying to deliver. That’s what helps you tie improving and protecting network performance to achieving your ultimate business goals.
There are two primary categories of data sources for network performance monitoring. There’s synthetic data, which you can generate by setting up robots that will send examples of traffic through the network, and then see what happens to it, in order to calculate the metrics mentioned above.
Then there’s data generated via passive observation of real, actual traffic. That’s done by either looking directly at packets that are going across the network or by looking at flow records generated from looking at those packets. Flow records are simply summaries about the packets that have been going across the network. So ultimately, both of those techniques come back down to understanding what’s happening at the packet level. The difference between them is that the data is transmitted and inspected in slightly different ways.
Logs are also a way to get some information about network performance, but you have to be fortunate enough to have log sources that will give you discrete performance metrics. There are some log sources out there that do so, however they are not common, and it’s not something that most people can make a primary part of their NPM strategy.
There are different types of NPM tool architectures. The most common is buying and deploying direct packet inspection appliances, connecting them to the network. This is the way most packet inspection happens today.
There are plenty of software-based solutions, too, that you can download and then deploy in whatever server you happen to have handy. That doesn’t tend to work as well when you’re doing packet inspection, because of the need to tune applications and servers to cope with a high volume of packet inspection.
The emerging way to do NPM is SaaS. SaaS is becoming an important way to access technologies that would otherwise be difficult to install and maintain. SaaS allows you to get levels of functionality that would otherwise be expensive or difficult to achieve.
Cloud and Digital Operations NPM Challenges
The real purpose of the discussion today is to talk about some of the challenges around network performance management with respect to cloud and digital operations. I want to start with pulling back a little bit and looking at the 10,000-foot view because this is an important context to think about.
Remember that the reason why organizations are moving to the cloud and remaking themselves into digital operations is to achieve business goals. In most cases, those goals are better productivity and revenue.
However, the operations and engineering guys in the technology team need to figure out how to do this while operating efficiently. They have to look at cost containment, optimizing what they have in place, and doing that quickly.
What is the challenge? A Gartner analyst named Sanji Ganguli captured this well in a piece that was put out in May of this year, called “Network Performance Monitoring Tools Leave Gaps in Cloud Monitoring,” and it stated a few of those major issues that exist. One point he made was that you’re still responsible for performance even if you don’t have control of the infrastructure. That’s certainly the case when you’re dealing with external cloud resources. Another point was that cloud and digital operations are creating a fundamental change in network traffic patterns. And that’s creating some gaps that are not well-filled by the existing tools.
Packet analysis, in particular, becomes far more challenging when you have a cloud or hybrid cloud environment because packets travel across networks to which you do not have direct access. So the question of instrumentation becomes, where are you going to get those traditional measurements? If you can’t get the data, how do you answer performance questions?
Instrumentation Challenges: When I owned all of my infrastructure, my data centers, my own WAN, and my own network, it was pretty straightforward for me to say, OK, I know where I can go and instrument for network performance data. I can deploy some appliance to look at packets and draw some data from it. So it’s pretty easy to get the data that I need to do NPM.
But with mixed and hybrid infrastructures, even though you can find instrumentation points in the cloud portions of your infrastructure, you are not going to be able to see the rest of the contextual environment in which those resources are operating. And this makes it hard to know if you have the metrics and measurements that you need.
The Challenge of Changing Dynamics: The second problem is the dynamics problem. Digital infrastructures, whether they’re internal, private Cloud, or they’re hybrid, or they’re external cloud, they are by definition very dynamic. They’ve been designed to be this way to provide agility for the organization, quick time to market, a faster ability for an organization to move and follow opportunity without having to go through long procurement and deployment cycles of traditional infrastructure.
Well, this means it’s moving and changing much more quickly. And therefore, not only do you have a problem with finding a point to instrument, but that point, if you do find it, keeps moving. So it’s a constantly changing environment, and it’s difficult to keep up with.
Scale Challenge: The last problem is scale. How do you keep up with the fact that you can start to create and generate a bunch of new infrastructure very quickly? And then turn it off, by the way. So part of the natural behavior of these digital and Cloud environments is this sort of elasticity. Essentially, the whole point is, how can you handle and maintain the volume of data you need for keeping on top of performance and understanding it?
Five Keys to Dealing with NPM Challenges
Synthetic and Passive Monitoring: I mentioned earlier that there were two categories of data used in NPM: synthetic and passive monitoring of real traffic. Synthetic test agents seem like a sound way to deal with cloud environments, because you can’t get in there necessarily and deploy your traditional hardware-based probes. But at least you can ping them, right? You can check to see if they’re up. And that has some value.
However, you need to measure real traffic, because it’s essential for understanding actual performance. You need to know more than whether a test message can make it through to an endpoint. You need to know what’s happening to all of the real production traffic that’s moving back and forth to those resources. When you have real traffic data, you can correlate more accurately between performance from the view of the endpoint, and the rest of the context of the traffic in the network and across the Internet.
There is still value in synthetic data in making sure resources are responsive and available, even when there’s no real traffic going to them, such as during off hours. Also, if you are fortunate enough to have very simple and repeatable transactions that you’re monitoring, you can use synthetic tests to reproduce those and get a good proxy reading.
Deployable and Feasible: Not all tools and technologies can be used in the new, hybrid cloud environment. You’re probably going to need a mix of tools. You’re going to have to think about using agents, even though to some people that’s a dirty word, and you’re probably going to have to use some sensors. To other people, that’s a dirty word. Go ahead, get dirty, you’re going to have to use some mix to get your head around the complete environment and all of the potential viewpoints you’d need.
A key point that I want to make clear is that traditional appliance-based approaches just aren’t going to be enough on their own. Appliances are extremely useful for performing deep packet-based inspection. But you have to get access to the packets. And that’s just not practical when you’re dealing with external cloud-based infrastructure. There are some adaptations of appliance techniques to try to do this. None of them has found great favor. They all have limitations. You may still want to have a traditional appliance back in your private data center, for your local environment. But you’ve got to go beyond just using appliances. That just doesn’t work anymore.
Ultimately, you’ve got to be flexible with your approach here. But you’ve got to look at techniques and tools that are both very cost effective, because remember, I mentioned this earlier, the reason for moving to Cloud, and the reason for rebuilding IT or back-end infrastructure in a digital sort of manner, is to save money. You are trying to be cost effective. So you can’t take an approach that’s going to kill you from a cost perspective.
You’ve got to find something that’s cost effective. It’s got to be easy to deploy, too. Really, the instrumentation and the methods you use for gathering performance metrics should be just as easy to deploy as the Cloud resources are themselves. You’ve got to find ways to be Cloud-like in the way you adapt, here.
And finally, it really needs to move with the workloads. Remember I mentioned earlier that dynamics problem. The things you want to keep an eye on won’t necessarily be in the same place in an hour, or a day from now. So the instrumentation approach, the strategy you use for gathering these metrics, needs to be able to float and stick with those workloads as they move around, or they’ll lose visibility.
Internet Path and Geo Awareness: If you’ve got digital operations, you’re a digital business, and your business is based on your employees and customers reaching your infrastructure and applications across the Internet. You have a huge level of dependence on that Internet connectivity being up and reliable and high performing. So you have to start thinking about how you get that view into the Internet path between you and your cloud services, or you and your customers, This wasn’t part of the picture for traditional enterprise networks. In the past, most people were able to just blissfully ignore the performance impact of the Internet. But that’s no longer valid.
Here’s why. The more networks or hops between your users and your resource, whether your data center or in the cloud somewhere, the more risk. The more hops, the more latency. Not all paths across the Internet are equal, and they do change regularly. Sometimes those paths work well for others, but not for you. So this is why, in the age of digital operations, it’s important to understand what’s going on with these Internet paths. Without this visibility, you are going to be at risk of not being able to influence or optimize customer experience.
Cloud-scale data analysis: NPM frankly is a big data problem in many ways, shapes, and forms. Most tools do not have a big data architecture. Without a big data architecture to keep all of the details, most NPM tools simply summarize and then throw away the details. That means you’re lacking the information you need to get to the bottom of problems.
Big data architectures help with dealing with the volume of data, and making it more flexible. A lot of folks have tried to build their own NPM tools using open source and commercial big data platforms, but it’s very expensive. Not necessarily expensive due to the cost of licenses, but due to the time and effort required to set up the big data back-ends, to figure out how to feed them data, and how to build reports and analyses.
There are, though, tools that are coming along, Kentik is one of them, that are commercializing big data architectures. And that’s where the answer is going to be in the long run.
The cloud-scale tool concept offers an opportunity to deal with a cloud-sized problem that you have in your operating environment. The cloud approach lets you access resources and solutions that can handle all the data without you having to do all the hard work and having to build a back end by yourself. Cloud it also makes it much easier to deploy these solutions. If you can access your NPM as a SaaS, it solves a lot of your total-cost-of-ownership pains.
API Friendliness: APIs are seeing tremendous adoption and use, and getting increased attention and value around this whole move to the cloud. We’re changing our applications development environment, but we’re now starting to have APIs be present in all aspects of cloud services, and virtualization services. Whether you’re virtualizing internally, making a private cloud, whether you’re accessing Amazon Web Services, APIs are everywhere now.
Unfortunately, NPM data has not been well serviced by existing APIs. That’s too bad because your NPM system needs to support API integration so you can feed that data into your digital operations and business decision-making processes. That performance data contributes to understanding the customer experience, user experience, and total activity. You can use performance data to help you with billing, service activity, and success and product management decisions.
Listen to the rest of the webinar to learn about the Kentik NPM solution and how one Ad Tech company utilized Kentik’s nProbe agent and Kentik Detect to monitor and optimize revenue-critical performance issues. If you already know that you want big data network performance monitoring for your digital operations team, start a free trial today.