Webinar

Replay: What is network performance monitoring and how is it evolving in the cloud era?

Hello, everyone, and thank you for joining today's webinar. I'm Jordan Sloop, your host from Kentik Marketing. Our topic for today is what is network performance monitoring and how is it evolving? Our presenter is Michael Patterson, Kentik technologist, founder, and former CEO of Plixer. In today's webinar, Michael will review what NPM has become and why legacy solutions met their demise. He will also discuss why these technologies bring deeper visibility into how your company's Internet connections and applications are being impacted and by whom. During the webinar, if you have any questions, feel free to enter them in the Zoom chat, and we will answer as many as we can in the open q and a session at the end of the presentation. With that, I turn it over to you, Michael. Thank you, Jordan. Hello, everyone. We have an interesting topic for you today covering network performance monitoring, also known as NPM. Years ago, many of us worked with one or more network monitoring packages that basically ping devices on the network and turned icons green, yellow, or red if some IP address couldn't be reached. If you clicked on devices, you could get basic SNMP details like the description of the device, the uptime, the local contact, the name of the device, as well as the location. Over time, we learned that these OIDs were great for some information like firmware revision, but not for details like the device name as it often conflicted with the DNS name. SNMP is no doubt still useful today, but with a cloud based enterprise, SNMP isn't always available, and we need more details than it can provide, specifically things like latency. Initially, network monitoring systems gave us trends, meters, pie charts, and more, but it was all real time. Nothing was historical or proactive. MRTG was one of the first tools we saw that provided historical information using SNMP, and it was free, open source, and web based. But the problem is that you had to write scripts and schedule them to deal with the constant ads, moves, and changes. And if you didn't, many trends went blank. What we quickly realized about SNMP utilization trends is that although they're great for telling us how you how full connections are, we couldn't see into who or what was consuming the bandwidth. To address this, the SNMP community created ARM on, which in most cases required deploying new, more expensive switches. The technology had several shortcomings and was never widely adopted. So instead, many companies deployed packet capture probes, which worked great in some ways, but they were very expensive. And they weren't always where we needed them, which was increasingly a problem as the network grew. Economical visibility into the who and what was finally resolved with NetFlow. Everywhere you had a router, of vendor, it supported NetFlow or IP fix at no extra cost. With the click of a button, NetFlow gave us the information we needed to find the root cause fast. But this visibility also falls short today unless it is enhanced with additional telemetry. Companies were adopting SaaS applications and moving to hybrid and multi cloud environments. As a result, we learned quickly that NetFlow was telling us what we already knew, that all the traffic was headed to the Internet. In the last two years, massive amounts of employees have started working remotely. In some cases, entire companies have migrated to work from home, and most, if not all, of their traffic never goes to the on prem network. This has been the biggest game changer for network performance monitoring. Facing a growing appetite for more bandwidth and fewer outages, Internet providers needed visibility into the routes traffic was taking, and enterprises needed end to end visibility into how their cloud applications were performing. Now most ISPs don't hand out SNMP access to their routers, and they don't share their NetFlow information either. This left simple ping availability, which meant that most network management systems lost visibility. And as a result, they largely became shelfware. Since the Internet is really just a collection of networks, network performance monitoring vendors turned to additional techniques that would allow them to regain visibility. By working with technologies such as BGP and API monitoring as well as trace route, network performance monitoring vendors brought new life to the NPM market and reemerged with greater network observability. Let's talk about why BGP and API monitoring are so important, and then I'll touch on how trace route can be very helpful when performed at scale. BGP is the routing protocol of the Internet. If you're a service provider or a digital enterprise, knowing that customers can reach your prefixes or IP addresses is very important. So why do companies want to monitor BGP updates? Well, largely, it's done to be proactive. Like, if relationships change with a service provider or a BGP peer. It can be used to detect DDoS attacks, and it's just as likely to be used to verify that routes have returned to normal. On the reactive side, BGP monitoring can also be used to detect misconfigurations like those having to do with attribute prepending errors, which is when a company has at least two BGP peerings or connections to the Internet. If there's an infrastructure failure like when a peering gets severed, of course, there are the infamous hijacks and leaks that commonly occur in BGP, and, of course, route flapping, which can result in routes not getting propagated for a period of time. All of these cases are monitored because they can prevent your customers and employees from reaching your business. Kentik uses hundreds of BGP feeds to ensure better network coverage, accuracy, and performance. Here's one example of how you can monitor BGP routes. Up here in the top left, you enter the prefixes you want to keep an eye on in case someone other than the owner starts advertising them. You probably want to include any sub prefixes as well. Up in the top right, you can enter the autonomous system numbers that you want to collect advertisements on. There is an or relationship between the prefixes you want to monitor and the ASNs you want to keep an eye on. In other words, the route only has to match one of these criteria, not both. Once it's configured, the platform will pour through the routes it is receiving and look for either the prefixes you specified or the ASNs. As you can see here, we didn't specify any IP v six prefixes, but we did specify the ASN number. This is a great way for seeing updates related to your business. But what if you want to trigger an event based on certain criteria? In the Kentik platform, we need to add a test specifically for a BGP monitor. Let's take a look at how to trigger for an event. Setting up a BGP monitor is very straightforward. You enter the prefixes that you want to watch for, specify the ASNs that are allowed to originate the prefixes, check off RPKI if that's applicable, give the monitor a name, select a notification, and if you visit the advanced options, you can associate this monitor with a bunch of tests with the same label. You can then specify how many times the event has to occur in a certain time frame, then have the platform start watching for the event. That's all there is to it. When a matching event occurs, it will show up here in the results. You can also narrow in on a time frame. By monitoring your BGP traffic, you'll be able to see when peers are added or if you change transit. If you're the victim of a DDoS attack, you'll witness the rerouting of the targeted prefix until the attack subsides. Or if one of your prefixes is hijacked, you will know immediately, allowing you to contact the necessary service providers to help minimize the impact. I hope it's becoming obvious why BGP monitoring is important to your business. Other service providers simply aren't going to give you SNMP access to their networks or share syslogs or NetFlow information. If you want to know when the far reaches of the Internet cannot reach your network, you have to participate in BGP. The second area where network performance monitoring has evolved is with HTTP and API monitoring. When it comes to our digital presence on the Internet, we can't rely solely on ping to verify that services are up and running, and we certainly can't use SNMP. To verify the availability of digital services, such as websites, it's also good to perform either a page load or an API test. This should also be performed from multiple locations. And to do that, we need to add another test. But this time, instead of BGP, we're going to go down here to HTTPS or API or even do a page load test. Let's talk about the differences. The HTTPS or API test is used to check a web server's response to an HTTP request. More on this in a minute. Page load tests are used to measure if the end user is experiencing an undesirable time to first value or what is commonly referred to as the first meaningful paint of the page. Let's take a closer look at these tests. The first step in each test is to enter a URL or a fully qualified domain. Notice with the HTTPS or API test, you can select whether or not you want to use a Git head patch, etcetera. Select the agents which are located all over the world that you would like to run the test from, and give the test a name. Enter some additional criteria that you may want to include in the test. Headers will allow users to satisfy use cases that involve things like basic authentication, API keys, the content type of the body, and so on. Parameters get converted into a query string that gets appended to the end of the URL. This is typically used by applications to support request options like filtering or sorting. You can specify some CSS selectors in order to verify the style sheets. And the body is often used to carry larger, more complex payloads in a friendlier format than embedding the same information in the URL string. Select a notification profile and select the advanced options for even more options like setting thresholds. The global tests will be deployed to all of the selected global and private agents around the world where they will start running the tests against the server you specified in the configuration. The tests include the domain lookup time, which is from when the domain lookup starts to when the domain lookup is finished. The connect time, which is from when the request to open a connection is sent to the network to when the connection is open. The response time, which is from when the first byte of the response is received to when the last byte of the response is received. The average HTTP latency, which is the average time for HTTP requests made during the time slice, and that's from making the request to receiving the last byte of the response. This figure is the sum of the columns domain lookup time, connect time, and response time. All of these metrics are stored and available for historical trends. Additional metrics included with each test include the status code and the response size, as well as the network related metrics of latency, jitter, and packet loss. The combination of these metrics will help you determine if a slowness problem is the Internet connection or the application. By taking advantage of synthetic monitors, your IT team ends up with a more complete big picture as to how your private and SaaS hosted applications are performing for both customers and remote employees regardless of where any of these users or services are located in the world. I'm going to dive into a quick demo to show you what I mean. All of your critical applications appear in the test control center with the applications witnessing the worst performance seen at the top. Notice the amount of agents running each test. If you click on the time frame of the test, a geographical view is provided. If latency for one of the tests to the targeted SaaS hits a threshold, the icon in the map changes color, and the metric is shown in the table below. If you click to view details, trends on all of the collected metrics are provided. The timeline makes it easy to identify when a threshold was exceeded. Notice the HTTP stages breakdown. In this example, most of the stage is all purple, indicating that the latency to the SaaS is caused by response time. This graph provides a trend on the average HTTP latency, which is the sum of domain lookup time, connect time, and response time. If you scroll down further on the page, you will see a trend of the HTTP status code as well as the average latency. And if you keep going down, a trend on packet loss and jitter are also provided. To dig even deeper, we need to talk about the hop by hop path that the end users are taking through the Internet in order to reach the SaaS. This brings us to the third topic. TraceRoute is an ideal way to see into the hop by hop path taken by your users and customers to reach SaaS applications. If you're not familiar with TraceRoute, it's a tool used to figure out the path of routing hops along the way to a destination. It's important to understand that over time, the path taken to the same destination can change, which can obviously impact the performance of your connection to the same host. And keep in mind that the return path is not always the same coming back. Let's take a closer look at trace route. In this example, I ran a trace route to baydoo dot com. At the very top, I type trace r t baydoo dot com. When this is executed, the first thing it does is go out to the DNS to get the IP address for the domain. Then three traces are sent out one at a time. Each has a time to live of one. Right? That's the TTL. This makes sure that the first router will respond instead of forwarding the packets to the next router. After each packet comes back, the trace sends out three more packets, this time each with a TTL of two, and so on. This continues until the destination is reached. Notice for each hop, we receive a response time in milliseconds or an asterisk. The asterisk means that there was no response. That means that either the router ignored the request or the packet was dropped. So what I did here is run two tests to the same domain called beta dot com. This trace using ICMP had a total hop count of twenty one. And this trace over TCP to the exact same destination had a total hop count of thirteen. This just goes to show that depending on the protocol and the time of the day, the results may vary. When you set up one of these monitors that I showed earlier, you can enable ping and trace route toward the host, and this allows you to check the path of a connection when the latency of the application reaches a threshold. Let's take a look at what that looks like. Here, we're looking at a trace that is connecting to ESPN worldwide. During the day, you can see that the platform is executing a trace to the service every sixty seconds. As you move along the trace, the traces that are executed at each interval are shown below. You can see autonomous system hops as well as the dotted lines that indicate the dropped traces. The red connection indicates excessive latency. Because you can identify the service providers that are responsible for a router introducing consistent problems, You can use this information as evidence that you need to find a new peering relationship or possibly a different transit to get around the problem hop. As far as the return path possibly taking a different route, Kentik's network of global and private agents means that you can test latency within the same cloud or between any other cloud in both directions, and that is for any application that your business relies on. Meshes of synthetic agents all testing back to one another can be set up across your global network. And, again, they allow you to test connections in both directions. When a consistent reoccurring problem is identified, the direction where the problem is introduced becomes obvious. Before we wrap this up, let's talk about the insight a network observability platform can provide on an outage like the recent Facebook event. This outage impacted shareholders and the company's revenue by possibly as much as a hundred million dollars. According to Facebook, the outage was caused by, quote, configuration changes on the backbone routers, end quote. Notice in the BGP monitor that after a steady stream of BGP announcements and withdrawals, it suddenly stopped, and that was the outage. By collecting flow data from multiple regions, we could see the Facebook applications that were down for over five and a half hours. If your company depends on applications like WhatsApp or Instagram to reach customers, this outage impacted your business. Prior to releasing our page load monitoring capability, you can see that although the IP address was responding to ping, the HTTP latency to the Facebook domain indicates the site was down as seen from over a dozen monitors that are deployed all over the world. By monitoring the applications your business depends on, you can be proactive by reaching out to customers and employees to let them know that business operations will return to normal once Facebook is back online. Alright. I'll start wrapping this up. When troubleshooting a latency problem, the app team or DevOps team needs to figure out if the slowness with the digital service is caused by the application or the Internet. To verify that the problem isn't the network, the network observability cloud provides the performance dashboard. The first tab displays the status of the tests we discussed in today's webinar, BGP monitors, page load, HTTP, API, etcetera, against all of the digital services your business depends on. The second tab displays the status of many of the most popular digital applications that you may not have thought to monitor. The third tab displays tests that Kentik is running within each of the major clouds, allowing you to verify the performance of their networks. And remember, Kentik has global agents installed and ready to help your business and all of the major cloud providers. Private agents can also be deployed. The fourth tab allows you to stay on top of how reachable your prefixes are from the entire Internet. And the fifth tab provides the stats on all major public DNS providers. The performance dashboard is your view into the status of the entire Internet. Alright, Jordan. That's all I have. Do we have any questions? Thanks, Michael. That was a great presentation. A few questions came in via the chat, so let's open the q and a panel now. Joining us, we also have Kevin Woods, head of product marketing at Kentik, to help with the questions. Our first question is, can you collect flow information? Kevin, why don't you take that one? Thanks, Jordan. Yeah, so it wasn't a big focus in today's webinar, but net flow collection and analytics is a very big part of what Kentik does. And I want to point out there's some advantages to flow collection and flow analytics, such as it gives you good visibility into obviously where your traffic's going, where it originates, where it's exiting or destined in your network. Kentik can also provide enriched data with the flows such as what applications are being used, what applications are associated with certain flows, what sort of, know, ISPs and other providers are in the flow. So a lot of information can also be provided through flows. Awesome. So our next question is, does the system have any CDN monitoring capabilities? Yes. When you, when you select the CDN to monitor, the platform will use flow data to automatically find the top IP addresses that are currently being used on your network to reach a CDN. It's pretty sophisticated, and every so many hours after that, it's configurable. It'll it'll search the flow records for the top ten IP addresses again that are, you know, used to the to reach that CDN. And and it'll tell you if the CDN becomes unreachable or if latency reaches a configurable, unacceptable level. So, yes, we have good CDN monitoring capabilities. Awesome. Thanks, Michael. So can you use can you configure trace route to use TCP, or is ICMP the only option? Could you take that, Michael? Sure. Yes. You ICMP and TCP are both configurable, and yes, UDP is also supported. Great. So what is the default TTL value for the trace route feature? Yeah. Kevin, I'll take that one. I believe it's, thirty, but, that's configurable as well. You can make it something else. Awesome. So if Kentik provides hundreds of global agents to be used by customers, why would someone want to deploy a private agent? I'll I'll take that one. So there would be there there are cases pretty frequently where you wanna test. You want to perform testing from a very specific location and understand the performance metrics from within your own network. Obviously, you may be also concerned about testing and proactive proactively monitoring the performance of a certain network path through a certain set of devices within your own network. And so you want to deploy a private agent to test those very specific locations for your own for your own network. Great. Thanks, Kevin. So since most traffic is encrypted, how does Kentik how does Kentik figure out the name of the application? I'll take that one. We use several techniques that are proprietary to Kentik. I mean, some of which include looking at ASN information and DNS logs. We also rely on flow data. And so that's basically how we do it. Hopefully, that's good enough. Sounds great to me. And our last question is, how does the pricing work? I'll take that one. So Kentik is obviously packaged as a software as a service. And so if you go to our website, there's a pricing page with price prices listed, And you can see three different packages that are purchasable and basically Essentials Pro and Premier packages. And they're organized to group together the most common features that different types of our customers use. I want to point out that with each one of those additions, there are synthetic test credits included. So you get some of the synthetic testing. And the additions also package a number of other common features so that you get everything in that monthly annual base price. Awesome. Well, it looks like we've come up on time. We at Kentik appreciate everyone tuning in today, and we have covered quite a bit of material. So this webinar was recorded, so you can review a replay of it on our website, that's kentik dot com, and please share it with your interested colleagues. You'll get an email from me in the next couple of days so you can review the content, get a link to the recording, and get additional information if you'd like. If you have further questions, please email us at webinarskentik dot com. And please stay tuned for our next webinar coming up in January that will be on what is network observability and why you should care. So I hope to see you all there and thanks again for joining.

Find out what NPM has become and why legacy solutions met their demise. Get in-depth details about these three must-have monitoring techniques:

BGP monitoring exposes attacks and route changes
HTTP and API monitoring can isolate slowness not related to the network
Traceroute uncovers internet trouble spots such as specific routers within ISPs

Presented by Michael Patterson, Kentik network technologist, founder and former CEO of Plixer

Watch this replay to learn why these technologies bring deeper visibility into how your company’s internet connections and applications are being impacted and by whom.

Identify the third parties to reach out to when problems occur and see where to focus your optimization efforts.

Webinar Presenter