Network performance monitoring (NPM) is the process of measuring, diagnosing and optimizing the service quality of a network as experienced by users. Network performance monitoring tools combine various types of network data (for example, packet data, network flow data, metrics from various types of network infrastructure devices, and synthetic tests) so that a network’s performance, availability and other important metrics can be analyzed.
NPM solutions may enable real-time, historic or even predictive analysis of a network’s performance over time. NPM solutions can also play a role in understanding the quality of end-user experience, using network performance data—especially data gathered from active, synthetic testing (in contrast to passive forms of network performance monitoring such as packet or flow data collection).
NPM requires multiple types of measurement or monitoring data on which engineers can perform diagnoses and analyses. Example categories of NPM monitoring data are:
Bandwidth: Measures the raw versus available maximum rate that information can be transferred though various points of the network, or along a network path.
Throughput: Measures how much information is being or has been transferred.
Latency: Measures network delays from the perspective of clients, servers and applications.
Errors: Measures raw numbers and percentages of errors such as bit errors, TCP retransmissions, and out-of-order packets
NPM solutions are sometimes referred to as “Network Performance Monitoring and Diagnostic” (NPMD) solutions. Most notably, industry analyst firm Gartner calls this the NPMD market which it defines (in the 2020 Market Guide for Network Performance Monitoring and Diagnostics) as “tools that leverage a combination of data sources. These include network-device-generated health metrics and events; network-device-generated traffic data (e.g., flow-based data sources); and raw network packets that provide historical, real-time and predictive views into the availability and performance of the network and the application traffic running on it.”
Network performance monitoring has traditionally drawn on data from sources including SNMP polling, traffic flow record export, and packet capture (PCAP) appliances. A host monitoring agent combined with a SaaS/big data back-end model provides an additional, more cloud-friendly approach. Modern NPM solutions also provide the ability to ingest and analyze cloud flow logs created by cloud-based systems (such as AWS, Azure, Google Cloud, etc.).
SNMP is an IETF standard protocol, the most common method for gathering total bandwidth, utilization, available bandwidth, and error measurements on a per-interface basis. SNMP uses a polling-based approach via management information bases (MIBs) such as the standards-based SNMP MIB II for TCP/IP-based networks. Typically, large networks only poll in five minute intervals to avoid overloading the network with management data. A downside of SNMP polling is lack of granularity, since multi-minute polling intervals can mask the bursty nature of network data flows, and interface counters only provide an interface-centric view.
Traffic flow records are generated by routers, switches and dedicated software programs by monitoring key statistics for uni-directional “flows” of packets between specific source and destination IP addresses, protocols (TCP, UDP, ICMP), port numbers and ToS (plus other optional criteria). Every time a flow ends or hits a pre-configured timer limit, the flow statistics gathering is stopped and those statistics are written to a flow record, which is sent or “exported” to a flow collector server.
There are several flow collection standards including NetFlow, sFlow and IPFIX. NetFlow is the trade version created by Cisco and has become a defacto industry standard. sFlow and IPFIX are multi-vendor standards, one governed by InMon and the other specified by the Internet Engineering Task Force (IETF).
Flow records are far more voluminous than SNMP records, but provide valuable details on actual flows of traffic. The statistics from flow records can be utilized to create a picture of actual throughput. Flow information can also be used to calculate interface utilization by reference to total interface bandwidth. Furthermore, since flow data must include source and destination IP addresses, it is possible to map recorded flows to routing data such as BGP routing internet paths. This data integration is highly valuable for network performance monitoring because the network or internet path may correlate to performance problems occurring in particular networks (known as Autonomous Systems in BGP parlance) that comprise an internet path.
NetFlow records statistics based only on the packet headers—and not on any packet data payload contents—so the information is meta data, rather than payload data. Secondly, while it is possible to measure every flow, most practical network implementations use some degree of “sampling” where the NetFlow exporter only monitors one in a thousand or more flows. Sampling limits the fidelity of NetFlow data, but in a large network, even 1:8000 sampling is considered statistically accurate for network performance management purposes.
Similar to flow records generated by network infrastructure components, cloud-based applications, systems, and virtual private clouds can also export network flow data. For example, in AWS (Amazon Web Services) virtual private clouds can be configured to capture and export “VPC Flow Logs” which provide information about the IP traffic going to and from network interfaces in a given VPC.
As in NetFlow-type sampling, VPC Flow Logs record a sample of network flows sent from and received by various cloud infrastructure components (such as virtual machine instances, Kubernetes nodes, etc.) and these can be ingested by an NPM solution to provide monitoring and analytics for cloud-based networks.
Packet capture involves the recording of every packet that passes across a particular network interface. With PCAP data, the information collected is granular, since it includes both packet headers and full payload. Since an interface will see packets going in and out, PCAP can be used to precisely measure latency between an outbound packet and its inbound response, for example. PCAP provides the richest source of network performance data.
PCAP can be performed using software utilities such as TCPDUMP and Wireshark on an individual server. For a skilled technician, this can be a very effective way to understand network performance issues. However, since it is a manual process, and requires fairly in-depth knowledge of the utilities, it is not a very scalable approach.
To improve on this manual approach, an appliance-based PCAP probe may be used. The probe has multiple interfaces connected to router or switch span ports or to an intervening packet broker device (such as those offered by Gigamon or Ixia). In some cases, virtual probes can be used, but they are dependent on network links in one form or another.
A major downside to PCAP appliances is the expense of deployment. Physical and virtual appliances are costly from a hardware and (in the case of commercial solutions) software licensing point of view. As a result, in most cases, it is only fiscally feasible to deploy PCAP probes to relatively few, selected points in the network. In addition, the appliance deployment model was developed based on pre-cloud assumptions of centralized data centers of limited scale, holding relatively monolithic application instances.
As cloud and distributed application models have proliferated, the appliance model for packet capture is less feasible, because of the wide distribution of application components in VMs or containers, and because of the fact that in many cloud hosting environments, there is no way to deploy even a virtual appliance.
A cloud-friendly and highly scalable model for network performance management combines the deployment of lightweight host-based monitoring agents that export PCAP-based statistics gathered on servers and open-source proxy servers such as HAProxy and NGNIX. Exported statistics are sent to a SaaS repository that scales horizontally to store unsummarized data and provides big data-based analytics for alerting, diagnostics and other use cases.
While host-based performance metric export doesn’t provide the full granularity of raw PCAP, it provides a highly scalable and cost-effective method for ubiquitously gathering, retaining and analyzing key performance data, and thus complements PCAP. An example of a host-based NPM agent is Kentik’s kprobe.
Increasingly, modern Network Performance Monitoring solutions are incorporating synthetic monitoring features, which are traditionally associated with a process/market called “Digital Experience Monitoring”. In contrast to flow or packet capture (which we might characterize as passive forms of monitoring), synthetic monitoring is a means of proactively tracking the performance and health of networks, applications and services.
In the networking context, synthetic monitoring means imitating different network conditions and/or simulating differing user conditions and behaviors. Synthetic monitoring achieves this by generating different types of traffic (e.g., network, DNS, HTTP, web, etc.), sending it to a specific target (e.g., IP address, server, host, web page, etc.), measuring metrics associated with that “test” and then building KPIs using those metrics.
NPM data sources are not limited to the types discussed above and may encompass many types of events, device metrics, streaming telemetry and contextual information. In this short video, Kentik CEO Avi Freedman discusses the many types of data and integrations that are important to improving network observability.
This video is a brief excerpt from “5 Problems Your Current Network Monitoring Can’t Solve (That Network Observability Can)”—you can watch the entire presentation here.
Kentik offers the industry’s only big data-based, SaaS network observability solution that integrates network agent performance metrics with billions of NetFlow, sFlow, IPFIX, cloud flow log, and BGP records matched with geolocation and other forms of enrichment data. Kentik’s solution also incorporates synthetic monitoring features that allow for proactive monitoring of all types of networks.