Network Performance Monitoring (NPM) refers to the process of measuring, diagnosing and optimizing the service quality of a network as experienced by users. Network Performance Monitoring tools combine various types of network data (for example packet data, network flow data and metrics from various types of network infrastructure devices) so that a network’s performance, availability and other important metrics can be analyzed. NPM solutions may enable real-time, historic or even predictive analysis of a network’s performance over time.
NPM requires multiple types of measurement or monitoring data on which engineers can perform diagnoses and analyses. Example categories of NPM monitoring data are:
Bandwidth: Measures the raw versus available maximum rate that information can be transferred though various points of the network, or along a network path.
Throughput: Measures how much information is being or has been transferred.
Latency: Measures network delays from the perspective of clients, servers and applications.
Errors: Measures raw numbers and percentages of errors such as Bit Errors, TCP retransmissions, and out-of-order packets
NPM solutions are sometimes referred to as “Network Performance Monitoring and Diagnostic” (NPMD) solutions. (Most notably, industry analyst firm Gartner calls this the NPMD market.)
Network Performance Monitoring has traditionally drawn on data from sources including SNMP polling, traffic flow record export, and packet capture (PCAP) appliances. A host monitoring agent combined with a SaaS/Big Data back-end model provides an additional, more cloud-friendly approach.
SNMP is an IETF standard protocol, the most common method for gathering total bandwidth, utilization, available bandwidth, and error measurements on a per-interface basis. SNMP utilizes a polling-based approach via Management Information Bases (MIBs) such as the standards-based SNMP MIB II for TCP/IP-based networks. Typically, large networks only poll in five minute intervals to avoid overloading the network with management data. A downside of SNMP polling is lack of granularity, since multi-minute polling intervals can mask the bursty nature of network data flows, and interface counters only provide an interface-centric view.
Traffic flow records are generated by routers, switches and dedicated software programs by monitoring key statistics for uni-directional “flows” of packets between specific source and destination IP addresses, protocols (TCP, UDP, ICMP), port numbers and ToS (plus other optional criteria). Every time a flow ends or hits a pre-configured timer limit, the flow statistics gathering is stopped and those statistics are written to a flow record, which is sent or “exported” to a flow collector server.
There are several flow collection standards including NetFlow, sFlow and IPFIX. NetFlow is the trade version created by Cisco and has become a defacto industry standard. sFlow and IPFIX are multi-vendor standards, one governed by InMon and the other specified by the Internet Engineering Task Force (IETF).
Flow records are far more voluminous than SNMP records, but provide valuable details on actual flows of traffic. The statistics from flow records can be utilized to create a picture of actual throughput. Flow information can also be used to calculate interface utilization by reference to total interface bandwidth. Furthermore, since flow data must include source and destination IP addresses, it is possible to map recorded flows to routing data such as BGP routing Internet paths. This data integration is highly valuable for network performance monitoring because the network or Internet path may correlate to performance problems occurring in particular networks (known as Autonomous Systems in BGP parlance) that comprise an Internet path.
NetFlow records statistics based only on the packet headers—and not on any packet data payload contents, so the information is meta data, rather than payload data. Secondly, while it is possible to measure every flow, most practical network implementations utilize some degree of “sampling” where the NetFlow exporter only monitors one in a thousand or more flows. Sampling limits the fidelity of NetFlow data, but in a large network, even 1:8000 sampling is considered statistically accurate for network performance management purposes.
Packet Capture involves the recording of every packet that passes across a particular network interface. With PCAP data, the information collected is granular, since it includes both packet headers and full payload. Since an interface will see packets going in and out, PCAP can be used to precisely measure latency between an outbound packet and its inbound response, for example. PCAP provides the richest source of network performance data.
PCAP can be performed using software utilities such as TCPDUMP and Wireshark on an individual server. For a skilled technician, this can be a very effective way to understand network performance issues. However, since it is a manual process, and requires fairly in-depth knowledge of the utilities, it is not a very scalable approach.
To improve on this manual approach, an appliance-based PCAP probe may be used. The probe has multiple interfaces connected to router or switch span ports or to an intervening packet broker device (such as those offered by Gigamon or Ixia). In some cases, virtual probes can be used, but they are dependent on network links in one form or another. A major downside to PCAP appliances is the expense of deployment. Physical and virtual appliances are costly from a hardware and (in the case of commercial solutions) software licensing point of view. As a result, in most cases, it is only fiscally feasible to deploy PCAP probes to relatively few, selected points in the network. In addition, the appliance deployment model was developed based on pre-cloud assumptions of centralized datacenters of limited scale, holding relatively monolithic application instances. As cloud and distributed application models have proliferated, the appliance model for packet capture is less feasible, because of the wide distribution of application components in VMs or containers, and because of the fact that in many cloud hosting environments, there is no way to deploy even a virtual appliance.
A cloud-friendly and highly scalable model for Network Performance Monitoring combines the deployment of lightweight host-based monitoring agents that export PCAP-based statistics gathered on servers and open source proxy servers such as HAProxy and NGNIX. Exported statistics are sent to a SaaS repository that scales horizontally to store unsummarized data and provides Big Data-based analytics for alerting, diagnostics and other use cases. While host-based performance metric export doesn’t provide the full granularity of raw PCAP, it provides a highly scalable and cost-effective method for ubiquitously gathering, retaining and analyzing key performance data, and thus complements PCAP. An example of a host-based NPM agent is nProbe™.
Kentik offers the industry’s only Big Data-based, SaaS NPM solution that integrates nProbe host agent performance metrics and billions of NetFlow, sFlow, IPFIX, BGP records matched with geolocation data. Visit the Kentik NPM solution brief to get an overview or start a free trial to try it yourself.
Updated: April 5, 2019