Overview of NetFlow Troubleshooting

NetFlow was originally developed to help network administrators gain a better end-to-end understanding of their network traffic. Once NetFlow is enabled on a router or other network device, it tracks unidirectional packet flow statistics related to TCP/IP, UDP/IP or ICMP sessions, without storing any of the payload data  carried in that session. By tracking only the metadata about the flows, NetFlow offers a way to preserve highly useful traffic analysis and troubleshooting details without needing to perform full packet capture, which is very I/O and storage intensive.

When combined and correlated with SNMP device and interface data, BGP, and performance metrics, NetFlow can be used to monitor and diagnose a variety of network issues

NetFlow Troubleshooting Use Cases

NetFlow can be used for many different network troubleshooting uses including:

  • Congestion
  • Application performance issues
  • DDoS attacks
  • Network security anomalies

1)   Troubleshoot network congestion problems:

  • Identify traffic bottlenecks at source/destination ports, interfaces and IP addresses by comparing traffic levels to interface capacity/bandwidth.
  • Drill into flow details to find out who/what are the top contributing flows?
  • Are they anomalous or valid flows?
  • Compare top contributing flows to other timeframes for context.
  • Perform ad-hoc grouping of BGP routing, interface, port, IP, geolocation and other fields to find commonalities that will shed light on other aspects of the root causes, such as which applications, servers, user groups, or locations are factors.

3)   Troubleshoot application performance issues:

  • Identify which applications and protocols are consuming your network bandwidth by analyzing the source and destination IPs, ports and protocols.
  • Track the cumulative usage of a given application in an aggregate manner, down to a specific region or country for example.
  • Analyze network performance by using metrics exported from packet capture exporters like nProbe™. Compare client versus server latency, TCP retransmits for anomalies that indicate a network versus a server/application issue.
  • If using a per-server agent such as nProbe, look at destination IP + exporting server to see if there is any correlation of performance issues to a particular server.
  • Look for correlation to destination networks, geography, ASN, interfaces, etc. to find potential root causes.

4) Defend against DDoS attacks

  • Detect a sudden overall rise in network traffic that departs from baseline behavior.
  • Which resource is being hit?
  • Look for unusual numbers of sending source IPs for evidence of botnets.
  • Look at unusual or known bad source geography or ASN
  • Trigger automated mitigation if available (such as via RTBH, mitigation appliances or cloud services)

5)   Analyze network security anomalies:

  • Baseline traffic volume between top subnet pairs and alert on new pairs.
  • Is there traffic to/from known bad IPs based on threat feeds from Alienware, etc.
  • Identify unusual traffic peaks for unknown IPs, unusual ports, known bad ports, IANA reserved IPs.
  • Identify unusual numbers of flows from one host to many on the same port.

NetFlow Sampling 

NetFlow, and other flow-based analysis solutions, generate flow records based on the volume of traffic flows.   Generating a UDP-based flow record for every flow can create a lot of telemetry data—typically 1% of operational traffic, which is significant overhead.  Not all troubleshooting use cases require 1:1 flow export.

Fortunately, NetFlow, J-Flow and IPFIX provide for flow sampling, whereby exporting devices can be configured to sample 1:N flows to reduce telemetry traffic volume.  For most network operations and DDoS detection/defense purposes, sampling anywhere from 1:1000 to 1:8000+ flows (depending on overall traffic volume) provides sufficient insight.  For network security purposes, if NetFlow is required to provide for detailed forensics or a full audit traffic, then 1:1 flow export is required.

Since different portions of the network handle different volumes and types of traffic, it is possible to sample at different rates on different exporters.  For example, internet-facing edge routers or datacenter core routers that handle huge volumes of traffic can be configured for high rates of sampling.  Routers and switches at the aggregation layer, where security anomalies become apparent, don’t handle nearly as much traffic, and can be configured for 1:1 sampling to provide full granularity of insight.

Factors Limiting NetFlow Troubleshooting Effectiveness

NetFlow troubleshooting is most effective when sufficient detail is available and can be compared with other data points such as performance metrics, routing and location.  Unfortunately, the state of the art of NetFlow analysis tools up until recently has presented a significant challenge to troubleshooting effectiveness, due to data reduction.  Even with sampling, flow records can add up to a lot of data.  Since most NetFlow collectors and analysis tools are based on scale-up software architectures hosted on single servers or appliances, they have extremely limited storage, compute and memory capacity.  As a result, the common practice is to roll-up the details into a series of summary reports and to discard the raw flow record details.  The problem with this, of course, is that most of the detail needed for operationally useful troubleshooting is lost.

Cloud-scale computing and big data techniques have opened up a great opportunity to improve both the cost and functionality of NetFlow analysis and troubleshooting. The market has long embraced SaaS as a delivery model for advanced products and capabilities and its now possible to apply this cost effective approach to network traffic visibility and analytics solutions.

Big data storage allows for the storage of huge volumes of augmented raw flow records instead of needing to roll-up the data to predefined aggregates that severely restrict analytical options.  SaaS options save the network managers from incurring CAPEX and OPEX costs related to dedicated, on-premises appliances. Scale-out NetFlow analysis can deliver faster response times to operational analysis queries on larger data sets than traditional appliances.

More Reading

To learn more about troubleshooting with NetFlow, read about Kentik’s flagship product, Kentik Detect, and our approach to big data NetFlow analysis.  Also, check out this blog posts on maximizing the value of network metadata, or view some of the video presentations and demos of Kentik Direct on our NFD12 info hub.