Comparing flow protocols for real-world large-scale networks
A lot of ink has been spilled over the years on the topic of flow protocols, specifically how they work and their relative accuracy. Historically, however, most of the testing, opinion, and coverage has been based on enterprise use cases and fairly low-bandwidth assumptions. In this post we’ll take a new look, focusing instead on use cases and bandwidths that are more representative of — and relevant to — large internet edge and datacenter operations.
One of the things that can be rather confusing is that there are a lot of different flow protocol names. Behind the many variants there are actually only two major technologies related to capturing and recording traffic-flow metadata. The first is based on stateful flow tracking, and the other is based on packet sampling. Both approaches are explained below. But first, based on that fundamental distinction, let’s classify common flow protocols accordingly:
- Stateful flow tracking:
– NetFlow: Originally developed by Cisco in 1996. The most used versions are v5 and v9.
– NetFlow by another name: Other vendors support NetFlow but call it something else, including J-Flow (Juniper), RFlow (Redback/Ericsson), cFlowd (Alcatel), Netstream (3Com/HP/Huawei).
– IPFIX: The IETF standards-based successor to NetFlow, sometimes referred to as NetFlow v10.
- Packet sampling:
– sFlow: Launched in 2003 by a multi-vendor group (sflow.org) that now includes Alcatel, Arista, Brocade, HP, and Hitachi.
So what’s the difference between the stateful flow tracking of NetFlow and the packet sampling of sFlow? A while back, Kentik’s CEO Avi Freedman wrote an excellent two part blog on flow protocol extensibility, so I’ll leverage his words:
Routers and switches running NetFlow/IPFIX designate a collection of packets as a flow by tracking packets, typically looking for packets that come from and go to the same place and share the same protocol, source and dest IP address, and port numbers. This tracking requires CPU and memory — in some circumstances, a huge amount of it. For example, with a forged source-address DDoS attack, every packet can be a flow, and routers have to try to maintain massive tables on the fly to track those flows! Also, to cut down on CPU and network bandwidth, flows are usually only “exported” on average every 10 seconds to a few minutes. This can result in very bursty traffic on sub-minute time scales.
sFlow, on the other hand, is based on interface counters and flow samples created by the network management software of each router or switch. The counters and packet samples are combined into “sFlow datagrams” that are sent across the network to an sFlow collector. The preparation of sFlow datagrams doesn’t require aggregation and the datagrams are streamed as soon as they are prepared. So while NetFlow can be described as observing traffic patterns (“How many buses went from here to there?”), with sFlow you’re just taking snapshots of whatever cars or buses happen to be going by at that particular moment. That takes less work, meaning that the memory and CPU requirements for sFlow are less than for NetFlow/IPFIX.
Most published tests and blogs have historically made two major points of comparison are typically between NetFlow and sFlow:
- Aggregate traffic volume accuracy on a per interface basis.
- Per-host traffic volume accuracy as compositional elements of the interface traffic.
These tests have mostly been performed at very low traffic volumes (under 10Mbps in multiple tests). In those test scenarios, it’s been repeatedly observed that aggregate interface volumes are essentially the same for both protocols. Drilling down into the host-level details, however, tests with traffic rates in the Kbps range often indicate that NetFlow more reliably captures traffic flow statistics at greater granularity.
This observation makes sense when traffic volume is low, because in that situation it’s not too taxing for NetFlow to collect every flow (as distinct from sampling the flows) and thereby reliably capture all flow statistics. If you’re looking to examine traffic coming from individual client machines in an SMB or even an enterprise network setting, then that increased granularity is helpful.
At higher traffic volumes, however, network engineers and operators are focused on different issues, and the need for per-flow granularity starts to fade significantly. For example, if you’re operating a network for an ISP, cloud-or-hosting provider, or Web enterprise, it wouldn’t be unusual to be push tens to hundreds of Gbps through your Internet edge devices. For large carrier ISP operations located in major metropolitan areas, with highly consolidated IP edge points of presence, it’s fairly commonplace for individual routers to push 1Tbps.
When you’re dealing with Internet edge networks at this scale, Kbps-level traffic coming from an individual host somewhere typically doesn’t matter enough to track individually. Instead, your primary concerns are with traffic engineering, with sudden volumetric shifts that can clog pipes and reduce service performance, and with volumetric DDoS attacks that most commonly range from the single to low-tens of Gbps. If you’re a hosting, cloud, or Web provider, the types of hosts you’re concerned with are servers, which are each usually pushing traffic at tens to hundreds Mbps.
Our SaaS platform, Kentik Detect, collects and stores upwards of 125B flow records per day from 100+ customers operating networks whose use cases are weighted toward high-volume Internet edge and east-west traffic, both intra-datacenter and inter-datacenter. Based on their feedback and our own observation and analysis, our take is that for these applications the accuracy of sFlow and NetFlow/IPFIX is essentially the same. In addition, since packet samples are sent immediately while NetFlow records can queue for minutes before being sent, sFlow has the advantage of delivering telemetry data with lower latency.
Whether you’re running an enterprise, web company, cloud/hosting, ISP, or mobile/telco network, you are likely running significant volumes of traffic, which makes it critical to be able to retain flow records and analyze them in detail at scale. Most flow collectors of any note can handle a fairly high level of streaming ingest, but that’s where their scalability ends. With single-server or appliance systems, detail at the flow-record level isn’t retained for long, if at all. Instead you get summaries of top talkers, etc. We talk to tons of users of these legacy tools and they tell us what we already know, which is that those few summary reports are neither comprehensive nor flexible enough to be informative about anything below the surface.
Kentik was created to solve this problem. Our platform ingests massive volumes of flows, correlates them into a time-series with BGP, GeoIP, performance, and other data, and retains records for 90 days (more be arrangement). You can perform ad-hoc analysis on billions of rows of data without any predefined limits, using multiple group-by dimensions and unlimited nested filtering. You can run queries and get answers back in a few seconds, pivot your analysis, drill down, zoom in or out, and keep getting answers fast until you get just data you need to make a decision. There’s much more to our solution; for additional information check out our website’s product pages.
If you’re not yet familiar with how Kentik Detect applies the power of big data to flow analysis, we’d love to show you; contact us at firstname.lastname@example.org or via our web chat and we can schedule a demo to walk you through it. Or you can dive in directly by starting a free trial; within 15 minutes you can be in the Kentik Detect portal looking at traffic on your own network. Either way, if you’re operating a serious network then Kentik’s scale, granularity, and power will enable you to see, understand, and respond to every aspect of your traffic.