Understanding Big Data sFlow Collectors: A Tutorial
Overview of Big Data sFlow Collectors
sFlow is a multi-vendor protocol created by InMon corporation,that is used to record statistical, infrastructure, routing and other information about IP traffic traversing an sFlow-enabled router or switch. A sFlow collector is one of three typical functional components used for sFlow analysis:
- sFlow Exporter: a sFlow-enabled router, switch, probe or host software agent that tracks key statistics and other information about IP traffic and generates flow records that are encapsulated in UDP and sent to a flow collector.
- sFlow Collector: an application responsible for receiving flow record packets, ingesting the data from the flow records, pre-processing and storing flow record from one or more flow exporters.
- sFlow Analyzer: a software application that provides tabular, graphical and other tools and visualizations to enable network operators and engineers to analyze flow data for various use cases, including network performance monitoring, troubleshooting, and capacity planning.
sFlow Collector Deployment Models
There are multiple deployment models for utilizing sFlow collectors. The first model runs the sFlow collector application on dedicated hardware-based computing resources—typically a rackmount server appliance. This model is the most constrained because it requires deployment of hardware to scale as flow record volume increases.
The second model is virtualized based where sFlow collectors are deployed as dedicated virtualized versions of classic sFlow collector appliances. The virtual sFlow collector adds greater deployment flexibility by allowing collectors to be deployed either in private or cloud-based, virtualized servers. It also allows for spin up of collectors on-demand, though in the vast majority of use cases, flow record volume is generally constant, so capacity planning for sFlow does not usually require bursting of incremental collectors.
One key similarity between both physical and virtual sFlow collectors is that they are generally designed in a monolithic fashion, which restricts their scalability and functional range. sFlow collectors must;
- Ingest flow UDP datagrams from one or more sFlow-enabled devices
- Unpack binary flow data into text/numeric formats
- Store resulting data in per-appliance flat files or SQL database instance
- Synchronize flow data to the sFlow analyzer application running on a separate computing resource
This monolithic design places severe constraints on how appliances can deal with high volumes of flow data. As a result, selected data from the raw flows is rolled up into a number of summary tables. Raw flow is retained for a short window and then discarded, both in order to save storage, but also because with a single appliance’s compute power, only a small amount of raw data can be post-processed at any time.
Big Data sFlow Collector: Designed Differently
A big data sFlow collector takes a different architectural approach. In this model, clusters of computing and storage resources can be scaled-out for different purposes. For example, a big data platform can allocate a scale-out cluster just to ingest and pre-process flow data in a way that preserves all raw flow fields. Rather than appliance by appliance local storage, a separate storage cluster can be used, which allows for deep retention of raw records. A separate cluster can be used to perform queries against the storage layer on behalf of a GUI or API calls. The big data approach ensures that capacity can be scaled flexibly to meet stringent performance requirements for queries against large-scale data sets, even as flow record ingest volumes and analysis query rates grow significantly.
Big data architectures can be built on a variety of open source software platforms such as Hadoop, ELK (Elastic, Logstash, Kibana) and other stacks. There is a major difference between a big data sFlow collector and analysis built for real-time/operational versus post-process/planning-only use cases. Operational use of sFlow data requires both high scale and low latency at all functional points: ingest scale, time to query, and query response. Planning-only use cases may not require low latency at any functional points of the collection and analysis/query stages.
Big data sFlow collectors and analysis engines can be built for single or multi-tenant use. Most open source big data platforms are built as single tenant engines, whereas SaaS big data engines requires multi-tenancy.
Big Data sFlow collectors are designed to meet the scale, flexibility and response time needs of network operators and planners. Kentik offers the industry’s only SaaS-based, big data sFlow, NetFlow, and IPFIX analysis solution built for network operations speed and scale. To learn more about Kentik Detect, download the Kentik Detect overview white paper or visit the Kentik Detect product overview.