Choosing the right big data engine for NetFlow data analysis can be a challenge given the many alternatives that are offered in the market. In order to assess various big data alternatives the following key requirements need to be considered that have a high correlation to NetFlow analysis needs:
There are multiple big data technologies on which network managers can build NetFlow analysis functionality, including Hadoop, ELK, and Google BigQuery.
Hadoop: Hadoop is an Apache open source software framework written in Java that allows distributed processing of large datasets across computer clusters supporting various programming models and tools. It’s composed of two parts: data storage and data processing. The distributed storage part is handled by the Hadoop Distributed File System (HDFS). The distributed processing part is achieved via MapReduce. Hadoop is licensed under the Apache License 2.0.
When evaluating Hadoop against the key NetFlow analysis requirements list it falls short in multiple ways. The first is that it does not have any tools that help with data modeling that support the data analysis required when processing NetFlow records. Some vendors have implemented cubes to fill this gap, enabling a dataset to be depicted in a multi-dimensional manner in a cube format, for slicing and dicing, to see more granular detail. Cubes first rose to prominence in the 1990s, so they can be seen as one of the first pervasive forms of analytical data modeling.
Cubes fall short for very large volumes of data and when the data model needs to change in real-time, which is typical in NetFlow analysis applications, affecting the overall responsiveness of the solution. In general, the Hadoop stack was designed for batch processing, and is inappropriate for network operations use cases that require real-time response.
ELK: The ELK stack is a set of open source analytics tools. ELK is an acronym for a collection of three open-source products: Elasticsearch, Logstash, and Kibana. Elasticsearch is a NoSQL database that is based on the Apache Lucerne search engine. Logstash is a log pipeline tool that accepts inputs from various sources, executes various transformations, and exports the data to various targets. Kibana is a visualization layer that works on top of Elasticsearch.
When evaluating the ELK stack against the key NetFlow analysis requirements list it falls short in several areas such as:
Google BigQuery: BigQuery is a RESTful web service that enables interactive analysis of large datasets working in conjunction with Google Storage. It is an Infrastructure as a Service (IaaS) offered by Google. When evaluating BigQuery against the key NetFlow analysis requirements list its data throughput volumes falls short at 100K records per second given that a large operator network can generate tens of millions of NetFlow records per second.
In order to meet all the above requirements for a NetFlow big data back-end, Kentik chose to create a purpose-built big data engine that leverages the following key elements, all of which are critical to a successful implementation:
Effective real-time network visibility is a never-ending and ever-growing challenge that has outpaced the rate of innovation by traditional network management vendors and products. Kentik has pioneered a clustered big data approach that leapfrogs the scalability, flexibility, and cost-effectiveness barriers that have long limited legacy approaches.
Using any of the standard big data distributions (such as Hadoop, ELK, and BigQuery) can lead to partial success against the needs for truly effective real-time NetFlow analysis, but typically at a TCO level that is unacceptably high for any organization lacking legions of experience programmers, coupled with unlimited systems resources.