In this post, learn about what a UDR is, how it benefits machine learning, and what it has to do with networking. Analyzing multiple databases using multiple tools on multiple screens is error-prone, slow, and tedious at best. Yet, that’s exactly how many network operators perform analytics on the telemetry they collect from switches, routers, firewalls, and so on. A unified data repository unifies all of that data in a single database that can be organized, secured, queried, and analyzed better than when working with disparate tools.
What is a UDR (Unified Data Repository)?
A unified data repository (UDR) is a centralized storage system that consolidates, organizes, and manages data from different sources. It serves as a single point of access for various types of data, which could be a significant volume of very diverse data formats in an extensive network.
Why do we need a UDR?
In a typical network, there are so many different types of devices and services producing some form of telemetry that there is the potential to end up with disparate visibility tools and databases. In fact, this is very common to see in many network operations centers. This is a problem because having many separate databases makes it difficult for a network operations team to collect, organize, secure, and, most importantly, analyze all the data simultaneously.
We want to be able to do that simply because application delivery doesn’t rely on just one type of network device or part of a network infrastructure. Instead, application delivery touches a massive number of devices, network-adjacent devices and services, the public internet itself, and so on. Therefore, we have to analyze all of this data as a whole to truly understand application performance over the network.
The alternative is to have separate visibility tools that a human engineer looks at one at a time, searching for meaningful correlation and deriving insights in their own head. This process of manual clue-chaining is tedious at best, error-prone, and impossible to do at scale. And this problem will only get worse as new forms of telemetry are collected from the ever-changing nature of networks.
The primary purpose of a UDR is to help solve this problem and enable more comprehensive data analysis across data types, formats, and from different sources.
How a UDR benefits machine learning
Machine learning relies on large amounts of data to build models, make predictions, find correlations, and ensure its results’ accuracy. ML is inherently data-driven, so it’s vital to implement the right data management strategy for data analysis to be successful.
UDRs can be implemented using various technologies, such as data warehouses (for structured data), data lakes (for unstructured data), or hybrid solutions that combine both. The choice of technology largely depends on an organization’s specific needs, data types, volume, and the desired level of scalability and flexibility.
A UDR plays a critical role in machine learning projects by:
- Providing high-quality, consistent data: A UDR streamlines data management and improves data quality, ensuring that machine learning models can access accurate and consistent data for training and validation. Because applications rely on so many devices and services, we must analyze them all at the same time using consistent scaling, etc.
- Accelerating the training process: By consolidating data into a single location, a UDR reduces the time spent on data collection and preprocessing, allowing engineers, data scientists, and ML practitioners to focus on developing and optimizing the data analysis workflow.
- Enhancing model performance: With access to a wide range of diverse data in a single database, ML models can be trained on more representative samples, leading to better generalization, prediction, and improved performance in real-world scenarios. This is very important, especially in networking, in which engineers care about understanding trends, seasonality, and the predictive capacity of a visibility tool.
- Facilitating collaboration: A UDR enables data scientists, engineers, and others to collaborate more effectively on data analytics projects by providing a centralized data source, reducing the risk of duplicated efforts or inconsistent results. This is a direct benefit to operational teams running networks day-to-day.
Remember that most network telemetry is unstructured and unlabeled data. This means that to apply certain ML models, the data must first be labeled, organized, and structured somehow. To do this, we want to use a pre-processing workflow to standardize and normalize data of different types, formats, and scales.
Once this is complete, the UDR unifies these now normalized, standardized, and scaled data so that a single tool can perform the automated data analysis efficiently, accurately, and much faster than a human engineer. That’s when we can start identifying correlations and patterns among multiple data types from different sources and over time.
How a UDR benefits networking
For networking, a unified database means the algorithms have the data necessary to make accurate predictions and find meaningful correlations across data types and formats.
In other words, the more unified the data, the better the analytics and the better the results.
For example, a typical network operations center might have one tool to collect NetFlow information, another for SNMP, another for packet analysis, another for cloud logs, and so on. Each of these tools, though potentially excellent on its own, has separate underlying databases. This means it’s up to a human engineer to log into each tool, look for the relevant data, then log into the following tool and search for the related data again.
This is not only error-prone but also incredibly tedious and slow, even in a medium-sized network environment. Therefore, we can improve day-to-day network operations using a UDR and appropriate data analysis workflows. A UDR will also allow ML models to make better predictions. Instead of identifying seasonality with only one data type, such as NetFlow, a ML model operating from a unified database of multiple forms of telemetry can go beyond seeing NetFlow seasonality and programmatically identify how NetFlow seasonality relates to a specific application latency, possible security vulnerabilities, and so on.
Kentik’s UDR - Kentik Network Observability Platform
Kentik uses a custom-built scalable columnar datastore called Kentik Data Engine. KDE ingests flow data (e.g., NetFlow or sFlow), correlates it with additional data such as GeoIP and BGP, and stores it (redundantly) in flow records that can be queried from the Kentik portal via Kentik APIs or a fully ANSI SQL-compliant individual PostgreSQL interface.
KDE keeps a discrete set of databases for the flow records of each customer. These databases are made up of “Main Tables” for the flow records and associated data received for each device as well as for additional data learned from the flow data. The columns of those Main Tables contain the data that Kentik queries, and most of these columns are represented as dimensions used for filtering and group-by in queries.
Everything ingested or learned by backend data analytics is stored in the KDE, making it the unified data repository for the Kentik platform. This unified, data-driven approach to network observability allows advanced analytics across data sources and types in a single platform.
Watch the Data-Driven Network Observability presentation from Networking Field Day 31 to learn more.