In part 2 of this series, I talked about the range of network devices and observation points that generate telemetry data. Over time, this range has expanded, and networks are more diverse than ever. All of our operational concerns, planning, running and fixing need to be coordinated across the complete variety of the networks that affect our traffic.
In this blog, I discuss the telemetry data itself. Telemetry is the key to seeing, and seeing is the first step in the practice of observability.
The wonderful thing about network telemetry is that there are so many types, which also creates the challenge of starting on the network observability journey!
Historically, many systems have taken one or two types of telemetry to answer a more limited set of questions. However, with modern data systems and techniques, it’s possible to take a broader set of telemetry, which opens up an even wider set of use cases and questions that can be answered.
Information on the flow of traffic across networks — NetFlow, sFlow, and IPFIX in the classic sense or equivalent, but in a modern sense including cloud VPC Flow Logs, and traffic data in JSON and other interchange formats, service meshes, and services and security proxies.
Wire data can also help as a type of traffic data, but we see almost all cloud-focused customers using traffic summaries (flow) because of difficulty scaling packet observation in distributed networks.
However you get it, traffic is the key “what is” that shows you what users and applications are up to and how they’re interacting with the network!
Questions you can answer with traffic telemetry:
Questions you can’t answer with traffic alone:
Typically expressed as metrics, exposing state of the physical and logical network elements.
This typically covers high level stats about both the control and forwarding planes, though usually not the deep telemetry on the traffic flowing across the network. Historically this was CLI, then became majority SNMP, evolved to add API access, and more recent energy has been around streaming telemetry.
Questions you can answer with device telemetry alone:
Questions you can’t answer with device telemetry alone:
Updates about events occurring from the point of view of the network elements, including alerts such as threshold violation of temperatures, CPU, optic, wireless/radio interfaces or other element health, config changes, and routing session state. Typically such notifications are sent via syslog or SNMP trap.
Questions you can answer with events alone:
Questions you can’t answer with events alone:
Measurements from testing using “synthetic” traffic — apart from actual user traffic. While synthetics can be triggered or collected via device telemetry interfaces, they are actually a broader category spanning client and server endpoints, network elements, and internet-wide locations performing network and application-layer testing.
Questions you can answer with synthetics alone:
Questions you can’t answer with synthetics alone:
Dynamic updates and/or routing table state specifically for the routes or paths, determined and propagated by the network elements.
This information tells you (modulo bugs) how traffic or packets will flow through the network under different conditions. Broadly this includes inter-domain (BGP), intra-domain (OSPF, IS-IS, RIP, BGP) and even switching (ARP and CAM) updates and tables. Routing is generally observed by participating in listen-only routing sessions, or for BGP, via BMP. Note: Really, I think of tables as composed in the observability data layer from updates as well but probably better to skip for now.
Questions you can answer with routing alone:
Questions you can’t answer with routing alone:
The (typically static) configuration data representing the operating intent for all configurable network elements such as addresses, ID’s, ACLs, topology info, location data, even device details such as hardware and software versions.
Questions you can answer with configuration data alone:
Questions you can’t answer with configuration data alone:
Often called “layer 8,” or the data about the use of the network that’s beyond strictly network scope, the business and operational context about what the network is used for is a critical source of telemetry for network observability.
There are a wide variety of metadata types to tap into, often already available on data busses. Examples include application orchestration from Kubernetes, VMware, and controllers; user association from IPAM, NAC, and RADIUS; threat intelligence curated by security groups; SaaS and cloud identity mapping; customer or department identification; and “business criticality” metadata including customer size or application criticality to business operations.
How metadata lets you ask better questions:
DNS query streams are also a very useful type of network telemetry, whether observed from the DNS service logs, or from the network traffic on the hosts running DNS servers. In addition, DNS telemetry can be helpful to put traffic and other telemetry types in context.
For example, if apps or sites are using cloud infrastructure, flow without DNS may not be able to “see” them distinctly, but adding DNS to traffic data can help you to peer better into your traffic to those properties.
Questions you can answer with DNS alone:
Questions you can’t answer with DNS alone:
One explicit note that’s critical to modern network observability is that some of the most rich, real-time, granular, and valuable data to shine light on the network comes from application-layer sources. Most application-layer traffic data has performance instrumentation simply not available from high-speed silicon-accelerated network elements. While network and application observability teams have work to be done to obtain common telemetry, terminology, workflows, and platform interoperability, we see this unification as an active effort in 2021 across our customer base.
Questions you can answer with application telemetry alone:
Questions you can’t answer with application telemetry alone:
Gathering network telemetry data is the key to being able to ask questions, and is the first step in the practice of observability.
As I’ve tried to lay out in this blog, a wider and varied set of telemetry types can answer many more questions — and this makes your network more observable! Many common questions require two or more telemetry types to answer, and generally, adding combinations of telemetry types gives you exponentially better ability to ask questions. Which is what network observability is about.
Now that we see the need to have lots of different network telemetry, and from the key network elements and types, how do we create a practical solution that is capable of handling all this data?
That will be the subject of my next blog in this series — the Telemetry Data Platform.