In part 2 of this series, I talked about the range of network devices and observation points that generate telemetry data. Over time, this range has expanded, and networks are more diverse than ever. All of our operational concerns, planning, running and fixing need to be coordinated across the complete variety of the networks that affect our traffic.
In this blog, I discuss the telemetry data itself. Telemetry is the key to seeing, and seeing is the first step in the practice of observability.
The wonderful thing about network telemetry is that there are so many types, which also creates the challenge of starting on the network observability journey!
Historically, many systems have taken one or two types of telemetry to answer a more limited set of questions. However, with modern data systems and techniques, it’s possible to take a broader set of telemetry, which opens up an even wider set of use cases and questions that can be answered.
Information on the flow of traffic across networks — NetFlow, sFlow, and IPFIX in the classic sense or equivalent, but in a modern sense including cloud VPC Flow Logs, and traffic data in JSON and other interchange formats, service meshes, and services and security proxies.
Wire data can also help as a type of traffic data, but we see almost all cloud-focused customers using traffic summaries (flow) because of difficulty scaling packet observation in distributed networks.
However you get it, traffic is the key “what is” that shows you what users and applications are up to and how they’re interacting with the network!
Questions you can answer with traffic telemetry:
Is this spike/congestion an attack, a misconfig, a distributed system dynamic, or something else?
Am I under attack? Am I attacking others? What are the sources? How can I mitigate?
What will break if I add filters/change policy?
Who did this IP address talk to? What did it do? What did those destinations then do?
Why is my bandwidth bill so high?
What can I do to localize traffic?
How much does this customer cost me?
What clouds am I sending the most traffic to?
How much traffic is each of my departments using so I can bill them properly?
Questions you can’t answer with traffic alone:
Is this drop in traffic to Google due to network, application, or other performance issues?
What users and applications are consuming my network bandwidth?
Typically expressed as metrics, exposing state of the physical and logical network elements.
This typically covers high level stats about both the control and forwarding planes, though usually not the deep telemetry on the traffic flowing across the network. Historically this was CLI, then became majority SNMP, evolved to add API access, and more recent energy has been around streaming telemetry.
Questions you can answer with device telemetry alone:
Is the device running out of memory, overheating, or otherwise showing it might stop working altogether?
What are my interface level statistics and usage now and historically?What are my optical power and optics temperature levels?
How much traffic is passing over each LSP?
What version of software am I running on my network device?
Is my interface down or up?
Questions you can’t answer with device telemetry alone:
Why is this interface full?
Are these interface errors causing performance problems?
What are my top talkers?
What is the traffic going through this interface or device?
What applications and users are passing through a specific interface?
Was my spike in traffic an attack?
Updates about events occurring from the point of view of the network elements, including alerts such as threshold violation of temperatures, CPU, optic, wireless/radio interfaces or other element health, config changes, and routing session state. Typically such notifications are sent via syslog or SNMP trap.
Questions you can answer with events alone:
Did someone make a change they shouldn’t have?
Was the interface shutdown or did it flap?
Who and what change was made?
Did a process crash on my device?
How long has my device been operational for?
Is my network having trouble with routing stability?
Is someone attempting to login to my device?
Questions you can’t answer with events alone:
Did config changes cause a customer-visible problem?
How did the change affect my network traffic and performance?
Measurements from testing using “synthetic” traffic — apart from actual user traffic. While synthetics can be triggered or collected via device telemetry interfaces, they are actually a broader category spanning client and server endpoints, network elements, and internet-wide locations performing network and application-layer testing.
Questions you can answer with synthetics alone:
Are there specific links that are having packet loss, potentially when run above current levels of traffic?
What is my performance to specific endpoints, between data centers or from on-prem to cloud?
What path is my traffic taking to reach a destination and what is the performance?
Questions you can’t answer with synthetics alone:
Were any applications/users affected by this bad performance test result?
What is causing the performance problem?
Did a change in the network cause a performance problem?
Were there important customers, users, or application traffic affected by the performance issue?
Dynamic updates and/or routing table state specifically for the routes or paths, determined and propagated by the network elements.
This information tells you (modulo bugs) how traffic or packets will flow through the network under different conditions. Broadly this includes inter-domain (BGP), intra-domain (OSPF, IS-IS, RIP, BGP) and even switching (ARP and CAM) updates and tables. Routing is generally observed by participating in listen-only routing sessions, or for BGP, via BMP. Note: Really, I think of tables as composed in the observability data layer from updates as well but probably better to skip for now.
Questions you can answer with routing alone:
Is someone hijacking my network reachability?
What path is my traffic being routed through my network?
What AS_PATH is my traffic taking to reach an endpoint?
What interface is my traffic egressing to a destination AS?
Am I announcing my customers’ networks properly?
Questions you can’t answer with routing alone:
Do the networks I see instability to represent any application or user traffic for me?
Is routing instability causing performance problems?
The (typically static) configuration data representing the operating intent for all configurable network elements such as addresses, ID’s, ACLs, topology info, location data, even device details such as hardware and software versions.
Questions you can answer with configuration data alone:
Did I make an obvious mistake with a configuration change?
Questions you can’t answer with configuration data alone:
Did I block traffic that shouldn’t have been blocked?
Business and Operational Metadata
Often called “layer 8,” or the data about the use of the network that’s beyond strictly network scope, the business and operational context about what the network is used for is a critical source of telemetry for network observability.
There are a wide variety of metadata types to tap into, often already available on data busses. Examples include application orchestration from Kubernetes, VMware, and controllers; user association from IPAM, NAC, and RADIUS; threat intelligence curated by security groups; SaaS and cloud identity mapping; customer or department identification; and “business criticality” metadata including customer size or application criticality to business operations.
How metadata lets you ask better questions:
“Who did this IP address attack?” becomes “What users or hosts were attacked from this IP address? Was this IP address part of my production infrastructure, or was it one of my users or customers?”
“What IP addresses were affected?” becomes “What hostname, applications, users, and customers were affected?”
“What IP or subnet is using a certain amount of bandwidth?” becomes “What department, user, or customer can I bill for this usage?
DNS query streams are also a very useful type of network telemetry, whether observed from the DNS service logs, or from the network traffic on the hosts running DNS servers. In addition, DNS telemetry can be helpful to put traffic and other telemetry types in context.
For example, if apps or sites are using cloud infrastructure, flow without DNS may not be able to “see” them distinctly, but adding DNS to traffic data can help you to peer better into your traffic to those properties.
Questions you can answer with DNS alone:
Am I getting or returning DNS failures (404 or otherwise)?
What are my frequently queried domain names and talkative DNS clients?
What is my DNS request load for my servers?
Questions you can’t answer with DNS alone:
Is slowness in DNS response causing pages to load slowly?
Application Data Sources
One explicit note that’s critical to modern network observability is that some of the most rich, real-time, granular, and valuable data to shine light on the network comes from application-layer sources. Most application-layer traffic data has performance instrumentation simply not available from high-speed silicon-accelerated network elements. While network and application observability teams have work to be done to obtain common telemetry, terminology, workflows, and platform interoperability, we see this unification as an active effort in 2021 across our customer base.
Questions you can answer with application telemetry alone:
Did I return a slow response to a user? How often? When?
Am I returning errored responses to users or other application components?
Questions you can’t answer with application telemetry alone:
Were slow responses due to application or network problems? Which and where?
Putting it all Together
Gathering network telemetry data is the key to being able to ask questions, and is the first step in the practice of observability.
As I’ve tried to lay out in this blog, a wider and varied set of telemetry types can answer many more questions — and this makes your network more observable! Many common questions require two or more telemetry types to answer, and generally, adding combinations of telemetry types gives you exponentially better ability to ask questions. Which is what network observability is about.
Now that we see the need to have lots of different network telemetry, and from the key network elements and types, how do we create a practical solution that is capable of handling all this data?
That will be the subject of my next blog in this series — the Telemetry Data Platform.