Microburst Detection: How to Catch Sub-Second Traffic Spikes That Standard Monitoring Misses

Table of contents

Microburst Detection at a Glance What is a Microburst?Why Microbursts Matter Why Standard Network Monitoring Misses Microbursts How to Detect Microbursts: A Three-Step Approach Step 1: Collect High-Frequency Device Telemetry Step 2: Correlate Burst Events to Flow Records Step 3: Investigate the Root Cause with AI-Assisted Analytics Common Causes of Microbursts Microbursts vs. Elephant Flows: Related but Distinct Microbursts in AI Data Center Networks How Kentik Detects and Investigates Microbursts Related Articles FAQs about Microburst Detection What is a microburst in networking?How long does a microburst typically last?Why do microbursts cause packet loss when average link utilization is low?What causes microbursts?How do you detect and investigate microbursts using flow telemetry?Can SNMP detect microbursts?How do you correlate spikes in TCP retransmits with specific network segments?What is the difference between microbursts and elephant flows?Why are microbursts especially problematic in AI data center networks?What is hardware queue telemetry and why does it matter for microburst detection?How does Kentik detect and investigate microbursts?Detect Microbursts and Accelerate Root Cause Analysis with Kentik

Reviewed for technical accuracy by: Eric Hian-Cheong, Senior Product Marketing Manager at Kentik, specializing in network monitoring, AI-assisted operations, and flow analytics.

A microburst is a sub-second spike in network traffic — typically lasting between 10 milliseconds and 1 second — that can saturate a link or switch queue to 100% capacity even when standard monitoring shows healthy average utilization. Microbursts cause tail drops in switch buffers, TCP retransmissions, latency spikes, and application slowdowns. They are increasingly common in AI/ML training fabrics, hyperscale data centers, storage networks, and high-frequency trading environments. This article explains what microbursts are, why they cause performance problems disproportionate to their duration, why traditional SNMP polling and flow aggregation miss them, the techniques used to detect them, and how to investigate the specific flows that caused them.

Microburst Detection at a Glance

What it is: A traffic spike lasting less than one second that exceeds available link or queue capacity, causing buffer overflows and packet drops even when 1-minute average utilization looks healthy.
Typical duration: 10 milliseconds to 1 second, with the most damaging bursts often in the 10–100 ms range.
Why they matter: Even a 50 ms burst that fully saturates a switch egress queue can drop hundreds of packets, triggering TCP retransmissions, degrading tail latency, and — in AI fabrics — significantly extending job completion time (JCT).
Why standard monitoring misses them: SNMP polling at 1-minute or 5-minute intervals averages bursts into background noise. NetFlow v5 and v9 aggregate counters over cache cycles that blur sub-second events.
How to detect them: Hardware-level queue and buffer monitoring on the switch (Cisco Nexus microburst monitoring, Arista LANZ, Juniper microburst features), streaming telemetry (gNMI) for high-frequency device metrics, flow telemetry (NetFlow, sFlow, IPFIX) to attribute bursts to specific source/destination pairs, and a correlation layer that ties device events to the responsible flows.
Where they are most common: AI/ML training networks (collective operations like all-reduce and all-to-all), data center east-west traffic, storage backup windows, microservice fan-in patterns, and TCP incast scenarios.

What is a Microburst?

A microburst is a brief, sharp surge in network traffic that pushes a link, interface, or switch egress queue to or beyond its capacity for a very short duration — typically 10 milliseconds to 1 second. Unlike sustained congestion, which shows up clearly on standard utilization charts, microbursts are invisible at typical monitoring resolutions: a switch egress queue can be 100% saturated for 50 ms and still appear as only 40% utilization on a 1-minute average chart.

The damage caused by microbursts is disproportionate to their duration. Once a switch egress buffer fills, additional packets are dropped — typically as tail drops — which triggers TCP retransmissions, increases tail latency, and in modern AI data center fabrics can stall thousands of GPUs waiting on synchronized collective communication. A single microburst lasting 50 ms can drop hundreds or thousands of packets, with cascading effects across distributed applications, storage clusters, and AI training jobs.

Microbursts are a structural feature of modern networks, not a misconfiguration. Most workloads do not produce smooth traffic; they produce bursty traffic with sharp peaks and idle gaps. The engineering challenge is not eliminating bursts but detecting them, attributing them to specific flows, and ensuring the network has enough buffer and bandwidth headroom to absorb them without dropping packets.

Kentik in brief: Kentik is a network intelligence platform that helps teams detect, investigate, and respond to microbursts by ingesting gNMI streaming telemetry from network devices alongside NetFlow, sFlow, and IPFIX flow records, then correlating device-level events with the specific flows and applications that caused them. With Kentik NMS, Data Explorer, and Kentik AI Advisor, teams can move from a device-level burst event to an evidence-backed root-cause narrative in a single workflow — without manually correlating data across vendor-specific dashboards. For an operator-focused view of microburst detection in GPU clouds and AI data centers, see Network Intelligence for Neoclouds and AI Data Centers.

The AI Data Center Can’t Scale Without Network Intelligence

Overcoming GPU bottlenecks requires a fundamentally smarter approach to the network.

Why Microbursts Matter

Microbursts cause performance problems that often look mysterious to NetOps teams using traditional monitoring tools. Common symptoms include TCP retransmissions and degraded throughput while average link utilization stays low, application latency spikes that show up only at the 95th or 99th percentile, small numbers of tail drops at switch interfaces that appear under-utilized, and extended job completion time in distributed AI training where the slowest GPU determines progress.

The business impact varies by environment. In a traditional enterprise data center, microbursts manifest as intermittent application slowness that is hard to reproduce and easy to misdiagnose as an application or storage problem. In an AI training fabric, the impact is direct and quantifiable: a single congested link in a collective-operation step can delay every GPU in the cluster, multiplying the cost of an already expensive training run. In a high-frequency trading environment, microsecond-scale packet drops can mean missed trades. In a hyperscale or service-provider network, microbursts in core or peering interfaces can degrade SLA performance across thousands of customers simultaneously.

The common thread across these environments is that the cost of an undetected microburst is far greater than the cost of detecting it. Detection requires investment in higher-frequency telemetry and analytics; the alternative is troubleshooting performance issues that disappear by the time an engineer logs in.

Why Standard Network Monitoring Misses Microbursts

Most network monitoring is built around two collection methods: SNMP polling for device and interface metrics, and flow telemetry (NetFlow, sFlow, IPFIX) for traffic data. Both methods, in their standard configurations, are blind to microbursts.

SNMP polling intervals are too coarse. A typical SNMP polling interval is 60 seconds, sometimes 30 or 15 seconds in performance-conscious deployments. At those intervals, a 50 ms burst that saturates an interface contributes a fraction of one percent to the polled average. A chart of 1-minute interface utilization looks flat at 40% — and the chart is technically correct, because the link spent the other 99.92% of that minute at lower utilization. The information that matters (that the queue was full for 50 ms and dropped packets) is invisible.

Flow telemetry has its own averaging problem. NetFlow v5 and v9 aggregate packets into flow records using cache timers, typically with active and inactive timeouts measured in minutes. By the time a flow record is exported, the burst is folded into an aggregate byte and packet count that lacks the time resolution to identify when within the flow’s lifetime the traffic peaked. IPFIX can carry higher-precision timestamps, but only if devices are configured to populate the relevant fields and exporters are tuned for short flow durations.

Sampled flow exacerbates the problem. Many production deployments use sampled flow telemetry (commonly 1:1000 or 1:4096) for scalability. Sampling is appropriate for capacity planning and traffic engineering, but at 1:4096 sampling, a 50 ms burst containing 20,000 packets is represented by perhaps five flow records — not enough to reconstruct the burst’s shape or attribute it confidently to a single source.

Standard interface counters are insufficient. Interface counters track bytes and packets passing through an interface, but they do not measure how full the egress queue was while those packets were transmitted. A microburst can drop packets without ever pushing interface utilization above its line rate, because the bottleneck is the egress buffer’s depth, not the interface’s bandwidth.

The result is that microbursts produce symptoms (application slowness, retransmissions, occasional drops) without the data needed to explain them. NetOps teams using traditional monitoring often spend hours or days investigating microburst-caused incidents and never find a definitive root cause.

How to Detect Microbursts: A Three-Step Approach

Effective microburst detection requires three capabilities working together: high-frequency device telemetry to see the burst, high-precision flow telemetry to attribute it to specific traffic, and an analytics workflow to correlate the two. The following three-step approach reflects how modern NetOps and AI infrastructure teams operationalize microburst detection.

Step 1: Collect High-Frequency Device Telemetry

The foundation of microburst detection is telemetry at sub-second resolution from the network devices most likely to experience congestion. There are three complementary techniques.

Streaming telemetry replaces polling. Streaming telemetry protocols such as gNMI push device metrics continuously from the switch or router to the collector, rather than waiting for a polling request. Streaming telemetry intervals are typically much shorter than SNMP polling — measured in seconds rather than minutes — and the highest-frequency metrics (such as CPU) can be exported at intervals as short as a few seconds. Exact intervals depend on the device platform, the metric, and how telemetry is configured. This is dense enough to surface short-lived device anomalies that minute-scale SNMP polling would miss. (For deeper context, see Network Device Monitoring.)

Hardware queue telemetry surfaces the bursts themselves. Modern switching ASICs from vendors including Cisco, Arista, Juniper, and others expose hardware-level queue and buffer occupancy metrics. These are the canonical indicators of a microburst: when an egress queue fills above a configured high-water mark, the switch can export a telemetry event or set a counter that downstream analytics can pick up. Cisco Nexus microburst monitoring, Arista LANZ (Latency Analyzer), and Juniper microburst features all provide variations on this capability.

Event-driven exports capture the moments that matter. Rather than streaming every metric continuously, some platforms support event-driven telemetry — exporting a record only when a threshold trips (queue rise above a watermark, packet drop, latency exceeded). Cisco Flow Telemetry Events (FTE), Arista LANZ events, and similar features reduce the volume of telemetry that needs to be ingested while still capturing the precise moments of burst activity.

The combination of these techniques produces a continuous, sub-second record of device behavior that can answer the question “did a microburst occur, and where?”

Step 2: Correlate Burst Events to Flow Records

Detecting that a burst occurred is only half the answer. The next question is which flows caused it — which application, source, destination, or workload was responsible. This requires flow telemetry with enough time resolution and detail to align with the device-level burst events.

The relevant flow telemetry is typically NetFlow, IPFIX, or sFlow, ideally with microsecond or millisecond timestamps and minimal sampling on the links most likely to experience bursts. IPFIX is particularly well suited for microburst attribution because its template structure can carry high-precision flowStartMilliseconds and flowEndMilliseconds fields.

The correlation workflow is straightforward in principle:

Identify the precise time window when the burst occurred (from device queue telemetry).
Filter flow records to that interface, in that time window.
Rank flows by byte and packet contribution to identify the dominant contributors.
Drill into source and destination IP, port, application, and ASN to characterize the responsible workload.

Kentik Data Explorer Sankey view of data center flows between leaf and spine switches, used to attribute traffic on a congested interface to specific source-destination pairs

A Data Explorer query surfaces the flows transiting a leaf-spine fabric, ranked by contribution — the attribution step that ties a device-level burst event to the workload responsible for it

In practice, this correlation is tedious without a platform that ingests device telemetry and flow telemetry into the same data model. Manually aligning device-level events from one tool with flow records from another is slow and error-prone, especially at the time scales involved (milliseconds).

Step 3: Investigate the Root Cause with AI-Assisted Analytics

Microburst patterns are rarely random. Once the responsible flow is identified, the next question is whether the burst is part of a recurring pattern — a scheduled backup job, an AI training step, a microservice fan-in event — and whether it can be predicted, rescheduled, or absorbed by a network design change.

This investigation step is where AI-assisted analytics adds the most value. A natural-language interface that can answer questions like “What other times this week did this interface see a queue alarm, and what was the dominant flow each time?” compresses what would otherwise be hours of manual querying into a single conversation. Predictive workflows can identify which interfaces are at highest risk of future bursts and which workloads are most likely to cause them.

Kentik AI Advisor natural-language interface answering questions about network events by querying telemetry directly

Kentik AI Advisor turns burst investigation into a conversation: ask about a queue alarm in plain language, and it runs the multi-step queries across device and flow telemetry — showing its reasoning at each step

The goal of the investigation step is not just to explain a past incident but to inform proactive changes: increasing buffer allocations on the affected interface, rebalancing traffic across ECMP paths, rescheduling batch workloads to non-overlapping windows, or upgrading capacity on chronically congested links.

Common Causes of Microbursts

Microbursts are produced by traffic patterns that are increasingly common in modern infrastructure. Understanding the source helps with both detection and prevention.

TCP incast and many-to-one fan-in. When many clients simultaneously respond to a single request — for example, a map-reduce shuffle, a distributed database scatter-gather query, or a microservice with many backend dependencies — the responses converge on the requester at the same instant. If the aggregate bandwidth of the responding clients exceeds the receiver’s link capacity, the result is a classic microburst at the egress queue feeding the receiver.

AI/ML training collective operations. Distributed AI training relies heavily on collective communication operations such as all-reduce, all-to-all, and broadcast. These operations synchronize gradients or activations across many GPUs and produce highly correlated, simultaneous traffic across many links of the AI fabric. The result is microbursts that occur with predictable timing but produce extreme transient load. (See also AI Networking 101.)

Elephant flow collisions. A single long-lived high-throughput flow — an “elephant flow” — that happens to share an ECMP path with another elephant flow can cause sustained congestion that interacts with smaller bursty traffic to produce drops. The two phenomena are related but distinct: elephant flows are about volume and duration, microbursts are about peak intensity over a short interval. See Elephant Flows: The Hidden Heavyweights of AI Data Center Networks for a deeper treatment of elephant flows specifically.

Storage backups, replication, and snapshots. Storage systems often produce highly synchronized traffic during backup windows, replication cycles, or snapshot operations. These workloads are designed for throughput, not smoothness, and routinely produce 100-ms-to-1-second bursts that saturate storage network links.

Scheduled batch jobs and cron events. Any workload that runs on a fixed schedule across many hosts simultaneously — log shipping, metrics export, security scanning, configuration sync — is a candidate to produce coincident bursts. The bursts are often predictable from the cron schedule but invisible to monitoring tools that don’t align their data to the same time window.

Multicast and content distribution events. Live video streams, software updates pushed to many hosts at once, and similar one-to-many distribution events produce short, sharp peaks in traffic that can stress aggregation and distribution links.

DDoS attacks and security events. Some DDoS attack patterns produce microburst-like signatures (sub-second floods designed to overwhelm specific buffers rather than sustain volume over time). Microburst detection capabilities can complement DDoS detection by surfacing low-volume, high-intensity attack patterns that volumetric thresholds may miss.

Microbursts and elephant flows are both phenomena where a small number of traffic events drive a disproportionate share of network impact. They are often discussed together, especially in AI data center contexts, but they describe different things.

An elephant flow is a single flow that transfers a large volume of data over an extended period — a single TCP session moving 10 GB over 30 seconds, for example. Elephant flows are about volume and duration; they distort ECMP load balancing, consume disproportionate buffer space, and can crowd out other traffic.

A microburst is a short, intense spike in aggregate traffic — often produced by many small flows occurring simultaneously, but sometimes caused by a single elephant flow’s peak rate intersecting with other traffic. Microbursts are about peak intensity over a brief interval, regardless of which flows produced them.

Leaf-spine data center fabric diagram showing an elephant flow pinned to a single path, the congestion pattern that interacts with microbursts to produce drops

An elephant flow pinned to one path of a leaf-spine fabric: when its sustained load intersects with the synchronized bursts of AI workloads, shallow switch buffers overflow in milliseconds

The two phenomena often co-occur. A typical AI training workload generates both: elephant flows from gradient transfers between GPUs, and microbursts from the synchronized timing of those transfers across many parallel links. Detection strategies for the two are complementary: elephant flow detection uses thresholds on per-flow byte counts over time (a sustained, large flow), while microburst detection uses thresholds on aggregate queue occupancy at sub-second resolution. A complete data-center observability practice covers both.

For a deeper look at elephant flows specifically — including how they distort ECMP hashing, what to do about them, and how Kentik’s flow analytics surface them — see the Kentik blog post Elephant Flows: The Hidden Heavyweights of AI Data Center Networks.

Microbursts in AI Data Center Networks

AI training workloads have moved microburst detection from a niche concern (for HFT and a few specialized environments) to a first-order requirement for any organization operating an AI fabric. Three properties of AI workloads make microbursts especially damaging.

Collective operations produce synchronized bursts at fabric scale. Distributed AI training is built around collective communication primitives — all-reduce, all-to-all, broadcast, scatter, gather — that move data between every pair of participating GPUs in tightly coordinated steps. When a training step crosses from compute to communication, every GPU starts transmitting at almost the same instant. The resulting traffic pattern is the most extreme form of synchronized burstiness in modern networks.

RDMA transports are loss-averse. AI fabrics commonly use RoCEv2 (RDMA over Converged Ethernet v2) or InfiniBand transports that are designed for lossless operation. Even small packet loss percentages — far below the 0.1% that a traditional TCP application might tolerate — can cause severe performance degradation. A microburst that drops a handful of packets can stall an entire training step until retransmission completes.

Job completion time (JCT) is gated by tail latency, not average latency. In synchronous distributed training, every GPU must complete its collective operation before the next step begins. The slowest link determines the speed of the entire job. A microburst that adds 5 ms of latency to one link in one step doesn’t sound like much, but multiplied across thousands of training steps it can extend a training run by hours or days — and idle GPU time is the most expensive resource in a modern AI data center.

These properties mean that microburst detection in AI fabrics is not just about post-incident troubleshooting. It is about continuously validating that the fabric is delivering the lossless, low-jitter performance that AI workloads require, and quickly identifying interfaces, paths, or workloads that are degrading it. For background on the broader challenges of AI data center networking, see AI Networking 101: How AI Runs Networks and Networks Run AI and the Kentik blog post The Critical Role of Networks in AI Data Centers. For an operator-focused view aimed at neocloud and GPU cloud providers, see Network Intelligence for Neoclouds and AI Data Centers.

How Kentik Detects and Investigates Microbursts

Kentik approaches microburst detection by unifying the platform-layer capabilities required for a complete workflow: high-frequency device telemetry collection, flow telemetry ingestion at scale, correlation in a single data model, and AI-assisted investigation — all in one platform that complements (rather than replaces) the hardware burst-detection features of the switches themselves.

Streaming telemetry collection. Kentik NMS supports gNMI streaming telemetry alongside traditional SNMP polling for supported platforms (Cisco, Juniper, Arista). Default streaming telemetry collection is configured at 30-second intervals for general metrics, with CPU metrics collected at 2-second resolution and visible as live updates on the NMS Devices page. Custom streaming telemetry intervals can be configured per Monitoring Template where vendor support allows. Streaming telemetry data is normalized into an OpenConfig-aligned schema, so dashboards, queries, and alerts work consistently regardless of collection method.

Flow ingestion at scale. Kentik ingests NetFlow, sFlow, IPFIX, J-Flow, and cloud VPC flow logs at full network scale into the Kentik Data Engine (KDE), the columnar datastore that correlates flow data with SNMP and streaming telemetry metadata into a unified view. Flow records are stored at full fidelity and made queryable in Data Explorer, where engineers can slice by source, destination, application, ASN, interface, and time window — and correlate flow records with the device-level metrics that surfaced the burst in the first place.

Correlation in a single workflow. Because device metrics and flow records share the same data model in KDE, device-level events surfaced via streaming telemetry can be correlated to flow records in the same query interface. When an interface anomaly or device-level event surfaces, an engineer (or Kentik AI Advisor) can immediately pivot from the device event to the flows transiting that interface during the event window — without switching tools or manually aligning timestamps across systems.

Converting a Kentik Data Explorer query into an automated alert policy, so a burst-prone traffic condition fires an alert without anyone watching dashboards

Any Data Explorer query can be saved as an automated alert policy — codifying a burst-prone traffic condition once, so the platform watches for it continuously

AI-assisted investigation. Kentik AI Advisor accepts natural-language questions about network events and runs multi-step queries across device metrics, flow data, and synthetic test results, with its reasoning process visible in real time as it works. Questions like “What flows transited Switch-3 Ethernet1/24 during the high-utilization window this morning?” or “Is the burst pattern on this interface recurring this week?” are answered with supporting evidence at every step. The AI Advisor MCP Server also exposes these capabilities to standard MCP-compatible clients for programmatic integration. This compresses what would otherwise be hours of manual querying into a focused investigation.

Overlay-aware analytics for data center fabrics. For organizations operating data center and AI fabrics that use VXLAN overlays, Kentik supports VXLAN overlay/underlay visibility via sFlow header export, exposing VXLAN-specific dimensions for filtering and grouping. This lets congestion or anomalies on the underlay be attributed to the specific overlay tenants, services, or workloads driving them — particularly relevant for neocloud operators serving multiple AI customers from a shared fabric. (See the Optimize Data Center Networks and Network Intelligence for Neoclouds and AI Data Centers solution pages for fuller treatments.)

Kentik is not a replacement for vendor-specific hardware telemetry features like Cisco Nexus microburst monitoring or Arista LANZ — those features run on the switch ASIC and generate the events that downstream analytics consume. Kentik is the platform that ingests events and telemetry from those features (and from many other vendors) into a unified data model, correlates them with flow data, and applies AI to accelerate root-cause analysis. The result is a single workflow that spans the network, instead of a separate dashboard per vendor.

FAQs about Microburst Detection

What is a microburst in networking?

A microburst is a sub-second spike in network traffic — typically lasting between 10 milliseconds and 1 second — that can saturate a link or switch egress queue to 100% capacity even when standard 1-minute average utilization charts show healthy levels. Microbursts cause buffer overflows, tail drops, TCP retransmissions, and latency spikes that are difficult to diagnose with traditional monitoring. Kentik supports microburst detection by ingesting high-frequency streaming telemetry from network devices alongside flow records, and correlating device-level events with the specific flows responsible for them.

How long does a microburst typically last?

Microbursts typically last between 10 milliseconds and 1 second, with the most damaging bursts often falling in the 10–100 ms range. Even a 50 ms burst that fully saturates a switch egress queue can drop hundreds of packets and trigger meaningful application impact, despite being invisible on a 1-minute interface utilization chart. Detecting bursts at this time scale requires streaming telemetry at sub-second intervals.

Why do microbursts cause packet loss when average link utilization is low?

The bottleneck during a microburst is the switch egress buffer’s depth, not the interface’s bandwidth. When traffic arrives faster than the interface can forward it for even a brief interval, the buffer fills; once full, additional packets are dropped as tail drops. Average utilization over a 1-minute window can still be 40% — because the link spent the other 99.92% of that minute well below capacity — even though the link dropped packets during the burst. Kentik correlates buffer and queue telemetry with flow records to surface these drops and attribute them to specific traffic.

What causes microbursts?

Common causes include TCP incast (many clients responding to one request simultaneously), AI/ML training collective operations like all-reduce and all-to-all, storage backups and replication, scheduled batch jobs that run across many hosts at the same time, microservice fan-in patterns, multicast and content distribution events, and some DDoS attack signatures. Most causes are predictable patterns rather than random spikes, which makes detection a matter of having the telemetry resolution to see them and the analytics to attribute them.

How do you detect and investigate microbursts using flow telemetry?

Flow telemetry (NetFlow, sFlow, IPFIX) is the attribution layer of microburst detection: it identifies which flows caused a burst, while high-frequency device telemetry — streaming telemetry (gNMI) and hardware queue or buffer monitoring on the switch ASIC — identifies that a burst occurred and where. The investigation workflow is to take the burst’s time window from queue telemetry, filter flow records to that interface and window, and rank flows by byte and packet contribution to find the responsible source, destination, and application. Kentik provides this as a single workflow — ingesting gNMI streaming telemetry alongside flow records into the Kentik Data Engine, where Kentik NMS, Data Explorer, and Kentik AI Advisor support correlation and investigation without manually aligning timestamps across tools.

Can SNMP detect microbursts?

Standard SNMP polling cannot detect microbursts in any practical sense. Typical SNMP polling intervals (60, 30, or 15 seconds) are far too coarse: a 50 ms burst contributes a fraction of a percent to a 1-minute polled average and is statistically invisible. Detecting microbursts requires either high-frequency streaming telemetry from the device or event-driven hardware exports from the switch ASIC itself. For why polling-based collection misses these events while push-based telemetry catches them, see why streaming telemetry catches what SNMP polling can’t. Kentik NMS supports both SNMP polling and gNMI streaming telemetry, with custom streaming telemetry intervals configurable per Monitoring Template — so teams can mix collection methods to match the resolution needed for different metrics within a unified platform.

How do you correlate spikes in TCP retransmits with specific network segments?

Retransmit spikes are a symptom; locating the responsible segment requires correlating the timing of the retransmits with interface-level drop counters and queue telemetry across the path, then filtering flow records to the affected conversations to see which links they crossed. Microbursts are a frequent root cause — drops occur on a segment whose average utilization looks healthy — so sub-second telemetry on candidate interfaces is what separates a definitive answer from guesswork. Kentik supports this by joining flow records, interface metrics, and streaming telemetry in one data model, so a retransmit-heavy conversation can be traced to the specific interfaces and time windows where drops actually occurred.

What is the difference between microbursts and elephant flows?

Both phenomena drive disproportionate network impact, but they describe different things. An elephant flow is a single flow that transfers a large volume of data over an extended period — for example, a single TCP session moving 10 GB over 30 seconds — and is about volume and duration. A microburst is a short, intense spike in aggregate traffic, typically lasting under one second, and is about peak intensity over a brief interval. The two often co-occur in AI data centers, where elephant flows from gradient transfers and microbursts from synchronized collective operations both contribute to fabric congestion. See the Kentik blog post on elephant flows in AI data center networks for a fuller treatment.

Why are microbursts especially problematic in AI data center networks?

Three properties of AI training workloads make microbursts especially damaging. Collective communication operations like all-reduce produce synchronized bursts across many links at the same instant; RDMA-based transports (RoCEv2, InfiniBand) are loss-averse and tolerate far less packet loss than traditional TCP applications; and job completion time in synchronous training is gated by the slowest link, so a microburst on any one link can delay every GPU in the cluster. The combined effect is that small amounts of packet loss can translate into hours or days of additional training time and significant added cost.

What is hardware queue telemetry and why does it matter for microburst detection?

Hardware queue telemetry refers to switch-ASIC-level metrics about egress queue and buffer occupancy — how full each queue is, how high the watermark has risen, and when high-water thresholds are tripped. These are the canonical indicators of a microburst, because microbursts are fundamentally about queue saturation rather than interface utilization. Modern switches from Cisco, Arista, Juniper, and others expose these metrics through vendor-specific features such as Cisco Nexus microburst monitoring and Arista LANZ, and via streaming telemetry paths where supported. Kentik NMS ingests streaming telemetry from these devices and makes the available metrics queryable alongside flow data in the Kentik Data Engine; the depth of available queue and buffer detail depends on what the device platform exposes.

How does Kentik detect and investigate microbursts?

Kentik supports microburst detection workflows by ingesting gNMI streaming telemetry from network devices alongside NetFlow, sFlow, IPFIX, and VPC flow records into a unified data model. Device-level events from switch ASIC features (Cisco Nexus microburst monitoring, Arista LANZ, Juniper microburst features) flow into Kentik via streaming telemetry, where they can be correlated with flow records in Data Explorer to attribute the burst to specific source/destination pairs. Kentik AI Advisor accepts natural-language questions about these events and runs multi-step queries with its reasoning visible at each step, compressing investigation into a focused workflow rather than manual correlation across vendor-specific dashboards.

Detect Microbursts and Accelerate Root Cause Analysis with Kentik

Kentik is the network intelligence platform that turns high-frequency device telemetry and flow records into answers — not just “did a burst happen,” but where it happened, which flows caused it, and whether it’s part of a recurring pattern.

Get a demo — See how Kentik correlates device-level events with specific flows to accelerate microburst root-cause analysis
Kentik NMS — Next-generation NMS with SNMP, gNMI streaming telemetry, and AI-assisted troubleshooting
Kentik AI Advisor — Natural-language investigation across device, flow, and synthetic telemetry
Network Intelligence for Neoclouds and AI Data Centers — Identify microbursts and elephant flows to minimize delays and protect job completion time (JCT)
Optimize Data Center Networks — Map overlay to underlay and accelerate troubleshooting in AI and enterprise fabrics

Updated: June 29, 2026