Question 1

What network challenges does Kentik address for neoclouds and AI data centers?

Accepted Answer

Neoclouds and AI data centers face network challenges that traditional infrastructure monitoring tools weren't designed to handle: ensuring lossless connectivity across high-speed GPU fabrics, detecting microbursts that stall AI training jobs, correlating overlay performance with underlay paths in VXLAN networks, monitoring east-west traffic patterns between GPU clusters, optimizing job completion times (JCT) for AI workloads, and protecting high-value inference endpoints from DDoS attacks. Kentik supports all of these in a unified network intelligence platform that ingests SNMP, gNMI streaming telemetry, NetFlow/sFlow/IPFIX, VPC flow logs, and synthetic test data — giving GPU cloud operators complete visibility from the leaf-spine fabric to the cloud and internet edge.

Question 2

Which tools are most effective for monitoring AI workloads in the cloud?

Accepted Answer

Effective AI workload monitoring requires capabilities that general cloud monitoring tools rarely deliver: visibility into the high-speed GPU fabric (typically VXLAN with gNMI streaming telemetry), microburst detection at sub-second granularity, east-west traffic analytics between GPU clusters, correlation of network behavior with job completion times, and DDoS protection for inference APIs without adding latency. Kentik is purpose-built for this environment, combining unified network telemetry, AI-driven investigation through Kentik AI Advisor, and the operational scale to monitor AI fabrics at the speeds modern accelerators demand. Cloud-native monitoring tools (AWS CloudWatch, Azure Monitor, GCP Monitoring) provide useful operational signals but don't reach the network depth required for serious AI infrastructure operation.

Question 3

How do I get visibility into overlay tunnels and underlay paths in a VXLAN fabric?

Accepted Answer

VXLAN fabric visibility requires capturing telemetry from both layers: overlay traffic between virtual endpoints (tenant networks, GPU clusters, container workloads) and underlay performance across the physical leaf-spine network. Kentik supports this by ingesting flow data from VXLAN-enabled devices, correlating overlay flows with underlay paths through VTEP-aware analytics, and applying gNMI streaming telemetry to capture per-interface and per-queue metrics at high frequency. Operators can pivot from overlay-level symptoms (a tenant reporting slow training jobs) to underlay-level root cause (link saturation, microburst loss, asymmetric paths) without switching tools.

Question 4

How do I detect and investigate microbursts using flow telemetry?

Accepted Answer

Microburst detection requires high-frequency telemetry that captures sub-second traffic spikes — events that traditional polling-based monitoring (1-minute or 5-minute intervals) misses entirely but that can stall AI training jobs or cause packet loss invisible at coarser timescales. Kentik supports microburst detection by ingesting gNMI streaming telemetry at native subscription intervals, correlating high-resolution interface metrics with flow data, and surfacing burst patterns through visualizations that show traffic spikes against capacity baselines. When a microburst is detected, operators can drill into the affected interfaces, identify the elephant flows or traffic patterns driving the burst, and adjust queuing policies or capacity allocation accordingly.

Question 5

How do I monitor east-west traffic within AI data centers and GPU clusters?

Accepted Answer

East-west traffic in AI data centers — communication between GPUs during training, between training and inference workloads, between storage and compute — is often the dominant traffic pattern and the primary determinant of job performance. Kentik supports east-west visibility by ingesting flow data from spine and leaf switches, applying VTEP-aware enrichment to correlate physical and logical flows, capturing pod-to-pod traffic via the Kentik Kappa eBPF agent for Kubernetes-based AI infrastructure, and surfacing east-west patterns alongside north-south internet traffic in a unified workflow. Teams use this to identify GPU clusters that are network-bottlenecked, validate that distributed training jobs are using optimal paths, and detect unexpected traffic patterns that may indicate inefficient model architectures.

Question 6

How does Kentik help protect AI inference APIs from DDoS attacks?

Accepted Answer

AI inference endpoints are increasingly high-value DDoS targets — they're customer-facing, often expensive to provision, and sensitive to added latency that mitigation services might introduce. Kentik supports DDoS protection by detecting volumetric and pattern-based attacks against inference APIs in real time using flow telemetry, triggering mitigation through BGP Flowspec, RTBH, or third-party scrubbing services (Cloudflare, Radware, A10) without forcing all traffic through inline inspection, and providing full-fidelity flow forensics after the event to identify attack sources and patterns. The combination of fast detection and flexible mitigation lets neocloud operators defend high-value inference workloads while preserving the low-latency characteristics their customers depend on.

Question 7

How does Kentik support job completion time (JCT) optimization for AI training?

Accepted Answer

Job completion time is the dominant performance metric for AI training infrastructure — a job that takes 12 hours to complete vs. 10 hours represents 20% wasted GPU time on infrastructure that costs millions of dollars per cluster. Kentik supports JCT optimization by surfacing network conditions that contribute to training slowdowns: microburst events that cause packet loss and retransmissions, congestion on critical paths between GPU clusters, asymmetric routing that increases collective operation latency, and capacity bottlenecks on inter-rack or inter-cluster links. Network engineering teams use this data to validate that training infrastructure is performing optimally, identify network changes that are slowing training jobs, and prioritize capacity investments where they'll most improve JCT.

Question 8

How does Kentik compare to traditional data center monitoring tools for AI infrastructure?

Accepted Answer

Traditional data center monitoring tools focus on device health and threshold-based alerting at minute-scale polling intervals — sufficient for general infrastructure operation but inadequate for AI fabrics where microsecond-scale events affect job performance. Kentik is built for the demands of modern AI infrastructure: gNMI streaming telemetry at native subscription rates, microburst detection at sub-second granularity, unified overlay/underlay analytics for VXLAN fabrics, full-fidelity flow retention for forensic investigation, and AI-driven investigation that can reason across all telemetry sources. For neocloud operators and AI data center teams, the question isn't typically 'should we replace SolarWinds with Kentik' but 'do we have any network telemetry tools that work at the speeds modern AI accelerators demand.'

Network Intelligence for Neoclouds & AI Data Centers

Eliminate AI fabric blind spots

Deliver optimal performance and speed

Protect customer experience and reduce MTTR with network AI

Secure inference and edge endpoints

Reclaim costly idle bandwidth

Control cloud and interconnection costs

Battle-tested by the best

Explore the platform

FAQs about Kentik for Neoclouds and AI Data Centers

What network challenges does Kentik address for neoclouds and AI data centers?

Which tools are most effective for monitoring AI workloads in the cloud?

How do I get visibility into overlay tunnels and underlay paths in a VXLAN fabric?

How do I detect and investigate microbursts using flow telemetry?

How do I monitor east-west traffic within AI data centers and GPU clusters?

How does Kentik help protect AI inference APIs from DDoS attacks?

How does Kentik support job completion time (JCT) optimization for AI training?

How does Kentik compare to traditional data center monitoring tools for AI infrastructure?

Platform

Solutions

Technology

New and Notable

Learn

Company