Kubernetes and Docker are increasingly familiar to DevOps and SRE teams, but still relatively unfamiliar to network teams even though the network interactions are complex. In order to maintain reliable applications in these new architectures, network teams must become more involved to proactively identify issues.
Those working in network operations or development are well used to the “blame game”: The network is most often at fault… until the network team provides proof that it’s not. I can recall 10 years ago having found a strange CIFS configuration problem via packet capture when we were about to abort a massive data center cutover. Upon fixing the issue, we were able to cut over with just enough time, and it was a big win for the team. Ultimately, packets do not lie, and analyzing packets is always the final diagnostic and analysis effort when issues occur.
Unfortunately, packet capture is becoming less feasible with the abstraction of networks via overlays and encryption.
In today’s infrastructures, these additional layers are deployed on top of the existing network. In a typical Kubernetes deployment, one of several network overlay technologies are used—the most common two being Flannel and Calico—but there are dozens more as per this excellent blog by Steven Acreman and his associated Google Sheet.
The challenge is that each of these network overlay technologies requires a new or modified set of tools capable of understanding the protocols, security, and routing of packets in these new layers.
But the benefits of network overlay technologies need not come at the cost of decreased network visibility. Based on our customers’ need to address these issues, Kentik has created a way to see inside the network overlay technologies used within public clouds, such as Elastic Kubernetes Service (EKS), Google Kubernetes Service (GKS), and Azure Kubernetes Service (AKS).
Kentik also supports on-premises Kubernetes deployments running plugins such as Flannel and Calico. These capabilities allow Kentik to collect data from within the network overlays and services to discover how the Kubernetes pods and nodes communicate. Although this helps a lot, capturing packets is a challenge due to overlay-limited legacy tools.
When debugging a service communication issue in a typical network environment, it’s difficult to determine whether the problem is in the physical network, firewalls, a logical configuration such as routing, or other access controls. It could even be at the host level with configuration of DNS, or external to the network at the load balancing layers.
These same services and requirements remain in place in today’s infrastructure—even on public cloud—but now there is significantly more complexity as you are running another network on top. (See this overview to learn more about Kubernetes networking.) In many of these overlay deployments, there are complexities around addressing, routing, and advertising. These are all in addition to the existing complexity of the physical network.
Without a tool to help diagnose the traffic and paths, it is challenging to isolate problems. Some users even run BGP via Calico, which creates challenges around managing these at the pod level via granular traffic and path control. Within these overlays, there are also security policies which may filter access by pod, service, IP, URL, or other methods.
Kentik deploys inside a Kubernetes cluster as a small container called kubetags, via Docker Hub. This very lightweight agent adds Kubernetes metadata to the flow data captured with either Kentik’s kProbe agent (on the host or container) or via the native flow log support Kentik offers on Amazon, Microsoft, and Google clouds.
We are looking to extend these types of visiblity features to address additional overlays in the future.