Kentik’s Operational DNA
One of the earliest lessons you learn when you’re running networks is that to control a network you have to know what that network is doing. With extensive experience building and running some of the largest network infrastructure in the world, including Akamai, Netflix, YouTube, and CloudFlare, the founders and executives here at Kentik have had plenty of opportunities to learn that lesson. Unfortunately, in the process we learned over and over that the systems available to provide network knowledge have massive gaps. Without an architecture that can truly handle vast volumes of network data, none of the “solutions” we’ve seen or used have ever been up to the task. And as the importance and complexity of networks has increased, those systems fall further behind every day.
When we started Kentik last January, our idea was that if we pooled our experiences and insights we could change this. We hoped to create a universal, open platform for network visibility and control, and to vastly improve the operational landscape of network management. In the process we knew we’d be making the lives of our friends and colleagues in network operations a whole lot easier.
My own experience with operations began at 16. I’d been an avid UNIX hacker since age 10, and after a few years I became part of a group out of Lawrence Livermore National Labs that was working on high performance computing (HPC). That’s when I learned first hand about the sad state of operational tools. Later, at CloudFlare, I focused on the large-scale provisioning system, automation engineering, and designing utilities and processes that help administrators manage large clusters efficiently.
While working in these various aspects of operations I’ve lived by a few basic tenets:
- For every common problem, there is a common cause.
- If you understand a problem, you can reproduce it.
- Once you’ve fixed something once, you should be able to automate the resolution of every subsequent event of the same kind.
The last point was often just an aspiration, because your fixes often hinge on only one person’s intimate knowledge of the stack. Without an effective way to share operational intelligence I was often left without a way to truly make the aspiration a reality. And everyone I knew in network operations was struggling with similar issues. Operations groups and operational tools were very fragmented, with each team in an organization using a different handful of tools or services, each of which showed them a very small part of the overall picture. Even if you were lucky enough to identify a problem with those tools, it was a real challenge to track issues as they evolved or to share the data between teams.
It was this completely unsatisfactory situation that created the impetus for the Kentik Data Engine (KDE), a single source of truth for network data that can be easily integrated across many utilities. KDE gives operators the ability to speak the same language across their organization, increasing action-ability, efficiency of troubleshooting, and the ability to identify cost-savings. Compared to black box vendors that force you to guess in advance what problems you’re going to need to look for, KDE enables Kentik to finally provide the full visibility that all of us in operations have been missing for so long.
Our earliest customers and beta users — including some of the world’s largest .coms, ISPs, and carriers — recognized the value of what we were trying to do and the way we were building it, and encouraged us to take our concept to general availability. Through their input we’ve gleaned more insights into how to proceed and gained confidence from knowing that the problem we were solving was as critical to them as it was us. At Kentik I’ve been focused on our systems and deployment architecture, and on helping to ensure that the internal operational culture at Kentik matches the ideal that we set out to enable for the infrastructure operations community at large.
With our launch and the general availability of Kentik Detect this week, I can see that we’ve made great strides in a really short amount of time. In partnership with some of the largest infrastructure operators in the world, we’re now able to bring our operational backgrounds to bear and really change the quality of visibility for everyone involved in infrastructure operations, management, and protection. In coming posts I’ll drill down into specific challenges faced by operators and talk about ways that Kentik can help.