Last month at DENOG11 in Germany, Kentik Site Reliability Engineer Costas Drogos talked about the SRE team’s journey during the last four years of growing Kentik’s infrastructure to support thousands of BGP sessions with customer devices on Kentik’s multi-tenant SaaS (cloud) platform. Costas shared various challenges the team overcame, the actions the team took, and finally, key takeaways.
Costas started off by giving a short introduction to how Kentik uses BGP, in order to develop the technical requirements, which include:
At Kentik, we use BGP data not only to enrich flow data so we can filter by BGP attributes in queries, but we also calculate lots of other analytics with routing data. For example, you can see how much of your traffic is associated with RPKI invalid prefixes; you can do peering analytics; if you have multiple sites, you can see how traffic gets in and out of your network (Kentik Ultimate Exit™); and eventually, perform network discovery. Moreover, each BGP session can be used as the transport to push mitigations, such as RTBH and Flowspec, triggered by alerting from the platform.
Costas then shared how the infrastructure has been built out from the beginning to today as Kentik’s customer base has grown.
Back in 2015, when we monitored approximately 200 customer devices, we started with 2 nodes in active/backup mode. The 2 nodes were sharing a floating IP that handles the HA/Failover, which is managed by ucarp — an open BSD CARP (VRRP) implementation. This setup ran at boot time from a script residing in /root
via rc.local
.
Obviously, this setup didn’t go very far with the rapid growth of BGP sessions. After a while, one active node could no longer handle all peers, with the node getting overutilized in terms of memory and CPU. With Kentik growing quickly, the solution needed to evolve.
In order to fit more peers, we had to add extra BGP nodes. Looking at our setup, the first thing we did was to replace ucarp because we observed scaling issues with more than 2 nodes. We developed a home-grown shell script (called ‘bgp-vips’) that communicates with a spawned exaBGP. This took care of announcing our floating BGP IP, which was now provisioned on the host’s loopback interface. Each host then announced the route with a different MED so that we had multiple paths available at all times.
The next big step was to scale out the actual connections, by allowing them to land on different nodes. On top of that, since our BGP nodes were identical, the distribution of sessions should be balanced. Given that we only have one IP active in on each node, the next step was to have this landing node act as a router for inbound BGP connections with policy routing as the high-level design. The issue we then had to think about was how to achieve a uniform enough distribution. After testing multiple setups, we ended up using wildcard masks as the sieve to mark connections with.
While we were able to scale connections and achieved a mostly uniform distribution among the peering nodes (example below), our setup was still not really IPv6 ready and needed full exaBGP restarts upon any topology modification, resulting in BGP flaps for customers.
On top of that, we introduced RTBH for DDoS mitigation, which immediately raised the importance of having a stable BGP setup — as we are now actively protecting customers’ networks.
With the fast growth of Kentik, when we hit the 1,300-peers mark, a few more issues surfaced:
Improvement of the Phase 2 setup became imperative. Customers were being onboarded so rapidly that the only way forward was continued innovation.
In the meantime, Kentik introduced Flowspec DDoS mitigations, so offering stable BGP sessions became even more important for Kentik.
Today, Kentik continues to grow, peering with more than 4,000 customer devices. As before, we designed the next phase in the spirit of continuous improvement. We decided to create a new design, building on previous experience. We started by setting the requirements, including that we:
We tested different designs during an evaluation cycle and we decided to go with LVS/DSR (Linux Virtual Server / Direct Server Return), which is a load-balancing setup traditionally used for website load balancers, but it worked well for BGP connections, too.
Here is how it works:
Under the hood, the new design utilized the following:
Today, we’re testing the new setup in our staging environment, evaluating the pros and cons and tuning it to ensure it’s going to meet future scaling requirements as we begin to support tens of thousands of BGP connections.
With the rapid growth of Kentik, in the past four-years’ journey, we evolved the backend for our BGP route ingestion in four major phases to meet scaling requirements and improve our setup’s reliability:
To watch the complete talk from DENOG, please check out this YouTube video.
To learn more about Kentik, sign up for a free trial or schedule a demo with us.