Kentik - Network Observability
Case Study

Booking.com has no reservations about the value of Kentik

Overview

Booking.com is one of the world’s leading digital travel companies. Its high standards for availability and quality of service require a robust network infrastructure that demands close and constant monitoring. Booking.com uses the Kentik Network Observability Cloud to understand, in unmatched detail, all aspects of its networks to ensure reliability and achieve optimal performance.

Situation

Since its founding in Amsterdam in 1996, Booking.com has grown into one of the world’s premier digital platforms for booking travel reservations, including everything from holiday homes, hotels and apartments to rental cars, attractions and flights. It serves customers from more than 220 countries and territories in 44 languages and dialects, including more than 100 million monthly active app users globally, drawing on a pool of over 28 million accommodation listings in more than 155,000 destinations around the world.

“So many factors influence network operations in a complex global environment that if you’re not fully informed, you may be flying blind.”

The company has an equally global network infrastructure. Data centers and points of presence (PoPs) span four continents to ensure availability of services and customer support 24/7. Delivering on the promise of nonstop availability is Jurriën Rasing, Booking.com’s group product manager for Platform Engineering.

Rasing says one of the challenges of running global network infrastructure is having access to enough information to troubleshoot problems, improve operational efficiency and plan for future needs. “So many factors influence network operations in a complex global environment that if you’re not fully informed, you may be flying blind,” says Rasing.

Solution

Network complexity is a major reason Booking.com uses the Kentik Network Observability Cloud to monitor, in detail, all aspects of its network infrastructure. When the team adopted Kentik in 2020, the focus was on understanding the types and volumes of traffic hitting the company’s internet edge. Part of that data informed decisions on selecting or revising peering arrangements with carriers.

The use of Kentik soon expanded. “Once we saw that we could easily collect data about usage of our network, we immediately wanted to see if we could expand it to other areas,” Rasing recalls. A key consideration at that point was scalability, he adds. “We have experience with a lot of solutions that work fine until you start to scale them out to the size of our operation. That’s when it becomes a problem.”

That wasn’t the case with Kentik, because the Kentik platform can accommodate data center fabrics with a spine-leaf configuration – which is exactly what Booking.com has. According to Rasing, “This makes it possible to apply Kentik to our entire network infrastructure.”

Booking.com expanded the use of Kentik from an initial 50 device network to now managing several thousand devices. This expansion allowed the team to address one of their key objectives: improving the cost-effectiveness of the core IT infrastructure. That infrastructure is used by many internal clients, such as teams providing MySQL and Cassandra database services, and Graphite time-series services. Previously, these service groups could track utilization of servers, “but we didn’t know the sources, destinations and volumes of data being moved,” Rasing adds.

In these distributed services, traffic can move to and from anywhere in the company’s vast global network. Rasing notes, it’s important to understand traffic patterns and the data volume of those patterns – information that becomes especially important as the company transitions more of its operations from on-premises data centers to cloud services.

Results

Greater observability leads to enhanced efficiency

With the highly granular insights delivered by Kentik, the Booking.com team can now give service teams a clear picture of how they are using the network – not for the purpose of billing, but as a basis for exploring how usage can be more efficient.

He cited an example: “With Kentik, we can see utilization of every device connected to our network. We noticed that one device was almost always near 100 percent. So, we went to the team using that interface – who are responsible for downloading images over the internet – and we talked about what they were doing. It turned out they had set up load-balancing such that all traffic was directed to a single server when they could have distributed it more evenly. They made a simple change, and traffic normalized.”

To achieve this level of specificity in understanding network operations, the team created a tool called identity mapping: “Each IP address we have is mapped to a service owner. We add this information as a custom field in Kentik, so we know exactly how much traffic is being generated, and where it’s moving from and to. Now, we can talk knowledgeably about how internal teams can use our network most efficiently.”

Network observability enhances security

“[In a DDoS attack] you want to look at traffic volumes, but with Kentik we also can look at source IPs, AS numbers and other metrics to see if it’s a distributed attack. This is so easy to do in Kentik.”

While Booking.com has the usual array of network-security tools, Kentik is providing an additional layer of valuable information. In a DDoS attack, Rasing notes, “you want to look at traffic volumes, but with Kentik we also can look at source IPs, AS numbers and other metrics to see if it’s a distributed attack. This is so easy to do in Kentik; you simply add the source IP address dimension to the analysis.”

Another aspect of Kentik’s contribution to security is real-time observability. “With some security tools, it’s already too late when you get a notification from them. But with the DDoS filters in Kentik turned on, we get notified immediately.”

But outside attacks are only one threat to network uptime, which Rasing says is “super-critical to our business.” Kentik quickly narrows the list of possible causes of a network problem by pinpointing where traffic flows may be impeded and identifying the internal or external source of the traffic being affected. This greatly shortens mean time to identifying and resolving any issue.

Applying Kentik for capacity planning and cost control

“Kentik not only collects flow data at a level we never had before, it also stores that data in a very simple way. We are now able to build our own queries.”

Rasing says he’s had some pleasant surprises during his use of Kentik: times when he has discovered a useful feature he previously did not know about. One example is capacity planning, a feature built into the Kentik platform. “We have thousands of switches in our network that have tens of thousands of hosts connected to them,” he observes. “We never tried capacity management at the host level; who wants to look at 50,000 interfaces to see how each is being utilized? But with Kentik, we now can see at-a-glance how each is being used. It’s easy to set up threshold notifications, and the results are presented in a clear, graphical format.”

This gives the infrastructure team specific guidance on where to spend dollars on upgrades and expansions, he says – another contribution to cost control.

Kentik also is proving valuable as Booking.com expands its use of cloud providers. Kentik monitors all data flows no matter the location or carrier. In addition, it presents highly specific data that can help determine which cloud services are being used.

Despite the complexity of Booking.com’s global network, Kentik makes it easy to understand usage in even the most granular detail. “Kentik not only collects flow data at a level we never had before, it also stores that data in a very simple way. We are now able to build our own queries. Previously, we didn’t have anywhere near the granularity of data we have with Kentik. Looking back, we spent a lot of time trying to extract insights from what data we did have. It was such a hassle to get an answer that you’d rather not pose the question.”

With Kentik, he says, “all the functions and metrics you want are pre-built and easy to use. I can even give other teams access to Kentik, and they can work on the data themselves to explore issues unique to them.”

Key takeaways

“Previously, we didn’t have anywhere near the granularity of data we have with Kentik.”

The complexity of Booking.com’s global networking needs increased, while their high standards for availability and quality of service remained steadfast. In turn, their platform engineering team adopted Kentik Network Observability Cloud to understand, in unmatched detail, all aspects of its networks to ensure reliability and achieve optimal performance. Once Booking.com saw how easy Kentik makes it to collect data about network usage, their use quickly scaled. They soon used Kentik to manage several thousand devices and deployed Kentik to enhance security and capacity planning and cost control.

Ultimately, the Booking.com team no longer feels like they are flying blind and has access to all information needed to troubleshoot problems, improve operational efficiency and build for the future.

Get the Case Study

Category

  • Travel technology company

Challenge

  • Booking.com needed to deliver on the promise of nonstop availability
  • Complex global infrastructure threatened visibility

Solution

Results

  • Faster MTTR, increased security, and global network visibility
  • Improved cost-effectiveness of the core IT infrastructure
We use cookies to deliver our services.
By using our website, you agree to the use of cookies as described in our Privacy Policy.