Webinar

Replay: Why you should monitor BGP and where to start

So hello, and, thank you for joining today's webinar. I'm Jordan Sleep, your host from Kentik Marketing. Our topic for today is why should you monitor BGP and where to start. Our presenter is Kentik's director of product management, Anil Murti. So in today's webinar, Anil will review why BGP is so important. He'll introduce Kentik's new proactive BGP monitoring capabilities and run through some common use cases for proactive BGP monitoring. So during the webinar, if you have any questions, feel free to enter them in the Zoom chat and we'll answer as many as we can in the open Q and A session at the end of the presentation. So with that, I'll pass it over to you, Anil. Thank you, Jordan. Hi, everyone, and thank you for taking the time out of your busy days to join us today. So in the interest of time, I'll get right into it. For folks that are not familiar with Kentik, I'll do a quick intro of who we are. So Kentik has been around for a little over seven years. We count about thirteen hundred plus enterprise customers among our customer base, and that includes some of the biggest brands in the world across the service providers as well as digital businesses. You can see some of those logos at the bottom of the screen. The top reason for why customers choose us as a platform for monitoring and give us really high customer satisfaction ratings year after year has primarily got to do with the fact that they say that Kentik improves the uptime of their services and networks by more than twenty five percent. And so the way Kentik does this is by ingesting trillions of records of data every day. These records are comprised of network telemetry from various sources. So to talk a little bit about how the Kentik platform works, essentially Kentik's platform is called the Kentik Network Observability Cloud, and it has sort of two parts to it. On the left side is the various sources of telemetry data that the Kentik platform ingests. So, these might be sources of telemetry coming from devices located within corporate networks, on the edge of the corporate networks, within public cloud, VPC, flow logs, for example, synthetic tests, SNMP metrics, so on and so forth. Now, in addition to those metrics, customers also have the option of ingesting or combining and enriching their own business context to it. So this might be geo specific context routing, or any threat analytics as part of the ingestion. So all this data ends up in the Kentik cloud, and then from there, Kentik exposes a set of features within its platform that include things like dashboarding and alerts, but also a free form query engine that lets you basically ask any question you like about the data that's in the Kentik cloud. And then last but not the least, we have integrations with the most common messaging as well as alerting platforms that you care about. So, essentially the way all of this information gets surfaced within the Kentik platform is through a set of products, and they range from our core product to more edge and protect and service provider analytics, as well as cloud and then of course, synthetics. So the focus for today's webinar is primarily going to be in the Synthetics portion of our platform and specifically we're going to talk about BGP. But just to give you a quick high level overview of what Synthetics is, Synthetics is all about monitoring things proactively. So as opposed to passive NetFlow and Sflow and SNMP metric injection, Synthetics is all about creating data and then measuring the response or the performance of the data that has been created. So, Synthetics, there is the concept of agents. So, we have a very large and vast and growing network of agents located across the world, and then customers have the option of installing their own agents as well. Synthetics towers testing across the different layers of the network stack, so going from the IP layer all the way to the DNS and web layers, and then as of a few days ago, we've added BGP monitoring to that mix as well. So, I won't spend a lot of time talking about synthetics in general. We have separate webinars that we do on the overall synthetic topic. For today, I'm going to stay focused primarily on our BGP monitoring capabilities within synthetics. So, before we jump into the details of BGP monitoring and why you should care about it and how we implement it within our product, we want to just take a minute and kind of get a sense for where people are today in terms of monitoring BGP. So, Jordan, would you mind running this poll really quickly? Awesome. Well, thank you all for taking that poll. Looks like I've got a decent mix of people, maybe a little bit more skewed towards people that aren't monitoring BGP today and also people that aren't sure quite what BGP monitoring entails. So what I'll do is I'll cover a little bit of both. I'll talk about why we should care about monitoring BGP. I'll talk about what BGP monitoring entails. And then for people that have a BGP monitoring solution today, I'll give you potentially some reasons for why you should consider a BGP monitoring solution like ours. So with that, I'll jump into the next slide. So let's start with the basics of why someone should care about monitoring BGP. So if you're a service provider or a content provider, somebody that manages a CDN network, for example, you're no stranger to BGP. It's something that you've cared about for years, very likely, if not decades. And so most of the customers that are service providers and content providers are very familiar with the reasons for why to monitor this. But over the last couple of decades, as a lot of the digital businesses have moved towards delivering things as a service, So literally everything that we consume today, including Kentik's own platform, is all delivered as a service, as a SaaS, right? So as that shift has happened, there's been an increasingly higher dependency on the network to make sure that the application and digital performance is good. And what's even more interesting is that in the last few years, the dependency has increased even more on the general internet because not only are people delivering these SaaS services over a network, but the network in many cases is the broader internet. And so, when that happens, BGP becomes a crucial portion of it because BGP ultimately is responsible for making sure that your packets get routed to the right places in the Internet. And when there are issues that happen within the BGP layer, they end up impacting in the best case, they end up impacting latency, which slows down your applications and causes you to potentially lose revenue. But in the worst case, it can potentially make your services completely unavailable. So, good example of that was, if you recall, the Facebook outage of last year happened sometime in October. Essentially, what happened there is that Facebook ended up in a situation where a lot of their BGP routes ended up getting withdrawn and that caused a lot of their DNS services to not be available to certain parts of the internet, which ultimately resulted in people not being able to resolve various domains of their applications. And ultimately, most of their applications, including WhatsApp and Instagram and Facebook ended up going offline for a decent amount of time. So the bottom line in all this is that monitoring BGP can help your organization essentially not just preserve revenue but also grow it. And then the more important aspect is that protecting the availability and uptime for critical services is more important than it's ever been industry way track. So that's just kind of some of the high level reasons for why people care about monitoring BGP. Going down sort of one layer below that, I'm talking a little bit about the technical use cases for why network engineering teams or site reliability teams should care about BGP monitoring. I look at it in six or seven main use cases, and I list these bullet points here. The very first one is essentially what I call event tracking, which is BGP events. So these might be announcements and withdrawals for the prefixes that belong to your AS or the prefixes that pertain the devices that your services depend on. So just being able to see that over time and make sure that that's happening according to plan. The second one is what I call hijack detection. So this is the situation where you have one or more ASs and they are announcing a set of prefixes. And there's a situation where a different AS, either more than likely inadvertently, but sometimes even maliciously, ends up announcing your prefixes. That can result in your traffic getting diverted to a different part of the internet and becoming unavailable to your users. And so that being able to detect and hijack and sell out on that and then react to it is a pretty common use case. The next one here is route leak detection, which from a technical standpoint is very similar to a hijack, except it's somewhere along the AS path as opposed to from the origin AS. The next one here is RPKI status checks. So RPKI is becoming more and more prevalent in terms of its use, in terms of securing what is otherwise a relatively unsecure protocol on the internet. So being able to check the status of your RPKI for your prefix is something that is a very common use case that I've heard of as well. The next one here is reachability tracking. So the idea here is you're announcing a certain set of prefixes. You want to make sure everybody on the internet, at least a significant portion of the internet, is able to receive those announcements and being able to know when a certain portion of the internet is unable to see those announcements or if it drops below a certain percentage is a crucial use case as well. And so this is what we refer to as prefix visibility. And then the last two bullet points here are primarily around ASPATH. So, AS path is essentially the path that AS is that your BGP announcements go through as they go across the internet. And so being able to, number one, see if there is an excessive number of AS PATH changes happening, which sometimes might be intended, but in other cases it might be unintended and it might be resulting in a lot of instability. Being able to keep track of that and get alerted when it occurs is a crucial aspect as well. And then last but not the least, being able to visualize how AS paths have changed over time. So, just some of the key use cases. So then going even one layer further down and talking now about specifically our BGP monitoring capabilities within Kentik and how they implement a lot of those use cases I talked about in the previous slide. Broadly, we kind of break it down into two main features within our product. So there is this first feature that we refer to as the route viewport, and the way to think about this is essentially like a BGP looking glass. This essentially lets you visualize and track all your BGP events, which is BGP announcements as well as withdrawals over time. Important thing here is that this route view is absolutely free to all customers and even to trials that sign up, so we don't charge for this at all. But then it also doesn't have any alerting bandwidth, so it's purely a viewer. The second feature is what's called a BGP Monitor test type, and this is part of our Synthetics platform. The one difference though between this test type and the other test types is that all of the other synthetic test types, they actively generate traffic from a synthetic agent, whereas BGP Monitor on the other hand is not actually generating traffic. It's just looking and listening to route updates from various parts of the internet and then making sense of that. So within the BGP Monitor test type, we have all of these other features like hijack detection, bug leaks, swathes, and it also includes, of course, a pretty rich AS path visualization, as you can see over here, and I'll show you on the demo as well. And then the BGP Monitor, given that it's an actual test, it also includes alerting and the ability to get notified at various notification channels, the most common ones including email, Slack, Teams, so on and so forth. And then I should mention here that even though the BGP one hundred test type is actually a paid test type, all Kentik customers that have access to the platform get essentially either two point five million or five million credits to use per month, and the way this translates to BGP is essentially that is enough to monitor anywhere between one hundred to two hundred prefixes, which if you think about it from a service provider standpoint probably doesn't cover everything, but if you look at it from a digital business standpoint, it more than covers the use cases that people want. So that's the two main features and I'll go into these in a demo next time. So, Adam is gonna squeeze my little bit, so here we go. First, I'll start with the main Kentik platform here. So this is the default view that we refer to as the Network Explorer. Most of the data here is coming from passive NetFlow data being ingested. But then if I click on the menu here, you can now see that the entire platform is broken down into these different products that I talked about on my very second or third slide. So, there's core edge service provider Cloud Protect, but there's Synthetics as well. And so BGP is part of Synthetics. And so to show you an example of where you sort of start with from a BGP standpoint, I'm going to go into the Test Control Center here. There are different types of tests running here. Click on the Add Test, and you can see that within Synthetics, we support testing across the different layers: network, dev, DNS, and then, of course, BGP. So the BGP monitor test type essentially is the test type that you want to be configuring if you want to set up BGP monitoring. And so the way this gets configured is that it's a pretty straightforward setup where you can enter one or more prefixes that you want to monitor, and you can also import if you happen to have a CSV file that contains a set of one hundred or two hundred prefixes, you can just import that CSV file here. And then you can choose to include covered prefixes as well. So if you have a broader prefix and you want to include all the sub prefixes under it, just turn on this toggle here and we'll go ahead and set up monitoring for all, not just that prefix, but all the subprefix under it as well. The next thing to do here is to specify what's called an allowed AS. So essentially, an allowed AS is typically your origin AS. So it's the AS that you think is authorized to announce this particular prefix or these prefixes. You can optionally also turn on RPKI check, and so what this will do is it'll check for that RPKI status and then incubate in the results as well as alert you for RPKI invalid status records. Give the test a name and then configure notification channels if you like. So options include email, Slack, all of that. So here's all the options that we support, which is I think pretty much all the things that most customers use. But if there's others that you care about, you certainly are going to listen to that. And then a few advanced options in terms of when you want the errors to be expiring. So to show you an example of one of these in action, I'm going to jump into this next window here. So here's a simple test where I'm saying I want to monitor the prefix twenty one here. And I think this belongs to the AS46851, which belongs to Turnitin. And then I also want to be checking RPKI. So once I configure this test and hit start save and start testing, what you end up is something like this. So essentially, what you're looking at here is a visualization that is showing you all the BGP events that have occurred in the last two weeks. So I've selected a two week time range on the time range here. And I can scroll back and forth on this timeline here, and every little green bar that you see on the positive axis, that is a count of the number of BGP announcements we have seen at that particular time. So where my mouse is right now is March six, seventeen twenty, and there's two announcements from the prefix that we're monitoring, no withdrawals and no unexpected origins. And so as I scroll back and forth on this, I can find points in times where there were certain BGP events where there was an unexpected origin announcement. So that's a quick check. Just by looking at this, can see there's points in time where there's red color and that's indicative of an unexpected origin. So right below here is also that same information represented in a table. So you can see that number one, the twenty one prefix. There's a couple of sub prefixes under it, the twenty four and twenty three. The twenty four and twenty three are all being announced only by Turnitin's ASL. But then this splash twenty one, we see that there is some unexpected origin announcements here. So when I open this up, I can now see that there were points in time where Into SAT was announcing this different AS4761 was announcing what was supposed to be a prefix that belonged to forty six thousand eight hundred fifty one. So I can see from here that there was a significant portion of those happening over time. And then the other piece of information here, of course, it's an unexpected origin, so that's called out here. The AS path going from the specific AS to the vantage point, meaning the place where we're monitoring this from, is indicated there. The total number of such announcements over the period of two weeks is indicated there. The RPGS status, which is of course invalid because it actually belongs to Turnitin, and then of course the dataset which in this case is our chips. So just one example of how we detect unexpected origins, also known as hijacks. Moving to another example here, I talked about the Facebook Outreach from last year, So, here is kind of a test, a similar test to the previous one, monitoring Facebook's twenty three prefix. That's Facebook's ASRA there. And you can see I've selected the time range here that is going from October fourth, twenty twenty one till October fifth, so that one day where the outage occurred. And what you can see here is that there were steady announcements and withdrawals of the twenty four and twenty three prefixes over time, and then as we got to the point where the outage occurred, you see all the announcements are just going away. And then after a while, they resume back again and then things were found again. Similar kind of information out here. You can see that there's no invalid origins or anything of that sort in this case, but you can see that they belong to Facebook. This is the SPAD, so on and so forth. Another way to look at the same information, so this information is good because it tells you that they're also withdrawn. What it doesn't give you is information about what percentage of vantage points we're seeing in each of these. So, to look at that information, we can go to a view like this, where what we're doing now is we're taking Facebook's different prefixes and we're basically looking at data sets from all of our vantage points, and we're basically saying what percentage of vantage points are able to see routes these particular prefixes. So, of these bars or each of these lines on this chart represent the percentage of vantage points that are seeing that specific prefix announcement. And so you can see again in this view for that same time range, fourth to fifth of October last year, Everything was going along pretty nicely until the outage hit. Things dropped down. What's interesting though, and this is an additional piece of information that surfaced here but not in the other view, is that as time went along, a portion of the prefixes recovered first, but we were still not backed up to where we were before the outage, and then as things went along, eventually things picked up and went back on. So this is an additional piece of insight that you get from this kind of view. So this is what we refer to as Prefix visibility. And then the third one, which is kind of my favorite view here, this is essentially what we refer to as an ASPATH view. And there's kind of two things out here. Talk about them both in order. So first off is this AS path visualization. What this is doing is essentially it's looking at all the AS paths that we've seen from our vantage points and it's kind of plotting them in a visual representation. So here are all the prefixes that I'm monitoring in this particular test, which are belong to Facebook. Here is obviously Facebook's Origin AS. We've got a vantage point there as well, so you can see that there's a RouteViews vantage point located there that's looking at information from that perspective. And then these are the ASM hops that we've seen. So you can see that we're going through, in this case, Telia and then all of these. Okay. There are points in time on these boxes here which represent AS where you see this little binocular like icon. What that signifies is that that particular AS has one or more Kentik vantage points. So a vantage point essentially in this context is one or more BGP peering sessions that Kentik has established to specific BGP peers within that AS. So in this case, there's a route views collector located within the Kerlick AS out in SmartFabricator, and that's what is the BGP peer that's giving us the AS paths that are used to plot this in this case. So you'll see that as we go towards the left of this, there's more and more vantage points located here. Once we click on those, we can see all the information out here as well. And then up on top here is a chart that is showing you the path changes over time. So, as I talked about before, ASPAT changes are critical to knowing whether your network is stable or there's instability in the network. So, one of the things that we'll be doing, which is implemented before today, is the ability to go back on this timeline and see this ASMAT change automatically. But then in relation to that, you can also just look at this chart and see points in time where there were ASMAT changes. So like here, there's a spike here. You can see that there were eleven ASPAT changes associated with that one prefix and then a few more on the other ones, so on and so forth. So that's just a quick overview of what capabilities you have within Kentik for monitoring BGP. Hopefully that gives folks that have never cared about monitoring BGP proactively before a sense for what you can get with the kind of data that we can give you. But for folks that are already monitoring BGP today, here's a good reason for why you might want to consider a new monitoring solution like ours. So, first on top is most of the BGP monitoring solutions that exist out there, commercial as well as non commercial, they're all very dependent on what are called public monitors. So the most common ones here are RubPims as well as RIS, which comes from the folks over at Ripe. So Kentik ingests data from the public monitors, of course. But in addition to that, because Kentik has been in the flow monitoring business for several years now, we have BGP peering sessions established with several customer private peers as well. And so what that ends up being, what it ends up resulting in is a total number of vantage points that is three to five times the number of vantage points that you find in other computing solutions. And so the way that translates to in terms of value is that, number one, you get real time data. So while some of these other competitive solutions that you may have seen, in the documentation, they talk about how it takes up to two hours to get your first initial read down, and it would take up to fifteen minutes for new updates to arrive on your BGP data. In case of Kentik, all of this is pretty much real time. So you set up a test, you get the data right away in literally less than a minute, and then alerts and stuff get fired pretty instantaneously as well so that you know before the rest of the world knows when a BGP issue offer is. And then our customers give us very high ratings on our general UI. In fact, I've heard this said quite a few times on customer calls where customers just tell other vendors to go build another Kentik if want to do better. So that's a pretty huge compliment that we get from our customers, and you can expect the same kind of best in class UX when you check out our BGP monitoring solution. And then, of course, we are very focused on supporting everything that we build within our platform with an API, but then also building software development kits if you want to do things programmatically. And then, of course, the more network telemetry data you can get into a single pane of glass, the better your full network observability. So, just some reasons for you to consider looking at our BGP solution. And that's kind of all I had for today. In terms of things that if this has got you interested in our BGP monitoring solution, some of the things you can do are there's a blog post that summarizes all of the things I talked about. Please do check that out and forward it to folks that weren't able to make it. Then there's a free trial that you can sign up for. And then we're also going be hosting a virtual design clinic, so if you sign up for a free trial or check out these features, this will be a much more hands on session where we'll answer any questions you might have about this part of the product or just Kentik in general. So I see that Jordan's already got the poll going, but hopefully you can take a minute or two to answer that poll as well so we can do better on the future webinars. So, So perfect. So thanks, Anil. That was a great presentation. So a few questions did come in via the chat, so let's open the q and a now. So the first question that we have here is if I am a New Relic customer and using Kade Translate to monitor, can we opt for BGP paid tests? Yeah. So not today, but it's something that we are going to discuss with New Relic in terms of not just BGP, but a more tighter integration where either you get some or all of the data within New Relic or you have an easy way for being able to jump from New Relic into Kentik and look at this data. So it's definitely things that we have planned on our roadmap for this year. Awesome. So how do you price this product? Yeah, so it's really straightforward. Basically, it's priced as part of our Synthetics offering, and so Synthetics is priced in terms of Synthetic test credits. And so when it comes to BGP, we charge you about half a test credit for every minute that a BGP test is configured and running. To put that in perspective, monitoring so if you had about two point five million credits, which is a lower end of a platform gets you that per month, you can monitor about one hundred prefixes with that. So the next question is, with regards to hijacking slash root poison, can you have data visually alert as well, or is it just notifications? Yeah, so it's both. So, you'll get your alerts within the platform, so you can go and configure how you want to be alerted in terms of how often that issue should occur, within what period of time, so on and so forth. And this will result in this healthy critical type of UI being populated. But then you can also tie these to notifications so that you don't have to be looking at a screen all the time and you can get notified when any shortcuts. Awesome. And the next one is how accurately can you detect route leaks? Very accurately. So of the key differentiators that we have relative to other BGP solutions is that we don't actually filter down to just the prefix that you are selecting. We are constantly looking at all the prefixes we can see all the time, and we're maintaining state on these all the time. So it's essentially like think of it as a really massive router that's looking at pretty much all the prefixes that we can see across the internet and we're maintaining state all the time. And so if there was a route leak that would happen, the chance of us not seeing it is pretty low. The one thing I will point out here is that today, as part of our configuration, we don't have the option to specify sort of a next hop is, but that's something that we're going be adding, and that'll add an additional level of monitoring on top of this, which will make it even easier to detect properties. Great. The last question is, where do you get your BGP data feeds from? Great one. So as I pointed out earlier on the slide, essentially it comes from two sources. It comes from sources of public BGP data. So these are route use collectors as well as RIS in the future. And then, in addition to that, we have our own Kentik private peers, which are private BGP peering sessions that we've established with customer devices over the last five, six years. So collectively, the total amount of data that we surface up is somewhere between three to five times the amount of data that you get from other solutions. Great. So it looks like we got one last question. We can answer this one. Can you gather iBGP data as well for your on prem and possibly with B2B sites? Yes, so the answer is yes, we can. We don't support it in the product today, but it's something that's on our roadmap. If this is something that's important to you, we should certainly talk about what's the best way for us to implement this. So perfect. So it looks like we've just about come up on time. So we appreciate everyone joining in today and we've covered a lot of material. This webinar was recorded, so you can review a replay of it on our website. So that's Kentik dot com, and please share it with your interested colleagues. In the next couple of days, you'll get an email from me so you can review the content, get a link for the recording, and get any other additional information if you'd like. If you have further questions, please email us at webinarskentik dot com. Again, that's webinarskentik dot com. And thank you all for joining, and we hope to see you again soon. Thanks, everyone. Thank you, Jordan.

BGP isn’t just for ISPs and hosting providers anymore. As we saw with Facebook’s historic outage, it’s now a necessity for digital enterprises to proactively monitor BGP.

In this webinar, Director of Product Management Anil Murty will introduce Kentik’s new proactive BGP monitoring capabilities.

Join Anil to learn:

Why monitoring BGP is so important
Common use cases for proactive BGP monitoring
How BGP monitoring from Kentik is used for hijack detection, prefix reachability and path-change tracking

Webinar Presenter