Only two days into the new year, and we had our first BGP routing leak. It was followed by a couple more in subsequent days. Although these incidents were brief with marginal operational impact on the internet, they are still worth analyzing because they shed light on the cracks in the internet’s routing system.
In this blog post, I’m going to use some of Kentik’s unique capabilities to take a look at what happened, what the impacts were, and what might prevent these in the future.
Flagged first by our friends over at Qrator, the first leak was perpetrated by AS138805 of Indonesia when it passed several thousand routes learned from one transit provider (TELIN, AS7713) to another transit provider (Lintasarta, AS4800). The leak began at 05:37 UTC on 2 January 2023 and only lasted around five minutes, but there are some interesting insights that are worth analyzing here.
Perhaps the first question to answer is, why would this leak propagate at all? The leak didn’t introduce more-specifics as is often the case with leaks involving route optimizers or leaked traffic engineering. To answer this, let’s look at the propagation of affected prefixes in Kentik’s BGP visualization.
Starting with this prefix from a Nova Scotian provider named for a colorful bovine. One of Purple Cow Internet’s prefixes (22.214.171.124/23) got sucked into this leak. In the graphic below, the upper stacked time series is a measure of route propagation over time. It shows how our BGP sources reach this prefix by each upstream of the origin. Normally about 20% of our sources see this route at all (17.4% via Hurricane Electric), but during the leak it jumps up to about 70% (with 51.7% suddenly seeing it via the Toronto Internet exchange TorIX, AS11670).
The ball-and-stick diagram at the bottom shows the AS-AS level adjacencies based on aggregated routes at the time of the leak. The diagram is pruned to exclude edges seen by less than 3% of our BGP sources.
The highlighted path shows the leak progression. Purple Cow Internet (AS397545) originates this prefix and was sharing it at TorIX, perhaps via a route server. TELIN (Telekom Indonesia, AS7713) picked it up there and shared it with its transit customers. However, one of those customers (AS138805) accidentally announced it to another of its transit providers, AS4800, who then shared it with the wider internet. There were a number of other Canadian routes leaked from TorIX in this manner.
Below is another impacted prefix from a different part of the world. This Taiwanese prefix is usually seen in the routing tables of 30% of our BGP sources; however, that figure jumped to 70% during the leak. This time the leak went through the Hong Kong Internet Exchange (HKIX, AS4635), as shown in the lower portion of the Kentik BGP visualization below.
You may have picked up on a commonality between these two prefixes. Neither were globally routed to begin with. Networks will often intentionally limit the propagation of a route that they prefer to be used only in certain geographies, for example. The problem with these “regional routes” is that when a leak occurs, there is nothing for the leaked route to compete against.
Each of these leaked routes has an AS_PATH that is longer than normal (extended by “4800 138805 7713” at the very least), so they should be a loser in BGP’s shortest path comparison. But if there is no other path to compare it to, the leaked route gets selected and propagated.
If we were to plot how many Routeviews BGP sources saw each leaked route versus how many Routeviews BGP sources see the same route typically, we arrive at the chart below. There is a clear negative correlation between leak propagation and, let’s call it, steady state propagation.
In other words, leaked versions of routes with limited propagation propagate further. If those routes with limited propagation are more-specifics, a lot of traffic destined for the covering routes will get dropped or misdirected.
Maybe the next question is, how much traffic gets dropped versus misdirected in an incident like this? The joy of being a BGP analyst at a NetFlow analysis company is that I can dig into our aggregate NetFlow to explore the answer to this question. Kentik annotates NetFlow records upon intake with the AS_PATH of the source and destination IPs as seen by the router generating the NetFlow. This enables users of Kentik’s Data Explorer to essentially perform BGP analysis using NetFlow.
If we search for NetFlow records destined for an IP marked with an AS_PATH that includes the leak subsequence “4800 138805 7713”, we can see the leak’s impact on actual internet traffic. Below is a screenshot of the results of that query when grouped by destination country.
The graph shows a spike in traffic that corresponds to the leak. The top five most affected countries by total misdirected traffic (in bits/sec) based on Kentik’s aggregate NetFlow were Hong Kong, Guam, Indonesia, the United Kingdom, and Brazil. We’re no longer talking about theoretical impact, as is often the case with BGP leak analysis. Here we can show this leak misdirected actual internet traffic.
But here is where it gets interesting. Can we compare the amount of misdirected traffic to the amount of lost traffic? If we look at the traffic from around the world destined for the top 500 leaked prefixes (when ranked by leak route propagation), we can isolate the portion that followed “4800 138805 7713” using the BGP annotations in our NetFlow records.
This approach arrives at the following screenshot. In this 90-minute period, there is a steady flow of traffic to the affected prefixes. At the time of the leak, we can see two impacts:
Again the leak was brief, but the ability to perform this style of traffic analysis of a BGP leak using NetFlow is unique to Kentik.
The subsequent leaks occurred two days later. The first was when B. Online (formerly Gulfnet, AS3225) briefly leaked routes from Zain (AS59605) to Telecom Italia Sparkle (AS6762) — again spotted first by Qrator. Like the previous leak, this was also a Type 1: Hairpin Turn leak, as defined by RFC 7908. It occurred twice on 4 January 2023, first at 9:39 UTC and then again at 11:33 UTC. Each instance lasted just a few minutes.
At 10:16 UTC the same day, Bangladesh Telecom (AS17494) leaked over 1700 routes from its peering sessions to its transit provider Bharti Airtel (AS9498). Unlike the other two, this leak is a Type 4: Leak of Peer Prefixes to Transit Provider from RFC 7908. Regardless, for the next eight minutes, providers around the world began sending Bangladesh Telecom traffic destined for faraway destinations such as Vietnam, South Korea, and Great Britain.
Below is an example affected prefix from UK-based ISP TalkTalk. The hump in the center is the leak from Bangladesh. If you look very closely, you can also see some marginal effects of the two AS3225 leaks at 9:39 UTC and 11:33 UTC.
Below is a graphic showing the internet traffic misdirected by the leaks on 4 January 2023 based on Kentik’s aggregate NetFlow.
Using the same approach we used in the first leak, we can observe that the amount of packet loss due to the Bangladesh leak was much more significant than the amount misdirected.
There are two impacts from routing leaks: misdirected traffic and dropped traffic. Dropped traffic is a result of congested links and represents the disruption caused by the leak. Alternatively, misdirected traffic usually incurs a performance penalty (i.e., higher latency) and introduces security concerns, specifically the possibility of interception or manipulation.
Which impact is greater will depend on a variety of factors, including the ASes involved in the leak — and our view of it is, of course, subject to the biases present in our aggregate NetFlow dataset. Regardless, it is fascinating to begin to be able to measure these impacts on actual internet traffic using aggregate NetFlow data.
I view challenges facing routing security as a constellation of problems. That constellation necessitates multiple solutions. Each of these adjacency leaks did not alter the origins of the routes nor introduce more-specifics; therefore, RPKI ROV (despite significant growth in adoption) would not have had any beneficial effect.
To prevent leaks like these, network service providers typically filter the routes they receive from their transit customers. Yet leaks like these continue to slip through the cracks, which is why Autonomous System Provider Authorization (ASPA) was proposed. ASPA enables providers to enumerate their transit relationships within the RPKI system (ASPA records are published and validated using the exact same infrastructure as ROAs).
This enables providers participating in the system to evaluate routes with AS_PATHs containing valley-free violations as invalid and reject them. Using the first example above, if it were known a priori that AS138805 was a customer of both AS4800 and AS7713, then a route with “4800 138805 7713” can be evaluated as invalid and rejected, limiting the impact of the leak.
As RPKI ROV aims to limit the disruption due to accidental origination leaks, ASPA helps to address issues in the middle of the AS_PATH due to accidental adjacency leaks. The examples above were brief but show that the internet is still vulnerable to this all-too-common BGP routing mistake.
Thanks to Job Snijders of Fastly for his expert advice on this blog post.