For as long as the internet has existed, the challenge of securing its underlying protocols has persisted. The resulting lack of routing security, for example, has led to numerous BGP incidents such as hijacks and routing leaks that regularly result in misdirected traffic, dropped packets and increased latencies.
Of the proposed technical solutions that have surfaced over the years, Resource Public Key Infrastructure (RPKI) has emerged as the internet’s best defense against BGP hijacks due to typos and other routing mishaps. While it doesn’t defend against every adverse routing event, RPKI reduces the impact of accidental originations, a fairly common occurrence.
The challenge with any distributed security mechanism such as this one is that it only yields benefits when there is broad adoption. Many networks need to independently elect to create Route Origin Authorizations (ROAs) for their prefixes and configure their routers to reject RPKI-invalids.
Networks deciding whether to participate face a “chicken or the egg” dilemma: why bother rejecting RPKI-invalid routes if few are creating ROAs, and why create ROAs if no one is rejecting invalids. However, the data presented in the following analysis suggests that we may have finally reached critical mass with RPKI — the point at which the benefit of participating outweighs the cost of implementation combined with the risk of inaction.
In just the last couple of years, the list of tier-1 network service providers who now reject RPKI-invalids has grown to include NTT, GTT, Arelion (Telia), Cogent, Telstra, PCCW and Lumen. In other words, an RPKI-invalid route has a difficult time propagating across the internet these days.
At the same time, the number of ROAs, which assert the rightful origin and prefix length of a route, has started growing in recent years. As evidence of this, we can take a look at the chart below from NIST’s RPKI Monitor. In this chart, we can see the number of IPv4 BGP routes evaluated as RPKI-valid started increasing at a steady rate beginning at the end of 2018.
Every time a ROA is created for a BGP route, the count of RPKI-valid routes increases (green line), the count of RPKI-unknown routes decreases (yellow line). There is also a red line which hugs the x-axis representing a miniscule amount of persistently RPKI-invalid routes due to misconfigurations.
There are two steps that must take place before an RPKI-invalid route can be rejected:
This analysis focuses on measuring how much progress has been made in that first step alone.
As was referenced above, there are several public tools for measuring ROA creation such as NIST’s RPKI Monitor as well as RIPE’s RIPEstat utility. However, these tools are strictly based on BGP data and those of us who regularly work with BGP know that not all routes are equivalent in terms of traffic.
The NIST’s RPKI Monitor presently reports that 34.89% of IPv4 routes and 34.28% of IPv6 routes are RPKI-valid based on published ROAs. The percentages of RPKI-unknown routes are 63.47% and 62.37% for IPv4 and IPv6 respectively. Regardless of protocol, the ratio of RPKI-unknown routes to RPKI-valid routes is 2:1.
But then the question becomes: What proportion of overall traffic is safeguarded by that 34%?
A long, long time ago, long before the pandemic, my co-author Job Snijders (Principal Engineer at Fastly) was working to allay an initial hesitation around RPKI adoption. The worry was that rejecting RPKI-invalid routes would lead to the loss of important customer traffic due to persistent RPKI-invalid routes and thus be unacceptable to a business.
In February 2019, Job worked with Paolo Lucente (they were both at NTT at the time) to extend the pmacct network analysis tool to combine NetFlow analysis with RPKI evaluation to remove any guesswork about what traffic might actually be dropped. Job summarized their work in an email to the NANOG list. The email ended with a challenge to Kentik to incorporate this functionality into its product for the benefit of Kentik’s customers as well as the greater internet.
Kentik heeded the challenge and within a few months added RPKI evaluation to its NetFlow analysis platform, enabling a user to explore what specific traffic would be evaluated by RPKI as valid, unknown and invalid. The conclusion of users following this line of inquiry is that dropping persistently RPKI-invalid routes does not lead to loss of important traffic.
In March 2019, Job presented their preliminary findings based on NTT traffic at DKNOG-9 and included the following chart showing that the majority of traffic was destined for RPKI-unknown routes in blue, while traffic to RPKI-valid routes (orange) was a distinct minority. As expected, traffic to RPKI-invalid routes (red) was miniscule.
Job’s provocative claim in that presentation was that “not everyone needs to do RPKI,” His point was that given the consolidation of the internet industry, only a few major players (content providers and eyeball networks) needed to deploy RPKI before we started seeing large benefits. Keep this claim in mind as we review the following results.
How can Kentik help our understanding of RPKI adoption? Kentik has hundreds of customers providing live feeds of NetFlow and almost half of those have opted-in to allow their data to be used as part of aggregate analysis.
It is important to note that any resulting analysis based on this data is subject to the biases of Kentik’s customer set, which includes network service providers, content delivery networks as well as large digital enterprises. It’s also skewed towards the United States, where the majority of our customers are based. These caveats aside, this large NetFlow dataset is invaluable for understanding broader developments on the internet, whether they be military coups or historic social media outages.
As mentioned earlier, in response to Job’s challenge, Kentik incorporated RPKI evaluation into the intake process of NetFlow to allow our users to answer the question about dropped RPKI-invalid traffic. However, the flexibility of Kentik’s analysis platform enables us to answer other questions, such as the one from above…
Just like the pmacct extension that Job and Paolo built, Kentik tracks four outcomes of RPKI evaluation:
Case #4 is when an RPKI-invalid route has a covering prefix that wouldn’t be rejected because it was either valid or unknown. The result is that the destination address space is reachable through the covering prefix. This case only exists in the analysis plane and is not part of any IETF standard on BGP or RPKI.
Over a week long period of analysis at the beginning of February, we observed the following breakdown our all of our aggregate NetFlow measured in bits/sec, as shown here:
If this breakdown is truly representative of the broader internet, then the majority of the internet’s traffic goes to BGP routes secured by ROAs. This paints a far rosier picture than the NIST RPKI Monitor stats and suggests we’ve come a long way from the NTT traffic chart in Job’s DKNOG-9 presentation.
So how is it that a majority of traffic can be destined to a minority of RPKI-valid BGP routes? To answer this question, we need to identify the companies responsible for these RPKI-valid routes.
Let’s explore the internet’s largest market, the United States. According to RIPEstat, RPKI-valid routes account for 25.15% of U.S. IPv4 address space and 19.99% of U.S. IPv6 address space. When we look at our aggregate NetFlow destined to the U.S., we see that 58.5%, a clear majority, of bits/sec goes to RPKI-valid routes.
If we decompose that 58.5% further, we can see the companies which account for the most traffic are a collection of major U.S. eyeball networks (Comcast and Spectrum), as well as content providers (Amazon, Google and Cloudflare). All of these companies completed major RPKI deployments in recent years.
These companies might not account for the majority of the BGP routes of the U.S., but they do account for a large portion of U.S. traffic. Not everyone needed to deploy RPKI before we started seeing benefits.
Deploying RPKI is a best current practice — both creating ROAs for your routes, as well as rejecting RPKI-invalid routes. Based on the analysis above, rejecting RPKI-invalid routes likely protects that majority of your outbound traffic from accidental BGP hijacks without posing a risk to legitimate traffic.
In addition, another best current practice is to avoid modifying LOCAL_PREF or BGP communities based on validation states. The risk here is that if a validator were to crash, all RPKI-valid states would be re-classified as RPKI-unknown and potentially thousands of new BGP routes be simultaneously announced causing a dangerous level of BGP churn. This is a scenario Job expounded upon in a recent talk at CERN.
As always, if you’d like to explore how your company’s traffic would fare after deploying RPKI, sign up for a trial and take a look at the results.
This blog post was based on a recent presentation at NANOG 83. Watch the full video.