In this post, Doug Madory reviews the highlights of his wide-ranging internet analysis from the past year, which included covering the state of BGP (both routing leaks and progress in RPKI), submarine cables (both cuts and another historic activation), major outages, and how geopolitics has shaped the internet in 2023.
It’s the end of another eventful year on the internet, and time for another end-of-year recap of the analysis we’ve published since our last annual recap. This year, we’ve organized the pieces into three broad categories: BGP analysis, outage analysis, and finally, submarine cables and geopolitics. We hope you find this annual summary insightful.
Border Gateway Protocol (BGP) analysis
In June, I published my post, A Brief History of the Internet’s Biggest BGP Incidents, which began as a section I wrote for the 2022 Broadband Internet Technical Advisory Group (BITAG) comprehensive report on internet routing security. The piece began with the AS7007 leak of 1997 and covered the most notable and significant BGP incidents in the history of the internet, from traffic-disrupting BGP leaks to crypto-stealing BGP hijacks. It intended to familiarize the reader with the major events that shape our understanding of the vulnerabilities and weaknesses of BGP.
In May, my friend Job Snijders of Fastly and I published updated RPKI ROV adoption numbers based on our measurement approach from last year. While the degree to which RPKI-invalids are rejected didn’t change noticeably from the previous year, the number of ROAs created continues to climb. The additional ROAs serve to increase the share of internet traffic eligible for the protection that RPKI ROV offers.
In fact, ROAs are being created at such a clip that the number of globally routed IPv4 BGP routes with ROAs will overtake those without at some point in the next year. IPv6 routes are already there. I’ve posted a survey on LinkedIn and X/Twitter to gather predictions from my fellow BGP nerds.
The above analysis was cited three times by speakers at the FCC’s Border Gateway Protocol Security Workshop held on July 31 in Washington D.C. My friend Tony Tauber, Engineering Fellow at Comcast, cited this analysis to argue that traffic data (i.e., NetFlow) suggests that we’re farther along in RPKI ROV adoption than the raw counts of BGP routes might suggest. Another friend of mine, Nimrod Levy of AT&T, cited our observation that a route that is evaluated as RPKI-invalid will have its propagation reduced by 50% or more.
In addition, I dug into a couple of BGP leak incidents this year. At the beginning of the year, I used two route leaks to explore the impacts on traffic using our extensive NetFlow data. A common concern with a routing leak is the misdirection of traffic through a suboptimal path, but the other, often greater impact, is how much traffic is simply dropped due to congested links or high latencies.
In other incidents, I analyzed the role of RPKI ROV in limiting the disruption caused by leaks. For example, in August, the government of Iraq attempted to block Telegram by announcing a BGP hijack within the country. The route leaked out but didn’t propagate very far because Telegram had a ROA for the IP space and the international carriers serving Iraq rejected the RPKI-invalid BGP hijack from propagating outside the country, limiting the disruption of Telegram.
In that post, I concluded that “while it likely didn’t have the potential to be another Pakistan/YouTube incident, BGP non-events like this are what RPKI successes look like.”
As with any year, there were many internet service outages of various types, but the two that I spent time digging into were the Microsoft Azure connectivity outage in January and the Optus outage in Australia at the end of the year.
In the early hours of Wednesday, January 25, Microsoft’s public cloud suffered a major outage that disrupted their cloud-based services and popular applications such as Sharepoint, Teams, and Office 365. Similar to the historic Facebook outage of October 2021, Microsoft blamed the outage on a flawed router command, which took down a significant portion of the cloud’s connectivity.
In my post, I analyzed the outage using a variety of unique datasets from Kentik, including our BGP visualizations, performance monitoring, and aggregate NetFlow analysis.
In the end, it was hard to explain why some parts of Azure’s connectivity were impacted while others were not without intimate knowledge of their architecture, but we could surmise that Azure’s loss of connectivity varied greatly depending on the source and destination.
A curious detail of this outage is that it surfaced BGP routes that purportedly showed Microsoft hijacking Saudi Telecom, T-Mobile, the US Department of Defense, among other entities. After a lengthy email exchange involving numerous engineers from Microsoft, Vocus, TPG and myself concerning the hijacks, they disappeared from the global routing table without further explanation.
The Optus outage in November was arguably much more impactful than the Azure glitch, and I looked into it in a post entitled Digging into the Optus Outage:
In the early hours of Wednesday, November 8, Australians woke up to find one of their major telecoms completely down. Internet and mobile provider Optus had suffered a catastrophic outage beginning at 17:04 UTC on November 7, or 4:04 am the following day in Sydney. It wouldn’t fully recover until hours later, leaving millions of Australians without telephone and internet connectivity.
In the post, I described how the Optus outage was similar to the Rogers outage of July 2022. In both cases, a portion of the networks’ BGP routes were withdrawn, but those withdrawals were merely symptoms of an internal issue, rather than the cause of the outage.
The Rogers outage was caused by the removal of a route filter, which leaked the global routing table into the internal routing, overwhelming their routers and bringing the network down. In the Optus outage, a sibling network (also owned by parent company Singtel) leaked routes into Optus’s network. The leak triggered the MAXPREF safeguard on Optus’s routers. MAXPREF instructs routers to tear down a session when the number of routes advertised across it exceeds a specified maximum number.
Unfortunately, the default retry time for Cisco routers is ‘forever,’ meaning that an engineer would need to manually intervene on each router to re-establish lost sessions. A safer practice is to set a retry time for the router to automatically re-establish the session after a minute when the routing leak has passed.
Submarine cables and geopolitics
In January, we celebrated the 10th anniversary of the activation of the ALBA-1 submarine cable to Cuba. However, just a month earlier, the US Department of Justice recommended that the FCC deny the request by the ARCOS submarine cable to build a spur connecting Cuba to the cable system.
As I wrote in my blog post in January, the rejection of an ARCOS landing in Cuba showed that, almost a decade after the activation of the ALBA-1 cable, geopolitics continues to shape the physical internet — especially when it comes to Cuba.
A large part of the rationale for denying ARCOS a landing in Cuba was the possibility that ETECSA, the state telecom of Cuba, could employ BGP hijacks to intercept US traffic. Of course, nothing stops ETECSA from trying that now and the addition of a submarine cable doesn’t alter this risk, in my opinion.
In August, an undersea landslide in one of the world’s longest submarine canyons knocked out two of the most important submarine cables serving the African internet. The loss of these cables knocked out international internet bandwidth along the west coast of Africa.
In my post on the cable cuts, I reviewed some history of the impact of undersea landslides on submarine cables and concluded, “Make no mistake; the seafloor can be a dangerous place for cables.” I used some of Kentik’s unique data sets to explore the impacts of these cable breaks, which I later remotely presented at Angola NOG (AONOG) in November.
It is rare that we get to celebrate a new submarine cable activation like the one that occurred on October 1 in Saint Helena. In my blog post, I published the first evidence of the activation of the submarine cable connection to the remote British overseas territory and the final place of exile for Napoleon Bonaparte.
In the post, I went on to tell the epic story (more historically accurate than the recent Napoleon movie) of the advocacy work by my friend German telecommunications expert Christian von der Ropp to make this cable activation a reality:
Realizing that landing a submarine cable branch in Saint Helena wasn’t going to happen without dedicated advocacy, Christian founded the non-profit Connect St Helena in early 2012 and began the lobbying effort.
After being rebuffed by the government of the UK, Saint Helena ultimately received funding from the EU to build a spur to Google’s upcoming Equiano submarine cable.
While the UK had voted to leave the EU in 2016, the implementation of “Brexit” wasn’t finalized until February 2020. In 2018, the (Saint Helena Government) was still eligible to receive EU funding. It would be one of the last EU benefits given to the UK.
Finally, the conflict in Ukraine continues to rage on, impacting internet connectivity in war-ravaged parts of the country. I wrote two pieces of analysis early in the year focused on the situation in Ukraine.
The first focused on the Russification of Ukrainian IP addresses in RIPE registrations. Ukrainian residents of Russian-held regions have been forced to adopt all things Russian: language, currency, telephone numbers, and, of course, internet service.
Using a novel utility made available by RIPE NCC, I identified dozens of changes to RIPE registrations, revealing another target of this Russification effort: the geolocation of Ukrainian IP addresses.
An example of the output of the historical RIPE whois query is below:
$ whois --diff-versions 9:10 188.8.131.52 - 184.108.40.206
% Difference between version 9 and 10 of object "220.127.116.11 - 18.104.22.168"
@@ -2,3 +2,3 @@
@@ -10,3 +10,3 @@
The second post was a collaboration with the Wall Street Journal and used a novel data source that allowed us to explore connectivity inside Ukraine: the ARK dataset from the Center for Applied Internet Data Analysis (CAIDA).
Expanding on our collaborative work with the New York Times in the previous year, we explored how the war changed the path of traffic traversing the domestic Ukrainian internet. In the graphic below, we employed traceroutes from the ARK measurement server in Kyiv to various ISPs in the Kherson region over the course of a year.
There is a clear point when the latencies increase due to the Russian rerouting at the beginning of June 2022. The graphic also illustrates the result of the Ukrainian liberation effort in Kherson. Ukrainians recaptured half of the region, and we see a portion of the traceroutes reverting to a lower latency as those networks restore their Ukrainian transit connections. A few providers in the region of Kherson are still on Russian transit, presumably in the territory that is still under Russian control.
Lastly, I joined a team of researchers led by the Censored Planet group at the University of Michigan to author a paper presented this summer at the 32rd Usenix Security Symposium entitled, Network Responses to Russia’s Invasion of Ukraine in 2022: A Cautionary Tale for Internet Freedom. My portion covered the Russian BGP hijack of Twitter and a video of the presentation is available here.
Kentik provides network observability to hundreds of the most important service providers and enterprises in the world. As a result, it provides me, an internet measurement analyst, unparalleled data and tools to conduct analysis of important internet developments.
As we look ahead to the new year, there is no shortage of challenges and opportunities for internet connectivity around the world. We intend to continue producing timely, informative, and impactful analysis that helps inform the public and industry about internet connectivity issues.