Between volcanoes erupting, misconfigurations, and nations purposely shutting down the internet to stop a protest, 2022 was a busy year for network outages. In this episode, Doug Madory, Kentik's Director of Internet Analysis, joins us to talk about some of the highlights of 2022 and also discuss some of the more common reasons we see large-scale network outages in the first place.
Doug Madory is the director of internet analysis for Kentik where he works on internet infrastructure analysis. The Washington Post dubbed him “The Man who can see the Internet” for his reputation in identifying significant developments in the global layout of the internet. Doug is regularly quoted by major news outlets about developments ranging from national blackouts to BGP hijacks to the activation of submarine cables. Prior to Kentik, he was the lead analyst for Oracle’s internet intelligence team (formerly Dyn Research and Renesys).Connect with Doug on LinkedIn
Phil: I recently read Doug Madory's blog post, A Year In Internet Analysis 2022, which was a great overview of all the major events on the internet, at a global scale I guess, for last year. It naturally reminded me of major internet events in previous years too, just because of the content he wrote about. And though I found that very interesting all on its own, I did start to notice a theme among all of these events. The ones in Doug's blog post of course, but also as I think back over years past. The biggest events, the biggest global scale internet disruptions, probably most memorable for me at least, that I can think of, seem to be caused by only a few things, namely natural disasters, human error, and lately intentional outages caused by national governments for whatever reason. I really can't remember any major global scale outages that were caused by an SFP going bad, or a router CPU being pegged, or something like that. Maybe a core switch hardware just failing. Now, from experience as a network engineer, I know those things do happen, but when I think about those huge, global scale disruptions, it seems like hardware going down, or otherwise good solid code just breaking for no reason, it really doesn't happen that often, at least not often enough and at the scale that affects huge parts of the world, if not the whole world. So, today with me, I have Doug Madory, the director of internet analysis at Kentik, to talk about his recent blog post, A Year In Internet Analysis 2022. Let's get started. So Doug, it's great to have you today. It's been a while since you and I have done something together collaboratively, so it's good to see you. Well, our audience can't really see you, this is mostly an audio only podcast, but it is good to see you. I read your blog post recently, A Year In Internet Analysis 2022. And I want to ask you as we start to get into this, you heard my intro, do you agree with me that there is this kind of theme among internet outages, not just in 2022, but in previous years, that they all seem to be a result of those three main pillars, natural disasters, undersea cables being disrupted by a hurricane or a volcano, or something like in your blog post, human error, configuration problems, people configuring BGP incorrectly, and you mentioned that in your blog post. And then what I really was interested in also toward the end of your blog post, the idea that there are countries that are now intentionally disrupting internet service for the folks in our countries, being a cause for a major outage. Do you see that as a main theme or am I reading too much into it? What are your thoughts?
Doug Madory: That sounds like three good categories, that covers a lot. I can't think of a counter example that doesn't fit into one of those, but yeah, I'd agree with that breakdown.
Phil: Yeah, I mean, you started off by talking about the eruption near Tonga and in your blog post you have that graphic image. It's so interesting to watch. But that was a volcano exploding as volcanoes do, from time to time, taking out an undersea cable. And therein lies major disruptions for the nation island of Tonga and I assume that region, I don't know, I have to look into it more. But why is it that it doesn't seem like devices themselves, like hardware, bad code, really causes a lot of the disruptions? Do you think that's because the internet, is that resilient that we're so good creating hardware, the major vendors, or is it that we're just hearing about those because they're not as sexy?
Doug Madory: Well, hang on. So, I wouldn't go so far to say bad code doesn't lead to outages. In the Tonga example, there was no... bad code was not the issue on that one. In my blog post I went into, I was ... Years ago in my time at Renesys, and then Dyn Research, I had a real interest in trying to identify submarine cable activations because they were interesting. And then the Tonga one I had spotted back in like 2013, I think it was. A number of years ago, this was a case where for Pacific Island nations, it's very hard to get them to be connected to the global internet. They relying on satellite, satellite's very expensive per megabit, especially given... It's especially bad in the South Pacific because the way that the business of satellite service works is that you get to divide the cost by the customer base. But the Pacific Ocean is a very large piece of real estate that has very few customers, so your denominator is pretty small. So, then I remember attending a submarine cable conference or speaking at one in 2013 and someone was talking about this, saying at the time, the wholesale bandwidth costs in North America are... I don't know if you even know if these figures, if we use these figures anymore, but it was like a dollar a mega month. There's always some sort of figure of what's the wholesale of bandwidth cost. It's probably like a penny or something now these days, I don't know. But in the South Pacific, it was on the order of a thousand. So, it was like a thousand times the cost and you were limited on capacity, high latency, there are all these other problems and you're paying a thousand times more. So, it was an act of humanity to... It was a humanitarian gift to Tonga that the Asian Development Bank and UN put together the money to put this cable to try to modernize society of this country. And then I was following that story and then part of my interest was there'd be a press release about a submarine cable activation. And then, I was just curious to see... I would see it, we would see it in our internet connectivity data. When did this thing actually start carrying traffic? Because those are two different dates. The cable may be ready to go, maybe they... there's no lie in the press release. It really did happen but then there's a moment where it's actually carrying traffic. So, we spotted that. So, I had a little bit of history on Tonga in particular, because I remember when I saw this, I was like, " I remember Tonga, we talked about this years ago." So then they were at that point... had turned down all their satellite services. They're completely relying on the submarine cable and they had to restore, I think whatever residual satellite antennas were knocked out by the aftermath of that blast from the undersea volcano.
Phil: But it wasn't just that undersea cable that caused major issues last year. I mean, you talked about the issue in Egypt, though I don't think that was caused by a natural disaster like a volcano or something. I don't know, if there's any volcano.
Doug Madory: No, I don't know that we got an explanation on that one. So, I wrote up a blog post on that one as well. And I know Egypt as a choke point and the global internet is a perennial theme, certainly in the submarine cable space of trying to come up with an alternative path. The Egyptian government makes a lot of money off of that choke point, as they do the Suez Canal, they do it with the internet as well. You pay to cross your cables through that space. But occasionally, there are terrestrial outages. They try to build a lot of redundant overland links to connect the cables that go from the Mediterranean to the Red Sea. But every once in a while, there's an outage. I remember one, a number of years ago I mentioned in the blog post was, we saw one, it looked a little like this one, there was a hours long outage. And at the time I had a great contact in telecom Egypt who managed the submarine or the fiber optics, the overland circuits. I was like, " What happened here? I saw on outage." And he's like, " Oh my God." He's like, "You're not going to believe it. We had people light fire to one of the COs trying to get copper out of the lines," not knowing these are all-
Phil: Oh, whoa.
Doug Madory: ...fiber optics and there wasn't much copper to be had, so they just burned this down. And so, at least if it's on land, fixing this stuff usually is in the matter of hours. If it's under the sea, it could be days or weeks, depending on where it is.
Phil: Well, I started off by mentioning natural disasters and we talked about the volcano eruption near Tonga, but we're focused more on undersea cables now. And you did make the comment that there are very few undersea cable connections, specifically in the Pacific. Is that pervasive throughout the world or is that a very robust method for moving data between continents and among continents? Because I'm wondering, some of those cables also have to be very old, correct?
Doug Madory: Yeah. So let's see, there's a couple of things there. One is if you look at the submarinecablemap.com, I know you've seen this-
Phil: The best map of all time.
Doug Madory: ...yeah, it's a pretty cool thing put out by TeleGeography is a good reference. I've got one of these printed on my dining room wall, but... show what kind of nerd I am. But these things follow the same maritime trade routes, the people have been following ships forever. So, the highly trafficked paths between crossing the Atlantic, across the Mediterranean, South Asia, around the Far East, those lines are... there's lots of cables, there's lots of redundancy. The only risk is that there's ships also going those same paths. They may set an anchor down or drag an anchor, hit a cable. If you were to look at marinetraffic. com, you compare the two, they're going to look very similar of just where are the cables and where are the ships? They're going in the same places. And they're also going to show that there's no ships or there's very few ships in the South Pacific and there's very few cables in the South Pacific because it's very... a cable is an expensive endeavor. It is millions of dollars, at least a hundred million, onwards to a billion depending on the length-
Phil: Really? Wow.
Doug Madory: ...and the complexity. The ROI, this is another thing that gets talked about at submarine cable conferences usually, is there's a lot of investors in business, people trying to understand the business case around this and how to mitigate the risks and maximize the ROI because someone has to raise a lot of money and how much money you're going to make off this, it's not a ton, but you also have this risk if it breaks, you're getting no money and you have to pay to fix it. This is why all the Google and Amazon and Facebook have gotten into the submarine cable business because their business is just different. They don't have to make money off the cable. If it serves their greater business, it's good. And so now, they're driving that industry. But anyway, so I'll let you talk, but in the South Pacific, one other interesting thing that happens, it's not that often, but you mentioned some of these cables are old. So, as the cable gets old, someone had come up with an idea a while back of... and it's amazing this has actually happened. So, the cables, they'll pull up a cable off the seabed. I can only imagine all the lifeforms, barnacles and things that have attached to this thing. As they're pulling this up onto the ship and then relaying it somewhere else, and in the total cost, I mentioned these figures of either hundreds of millions of dollars, half of that or more, or no, it's more than half, it's the vast majority of that, is the fabrication of the cable. And then there's the installation, is the way it's termed in the industry, of actually putting it in the water is the installation. The fabrication of the cable is the most expensive part. And if you can pull one off the ocean, that's some cost there. But so there's been a couple cables that have been relayed as quote unquote donor cables is the term. And so, I forget the one... So, a cable may be no longer... this was in the area of Australia, South Pacific. It no longer served its purposes. It didn't have the capacity to handle a major route but it'd be plenty to hook up a smaller island nation-
Phil: I see, right.
Doug Madory: ... as farcapacity goes and so you could reuse the cable. And so, this has happened a couple times, which is again a pretty mind- blowing thing that this takes place at all.
Phil: Yeah, it is. That's pretty neat. Do you think that the movement toward more ubiquitous satellite connections and connectivity on a global scale is going to solve some of the inherent danger that undersea cables being dragged up by ship anchors and natural disasters and all that kind of stuff?
Doug Madory: Yeah, I guess it could. It could to some extent. I mean, satellites are never going to have the capacity of fiber optic cable, so that we know is never going to be the case. The other issue I mentioned earlier with Tonga or other countries that are relying on both satellite service, latency's an issue, depending if-
Phil: Exactly, right.
Doug Madory: ...it's geostationary satellite, then just due to the laws of physics, it takes a certain amount of time for light to travel to outer space and back. And it can't be shorter than a 480 millisecond round trip. It's probably going to be significantly more. So, then there was O3b came out with the first MEO, medium Earth orbit. And so, these are closer satellites, there's more of them. And then it gets more complicated on the ground because you have to now track satellites as they're crossing through the sky in the handoffs-
Phil: Because they're not geostationary, correct?
Doug Madory: Yeah, this is medium Earth orbit. So, geostationary, you can have one dish pointing in one place and then leave it. And so, it's real simple.
Phil: How far away is geostationary?
Doug Madory: I don't even have to look it up. I don't have those numbers.
Phil: I'm googling it as you talk.
Doug Madory: Yeah, I'm sure there's some great diagrams. I don't have those figures memorized.
Phil: 22, 236 miles above Earth's equator. So yeah, the laws of physics govern how long it takes light to travel roundtrip out from your location to that side .
Doug Madory: So, you had O3b create medium Earth orbit. And so that came, br brought it closer. There's more complexity on the ground equipment. The latency is a lot lower. And in some places where this was getting fielded in places where they were never going to get terrestrial, or it was hard to reach places, then they were getting latencies similar to what you would have with terrestrial. But the latest is these low Earth orbit, Starlink, SpaceX, and then One Web, and Amazon has a project they're pretty early on in the Project Kuiper, and there's a bunch of Chinese they call mega constellations and these require thousands of satellites. But yeah, could that help in a Tonga situation, an undersea volcano takes out Tonga? Yeah, I guess in the case of Tonga, Starlink was one of the first... I think there was other Pacific satellite operators that got in there first, but they were... Starlink was providing some of the capacity. They did need to set up a ground station in Fiji because it's only recently that they've had inter- satellite links. So, this is their piece to the lower Earth orbit is that if your satellite's very low, then the footprint's very low and you have to just ricochet up to the satellite and back to the ground, and now you need to have a ground station pretty near close nearby. And so, this inner satellite link being able to go up to a satellite and then from one satellite to another satellite, is super complex and really hard to do. And they're starting to do it now. So, that has to be really solved in order to do things like use low Earth orbit to cross the Atlantic. Right now you can't do that because you can't come back down. You have to go intersatellite links over and that's just starting to happen now. So, it could, you're just never going to have the capacity.
Phil: Not until they invent subspace communication in Star Trek, I assume.
Doug Madory: Yeah, I mean, there's belief that... I guess there's some science behind the intersatellite links because they're going through a vacuum in space, can actually carry a higher capacity than the link going from the ground to the satellite.
Phil: Okay, .
Doug Madory: And so, there's some gain to be had in the fidelity of the links and the interspace links.
Phil: Yeah, and I have to assume that as that technology progresses and improves, that it will offer a... I mean, I know that it's never going to be the same as a hard fiber connection here on the ground, but as it improves in resiliency and in data transfer, speed bandwidth, it would be more immune to natural disasters than undersea cables and things that are affected by hurricanes and earthquakes and things like that. I mean, I can remember when first learning about all of this choke point, you mentioned the choke point in the Middle East. There's a choke point I believe in the tri- state area of New York. I don't know if it's on the New Jersey side or in Manhattan. I just have to imagine if there was one problem in that building where all of these connections go through, that that's it for the northeast US, which is a few people. So, now I do want to move on to... and I could talk about undersea cables for the next three hours. So, we definitely have to do that again.
Doug Madory: Me too. Do you want to into, you mentioned you looked at the Rogers outage from last year, that was a big deal. I remember reading a ton of blog posts on that and all of your analysis as well. Really interesting stuff, a terrible thing to occur. But ultimately, I don't know if we know with full certainty what the true cause of that was, the root cause, but everything points to human error, correct? Yeah, so that was, I think arguably, Canada's largest internet outage ever in history. And it was long, I forget the duration, but this is many hours, maybe 24 hours before it was completely getting restored and there was some root cause published, these things are always... To people who are pretty techy, maybe who are going to read these, it's always insufficient. There's never enough detail. I would love to know some more and they're never going to... you were just never going to get there. But in this case, if you read between the lines, it seemed like what was happening was they had basically leaked the global routing table, which is over 800, 000 routes or something, into their IGP, whether it's internal BGP or something, their internal routing, which is not going to be anything on that scale. And if you're using a protocol like OSPF, or something that is very talkative, to try to maintain total knowledge of links, these things do not go well together. They're very different styles of routing. So, there was basically, they leaked the table and it was just too many announcements and these routers were just melting down. And why it took so long is I think it seemed like there was some commonalities to the historic Facebook outage the previous fall, where there's unforeseen dependencies. The engineers were also using Rogers mobile service and the company's using its own communication services to coordinate its work. And when that goes down, they don't have a way to coordinate. And so, that extends the outage because it's very hard to... if you can't talk, if your normal tools for talking are no longer available. But it does reveal some stuff about the Canadian internet. This is an issue where there's different pockets of essentially monopolies of Rogers and of Bell, and this is... folks in the internet industry in Canada wrestle with this and it'd probably be... and they're actually moving towards a more greater consolidation still. If you have your major provider that's got a near monopoly in a region go down, there may not be a lot of great alternatives to use. So, that may have contributed to the duration of the outage. But yeah, it was a routing thing. It sounds like a filter was removed. I guess we're not going to know much more than that. But I would say that there was a lot of folks, when this outage took place, we're all looking at BGP as a thing that is easy to go to, for me and people like me who do internet measurement. And there was lots of routing instability going on at this time. A lot of routes got pulled, but we could see, because we have this... I have the benefit of having our aggregate net flow to look at, of what do we see on our customer base? What can we see as far as communications with Rogers? We could see routes that stayed up and the traffic stopped going. And so, that means that the route wasn't the problem. The routes were still up and available and at that time stable, but they weren't carrying any traffic. And so, you have multiple layers of this onion of their network. You've got some routers that are announcing their address base to the rest of the internet and you got internal routers handling how the traffic move within the network. Those were down, but they were still advertising their space. So, there were some initial claims that they left the Google routing table. I mean, some routes did, but that's one thing I tried to pick apart in the blog post was to say, " All right, we can see traffic stopping to routes that are still up," so that it's not really a BGP thing in that case, it's an internal... I mean, not an exterior BGP. Maybe it's an internal BGP, depending on what ... protocol they're using internally. Anyway, so that was-
Phil: Yeah, it's interesting that you have this really deep visibility into what's going on in the public internet and then you can use that and parse it in such a way where you can actually infer what's going on in somebody's private network-
Doug Madory: To some extent.
Phil: I remember just a few years ago, prior to Rogers and then prior to the Facebook outage, or was it prior? Didn't Facebook do something in Southeast Asia where they accidentally were a transit network for the public internet for a time or something like that?
Doug Madory: Let's see.
Phil: It was a major disruption as a result. I don't know if it was Facebook or somebody else, but I do ...-
Doug Madory: There's been a couple things like this. There was a Google incident where Google leaked... had at a BGP leak that took down a lot of connectivity in Japan for a while. I remember I wrote something up on that at the time, they got called before the-
Phil: There's a ripple effect. There's a cascade effect when you get into routing and talking about... especially if you're talking about full feeds and things like that. And then redistribution of routes and how you filter, like you mentioned. I mean, it's a cascade effect, all the way down to endpoints sitting on your network. You really need to be very aware and careful as a human being, engineering, actively engineering and touching wiggling wires on your network. I remember the AWS outage as well, remember that? Was the S3 outage and it came out that it was somebody who misconfigured something, I don't remember what. I wrote a blog post about that. And my blog post was called Amazon S3 or something like that. We've all been there, because I've been there now. I don't work at the global... I didn't work as a network engineer at the global scale except for some consultant work, I worked for... I did for GE. But other than that, it was large enterprise. But yeah, if you configure something incorrectly, just one little error in an access list, or in a route map, or whatever it happens to be, or the one I like to joke about is maybe it's a major trunk port going in your backbone for layer two and you forget the ad command if you're using Cisco devices and then boom, everything's down. It's very high impact and it's just you as a human being that can cause all of that disruption. I wonder if there is a way, I mean now that we talk about network automation and programmability, and this desire to eliminate that which is error prone in manual configuration, of course make things more efficient and cleaner and all that, but also eliminating that error prone component of manual configuration. I wonder if that is realistic or not. I mean ultimately, the code that we write in Python, and I think some people still use Ansible playbooks and things like that, is still written by human beings that require an understanding of how BGP works and how to write an inventory list, or how to write a route map and still requires a human being to know that and to figure that out and to parse that in such a way in code, where it interacts with everything else going on in the network. So, I don't know if we can ever get away from the human error component of these types of outages. Maybe we can .
Doug Madory: It's a lesson in humility.
Phil: It's a lesson in humility, for sure.
Doug Madory: I think probably 2021, this will be with us forever, probably. But I think 2021 was definitely the year of learning, of humbling experiences of the greats of the internet falling to their knees. You had Facebook, you also had, Amazon had a couple outages, the second one wasn't as big, but the first one you mentioned, where there was an internal DNS issue. It turned out that a lot of internal services did not use multi- region, which is funny. I mean, that's a cloud operator, it's their stuff. They could be replicating this in every region. Everything was based, like the rest of the world, in US- East- 1 and they were too. And singly homed with their internal services, including their internal DNS, it's kind of... I don't know. But it's easy for us on the outside to be like, " Ah, you should have known, da, da, da." I think these things, especially this scale, the scale that we're talking, of like a Facebook, AWS earlier in the year, it was Fastly and Akamai had outages in that year. I think every major provider has had one of these. And it is hard to anticipate every dependency. And then you know, it's out of band, it's easy for us to say, " Oh, you should have had an out- of- band communication that doesn't have any dependency on your network and allows you to remote in to your staff and configure it." That's actually a hard thing to number one, make and then two, secure. Can you imagine, you're creating back doors that have no reliance on anything or trying to secure that?
Doug Madory: I mean, it's easier said than done, building these things.
Phil: I've been there too. I mean, I've made those mistakes and have had those humbling experiences, but never at a global scale. I've taken down networks. I remember taking down subnets and then entire networks from time to time, but a few enough where I learned from my mistakes and then really took the time to analyze and to investigate and to research prior to making a change. And even, then during that change window, you're sweating, literally sweating because you're like, " Okay, here we go." And the whole team is on pins and needles. " All right, hit the enter button," or the enter key. But it was never at this kind of global scale. And we've been talking about what, eight companies, 10 companies? That's all you've mentioned, the names of literally less than a dozen companies thus far in our conversation. And granted, that's just in the scope of our conversation, but the point I'm trying to make is, is this idea of a very few number of companies holding not the power, but all of the connectivity and the data transfer, and all of the content even, for so much of the global internet today, maybe that's part of the problem. Where somebody, human being, makes one problem at Facebook and then boom, you have a huge, huge disruption. Whereas, if there was something more decentralized or there were more companies... and I don't know, but it seems to be ...
Doug Madory: I think this topic definitely came up in the Amazon, AWS, outage in December. I mean, when Facebook went down, it was basically everything owned by Facebook was down, you understood that. The rest of the world was essentially, unless you are using Facebook to log into something else, which some people were doing with the credentials, the rest of the world carried on. With the AWS outage, we learned how much everybody is using US- East- 1, just the one region, the one cloud provider is powering so much. And so, I got invited onto Fox Business with Neil Cavuto on live TV and it was like, " Why is this happening?" I was like, " Well, let's just take a step back." I mean, one thing is this is a service that is wildly popular. Cloud services are solving problems. And this outage is the flip side of that success. It's hard to know, this is an unknowable question, but let's say there was a thousand companies that got knocked out for that period of time. How long have they been on AWS? Maybe a year, how many little outages those companies have had that they didn't have because of... they had put this on AWS and they take the responsibility for running this. So, there is a trade- off, you're-
Phil: Yeah, ...
Doug Madory: ...not having outages normally that you're responsible for. You outsource that. AWS is keeping this online until they don't but that's quite rare. In the meantime, everybody's up all the time. I mean that didn't put a dent, in my opinion, at all in the cloud business, it still is a good... So then yeah, you still have this issue of consolidation, of you have a handful of companies that can take down a lot of connectivity and I don't know, I think this-
Phil: Well, what about on the providers' side then? We're talking about CDNs and other types of... Facebook and Google, however you want to define them, but what about actual providers?
Doug Madory: Like network service providers?
Doug Madory: Yeah, so there's two dynamics there. There's one at a national level, that ends up being governed by how much the regulator of that country tries to introduce and foster competition. And the lack of competition is a measure of the regulator's power to control the market. There's lots of countries that have dominant incumbents. Every country started the same way. Everybody started with a state telecom, the government started this thing, whether that's a hundred years ago with telegraph wires or something, that everybody started with a state telecom. And then that is now... whatever version of that exists today is now the incumbent. Sometimes that still is the state and government- owned thing. Sometimes it's been privatized, but then separately, there's a regulator that's trying to foster competition. I think, at least in the business of telecommunications, it's an accepted truth that more competition breeds lower prices, better service. But it also, depending on how strong that incumbent is, that may cost them jobs. They've got a lot of pull and sway, and so they'll fight some of those. Anyway, so at the national level you have this... So Canada's an example of one that's... it's not the worst, but it's not... they could have more competition. I mean, the United States could as well, I mean, you could take any... there's lots of countries that you could make this argument about. Then at a much higher level, you've got the backbone providers, the big global networks. I mean, I guess we've seen some consolidation in there. At that level, it's interesting to me that it's not... You have to move a lot of traffic to make any money and you have to be really big and it's a commodity thing. And anytime your product turns into a commodity, then it's just can you move large amounts? And it becomes cutthroat on price and scale. And so, you have these very large companies trying to move huge amounts of traffic, with very few people, as few engineers as you can, just as much equipment as you need. And there's been some consolidation. You've got, I guess they're Lumen now, who owned Level 3, who bought Global Crossing and XO, and that all those companies are now... CenturyLink is all one thing. So, there's been some consolidation there. And then there's others that are essentially national champions that are probably not going to be... they're too important to be acquired. NTT Japan's not going to be acquired. And even Tata, based in India is not going to be ... maybe they will, I guess ...
Phil: It's interesting that you bring up this distinction between privatized carriers, carriers that were started as state- run businesses, things like that. Because it started to get me thinking of the last section, or maybe the middle section of your blog post. You talked about outages in Cuba, the protests in Iran, and then the resultant behavior from the national government there to shut things down. That's something that we're starting to see. Ukraine as well, right? There is this... You were going to say something? Go ahead.
Doug Madory: Oh, I was going to say. The last outage in Ukraine, not so much government directed-
Phil: Oh fair, that's right, yeah.
Doug Madory: ...I mean, they're getting bombed. It's happening still to this day.
Phil: Nevertheless, a result of geopolitical occurrences.
Doug Madory: For sure.
Phil: In this case, war, which is not the same as the government trying to shut down a protest, or stop the flow of information. So, that is true. But nevertheless, it's nation states making decisions to control information. And that's really where I was getting at with Cuba and Iran. That is a very different type of outage, isn't it? And disruption on the internet. We've talked about natural disasters, and undersea cables, and satellite and human error, which is probably the most ubiquitous type of outage. But now we're seeing this happen more and more and you're reporting on it. I see your Twitter feed, I see your LinkedIn posts, all the time about nations that say, " No, we're going to shut you down, citizenry." How is that playing into this? Is becoming a more widespread thing that's happening among countries?
Doug Madory: Yeah, I mean on this topic, I got into it long ago with the Arab Spring, back when we were... I was with Renesys, we were already mapping out the global internet with a live picture of the internet in the country. And so, when Egypt went offline, we could pull that up immediately and understand what was up, what was down with the timing. And then we worked with the global media outlets to tell that story from a technical standpoint. I've never stopped covering this from... or trying to come up with ways to contribute some technical analysis to our understanding of these things, but they don't seem to stop. It's very hard to order a sovereign nation to stop doing something bad, whether that's shutting off internet or other things. But this one, I thought this fall, we saw a couple more instances of what's been termed as a internet curfew, where internet service gets turned out for a period of time deliberately and then restored later. And it's often in the evening. It's often focusing more on mobile services and fix the line up. And what the issue is, in each case, going back to, we saw this first in Gabon and then, Myanmar last year did this after the military coup. In each case, the rationale is there's definitely some cost to shutting all your services off. There's cost to the businesses of your country. There's lots of things, it's disruptive to the government itself. And so to mitigate, to hedge that, then they want to try to isolate, be a little more surgical about shutting things off and blocking, censoring particular types of web services is one thing that's done quite a bit. And then another thing is to... in the case of Iran is a good example. This went for, I don't know, more than a couple weeks where basically the three major mobile providers basically turn off their service, starting in the early evening and going into the early morning. And what they're trying to do is try to disrupt the protestors in the street, their ability to communicate to each other and organize, their ability to report. I mean, eventually services going to come back and they can report at that point, but at the moment they're unable to tell each other what's going on. And then at the same time, anybody who's using fixed line, those are people in offices, these are people... Those people, their service for the most part remains online. There's things that are getting for the services aren't blocked. And so, that's their way to try to lower the cost of a government directed shutdown. And we may see more of this, it was both Cuba and Iran happening almost simultaneously this fall, last year with Myanmar. There aren't that many examples of it, but I don't know, I think these guys look at other countries and see what they do and others may learn from that and be like, " Well, that's a good way to keep... not upset the business community and the government and get those pesky protestors to not be able to communicate."
Phil: These pesky protestors. I mean, you say there aren't that many examples, but they are a very top of mind. I mean, this notion of authoritarian governments shutting down protests in and of itself by shutting off the flow of information. Though it's very few, and as far as the examples that we can point to, they're very profound I think, in their impact and in the discussion, the philosophical discussion on the future of the internet, its use by humanity just to propagate information and to be connected. Now, separate note, but just thinking now about the global internet, how BGP operates and how it's so much of a trust relationship. I can take in a full feed, I can advertise pretty much whatever I want. Beyond a password for an adjacency and some other things like that, do you see that as a problem? Now that this is not 1994, where the internet was this cool idea that's never going to take off, and now is the lifeblood of our society, at least in the United States and in many other nations, is the security component of the internet, maybe specifically BGP, but of the internet an issue? Because we are seeing hospitals, hospital systems with 80,000, 100,000 employees, in the New York tri- state area get... what is it? Hijacked and shut down until they pay a ransom, expand that to a global scale, shutting down a country unless you do what I say, that sort of thing.
Doug Madory: Let's see, I guess on the topic of routing security, I think for those of us that work in that space and think about it a lot, I think the number one message to communicate to people who don't, who aren't in that space is that this is a very hard problem and it is a constellation of problems. It is not one thing. There's a whole bunch of sub- issues here. We're not going to deploy a single technology in a day that solves this, it's just... you can only imagine, trying... You've got routers all across the world, an internet of routers that all have to... we've got to get them to do something different. That is really hard to do. So, the problem's very hard, but I would also... and it's not solved, but I guess I'm a glass half full kind of guy. And here's why. So, there's so RPKI ROV route origin validation, are the technologies getting pushed these days trying to limit certain types of these problems. You can imagine there's a spectrum of these issues. And on one end is the bonehead errors, just stupid stuff. And when I started doing this a little over 10 years ago, there was lots of stuff happening on that end of the spectrum and probably lots of stuff happening on the other end too. But we had made no progress on even easy or just dumb errors. I feel like these days, with the adoption of ROV, so what this is basically, folks who own internet resources, so IP addresses specifically we're talking about. Can go into their ... portal. If you're North America, you going into ARIN, you set up what is the proper origin that you would want, who should be originating this in BGP land? And that gets communicated to everybody in the system. If you they see somebody else, then they'll drop. If they see another origin, they'll drop that route, if they're also participating in the system. It's not foolproof, but it does reduce the impacts of origination leaks, when somebody accidentally barfs out a bunch of... a full table. I mean, we haven't had one of those in a long time. And I don't think there's... there's not just RPI, there's things like... I don't want to get into all these, but things like Pure- Lock, something that... so this is where the major providers look at just... they just filter. They should not be receiving other... people in the, we call them the TFC or DFC, default- free zone, the top of the internet. Top of the internet hierarchy. So, it's a fully connected mesh, is the way the internet has to operate. Any link gets disconnected, then there's a partition and one part, can't touch another. But at the top, it's a full partition. But they should... so let's say take Lumen, and let's say NTT, Lumen should not be receiving NTT stuff from one of its customers, is essentially the gist of it. And you can come up with a handful of rules that just what would be the AS paths that should be automatically rejected? That's widely deployed. And so, I would say that a lot of these major bonehead things, they don't happen that much anymore. I would add also that I was one of the people who was writing up a lot of sketchy internet routing issues involving China Telecom. And this got a lot of circulation in national security circles. It's led to the SEC, to revoke China Telecom's license to operate telecommunication services in the United States. We used to see them involved in these leaks. I can't tell you the last one, that China Telecom, was a part of it. And they had joined MANRS, which is the internet society's effort to try to... and there's not really any mechanism. You join and you pledge to do a bunch of things and then they are vouching that you were actually doing these and there's some attempt to try to check up on this. It's mostly a pledge and then that you'd be shamed if you did something wrong, you have to.
Phil: You'd be shamed. it's a trust relationship.
Doug Madory: But then I don't know, they joined it and there isn't... I haven't seen them in one of these in a while. And so, maybe we are making some progress.
Phil: Well, I have to imagine that it behooves any country that has any kind of nefarious intention to not do that when they are interacting with other nations as a means of doing business, of interacting with other countries to keep peace. So, as much as you might have this nefarious intention... I know this is going to be a strange analogy, Doug, but I just watched The Good, The Bad And The Ugly with Clint Eastwood, one of my favorite movies. I love Westerns. And my wife was with me and she's like, " Did they really act like that?" And I'm like, " Not really." I did some quick Googling on my phone while we were watching the movie, and I was just reading that a lot of this stuff is fantasy and hyperbole for movie sets .
Doug Madory: It's a good story though.
Phil: Yeah, it's a great story. I love it. I love the spaghetti Westerns very, very much. And the fact is that folks went out to the West US, cowboys and families and ranchers and all these people alike, to make a better life for themselves. You had bad element, but it wasn't this Wild West that we see in Clint Eastwood movies and other movies like that, John Wayne movies. It was much, much more calm with, of course, those exceptions. I have to imagine that that's true on a nation state level, on a provider level, on a global scale with regard to the internet. We're all trying to do the best we can for ourselves, it's self- interest, and I get that, and that's okay. But it's in my self- interest as a country to make sure I'm not screwing up my relationship with other countries because that's a great source of income and again, keeping-
Doug Madory: No, it's a good point. It is a property of the internet. But I would also add that... I'll shift gears from being a Pollyanna to, there's a lot of assumptions on the internet. For example, when we've exhausted V6, all the V6 that's going to... IP address that's going to be used it's already been given out. There's no more to have.
Phil: You mean the V4?
Doug Madory: Sorry, V4, yeah, right.
Phil: Yeah, I was about to say.
Doug Madory: We've not run out of the V6 yet. Sorry, V4.
Phil: We have a few more V6.
Doug Madory: Yeah, excuse me, I misspoke. But if we got to a point where, let's say China is like, "You know what? We think... " they actually, let's say not that long ago before they DOD started announcing all this address base, it was a big story last year. They were sitting on a ton that nobody was using, like four billion addresses or something. Or maybe not four... yeah, there was some large amount, a few hundred million. And Chinese companies would use these internally, use US DOD address base, internally. And nothing's to stop them, as long it's not out on the internet and occasionally -
Phil: They're just numbers, routers don't care.
Doug Madory: ... ifyou ran a trace route, it'd be funny because you'd be like, " China, China, China, DOD, China, China, China." But you could get to a point where countries no longer agree on what are the norms of respecting these boundaries because there's no way to enforce any of this. And right now, it's still, we're all in this... we have this common good. Even though some countries may be at war with each other, they still respect these boundaries. But I don't know, I think it's one of these things where that's a line that could get crossed at some point. So, let's say, Ukraine has... their vice president had called for this IT army to attack things in Russia. If you wanted to support Ukraine, you'd attack and people were participating in this, this still happens. I mean, if they really wanted to, they could start just announcing all their Russian address base and just screwing... and then Russia could start announcing all of the Ukrainian address base and it would just be a complete mess. And that's a line, I think we had just assumed that would never get crossed. But it could and then we'd be-
Phil: It's a hypothetical, isn't it? .
Doug Madory: There is a possibility of this stuff breaking apart and...
Phil: There's no technical reason that none of those things could happen or couldn't happen.
Doug Madory: There's no technical reason, yeah. I'll just say one more thing, Phil on the-
Doug Madory: ...so we talked about the reasons to be optimistic on routing security, but then the flip side of that is that there's a lot of things that aren't solved. And there's been a few cryptocurrency attacks that were very profitable for the folks that pull these off and these are-
Phil: Oh interesting, yeah.
Doug Madory: ...the sophisticated attacks. So, I mentioned the spectrum of one end's bonehead, the other end's sophisticated... we call it a determined adversary, is the phraseology.
Phil: What was that term again?
Doug Madory: A determined adversary.
Phil: Determined adversary, yeah, that's right.
Doug Madory: If you were really determined to defeat RKROV, you can do it. And there's ways to manipulate the system that just aren't... we just haven't solved yet. And there are people who are doing this now. I think as we clear and clean up the bonehead section and experts can keep moving the needle up towards the determined adversary, then I think we can start to tighten down and make that costlier, if we can't make it completely prevented. And maybe some of these scenarios can be prevented, but there's a lot happening in that space as well, in a determined adversary, BGP hijacks.
Phil: And with regard to this entire idea of outages that we've been talking about and major disruptions on a large scale, global scale, certainly more complex than the few things that we've mentioned today, the theme that I picked out, natural disasters, and human error, and nation states being authoritarian in their control of information. It's actually much more than that. And we touched on BGP security as well. So in any case, Doug, we're at time. Great conversation. So much we can unpack, I think we can turn this into about 20 different podcasts, maybe a series on internet analysis.
Doug Madory: Love it.
Phil: In fact .
Doug Madory: I love talking about this stuff.
Phil: Yeah, no, I appreciate that.
Doug Madory: Happy to do it.
Phil: So, as we wrap up here, I'd like to give you an opportunity, how can folks reach out to you if they don't know already, how they can reach out to you to ask a question, make a comment?
Doug Madory: I am still on Twitter. I haven't departed. I still think it's probably going to be around for a while, but that's probably the easiest. I'd put some things there. I'm on LinkedIn as well. You can feel free to send me an invite. If you're in the business, I usually accept it and then I try to start a conversation about what is it you're interested in, see if there's any commonality with our interests. Those are good ways to-
Phil: Right, what's your Twitter handle?
Doug Madory: It's just @ DougMadory, D- O- U- G M- A- D- O- R- Y.
Phil: Great. And I believe you blog pretty frequently on the Kentik blog?
Doug Madory: I try to, yeah.
Phil: Right, okay. So make sure to check that out. You can find me on Twitter network_phil. I am still very active there. I'm also on LinkedIn. You can search me there. My blog is networkphil. com. Not as frequently posting recently, but you can still check it out. So until next time, thanks very much for listening. Bye- bye.
Do you dread forgetting to use the “add” command on a trunk port? Do you grit your teeth when the coffee maker isn't working, and everyone says, “It’s the network’s fault?” Do you like to blame DNS for everything because you know deep down, in the bottom of your heart, it probably is DNS?
Well, you're in the right place! Telemetry Now is the podcast for you!
Tune in and let the packets wash over you as host Phil Gervasi and his expert guests talk networking, network engineering and related careers, emerging technologies, and more.