Kentik - Network Observability
More episodes
Telemetry Now  |  Season 1 - Episode 20  |  August 8, 2023

Understanding the health of the internet with Romain Fontugne

Play now

Dr. Romain Fontugne
Dr. Romain Fontugne
Deputy Director at IIJ Research Lab

Romain is Deputy Director at IIJ Research Lab, leading the Internet Health Report project. He is also a WIDE project (MAWI working group) member and an associate member of the Japanese-French Laboratory for Informatics (JFLI). Romain was also a 2022 MANRS Research Ambassador.


Philip Gervasi: We've gotten pretty good at monitoring local networks using a variety of technology, and we've been getting better and better at monitoring Public Cloud, SaaS providers, the latest network overlays and so on. But how do we monitor the health and activity of the whole system, the entire internet? How do we monitor performance on a global scale and how can we identify dependencies among providers, something that's very important and, ultimately, when it comes down to it, how can that information actually help us? Today with us, we have Dr. Romain inaudible, a subject matter expert in internet measurement and the creator of the Internet Health Report. We'll be discussing why and how we monitor routing and we'll be discussing the technology, the impetus to do so. We'll be talking about something called AS Hegemony, and we'll be unpacking the Internet Health Report as well. Doug Madory, Kentik Resident Director of Internet Analysis, who also specializes in the field of internet measurement is also with us. So we really have a great episode lined up for you today. And my name is Philip Gervasi, and this is Telemetry Now. Romain, it's really great to have you today, thank you so much for joining. I'm really interested in this. I listened to a podcast that you did recently, I think it was with APNIC, correct? Very, very interesting. And I read some of your literature online, so I can't wait to dig in. And Doug, of course, it's great to speak with you again. You're both in Tokyo, Japan right now, correct?

Romain Fontugne: Yes, we are.

Doug Madory: Yeah, this is on location.

Philip Gervasi: On location. I'm on location still in upstate New York, so a little bit of a different scenery out my window. That's not exactly true. I don't have a window in my office, but outside it is evening and it is morning for you. So I appreciate very much that you're joining me from many, many time zones away. So before we get started and get into today's episode, Romain, would you give us a little bit of an introduction to yourself, a little background maybe from a professional standpoint, but also your history and how you ended up in Japan?

Romain Fontugne: Yeah, sure. Well, I first came in Japan for a six- months internship and this was 15 years ago. I'm not an intern anymore since did my PhD in Japan, was analyzing internet traffic. That was an interesting project. A PostDoc at university and now I am at IIJ in the research lab and that was a long trip, but very interesting. Started with analysis of internet traffic. There is an interesting project here in Japan where they record traffic on one of the academic network. I was looking at this ring, my PhD, finding inaudible viruses in that traffic. That was very interesting. And now moved a bit more on looking at the topology of the internet.

Philip Gervasi: The topology of the global internet to be more precise of external networking and what's going on. I assume then that means we're going to be talking a lot about BGP and autonomous systems and that sort of thing.

Romain Fontugne: Oh yes. We love BGP.

Philip Gervasi: Oh yeah, my favorite stuff. And Doug, no stranger to our audience, but of course if you wouldn't mind, give us a little introduction to yourself and how you came to know Romain maybe.

Doug Madory: Right now I am on location in Tokyo at the IIJ lab, the research group. I'm here for a couple of weeks. I'm something between a visiting scholar and an intern, and although I've just been cleaning whiteboards and fetching coffee so far, so it's going to get good I think many minute now. Let's see, so I think I met Romain's colleague Mark at a conference maybe five years ago, and he was like, " You got to come to Tokyo. If you're ever in Tokyo, you're going to come to IIJ." And I was like, "Oh, that sounds great. I'd love to go." I had a Japanese exchange student when I was a kid and so I always had an interest in Japan, but despite all my traveling around the world, both in the internet measurement business and in the US military, I never made it to Japan. So this is my very first time here, so it's been a little bit of a dream to come here. And then I think Romain and I were at a event in Paris at the end of the year last year and we were at the speaker's dinner coming out of that, and I always made an offhand comment and I was like, " I harbor this fantasy of coming to IIJ one day," and he's like, " You should come." Anyway, we made it work. Of course I'm here in July and it's pretty warm in Japan in the summer, it's even more warm than upstate New York. So yeah, I'm here talking to the researchers here and then there's a handful of graduate students that are doing work here, it's having some good hallway and cubicle conversations about the research they're doing.

Philip Gervasi: I bet. That sounds great. And as the Resident Director of Internet Analysis for Kentik, this is right up your alley, so I'm really glad that we're all three here talking about this. But you mentioned the IIJ several times, Romain, what is the IIJ?

Romain Fontugne: So IIJ is a Japanese internet service provider and, in fact, it was historically the first commercial ISP in Japan. So it's pretty old company, in terms of the internet history. We celebrated the thirtieth anniversary last year. It has a long history, it's very important, I think, in Asia. One thing that people don't usually know, it was one of the first office for APNIC. So when APNIC started at that time there was RIPE already running, and in Asia they were thinking, " Okay, we have to try that." So there was a professor here called John Morais, he's known as the father of the internet in Japan or the internet samurai, and they decided to do that pilot project APNIC, trying to replicate what RIPE is doing in Europe and what was happening in other part of the world. At that time, I think IIJ applied for its license to be an ISP. There was some delay and at the same time they hire John Conrad from US. He was supposed to help during the work, but he was waiting for that license too. So John Morais, I told him, " Okay, you could start up that pilot project." So there was of course a lot of people involved. It's not only IIJ, there was inaudible NTT IIJ, but he was doing most of the work and so, in the IIJ office, we were hosting the very first years of APNIC.

Doug Madory: So also for the internet in Japan, IIJ was the first ISP, is that correct?

Romain Fontugne: Yeah, that's correct. So at the very beginning, there was IIJ, then NTT, which used to be the incumbent ISP here was operating only within Japan, only within the country and KDDI was doing the international connectivity. So IIJ was kind of the free electron that was doing both of them. Domestic international connectivity.

Philip Gervasi: Very good. And your role primarily with the IIJ is in research, is that right?

Romain Fontugne: Yes. So now I am Deputy Director at IAG research. My main job is to do research. What is interesting in IAG research is I think we have both a foot in academia and industry. So we are very present at conferences. We are part of a lot of this technical program committee for conferences. We publish a few papers or so, and we try to work with IIJ and make research that is very applicable. We have about 20 people, I think, in the lab. We have a summer internship, so Doug is not here for that, but he can see-

Doug Madory: Kind of.

Romain Fontugne: Sitting with our interns now. We have a PostDoc program. We are very open to academia. We have a lot of different, not a lot, but we have three or four different teams focusing on different topics. So we have one that is a bit more doing system and they're like now trying to push code to the Linux kernel, for example, we have a group that works more on IXP, the Internet Exchange Points. My group is working on monitoring the internet, and we have another part on what they call Cloud Morphing. So it's working more on the Cloud too. IIJ, yeah, I haven't mentioned that, but IIJ is an ISP, but when you have such a large network then you can provide a lot of different services. So IIJ is providing internet connectivity but also have a Cloud division, security division. We are a mobile operator also. So we have a lot of different aspects.

Doug Madory: It's a big company. We're in the IIJ tower, they've got a big building that we're presently in.

Philip Gervasi: Okay. And then the Internet Health Report, I read all about that over the past few days and then a lot today. Is that a work of the IIJ that you do or is that something that you do separately?

Romain Fontugne: Yes, so I'm leading that project. I would say mainly with my colleague Emil Bennett at the RIPE inaudible.

Philip Gervasi: Okay, I see.

Romain Fontugne: It's a research project sponsored by IIJ, well I should say up front, it's not a service that IIJ is providing. It's really like a proof of concept. And this is coming from the research lab, so it's very researchy, and the goal of the Internet Health Report is to provide an observatory for the internet. So we try to have an understanding of how the whole internet works, try to monitor this. We are designing tools so we can document the evolution of the internet document, some of the rapid events that might happen on the internet. One of the singularities of this project is we are using only Open Data. So there is a very big internet measurement community. There's a lot of data that is available like BGP data, trace route from RIPE Atlas, inaudible in US providing a lot of data too. And we are trying to get as much as we can from this data. So there's a lot of data sitting out there and we are designing tools so we can get as much insight as we can from these data sets.

Philip Gervasi: But insight about what? You mentioned a couple of things specifically when you say internet measurement, getting an idea of the topology of the internet and what's going on as far as changes, the evolution. So I assume that means in a shorter term what's happening over the course of days, weeks, months, and years. Why, first of all, and is that correct? Is that what you're looking at?

Romain Fontugne: Yeah, it's correct. It's what we are looking at. One thing I give for the motivation is let's step back and think a bit of what is the internet, the US government define the internet as a critical infrastructure. And when you look at other critical infrastructure, you have the power grid, the water system, the hospitals, nuclear power plant, transport, the air traffic. And one thing all this critical infrastructure have in common is there is always a system to monitor it. Usually in real time, you can think of those websites for air traffic where you see all those plane flying over. This is great. You can see the entire commercial air traffic going on. And as computer scientist working on the internet, we thought like, " Okay, it will be cool to have this for the internet." It's very important the internet is now the base for a lot of services. So we need to monitor it. We want to understand its strengths and weaknesses. There is a problem in the network. We want to monitor this if possible in near real time. And we want also report anywhere if there is problem for resiliency or if there's any problem that can happen there.

Philip Gervasi: So it's not necessarily a purely security focused initiative or a purely performance- focused initiative. It's really the state of the internet as it stands today at measuring it the way that you do using the data that you do. So it's a little bit more of a holistic approach it sounds like.

Romain Fontugne: Exactly. Yeah. And there is a lot of ISP. So we have also in Asia, a lot of ISP have their own system. They can monitor their own network, see how their traffic goes. Well that's one thing that can take also far. You can see your network and see your traffic. But the approach we take is a bit different because on the internet, the internet is this, we usually say it's a network of networks. And what it means is all the networks that are connected to the internet, they usually have to rely on other networks to have a global connectivity. So if I'm in IIJ, I send a message to someone in US in, I don't know, Comcast. My message might, if IIJ and Comcast are not directly connected, it might go through another provider and that means our connectivity depends on that third party provider. So we have one project in the Internet Health Report where we try to measure those dependencies. And this is I think a very important aspect that goes across ISPs and you need that holistic view for that.

Philip Gervasi: Right. And the internet being a network of networks, we are talking about external connectivity and the internet backbone and intermediate connectivity and transit providers and that sort of thing. So ultimately, even a very large global enterprise is sitting behind all of that, kind of as end nodes on this intermediate and external system. So one of the things that you mentioned, I already know the answer to this because I heard you, I read it in something you wrote, but I'm going to ask anyway. What is the motivation specifically behind the Internet Health Report? Is there something lacking or is there something deficient with the data that's out there? I'm familiar with what APNIC does and RIPE and organizations like that. What gap are you filling in?

Romain Fontugne: The gap we are filling in with... Okay, let's take RIPE as an example. RIPE is great because they provide a lot of data. They have a project called inaudible where they provide tons of BGP data, another project called Atlas, where they provide tons of trust route data. And this is, for us, it's gold, data is gold. There's a lot of information in that data, but they don't provide a lot of analysis on top of that. So what we are trying to do with it, sometimes I see myself as a data analyst, so I just take that data, try to squeeze it as much as I can and get some new insight about it. The difficulty for us is it's a lot of data. So this is just a technical challenge. How do you analyze that much data especially if you want to do it in near real time? And it's also very noisy data for anyone that's used to work with trust route latency, even BGP, most of the data you receive is inaudible, and the part that is not really interesting. So there's a lot of filtering to do. There is a lot of expert knowledge also that is required. You have to know exactly how those tool works to understand what is noise and what is a strong signal out of this.

Philip Gervasi: Right. Are you able to discern that programmatically or is that a largely manual approach with a team of data scientists?

Romain Fontugne: Well, that's where I think Duke shines. Yeah, because Duke is really good at that and our approach is try to automate that. So we are trying to make a bold version of Duke if you want, somehow.

Doug Madory: Good luck. I might interject too, just to add that I think for people who aren't practitioners in this space may not appreciate that it's a big enough space and there's a lot of questions. Romain was talking about the data, but different groups take different approaches and so there's some uniqueness to the approach that the Internet Health Report takes that makes it different. And so every time you take a different angle, you have the potential for discovering something in a way that couldn't have been discovered through an existing approach. So they've got, AS Hegemony, if you want to talk about that, there's a couple of things that are very unique here that are good at answering certain types of questions that we don't have another tool, and there's room for other aspiring internet measurement practitioners out there for other approaches. There's probably 10 more that someone could be inventing. There are a lot of questions to be answered.

Philip Gervasi: Hence your summer internship, right, Doug?

Doug Madory: That's right, yes.

Romain Fontugne: Yeah. Well, one thing I want to inaudible on that is the Internet Health Report is completely open source. So our code is all on GitHub. We are welcoming anyone, so just write, provide data to anyone. We have now this platform that ingests all those big data sets and if someone wants to write a tool on this and make it run on our platform, that's also possible. We are part of the Google inaudible code. We get some students that also work on the project like that. We have interns. We are very, very open now.

Philip Gervasi: So it's not a matter of a deficiency in one of these other organizations that we've been discussing. It's more a matter of using their data as the foundation for your data analysis to find, like you said, the insight, what's really going on, which does beg the question what is going on? I thought that the internet never converged and we're talking about ephemeral networks that pop on the internet and then disappear and things and routes and prefixes that are pulled and pushed. Is it very difficult then when you're talking about a dynamic and not a static data set?

Doug Madory: It's all in the Cloud, it's-

Philip Gervasi: Okay.

Doug Madory: Just draw a cloud on the whiteboard. That's it.

Philip Gervasi: There we go.

Romain Fontugne: It's hard to analyze this. So again, to go back to the parallel I did with the critical infrastructure, one big difference with air traffic or the power grid is the internet has two components. There's the physical component. We can see, we know where these submarine cables are. We know some operators going to show some of their fiber network. There is a physical infrastructure. But the internet, the IP infrastructure is on top of that. And it's more like a cyberspace where it's very hard to go from one to another. It's not that the physical is more or less it's developing, but it's static. It's slowly developing. In IP you can have large reroute. So you can see a lot of paths that goes in one direction, one minute later is going in the other direction. Everything changed very quickly. You don't see this in air traffic, you don't see suddenly your plane just-

Philip Gervasi: I hope not. We like stability in most arenas, I think.

Romain Fontugne: And sometime when we are interested in, so another thing that Doug is doing a lot, when we're interested in geopolitical event, then we have to match an event that happened in a country and we have to find IP resources. The mapping between, again, the physical infrastructure and the cyberspace in the internet is not an easy task. That's one thing. The other thing is we are looking at an object, the internet, that is evolving. It is growing all the time. So there's a lot of graph online. You can see maybe Jeff Houston has a graph where you see the number of SN is always increasing. So that network is always growing, growing. But when you do outage detection or anomaly detection, it's how to have a reference to say, " Okay, that's my picture of the internet. How different is it right now?" And detect those anomalies. It's hard because that thing is changing anyway.

Doug Madory: Another angle, I think it's evolved over the decades of the internet, which really hasn't been around that long as a core technology of human society today. But just the change from the'90s, of if you are either, in the industry we call it a eyeball network or a access network, how do you connect to the internet that you're in the US Comcast spectrum, that kind of thing. And then on the other end is where the content is. And so to get to what webpage you're trying to visit. Okay, so now you needed some sort of a transit provider to connect you from the access layer to the content. And then the evolution that's happened over the decades is that the content providers are directly connecting, if not embedded in the access networks. And so there's been this evolution of like, " All right, what the heck's the point of transit anymore, because I get all my content directly, Netflix and Comcast Peer, directly. So why do we need the internet anymore?" And if you were to count packets or bits per second or something, you would find that most of the traffic is satisfied by either local cache or content peering. And only a small portion ends up going out transit. And so that would make the argument of, well, who cares? And the truth is you do care. We still, even despite all those developments, you still need to know the internet still needs to remain connected and problems with the internet will still affect you even if it's not the majority of the packets being sent. Your DNS query's still got to traverse the internet. There's a handful of things that still are going to always have important dependencies on. And so I like to push back on sometimes, it goes along with the death of transit discussion, where why have all this BGP analysis at all if everybody's just watching Netflix and Comcast directly connected. So that doesn't require a lot of internet analysis to make that ensure that's working good. But the whole thing still relies on this global network.

Philip Gervasi: Yeah, yeah, absolutely. That's how we get to the Cloud that you talked about that we drew on the whiteboard.

Doug Madory: Sure.

Philip Gervasi: We connect to the Cloud. Now, I do think it's important to mention for our audience who are not necessarily in the service provider space that there is a difference between the access network in the enterprise and in the access network in the provider world. The access network could be a 30, 000 person organization and they have inaudible active standby BGP connection peering to the internet and they, as a whole, are connecting via there. They're accessing the network through that peering relationship. Whereas on the enterprise you have that three tier design, the access layers where end users plug into the network which, interestingly, is logically the same thing if you think about it from a logical perspective obviously, from a scale-

Doug Madory: We're kind of similar.

Philip Gervasi: It is, it's how you connect into the rest of the infrastructure. And then we have an enterprise backbone sometimes from data center A to data center, B and C.

Doug Madory: So in the enterprise example you're bringing up, I would argue, again going way back to managing networks in the military, most of the traffic is local there too. You have local services and so that's the equivalent of the Netflix. Hopefully people aren't watching Netflix on the enterprise network. But you're setting up local services so you don't have to rely on your link out, your transit link out. Hopefully, for as much as you possibly can, you'd like it to have some sort of a inaudible director local connection.

Philip Gervasi: And that I think that's more for the administrative management component and less for accessing services. Because in my experience in the enterprise, a significant amount of traffic is now going up and out, not branch to branch. You're not putting services at your local branch. There's no IDF down the hall with DNS servers. The only thing in my local branch might be a print server since that's the pain in the neck to do, spool up, it's somewhere else across the ocean. So much of the traffic, even if it's owned by the organization, is somewhere else. Or not traffic, some of the services that we want are somewhere else. And I think that's very common and even small enterprise now, hence the discussion around Cloud connectivity, Multi- Cloud, Hybrid Cloud and all of those things. And I also think it's important to make a distinction between interior gateway routing and BGP exterior gateway. They are different. BGP not being the same type of deterministic routing that you have with an OSPF, where you're not necessarily looking at path selection, but you're advertising prefixes, reachability and paths. These are all different things when you compare an IGP versus ABGP, specifically EBGP. And so it does, in this conversation, mean that we really are focusing on global routing and how we reach things over the internet between providers and among providers, transit providers. But that does presuppose that there is a limited number of pathways. So you mentioned one minute all my traffic is going one way and another minute all my traffic is going another way. But very often, I'm limited in the number of pathways simply because of where I am geographically in the world. So that's something that I think is probably... Considering that what we do on the internet is both the mundane stuff, our productivity tools and Office 365, and also the mission- critical things like a hospital accessing its EMR online. Those are the things that you measure. So I have a list of several of the things that you talked about in one of your articles, things that you measure. But you mentioned the first one, network dependency, several times, and then Doug mentioned AS Hegemony, that's a difficult one, a couple of times. Are those the same thing?

Romain Fontugne: Those are the same thing, yeah.

Philip Gervasi: Okay. All right. I wasn't sure because I saw how you were using them in your writing and I'm like, " I don't get it." So can you explain that a little bit? What is network dependency?

Romain Fontugne: Yeah, sure. Well first they are the same thing. When we wrote the research paper, we thought like, " Oh, we need a fancy name." I don't remember. And we called this AS Hegemony. Then when we put it on our websites, because we have this in the Internet Health Report as a main website where we show results. Well if you're not techie, then you didn't get it really. I mean if you didn't read the research paper, you couldn't really get what the AS Hegemony is. So then we call it network dependency, which is a bit more intuitive. And this is looking at BGP data. We are looking at all the path we see in BGP data, and we are going to find what are the main dependency from one network to another. So the example I always give, and maybe if you listen to that podcast, I probably gave it there, it's the University of Tokyo. So the University of Tokyo has its own AS, which is connected to the educational network here in Japan, which is called SINET. And SINET's main upstream provider is IIJ. And if you look at the result on the internet test report, you're going to see that we measure that the University of Tokyo depends 100% on SINET and almost 100% on IIJ, even though University of Tokyo is not directly connected to IIJ. This is transitive property. Because SINET depends on IIJ and the University of Tokyo depends on SINET, we can see that. And that's an information that network operator could use for new deployment, for example, if they want to diversify their connectivity. Well, it's a bit sad to say, but connecting to IIJ won't reduce their dependency to us IIJ, so they could try it with another provider.

Doug Madory: I think actually the insight that's useful, or at least I find with this, is sometimes you can see that there's a inaudible is singly home behind somebody. Well, clearly they're dependent. That doesn't take a sophisticated service to figure that out. But sometimes that dependency can be a couple of hops away and still exists, and that becomes harder to, at a glance, figure out. And that's something that's getting boiled up in this service. So then you pick out these dependencies that are not immediately adjacent, and then there's insight that you probably wouldn't.

Romain Fontugne: One of this example would be a network in Iran, even though it might be connected to a lot of other networks, to go out from the country, you're going to see that they have a single areas to go outside of the country. So this going to show up as a dependency also, even though they're not connected to that. So it might not be obvious if you don't have these tools. Yeah, this is really useful. And I think the good thing with that tool is... Well, there is other tool that look at BGP, I'm thinking of IOTA at Georgia Tech. This is a great tool, but what they're looking at is mainly if the prefix are on or off, is the prefix reachable or not? If it becomes unreachable then, for them, that's a signal. They're going to just report that. And we have, I think, an extra information is how the paths are changing. The prefix might still be reachable, but we see there is a lot of rerouting that could be due to BGP leak or some hijack. There could be a lot of different reasons. Or an outage actually, there's a big outage and you see all these networks that try to reroute around it. We can see that. And since we've put that data out there, there is a lot of other research group that picked this. So there's a group in MIT that made a BGP leak detector of this. We had some research on classifying BGP hijack using this. The Internet Society have a platform that they use to measure resiliency. They have a internet resilience index. It takes a lot of different data sets into account, but AS Hegemony, or network dependency, however you call it, is taken into account there also. It's a very basic metric, but we found it very useful there.

Philip Gervasi: And it is something that you see on the enterprise side as well. When you're looking at let's say, if you have four or five data centers and you're configuring and designing your data center interconnects, you're designing your multi- home environments, your large campuses, I've done work with large universities with global pharmaceutical companies, and we look at who's our last mile provider, who are we peering with? What is the upstream provider from there? Because we are mapping out, you don't have your data centers go offline because of an upstream provider. So it's not the same as the global scale as you're talking about. But I think the idea is the same, that you need some sort of visibility into what's going on on a broader scale outside my little sphere. So I can prevent being down, I can prevent outages and avoid them at least, not prevent them necessarily. So what else are you measuring? We've discussed network dependency which, to me honestly, coming from a networking background is very logical. That's something that I understand right away. But what else are you measuring? Are you looking at any sort of performance metrics?

Romain Fontugne: Yes, we have those performances. So we are also looking at, I mentioned before, trust route. We are taking trust route data and looking at the latency inside a trust route. So RIPE Atlas have 10, 000-12,000 monitors deployed on the internet. They're called Atlas probes. Those probes are doing trust route to a lot of different destination on the internet. We collect this data, RIPE collect this data and we analyze it and yeah, I didn't want to say that RIPE is not doing great like before, they're doing a lot of work. It's a lot of work just to collect all this data and archive all this data. That's a lot of work. We just to build on top of that. So we take a lot of this trust route and keep track of the delay from point A to B, and A and B could be two ASCs in the network. When we can, we try to map those two cities, or if we see IXPs on it, we're going to try to monitor this, and we put all this result in a database so we can query a database and ask, " Okay, what was the delay from IIJ to the Amsterdam internet exchange?" And we have this time series that can show us this.

Doug Madory: I have question. So the RIPE Atlas probes, someone has have set up a test, you're not guaranteed that any probe has measured from anywhere at any time unless someone has set something up. Are you able to just query the whole corpus and say, " Hey, did anybody happen to have a measurement between here and here?"

Romain Fontugne: The way we work, again, because this data is very noisy, the way we work is we try to have a stable signal out of that. You're right, there is two types of measurement Atlas probes are doing. One is called Built- in. So once you plug the network, the prop to the network, it's going to just start doing trust route to the DNS route servers, to some servers managed by RIPE. There is trust route to the DNS, Google Public DNS resolver.

Doug Madory: They love that.

Romain Fontugne: Yeah, so this was built-in that are for us easier to analyze because that's a very stable signal. We have a trace route. I think it's every 15 minutes we're going to have a trace route from one of these probes to any of the DNS routes inaudible. And then there is what I think RIPE called super user. So there is some user that have special right that can run measurement on all the probes. So those are usually the RIPE NCC stuff.

Doug Madory: A meal.

Romain Fontugne: Yeah, like a meal. And they can run very big wide measurement that going to last forever or whenever the user going to stop it. That's also a good signal for us. And then the last one is the usual user that just going to run a measurement for a day, a week, just from a few source to a few destination. And this is for us, this is very, very hard to use because suddenly you're like, " Oh, I have a thousand props to server in IIJ." It's like, " Yeah, what do I compare this with? Well, is it normal? Is it not normal? I don't know." And then it disappears. So we have a way to filter. We have, what I call, the long measurement. Atlas long measurement. I have a list of those measurements and I take only these data, it's still a lot of data.

Philip Gervasi: So you're looking at passive information, meaning you are doing classic observability where you're looking at what actual traffic is doing, what the system in this case, the entire internet, the health of the system, the status of the system, based on what's actually happening. And then you record that over time. So both real time and historically. But you're also using this artificial traffic, these probes, in order to generate whatever kind of testing information you want and you dump that on the network. So you're using both of this active and passive form. But it does sound like you're focused very much on network centric latency, even though there are probes that you can configure to do whatever you want and send out false get requests and things like that. It's network centric, is that right?

Romain Fontugne: Yeah, it's completely right here. The advantage of using trust routes as opposed to the BGP is, for example, in the case of submarine cable cut, we have this example where we show there was a submarine that was cut between Singapore and Australia, and we can see that the latencies is going up because the traffic is rerouted through a different path and going through a different cable. We can see this in trust route, it's very clear, but in BGP it might not appear because if that rerouting happen inside one AS then the routing is exactly the same for BGP. So those are complimentary to look at those.

Doug Madory: Yeah, there's no latency in BGP. And then you get, we used to call it city provider pairs of just like what's the path of the in BGP and AS is an abstraction of a company, and if you want any more detail, you're not going to get it out of BGP. And so things like trace route help illuminate what's the precise city provider pair path of going from one place to another, and then what's the observed round trip time for latency.

Philip Gervasi: Yeah, I mean in effect, you're measuring other routing, not necessarily BGP, because BGP isn't that reachability matrix. It's going to be the underlying, probably MPLS, whatever they're doing. IS AS and a lot of providers, they're going to use some other mechanism to actually make forwarding decisions under the hood. So you're measuring other rounding protocols in that sense. But that's beyond dependency and delay. You're talking about actual activity on a link, making decisions on forwarding and then why did we go, or not necessarily why, but we did go this way and here's what the ramifications are on the entire internet. So one of the things I read about was link monitoring. Is that what that means or did I just define it completely incorrectly?

Romain Fontugne: It is, yeah. So this link monitoring we are doing is exactly that, looking at the internal routing. This code is actually not currently running. We are in process of rewriting most of this code. One thing I've learned for the Internet Test Report is writing a research paper and having a piece of code for research paper is very different from having... A service you run in real time, all the time, that has to ingest billions of trust route, it's very different. And so we are in the process of making that code run again. The code was not very clean. I wrote it, so I can say that. But that's what we're doing. This link monitoring is to look at rerouting inside an AS and the congestion we see inside an AS, or so. We had some good example on that when there was some game update from Steam, there was some IXP that reported a peak, more traffic at IXPs, and we could see congestion in some of the tier ones that actually also providing a lot of this. Yeah, this is really tricky to see from trust route, you have to look at the right place. So that's why we make this, that's the whole point of the Internet Test Report. So showing those places where it happened because there is millions of IP to monitor, so that our system going to just try to automatically find, " Look, the delay between those two IPs used to be very stable, but today it's twice higher." That's the link monitoring partner.

Philip Gervasi: So then what tools, mechanisms, workflows, whatever are you using in order to ingest, analyze, and then find insight in the data? You mentioned Jeff Houston a couple of times. I'm familiar with what he does. You talked about, " Oh, then this network appears on the graph." So I assume that you're using some sort of graph methodology and looking at nodes on a graph and interdependencies. That seems to make sense to me, right?

Romain Fontugne: Yeah. So we have some mathematical modeling of the data. So I talk about the data we ingest. I'm a big fan of Kafka. We have this Kafka cluster where we can put all this data, it's just streamed there and we have a lot of script that just plugged to this Kafka cluster, read that data doing very one simple task and just return result. I would say we are pretty good at doing those scripts that analyze the data and then, once we have result, we put this in the database, we try to show this on our website, but we are not as good for the front end design. So that's one thing we are trying to improve this year, trying to make it a bit more intuitive, a bit more accessible for the global audience, because we have a lot of people now that talk to us and say like, " Oh, can you see this in this country? Can you see that?" Yeah, we have the network dependency, we do that dependency per network, but we also do it per country. And people are really interested in that, like seeing, " Oh, that country..." We had a good example recently with Italy. We measured that Italy relies too much on Telecom Italia. And recently, last February, it was an outage of Telecom Italia, the big transit, the big tier one network. And that completely disrupted the internet in Italy. So we couldn't monitor that in advance. We were like, " Yeah, there's quite a few countries in that case."

Philip Gervasi: So if you're operating at that level and you actually use the term geopolitical earlier, are you involved with the analysis of any kind of geopolitical events as they occur on the internet? Obviously from an analysis perspective, because you see it, but as far as being involved in the sense of presenting an analysis of what's going on, maybe working with organizations to figure out what's going on when there's a government making explicit decisions to withdraw prefixes, or whatever else is going on.

Romain Fontugne: Well, I think this is just starting now. Because now we've built our tool. We made proof that it makes sense, useful. And now we're starting to discuss with, I mentioned the Internet Society before, which are doing a lot of this work too. I'm discussing with them, I was a inaudible ambassador last year, so this is more on the routing security. But now we discuss also about internet resilience and how can we improve this and measure this better. We discussed also with Google, so trying to see. BGP is tricky because it gives a very nice view of the whole internet, but it's only the path that are active. If one of the link goes down, you don't really know if there's a backup or not. This is kind of a research project to do.

Doug Madory: We've looked at also the war in Ukraine is a topic in internet measurement as well as this continues. And then each of the different parties, us with Kentik, and Romain with his inaudible tools and others, I think we all try to look at the developments there. And then there's also a lot of back channel discussion among the folks that do this to try to support each other and make sure we're reporting something accurate and useful. But I think that'll be a topic that we're all focused on for a while.

Romain Fontugne: And inaudible, my colleague, is doing a lot of this work also, trying to look into the result and see. So he did some presentation about Ukraine. And one thing I want to mention also, it's really nice that we are different group around the world. We have our own tool and we can crosscheck our results. There is always a lot of limitation in those tool, in those inaudible assets. So it's nice that I can see what Doug is seeing. I can check in my result and that give us a good confidence on what we're seeing.

Philip Gervasi: Yeah, that's very interesting. So then what is the future of the Internet Health Report and your work and what you're focused on?

Romain Fontugne: Well, there's a lot of things coming in.

Philip Gervasi: Broad question, true.

Romain Fontugne: Yeah. The first thing I like to do is to make it a bit more usable and a bit more reliable. So as I said at the beginning, it's more like a proof of concept. I see that there is quite a few people that are now using it, so we'd like to make this a bit more usable. The other big project that is coming is called the Internet Yellow Pages, it's called a knowledge graph or knowledge database of internet resources. I'm very, very excited about this. It's a big database where we put everything we know about IP addresses, prefixes, inaudible, we put all of that, and then we can query the database and ask, " Yeah, please tell me what are the most popular websites? Who is hosting it with which prefixes and which one are on RPK or not, for example." So you can see which one has used the best practice. And we can ask a lot of very involved questions. And I'm very, very excited about this. We have a database that is now working. We are going to integrate this with the Internet Health Report too. So that'll give us a lot of insight also about specific resources. So if you want to look at a specific IP or specific domain name, specific prefix, then we just give like, " Oh, that's all the information we know about."

Philip Gervasi: Yeah. That's amazing. It's so interesting. The more data that you have, the more ability you have to find answers to sometimes very abstract questions like you said. And yeah, just scratching the surface. I'm looking forward to it as well. It's really interesting. And here we are, oceans apart, coming together through the commonality of BGP. BGP is what brought us together across vast distances and background.

Doug Madory: Making this call possible.

Philip Gervasi: Yeah, making this call possible as well. So Ramon, thank you kindly for joining today. This has been very interesting, really has, and I will put links to a lot of the resources that you mentioned in our show notes for folks to look at. But as we close out, if anyone has a question or comment, which I guess somebody will have a question or comment, how can they reach out to you online?

Romain Fontugne: Well, they can, I'm on Twitter, or they can send me an email at Romain @

Philip Gervasi: Great. And Doug, good to see you again. How can folks reach out to you online?

Doug Madory: I'm on Twitter and LinkedIn. I haven't adopted any new social media just yet, waiting to see how that shakes out. But Twitter and LinkedIn are probably the ways to reach me.

Philip Gervasi: You got it. Okay. And you can still find me on Twitter at network_phil. You can search my name in LinkedIn, find my blog network phil. com. Now, if you are interested in joining the podcast as a guest or if you have an idea for a show, I'd love to hear from you. So reach out to us at So until next time, thanks for listening. Bye- bye.

About Telemetry Now

Do you dread forgetting to use the “add” command on a trunk port? Do you grit your teeth when the coffee maker isn't working, and everyone says, “It’s the network’s fault?” Do you like to blame DNS for everything because you know deep down, in the bottom of your heart, it probably is DNS?

Well, you're in the right place! Telemetry Now is the podcast for you!

Tune in and let the packets wash over you as host Phil Gervasi and his expert guests talk networking, network engineering and related careers, emerging technologies, and more.

We use cookies to deliver our services.
By using our website, you agree to the use of cookies as described in our Privacy Policy.