Kentik - Network Observability
More episodes
Telemetry Now  |  Season 1 - Episode 20  |  August 8, 2023

Understanding the health of the internet with Romain Fontugne

Play now

 

In this episode, Dr. Romain Fontugne, a subject matter expert in internet measurement and understanding the internet as a whole, discusses his experience and work monitoring global routing, why that's critical to do today, the technology involved, and even some of the geo-political ramifications of understanding the internet as a dynamic, interdependent system.

Transcript

We've gotten pretty good at monitoring local networks using a variety of technology and we've been getting better and better at monitoring public cloud SaaS provider the latest network overlays and and so on. But how do we monitor the health and activity of the whole system, the entire internet?

How do we monitor performance on a global scale and how can we identify dependencies among providers? Something that's very important. And ultimately when it comes down to it, how can that information actually help us?

Today with us, we have doctor Roman a subject matter expert in internet measurement and the creator of the internet health report. We'll be discussing why and how we monitor routing, and we'll be discussing the technology, the impetus to do so. We'll be talking about something called AS HEGhemony, and we'll be unpacking the internet health report as well. Doug Madore, Kentech's resident director of internet analysis, who also specializes in the field of internet measurement, is also with us. So we really have a great episode lined up for you today.

And my name is Philip Jervasi, and this is telemetry now.

Roman, it's really great to have you today. Thank you so much for joining. I'm really interested in this. I listened to a podcast that you did recently. I think it was with Appnet. Correct?

Very, very and I read some of your literature online. So I I I can't wait to dig in. And and Doug, of course, it's great to speak with you again. And I you're both in, in Tokyo, Japan right now. Correct?

Yes. We are. Yeah.

Yeah. This is on location.

On location. I'm on location still in Upstate, New York. So a little bit of a different scenery out my window.

That's not exactly true. I don't have a window in my office.

But, outside it is evening and it is morning for you. So, I appreciate very much that, you're joining me, from many, many time zones away. So, before we get started and get into, today's so. Roman, would you give us a little bit of an introduction to yourself a little background, maybe from a professional standpoint, but also your history and how you ended up in in Japan?

Yeah. Sure.

Well, I first came in Japan for a six months internship, and this was fifteen years ago.

I'm not an intern anymore. I since I I did my my PhD in Japan, was analyzing internet traffic. That's an interesting project at postdoc at university, and now I'm, at Ayasia in the research lab.

And, yeah, that was a long trip, but, they're interesting.

Started with, analysis of, internet traffic. There is an interesting project here in Japan where they they record traffic on, one of the academic network.

I was looking at this ring by PhD finding worms viruses, in that traffic. That was very interesting.

And now moved a bit more on looking at the topology of the internet. Okay.

The topology of the global internet to be more precise of external networking, and and what's going on on, I assume then that means we're gonna be talking a lot about BGP and autonomous systems and that sort of thing.

Oh, yes. We love Bishop.

Oh, yeah. My favorite stuff.

And Doug, no no stranger to our audience, of course, if you wouldn't mind, give us a little introduction to yourself and and how you came to know Romand, maybe.

Right now, I am, on location in Tokyo at the IAG, lab, the research group. I'm here for a couple of weeks. I'm something between a visiting scholar and an intern, and, although I've just been cleaning whiteboards and and fetching coffee for a month so far. So, get good, I think, any minute now.

Let's see. So I got I think I met Roman's, colleague Mark at a conference maybe five years ago, and he was like, you gotta come to Tokyo. If you're ever in Tokyo, you're gonna come to IJ and I was like, hon, sounds great. I'd, I'd love to go.

I I had a Japanese exchange student when I was a kid, and so I always had an interest in Japan. But despite all my traveling around the world, both in the internet measurement business and in the US military, I never made it to Japan. So this is my very first time, here. So it's kind of been a, a little bit of a dream to come here.

And, and then I think, Roman and I were at a event in Paris, at the end of the year last year, And we were at the speaker's dinner coming out of that. And I always made an offhand comment. And I was like, you know, I I harbor this fantasy of coming to IHA one day. And he said, you should come.

And I was like, I was like, Anyway, we we made it work. Of course, I'm here in July, and it's, it's it's it's pretty warm in Japan in the summer.

It's even more warm than, upstate New York. So, yeah, I'm here, talking to the researchers here, and then there's a handful of, graduate students that are doing, work here.

Yes, we're having some good, good hallway and cubicle conversations about, the research they're doing.

I bet that sounds great. And as the, the resident director of internet analysis for Kentech, this is right up your alley. So, I'm really glad that we're all three here talking about this but you you mentioned the IJ several times. Ram, Ramon, what is the IHA?

So IHA is, is a Japanese internet service provider. And in fact, it was historically the first, commercial ASP in Japan.

So it's pretty all company in terms of the internet history. We celebrated the Certis anniversary last year.

It does a a long history. It's it's very important, I think, in, Asia. It was one thing that people don't usually, no. It was, one of the first office for APenic.

So when, Epinic started, at that time, there was ripe already running and in Asia, they were thinking, okay, we have to try that. So, there was people. There was some, there is a professor here called John Murai. He's known as the the father of the internet in Japan or the, internet samurai.

And, they decided to do that pilot project, APinique, trying to replicate what, RIP is doing in Europe. And what was happening in other parts of the world.

At that time, I think Ayager, applied for, it's licensed to be an ISP.

There was some delay.

And at the same time, they hire John Conrad from, US.

He was supposed to help, doing the world, but he was waiting for that license two. So, June were like, for him, like, okay, you could, start up that pilot project.

So there was, of course, a lot of people involved. It's not IHA. It was Trippini, KENTIHA.

But he was doing most of the work.

And, yeah, so in the I Asia office, we had, we were hosting the, the very first, years of APenic So also for for the internet in Japan IJ was the was the first ISP.

That's is that correct?

Yeah. Yeah. That's correct. Yeah.

So at the very beginning, there was IHA, then, entity, which is the, used to be the in in component, ISP here.

Was doing was operating only within Japan. Anywhere in the country, and KDDI was doing the international connectivity.

So I Asia was kind of the, the free electron that was doing both of them, domestic.

Very good. And, and your role primarily with the IHA is in is in research. Is that right?

Yes. So now I'm a deputy director at IAG Research.

My main job is to do research. We have, what is interesting in the Asia research is I think we have, both the foot in academia and, industry.

Yeah.

So we are very present at conferences. We are part of a lot of this technical program committee for conferences.

We publish a few papers or so.

And we try to work with with Elijah and, make research that is very applicable.

We have about twenty people living in the lab. We have, summer internships. So do is not here for that, but he's he can see.

Kind of.

He's sitting with our interns now.

We have a postdoc program. We have we are we are very open, to academia.

We have, a lot of different, not a lot, but we have, like, three or four different teams focusing on on different topics. So we have one that is a bit more doing, system, and they have they direct, now trying to push code to, the Linux kernel, for example, We have a group that works more on IXP, the Internet exchange points.

My group is working on monitoring the internet. And, we have, another part on what they call cloud morphing. So it's working on the cloud too. IHA. Yeah. I haven't mentioned that, but IHA is an ASP.

Mhmm. But, you know, when you have, like, such a large network, then you can provide a lot of different services. So I, I Asia is, providing internet connectivity, but also have a cloud division security division.

We are, a mobile operator also. Mhmm. So we have a lot of different, aspects.

Yeah. It's a it's a big company. We're in, I, J. Tower, or, you know, in the they've got a a big a big building that we're, that we're presently in.

So Okay. And and then the, the internet health report, I read, all about that over the past few days and then a lot today. Is that a work, of the IJ that you you do, or is that something that you do separately?

Yes. So I'm I'm leading that project. I would say many with, my colleague emila Bennett, at the ripe NCC.

Okay. I see.

It's, it's a research project sponsored by Ayager.

Well, I should say upfront. It's not a service that Ayager is providing. It's really like a proof of concept, and this is coming from the research So it's it's very research y. Mhmm.

And the goal of the internet test report is to provide an observatory for the internet. So we try to have an understanding of how the the whole internet works, try to monitor this. We are designing tools so we can document the evolution of the internet documents some of the, rapid events that may happen on the internet. One of the singularities of this project is we are using only open data.

So there's a very big, internet measurement community Sure. Yeah. There's a lot of data that is available, like, BGP data, Tress route from Raibatlas, Kayda in US, spreading Love Data two.

And we are trying to get as much as we can from this data. So there's a lot of data sitting out there, and we are designing tools so we can get as much insight as we can from, these datasets.

But insight about what, I mean, you you mentioned a couple of things specifically, you know, when you say internet measurement, getting an idea of the topology of the internet and what's going on as far as changes, the evolution. So I assume that means in a in a shorter term, like, what's happening over the course of days, weeks, months and years. What why, first of all? And and, is that correct? Is that what you're looking at?

Yeah. It's correct. It's what we are looking at.

The, well, one thing I give for the motivation is, let's step back and think a bit of what is the internet, you know, the the Okay. US government defined the internet as a critical infrastructure.

And when you look at other critical infrastructure, you have the power grid, the the water system, the hospitals, nuclear power plant, transport, like the air traffic. And one thing all this critical infrastructure have in common is there is always a system to monitor it. Usually in real time, you can think of those websites for air traffic where you see all those plane flying over.

This is great. Right? You can select the entire, commercial air traffic going on.

And as, computer scientists working on the internet, we felt like, okay, will be cool to have this for the the internet.

It's very important. It's, the internet is is now, you know, based for a lot of services.

So we need to monitor it. We want to understand its strengths and weaknesses. There is a problem in the network. We wanna we wanna monitor this if possible in near real time, and we wanna support anywhere if there is, problem for resiliency, or if there's, like, any problem that can happen there?

So it's not necessarily, like a purely security focused initiative, or a purely performance focused initiative. It's really the state of the internet as it stands today at measuring it the way that you do using the data that you do. So it's a little bit more of a holistic approach it sounds like.

Yes. Yeah. Exactly. Yeah. And and there is like a lot of ASPs we have also in Asia.

A lot of ASPs have their own system. They can monitor their own network.

See how the traffic goes. Well, that's one thing that can take a toll free. Like, you can see your network and and see your traffic. But the the approach we take is a bit different because on the internet, the internet is this we usually say it's a network of networks.

Sure. Yeah.

And what it means is all the network that are connected to the internet, they usually have to rely on other network to have a a global connectivity. So if I'm in IHA, send a message to someone in US, in, I don't know, Comcast, my message might if I agent Comcast are not directly connected, it might go through, another provider.

And, that means our connectivity depends on that, third party provider. So we we have one project in the internet test report where we try to measure those, dependencies.

And this is, I think, a very, important aspect that goes across ISPs.

Then you you need that holistic view for that.

Right. Right. And the internet being a network of networks, we are talking about external connectivity and the internet backbone and intermediate connectivity and and transit providers and that sort of thing. So ultimately, you know, even a very large global enterprise is sitting behind all of that kind of as end nodes on this intermediate and external system. So one of the things that you mentioned, Well, you know, before I I already know the answer to this because I heard you I read it in something you wrote, but I'm gonna I'm gonna ask anyway.

What what is the motivation specifically behind the internet health report. Is there something lacking, or is there something deficient with the data that's out there? I'm familiar with what AP Nick does and and ripe and organizations like that. What gap are you filling in?

The gap we are filling in with, okay, let's take RIP as an example. RIP is great because they provide a lot of data.

They have a project called Reese where they provide tons of BGP data, another project called Atlas, where they provide tons of crest route data.

And this is for us. It's gold. You know, like, data is gold. Like, there's there's a lot of information in the data.

But they don't provide a lot of, analysis on top of that. So what we are trying to do with it is, as You know, sometimes I see myself as a data analyst. So I I just take that data, try to squeeze it as much as I can. And I did some new insight about it.

The difficulty for us is it's a lot of data. Mhmm.

So this is just like a technical, challenge, like, how do you analyze, like, that much data in, especially if you're gonna do it in, near real time.

And it's a it's also very noisy data. For anyone that's used to work with Tress route latency, even BGP, Most of the data you receive is the the, and the part that is not really interesting.

So there's a lot of filtering to do, there is a lot of, expert knowledge or so that is required, like, you have to know exactly how those tool works to understand, what is noise and what is a strong signal on all of this.

Alright. Are you able to discern that programmatically, or is that a a largely manual approach with a team of data scientists.

Well, that's where I think Duke shines because Duke is really good at that.

And and our approach is try to to make that. So we are trying to make, a bold version of do if you want.

Good luck. Good luck with that.

I, yeah, I might I might interject too just to, add the, I I think for people who do art practitioners in this space may not appreciate that it's a big enough space, and there's a lot of questions as a remain Ramon was talking about the data, but, different different groups, take different approaches there's some uniqueness to the approach that the internet health report takes that makes it different. And so every time you take a different angle, you have the potential for discovering something in a way that couldn't have been discovered, through an existing approach. So they've got, you know, like the, AS, hegemony, if, you wanna talk about that.

There's, like, a couple of things that are very unique, here, that are good at answering certain types of questions that we don't have another tool. So and there's and there's room for other aspiring internet measurement practitioners out there, for other approaches, there's probably, ten more that someone could be inventing. It's just there there are a lot of questions to be answered.

Hence, your summer internship. Right, Doug?

That's right. Yes.

Yeah. And what what thing I wanna search on that is the, the internet test report is completely open source. So our code is all on GitHub.

We are we're welcoming anyone. So just like, right, provide data to anyone.

We have now this platform that ingest all those big data sets. And if if someone wants to write a tool on this and make it right on our platform, that's possible.

We are part of the, Google Son Offcode. We get some students also, work on the project like that. We have interns. Mhmm. We are very, very open now.

So it's not a matter of, a deficiency in one of these other organizations that we've been discussing. It's more a matter of using their data as the found for your data analysis to find. Like you said, the the insight, what's really going on, which does beg the question. What is going on? I thought that the internet never converged we're talking about ephemeral, you know, networks that pop on the internet and then disappear and things and routes and prefixes that are pulled and pushed. Is it is it very difficult then when you're talking about a, a dynamic and not a static data set?

It's all in the cloud.

It's all.

Okay.

Yeah. I just saw a cloud on the whiteboard. That's it.

There we go.

It's a hard. Yeah. It's it's hard to analyze this. Yeah. So again, like, to go back to the parallel, I did with the, the critical infrastructure.

One big difference with, air traffic or the power grid is the internet as two component. Like, there's the physical component. We can see that we know where these submarine cables are. We know, some are operators gonna show, like, some of their fiber network. There is, like, a physical infrastructure, but the internet, the IP infrastructure is on top of that, and it's more like a cyberspace where it's very hard to go from one to another.

It's not, the the physical is more or less. It's developing, but it's it's static. You know, it's slowly, developing.

In IP, you can have large, reroute. So you can see a lot of paths that goes in one direction.

One minute later, it's going in the other direction. Like, everything changed, like, very quickly. You don't see this in their traffic.

Like, you don't see, like, suddenly your plane just I hope not.

We like stability.

In most in most arenas, I think.

Yeah. And and sometime when we are interested in, so one another thing that Doug is doing a lot when we're interested in, geopolitical events, then we have to match, an event that happened, in a country and we have to find IP resources. Like, the the mapping between, the, again, the physic the physical infrastructure and the cyber space in the internet is, not an easy task.

That's one thing.

The other thing is we are looking at an object to the internet that is evolving. It's growing all the time. Mhmm. So there's a lot of graph online. You can see, like, maybe Jeff Houston as as like graph where you see the number of SN is always increasing. So the that network is always growing, growing, growing.

But when you do out touch detection or anomaly detection, it's how to have a reference to say, like, okay, that's my picture of the internet.

How different is it, right now and detect those anomalies. It's it's hard because that thing is is changing anyway.

And another another, angle to the I think it's evolved, over the last the decades of the internet, which really hasn't been around that long, as a core technology of human society today. But it's just the, the change from the nineties of, you know, if you are either a, like a we in the industry, we call it like a eyeball network or network. How do you connect to the internet, that, like, you're in the US, like Comcast Spectrum, that kind of thing? And then on the other end is where the content is And so to get to, like, what web page you're trying to visit, okay, so now you needed some sort of a transit provider to connect you from the access layer to the content.

And then the evolution that's that's happened over the the decades is that, the content providers are directly connecting, if not embedded in the access networks, Right.

And so there's been this evolution of like, alright, what the heck's the point of transit anymore? Because I get all my content, directly, you know, Netflix and Comcast periodically. So why, why do we need the internet anymore? And and if you were to count, packets or bits per second or something, you would find that most most of the traffic is satisfied by either local cache or content peering and only a small portion ends up going out of transit.

And so that would make the argument of, well, as who cares, and the truth is you do care We still, even despite all those developments, you still need to know, the the internet is still needs to remain connected. And, problems within it will still affect you, even if it's not the majority of the of the packets being sent. Your DNS query still gotta traverse the internet. There's a handful of things that you still are gonna always remain, have important, dependencies on.

And so, that's why I kinda, like to, like, to push back on sometimes, this, the it goes along with the death of transit, kind of, discussion where like, why why have all this VGP analysis at all if if everybody's just watching Netflix and Comcast directly connected, so that doesn't require a lot of, interim analysis to make that, ensure that's working. But, there's the the whole the whole thing still relies on this global, global network.

Yeah. Yeah. Absolutely. That's how we get to the cloud that you talked about that we drew on the whiteboard. We connect to the cloud. Now I do think it's important to mention for our audience who are not necessarily in a service provider space that there is a difference between the access network in the enterprise and in the network in the provider world. The access network could be a, you know, thirty thousand person organization, and they have due to, you know, a active standby BGP.

A connection peering to the to the internet, and they, as a whole, are connecting via the they're they're accessing the network through that peering relationship. Whereas on the enterprise, you have that three tier design, the access layers where end users plug into the network, which interestingly is logically the same thing if you think about it. From a logical perspective, obviously, from a scale.

We've got a similar customer.

It is. Right?

You know, it's how you connect into the rest of the infrastructure. And then, you know, we have an enterprise backbone sometimes.

Data center, a to data center, B and C.

In the enterprise example, you're bringing up, I would argue, again, going way back to managing networks in the military, most of the traffic is local there too. Like, most, you have local services and you're, and so that's your, that's equal to the net, the Netflix hopefully, people aren't watching Netflix on the net on the enterprise network, but, Yeah. Yeah.

Do you you're setting up local services that they use.

So you don't have to, you don't have to rely on your, your, link out, your, your transit link out, hopefully, if you, for as much as you possibly can, you'd like it to have some sort of, decor director local connection. Mhmm.

Yeah. And that I think that's more for the administrative management component and and less for accessing services. Because I in my experience, in the in the enterprise, a significant amount of traffic is now going up and not branch to branch. You're not putting services at your local branch.

There's no IDF down the hall with, you know, DNS servers. And I'm the the only thing in my local branch might be a print server since that's the a pain in neck to do, spool up somewhere else across the ocean. So much of the traffic, even if it's owned by the organization, is somewhere else. And I think or not traffic.

Some of the the services, that we that we want are somewhere else. And I think that's very common in in even small enterprise now. Hence, the discussion around cloud connectivity, multi cloud hybrid hybrid cloud and all of those things. And I also think it's important to make a distinction between interior gateway routing and BGP exterior gateway.

They they are different, BGP not being the same type of deterministic routing that you have with, like, an OSPF, where you're not necessarily looking at path selection, but you're advertising prefix, reachability, and path these are all different things when you compare an IGP versus a a a BGP, specifically e BGP. Right?

And so, you know, it it it it does in this conversation mean that we we really are focusing on global routing and and how we reach things over the internet between providers and among providers, transit providers.

But that does presuppose that there is a limited number of pathways. Right? So you mentioned one minute, all my traffic is going one way and another minute traffic is going another way. But very often, I'm limited in the number of pathways simply because of where I am geographically in the world. So that's something that I think is probably considering that what we do on the Internet is both the mundane stuff, our productivity tools in Office three sixty five. And also the mission critical things, like a hospital accessing its EMR online. Right?

Those are the things that you that you measure. So I I have a list of several of the things that you you you talked about, in one of your your articles, things that you measure. But you mentioned the first one network dependency several times. And then Doug mentioned AS Hag Hemhemony That's a different one. A couple of things.

Are those the same thing? Are those scenarios? Yeah. They are. Okay. Alright. I wasn't sure because I I saw how you were using them.

In your writing, and I'm like, I I don't get it. So can you explain that a little bit? What is network dependency?

Yeah. Sure. Well, first, they are the same thing here. When we wrote the research paper. We saw, like, oh, we need a a fancy name. That's important to remember. And we call this a s hegemony.

And then when we put it on our websites, because we have this, in the internet test report as a main website where we show results well, if you are not, techy, then you didn't get rid of. I mean, if you didn't read the research paper, you couldn't really get what the s h m then we call it a network dependency, which is a bit more intuitive.

Yeah. And this is looking at BGP data. We are looking at all the paths we see in BGP data, and we're gonna find, what are the main dependencies from one to two on the so the the example I always give, and maybe if you listen that podcast, I I probably gave it there. It's the University of Tokyo.

So the University of Tokyo has its own, AS, which is connected to, the educational network here in Japan, which is signet.

And, signet, main upstream provider is ASHA.

And if you look at the result on the internet test report, you're gonna see that, we measure that the university of Tokyo depends hundred percent on, CyNet and almost hundred percent on I Asia. Even though University of Tokyo is not directly connected to I Asia, This is this, transitive property. Because cyanet depends on Asia, and the University of Tokyo depend on cyanet. We can see that.

And that's an information that network operator could use for, new deployment, for example. If they want to diversify the connectivity, well, it's a bit sad to say, but connecting to Asia won't reduce their dependency to, to us, Asia. So, yeah, they could, like, try it with another provider.

But I think, I think, actually, the, the, the insight that's useful is that, or at least I find with this, is sometimes you, you can see that there's, you know, a stub AS, is, you know, singly home behind. So they will clearly like that one's not a, that doesn't take a, sophisticated service to figure that out. But, but sometimes that dependency can be a couple of hops away and still exists. And that becomes harder to, at a glance, figure out, and that's something that's getting boil up in, in, in this, in this service. So then you can see, yeah, you pick out these, these dependencies that are not immediately adjacent.

And then, then there's insight that you probably wouldn't, So one of one of this example would be, a network in Iran even though it might be connected to a lot of other networks.

To go out from the country, you're gonna see that they they have, like, a single, areas to, to go outside of the country. So this gonna show up as a dependency of even though they are not, connected to that. So it might not be obvious if you don't have, this kind of tools.

Yeah, this is really useful. And I think the good thing with that tool is well, there is other tool that look at BGP, like, I'm thinking of Iota at, Georgia Tech.

This is a great tool.

But what they are looking at is many if the prefix are on or off, Is the prefix reachable or not? If it becomes unreachable, then for them, that's a signal. They can address, report that. And we have, I think, an extra information is how the paths are changing. The prefix, the prefix might still be reachable, but we see, like, there is a lot of rerouting that could be due to, BGP leak or some hijack. Like, there could be a lot of different reasons or an outage actually. There's a big outage and you see, like, all these networks that try to reroute around it.

We can see that And since we've, put that data out there, there was there is a lot of, other research group that, picked this there's a group in MIT that made a Bijip League detector of this.

We had some research on classifying BijipI Jack using this The internet society have, a platform that they use to measure, resiliency. They have a, internet resiliency index.

It takes a lot of different data sets into account, but ASC Germany on network dependency, however you call it.

Is is taken into account there also. It's a very basic metric, but we found it very useful there. Mhmm.

And it it is something that you see on on the enterprise side as well. When you're looking at, let's if you have four or five data centers and you're configuring and and designing your data center interconnects, you're designing your multi home environments, your large campuses, I've done work with large universities, with global pharmaceutical companies, and we look at who's our last mile provider, who are we peering with. What is the upstream provider from there because we are mapping out. You know, you don't you don't have your data centers go offline because of an upstream provider. So it's not the same as the global scale as you're talking about. But I think the idea is the same that you need some sort of visibility into what's going on on a broader scale outside my little sphere so I can prevent so I can prevent being down, right, and prevent outages, and avoid them, at least, not prevent them necessarily.

Yeah. So, what else are you measuring? We've discussed network dependency, which to me, honestly, coming from a networking background is is very logical. That's something that I I understand right away. What else are you are you measuring? Are you looking at any sort of performance metrics?

Yes. We have those preference.

So we are also looking at I mentioned before, traceroute. We are taking traceroute data and looking at the latency inside of traceroute.

So ripe at last of ten, twelve thousand, monitors deployed on the internet.

Those, so they are called Atlas probes.

Okay.

Those probes are doing test routes to a lot of different destination on the internet. We collect this data. Well, right, collect this data, and we analyze it.

And, yeah, I want I didn't want to, to say that's right is is, not doing great like before. Like, they are doing a lot of work. It's a lot of work just to collect all this data and that careful to say that. That's a lot of work.

And I just want to build on top of that. We so we take, a lot of distress route and keep track of the delay from point a to b, and a and b could be, two ASCs in the network. When we can, we try to map those two like cities, or if we see IXPs on it, we're gonna, try to monitor this. And we, put all this result in a database so we can query our database and ask, okay, what was the delay from, IHA to the Amsterdam internet exchange.

And we we have this time series that can show us this.

So, other questions. So then, so the right Atlas probes someone has to have have set up a a test, like you're not guaranteed that any probe has measured from anywhere at any time, unless someone has set something up.

So is it do you are you able to just query the whole corpus and say, Hey, Did anybody happen to have a, measurement between here and here that, so the way we work Again, because this data is very noisy, the way we work is we try to have a stable signal out of that.

And you're right. Like, there is two way that there is two type of measurement, atlas probes are doing. One is called built in. So once you plug the the network, the prop to the network, it's gonna just start doing press route to the DNS route servers, to some servers, managed by RIP.

There is stress route to the DNS, public DNS resolver.

Google Public Dines Resolve.

They love that.

Yeah, it looks like there is so there's those built in that are for us easier to analyze because that's a very able signal. We have, like, a trace route. We know that every I think it's every fifteen minutes. We're gonna have a trace route from one of this probe to any of the, Genest root server.

And then there is what I think ripe called super user.

So there is some users that have special right that can run measurement on all the props.

Okay.

So those are are usually the right NCC Camille. Stuff. Yeah. Like, Camille.

And they can run, like, very big, wide measurements that's gonna last forever, or whenever the user are gonna stop it. That's also a good signal for us. And then the last one is the usual user that's just gonna run a measurement for a day a week just to from a few source to a few destinations.

And this is for us. This is very, very hard to use because suddenly you're like, oh, I have thousand props to, server in Ayashire.

And it's like, yeah, what what do I compare this with? Like, is it normal? Is it not normal? I don't know. And then it just disappears. So, we we have a way to to filter. We have, what I call the long measurement, atlas long measurement.

I have a list of those measurement, and and I take only these data. It's still a lot of data.

So you're looking at passive information, meaning you're, doing like, classic observability where you're looking at what actual traffic is is doing, what the system, in this case, the entire internet, this the health of the system, the status of the system based on what's actually happening, and then you record that over time. So both real time and it historically, but you're also using this artificial traffic.

These these probes in order to generate whatever kind of testing information you want. You dump that on the network, So you're using both of this active and passive form.

But it does sound like you're focused very much on network centric latency see. Even though there are probes that you can configure to do whatever you want and send out false get requests and things like that, it's it's network centric. Is that right?

Yes. It's it's getting bigger. Right. Yeah. And and we do so the the advantage of using trace routes as opposed to the the VGP, is, for example, in the case of, submarine cable cut. We have this this example where we show, like, there was, submarine that was, cut between Singapore and Australia.

And we can see that the latency is is going up because the traffic is rerouted through, a different path. And going through a different cable. We can see this in truss route.

It's very clear, but in Bicipate, it might not appear because if that rerouting up and inside the one AS, then the, the routing is exactly the same for BGP.

So those are are complementary to, look at those. Yeah.

There's no latency in BGP.

Yeah. Yeah. Yeah. And then you get, like, the, used to call it city provider pairs of just like, you know, what's the, what's the path of the, you know, in BGP and AS is an abstraction of you know, a company and you've just and if you want any more detail, you're not gonna get it out of, BGP. And so things like trace route help illuminate what are the, what's the precise city provider, pair of path, of, you know, going from one place to another. And then what's the observed a round trip time for latency.

Yeah. I mean, in effect, you are measuring other things other than your, you're measuring other routing not necessarily BGP, because BGP isn't that reachability matrix. It's gonna be the underlying, probably MPLS, right, whatever they're doing. ISIS and a lot of providers, they're gonna use some other mechanism to actually make forwarding decisions under the hood. So you're measuring other routing protocols in that sense.

But but that's beyond dependency and delay. You're talking about actual, like, activity on a link, making decisions on forwarding, and then what why did we go? Or not necessarily why, but we did go this way and here's what the ramifications are on the entire internet. Right? So, one of the things I read about was link monitoring. Is that what that means, or did I just define it completely incorrectly?

That night is. Yeah. So we this link monitoring we are doing is, exactly that looking at the internal routing. This call is actually not currently running.

We are in process of, rewriting most of this code One one thing I've learned, for the internet test, report is writing a research paper and having a piece of code for research paper is very different from. I mean, a service you run, you know, in real time, like, all the time that has to ingest billions of test routes. It's very different.

And, so we are in the process of, making that code run again.

The code was, not directly in. I wrote it, so I can say that.

But that's that's what we're doing. Yeah. This link monitoring is to look at rerouting inside an AS and the congestion we see inside an AS or so. We had some good example, on that when, there was some game update from Steam.

There was some ISP that reported, like, a peak, like, more traffic at, IXPs. And we could see congestion in some of the tier ones.

That actually you're still providing, a lot of this.

Yeah, this is really tricky to see, from Transparads. You have to look at the right place. So that's why we make this.

This I mean, that's the whole point of the internet test report. Right? Sure. Right. Doing those places where it happened because there is millions of IP to monitor. So that our system gonna just try to automatically find Look, the delay between those two APs used to be, like, very stable, but today, it's, like, twice higher. That's the delinquent training partner.

So then what, what tools, mechanisms, workflows, whatever are you using in order to ingest analyze and then, find insight in the data. You mentioned Jeff Houston a couple times. I'm familiar with what he does. You talked about, oh, then this network appears on the graph. So I have assumed that you're using some sort of graph math graph methodology and looking at nodes on a graph and interdependencies.

That seems to make sense to me. Right? Right?

Yeah. Yeah. Yeah. So we have, well, some mathematical modeling of the of the data.

So I talk about the data we ingest.

We have I'm a big fan of Kafka. We have, like, this Kafka cluster where you can put all these data. It's just stream there. And we have a lot of script that just plugged to this cafe cluster, read that data.

Doing, like, very one simple task and just return results.

I would say we are pretty good at doing those scripts that analyze the data. And then, once we have results, we put this in the database. We try to show this on our website.

We are not as good for the front end of December.

Well, that's one thing we are trying to, improve this year.

Trying to make it a bit more intuitive, a bit more accessible for the global audience because we have a lot of people now that talk to us and say, like, oh, can you see this in this country? Can you see that? We have the the network dependency. We do that dependency per network, but we also do it per country.

And people are really interested in in that, like, seeing that, oh, that country, like, we had a good example recently with Italy. We would measure that Italy realized too much on telecom Italy and recently there was an, last February. There was an outage of telecom Italy. The the the big transit, the big tier one network, and that completely disrupted the internet in Italy.

So we could we could monitor that in advance. We were like, yeah, there's quite a few country in that case.

So if you're operating at that level and you you actually use the term geopolitical earlier, are you involved with the analysis of any kind of geopolitical events as they occur on the internet.

Obviously, from an analysis perspective, because you see it. But as far as being involved in a in the sense of presenting an analysis of what's going on on maybe working with, organizations to figure out what's going on when there's a government making explicit decisions to withdraw prefixes or whatever else is going on?

Well, I think, like, this is just starting now. Yeah. We are. Because now we've built out a tool, we made, proof that it makes sense. Beautiful.

And now we are try we are starting to discuss with. I mentioned the internet society before, which are doing a lot of this this work to I'm discussing with them. I was, manor's ambassador last year. So this is more on the routing security, but now we discuss about internet, resilience.

And, how can we improve this and measure this that we discussed also, we have to get still trying to see. Bijinsky is tricky because it gives a a very nice view of the whole internet But it's only the paths that are active.

You don't really know, like, if if one of the lingos down, you don't really know, like, if if there's a backup or not, and, we also this is kind of like a research project, to do.

We've looked at, also, the the war in Ukraine is a a, a topic in internet measurement as well as this continues. And then, you know, each, other different parties, us with KentIC and Roman, with his, certain, tools, and others, we, I think we all try to, look at, you know, the developments there. And then, there's also a lot of back channel, discussion among, the folks that do this to try to, support each other and, make sure we're reporting something accurate, and useful.

But, it I think that is gonna that'll be a topic, that we're we're all focused on for a while.

And Mila, my colleague is is doing a lot of this work. So, like, trying to to look into the result and and see, so he did some presentation about, Craig. And one thing I wanna mention, and so it's really nice that, we are a different group around the world. We have our own tool, and we can you know, cross check our results, because there is always a lot of limitation in those two in those data sets. So it's nice that we can, see, like, well, I can see, like, what do he's he's seeing. I can check-in my result, and that give us a good, confidence on what we're seeing.

Right. Yeah. That's very interesting.

So then what is the, what is the future of the internet health report and your work and what you're focused on?

Well, there's a lot of things coming.

That's a great question, true.

Yeah.

The first thing I like to do is to make it a bit more usable and a bit more reliable. So as I said at the beginning, it's more like a proof of concept I see that, there is quite a few people that are now using it. So we'd like to, make this a bit more usable.

The other big project that is coming, it's called the internet yellow pages where we built, It's called a knowledge graph or knowledge database of internet resources.

I'm very, very excited about this. It's a big database where we put Everything we know about IP addresses, prefixes, SSENSE, we pull all of that, and then we can query the database and ask, please tell me what are the most popular websites, who is hosting it, with which prefixes and which one are on RPK or not, for example. So you can see, like, which one I've used the best practice And we can ask a lot of very, involve questions.

And, I'm very, very excited about this. We're gonna we have a database that now, working. We're gonna integrate this with the internet test report too. That'll give us a lot of insights also about, specific resources. So if you wanna look at a specific IP or specific domain name, specific prefix, then we just give, like, oh, that's all the information we know about.

Yeah. Yeah. That's amazing. It's so interesting. The more data that you have, the more ability you have to find answers to sometimes very abstract questions like you said.

And yeah, just scratching the surface. I'm looking forward to it as well. It's really interesting. And, and here we are, o oceans apart.

Coming together through the commonality of BGP. BGP is what brought us together across vast distances and, background.

So, Roman Making this call possible.

Yeah. And making this call possible as well. So, Roman, thank you kindly for joining today. This has been very interesting. Really has And, I will, put links to a lot of the resource that you mentioned in our show notes for folks to to look at. But as we close out, if anyone has a question or comment, which I I guess somebody will have a question or comment. How can they reach out to you online?

Well, they can I'm on Twitter or they can send me an email at romaine at ayesha dot a t dot j p.

Great. And, Doug, good to see you again. How can folks reach out to you online?

I'm on Twitter and LinkedIn. I haven't, adopted any new, social media just yet waiting to see, how that shakes out, but the Switter and Linkedin are probably the ways to reach me.

You got it. Okay. And you could still find me on Twitter at network underscore fill. You can search my name in LinkedIn, find my blog network fill dot com.

Now if you are interested in joining the podcast as a guest, or if you have an idea for a show, I'd love to hear from you. So reach out to us at telemetry now at kentech dot com. So until next time, thanks for listening. Bye bye.

About Telemetry Now

Do you dread forgetting to use the “add” command on a trunk port? Do you grit your teeth when the coffee maker isn't working, and everyone says, “It’s the network’s fault?” Do you like to blame DNS for everything because you know deep down, in the bottom of your heart, it probably is DNS? Well, you're in the right place! Telemetry Now is the podcast for you! Tune in and let the packets wash over you as host Phil Gervasi and his expert guests talk networking, network engineering and related careers, emerging technologies, and more.
We use cookies to deliver our services.
By using our website, you agree to the use of cookies as described in our Privacy Policy.