Is data science, and specifically machine learning, just network industry marketecture, or do the process and workflows of ML actually solve real problems for network engineers working in the trenches? In this episode of Telemetry Now, Estefan Ortiz, Ph.D., joins us to talk about what ML has to do with network visibility and the truth of what it can do to solve real problems in networking.
Phil Gervasi: This is Telemetry Now, and I'm your host, Phil Gervasi. And trigger warning, we will be talking about machine learning in this episode. No, I'm just joking. Well, I'm actually not joking about the machine learning part, I'm joking about the trigger warning. Joining me is Estefan Ortiz, a senior data scientist with Kentik, and an expert in data science, machine learning in particular. What we're going to do is talk about how data science is being applied to networking today, specifically network visibility. Yes, we'll also touch on whether or not this is all just hype. Ultimately, that's the goal here today. Keep it real, keep it honest, and learn what we're actually doing with ML and network visibility. Let's get started. Estefan, it's great to have you here today. And I do appreciate that you took some time out of your schedule to talk. Now, I know you're from Texas but you went to grad school in Hawaii, is that right?
Estefan Ortiz: I did, yeah. Yeah, it's good to be here too. Yeah, I finished up undergrad at St. Mary's University in San Antonio. And I thought, " Well, let's see if I can go to grad school at a really nice area." I thought, " Why not Hawaii?" And so I picked an EE program that was strong in control theory and strong in... I think at the time when I was looking for it, was error control coding. And so Hawaii it was. I went out there... Oh man, 2003 I believe, if I recall, 2003. And then I worked on a master's, graduated in 2006. And then I decided to stay out there until about 2010 or so, in which I decided to go and pursue a PhD at the University of Notre Dame.
Phil Gervasi: That's pretty amazing. I have to imagine it was really awesome living in Hawaii for those years.
Estefan Ortiz: It was, it was great. It was at times hard to concentrate on school, wanting to get out, learned to surf, be out and then be out in nature and hike and whatnot. It was a lot of fun.
Phil Gervasi: Okay. I went to graduate school in Albany, New York, which some people like to call the Hawaii of New England. That is not true, I made that up. Nobody calls Albany that. In fact, we have bumper stickers around town... I don't live in the city of Albany but in the area, that says, " Keep Albany boring." And believe me, it is. Anyway, anyway. Before we get into it, you gave me a little bit about your background, but what specifically do you do as a data scientist and what is your doctorate in exactly? You mentioned EE, so I assume that's what it is.
Estefan Ortiz: Yeah. I started off in EE. I received a master's in electrical engineering at the University of Hawaii. Then I went to Notre Dame to pursue computer science and engineering. I received another master's there and then a PhD. My focus was broadly computer vision, specifically biometrics, with a focus on iris recognition and iris detection, so to speak.
Phil Gervasi: Very cool. Very interesting background, I got to say, but I do want to get into it now. Before we really unpack what data science is all about and how we apply it, can we establish a foundational definition of what data science is?
Estefan Ortiz: Good question. I guess, the way that I see it is that data science is somewhere in between applied statistics and software development. I forget the person who said it best, but data scientists tend to be better at software development than most applied statisticians, and then better at applied stats than most software developers. It's that middle ground. But I guess what motivates me, and I think maybe a lot of data scientists, is being able to sift through and look through data to pull out insights, specific insights that are actionable for a given problem. Yeah. And so I think in the network engineering space, can you detect interesting things in say, volumetric data over time? And then once you detect it, can you do something interesting with it?
Phil Gervasi: That's really cool. Now, I've been a network engineer for 15 years or so, and this conversation, the application of data science to networking, is relatively new. Now, I know technically I'm sure there were people like MIT and Stanford doing it for years and years, whatever. But generally speaking, it's a relatively new conversation in the field. Do you have any idea why? Why is it that we're only now starting to use the methods and workflows and processes of data science applied to this industry, networking?
Estefan Ortiz: Yeah. I guess, speaking just in generic terms from what I've seen in other places that I've worked, is that sometimes it's just often an adoption part of it, being able to express the things that machine learning or data science can do when compared to what's already being used. And so I think maybe the slow adoption is there because the field is new, and making the case for a given area is difficult, network engineering being one of those places. And so I think that's one of the reason. The other reason is I think like other areas, you don't want to treat... Wait, you want to make sure that you're expressing the ability of a given say, trained model, in a way that gives some explanatory behavior of the underlying system. And oftentimes, it's very difficult with either... It's difficult because either the model you've chosen is complex and it's difficult to explain things correctly from it, or the underlying data set isn't as clear to those that are interested. Basically, what's the data that goes into this model? I want to know what are the factors if it's right. And being able to explain what data goes in I think, has been an ongoing process to let everyone know how things were built I guess, is the right way... Or how things were estimated, is the right way to say it.
Phil Gervasi: Yeah, I think one of the things too is that the past few years, maybe more than just a few, there seems to have been, at least in my naive perspective, a hockey stick, exponential growth of complexity and networking when you add all the overlays that we have now that we didn't when I first started. When I first starting in networking, things were very simplistic. There was a inaudible edge and there was some stuff going on in your land. The complexity was like, " My wireless is acting funny." Today, it's crazy. There's so much stuff going on. And I think that's probably one of the things that's lending to this desire to solve that problem in a new way. How do we solve this problem of visibility or of configuration management, whatever it happens to be? Is that a problem that you're seeing with this industry, the type of data that we have, network telemetry is what we call it internally at Kentik, but is that an issue, the kind of data that we're using?
Estefan Ortiz: It is, yeah. Being able to I guess, make sense of the complex datasets that are there, not just a single dataset but multiple data sets. Being able to correlate things in a meaningful fashion at scale, really. It's nice to be able to do small analysis on your laptop but it's a whole different ballgame whenever you're trying to put in a somewhat real time system that does the same thing across millions of data points, across thousands of devices, across thousands of interfaces. And I think the scale of it all makes it very difficult to do things correctly and then to explain the models of the results in a meaningful fashion. Yeah, the underlying data, the underlying complex network topology really plays into that as well. And what we can sample from it too, yeah.
Phil Gervasi: Yeah. I have to imagine because you're not collecting one data type, it is very diverse data types and formats. And I was just looking at some examples of how scaling is done in normalization, and I'm like, " Well, that must be a huge part of what you have to do, considering the diversity of data that we have."
Estefan Ortiz: Yeah, yeah. And then I think the other pieces that play into if you're doing ML classification systems, is trying to capture knowledge through say, label data sets, is a whole other issue to address. You want to make sure that it's quality labeled set, you want to make sure that you're capturing subject matter experts' knowledge correctly within that data set, and you want to make sure that you can at least present a good I guess, stratified set so to speak, so that you can see edge cases in the classification type models, whether it's simple, " This network is doing well versus this network's doing poorly." Or, " Yes, this is a known outlier and it's not something that's expected from a given set."
Phil Gervasi: Now, you just mentioned machine learning a moment ago, is that really what we're talking about when we say data science and how it's applied to networking?
Estefan Ortiz: That's a good question. I feel like it's on a spectrum mix. Yeah.l For me, it typically starts with applied stats, and how far can you get away with modeling things in terms of distributions? And then once that starts to hit its limits, can we use more flexible models that add a little bit more complexity, say classification type models or unsupervised learning if we're trying to do discovery of I guess, patterns within the data set? Yeah. And so for me, it behaves all on the same type of spectrum. It just depends on what problem you're trying to address.
Phil Gervasi: Well, I guess that begs the question, what are we trying to solve here? We're talking about some pretty cool stuff, and I have so many questions about correlation and causal relationships and strong correlation versus weak, so much stuff that I want to ask you about, but what are we actually trying to solve? Honestly, a lot of the industry is looking at this and saying, " Is this just a solution looking for a problem?"
Estefan Ortiz: Yeah, yeah. I would say, I guess in my day to day, what I look at is mostly time series data. How do things behave over time? And so I am trying to at least push what we currently do, and you can correct me if I'm wrong, what the industry does with these detection sharistics where you're looking at some mean and some standard deviation. You're asking, " How different is this new quantity, relative to those two statistical measures? Am I far from the mean, given some underlying variance?" And so there are ways that we can extend that to incorporate time series behavior. The one that quickly comes to mind is being able to capture say, seasonal patterns or periodic patterns so that if there is a peak, you can ask the question, " Is this peak known? Did it occur every day at 8: 00 AM?" And so I expect those values to be high. And so outliers may not be the same, whereas if it's maybe in a trough where for whatever reason, the network isn't as busy. Being able to take the same concepts that you use for current outlier detection but incorporate a time component to it in saying, " This is my conditional expectations for this time of day. This is the center point, and this is the variability for that inaudible time of day." It's not so sensitive to these expected variations throughout the day or throughout the month or week, however we capture seasonal patterns.
Phil Gervasi: And I guess, that's what we mean by insight then. There's something beyond just looking at the interface statistics, this is going beyond and doing some sort of... Not interpretation but in inference.
Estefan Ortiz: Yeah, I would add to that that insights is both that predictability... Or not predictability, that forecasting ability but also the action that's associated with it. It's always been, " Hey, I've detected something cool" ... which from an ML or data science perspective is awesome, but then it's, " What do we do with that?" And I think that's where a lot of the subject matter expertise comes into play. Can we present the scenario and say, " We've detected something, now go fix this interface. Or it's a problem with this link. Or we're having problems reaching this point but so are 10 other people or 10 other companies." Yeah. For me, it's both, the interesting part that you can detect but the call to action that drives it too.
Phil Gervasi: We're talking about ingesting a ton of information, doing applied statistical analysis, perhaps using some ML algorithms, all this really cool stuff, but ultimately it's so that an engineer can go fix an interface or whatever.
Estefan Ortiz: It is. No, it is.
Phil Gervasi: You can make the network run better and applications get delivered properly and all that.
Estefan Ortiz: Yeah. No, I laugh because the pattern seems to be consistent from back...
Phil Gervasi: What do you mean?
Estefan Ortiz: From back when I used to work on aircraft health monitoring at University of Hawaii, it was the same problem. Can we look at operational data? Can we tie it to maintenance information and then use that to say, " Hey, we've detected something wrong with the plane. What do we do with it? How do we act?" ... so to speak.
Phil Gervasi: Yeah. Yeah, that makes sense. And that's that whole idea of actionable insight, not just an insight. Something that you mentioned was you're collecting a bunch of data and you're able to find some pattern, but to a subject matter expert, then you present it to them and they're like, " Yeah, who cares?" At Networking Field Day 29, a few weeks ago, last month, I made the comment that... Let's say you're analyzing this telemetry that's coming in and you have this 400 gig interface in your data center, which is not an uncommon bandwidth amount, and it's plugging away at one meg per second, not one gig but one meg. It's a tiny, tiny fraction. And then you see it jump to 10 megs, which is a statistically significant increase, yet it is so small, really to a subject matter expert who knows networking looking at that, they know immediately has no bearing on application performance, has no bearing... Maybe it's something to check out and you put up a warning something's going on, but it's really not mission critical. And that's what I remember... A former colleague of mine talked about it in this way, he said it was the difference between weird and bad. It's weird but it's not really a bad thing so what do we do with it? That's the quality of the insight.
Estefan Ortiz: Yeah. And so for me, when you discuss it or when I hear the same thing, " Hey, you detected something interesting but it's not significant." I try to internalize that and say, " Well, how do I map that it's not interesting part to some quantity or some algorithm?" Whether it's doing something simple, like an effect size that tries to translate that into some mathematical behavior where we've detected the outlier but relative to its magnitude change so to speak, it doesn't mean much. Or relative to the device itself... I forget, it was a 40 gig or 400 gig bandwidth, relative to that threshold or that measure, is it significant? And so being able to map that to something that we can incorporate in the algorithmic process is awesome. And so that would be the rule based approach. Not rule based but that would be incorporating that rule or rule of thumb, or being able to take that and go back and look at the data and have myself or others label it, saying, " Yes, these are insignificant, these are significant. Use those labels now to feed through the algorithm and say,'Let's improve upon this because we've showed it to interested folks. They don't think it's very interesting, so now let's close that feedback loop to actually use that information'."
Phil Gervasi: It sounds like the process of using all of these methods and processes of data science from a high level is very iterative, it's not like you slap an algorithm and you got all the answers and now all your engineers are happy because all their problems are solved. It sounds like it's a constant process of maybe trial and error. I hate to use that because as a scientist, I'm sure you don't want to... You want black and white answers, but it's a process of getting things more accurate, more meaningful, is that right?
Estefan Ortiz: Yeah, sophisticated trial and error. Yeah, rely on the scientific method a little bit. But yeah, close. But you mentioned that, and there's a whole field that's starting to crop up behind just what you said.
Phil Gervasi: What's that?
Estefan Ortiz: There's ML ops, similar to DevOps, where they're doing the monitoring of these systems to say, " This one's drifting away from some set point. Let's go ahead and retrain. Or we've got feedback from the field that says these insights aren't very good, how do we bring that and feed it back into either the original dataset or the modeling portions of the algorithm?" ... so to speak.
Phil Gervasi: Yeah, yeah. We've been pretty high level today. I really wanted to get an idea of this concept of data science and why and how we apply it to networking but so much of the stuff that you said has... I have so many questions that I want to follow up on. For example, you talked about anomalies earlier, and something I struggled with in my career was getting a platform in front of me that's firing off alerts for anomalies. And they're all false positives. I'm like, "That's not an issue." I end up not trusting the tool, and how do we deal with that? There's a lot of questions that I have for you but just because of time, I do want to close this out now. It has been really a great episode talking about data science and machine learning and getting to unpack some of the real meat behind what the industry is doing right now. Thank you Estefan, for joining me today, I really do appreciate it. And before we go, if folks want to reach out to you online, maybe ask a question or have a comment, how can they get in touch with you?
Estefan Ortiz: Sure. They can send me an email to my email address, eortiz@ kentik. com. That's E- O- R- T- I-Z@ kentik.com.
Phil Gervasi: Okay, great. And you can find me on Twitter, @ network_phil, and you can search my name in LinkedIn. And until next time, thanks very much.
Do you dread forgetting to use the “add” command on a trunk port? Do you grit your teeth when the coffee maker isn't working, and everyone says, “It’s the network’s fault?” Do you like to blame DNS for everything because you know deep down, in the bottom of your heart, it probably is DNS?
Well, you're in the right place! Telemetry Now is the podcast for you!
Tune in and let the packets wash over you as host Phil Gervasi and his expert guests talk networking, network engineering and related careers, emerging technologies, and more.