Digging into applications and infrastructure to get the best performance for customers. Specialties: Co-ordinating teams to get the best results out of networks and server architectures. Hands on building network and server architecture
Connect with Ted on LinkedInPhillip Gervasi: What do planets and stars and galaxies have to do with data, databases and even data centers? Well, just like when large amounts of mass begin to clump together out in space, and as that mass grows in volume, it only causes more mass to be attracted to it until you end up with an entire planet, data kind of works the same way. As organizations collect and organize and store more and more data as their organizations grow or their needs change over time, what happens is an exponential growth of data coalesces in one place. Databases grow, sometimes exponentially, and so do the data centers and secondary applications and services needed to store, manipulate and access that data. This is what's called data gravity, and it's a phenomenon that technologists have had to contend with for years, including in more recent years with public cloud containerized services and I guess you could say our almost wholesale reliance on applications. Now, with me again today is Ted Turner, a subject matter expert in cloud technology and a solutions architect with Kentik. And in this episode we'll be discussing why this happens, what challenges it creates, and how we can solve them. I'm Philip Gervasi and you're listening to Telemetry Now. Ted, it's great to have you back again. I'm really looking forward to this conversation today. I read your blog posts on data gravity, and then of course I went and Googled it as well, I wanted to learn more and more and well, a really interesting topic. But before we get into it, would you mind just introducing yourself to our audience again, just in case they're not familiar?
Ted Turner: Hey, I'm Ted Turner, Cloud Solutions Architect over at Kentik. I've been doing data centers and networking and the internet stuff for decades now.
Phillip Gervasi: And when you say cloud and internet and all that kind of stuff, did you start off in the networking space and then move to cloud or did you start off in, I guess purely data center space? Where are your roots?
Ted Turner: The roots come from small business, working my way up through the medium- sized business and then helping out large organizations. Last major gig I did has been SaaS providers, providing their wares on the cloud for the customers.
Phillip Gervasi: Okay, yeah, me too. I certainly worked my way from small business, medium business, sometimes very small business, a dozen people, all the way to global enterprise service providers. So it's, I think, a really cool path to go personally because you get to see so much, you have to touch so many things, so it forces you to develop an understanding of how certain things work together. I didn't start my career in a silo, I had no choice but not to be in a silo, so I really think that's a cool way to go. Anyway, let's start off with defining data gravity. I have a definition in my head based on your blog posts and my own reading, but what does that mean?
Ted Turner: So from back in the day with the first cloud scale provider I worked at all the data ends up in a single data center, and so we had two data centers, one in San Diego, one in Plano. And at some point you have a very hard time replicating that data over into the alternate data center. The pipes just aren't big enough. You can't replicate fast enough, you don't have enough memory, you don't have enough bandwidth on the database to do that replication, and so you end up having to, quote, " ship tapes." In my head, that's where I started off the concept of data gravity. You couldn't run and maintain two sites. So you have the concept of scaling up or scaling out, and so that's where the cloud providers have come in and a lot of organizations are deciding to try and keep their data and move it away from that data gravity, make sure that you can replicate it and do it in smaller chunks, smaller bite size pieces.
Phillip Gervasi: Yeah, sure. And I remember those days setting up DCIs between an active standby data center or maybe dual hub active data centers, whatever, and there's an entire level of complexity and requirements from a technical perspective on how to make that work. But what do you mean by gravity? I get it, so we have a large data center and things are consolidated there. Are you talking about physical location, data center real estate and everything being consolidated in one location?
Ted Turner: So it can be in one data center location or one geography in the cloud at your cloud provider, once you try and move all of your data for your customers to another region. So if you have a hurricane, if you have a massive power outage and all of a sudden your data center goes down, how do you keep your business up and running? The volume of data, the number of customers you have going to through a single funnel to get access to that data, limit you?
Phillip Gervasi: I see. Yeah.
Ted Turner: So things like content delivery networks for static content for your newspapers, your Netflix's, they go generate the content one time and then they try and distribute it everywhere around the globe. That's one method of getting rid of the data gravity instead of pulling everybody back to one single point. But if you've got live data, banking, retail, credit card processing transactions, that's an atomic transaction. All the data's in one spot as much as it can be, so that you have financial transaction consistency inaudible.
Phillip Gervasi: Yeah, yeah. And you used a term there, well, it was a phrase, but pulling everybody back or pulling everything back. So there's that idea of gravity. So if you're in one single geographic or actual physical street address location, your data center is just getting bigger and bigger because of the nature of your growing business, which is awesome.
Ted Turner: It's success right there. You've defined success for your business.
Phillip Gervasi: But as a result of that success and your growing amount of data, so we're talking about physical racks of servers and network gear and all that stuff, what's going to happen is you have the network adjacent and database adjacent services and technology, sometimes physical, virtual all on site or at least all going in and out of that site to help facilitate the use of that data. And I remember doing a podcast with a friend of mine a while back and we were talking about how at the end of the day, what we're doing with networks and databases and data centers and all this stuff is really just helping human beings reach an application. And then he corrected me, this is my friend Tony, and he corrected me and he said, " It's not really helping people connect to an application. The application is actually helping people connect with the data." So even the applications are just a mechanism of conduit for me sitting here to look at my bank information or my bank account information. And that really struck a chord with me. So when it comes down to it, as we rely more on applications for our day- to- day lives, so mundane things, but also stuff like banks and 911 services in my city, all that stuff relies on what's happening in data centers. All of it gets funneled more and more into one geographic location, into, let's say I live near the city of Albany, New York. Albany, I don't know if they have a single data center, I don't know what they do, but let's say they have a single data center downtown, more and more stuff goes there, more and more stuff. So there's your idea of gravity, like a planet with greater mass attracting more satellites around it, and that causes problems. You mentioned a couple problems. What are the problem... Now you mentioned not being able to replicate because of bandwidth constraints or constraints on your physical gear.
Ted Turner: On the database, the memory, the disc, localized stuff to the database or the network. But an example like you were just asking, the first time I ran into this HP 3000 servers. We had four dedicated HP 3000 servers just for DNS, over provisioned with memory, we were running an old version of Red Hat Linux and it got constrained at the kernel and couldn't handle DNS. So we had things load balanced with an F5 load balancer in front, and those boxes simply fell over. So we had to get in and start doing kernel tuning to handle the DNS volume and simply not drop DNS requests. If you don't do a DNS request, the application, the customer transactions simply don't work.
Phillip Gervasi: Right, right. And then lifting and shifting your DNS services from on- prem to a cloud provider doesn't necessarily solve that because then everybody just goes to that IP address or two IP addresses if you're load balancing two, but it doesn't really-
Ted Turner: I found out that Route 53 will not respond to a DNS request shorter than 1. 6 seconds or 1. 8 seconds. So we had to set our unbound DNS caching to at least two seconds because we had set it down to one second and we broke Amazon's DNS because we were just hammering it so hard. Now they protect themselves and they protect all of the rest of their customers by putting in those limits. But those are the types of things that you need to start figuring out.
Phillip Gervasi: Yeah, yeah. And ultimately we're talking about some of those network adjacent services. You mentioned DNS having performance problems, and those performance problems are going to directly impact the applications performance. I mean, if I'm trying to query DNS and there's a request response and it's significant latency just to be able to resolve an IP address or all the different things, the components on the webpage, that's really happening behind the scenes, you open the webpage, but there's a ton of other DNS requests going on, that's going to make the application feel slow and my digital experience be very poor as a result. And then you also mentioned database architecture just being hammered, actual physical and virtual resources being hammered and having to add more memory and compute and CPU. So I'm assuming then there's also backend slowness as a result from everything being consolidated more in place, right?
Ted Turner: Yeah, I wrote a blog article when I first landed at Kentik. Latency starts, 10 milliseconds of latency at the backend database turns into 100 milliseconds at that application logic tier, turns into 1, 000 milliseconds by the time you get to the UI. So every tier that you have almost adds a magnitude of latency, and so very small pieces of latency show up in the backend, have very big impacts all the way throughout your infrastructure.
Phillip Gervasi: Right, right. So ultimately it is not... At Kentik we're very much focused on the network, I get it. And so we look at network latency and what is the round trip time and is there packet loss somewhere in the path? I get that. But there's a significant amount of potential for latency, which is the new outage, like the saying goes, that occurs as a result of just some old server taking a long time to respond because it's being hammered, and then tracking that down, I'm assuming is going to be a problem as well. So we're talking about performance problems as a result of just our services, our devices being overwhelmed, I get that because everything is consolidated into one place. But I have to assume then that also plays into reliability. You mentioned replication, and that's all about reliability. We're replicating our data center to another data center, probably a backup DC or a standby. So there's issues with reliability as well.
Ted Turner: You can start to replicate and have one read write database and then have one or more read only caches of that database, and that can be locally within that same availability zone within one provider, one cloud provider within one data center that you manage. You can also start to make availability a read only replica in multiple availability zones. So now you're not only talking about kind of mitigating against the data gravity, but also the larger world around you, fault tolerance capabilities, these start coming together across all of that. And then you can also take that database availability, make that content cached or that database cached over in another region, so you're in US West and US East, all of these things start to come together. A lot of organizations are starting to throw Redis in front of the database, handling all of that content, making it quickly available and make it available to an API or the application tier. And so there's several different ways you can start to look at that reliability aspect of making sure the data's in multiple places and then potentially caching tiers to make sure that those databases are not hit so hard.
Phillip Gervasi: Yeah. And you're speaking to the solution or at least one of the solutions here, but to just go back, ultimately we're talking about monolith data centers attracting more data as a result of them being located in one place and the growth of a business and organization. And the problem that results is reliability issues, maybe because of actual geographic, like weather stuff, weather emergencies or performance issues as a result of everybody in your organization coming into one data center. And I have to assume, we haven't mentioned it yet, the third issue has to be cost because if I am having my entire organization of thousands and thousands of end users coming in over two links or four links coming into my DC or two DCs, I need massive pipes. I need very expensive, high quality connectivity to the internet or maybe private circuits out to other locations, like I did years ago when we were using SD- WAN.
Ted Turner: Internet exchanges, yeah.
Phillip Gervasi: Yeah, yeah.
Ted Turner: All of the above.
Phillip Gervasi: So reliability, latency, performance and cost are really our main problems that we have as data gravity continues to grow, and we develop this gigantic data center of services and resources all in one place.
Ted Turner: And that makes the network underneath complex because you have to make sure that those databases have a clear, clean path to replicate with each other, and you're usually going to make that happen someplace different than that front end application that's talking to the database.
Phillip Gervasi: Okay, then how do we solve this? My background in networking means that I'm primarily thinking about networking solutions, which I know is not all of it, there's much more there. But you already began to talk about what I think is one of the solutions, which is disaggregating your data in multiple data centers. Let's start with that.
Ted Turner: So with multiple data centers, you need to have that backend path, a clear, clean path to make sure that whatever data replication is going on for your customers, that live atomic transaction is not impacted by a backup. When we first started doing these things, backups started taking a long time and instead of ending at 5: 00 AM, they're moving into the business day, 9: 00 AM 10:00 AM, 12: 00 PM lunchtime, and all of a sudden the database can't keep up. So you need to make sure that you have services not interfering with each other. So database replication is one thing, going from site to site and then a backup to tape back in the old days, making sure that these two things aren't happening at the same time or they have two separate pathways, you're backing up a read only database instead of the read write database.
Phillip Gervasi: I mean, you're talking about multiple paths so that way you don't any one individual path, so there's a networking component there. Scheduling your backup jobs and replication and all of that among your multiple data centers. That all makes sense, but that sort of speaks to the performance issue, doesn't it? So that way there's plenty of bandwidth, I don't have any overutilized devices. Replication is happening perhaps at a different time than the bulk of my users, when they're active at least, so that way there's no contention. Isn't that what we're talking about, more of a performance thing here?
Ted Turner: But it's also going towards that reliability, that fault tolerance. And so if you can get it to be geographically, most organizations will start off with a kind of an AB path, and so they'll have the applications maybe in multiple geographies, but calling on one database, and then like you called out, that cost to replicate everything is expensive. So at some tiering, it's worth it to have the data available for your customers. That performance, that cost of not having the data when things go down. We were running in Amazon, we had our database on the backend, Oracle database provisioned by Amazon, and poof, everything disappeared. Amazon brought everything back on two minutes later, but everything queued up, we had 10,000 transactions that we lost atomically. And so there's millions of transactions that day, but we had to go in and find those 10, 000 transactions that poof, went missing and go notify the customers, " Hey, we had an anomaly within our data center processing. Please go check your data and validate that these transactions that were in process took place."
Phillip Gervasi: Just lifting and shifting everything from my private data center into AWS isn't necessarily going to solve everything then, especially if those are not cloud native applications, not written that way. So I'm basically just moving the problem from me to AWS if I am still a growing organization with a number of users. However, however, AWS on the backend is distributing that data across their regions and across their own infrastructure, right?
Ted Turner: If you select the check boxes, and when you select those check boxes, you incur the costs.
Phillip Gervasi: Okay. So it's always a cost constraint. So the more... And that's why we used to say, I remember talking to my own customers about fault tolerance and resilience and all that, and it's basically the more you get close to that 100% resilient, it's kind of an exponential growth in cost to get that one more 1% or fraction of a percent of reliability. And it's the same with looking at data reliability here, right?
Ted Turner: You got it. The three nines to four nines to five nines is$ 10 versus$100 versus $1, 000.
Phillip Gervasi: Yeah. But what about distributing the applications themselves then? So we're talking about replicating database A, so it lives in multiple places and people can access them, and our resources are less hammered, fine. But what about distributing those applications so that way you don't have single processes entirely living in one location?
Ted Turner: So you can distribute, there's new technologies to add dockerization or adding Kubernetes around the planet, making sure that that application experience happens. If you could put a small Redis cache out there closer to the edge, these are starting to become the new concepts of having a small Redis cache at the edge and your application being served on the edge. The Redis cache will go ask the database for the most relevant data, write it all in that local memory cache, and then make it available for that customer transaction, that last 10 milliseconds, the last mile. And then when the customer writes, they'll write back to the Redis cache, so they still have access to the local data, and then Redis will handle sending it all the way back to the backend database.
Phillip Gervasi: Okay. So one of the ways that we can solve this... That sounds very reminiscent to how CDNs work, by the way.
Ted Turner: This is an extension of CDN concepts.
Phillip Gervasi: Right, okay, so caching data. Back in the day, we also used to talk about things like WAN optimization, where there is a caching component to inaudible in the local inaudible appliance, so it's a similar idea. Is that the primary way of how we solve the problem of data gravity? Obviously we talked about distributing your databases among multiple data centers, whether they be on- prem or in multiple cloud instances, fine, but now we're talking about caching. Is that going to be another way we solve this?
Ted Turner: You got it. It's just one more element to taking that single set of data, as that data gets big, you bring everything into it and you're just trying to take small pieces and move them closer to wherever you need them.
Phillip Gervasi: And wherever you need them is people. I mean, when it comes down to it it's me accessing, let's say Microsoft 365 applications, whatever productivity tools, and there's going to be a location geographically close to me in the northeast that I'm going to be redirected to, so that way my performance is better. And on the back end, I'm sure Microsoft has an entire network where they work all that out.
Ted Turner: I don't know if you've noticed, but when you walk through the airport, they've got these machines, vending machines, and you can get your latest iPhone accessories or Android accessories, whatever got damaged or destroyed in transit, in travel, you go to a mall, they've got stores, and then they have these small kiosks and you swipe. Having that transaction, there's inventory management for that kiosk or that vending machine, plus there's the business transaction, the credit card processing, all of these are happening at the edge, taking care of what you just called out, people.
Phillip Gervasi: Now that solves the latency problem, but it doesn't necessarily... I mean, I guess it also solves the servers on the backend being hammered as well, because you are caching that data locally or at least geographically nearby. And so you are not making the same number of requests to that backend database, is that correct?
Ted Turner: You got it.
Phillip Gervasi: All right. So you're solving several problems here. You're solving the latency problem, which is...
Ted Turner: Performance.
Phillip Gervasi: Yeah, it's the main contributor to performance degradation. You're also addressing to an extent your backend services being hammered and overutilized, which again is going to affect performance, but also reliability. And then also distributing that data on the backend among multiple data centers, cloud or otherwise, containerized or otherwise, will protect you from natural disasters and give you that fall tolerance. Man, Ted, this sounds really expensive though. Really expensive. I mean, we were talking about cost being a problem when I'm all in one data center. This sounds way more expensive.
Ted Turner: There's a level of cost benefit. So you've got reliability and performance, and at a certain point you can add more to your costs. So when you're going and buying something at the airport at that kiosk, it's going to cost more than ordering it and waiting for it to show up two days later if you're going on Amazon, or a week or two weeks, whatever that is, there's a time delay. And so how fast, how reliable do you need it to be, there is a markup that we see in the marketplace today to make these things happen and just quickly available. There is a cost, yes.
Phillip Gervasi: Yeah. So it sounds like in an attempt to solve these other technical problems caused by data gravity, performance problems, reliability problems, we're not going to solve the cost problem. I mean, the cost problems that we had with having all our data in one place, I mean, we're really just shifting that cost to somewhere else in our environment, possibly even incurring additional cost, greater cost. Not to mention that there's probably an operational cost now because we have a more complex environment of caching services and multiple data centers and network overlays, and then the staff to manage all of that.
Ted Turner: The worst thing is trying to troubleshoot all this. Where is my problem occurring? Am I not getting the data from the database, doing the inventory management? Is my credit card transaction not happening? Is the application performance, I called out DNS earlier, is it just a problem locally resolving DNS or is it remotely resolving DNS? You have to be able to see all of these things. The troubleshooting of these things becomes nightmarish.
Phillip Gervasi: Yeah, so nightmarish sounds scary to me. Here's the thing, I'm going to be contrarian here. So then are we really solving anything? I mean, data gravity is a problem, I get it, everything's there. We addressed the potential issues, performance, reliability, cost. I feel like we're just talking about moving the performance and reliability problems closer to the edge, but they're still there. I mean, if I have a significantly more complex network, network and backend, fine, that is inherently going to be less reliable because you have more potential for problems, right?
Ted Turner: Yes. Murphy jumps in everywhere you put in a new additional component. This is where that concept of observability came from, and the application guys started saying, " Hey, we need metrics, logs, and traces from everything, everywhere put down someplace so I can start to figure out where things are at." Kentik, I love being here because we do the network observability portion of this, adding in underneath what the application teams were doing to try and understand what that edge site looks like, what that data center, what that cloud site looks like, threading together all of these pieces so that you can start to get those diagnostics.
Phillip Gervasi: Yeah, I feel like what we're doing is over the past 10 years, 12 years is we're trading one set of problems for another, but the new set of problems, they're the same, but they're different. And so because we had everything consolidated in one big data center, and let's say I had a pharmaceutical company with 10,000 employees, and everybody's going into data center A to do their work, and maybe there's a backup data center, fine. And that's starting to change over the past decade to solve the problem of performance and reliability. But we're moving the problem of performance and reliability down closer to the edge, but it's alongside the improvements that we're gaining. So we're never really eliminating those problems, we are just accommodating those problems, so they kind of work with the new system that we have. And like you said, we have observability to help address that, where we're gathering metrics from all sorts of different devices. Where back in the day it was like PRTG and I'm collecting SNMP, and that's basically all I looked at. Whereas today, we're looking at everything because who knows where that application is flowing through. It's not like server to client. It's server through a billion devices and services and clouds, and then finally to my tablet. And so I need more data to be able to figure out what's going on. But that's not inherently bad because the alternative is to leave everything sitting in that data center and nothing would work well.
Ted Turner: You got it, and at some point the database falls over or you have a bad database upgrade or you have some security patch or fix. We did that one time. There was a massive security breach in the database, we had to go patch everything and performance dropped in half. Now we're more secure, but now we have to figure out how to engineer everything to not hammer that database because it simply can't handle the amount of traffic coming in because of that simple security patch.
Phillip Gervasi: Right. And then distributing those databases or that database among several databases that are geo- located will solve that, great, but then it creates other problems. So it's just this constant balancing act and this constant tension between trying to solve one problem and inheriting some new problems. Maybe the new problems are better, or maybe you're worse off. So is there ever a reason to just say, " Hey, this problem of data gravity, not really a problem in our scenario, we're going to stay in this single or active standby data center, and we're not moving all our resources to AWS, or maybe, maybe we're going to be very selective about what we're going to move into AWS," and hence we have the hybrid-
Ted Turner: I think this is why we call engineering, because it's always trade off. There's cost, there's performance, there's reliability. Pick two. So, if you pick reliability and performance, your costs are going to go up. If you're going to drive your costs down, one of the other two's going to have to suffer. So that pick two concept comes up in many different places, but it boils down to its engineering. You have choices on how you want to run your business. And for smaller businesses that don't have high traffic volumes or don't have high margins, you can push those costs down, but you might have more latency, you might have less reliability.
Phillip Gervasi: What does the term escape velocity mean? I read that in one of your blog posts, and I couldn't figure out what you meant by that.
Ted Turner: So if you think of gravity pulling everything down to the planet, planet earth, when you send a rocket up into space, to get that rocket so that gravity is not impacting it anymore, you have to have enough thrust. You have to have that escape velocity for that rocket to get out of the gravity well here. So how do you make that available for your applications? How do you get your applications and your data out of that gravity well, so that you can go travel the universe, go throughout the solar system here? So how do you start to build those pieces, how do you engineer it so that you're not stuck in your gravity well?
Phillip Gervasi: Okay, so escape velocity is really the tools and processes and methods that we're using to basically fight against data gravity. So we're caching data, we have CDNs, we are doing all these things within a large enterprise that has the resources and capability to do that. But how am I going to do that if I'm utilizing cloud resources, what can I do there?
Ted Turner: So one of the techniques, the database provider, Snowflake. So instead of keeping everything in memory, in a Microsoft SQL database or an Oracle database, they're leaving everything in storage. So they're dumping everything into an S3 bucket using ELT extract, pulling it out of the S3 bucket, whatever you need, load it into quick memory and then transform it and make it available to the applications, to the users. And so it's just an order of operations on where things are at, but you're trying to avoid loading things into memory if you don't need it.
Phillip Gervasi: That makes sense. So it's really the same philosophy, even if you were doing everything on- prem with your own resource, it's really the same thing. So Ted, this has been a really great podcast today, really great episode. It's great to have you on again and talk about data gravity. We talked about data gravity, we talked about data mass, we talked about escape velocity. Very interesting stuff. So ultimately, if folks have a question for you about data gravity, about what you do with cloud, how can folks reach out to you online?
Ted Turner: I'm ted @ kentik.com. I am also on Twitter, TedTurnerInCal, and I'm on LinkedIn.
Phillip Gervasi: Great. Thanks very much. And you can find me on Twitter @ network_phil. I'm still active there. You can search my name on LinkedIn, find me all over the place online these days. If you have an idea for an episode of Telemetry Now, we'd love to hear from you. Email us at telemetrynow @kentik.com or if you'd like to be a guest on an episode, still love to hear from you. Until next time, thanks very much. Bye- bye.
Do you dread forgetting to use the “add” command on a trunk port? Do you grit your teeth when the coffee maker isn't working, and everyone says, “It’s the network’s fault?” Do you like to blame DNS for everything because you know deep down, in the bottom of your heart, it probably is DNS?
Well, you're in the right place! Telemetry Now is the podcast for you!
Tune in and let the packets wash over you as host Phil Gervasi and his expert guests talk networking, network engineering and related careers, emerging technologies, and more.