Telemetry Now | Season 1 - Episode 15 | May 30, 2023

Shifting Orbits: Solving Data Gravity by Relocating the Problem

Play now

Applications rely on data, and as a database grows, the applications and services related to that data also tend to grow and multiply. In this episode of Telemetry Now, Ted Turner joins the podcast to discuss data gravity, what problems it causes, and how we can mitigate them.

Transcript

What do planets and stars and galaxies have to do with data, databases, and even data centers?

Well, just like when large amounts of mass begin to clump together out in space. And as that mass grows in volume, it only causes more mass to be attracted to it until you end up with an entire planet.

Data kinda works the same way.

As organizations collect and organize and store more and more data as their organizations grow or their needs change over time, what happens is an exponential growth of data coalesces in one place.

Database grow, sometimes exponentially, and so do the data center and secondary applications and services needed to store, manipulate, and access that data.

This is what's called data gravity And it's a phenomenon that technologists have had to contend with for years, including in more recent years with public cloud containerized services, and and I guess you could say our almost wholesale reliance on applications.

Now with me again today is Ted Turner a subject matter expert in cloud technology and a solutions architect with Kentech. And in this episode, we'll be discussing why this happens, what challenges it creates, and how we can solve them. I'm Philip Durbasi, and you're listening to telemetry now.

It's great to have you back again. I'm really looking forward to this conversation today. I read your blog posts on data gravity. And then, of course, I went and Googled it well. I wanted to learn more and more. And, well, you know, a really interesting topic. But before we get into it, would you mind just introducing yourself to our audience again just in case they're not familiar.

Hey. I'm Ted Turner, cloud solutions architect over at Kentech. I've been doing, data centers and networking and the internet stuff for decades now.

And when you say cloud and internet and all that kind of stuff, did you start off in the networking space and then move to cloud, or did you start off in I guess, purely data center space.

Where where was your where are your roots?

The roots come from small business, working my way up through the medium sized business and then helping out large organizations. Last, major gig I did, has been, SaaS providers providing their wares on the cloud, for the customers.

Okay. Yeah. Me too. I I I certainly worked my way from small business, medium business, sometimes very small business, right?

You know, a a dozen people. All the way to global enterprise service providers. So it's, I think, a really cool path to go personally because you get to see so much. You have to touch so many things.

So it forces you to develop an understanding of how certain things work together. I didn't start my career in a silo. I had no choice but not to be in a silo.

So I really think that's a cool way to go. Anyway, let's start off with defining data gravity. I I haven't a definition in my head based on your blog post and my own reading, but what does that mean?

So from back in the day with the First Cloud Scale provider I worked at, all the data center, all the data ends up in a single data center. And so we had two data centers, one in San Diego, one in Plano.

And at some point, you have a very hard time replicating that data over into the alternate data center. The pipes just aren't big enough. You can't replicate fast enough. You don't have enough memory you don't have enough bandwidth on the database to do that replication. And so you end up having to, quote, ship tapes. In my head, that's where I started off of data gravity, you couldn't run and maintain two sites. So you have the concept of scaling up or scaling out And so that's, where the cloud providers have come in and a lot of organizations are deciding to try and keep their data and move it away from that data gravity, make sure that you can replicate it and do it in smaller chunks, smaller bite sized pieces.

Yeah. Sure. And I and I remember those days setting up DCIs between an active standby data center or maybe maybe dual hub active data centers, whatever. And and there's an entire level of complex city and, requirements from a technical perspective on, to make how to make that work.

But what what do you mean by gravity? I get it. So we have a large data center, and then things are consolidated there. You're Are you talking about, like, physical location, like data data center real estate and everything being consolidated in one location?

So it can be in one data center location or one geography in the cloud at your cloud provider. Once you try and move all of your data for your customers, to another region. So if you have, a hurricane, if you have a massive power outage and all of a sudden your data center goes down, how do you keep your business up and running?

The volume of data, the number of customers you have going in through a single funnel to get access to that data, limit you. I see.

Yeah.

So things like content delivery networks or static content for your newspapers, your your Netflixes, they go generate the content one time, and then they try and distribute it everywhere around the globe. That's one method of getting rid of the data gravity instead of pulling everybody back to one single point. Right.

But if you've got live data, banking, retail, credit card processing transactions, that's an atomic transaction.

All the data is in one spot, as much as it can be so that you have financial transaction, consistency, atomicity. Yeah.

Yeah. And you mentioned that you used a term there. Well, it was phrase, but pulling everybody back or pulling everything back. So there's that idea of gravity.

Right? So if you're in one single geographic or actual physical, like, street address location. Your data center is just getting bigger and bigger because of the nature of your growing business, which is awesome. Right?

It it it's success right there.

You define success for your business. Right.

But as a result of that success and they're you're growing, amount of data. So we're talking about physical rack of servers and network gear and all that stuff. What's gonna happen is you have the, the, the network adjacent and database adjacent services in technologies, sometimes physical virtual, all on-site, or at least all going in and out of that site to help facilitate the use of that data. And I and I remember doing a podcast with a friend of mine a while back, and we were talking about how, you know, at the end of the day, what we're doing with networks and databases and data centers and all is is really just helping human beings reach, an application.

And then he corrected me. This is my friend Tony, and he corrected me. He said, it's not really helping people connect to an application. It's the application is actually helping people connect with the data.

So even the applications are just a mechanism of conduit for me sitting here, to look at my bank information or my bank account information. And that really struck a chord with me. So when it comes down to it, as we rely more on application for our day to day lives. So mundane things, but also stuff like banks and and, you know, nine one one services in my city.

Right? All that stuff relies on what's happening in data centers.

All of it gets funneled more and more into one geographic location into, you know, let's say I live near the city of Albany, New York, Albany's don't know if they had a single data center. I don't know what they do, but let's say they have a single data center downtown.

More and more stuff goes there, more and more stuff. So there's your idea of gravity, like planet with greater mass attracting more satellites around it. And that causes problems. You mentioned a couple problems.

What what are the problem? Now, you you mentioned not being able to replicate because of bandwidth constraints or constraints on your, physical gear.

On the database, the memory, the disk, localized stuff to the database or the network.

But an example, like you were just asking, the first time I ran into this, HP three thousand servers. We had four dedicated, HP three thousand servers just for DNS.

Over provisioned with memory, we're running an old version of Red Hat Linux, and it got constrained at the kernel and couldn't handle DNS.

So we had things load balanced with an f five load balancer in front, and those boxes simply fell over. So we had to get in and start doing kernel tuning to handle the DNS volume and simply not drop DNS requests. If you don't do a DNS request, the application the customer transactions simply don't work.

Right. Right. And then lifting and shifting your DNS services from, you know, on prem to a cloud provider, doesn't necessarily solve that because then everybody just goes to that IP address, or two IP addresses if you're load balancing, you know, too.

But it doesn't really I found out I found out that Route fifty three will not respond to a DNS request shorter than one point six seconds or one point eight seconds.

So we had to set our, unbound DNS caching to at least two seconds because we had set it down to one second. We broke Amazon's DNS because we're just hammering it so hard. Right. Now, they protect themselves, and they protect all of the rest of their customers by putting in those limits.

But those are the types of things that you need to start figuring out.

Yeah. Yeah. And ultimately, we're talking about, some of those network adjacent services. You mentioned DNS.

Having performance problems. And those performance problems are going to directly impact the application's performance. Right? I mean, if I'm trying to query DNS and there's a request response and it's significant latency just to be able to resolve an IP address, or all the different things, the components on the webpage. That's what's really happening behind the scenes. Right? You open the webpage, but there's a ton of other DNS requests going on.

That's gonna make the application feel slow and might digital experience be very poor as a result. And and then you also mentioned, database architecture just being hammered. The actual physical and virtual resources being hammered and having to add more memory and compute and CPU So I'm assuming then there's also back end slowness as a result from everything being consolidated in one place. Right?

Yeah. I I wrote a blog article when I first landed at Kentech. Latency starts ten milliseconds of latency at the back end database. Turns into a hundred milliseconds at that application logic tier. Turns into, a thousand milliseconds by the time you get to the UI. So every tier that you have almost adds that magnitude of latency.

And so very small pieces of latency show up in the back end. Have very big impacts all the way throughout your infrastructure.

Right. Right. So ultimately, it is not I I I can't think we're very much focused on the network. I get it.

And so we look at network latency. And and, what is the round trip time? And is there packet loss or some somewhere in the in the path? I get that.

There's a significant amount of potential for latency, which is the new outage, right, like the saying goes, that occurs as a result of a of, like, just some old server taking a long time to respond because it's it's being hammered. And, and then tracking that down, I'm sure, I'm assuming it's gonna be a problem as well. We're talking about performance problems, as a result of just our our our services, our devices being overwhelmed. I get that because everything is consolidated into one place.

But I have to assume then that also plays into reliability. You mentioned replication, and that's that's all about reliability. We're we're replicating our our data center to another data center, probably a backup DC or or a standby. Right? So there's there's issues with reliability as well.

You can start to replicate and have one read write database and then have one or more read only caches of that database, and that can be locally within that same, availability zone within one provider, one cloud provider, within one data center that you manage. You can also start to make availability a read only replica and multiple availability zones. So now you're not only talking about kind of mitigating against the data gravity, but also the larger world around you, fault tolerance capabilities, these start coming together across all of that. And then you can, also take that database availability, make that content cached or that database cache, over in another region. So you're in US West and US East.

Right.

All of these things start to come together.

A lot of organizations are starting to throw redis. And in front of the database handling all of that content, making it quickly available, and making it available to an API. Or the application tier. And so there's several different ways you can start to look at that reliability aspect. Of making sure the data is in multiple places and then potentially caching tiers to make sure that those databases are not hit so hard.

Yeah. And you're speaking to the solution or at least one of the solutions here. But, but to just go back, ultimately, we're talking about monolith data centers attracting more data as a result of them being located in one place and the and the growth of a business and organization.

And the problem that results is reliability issues, maybe because of actual geographic like weather stuff, weather emergencies, or or performance issues as a result of everybody in your organization coming into one data center. And I have to assume we haven't mentioned the third issue has to be cost because if I am having my entire organization of thousands and thousands of end users coming in over two links or four links coming into my DC or or two DCs. I need massive pipes.

You know, I need, you know, very expensive, high quality connectivity to the internet, or maybe private circuits out to other, locations, like I did years ago, right, when we weren't using SD WAN at the Internet exchanges?

Yep. Yeah. Yeah. All of the above.

So reliability, latency, performance, right, and cost are really our our main problems that we have as data gravity continues to grow, and we we we develop this gigantic data center of services and resources all in one place.

And that makes the network underneath complex because you have to make sure that those databases have clear clean path to replicate with each other. Your user gonna make that happen someplace different than that front end application is talking to the database.

Okay. Then how do we solve this? My background in networking means that I'm primarily thinking about networking solutions, which I know is is not, all of it. There's much more there.

But you already began to talk about what I think is one of the solutions, which is disaggregating your data in multiple data centers. Let's start with that.

So with multiple data centers, you need to have that back end path, a clear clean path, to make sure that whatever data replication is going on for your customers, that live atomic transaction is not impacted by a backup. When we first started doing these things, backups, starting taking a long time. And instead of ending at five AM, they're moving into the business day, nine AM, ten twelve PM lunchtime. And all of a sudden, the database can't keep up.

So you need to make sure that you have services not interfering with each other. So database replication is one thing going from site to site, and then a backup to tape. Back in the old days, making sure that these two things aren't happening at the same time, or they have two separate pathways. You're backing up a read only database instead of the read write database.

I mean, you're talking about multiple paths, so that way you don't over utilize any one individual path. Okay. So there's a networking component there. Scheduling your backup jobs and replication and all of that among your multiple data centers.

That all makes sense. But that that sort of speaks to the performance issue, doesn't it? So that way there is plenty of bandwidth. I don't have any over utilized devices.

Replication is happening. Perhaps at a different time than the bulk of my users when they're active at least so that way there's no contention. Is it isn't that talking about more of a performance thing here.

But it's also going towards that reliability, that fault tolerance. Right? And so if you can get to be geographically. Most organizations will start off with kind of an AB path. And so they'll have the applications maybe in multiple geographies, but calling on one database.

And then, those, like, you called out, that cost to replicate everything is expensive. So at at some tiering, it's worth it to have the data available for your customers. That performance, that cost of not having the data when things go down.

We had we were running in Amazon, We had our database on the back end, Oracle database, provisioned by, by Amazon, and poof, everything disappeared. Amazon brought everything back on two minutes later. But everything queued up. We had ten thousand transactions that we lost automatically.

And so there's millions of transactions that day, but we had to go in and find those ten ten thousand transactions that poof went missing.

And go notify the customers.

Hey, we had an anomaly within our data center processing.

Please go check your data and validate that these transactions that were in process, took place.

Just lifting and shifting everything from my my private data center into AWS isn't necessarily gonna solve everything then, especially if those are not cloud native applications, not written that So I'm basically just moving the problem from me to AWS if I have still a grown organization with a number of users. However, however, AWS on the back end is distributing, that data across their regions and across their own infrastructure. Right?

If you select the check boxes. And when you select those check boxes, you incur the costs.

Okay. So it's always a cost constraint. So the more And that's why we used to say I remember, talking to my own customers, about fault tolerance and resilience and and all that. And it's basically, you know, the more the more you get close to that one hundred percent resilient, it's kind of an exponential growth in cost to get that one more one percent or a fraction of a percent of reliability.

And it's the same it's the same with looking at data, reliability here. Right?

You got it. Set it Yeah. The the higher the three nines to four nines to five nines is ten dollars versus a hundred dollars versus a thousand dollars.

Yeah. But what about distributing the applications themselves then? So we're talking about replicating database A. So it lives in multiple places and people can access them and our resources are less hammered fine. What about distributing those applications so that way you don't have single processes, entirely, living in one location?

So you can distribute, there's new technologies to add doctorization or adding Kubernetes around the planet. Making sure that that application experience happens. If you could put a small red as cash out there closer to the edge, These are starting to become the new concepts of having a small red as cash at the edge and your application being served from the edge. The red is cache will go ask the database for the most relevant data, write it all in that local memory cache, and then make it available for that customer transaction that last ten milliseconds, the last mile. And then when the customer writes, they'll write back to the red as cash, so they still have access to the local data, and then redis will handle sending it all the way back to the back end database.

Okay. So one of the ways that we can solve this, that sounds very reminiscent to how CDNs work, by Right?

This is an extension of CDN concepts.

Right. Okay. So caching data.

Back in the day, we also used to talk about things like wham wham optimization where there is a cashing component to WANop, in the local WANop appliance, right?

So it's a similar idea. Is that the primary way of how we solve the problem of data gravity. Obviously, we talked about distributing your databases among multiple data centers, whether they be on prem or in multiple cloud instances, fine.

But now we're talking about cashing. Is that gonna be another way we solve this?

You got it. It's just one more element to taking that single set of data, you know, as it data gets big, you bring everything into it. You're just trying to take small pieces and move them closer to wherever you need them.

And wherever you need them is people. I mean, when it comes down to it, it's me accessing, let's say, Microsoft three sixty five applications, right, whatever productivity tools, And there's gonna be a location geographically close to me in the northeast that I'm gonna be redirected to so that way my performance is better. And on on the back end, I'm sure Microsoft has an entire network where they, you know, work all that out.

I I don't know if you've noticed, but when you walk through the airport, they've got these machines, vending machines, and you can get your, your latest iPhone accessories or Android accessories, whatever you, you know, got damaged or destroyed in transit to travel. You go to a mall. They've got stores, and then they have these small kiosks, and you swipe. Having that transaction, there's inventory management for that kiosk or that, vending machine, plus there's the business transaction to, the credit card processing.

All of these are happening at the edge, taking care of what you just called out people.

Now that solves the latency problem but it doesn't necessarily I mean, I guess it also solves the, the servers on the back end being hammered as well because you are caching that data locally or at least geographically in your pie. And so you are not making the same number of requests to that back end database. Is that correct?

You got it.

Alright. So you're solving several problems here. You're solving the latency problem, which is performance. Yeah. It's the main contributor to to performance degradation.

You're also, addressing to an extent your back end services being hammered and over utilized, which again is going to affect performance, but also reliability.

And then, and then also distributing that data on the back end among multiple data centers, cloud or otherwise containerized or otherwise, will protect you from natural disasters and and give you that fall tolerance.

Man, Ted, this sounds really expensive, though. Really expensive. I mean, we were talking about cost being a problem when I'm all in one data center. This sounds this sounds way more expensive.

There's a there's a level of cost benefit.

So you've got reliability and performance And to at a certain point, you can add more to your costs. So when you're going and buying something at the airport at that kiosk, it's gonna cost more.

Than ordering it and waiting for it to show up two days later if you're going on Amazon or a week or two weeks, whatever that is. There's a time delay. And so how fast, how reliable do you need it to be? There is a markup that we see in the marketplace today, to make these things happen and and just quickly available, there is a cost. Yes.

Yeah. So it sounds like, you know, in an attempt to solve these other technical problems, caused by data gravity, performance problems, reliability problems, we're not gonna solve the cost problem. I mean, we're not the cause problems that we had with having all our data in one place. I mean, we're really just shifting that cost to somewhere else in our environment. Possibly even incurring additional costs, greater cost. Not to mention that there's probably an operational cost now because we have a more complex environment of cashing services and multiple data centers and network overlays, and then the staff to manage all of that.

The worst thing is trying to troubleshoot all this. Where is my problem occurring? Am I not getting the data from the database doing the inventory management? Is my credit card transaction not happening? Is it application performance? I called out DNS earlier. Is it just a a a problem locally?

Resolving DNS, or is it remotely resolving DNS?

You have to be able to see all of these things. The troubleshooting of these things becomes nightmarish.

Yeah. So nightmarish sounds scary to me. Here here's the thing. I'm gonna be contrarian here.

So then, are we really solving anything? I mean, data gravity a problem. I get it. Everything's there.

We we we addressed the the potential issues, performance, reliability cost. I I feel like we're just talking about moving the and reliability problems to closer to the edge, but they're still there. I mean, if I have a significantly more complex network, network end back end, fine. That is inherently going to be, less reliable because you have more potential for, for problems.

Right?

Yes. Murphy jumps in everywhere you put in a new additional component. This is where that concept of observability came from, and the application guys started saying, Hey, we need metrics, logs, and traces from everything, everywhere, put down some place so I can start to figure out where things are at. Pantic, I love being here because we do the network observability portion of this, adding in underneath what the application teams were doing, to try and understand what that edge, site looks like, what that data center, what that cloud site looks like, threading together all of these pieces so that you can to get those diagnostics.

Yeah. I feel like what we're doing is, over the past ten years, twelve years is we're trading one set of problems for another, but the new set of problems are are they're they're they're the same, but they're different. And so because we had everything consolidating one big data center, and and let's say I had a pharmaceutical company with ten thousand employees and everybody's going going going into data center a to do their and maybe there's a backup data center fine. And that's starting to change over the past decade, to solve the problem of performance and reliability, But we're moving the problem of performance and reliability down closer to the edge.

But it's alongside the improvements that we're gaining So we're never really eliminating those problems. We are just accommodating those problems, so they kinda work with the new system that we have.

And like you said, we have observability to help address that where we're gathering metrics from all sorts of different devices. We're back in the day. It was like PRTG, and I'm collecting SNMP basically all I looked at. Right?

Whereas today, we're looking at everything because who knows where that application is flowing through? It's not like server to client. It's server through a billion devices and services and clouds, and then finally to my tablet, you know. And so I need more data to be able to figure out what's going on. But that's not inherently bad because the alternative is to leave everything sitting in that data center and nothing would work well.

You got it. And and at some point, the database falls over or you have a bad database upgrade, or you have, some security patch or fix.

We did that one time. There's a massive security breach in the database.

We had to go patch everything, and performance dropped in half. Now we're more secure But now we have to figure out how to engineer everything to not hammer that database because it simply can't handle the amount of traffic coming in because of that simple security patch.

Right.

And then distributing those databases or that database, among several databases that are geolocated will solve that. Great, but then it creates other problems. So it's just this constant balancing act and this constant tension between trying to solve one problem and inheriting some new problems.

Maybe the new problems are better or maybe you're worse off. So Is there ever a reason to just say, Hey, this problem of data gravity, not really a problem in our scenario. We're going to stay in this single or active standby data center, and we're not moving all our resources to AWS, or maybe maybe we're gonna be very selective about what we're gonna move into AWS. You know, and hence we have the hybrid.

I think this is why we call engineering because it's always trade off. There's there's cost There's performance, there's reliability. Pick two. If so if you pick reliability and performance, your costs are gonna go up.

If you're gonna drive your cost down, one of the other two is gonna have to suffer. So, you know, that pick two concept comes up in many different places, but it boils down to its engineering.

You have choices on how you wanna run your business. And for smaller businesses that don't have high traffic volumes or don't have high margins, You can push those costs down, but you might have more latency. You might have, less reliability.

What does the, what does the term, escape I read that in one of your blog posts, and I, and I couldn't figure out what what you meant by that.

So if you think of, gravity pulling everything down to the planet, planet earth, right? When you send a rocket up into space to get that rocket so that gravity is not impacting it anymore, You have to have enough thrust. You have to have that escape velocity for that rocket to get out of the gravity well here. So how do you make that available for your applications? How do you get your applications and your data out of that gravity well so that you can go travel the universe, right, go throughout the the solar system here. So how do you start to build those pieces? How do you engineer it so that you're not stuck in your gravity well?

Okay. So getting so escape velocity is really the the tools and processes methods that we're using to basically fight against data gravity.

So we're caching data. We have CDNs. We are doing all these things within a large enterprise that has the resource and capability to do that, but what how am I gonna do that if I'm utilizing cloud resources? What can I do there?

So one of the techniques, the database provider Snowflake So instead of keeping everything in mid in memory in a, Microsoft SQL database or an Oracle database, they're leaving everything in storage. So dumping everything into an s three bucket, using ELT extract, pulling it out of the S3 bucket, whatever you need, load it into quick memory, and then transform it and make it available to the tran, to the application. To the users. Okay. And so it's just an order of operations on where things are at, but you're trying to avoid loading things into memory if you don't need it.

That makes sense. So it's really the same philosophy, if even if you were doing everything on prem with your own resources, it's really the same thing.

So, Ted, this has been a really great podcast today. We're a really great episode. It's great to have you on again and talk about, data gravity. We talked about data gravity. We talked about data mass. We about escape velocity.

Very interesting stuff. So ultimately, if folks have a question for you about data gravity, about what you do with cloud, how can folks reach out to you online?

I'm Ted at kentech dot com. I am also on Twitter, Ted Turner in Cow. And I'm on LinkedIn.

Great. Thanks very much. And you can find me on Twitter at network underscore fill. I'm still active there.

You can search my name on LinkedIn. Find me all over the place online these days. If you have an idea for an episode of telemetry now, we'd love to hear from you. Email us at telemetry now at kentech dot com, or if you'd like to be a guest on an episode, still love to hear from you.

Until next time. Thanks very much. Bye bye.

About Telemetry Now

Do you dread forgetting to use the “add” command on a trunk port? Do you grit your teeth when the coffee maker isn't working, and everyone says, “It’s the network’s fault?” Do you like to blame DNS for everything because you know deep down, in the bottom of your heart, it probably is DNS? Well, you're in the right place! Telemetry Now is the podcast for you! Tune in and let the packets wash over you as host Phil Gervasi and his expert guests talk networking, network engineering and related careers, emerging technologies, and more.

All Episodes

Kentik is the network intelligence platform for modern infrastructure teams.

844-356-3278

Platform

Solutions

Technology

New and Notable

Learn

Company

We use cookies to deliver our services.

By using our website, you agree to the use of cookies as described in our Privacy Policy.