More episodes
Telemetry Now  |  Season 2 - Episode 1  |  April 11, 2024

Demystifying Generative AI and Large Language Models with Ryan Booth

Play now

 
In this episode, Ryan Booth joins Telemetry Now to discuss Generative AI, Natural Language Processing, and Large Language Models. We dive into the history, components, and mechanisms of how LLMs work, how we can deal with hallucinations, and how AI can help improve IT operations.

Transcript

Generative AI and large language models in particular have absolutely become popular topics to discuss in every podcast, every blog post, all the news stations, even in mainstream popular culture, and there's a reason for that. Gen AI, large language models or LLMs, they're changing the way that people access information.

And for the tech industry, for our industry, the way that we interact with data.

Now, Telemetry Now is predominantly a technical podcast. So with me today is Ryan Booth, a CCIE turned software developer who's worked for years now very closely with machine learning and now in the generative AI space.

So we're gonna get a little deeper into what LLMs are and how they work today, how they developed, why we're talking about them right now, and how we deal with things like hallucinations. So lots of good stuff in today's episode.

My name is Philip Gervasi, and this is Telemetry Now.

Ryan, thank you so much for joining. This is the second time you've been on the podcast. Not the second time that you and I have ever talked, though. I mean, we're we're always chatting on the CIDR, and, we've been on podcast together.

We've seen each other at various events. Just recently, I saw you in Dallas at the Texas Network user group. So, one of the things that I appreciate about you, though, before we get started is that, first, you have a you have a serious networking background. You're a CCIE, and you worked in industry, for a long time.

And then you transitioned into software development, and you've been doing that for almost a decade. And, in the networking space, but specifically focused on, machine learning more than anything else and now the generative AI space. So I really appreciate that about you, your experience, what you bring from actual production and industry. So, welcome again.

Yeah. No. Absolutely. It's good to be here. You know, over the past few years, have we been talking and having conversations back and forth? Our knowledge of the whole situation keeps going up and the industry keeps changing. So it's it's good to continuously have these conversations to see where everybody's mind's at.

There are two terms that I wanna highlight and discuss before we get into talking about algorithms and models and that sort of thing. And those two terms are generative AI and general AI. I hear them used almost interchangeably incorrectly.

So I'd like to level set with you first. What the what do those terms mean, and how do they differ?

Yeah. Absolutely. So I agree with you that, you know, you got these these buzzwords and we get all these these terms that are being thrown around a lot more and more, and it's kinda starting to get, you know, make the the waters muddy and everybody's kinda like, well, what's this versus that? It's all seems like AI.

You know, generative AI is one that's kind of been around for a while. It's picked up a ton of steam since we've seen the introduction of transformers or LLMs, because it it really plays into that space. And that that basically comes down to I I am generating the the AI is generating output for you or a specific model is generating more than just the kabolean, yes, no answer, or, just general text. It's it's generating a a conversation with you. It's generating an image.

Anything like that you're going after is what you're going there.

Yeah. Exactly. So generative AI, like you said, and the name suggests, is all about generating some sort of new information. You said an output, and I'm gonna stick with information because specifically in the case of large language models, you have a giant dataset, of text usually.

And, and so, a large language model, generative AI is gonna look at that body of text and then and then generate something new, some new information. And that might be as mundane as summarizing a text, or it could be generating some code or a snippet of code or, like a lot of people do with ChatGPT when it first came out, generating a poem, right, based on the underlying body of text of all the poetry that exist exists and that it can find on the Internet. Right? And that's different than general AI, which is much more about the data processing Right. Data analysis, clustering, finding patterns, and correlation in the data, perhaps then, performing some sort of action on the correlation that it finds, but ultimately not generating some sort of new information. It's it's more it's more concerned with data analysis.

Right.

And and so under the term AI, we have all the technical mechanisms. We have various algorithms in, you know, statistics. We have machine learning. We have we have all these things that we're gonna talk about today that are actually the mechanisms that allow us to do some sort of artificial intelligence. But AI itself is a very broad term that refers to this entire high level concept of trying to to have a machine think like a person.

Right. Yeah. Yeah. Absolutely. And I I I think it's key there that it's not just generating the same thing a human would, but it's also exploring areas that humans aren't necessarily good at. That machinery or logic that we use, what we call deep learning or AI, can actually solve problems that we can't. And that's kind of a, you know, high end goal for the entire industry to start to outperform a human.

And, you know, that that makes me wonder. I'd like your opinion on this. Do you think that's why the industry I mean, the tech industry, our industry, but really all industries, the world is just so interested in generative AI right now. AI in general, but specifically what we're doing with large language models like, like GPT by OpenAI, and Meta has LAMA, and Google has Palm. What what do you think?

I think that's where it's always been. And and and looking back as as far as I can and reading reading articles and and talking to people that that been around the industry for a while, I I think that's where we wanna be. That's what we wanna do is is is is we want to ease the job of humans and offset it with machines or intelligence that can take care of it. It's just a matter of, you know, it's taken this long to get somewhere where we can get some serious traction. Mhmm.

Yeah. And and when you say this long, it's taken this long, really, we're talking about decades, like, many decades.

Where we are today in the twenty twenties, twenty twenty four and beyond really began right after World War two with the advent of what, I guess, what we call modern computers. You know, I'm talking about vacuum tubes and things like that in in the late nineteen forties, but nevertheless, the the era of of modern computers as we know it today. And, that's where we can really start to trace the very beginnings of not just artificial intelligence on a broad level, but but natural language processing and then eventually large language models. So rather than go into the history even deeper, let's talk about that high level concept of natural language processing. We mentioned it a few times and, it is the umbrella term that covers large language models. So, Ryan, what what is natural language processing?

Yeah. Yeah. So natural language processing is is exactly what it sounds, is being able to, listen to natural conversation and be able to translate that into text, into a transcript, into even like a a a SQL search or a Google search. It's being able to translate what's given and understand the meaning behind what they're asking.

The flip side of that is is what most of us interact with or or we see as the magic, and that's that's the output. So, you know, having a system that is able to, take, like, a search result, or take a body of text and read through it, understand the full context, and then provide some some sort of output to it, in in in a natural language. Whereas before, if we wanted to interact with databases, we gotta know how to do SQL queries or how to write schemas and and input data. Same thing with networks.

We're we're at the CLI. We gotta tell it to configure the network exactly how we want it.

Whereas natural processing, it's more like a conversation, more like me and you are doing right now, and that's what we're seeing in in in all the tools out there. And so that's that's that is basically how it all lays out.

Yeah. Yeah. And and the thing is that programming languages, if you think about it, if you're very, very well, you know, familiar with a particular programming language, even if it's something as crazy like assembly language or something like that, it is very natural for that human being. And so in that sense, you do have this language, that you're using to speak with with the computer and then ultimately with the data that's that's behind it.

And so back in the fifties and the sixties, and then when you have modern programming languages like Python, for example, those those are languages. That's a that's a natural way to interact with the computer and with data. But there isn't that, that that understanding, that deeper understanding that we seem to want and that we're trying to go for with modern large language models in the sense of it sort of understands the nuance of language. It's not just breaking down the, the syntax that you put in the command line, but it's understanding the nuance of your question of your prompt and also the nuance and even the inconsistency because people are not consistent, right, human beings, of all the text that it's using as its database to answer your question.

So, Ryan, when you put something into chat g p t, it's understanding the nuance and the weirdness of your question, but it's also understanding the nuance and the weirdness of all of the data, the text data that it's using to answer your question. So it's Right. It's beyond that early days of whatever kinda, like, you know, simplistic form of programming language that we use to interact with the computer.

Well, it's, early days. Yeah. Correct. But it's it's also something that's that's been widely adopted and used all the way up into this day, and we still see it every single day.

And and what you discussed there about getting beyond just guessing what the next word is or guessing, you know, what what comes next, understanding the full thing is understanding the attention Mhmm. From the model. And that's the technical term that everybody uses because that's where it came from. The paper is attention all you is all you need.

Right. Attention is all you need. That was a paper by, Ashish Vaswani and others, a group, which, of course, we'll link in the show notes for you.

And, and and, yeah, the the idea of the attention mechanism and attention is why we are where we are today Right. With large language models. So we'll absolutely get into that in some greater depth in a, later in the episode. And, you know, I it's it's just interesting for me, especially because of my my, English teacher background from decades ago. You know, it it does beg the question of what is knowledge. Is knowledge for a human being simply the, the understanding of the rules of grammar and and words and how patterns of words come together, or is it deeper than that?

And, and I wonder if the attention mechanism, if transformers, if the way that large language models work today in twenty twenty four and presumably into the future more closely aligns with the that ideal of what natural language processing is all about.

Yeah. And and it it went even further, once once you got into the twentieth century. Mhmm. If you start thinking about tools like, well, Google, when you start typing a search, it starts recommending your next few words.

Right.

That's that's a very basic rudimentary deep level or in machine level model. Right. Either like l s LSTM or RNN Mhmm. Or even CNN to to be able to dictate or guess what the next words are.

And so those are where it's kinda been modern. But where it kinda stopped with those models and where where the shortcomings came there is just the fact that it can only guess what the next word was. And it only had a probability of what that word was, but it didn't understand. Right.

Like you mentioned, it didn't have the attention to understand the the context.

What should the next four sentences be? Mhmm. Or if there's any type of negative prompt or anything where the users don't want a response to be in a certain way, you know, they they they worked they they had those limitations. So that's that's kinda where we ended up before the jump into, transformers.

And and you gotta remember for folks listening that this is all just, mathematical computations.

These are formulas and functions, in which words are tokenized to a numerical value that are plugged into math, and then the output of that function is a value that's used to determine what the next most probable word or sentence of words are, how things relate to each other in a text. So, ultimately, this is as much as we say, wow. You know, this is how we used to do things back in the day with engram models, and nobody uses that anymore. This is still incredibly sophisticated, amazing stuff that's that's been going on.

Oh, yeah. Yeah. Yeah. So so we have, for example, the advent of the engram language model.

And the n gram model is is still probabilistic, meaning we're still using prediction. We're still predicting the next item in a sequence, which large language models is words in a sentence. Right?

So the term engram that refers to, you know, a sequence of n items, in air quotes, I I say n, to represent a number of items. And then the model is built on the idea that the probability of the word in a particular text, you can approximate it by the context provided by all the preceding words.

So an n gram model is given a sequence of words, and it's gonna determine that the next word should be this because this is what all the preceding words were. And and that's and and as sophisticated as that sounds, and it is, of course, there are some inherent limitations. So if you think about the context window of the dataset that the, the model is using to determine the probability of the next word, it's just n minus one. I say just in air quotes because n could be a gigantic number. But really, all of the probability is based on the preceding words in a sequence. That's the entire, context. And so that isn't necessarily often enough to fully understand the meaning of a complete sentence of and and could produce an inaccurate result.

So the probability is based on a limited context window with engram models, which is why we don't necessarily use engram models for, modern large language models today. And, you know, modern LLMs are gonna then use neural network models, some of which we already mentioned with RNN and LSTM, but ultimately a multilayered approach, rather than the statistical methods and engram models that were used previously. But I do wanna get into that. Before before we do, Ryan, you mentioned attention, attention mechanism. You mentioned transformer several time, and I think it's important to discuss that simply because it's just so important to how we, build, large language models today.

Yeah. So, an attention mechanism. So there's a paper out there, that we just discussed. It's attention is all you need.

If if you haven't heard of how ever links it's been linked to it or read it.

It's a decent read. It's it's a relatively heavy read, but it's it's basically what LLMs and and modern models above deep learning are utilizing to to do everything we just described is to to be able to fluently discuss and and pull information together in in an amazingly accurate way, to to present to the user. And it's it's it's a new schema to, pass data through both an encoding and a decoding block, which basically takes the input and processes it to output, encode decode, and then also sharing that information back into, both sides so that they're able to feed off each other and learn as they're progressing as well.

So that's that's basically the cornerstone of of what an LLM is right now.

Yeah. I definitely encourage listeners to check out that paper.

It's a great way to get into the weeds of what attention attention mechanism, what it is, and and how it works from a technical perspective as well, what it what it actually gives us.

So how it does that is interesting. You know, it's similar if you think about it, how we as human beings, pay attention to different aspects of of what we see or what we hear. And when we when we read a text, then we're we're pulling in all, so much more information than just the preceding words that we read in that particular sentence.

And then what happens is the attention mechanism assigns different levels of importance to all those different words, or weights, right, to various parts of all of that data.

And, and then the attention mechanism can prioritize what what is important and then determine what the next word should be based on that, not necessarily just the sequence of preceding words.

So I have to assume that also helps solve the problem of the long range dependencies, meaning, like, this is what the sequence of word says, and so this is the most likely next word. But that doesn't kinda like you said earlier, it doesn't factor in, how do we figure out what the next several words or the next four sentences should be, which is something you mentioned earlier.

Yeah. So, learning is usually breaking down, how we as humans figure out a problem, especially even like a statistics problem or even, you know, like, how many apples in this bowl are red and how many apples are green.

It's being able to take that and put it into a formula that a machine can crunch through and be able to come out with a reasonable answer.

And you you start off with the very, very basics, that that have been around forever, and that's like logical regression.

You know, those those models where you can you can determine how many how many dots on this side of a plot graph are red and how many dots on this side are blue. That means that anything that's red is on this side, anything blue on that side, however it goes about it.

But then the the the formulas and and finding, the finding out of what information or or the training, I'm sorry, is more we're getting at, that that that takes deeper layers and more complex formulas that start figuring out where correlations are. And so when you start pumping through a bunch of results that, you know, it says, hey. Here's here's ten dots on this side. Here's ten dots on this side.

This dividing line is is where we divide the two. Red's over here. Blue's over here. And the more examples positive examples we pump through a model like that, the more it recognizes when I see this type of pattern, this is what I'm seeing.

And so it's that positive reinforcement that as over time, it starts recognizing it. And as it pushes through the various layers, it's able to condense that down and come out with a with a proper answer.

Which sounds very similar to reinforcement learning and machine learning. Right?

Pretty much. Pretty much. Yeah. And, yeah, and everything I just described is is basically how machine learning and then getting into deep learning is is set up. That's the foundation of it. With machine learning, it goes a little bit above that. It gets in some more a little a few more complex layers, but it's it's not too deep.

Getting into deep learning, you're starting to talk ten, twenty, thirty extra layers, if not more, to push the data through to really get into it. And then, obviously, with LLMs and getting into, the the the much higher level stuff we see now, it's it's a much larger complex workflow.

Alright. Well, you know what? Let's pause and demystify what deep learning is, and I'm gonna do that by actually explaining what neural networks are. So if you could picture in your mind's eye a neural network, where you have an input layer. Right? So you start from the left with an input layer. You input some data, some whatever, and that gets fed into its first layer, and we call all the layers in between the input and output hidden layers.

Each hidden layer is gonna have multiple nodes or neurons, at which point it's doing some sort of discrete mathematical computation.

And so as you feed that, result from each layer forward, we have what we call a feed forward network.

So at each point in that hidden layer, you're doing things at individual nodes like Right.

You know, like like regression, or like you mentioned linear regression, Ryan. Or maybe it's a classification model that you're that you're applying, like a decision tree or k nearest neighbors.

I'm using ML terms here.

Maybe a clustering algorithm, like like k means clustering.

I I can't think of all of the examples right now, but you can imagine that that's all happening. Yeah. Individual points within the, the layers of the neural network feeding forward to the next layer. So the next layer's input is the output of the previous layer ultimately leading to the output, of the neural altogether being some sort of probability or some sort of classification or whatever it happens to be.

Now the there is kind of a debate in the literature of how many layers, you need to then have deep learning, but that's all really deep learning is. It's a very deep, multiple layered neural network, and, there is inherent complexity in that, and complexity and computational intensity as far as resources required and some of the algorithms that are being applied.

But it is it is in a forward direction, ultimately leading us to our answer.

Yep. Yep. And and those all of those layers, they can all be relatively small simple stuff. They don't have to be deeply complex. So some good examples, normalization is is a key tool inside of any type of ML, AI, or deep learning. Normalization is basically if your data is so scattered and spread out in all sorts of different directions and and there's the more spread out and the more diverse it is, the harder it is to draw correlation.

And so what a normalization layer does is it reduces all of whatever the output numbers were, the previous layer or something else. It reduces it down to a number between zero and one. Okay. And that's just one example of of using it, but I think that's the most common or at least the one I've seen the most. And so you get everything down into a zero to one, and then you can start seeing closer correlation.

And so doing that computation is not difficult. It's just a very simple math formula to reduce everything down to a value between zero and one, but that is a whole entire layer in this whole thing itself. And usually, you combine that one with, like, three other layers, and that's a common component that you would use inside a model like this.

Yeah. So putting it in the context of large language models, what we have here is a very complex workflow where, you know, data is inputted. It has to be tokenized. So there's a layer where it's doing that kind of activity. The data itself is is embedded into vector space, so there's a numerical value there that has to be normalized.

So that way we can do what a subsequent layer might do, like, clustering or regression to find patterns and to find Right. Probability of what the next word might be. Ultimately leading to, you know, more layers, and then and the and the result.

And and so therein, we get this idea of an entire neural network and and because of the many layers of a deep learning. Now why did that happen though? That that's what I wanna know is what happened that caused the entire industry to shift in academia and then in industry from probabilistic models to transformers attention and and doing, generative AI this way?

The the best I can gather, what really lit a fire under this was we we finally had a platform, and the the the AI ML industry finally have oxygen to breathe off of where it's always been limited in the past, and it's just taken this long to get here.

The two key things that I hear people say time and time again on on what helped trigger this is access access to data is the key one. Just like you said a little bit ago, if data is king, it is absolutely king, and if you do not have enough data, you can't do something in this space reliably.

And the the twenty first century is what opened up the entire world to gobs of data, the Internet, YouTube, Reddit, Wikipedia, all the data you can get is out there and readily available now. So that's one.

The second one was it is very computational heavy to do some of these more advanced, workloads.

And so we needed to be able to do it somewhere. We needed hardware that could handle larger workloads.

And GPUs are what it was once we figured out that tensors were processable very efficiently through GPUs, and then the the crypto bubble hit and busted and now everybody has a bunch of GPUs laying around. Well, what are we gonna throw at it? Let's start throwing AI at it. To me, I that last part's kinda made up by me, and that's how I like to think about it. But, you know, at some point in time, with the GPUs kinda outperform the CPUs, and we could really start pushing the hardware to its limits where we couldn't in the past. And I really think that's what set it off.

Yeah. That's a that's a good point. The availability of just so much data with the advent of, you know, the Internet.

I mean, that's just a vast database that we didn't really have available or at least easily available prior. I mean, I guess you could say that in academia, there were, there were digital copies of data that people could use, but not not to the extent and to the scope that we have today. So that makes sense to me. And then, of course, also the hardware point that you made. Yeah. Thinking about how the, you know, just in in in the silicon world going from seven nanometers down to four nanometers and the amount of computational power that we can cram in a very, very tiny little space and therefore scale bigger and bigger and bigger.

And and in the network space as well, speeds and feeds just grown dramatically very quickly over the past ten and twenty years. All of that is kind of conspired together at the same time in history to allow us to take what we've been building on for the past eighty years, seventy years to be able to do much more interesting stuff with it. So that's really cool. But it's, it's also really interesting to me to think about how, a lot of those individual neurons or nodes in neural in the neural network, they're actually doing a lot of that stuff like statistical probability calculations and, you know, some of the stuff that we say, oh, we don't do that anymore in neural networks.

No. No. The neural network is is kinda like the workflow of all of that activity. And sure, there's more advanced things we've talked about attention and and some other, algorithms that are used today that maybe weren't used thirty years ago.

But a lot of the stuff that we did decades ago is still being incorporated. I mean, just for example, when I, input a question or or request into a, a prompt like chat CPT, it still needs to abide by some of the rules of grammar and syntax in order to produce a, intelligible response.

But I would like to go back for a second to, transformers and specifically the encoder decoder, because when I was, you know, when I was getting into this, it just seemed so simplistic to me that I just assumed that I was missing something. Is that all that we're talking about here with transformers? It's the the encoder decoder is just an input output function?

Pretty much.

Okay.

Yeah. It's it it it it it literally is. You you basically gotta take the data.

You you gotta ingest it into a system that can then process it. And so if you're thinking about a model as a set of layers, the the inputs or the neurons that come into it are basically all the points of data that are taken into consideration and considered against each other.

And so when you build your model out, you you say we're gonna have this many inputs. It's gonna have this many outputs that CIDR direction to these layers, yada yada yada. You've got to structure your data in a way that can be fed into that model and just pushed over to it. You can't just hand it a pdf and say go to town and have fun. The pdf has got to be moved into that model.

So usually with a pdf it gets broken down into chunks and then those chunks get broken down into vectors that can then be in relation to other vectors and yada yada. So that's that's literally what it is, is it's taking the data, transforming it into a way the model can actually use and push through itself to get a result.

Decoding is basically the app the the complete opposite. It's taking the output that comes from the model in those same results, okay, but with a prediction attached to it, and then outputting that in a way that is natural for us to read or however the architecture of the application that's using it wants it to come out. Can either come out as a JSON payload to then get pushed off to some other pipeline, or how we like to see it now with NLP is be able to respond in natural human speak.

Yeah. So what we're talking about here is something that we've touched on several times without getting really into it, so maybe we should really quick. What you're talking about is tokenization.

Correct. That's the process of taking something more complex and breaking it down into its parts. And if you can think of it in terms of language, that might be taking a sentence to keep it simple. I mean, obviously, that's an entire text or an entire huge dataset of texts. Right? But it could be taking a sentence and bringing it breaking it down into individual words, or more likely into into individual characters, including, punctuation as well.

Then those individual, words are are transformed into embeddings. Embeddings are a numerical value. And those numerical values can be represented in a three d, vector. And so if you can picture a three d model, all these words appear in relation to each other.

Now, of course, they exist in tables and in math and in algorithms, but if you were to picture in your mind's eye, that's how it would look. And so now the model can use this information to determine the relationship among words and can use that as part of its overall computation to determine probability and and things like that. But I I wanna ask you about the data because this all presupposes that we have, number one, enough data, but we have the Internet so fine. But we have good quality data.

And you said I think at some point in our conversation, you mentioned Wikipedia and Reddit, and I have to admit that internally, I sort of recoiled at that. I'm like, you can't be serious, man. Reddit and and Wikipedia. But but are you serious?

Is that data acceptable to use here?

Absolutely. And they're they're a critical source right now, because with Wikipedia is just a massive, collection of historical data and historical information that's been somewhat vetted out and references applied, you know, so it's it's it's solid data to work off of.

Reddit, on the other hand, it's all conversational. That's the value of Reddit in in in an AI workflow.

It's not necessarily is it right or is it wrong, but it's a great example to show how to have a human conversation because that's how you're interacting on Reddit is conversational threads back and forth. And those threaded conversations really do a good job of helping AI learn how to conversate with another human.

And so that's that's why I bring up Reddit.

And I think Reddit you know, this is just my personal opinion. Hopefully, I don't get sued for this. But, you know, I think that's a lot of why Reddit made the recent decisions that they have is, you know, their data is very valuable to this industry right now. And being able to utilize and use that data for free, you know, it's like, oh, we're kinda missing out a little bit of, you know, fear of missing out there, and they need to get their their their their gold out of this as well. And so that's kinda where I think they they went that direction.

So what we're talking about here is kind of a multistage process of training a model. We have the initial model training. Right? And that's gonna be the bulk of the computational power and activity and resources and time given to, ingesting that initial dataset, presupposing that it's sufficient in size and quality, and, and then observing its result going through that.

And that can take weeks and months, because it is an iterative process of observing the result and then fine tuning the model in order to achieve the the the result that we're looking for. And when we finally get there, what we do is, well, then the model is fit for service. It's ready to be put into production. After that, of course, you may need to continually update your model based on new input, new, updated and new fresh data.

But that's not as as computationally expensive as, as that initial model training.

So, this is what's gonna cost the most money. This is what's gonna take those tens of thousands of of GPUs and the billion dollars worth of infrastructure to actually do this in the first place.

But what happens when the model produces something that's just wrong? Like, it's incorrect, factually incorrect, or it just makes no sense.

Yeah. So, you know, what you just described as being wrong or being off is is is in the general term called hallucination.

And that's that's basically the LLM did not pick the best answer or it picked one way out of nowhere.

And it it's, you know, it's it's one of those the industry is still working through figuring out how best to handle it, and how best to mitigate it. There's models out there that actually allow you to recognize and detect that, you know, something's either been AI generated, so we see all the tools out there to to see if anybody's cheating on their homework.

But also, you know, in the same vein, those types of tools and other models out there can can give you a score to say, hey. This this answer is is most likely not right.

It doesn't line up with what the answer should be or the keywords aren't there, so it has a really low probability score. So let's ignore it.

But, ultimately, you know, LLMs hallucinate because of the way that they're trained.

You know, they the so that that means that there's a limitation in the dataset. There's a limitation in how the parameters and hyperparameters are adjusted in the in the training and in the fine tuning, so in the pretraining and in the fine tuning period.

So we wanna solve for those things so that it does generate a reliable accurate response. So it starts with ensuring that you're using sufficient quantity and high quality data, and that it's being refreshed periodically so it is up to date, and therefore producing relevant, accurate responses. There's also a process of data cleaning of of compression and consolidation, so that makes the, the model run more efficiently.

And then there's also, the concept of model overfitting and underfitting, which is it's kinda well understood in statistical analysis for for years and years now and and machine learning in general.

But basically, you know, if you have a data set, the model can operate very, very accurately in the confines of that data set. But as soon as you step, like, one inch outside of that data set, the model becomes highly inaccurate.

So that is, you know, that's a problem. And then we also have this concept of, of back propagation.

So for the audience's sake, for from a very, very high level, that's the process of basically observing the result of the model of the neural network. And then based on that result, the accuracy of the result, how far it differs from what we expect, we can go back and then adjust those parameters within the layers of the neural network. So with backpropagation, we can go back to individual neurons in the layers of the network to maybe increase the weights or bias, at different at different nodes in in the neural network. And what we're doing is figuring out this is the training process, right, that's iterative.

And what we're trying to do is figure out what weights, what biases, what specific values, minimize what's called a cost function, and that's what we're using to determine how far away our result is from what we expect or what we want. Now that's a that's a gross oversimplification of gradient descent, which is that's that's the entire process of figuring out the values that we need to adjust by how much. And then back propagation is the algorithm that we use to do that.

Yeah. Well, you're you're right in your saying, but, back propagation actually has a much more precise functionality in all of this.

Okay. Yeah. Yeah. No. Please explain.

So back propagation is used in any type of model that that does any type of results, ML, deep learning, AI. They they all must use back propagation.

And back back propagation is basically what is termed learning.

When when a model learns something, it is due to back propagation.

I don't know if anybody can remember way back in the day well, way back in the day, in the past when we started seeing results out of some of these models, deep learning models, you see the ones where it shows a computer simulator where a robot tries to ride a bicycle and it falls over after one step. But after about fifty steps, it makes it about ten yards. After a hundred steps, it makes it in a hundred yards, and then after, like, five thousand, it makes it all the way. Well, that's learning. That's it going through the model. So it basically takes its inputs, pushes it through the model with what it thinks are good results or or good biased, good weights, good hyperparameters to make it know or make it learn official.

When it gets to the end, it takes those those, those hyperparameters and those weights of bias, okay, and it compares those with the results that it has as a good or bad result.

And if it comes back as a good result or you use it however you want strategically, you then take those weights and you start from there and say I wanna go to these weights. I'm assuming I'm gonna correct by going from these weights to these weights. And then you work backwards in the model all the way back to the beginning, and you should know where to start.

Okay.

And then that allows you to start the process. So and so this is a way to go forward and then backwards like you said and learn where best to get to the the right hyperparameters, the right tuning, and that's where it goes through.

Alright. Yeah. Yeah. And and and none of this stuff is actually gonna happen when the model is in production. I mean, we're not using, we're or rather we're not training the model per se in as far as that initial pre training, and we're not even doing any significant fine tuning when the model is in production other than some of the things that we're gonna talk about later, like in like, prompt engineering and instruction tuning, things like that.

But as far as this process of pretraining and back propagation and the iterations of of model training. That's happening before the the model goes into production.

And, and that's why we can run some of these models, like, on your laptop because you're not doing all of that on your laptop. That happened already in some very expensive gigantic data center. And now and and the parameters have been, identified and set, and now you're just running through the model with your subset data.

Yep. Now the the assumption or the the difference there too is if you get into fine tuning.

Okay? Fine fine tuning is gonna allow you to, and and and older models like, ML models or even deep learning models like JX boost and stuff, you could just strip the last couple layers off, add new layers, train those layers, and ship it out the door, and that's what we consider fine tuning.

But with ML models or, I'm sorry, AI models or LLMs, it's it's basically you add a couple more layers on with all that information, and then you let it run through the whole process, including that back propagation all the way through the model to let the whole model sink in together.

Okay. Well, you mentioned two different terms, though. What is the difference between a parameter and hyperparameter?

So parameters, that's another good one too, and I can be way off here because I think it kinda gets muddied in the waters as well because you can very easily see weight and bias inside your hyperparameters too. But, usually, your your core parameters are weight and biased, and that's that's the numbers you work off of to train your model.

Hyperparameters are more things like, you you introduce new tools. So I I spoke about one earlier on, on normalization.

What are the parameters for normalization? Where are we normally normalizing that?

You have you have mechanism and tools where you can do, like, early stop on your training instead of going all the way through the whole training exercises and training, you know, ten thousand times, if you hit a certain performance, stop. But what is that performance? Where where is that number?

You can also set during your training and your your evaluation.

You can skip levels of the model. You can skip layers, okay, if you want to. So if you hit certain parameters or you see certain results, go ahead and skip the next three of these layers because we don't need to go through that. Get get to the end, get to the softmax layer that outputs to us.

Those are the type of things that that you can tweak and then see how they come out. And then you tweak those hyperparameters, run it through the training job, see how your test set does, and go from there.

Okay. So the parameters then are things that the model can adjust based on that whole process of back propagation and everything. And the hyperparameters are gonna be more static configurations that an engineer or a human being is gonna set, like you said. Right? I see.

But it's it's one of those where, you know, they're they also explore with use in AI and ML to actually adjust hyperparameters for you.

Okay.

So the turtles all the way down. Yeah. There you go.

So then it then going back to the original point that we were trying to make or that you were trying to make, how you can solve, for hallucinations.

You know, we talked about the quality of the data, introducing more and fresh data from time to time, going through the iterations of model training and fine tuning using this process of of back propagation, which ultimately is the process of adjusting parameters and hyperparameters to get a better result at the very end of of, of the model training. Right?

Yeah.

You know, it's, it's actually a funny story I wanted to tell you.

Last year, I was at AutoCon Zero, and, one of the speakers was presenting on AI and specifically large language models.

And somebody from the floor in the question answer period asked about hallucinations, and the speaker started to give a response. Now a few weeks or a month prior, I had read about how you can adjust temperature, how you can adjust the temperature within the model, within the neural network as one of the hyperparameters to, you know, basically solve for hallucinations.

And, somebody else, and I and I put that in the network automation forum Slack channel right there with, like, you know, hundreds and hundreds of people in the room thinking that I knew what I was talking about. And then somebody, gently and professionally and kindly corrected me saying, well, yeah, that's that's a parameter you can adjust, but that's certainly not how you can solve for hallucinations. It's much more complex than that.

So I looked up this guy and I realized, oh, he's an experienced network engineer and data scientist who wrote a book on this. So I went out and bought his book and read it, and, that's actually what started me to get, much much deeper into learning about this field. So let me throw it back at you and then ask, what what are we missing in this discussion? What are what are some things that we're missing as far as, how we can accommodate for hallucinations? In your opinion, what's the best way moving forward to to accommodate for hallucinations in an LOM?

Yeah.

And and so getting into my opinion and and what I found through researching, the the key area and the key way that we we can use to reduce hallucinations right now is removes remove reduce the amount of data the LLM has to use to generate language.

So there's a lot there.

I wasn't expecting you to say that.

I was gonna say I no. I'm serious. I was expecting you to say that we wanna increase our data set so we can increase the probability of good data and more, examples from which the machine can learn. You just said the opposite, though.

Right. Right. Right. So, you know, if you think about an LLM, it's it's basically this massive ball of knowledge on everything from, you know, why why our octopuses purple all the way to, you know, chocolate ice cream is made with cocoa beans and to anything we can think of in networking.

So it's all over the place, and so it's very easy for two, two sets of data inside that model or or two vectors, if you wanna say that, to be close to each other but have no relation to each other at all. Okay. It's relatively easy for that to happen, and that's where a lot of hallucination comes from, is it recognizes that the the, the the wrong tensor or the wrong, I'm sorry, the wrong vector lines up. Okay?

And so by reducing the dataset or controlling the model into what it can use, we we focus it in on the areas we wanted to. Now there's a multiple approaches to go about that, but right now, really, the biggest approach that you see everybody working off of is what's called a RAG pipeline, RAG, or, recursive augmented generation. Okay. That was new to me.

That was new to me.

Yep.

So, basically, if we can already do search very, very well right now as an industry and as humans. Everybody knows how to jump on Google and really fine tune how to search and ask a question.

What do you mean you mean fine tune in the sense of, like, the prompt engineering concept?

No. No. No. I'm sorry.

So you fine tuning.

You go to a search engine Yeah. And you type in a question, whatever that question is. Okay? We have with our algorithms in words, our search, and it the ability to filter down that list of results very efficiently.

Each of us, you know, even even grandma knows you jump on Google and you ask a question in the top fifteen or first page. You don't get good responses. You change your question.

Okay.

We've gotten good at this. Okay? And so why not leverage search there? And that's kind of the premise is let's add a step to the LLM.

LLMs are very, very good at, producing NLP and natural language. That's what they're good at right now.

If if we take and we search a database of our information, so we we filter in the information to exactly what we want, search that database of information, find the results that we want, push that into an LLM, and say, answer a question based off this data alone.

That is one way we are manually forcing the the data into a smaller bucket to focus on.

And and better ensure a more accurate or logical or at least not nonsensical result. But, I mean, that also kinda flies in the face of what we were talking about before with model overfitting. Fitting. Right?

So we have a data set that's just kinda too small. It's it's limited. And so the model gets some sort of question or request and then it sort of tries to generalize an answer based on a limited amount of data. So I I don't I don't get this.

I don't get that.

It's it's it's built outside of the model. It's built on top of the model.

That's why it's usually called a pipeline, because it's an additional step that you you add into it. So if you understand adding context to a prompt or to your query, being able to add the context, you you just add those those search results in there, and that's what returns it.

Okay. So the RAG pipeline doesn't limit the underlying dataset used to train the model initially.

It's an additional layer Yep.

To add context so we can help focus in the answer that it ultimately generates. Alright. I I understand now.

So I I do wanna refocus our attention, though, on on our industry, our our field. So networking. Well, not really networking, but IT in general more broadly. What are your thoughts about how large language models are today changing, IT operations, how we do operations, and how it could potentially change how we do operations in the future.

Yeah.

Kinda like what I just said. LLMs right now, their bread and butter is being able to produce natural language.

And what they're gonna first thing I feel they're gonna open up to a lot of applications and a lot of people is a way to naturally interact with the infrastructure, with the tools they use, with the applications instead of having to know, like, CLI syntax or, you know, what's how do I configure this on a Cisco box versus a Juniper box versus a resto or what have you know, how do how do I configure an f five load balancer?

You you don't have to know that syntactical syntactical information. You just need to talk with the model, ask it to build and generate or do, and then it returns output to you.

Now is it gonna be one of those where, you know, in five years, no one's gonna use the CLI anymore? No. We've proven that wrong decade after decade now.

But it's it's gonna help augment how we do that stuff.

And that's the first thing right out of the gate that I see. That and then also, like, telemetry data, being able to process telemetry data, understand what we're doing with it, boil it down, and come out with, you know, human relatable and readable results.

I mean, considering the name of the podcast, Telemetry Now, I have a bias to and I tend to agree with what you said about, telemetry in particular, considering that, it is just a vast dataset of very diverse, you know, and and high volume of data from cloud providers and your network and from application sources and servers and switches and routers and firewalls and all these different things.

And, and all at, vastly different scales and different types and formats that it requires, this entire, intelligent workflow to normalize and then do something meaningful with. And so I think the interrogation of data at the scale that we do IT operations today, right, with with all of these components that we didn't have twenty years ago, I think that's gonna be one of the more compelling uses of large language models, specifically in IT operations.

The interrogation analysis understanding of data, much deeper and with greater insight than we've been able to do in the past. And and I I understand that that is my bias because that is what I focus on in my company every single day.

But looking at the landscape of what we're doing with large language models, specifically in IT, sure. I I see, like, the summarization of text and configuration guides and and creating syntax, for code, for router switches, and and, you know, your your Python script, that's all meaningful.

But I I really do think that data analysis and being able to interact with data very, very naturally, which is what NLP is all about, is one of the more compelling things that we're gonna see occur over the next few years.

Yep. I I I do. I agree. I think it's gonna be the most impactful.

Yeah. Absolutely.

Standing up, you know, automation of deployment, standing up new infrastructure, that's easy. You know? We've been doing it forever. It's it's relatively easy.

I know what you mean. Yeah.

Yeah. So it's it's not too big of a deal. It's the problem is when it transitions from that day zero into day one. And now we gotta maintain it.

You don't know what happens. You don't know what's going on. The janitor could smack her broom into one of the switch a six or not switch a ASICs, but a SMP or whatever. You know?

Who knows? And you gotta be able to put these logs together. And as an industry, even when the big data trends came out, we we got all this cool stuff. But how do you interact with big data through an SQL query?

Mhmm.

How do your frontline support techs do that? You know? They don't. But if if they can ask a question to a bot or to a service that can then pull all that together for them, understanding what they want, that's gonna be our biggest impact.

Yeah. And I think that we're gonna see large language well, we already are seeing large language models trained for specific knowledge areas, domain areas. So specifically for IT, for example, we have OWL, o w l. That's a large language model that was trained on IT information. And so, you know, it was designed to be more relevant for IT ops. Now whether it's that effective or not, that's a debate. And we're gonna see that for other areas as well, other disciplines, not IT.

And, and and and that, I think, is gonna allow large language models in general to become more useful and more meaningful in what they can produce, over time. Now, as far as software defined networking, intent based networking, Ryan, you and I have had these conversations for years, and we've always had these high hopes that we were gonna have these self healing networks and automated remediation and super programmatic intelligent networking, and that never happened. Do you think that what we're seeing now with generative and even general AI is a big step in that direction? Maybe what's gonna push us over the edge?

It it already isn't.

It already what?

It's not.

Oh, okay. Well, then please explain.

It already is not the mechanism that's gonna is the best example.

You know, your all of the components and all of the ML, AI, LLMs that we we put out there and can use. They they are very good in certain areas, but weak in others, and they can do different things. What we're finding, RAG pipeline is one of the first that, kind of proved this out and has shown a lot of steam or a lot of a lot of momentum, is issue chain operations together. So if if you think, like how you would do a c I d c I c I c d pipeline for any type of cloud deployment or even like your network automation or even software development, It's not just one tool that handles the entire workflow. It's usually about ten or twelve tools chained together. Each one of them does their function, produces, and pushes the output to the next layer. It's gonna be very similar to that.

And and there's there's concepts and there's, you know, you you've got tools out there. There's like lane chain and long index, which are tool two tools that are very popular right now. Because if if you're familiar with something like Ansible or Terraform, they allow you to very easily chain, multiple steps together in a single work field.

And so being able to do that or be able to take a entire system and say, hey. Do this for me, and then you see an output come out the other side. There's ten or twelve steps in there that help massage that down the pipe.

And how I like to think about this now for network engineers and how we do our day to day is think about how you would do it. Okay? If if you're sitting down and you're having to troubleshoot an issue, where are you gonna start?

And a lot of us as network people went through these same steps when we went through automation.

Okay. We we outline every single step you take to get from start to finish of any given task.

And then the goal is to automate every single one of those steps.

The same is true for AI and ML, is you figure out what step can best be achieved by what, LLM model or what tooling set, and you chain them all together. And then the beginning and the end is your your natural language processing where it takes the input and does the understanding of what you're saying, and then at the end, it outputs it in normal language for you.

Yeah. That's a that's a great way to put it actually. So large language models are really like a a layer through which human beings can interact with computers or really the the data that that sits behind computers, in a much better way than we ever have before. So in that sense, the large language model isn't, that hyperintelligent AI that's gonna take over the world. It's a service layer between human beings and data.

And really, you know, I'll I guess I'll address that. This idea in popular culture and and and even in the media among the talking heads that, AI today is, you know, on the verge of taking over the world as as some form of hyper intelligent computer being is false. I mean, that's not the level of intelligence that we're seeing in any form of general or generative AI today.

And and and we have as, you know, as as humans and everything we've we've consumed over the past couple years, we have this vision that AI is this monolithic tool or application that just does all this stuff, and it's not. It's a combination of multiple tools.

Right. Right. Multiple tools in a in a workflow, that that is, you know, admittedly very complex and sophisticated. So we're not trying to downplay that AI isn't useful and and, sophisticated as far as the technology.

But, certainly, you know, just to go back to neural networks, the idea that a neural network mimics how a human mind works is is not true. That's a misnomer of of the name neural network.

It doesn't even come close to it. In fact, I I would venture to say that human beings, generally, scientists, neuroscientists, aren't still quite sure even how the human mind works anyway. So it's odd for us to say the neural network behaves like what we don't understand in the first place.

What's interesting to me though is that there is stuff that happens, within the bowels of neural networks and the hidden layers where even computer scientists today are not quite sure how it happens and how it works. So there is some interesting stuff going on for sure, but I don't think that it's about trying to mimic human intelligence necessarily. I mean, yeah, there's the aspect of responding to a human prompt with a response that feels very natural, very human. And then, you know, the the the artificial intelligence, workflow and model being able to, determine context, understand nuance, and that sort of thing.

Okay. But other than that, really, we're trying to use these models to do what humans do but better. Right? More efficiently, faster, see deeper insight, find correlation that humans can't see, and, and do that sort of analysis that's beyond human ability.

So not not mimicking human intelligence, but enhancing the performance of how humans think. Right?

And and you you mentioned of of getting to the performance of humans and and reaching how humans do it.

And that's that's one of those, you know, I I had to learn this lesson when going through network automation is it's not necessarily the most optimal or the most stable way to program an application is to do it the way a human does.

AI and ML is the same way. We we try to mimic it somewhat off of what we understand of the human brain. Mhmm. But, ultimately, how the brain operates from what we know and how MLAI models work, they're two separate areas. They they work differently.

Right. And so machines don't necessarily need to go through the same type of processes that we as humans do. Yeah. They need to go through a set separate layer. So it's it's it's understanding and being open minded to the idea that a workflow through a machine might be different than that of a human.

And if it proves to be better, well, you got your results. And if it doesn't, okay, then how do we tune it to get it as good as a human? And maybe that's the best we can get it. And so you gotta take that stuff in consideration and be open minded with it.

So, Ryan, I think this has been a really good technical overview of large language models. We even got into, like, how they are relevant to IT operations, which was good. But I I know that any one of the topics that we covered, back propagation, transformers, attention mechanism, n gram models, what else, linear regression. Anything that we touched upon today could easily be a podcast or a series of podcasts onto itself for sure. So I do encourage listeners to check out the white paper, that I'll link in the show notes. And, and, Ryan, for folks that have a question for you or want to reach out with a comment, how can they find you online?

Yeah. Best place to find me right now is is probably gonna be LinkedIn. I used to be pretty big on Twitter. I'm still there, but I I just don't use it anymore.

LinkedIn, you know, if you're in the networking community, you'll most likely find me under that one guy fifteen in various forums, in various Slack channels, Discord groups. Anywhere you can find me, hit me up. And if you can't get a hold of me, find me in another platform. Hit me up there. You're welcome to hunt me down.

Great. And, you could still find me on Twitter. I'm pretty active there still, network underscore Phil. You could search my name in LinkedIn, Philip Gervasi, very active there.

And my personal blog networkphil dot com, although I do a ton of writing on, the Kentic blog as well. Now if you have an idea for an episode or if you'd like to be a guest on Telemetry Now, I'd love to hear from you. Please reach out, telemetry now at kentic dot com. So for now, thanks very much for listening.

Bye bye.

About Telemetry Now

Do you dread forgetting to use the “add” command on a trunk port? Do you grit your teeth when the coffee maker isn't working, and everyone says, “It’s the network’s fault?” Do you like to blame DNS for everything because you know deep down, in the bottom of your heart, it probably is DNS? Well, you're in the right place! Telemetry Now is the podcast for you! Tune in and let the packets wash over you as host Phil Gervasi and his expert guests talk networking, network engineering and related careers, emerging technologies, and more.
We use cookies to deliver our services.
By using our website, you agree to the use of cookies as described in our Privacy Policy.