View on GitHub
GitHub
Neural networks
Transformers, the tech behind LLMs | Deep Learning Chapter 5
Loading player
Notes
Transcript
443 segments
0:00
The initials GPT stand for Generative Pretrained Transformer.
0:05
So that first word is straightforward enough, these are bots that generate new text.
0:09
Pretrained refers to how the model went through a process of learning
0:13
from a massive amount of data, and the prefix insinuates that there's
0:16
more room to fine-tune it on specific tasks with additional training.
0:20
But the last word, that's the real key piece.
0:23
A transformer is a specific kind of neural network, a machine learning model,
0:27
and it's the core invention underlying the current boom in AI.
0:31
What I want to do with this video and the following chapters is go through a
0:35
visually-driven explanation for what actually happens inside a transformer.
0:39
We're going to follow the data that flows through it and go step by step.
0:43
There are many different kinds of models that you can build using transformers.
0:47
Some models take in audio and produce a transcript.
0:51
This sentence comes from a model going the other way around,
0:54
producing synthetic speech just from text.
0:56
All those tools that took the world by storm in 2022 like DALL-E and Midjourney
1:01
that take in a text description and produce an image are based on transformers.
1:06
Even if I can't quite get it to understand what a pi creature is supposed to be,
1:09
I'm still blown away that this kind of thing is even remotely possible.
1:13
And the original transformer introduced in 2017 by Google was invented for
1:18
the specific use case of translating text from one language into another.
1:22
But the variant that you and I will focus on, which is the type that
1:26
underlies tools like ChatGPT, will be a model that's trained to take in a piece of text,
1:31
maybe even with some surrounding images or sound accompanying it,
1:34
and produce a prediction for what comes next in the passage.
1:38
That prediction takes the form of a probability distribution
1:41
over many different chunks of text that might follow.
1:45
At first glance, you might think that predicting the next word
1:47
feels like a very different goal from generating new text.
1:50
But once you have a prediction model like this,
1:52
a simple thing you could try to make it generate, a longer piece of text,
1:56
is to give it an initial snippet to work with,
1:58
have it take a random sample from the distribution it just generated,
2:02
append that sample to the text, and then run the whole process again to make
2:05
a new prediction based on all the new text, including what it just added.
2:10
I don't know about you, but it really doesn't feel like this should actually work.
2:13
In this animation, for example, I'm running GPT-2 on my laptop and having it repeatedly
2:17
predict and sample the next chunk of text to generate a story based on the seed text.
2:22
The story just doesn't actually really make that much sense.
2:26
But if I swap it out for API calls to GPT-3 instead, which is the same basic model,
2:31
just much bigger, suddenly almost magically we do get a sensible story,
2:35
one that even seems to infer that a pi creature would live in a land of math and
2:40
computation.
2:41
This process here of repeated prediction and sampling is essentially
2:44
what's happening when you interact with ChatGPT,
2:47
or any of these other large language models, and you see them producing
2:50
one word at a time.
2:52
In fact, one feature that I would very much enjoy is the ability to
2:55
see the underlying distribution for each new word that it chooses.
3:03
Let's kick things off with a very high level preview
3:06
of how data flows through a transformer.
3:08
We will spend much more time motivating and interpreting and expanding
3:11
on the details of each step, but in broad strokes,
3:14
when one of these chatbots generates a given word, here's what's going on under the hood.
3:19
First, the input is broken up into a bunch of little pieces.
3:22
These pieces are called tokens, and in the case of text these tend to be
3:26
words or little pieces of words or other common character combinations.
3:30
If images or sound are involved, then tokens could be little
3:34
patches of that image or little chunks of that sound.
3:37
Each one of these tokens is then associated with a vector, meaning some list of numbers,
3:42
which is meant to somehow encode the meaning of that piece.
3:45
If you think of these vectors as giving coordinates in some very high dimensional space,
3:50
words with similar meanings tend to land on vectors that are
3:53
close to each other in that space.
3:55
This sequence of vectors then passes through an operation that's
3:58
known as an attention block, and this allows the vectors to talk to
4:01
each other and pass information back and forth to update their values.
4:04
For example, the meaning of the word model in the phrase "a machine learning
4:08
model" is different from its meaning in the phrase "a fashion model".
4:12
The attention block is what's responsible for figuring out which
4:15
words in context are relevant to updating the meanings of which other words,
4:19
and how exactly those meanings should be updated.
4:22
And again, whenever I use the word meaning, this is
4:25
somehow entirely encoded in the entries of those vectors.
4:29
After that, these vectors pass through a different kind of operation,
4:32
and depending on the source that you're reading this will be referred
4:35
to as a multi-layer perceptron or maybe a feed-forward layer.
4:38
And here the vectors don't talk to each other,
4:40
they all go through the same operation in parallel.
4:43
And while this block is a little bit harder to interpret,
4:45
later on we'll talk about how the step is a little bit like asking a long list
4:49
of questions about each vector, and then updating them based on the answers
4:53
to those questions.
4:54
All of the operations in both of these blocks look like a
4:58
giant pile of matrix multiplications, and our primary job is
5:01
going to be to understand how to read the underlying matrices.
5:06
I'm glossing over some details about some normalization steps that happen in between,
5:10
but this is after all a high-level preview.
5:13
After that, the process essentially repeats, you go back and forth
5:17
between attention blocks and multi-layer perceptron blocks,
5:20
until at the very end the hope is that all of the essential meaning
5:24
of the passage has somehow been baked into the very last vector in the sequence.
5:28
We then perform a certain operation on that last vector that produces a probability
5:33
distribution over all possible tokens, all possible little chunks of text that might
5:37
come next.
5:38
And like I said, once you have a tool that predicts what comes next
5:42
given a snippet of text, you can feed it a little bit of seed text and
5:45
have it repeatedly play this game of predicting what comes next,
5:49
sampling from the distribution, appending it, and then repeating over and over.
5:53
Some of you in the know may remember how long before ChatGPT came into the scene,
5:57
this is what early demos of GPT-3 looked like,
6:00
you would have it autocomplete stories and essays based on an initial snippet.
6:05
To make a tool like this into a chatbot, the easiest starting point is to have a
6:09
little bit of text that establishes the setting of a user interacting with a
6:13
helpful AI assistant, what you would call the system prompt,
6:17
and then you would use the user's initial question or prompt as the first bit of
6:21
dialogue, and then you have it start predicting what such a helpful AI assistant
6:25
would say in response.
6:27
There is more to say about an added step of training that's required
6:30
to make this work well, but at a high level this is the idea.
6:35
In this chapter, you and I are going to expand on the details of what happens at the very
6:39
beginning of the network, at the very end of the network,
6:42
and I also want to spend a lot of time reviewing some important bits of background
6:46
knowledge, things that would have been second nature to any machine learning engineer by
6:50
the time transformers came around.
6:53
If you're comfortable with that background knowledge and a little impatient,
6:56
you could probably feel free to skip to the next chapter,
6:58
which is going to focus on the attention blocks,
7:00
generally considered the heart of the transformer.
7:03
After that, I want to talk more about these multi-layer perceptron blocks,
7:06
how training works, and a number of other details that will have been skipped up to
7:11
that point.
7:12
For broader context, these videos are additions to a mini-series about deep learning,
7:16
and it's okay if you haven't watched the previous ones,
7:18
I think you can do it out of order, but before diving into transformers specifically,
7:22
I do think it's worth making sure that we're on the same page about the basic premise
7:27
and structure of deep learning.
7:29
At the risk of stating the obvious, this is one approach to machine learning,
7:33
which describes any model where you're using data to somehow determine how a model
7:37
behaves.
7:39
What I mean by that is, let's say you want a function that takes in
7:42
an image and it produces a label describing it,
7:44
or our example of predicting the next word given a passage of text,
7:48
or any other task that seems to require some element of intuition
7:51
and pattern recognition.
7:53
We almost take this for granted these days, but the idea with machine learning is that
7:57
rather than trying to explicitly define a procedure for how to do that task in code,
8:02
which is what people would have done in the earliest days of AI,
8:05
instead you set up a very flexible structure with tunable parameters,
8:09
like a bunch of knobs and dials, and then, somehow,
8:11
you use many examples of what the output should look like for a given input to tweak
8:16
and tune the values of those parameters to mimic this behavior.
8:19
For example, maybe the simplest form of machine learning is linear regression,
8:24
where your inputs and outputs are each single numbers,
8:27
something like the square footage of a house and its price,
8:30
and what you want is to find a line of best fit through this data, you know,
8:34
to predict future house prices.
8:37
That line is described by two continuous parameters,
8:40
say the slope and the y-intercept, and the goal of linear
8:43
regression is to determine those parameters to closely match the data.
8:48
Needless to say, deep learning models get much more complicated.
8:52
GPT-3, for example, has not two, but 175 billion parameters.
8:58
But here's the thing, it's not a given that you can create some giant
9:01
model with a huge number of parameters without it either grossly
9:05
overfitting the training data or being completely intractable to train.
9:10
Deep learning describes a class of models that in the
9:13
last couple decades have proven to scale remarkably well.
9:16
What unifies them is that they all use the same training algorithm,
9:19
it's called backpropagation, we talked about it in previous chapters,
9:22
and the context that I want you to have as we go in is that in order for this
9:26
training algorithm to work well at scale, these models have to follow a certain
9:30
specific format.
9:31
And if you know this format going in, it helps to explain many of the choices for how a
9:36
transformer processes language, which otherwise run the risk of feeling kinda arbitrary.
9:41
First, whatever kind of model you're making, the
9:43
input has to be formatted as an array of real numbers.
9:46
This could simply mean a list of numbers, it could be a two-dimensional array,
9:50
or very often you deal with higher dimensional arrays,
9:53
where the general term used is tensor.
9:56
You often think of that input data as being progressively transformed into many
10:00
distinct layers, where again, each layer is always structured as some kind of
10:04
array of real numbers, until you get to a final layer which you consider the output.
10:09
For example, the final layer in our text processing model is a list of numbers
10:13
representing the probability distribution for all possible next tokens.
10:17
In deep learning, these model parameters are almost always referred to as weights,
10:22
and this is because a key feature of these models is that the only way these
10:25
parameters interact with the data being processed is through weighted sums.
10:30
You also sprinkle some non-linear functions throughout,
10:32
but they won't depend on parameters.
10:35
Typically, though, instead of seeing the weighted sums all naked
10:38
and written out explicitly like this, you'll instead find them
10:41
packaged together as various components in a matrix vector product.
10:46
It amounts to saying the same thing, if you think back to how matrix vector
10:50
multiplication works, each component in the output looks like a weighted sum.
10:54
It's just often conceptually cleaner for you and me to think
10:58
about matrices that are filled with tunable parameters that
11:01
transform vectors that are drawn from the data being processed.
11:06
For example, those 175 billion weights in GPT-3 are
11:10
organized into just under 28,000 distinct matrices.
11:14
Those matrices in turn fall into eight different categories,
11:17
and what you and I are going to do is step through each one of those categories to
11:21
understand what that type does.
11:23
As we go through, I think it's kind of fun to reference the specific
11:27
numbers from GPT-3 to count up exactly where those 175 billion come from.
11:31
Even if nowadays there are bigger and better models,
11:34
this one has a certain charm as the first large-language
11:37
model to really capture the world's attention outside of ML communities.
11:41
Also, practically speaking, companies tend to keep much tighter
11:44
lips around the specific numbers for more modern networks.
11:47
I just want to set the scene going in, that as you peek under the
11:50
hood to see what happens inside a tool like ChatGPT,
11:53
almost all of the actual computation looks like matrix vector multiplication.
11:57
There's a little bit of a risk getting lost in the sea of billions of numbers,
12:01
but you should draw a very sharp distinction in your mind between
12:05
the weights of the model, which I'll always color in blue or red,
12:08
and the data being processed, which I'll always color in gray.
12:12
The weights are the actual brains, they are the things learned during training,
12:16
and they determine how it behaves.
12:18
The data being processed simply encodes whatever specific input is
12:22
fed into the model for a given run, like an example snippet of text.
12:27
With all of that as foundation, let's dig into the first step of this text processing
12:31
example, which is to break up the input into little chunks and turn those chunks into
12:35
vectors.
12:37
I mentioned how those chunks are called tokens,
12:39
which might be pieces of words or punctuation,
12:41
but every now and then in this chapter and especially in the next one,
12:44
I'd like to just pretend that it's broken more cleanly into words.
12:48
Because we humans think in words, this will just make it much
12:51
easier to reference little examples and clarify each step.
12:55
The model has a predefined vocabulary, some list of all possible words,
12:59
say 50,000 of them, and the first matrix that we'll encounter,
13:03
known as the embedding matrix, has a single column for each one of these words.
13:08
These columns are what determines what vector each word turns into in that first step.
13:15
We label it W_E, and like all the matrices we see,
13:18
its values begin random, but they're going to be learned based on data.
13:23
Turning words into vectors was common practice in machine learning long before
13:27
transformers, but it's a little weird if you've never seen it before,
13:30
and it sets the foundation for everything that follows,
13:33
so let's take a moment to get familiar with it.
13:36
We often call this embedding a word, which invites you to think of these
13:39
vectors very geometrically as points in some high dimensional space.
13:44
Visualizing a list of three numbers as coordinates for points in 3D space would
13:48
be no problem, but word embeddings tend to be much much higher dimensional.
13:52
In GPT-3 they have 12,288 dimensions, and as you'll see,
13:55
it matters to work in a space that has a lot of distinct directions.
14:01
In the same way that you could take a two-dimensional slice through a 3D space
14:05
and project all the points onto that slice, for the sake of animating word
14:08
embeddings that a simple model is giving me, I'm going to do an analogous
14:12
thing by choosing a three-dimensional slice through this very high dimensional space,
14:16
and projecting the word vectors down onto that and displaying the results.
14:21
The big idea here is that as a model tweaks and tunes its weights to determine
14:25
how exactly words get embedded as vectors during training,
14:28
it tends to settle on a set of embeddings where directions in the space have a
14:33
kind of semantic meaning.
14:34
For the simple word-to-vector model I'm running here,
14:37
if I run a search for all the words whose embeddings are closest to that of tower,
14:42
you'll notice how they all seem to give very similar tower-ish vibes.
14:46
And if you want to pull up some Python and play along at home,
14:48
this is the specific model that I'm using to make the animations.
14:51
It's not a transformer, but it's enough to illustrate the
14:54
idea that directions in the space can carry semantic meaning.
14:58
A very classic example of this is how if you take the difference between
15:02
the vectors for woman and man, something you would visualize as a
15:05
little vector in the space connecting the tip of one to the tip of the other,
15:09
it's very similar to the difference between king and queen.
15:15
So let's say you didn't know the word for a female monarch,
15:18
you could find it by taking king, adding this woman minus man direction,
15:22
and searching for the embedding closest to that point.
15:27
At least, kind of.
15:28
Despite this being a classic example for the model I'm playing with,
15:31
the true embedding of queen is actually a little farther off than this would suggest,
15:35
presumably because the way queen is used in training data is not merely a feminine
15:39
version of king.
15:41
When I played around, family relations seemed to illustrate the idea much better.
15:46
The point is, it looks like during training the model found it advantageous to
15:50
choose embeddings such that one direction in this space encodes gender information.
15:56
Another example is that if you take the embedding of Italy,
16:00
and you subtract the embedding of Germany, and add that to the embedding of Hitler,
16:04
you get something very close to the embedding of Mussolini.
16:08
It's as if the model learned to associate some directions with Italian-ness,
16:13
and others with WWII axis leaders.
16:16
Maybe my favorite example in this vein is how in some models,
16:19
if you take the difference between Germany and Japan, and add it to sushi,
16:24
you end up very close to bratwurst.
16:27
Also in playing this game of finding nearest neighbors,
16:30
I was very pleased to see how close cat was to both beast and monster.
16:34
One bit of mathematical intuition that's helpful to have in mind,
16:37
especially for the next chapter, is how the dot product of two
16:40
vectors can be thought of as a way to measure how well they align.
16:44
Computationally, dot products involve multiplying all the
16:47
corresponding components and then adding the results, which is good,
16:51
since so much of our computation has to look like weighted sums.
16:55
Geometrically, the dot product is positive when vectors point in similar directions,
16:59
it's zero if they're perpendicular, and it's negative whenever
17:03
they point in opposite directions.
17:06
For example, let's say you were playing with this model,
17:09
and you hypothesize that the embedding of cats minus cat might represent a sort of
17:14
plurality direction in this space.
17:17
To test this, I'm going to take this vector and compute its dot
17:20
product against the embeddings of certain singular nouns,
17:23
and compare it to the dot products with the corresponding plural nouns.
17:27
If you play around with this, you'll notice that the plural ones
17:30
do indeed seem to consistently give higher values than the singular ones,
17:33
indicating that they align more with this direction.
17:37
It's also fun how if you take this dot product with the embeddings of the words one,
17:41
two, three, and so on, they give increasing values,
17:44
so it's as if we can quantitatively measure how plural the model finds a given word.
17:50
Again, the specifics for how words get embedded is learned using data.
17:54
This embedding matrix, whose columns tell us what happens to each word,
17:57
is the first pile of weights in our model.
18:00
Using the GPT-3 numbers, the vocabulary size specifically is 50,257,
18:04
and again, technically this consists not of words per se, but of tokens.
18:10
The embedding dimension is 12,288, and multiplying those
18:14
tells us this consists of about 617 million weights.
18:18
Let's go ahead and add this to a running tally,
18:20
remembering that by the end we should count up to 175 billion.
18:25
In the case of transformers, you really want to think of the vectors
18:28
in this embedding space as not merely representing individual words.
18:32
For one thing, they also encode information about the position of that word,
18:36
which we'll talk about later, but more importantly,
18:39
you should think of them as having the capacity to soak in context.
18:43
A vector that started its life as the embedding of the word king, for example,
18:47
might progressively get tugged and pulled by various blocks in this network,
18:51
so that by the end it points in a much more specific and nuanced direction that
18:55
somehow encodes that it was a king who lived in Scotland,
18:58
and who had achieved his post after murdering the previous king,
19:01
and who's being described in Shakespearean language.
19:05
Think about your own understanding of a given word.
19:08
The meaning of that word is clearly informed by the surroundings,
19:11
and sometimes this includes context from a long distance away,
19:15
so in putting together a model that has the ability to predict what word comes next,
19:19
the goal is to somehow empower it to incorporate context efficiently.
19:24
To be clear, in that very first step, when you create the array of
19:27
vectors based on the input text, each one of those is simply plucked
19:30
out of the embedding matrix, so initially each one can only encode
19:33
the meaning of a single word without any input from its surroundings.
19:37
But you should think of the primary goal of this network that it flows through
19:41
as being to enable each one of those vectors to soak up a meaning that's much
19:45
more rich and specific than what mere individual words could represent.
19:49
The network can only process a fixed number of vectors at a time,
19:52
known as its context size.
19:54
For GPT-3 it was trained with a context size of 2048,
19:57
so the data flowing through the network always looks like this array of 2048 columns,
20:02
each of which has 12,000 dimensions.
20:05
This context size limits how much text the transformer can
20:08
incorporate when it's making a prediction of the next word.
20:12
This is why long conversations with certain chatbots,
20:15
like the early versions of ChatGPT, often gave the feeling of
20:18
the bot kind of losing the thread of conversation as you continued too long.
20:23
We'll go into the details of attention in due time,
20:25
but skipping ahead I want to talk for a minute about what happens at the very end.
20:29
Remember, the desired output is a probability
20:31
distribution over all tokens that might come next.
20:35
For example, if the very last word is Professor,
20:37
and the context includes words like Harry Potter,
20:40
and immediately preceding we see least favorite teacher,
20:43
and also if you give me some leeway by letting me pretend that tokens simply
20:47
look like full words, then a well-trained network that had built up knowledge
20:51
of Harry Potter would presumably assign a high number to the word Snape.
20:56
This involves two different steps.
20:58
The first one is to use another matrix that maps the very last vector in that
21:03
context to a list of 50,000 values, one for each token in the vocabulary.
21:08
Then there's a function that normalizes this into a probability distribution,
21:12
it's called softmax and we'll talk more about it in just a second,
21:15
but before that it might seem a little bit weird to only use this last embedding
21:19
to make a prediction, when after all in that last step there are thousands of
21:23
other vectors in the layer just sitting there with their own context-rich meanings.
21:28
This has to do with the fact that in the training process it turns out to be
21:32
much more efficient if you use each one of those vectors in the final layer
21:36
to simultaneously make a prediction for what would come immediately after it.
21:40
There's a lot more to be said about training later on,
21:43
but I just want to call that out right now.
21:45
This matrix is called the Unembedding matrix and we give it the label WU.
21:50
Again, like all the weight matrices we see, its entries begin at random,
21:53
but they are learned during the training process.
21:56
Keeping score on our total parameter count, this Unembedding
21:59
matrix has one row for each word in the vocabulary,
22:02
and each row has the same number of elements as the embedding dimension.
22:06
It's very similar to the embedding matrix, just with the order swapped,
22:10
so it adds another 617 million parameters to the network,
22:13
meaning our count so far is a little over a billion,
22:16
a small but not wholly insignificant fraction of the 175 billion
22:20
we'll end up with in total.
22:22
As the very last mini-lesson for this chapter,
22:24
I want to talk more about this softmax function,
22:26
since it makes another appearance for us once we dive into the attention blocks.
22:31
The idea is that if you want a sequence of numbers to act as a probability distribution,
22:36
say a distribution over all possible next words,
22:39
then each value has to be between 0 and 1, and you also need all of them to add up to 1.
22:45
However, if you're playing the deep learning game where everything you do looks like
22:49
matrix-vector multiplication, the outputs you get by default don't abide by this at all.
22:55
The values are often negative, or much bigger than 1,
22:57
and they almost certainly don't add up to 1.
23:00
Softmax is the standard way to turn an arbitrary list of numbers
23:04
into a valid distribution in such a way that the largest values end up closest to 1,
23:08
and the smaller values end up very close to 0.
23:11
That's all you really need to know.
23:13
But if you're curious, the way it works is to first raise e to the power
23:17
of each of the numbers, which means you now have a list of positive values,
23:21
and then you can take the sum of all those positive values and divide each
23:25
term by that sum, which normalizes it into a list that adds up to 1.
23:30
You'll notice that if one of the numbers in the input is meaningfully bigger than the
23:34
rest, then in the output the corresponding term dominates the distribution,
23:37
so if you were sampling from it you'd almost certainly just be picking the maximizing
23:42
input.
23:42
But it's softer than just picking the max in the sense that when other values
23:46
are similarly large, they also get meaningful weight in the distribution,
23:50
and everything changes continuously as you continuously vary the inputs.
23:55
In some situations, like when ChatGPT is using this distribution to create a next word,
23:59
there's room for a little bit of extra fun by adding a little extra spice into this
24:04
function, with a constant T thrown into the denominator of those exponents.
24:09
We call it the temperature, since it vaguely resembles the role of temperature in
24:14
certain thermodynamics equations, and the effect is that when T is larger,
24:18
you give more weight to the lower values, meaning the distribution is a little bit
24:22
more uniform, and if T is smaller, then the bigger values will dominate more
24:26
aggressively, where in the extreme, setting T equal to zero means all of the weight
24:31
goes to maximum value.
24:33
For example, I'll have GPT-3 generate a story with the seed text,
24:37
"once upon a time there was A", but I'll use different temperatures in each case.
24:43
Temperature zero means that it always goes with the most predictable word,
24:48
and what you get ends up being a trite derivative of Goldilocks.
24:53
A higher temperature gives it a chance to choose less likely words,
24:56
but it comes with a risk.
24:58
In this case, the story starts out more originally,
25:01
about a young web artist from South Korea, but it quickly degenerates into nonsense.
25:06
Technically speaking, the API doesn't actually let you pick a temperature bigger than 2.
25:11
There's no mathematical reason for this, it's just an arbitrary constraint imposed
25:15
to keep their tool from being seen generating things that are too nonsensical.
25:19
So if you're curious, the way this animation is actually working is I'm taking the
25:24
20 most probable next tokens that GPT-3 generates,
25:27
which seems to be the maximum they'll give me,
25:29
and then I tweak the probabilities based on an exponent of 1/5.
25:33
As another bit of jargon, in the same way that you might call the components of
25:37
the output of this function probabilities, people often refer to the inputs as logits,
25:42
or some people say logits, some people say logits, I'm gonna say logits.
25:46
So for instance, when you feed in some text, you have all these word embeddings
25:50
flow through the network, and you do this final multiplication with the
25:53
unembedding matrix, machine learning people would refer to the components in that raw,
25:58
unnormalized output as the logits for the next word prediction.
26:03
A lot of the goal with this chapter was to lay the foundations for
26:06
understanding the attention mechanism, Karate Kid wax-on-wax-off style.
26:10
You see, if you have a strong intuition for word embeddings, for softmax,
26:14
for how dot products measure similarity, and also the underlying premise that
26:19
most of the calculations have to look like matrix multiplication with matrices
26:23
full of tunable parameters, then understanding the attention mechanism,
26:27
this cornerstone piece in the whole modern boom in AI, should be relatively smooth.
26:32
For that, come join me in the next chapter.
26:36
As I'm publishing this, a draft of that next chapter
26:38
is available for review by Patreon supporters.
26:41
A final version should be up in public in a week or two,
26:44
it usually depends on how much I end up changing based on that review.
26:47
In the meantime, if you want to dive into attention,
26:49
and if you want to help the channel out a little bit, it's there waiting.