View on GitHub
GitHub
Neural Networks: Zero to Hero
Let's build GPT: from scratch, in code, spelled out.
Loading player
Notes
Transcript
5896 segments
0:00
hi everyone so by now you have probably
0:02
hi everyone so by now you have probably
0:02
hi everyone so by now you have probably heard of chat GPT it has taken the world
0:04
heard of chat GPT it has taken the world
0:04
heard of chat GPT it has taken the world and AI Community by storm and it is a
0:07
and AI Community by storm and it is a
0:07
and AI Community by storm and it is a system that allows you to interact with
0:09
system that allows you to interact with
0:09
system that allows you to interact with an AI and give it text based tasks so
0:12
an AI and give it text based tasks so
0:12
an AI and give it text based tasks so for example we can ask chat GPT to write
0:15
for example we can ask chat GPT to write
0:15
for example we can ask chat GPT to write us a small Hau about how important it is
0:16
us a small Hau about how important it is
0:16
us a small Hau about how important it is that people understand Ai and then they
0:18
that people understand Ai and then they
0:18
that people understand Ai and then they can use it to improve the world and make
0:20
can use it to improve the world and make
0:20
can use it to improve the world and make it more prosperous so when we run this
0:23
it more prosperous so when we run this
0:23
it more prosperous so when we run this AI knowledge brings prosperity for all
0:25
AI knowledge brings prosperity for all
0:25
AI knowledge brings prosperity for all to see Embrace its
0:27
to see Embrace its
0:27
to see Embrace its power okay not bad and so you could see
0:29
power okay not bad and so you could see
0:29
power okay not bad and so you could see that chpt went from left to right and
0:31
that chpt went from left to right and
0:32
that chpt went from left to right and generated all these words SE sort of
0:35
generated all these words SE sort of
0:35
generated all these words SE sort of sequentially now I asked it already the
0:37
sequentially now I asked it already the
0:37
sequentially now I asked it already the exact same prompt a little bit earlier
0:39
exact same prompt a little bit earlier
0:39
exact same prompt a little bit earlier and it generated a slightly different
0:41
and it generated a slightly different
0:41
and it generated a slightly different outcome ai's power to grow ignorance
0:43
outcome ai's power to grow ignorance
0:44
outcome ai's power to grow ignorance holds us back learn Prosperity weights
0:47
holds us back learn Prosperity weights
0:47
holds us back learn Prosperity weights so uh pretty good in both cases and
0:49
so uh pretty good in both cases and
0:49
so uh pretty good in both cases and slightly different so you can see that
0:50
slightly different so you can see that
0:50
slightly different so you can see that chat GPT is a probabilistic system and
0:52
chat GPT is a probabilistic system and
0:52
chat GPT is a probabilistic system and for any one prompt it can give us
0:54
for any one prompt it can give us
0:54
for any one prompt it can give us multiple answers sort of uh replying to
0:57
multiple answers sort of uh replying to
0:57
multiple answers sort of uh replying to it now this is just one example of a
0:59
it now this is just one example of a
0:59
it now this is just one example of a problem people have come up with many
1:01
problem people have come up with many
1:01
problem people have come up with many many examples and there are entire
1:03
many examples and there are entire
1:03
many examples and there are entire websites that index interactions with
1:05
websites that index interactions with
1:06
websites that index interactions with chpt and so many of them are quite
1:08
chpt and so many of them are quite
1:08
chpt and so many of them are quite humorous explain HTML to me like I'm a
1:10
humorous explain HTML to me like I'm a
1:10
humorous explain HTML to me like I'm a dog uh write release notes for chess 2
1:14
dog uh write release notes for chess 2
1:14
dog uh write release notes for chess 2 write a note about Elon Musk buying a
1:16
write a note about Elon Musk buying a
1:16
write a note about Elon Musk buying a Twitter and so on so as an example uh
1:20
Twitter and so on so as an example uh
1:20
Twitter and so on so as an example uh please write a breaking news article
1:21
please write a breaking news article
1:21
please write a breaking news article about a leaf falling from a
1:23
about a leaf falling from a
1:23
about a leaf falling from a tree uh and a shocking turn of events a
1:26
tree uh and a shocking turn of events a
1:26
tree uh and a shocking turn of events a leaf has fallen from a tree in the local
1:28
leaf has fallen from a tree in the local
1:28
leaf has fallen from a tree in the local park Witnesses report that the leaf
1:30
park Witnesses report that the leaf
1:30
park Witnesses report that the leaf which was previously attached to a
1:31
which was previously attached to a
1:31
which was previously attached to a branch of a tree attached itself and
1:33
branch of a tree attached itself and
1:33
branch of a tree attached itself and fell to the ground very dramatic so you
1:36
fell to the ground very dramatic so you
1:36
fell to the ground very dramatic so you can see that this is a pretty remarkable
1:37
can see that this is a pretty remarkable
1:37
can see that this is a pretty remarkable system and it is what we call a language
1:40
system and it is what we call a language
1:40
system and it is what we call a language model uh because it um it models the
1:43
model uh because it um it models the
1:43
model uh because it um it models the sequence of words or characters or
1:46
sequence of words or characters or
1:46
sequence of words or characters or tokens more generally and it knows how
1:49
tokens more generally and it knows how
1:49
tokens more generally and it knows how sort of words follow each other in
1:50
sort of words follow each other in
1:50
sort of words follow each other in English language and so from its
1:52
English language and so from its
1:52
English language and so from its perspective what it is doing is it is
1:55
perspective what it is doing is it is
1:55
perspective what it is doing is it is completing the sequence so I give it the
1:57
completing the sequence so I give it the
1:57
completing the sequence so I give it the start of a sequence and it completes the
2:00
start of a sequence and it completes the
2:00
start of a sequence and it completes the sequence with the outcome and so it's a
2:02
sequence with the outcome and so it's a
2:02
sequence with the outcome and so it's a language model in that sense now I would
2:05
language model in that sense now I would
2:05
language model in that sense now I would like to focus on the under the hood of
2:07
like to focus on the under the hood of
2:07
like to focus on the under the hood of um under the hood components of what
2:09
um under the hood components of what
2:09
um under the hood components of what makes CH GPT work so what is the neural
2:12
makes CH GPT work so what is the neural
2:12
makes CH GPT work so what is the neural network under the hood that models the
2:14
network under the hood that models the
2:14
network under the hood that models the sequence of these words and that comes
2:17
sequence of these words and that comes
2:17
sequence of these words and that comes from this paper called attention is all
2:19
from this paper called attention is all
2:19
from this paper called attention is all you need in 2017 a landmark paper a
2:23
you need in 2017 a landmark paper a
2:23
you need in 2017 a landmark paper a landmark paper in AI that produced and
2:25
landmark paper in AI that produced and
2:25
landmark paper in AI that produced and proposed the Transformer
2:27
proposed the Transformer
2:27
proposed the Transformer architecture so GPT is uh short for
2:31
architecture so GPT is uh short for
2:31
architecture so GPT is uh short for generally generatively pre-trained
2:33
generally generatively pre-trained
2:33
generally generatively pre-trained Transformer so Transformer is the neuron
2:35
Transformer so Transformer is the neuron
2:35
Transformer so Transformer is the neuron nut that actually does all the heavy
2:36
nut that actually does all the heavy
2:36
nut that actually does all the heavy lifting under the hood it comes from
2:39
lifting under the hood it comes from
2:39
lifting under the hood it comes from this paper in 2017 now if you read this
2:41
this paper in 2017 now if you read this
2:41
this paper in 2017 now if you read this paper this uh reads like a pretty random
2:44
paper this uh reads like a pretty random
2:44
paper this uh reads like a pretty random machine translation paper and that's
2:46
machine translation paper and that's
2:46
machine translation paper and that's because I think the authors didn't fully
2:47
because I think the authors didn't fully
2:47
because I think the authors didn't fully anticipate the impact that the
2:48
anticipate the impact that the
2:49
anticipate the impact that the Transformer would have on the field and
2:51
Transformer would have on the field and
2:51
Transformer would have on the field and this architecture that they produced in
2:52
this architecture that they produced in
2:52
this architecture that they produced in the context of machine translation in
2:54
the context of machine translation in
2:54
the context of machine translation in their case actually ended up taking over
2:57
their case actually ended up taking over
2:57
their case actually ended up taking over uh the rest of AI in the next 5 years
3:00
uh the rest of AI in the next 5 years
3:00
uh the rest of AI in the next 5 years after and so this architecture with
3:02
after and so this architecture with
3:02
after and so this architecture with minor changes was copy pasted into a
3:05
minor changes was copy pasted into a
3:05
minor changes was copy pasted into a huge amount of applications in AI in
3:07
huge amount of applications in AI in
3:07
huge amount of applications in AI in more recent years and that includes at
3:10
more recent years and that includes at
3:10
more recent years and that includes at the core of chat GPT now we are not
3:13
the core of chat GPT now we are not
3:13
the core of chat GPT now we are not going to what I'd like to do now is I'd
3:15
going to what I'd like to do now is I'd
3:15
going to what I'd like to do now is I'd like to build out something like chat
3:17
like to build out something like chat
3:17
like to build out something like chat GPT but uh we're not going to be able to
3:19
GPT but uh we're not going to be able to
3:19
GPT but uh we're not going to be able to of course reproduce chat GPT this is a
3:21
of course reproduce chat GPT this is a
3:21
of course reproduce chat GPT this is a very serious production grade system it
3:23
very serious production grade system it
3:23
very serious production grade system it is trained on uh a good chunk of
3:26
is trained on uh a good chunk of
3:26
is trained on uh a good chunk of internet and then there's a lot of uh
3:29
internet and then there's a lot of uh
3:29
internet and then there's a lot of uh pre-training and fine-tuning stages to
3:31
pre-training and fine-tuning stages to
3:31
pre-training and fine-tuning stages to it and so it's very complicated what I'd
3:33
it and so it's very complicated what I'd
3:33
it and so it's very complicated what I'd like to focus on is just to train a
3:36
like to focus on is just to train a
3:36
like to focus on is just to train a Transformer based language model and in
3:38
Transformer based language model and in
3:38
Transformer based language model and in our case it's going to be a character
3:40
our case it's going to be a character
3:40
our case it's going to be a character level language model I still think that
3:43
level language model I still think that
3:43
level language model I still think that is uh very educational with respect to
3:44
is uh very educational with respect to
3:45
is uh very educational with respect to how these systems work so I don't want
3:47
how these systems work so I don't want
3:47
how these systems work so I don't want to train on the chunk of Internet we
3:48
to train on the chunk of Internet we
3:48
to train on the chunk of Internet we need a smaller data set in this case I
3:51
need a smaller data set in this case I
3:51
need a smaller data set in this case I propose that we work with uh my favorite
3:53
propose that we work with uh my favorite
3:53
propose that we work with uh my favorite toy data set it's called tiny
3:55
toy data set it's called tiny
3:55
toy data set it's called tiny Shakespeare and um what it is is
3:57
Shakespeare and um what it is is
3:57
Shakespeare and um what it is is basically it's a concatenation of all of
3:59
basically it's a concatenation of all of
3:59
basically it's a concatenation of all of the works of sh Shakespeare in my
4:00
the works of sh Shakespeare in my
4:00
the works of sh Shakespeare in my understanding and so this is all of
4:02
understanding and so this is all of
4:02
understanding and so this is all of Shakespeare in a single file uh this
4:05
Shakespeare in a single file uh this
4:05
Shakespeare in a single file uh this file is about 1 megab and it's just all
4:07
file is about 1 megab and it's just all
4:07
file is about 1 megab and it's just all of
4:08
of
4:08
of Shakespeare and what we are going to do
4:10
Shakespeare and what we are going to do
4:10
Shakespeare and what we are going to do now is we're going to basically model
4:12
now is we're going to basically model
4:12
now is we're going to basically model how these characters uh follow each
4:14
how these characters uh follow each
4:14
how these characters uh follow each other so for example given a chunk of
4:16
other so for example given a chunk of
4:16
other so for example given a chunk of these characters like this uh given some
4:19
these characters like this uh given some
4:19
these characters like this uh given some context of characters in the past the
4:22
context of characters in the past the
4:22
context of characters in the past the Transformer neural network will look at
4:24
Transformer neural network will look at
4:24
Transformer neural network will look at the characters that I've highlighted and
4:26
the characters that I've highlighted and
4:26
the characters that I've highlighted and is going to predict that g is likely to
4:28
is going to predict that g is likely to
4:28
is going to predict that g is likely to come next in the sequence and it's going
4:30
come next in the sequence and it's going
4:30
come next in the sequence and it's going to do that because we're going to train
4:31
to do that because we're going to train
4:31
to do that because we're going to train that Transformer on Shakespeare and it's
4:34
that Transformer on Shakespeare and it's
4:34
that Transformer on Shakespeare and it's just going to try to produce uh
4:36
just going to try to produce uh
4:36
just going to try to produce uh character sequences that look like this
4:39
character sequences that look like this
4:39
character sequences that look like this and in that process is going to model
4:40
and in that process is going to model
4:40
and in that process is going to model all the patterns inside this data so
4:43
all the patterns inside this data so
4:43
all the patterns inside this data so once we've trained the system i' just
4:45
once we've trained the system i' just
4:45
once we've trained the system i' just like to give you a preview we can
4:47
like to give you a preview we can
4:47
like to give you a preview we can generate infinite Shakespeare and of
4:49
generate infinite Shakespeare and of
4:49
generate infinite Shakespeare and of course it's a fake thing that looks kind
4:51
course it's a fake thing that looks kind
4:51
course it's a fake thing that looks kind of like
4:53
of like
4:53
of like Shakespeare
4:55
Shakespeare
4:55
Shakespeare um apologies for there's some Jank that
4:59
um apologies for there's some Jank that
4:59
um apologies for there's some Jank that I'm not able to resolve in in here but
5:02
I'm not able to resolve in in here but
5:02
I'm not able to resolve in in here but um you can see how this is going
5:05
um you can see how this is going
5:05
um you can see how this is going character by character and it's kind of
5:07
character by character and it's kind of
5:07
character by character and it's kind of like predicting Shakespeare like
5:09
like predicting Shakespeare like
5:09
like predicting Shakespeare like language so verily my Lord the sites
5:12
language so verily my Lord the sites
5:12
language so verily my Lord the sites have left the again the king coming with
5:15
have left the again the king coming with
5:15
have left the again the king coming with my curses with precious pale and then
5:19
my curses with precious pale and then
5:19
my curses with precious pale and then tranos say something else Etc and this
5:21
tranos say something else Etc and this
5:21
tranos say something else Etc and this is just coming out of the Transformer in
5:23
is just coming out of the Transformer in
5:23
is just coming out of the Transformer in a very similar manner as it would come
5:25
a very similar manner as it would come
5:25
a very similar manner as it would come out in chat GPT in our case character by
5:27
out in chat GPT in our case character by
5:27
out in chat GPT in our case character by character in chat GPT uh it's coming out
5:31
character in chat GPT uh it's coming out
5:31
character in chat GPT uh it's coming out on the token by token level and tokens
5:33
on the token by token level and tokens
5:33
on the token by token level and tokens are these sort of like little subword
5:35
are these sort of like little subword
5:35
are these sort of like little subword pieces so they're not Word level they're
5:36
pieces so they're not Word level they're
5:36
pieces so they're not Word level they're kind of like word chunk
5:38
kind of like word chunk
5:38
kind of like word chunk level um and now I've already written
5:43
level um and now I've already written
5:43
level um and now I've already written this entire code uh to train these
5:45
this entire code uh to train these
5:45
this entire code uh to train these Transformers um and it is in a GitHub
5:48
Transformers um and it is in a GitHub
5:48
Transformers um and it is in a GitHub repository that you can find and it's
5:50
repository that you can find and it's
5:50
repository that you can find and it's called nanog
5:51
called nanog
5:51
called nanog GPT so nanog GPT is a repository that
5:54
GPT so nanog GPT is a repository that
5:54
GPT so nanog GPT is a repository that you can find in my GitHub and it's a
5:56
you can find in my GitHub and it's a
5:56
you can find in my GitHub and it's a repository for training Transformers um
5:59
repository for training Transformers um
5:59
repository for training Transformers um on any given text and what I think is
6:02
on any given text and what I think is
6:02
on any given text and what I think is interesting about it because there's
6:03
interesting about it because there's
6:03
interesting about it because there's many ways to train Transformers but this
6:05
many ways to train Transformers but this
6:05
many ways to train Transformers but this is a very simple implementation so it's
6:06
is a very simple implementation so it's
6:06
is a very simple implementation so it's just two files of 300 lines of code each
6:10
just two files of 300 lines of code each
6:10
just two files of 300 lines of code each one file defines the GPT model the
6:12
one file defines the GPT model the
6:12
one file defines the GPT model the Transformer and one file trains it on
6:14
Transformer and one file trains it on
6:14
Transformer and one file trains it on some given Text data set and here I'm
6:17
some given Text data set and here I'm
6:17
some given Text data set and here I'm showing that if you train it on a open
6:18
showing that if you train it on a open
6:18
showing that if you train it on a open web Text data set which is a fairly
6:20
web Text data set which is a fairly
6:20
web Text data set which is a fairly large data set of web pages then I
6:22
large data set of web pages then I
6:22
large data set of web pages then I reproduce the the performance of
6:25
reproduce the the performance of
6:25
reproduce the the performance of gpt2 so gpt2 is an early version of open
6:29
gpt2 so gpt2 is an early version of open
6:29
gpt2 so gpt2 is an early version of open AI GPT uh from 2017 if I recall
6:32
AI GPT uh from 2017 if I recall
6:32
AI GPT uh from 2017 if I recall correctly and I've only so far
6:34
correctly and I've only so far
6:34
correctly and I've only so far reproduced the the smallest 124 million
6:36
reproduced the the smallest 124 million
6:36
reproduced the the smallest 124 million parameter model uh but basically this is
6:38
parameter model uh but basically this is
6:38
parameter model uh but basically this is just proving that the codebase is
6:39
just proving that the codebase is
6:39
just proving that the codebase is correctly arranged and I'm able to load
6:42
correctly arranged and I'm able to load
6:42
correctly arranged and I'm able to load the uh neural network weights that openi
6:45
the uh neural network weights that openi
6:45
the uh neural network weights that openi has released later so you can take a
6:48
has released later so you can take a
6:48
has released later so you can take a look at the finished code here in N GPT
6:50
look at the finished code here in N GPT
6:50
look at the finished code here in N GPT but what I would like to do in this
6:51
but what I would like to do in this
6:51
but what I would like to do in this lecture is I would like to basically uh
6:55
lecture is I would like to basically uh
6:55
lecture is I would like to basically uh write this repository from scratch so
6:57
write this repository from scratch so
6:57
write this repository from scratch so we're going to begin with an empty file
6:59
we're going to begin with an empty file
6:59
we're going to begin with an empty file and we're we're going to define a
7:00
and we're we're going to define a
7:00
and we're we're going to define a Transformer piece by piece we're going
7:03
Transformer piece by piece we're going
7:03
Transformer piece by piece we're going to train it on the tiny Shakespeare data
7:05
to train it on the tiny Shakespeare data
7:05
to train it on the tiny Shakespeare data set and we'll see how we can then uh
7:08
set and we'll see how we can then uh
7:08
set and we'll see how we can then uh generate infinite Shakespeare and of
7:10
generate infinite Shakespeare and of
7:10
generate infinite Shakespeare and of course this can copy paste to any
7:12
course this can copy paste to any
7:12
course this can copy paste to any arbitrary Text data set uh that you like
7:14
arbitrary Text data set uh that you like
7:14
arbitrary Text data set uh that you like uh but my goal really here is to just
7:16
uh but my goal really here is to just
7:16
uh but my goal really here is to just make you understand and appreciate uh
7:18
make you understand and appreciate uh
7:18
make you understand and appreciate uh how under the hood chat GPT works and um
7:22
how under the hood chat GPT works and um
7:22
how under the hood chat GPT works and um really all that's required is a
7:23
really all that's required is a
7:24
really all that's required is a Proficiency in Python and uh some basic
7:27
Proficiency in Python and uh some basic
7:27
Proficiency in Python and uh some basic understanding of um calculus and
7:29
understanding of um calculus and
7:29
understanding of um calculus and statistics
7:30
statistics
7:30
statistics and it would help if you also see my
7:32
and it would help if you also see my
7:32
and it would help if you also see my previous videos on the same YouTube
7:33
previous videos on the same YouTube
7:34
previous videos on the same YouTube channel in particular my make more
7:35
channel in particular my make more
7:35
channel in particular my make more series where I um Define smaller and
7:40
series where I um Define smaller and
7:40
series where I um Define smaller and simpler neural network language models
7:42
simpler neural network language models
7:42
simpler neural network language models uh so multi perceptrons and so on it
7:45
uh so multi perceptrons and so on it
7:45
uh so multi perceptrons and so on it really introduces the language modeling
7:46
really introduces the language modeling
7:46
really introduces the language modeling framework and then uh here in this video
7:49
framework and then uh here in this video
7:49
framework and then uh here in this video we're going to focus on the Transformer
7:50
we're going to focus on the Transformer
7:50
we're going to focus on the Transformer neural network itself okay so I created
7:53
neural network itself okay so I created
7:53
neural network itself okay so I created a new Google collab uh jup notebook here
7:57
a new Google collab uh jup notebook here
7:57
a new Google collab uh jup notebook here and this will allow me to later easily
7:58
and this will allow me to later easily
7:58
and this will allow me to later easily share this code that we're going to
8:00
share this code that we're going to
8:00
share this code that we're going to develop together uh with you so you can
8:01
develop together uh with you so you can
8:01
develop together uh with you so you can follow along so this will be in a video
8:03
follow along so this will be in a video
8:03
follow along so this will be in a video description uh later now here I've just
8:07
description uh later now here I've just
8:07
description uh later now here I've just done some preliminaries I downloaded the
8:09
done some preliminaries I downloaded the
8:09
done some preliminaries I downloaded the data set the tiny Shakespeare data set
8:10
data set the tiny Shakespeare data set
8:10
data set the tiny Shakespeare data set at this URL and you can see that it's
8:12
at this URL and you can see that it's
8:12
at this URL and you can see that it's about a 1 Megabyte file then here I open
8:15
about a 1 Megabyte file then here I open
8:15
about a 1 Megabyte file then here I open the input.txt file and just read in all
8:17
the input.txt file and just read in all
8:17
the input.txt file and just read in all the text of the string and we see that
8:20
the text of the string and we see that
8:20
the text of the string and we see that we are working with 1 million characters
8:22
we are working with 1 million characters
8:22
we are working with 1 million characters roughly and the first 1,000 characters
8:24
roughly and the first 1,000 characters
8:24
roughly and the first 1,000 characters if we just print them out are basically
8:26
if we just print them out are basically
8:26
if we just print them out are basically what you would expect this is the first
8:28
what you would expect this is the first
8:28
what you would expect this is the first 1,000 characters of the tiny Shakespeare
8:30
1,000 characters of the tiny Shakespeare
8:30
1,000 characters of the tiny Shakespeare data set roughly up to here so so far so
8:34
data set roughly up to here so so far so
8:34
data set roughly up to here so so far so good next we're going to take this text
8:37
good next we're going to take this text
8:37
good next we're going to take this text and the text is a sequence of characters
8:39
and the text is a sequence of characters
8:39
and the text is a sequence of characters in Python so when I call the set
8:41
in Python so when I call the set
8:41
in Python so when I call the set Constructor on it I'm just going to get
8:44
Constructor on it I'm just going to get
8:44
Constructor on it I'm just going to get the set of all the characters that occur
8:46
the set of all the characters that occur
8:46
the set of all the characters that occur in this text and then I call list on
8:49
in this text and then I call list on
8:49
in this text and then I call list on that to create a list of those
8:51
that to create a list of those
8:51
that to create a list of those characters instead of just a set so that
8:53
characters instead of just a set so that
8:53
characters instead of just a set so that I have an ordering an arbitrary ordering
8:56
I have an ordering an arbitrary ordering
8:56
I have an ordering an arbitrary ordering and then I sort that so basically we get
8:59
and then I sort that so basically we get
8:59
and then I sort that so basically we get just all the characters that occur in
9:00
just all the characters that occur in
9:00
just all the characters that occur in the entire data set and they're sorted
9:02
the entire data set and they're sorted
9:02
the entire data set and they're sorted now the number of them is going to be
9:04
now the number of them is going to be
9:04
now the number of them is going to be our vocabulary size these are the
9:06
our vocabulary size these are the
9:06
our vocabulary size these are the possible elements of our sequences and
9:08
possible elements of our sequences and
9:09
possible elements of our sequences and we see that when I print here the
9:11
we see that when I print here the
9:11
we see that when I print here the characters there's 65 of them in total
9:14
characters there's 65 of them in total
9:14
characters there's 65 of them in total there's a space character and then all
9:16
there's a space character and then all
9:16
there's a space character and then all kinds of special characters and then U
9:19
kinds of special characters and then U
9:19
kinds of special characters and then U capitals and lowercase letters so that's
9:21
capitals and lowercase letters so that's
9:21
capitals and lowercase letters so that's our vocabulary and that's the sort of
9:23
our vocabulary and that's the sort of
9:23
our vocabulary and that's the sort of like possible uh characters that the
9:25
like possible uh characters that the
9:25
like possible uh characters that the model can see or emit okay so next we
9:29
model can see or emit okay so next we
9:29
model can see or emit okay so next we will would like to develop some strategy
9:30
will would like to develop some strategy
9:31
will would like to develop some strategy to tokenize the input text now when
9:34
to tokenize the input text now when
9:35
to tokenize the input text now when people say tokenize they mean convert
9:36
people say tokenize they mean convert
9:36
people say tokenize they mean convert the raw text as a string to some
9:39
the raw text as a string to some
9:39
the raw text as a string to some sequence of integers According to some
9:41
sequence of integers According to some
9:41
sequence of integers According to some uh notebook According to some vocabulary
9:43
uh notebook According to some vocabulary
9:43
uh notebook According to some vocabulary of possible elements so as an example
9:46
of possible elements so as an example
9:46
of possible elements so as an example here we are going to be building a
9:48
here we are going to be building a
9:48
here we are going to be building a character level language model so we're
9:49
character level language model so we're
9:49
character level language model so we're simply going to be translating
9:50
simply going to be translating
9:50
simply going to be translating individual characters into integers so
9:53
individual characters into integers so
9:53
individual characters into integers so let me show you uh a chunk of code that
9:55
let me show you uh a chunk of code that
9:55
let me show you uh a chunk of code that sort of does that for us so we're
9:57
sort of does that for us so we're
9:57
sort of does that for us so we're building both the encoder and the
9:58
building both the encoder and the
9:58
building both the encoder and the decoder
10:00
decoder
10:00
decoder and let me just talk through what's
10:01
and let me just talk through what's
10:01
and let me just talk through what's happening
10:02
happening
10:02
happening here when we encode an arbitrary text
10:05
here when we encode an arbitrary text
10:05
here when we encode an arbitrary text like hi there we're going to receive a
10:08
like hi there we're going to receive a
10:08
like hi there we're going to receive a list of integers that represents that
10:10
list of integers that represents that
10:10
list of integers that represents that string so for example 46 47 Etc and then
10:14
string so for example 46 47 Etc and then
10:14
string so for example 46 47 Etc and then we also have the reverse mapping so we
10:17
we also have the reverse mapping so we
10:17
we also have the reverse mapping so we can take this list and decode it to get
10:20
can take this list and decode it to get
10:20
can take this list and decode it to get back the exact same string so it's
10:22
back the exact same string so it's
10:22
back the exact same string so it's really just like a translation to
10:24
really just like a translation to
10:24
really just like a translation to integers and back for arbitrary string
10:26
integers and back for arbitrary string
10:26
integers and back for arbitrary string and for us it is done on a character
10:28
and for us it is done on a character
10:28
and for us it is done on a character level
10:30
level
10:30
level now the way this was achieved is we just
10:31
now the way this was achieved is we just
10:31
now the way this was achieved is we just iterate over all the characters here and
10:34
iterate over all the characters here and
10:34
iterate over all the characters here and create a lookup table from the character
10:35
create a lookup table from the character
10:35
create a lookup table from the character to the integer and vice versa and then
10:38
to the integer and vice versa and then
10:38
to the integer and vice versa and then to encode some string we simply
10:40
to encode some string we simply
10:40
to encode some string we simply translate all the characters
10:41
translate all the characters
10:41
translate all the characters individually and to decode it back we
10:44
individually and to decode it back we
10:44
individually and to decode it back we use the reverse mapping and concatenate
10:46
use the reverse mapping and concatenate
10:46
use the reverse mapping and concatenate all of it now this is only one of many
10:49
all of it now this is only one of many
10:49
all of it now this is only one of many possible encodings or many possible sort
10:51
possible encodings or many possible sort
10:51
possible encodings or many possible sort of tokenizers and it's a very simple one
10:54
of tokenizers and it's a very simple one
10:54
of tokenizers and it's a very simple one but there's many other schemas that
10:55
but there's many other schemas that
10:55
but there's many other schemas that people have come up with in practice so
10:57
people have come up with in practice so
10:57
people have come up with in practice so for example Google uses a sentence
10:59
for example Google uses a sentence
10:59
for example Google uses a sentence piece uh so sentence piece will also
11:02
piece uh so sentence piece will also
11:02
piece uh so sentence piece will also encode text into um integers but in a
11:05
encode text into um integers but in a
11:05
encode text into um integers but in a different schema and using a different
11:08
different schema and using a different
11:08
different schema and using a different vocabulary and sentence piece is a
11:10
vocabulary and sentence piece is a
11:10
vocabulary and sentence piece is a subword uh sort of tokenizer and what
11:13
subword uh sort of tokenizer and what
11:13
subword uh sort of tokenizer and what that means is that um you're not
11:15
that means is that um you're not
11:15
that means is that um you're not encoding entire words but you're not
11:17
encoding entire words but you're not
11:17
encoding entire words but you're not also encoding individual characters it's
11:19
also encoding individual characters it's
11:19
also encoding individual characters it's it's a subword unit level and that's
11:22
it's a subword unit level and that's
11:22
it's a subword unit level and that's usually what's adopted in practice for
11:24
usually what's adopted in practice for
11:24
usually what's adopted in practice for example also openai has this Library
11:26
example also openai has this Library
11:26
example also openai has this Library called tick token that uses a bite pair
11:28
called tick token that uses a bite pair
11:28
called tick token that uses a bite pair encode
11:29
encode
11:29
encode tokenizer um and that's what GPT uses
11:33
tokenizer um and that's what GPT uses
11:33
tokenizer um and that's what GPT uses and you can also just encode words into
11:35
and you can also just encode words into
11:35
and you can also just encode words into like hell world into a list of integers
11:38
like hell world into a list of integers
11:38
like hell world into a list of integers so as an example I'm using the Tik token
11:40
so as an example I'm using the Tik token
11:40
so as an example I'm using the Tik token Library here I'm getting the encoding
11:43
Library here I'm getting the encoding
11:43
Library here I'm getting the encoding for gpt2 or that was used for gpt2
11:46
for gpt2 or that was used for gpt2
11:46
for gpt2 or that was used for gpt2 instead of just having 65 possible
11:48
instead of just having 65 possible
11:48
instead of just having 65 possible characters or tokens they have 50,000
11:51
characters or tokens they have 50,000
11:51
characters or tokens they have 50,000 tokens and so when they encode the exact
11:54
tokens and so when they encode the exact
11:54
tokens and so when they encode the exact same string High there we only get a
11:57
same string High there we only get a
11:57
same string High there we only get a list of three integers but those
11:59
list of three integers but those
11:59
list of three integers but those integers are not between 0 and 64 they
12:01
integers are not between 0 and 64 they
12:01
integers are not between 0 and 64 they are between Z and 5,
12:05
are between Z and 5,
12:05
are between Z and 5, 5,256 so basically you can trade off the
12:09
5,256 so basically you can trade off the
12:09
5,256 so basically you can trade off the code book size and the sequence lengths
12:11
code book size and the sequence lengths
12:12
code book size and the sequence lengths so you can have very long sequences of
12:13
so you can have very long sequences of
12:13
so you can have very long sequences of integers with very small vocabularies or
12:16
integers with very small vocabularies or
12:16
integers with very small vocabularies or we can have short um sequences of
12:20
we can have short um sequences of
12:20
we can have short um sequences of integers with very large vocabularies
12:23
integers with very large vocabularies
12:23
integers with very large vocabularies and so typically people use in practice
12:25
and so typically people use in practice
12:25
and so typically people use in practice these subword encodings but I'd like to
12:28
these subword encodings but I'd like to
12:28
these subword encodings but I'd like to keep our token ier very simple so we're
12:30
keep our token ier very simple so we're
12:30
keep our token ier very simple so we're using character level tokenizer and that
12:33
using character level tokenizer and that
12:33
using character level tokenizer and that means that we have very small code books
12:34
means that we have very small code books
12:35
means that we have very small code books we have very simple encode and decode
12:37
we have very simple encode and decode
12:37
we have very simple encode and decode functions uh but we do get very long
12:40
functions uh but we do get very long
12:40
functions uh but we do get very long sequences as a result but that's the
12:42
sequences as a result but that's the
12:42
sequences as a result but that's the level at which we're going to stick with
12:43
level at which we're going to stick with
12:43
level at which we're going to stick with this lecture because it's the simplest
12:45
this lecture because it's the simplest
12:45
this lecture because it's the simplest thing okay so now that we have an
12:46
thing okay so now that we have an
12:46
thing okay so now that we have an encoder and a decoder effectively a
12:49
encoder and a decoder effectively a
12:49
encoder and a decoder effectively a tokenizer we can tokenize the entire
12:51
tokenizer we can tokenize the entire
12:51
tokenizer we can tokenize the entire training set of Shakespeare so here's a
12:53
training set of Shakespeare so here's a
12:53
training set of Shakespeare so here's a chunk of code that does that and I'm
12:55
chunk of code that does that and I'm
12:55
chunk of code that does that and I'm going to start to use the pytorch
12:56
going to start to use the pytorch
12:56
going to start to use the pytorch library and specifically the torch.
12:58
library and specifically the torch.
12:58
library and specifically the torch. tensor from the pytorch library so we're
13:01
tensor from the pytorch library so we're
13:01
tensor from the pytorch library so we're going to take all of the text in tiny
13:03
going to take all of the text in tiny
13:03
going to take all of the text in tiny Shakespeare encode it and then wrap it
13:05
Shakespeare encode it and then wrap it
13:05
Shakespeare encode it and then wrap it into a torch. tensor to get the data
13:08
into a torch. tensor to get the data
13:08
into a torch. tensor to get the data tensor so here's what the data tensor
13:10
tensor so here's what the data tensor
13:10
tensor so here's what the data tensor looks like when I look at just the first
13:11
looks like when I look at just the first
13:12
looks like when I look at just the first 1,000 characters or the 1,000 elements
13:14
1,000 characters or the 1,000 elements
13:14
1,000 characters or the 1,000 elements of it so we see that we have a massive
13:16
of it so we see that we have a massive
13:16
of it so we see that we have a massive sequence of integers and this sequence
13:18
sequence of integers and this sequence
13:18
sequence of integers and this sequence of integers here is basically an
13:20
of integers here is basically an
13:20
of integers here is basically an identical translation of the first
13:22
identical translation of the first
13:22
identical translation of the first 10,000 characters
13:24
10,000 characters
13:24
10,000 characters here so I believe for example that zero
13:26
here so I believe for example that zero
13:27
here so I believe for example that zero is a new line character and maybe one
13:29
is a new line character and maybe one
13:29
is a new line character and maybe one one is a space not 100% sure but from
13:32
one is a space not 100% sure but from
13:32
one is a space not 100% sure but from now on the entire data set of text is
13:34
now on the entire data set of text is
13:34
now on the entire data set of text is re-represented as just it's just
13:35
re-represented as just it's just
13:35
re-represented as just it's just stretched out as a single very large uh
13:38
stretched out as a single very large uh
13:38
stretched out as a single very large uh sequence of
13:39
sequence of
13:39
sequence of integers let me do one more thing before
13:41
integers let me do one more thing before
13:41
integers let me do one more thing before we move on here I'd like to separate out
13:43
we move on here I'd like to separate out
13:43
we move on here I'd like to separate out our data set into a train and a
13:45
our data set into a train and a
13:45
our data set into a train and a validation split so in particular we're
13:48
validation split so in particular we're
13:48
validation split so in particular we're going to take the first 90% of the data
13:51
going to take the first 90% of the data
13:51
going to take the first 90% of the data set and consider that to be the training
13:52
set and consider that to be the training
13:52
set and consider that to be the training data for the Transformer and we're going
13:54
data for the Transformer and we're going
13:54
data for the Transformer and we're going to withhold the last 10% at the end of
13:56
to withhold the last 10% at the end of
13:56
to withhold the last 10% at the end of it to be the validation data and this
13:59
it to be the validation data and this
13:59
it to be the validation data and this will help us understand to what extent
14:01
will help us understand to what extent
14:01
will help us understand to what extent our model is overfitting so we're going
14:03
our model is overfitting so we're going
14:03
our model is overfitting so we're going to basically hide and keep the
14:04
to basically hide and keep the
14:04
to basically hide and keep the validation data on the side because we
14:06
validation data on the side because we
14:06
validation data on the side because we don't want just a perfect memorization
14:08
don't want just a perfect memorization
14:08
don't want just a perfect memorization of this exact Shakespeare we want a
14:11
of this exact Shakespeare we want a
14:11
of this exact Shakespeare we want a neural network that sort of creates
14:12
neural network that sort of creates
14:12
neural network that sort of creates Shakespeare like uh text and so it
14:15
Shakespeare like uh text and so it
14:15
Shakespeare like uh text and so it should be fairly likely for it to
14:17
should be fairly likely for it to
14:17
should be fairly likely for it to produce the actual like stowed away uh
14:21
produce the actual like stowed away uh
14:21
produce the actual like stowed away uh true Shakespeare text um and so we're
14:24
true Shakespeare text um and so we're
14:24
true Shakespeare text um and so we're going to use this to uh get a sense of
14:26
going to use this to uh get a sense of
14:26
going to use this to uh get a sense of the overfitting okay so now we would
14:28
the overfitting okay so now we would
14:28
the overfitting okay so now we would like to start plugging these text
14:30
like to start plugging these text
14:30
like to start plugging these text sequences or integer sequences into the
14:32
sequences or integer sequences into the
14:32
sequences or integer sequences into the Transformer so that it can train and
14:34
Transformer so that it can train and
14:34
Transformer so that it can train and learn those patterns now the important
14:36
learn those patterns now the important
14:36
learn those patterns now the important thing to realize is we're never going to
14:38
thing to realize is we're never going to
14:38
thing to realize is we're never going to actually feed entire text into a
14:40
actually feed entire text into a
14:40
actually feed entire text into a Transformer all at once that would be
14:42
Transformer all at once that would be
14:42
Transformer all at once that would be computationally very expensive and
14:44
computationally very expensive and
14:44
computationally very expensive and prohibitive so when we actually train a
14:46
prohibitive so when we actually train a
14:46
prohibitive so when we actually train a Transformer on a lot of these data sets
14:48
Transformer on a lot of these data sets
14:48
Transformer on a lot of these data sets we only work with chunks of the data set
14:50
we only work with chunks of the data set
14:50
we only work with chunks of the data set and when we train the Transformer we
14:52
and when we train the Transformer we
14:52
and when we train the Transformer we basically sample random little chunks
14:53
basically sample random little chunks
14:53
basically sample random little chunks out of the training set and train on
14:55
out of the training set and train on
14:55
out of the training set and train on just chunks at a time and these chunks
14:58
just chunks at a time and these chunks
14:58
just chunks at a time and these chunks have basically some kind of a length and
15:01
have basically some kind of a length and
15:01
have basically some kind of a length and some maximum length now the maximum
15:04
some maximum length now the maximum
15:04
some maximum length now the maximum length typically at least in the code I
15:06
length typically at least in the code I
15:06
length typically at least in the code I usually write is called block size you
15:08
usually write is called block size you
15:08
usually write is called block size you can you can uh find it under different
15:10
can you can uh find it under different
15:10
can you can uh find it under different names like context length or something
15:12
names like context length or something
15:12
names like context length or something like that let's start with the block
15:14
like that let's start with the block
15:14
like that let's start with the block size of just eight and let me look at
15:16
size of just eight and let me look at
15:16
size of just eight and let me look at the first train data characters the
15:18
the first train data characters the
15:18
the first train data characters the first block size plus one characters
15:20
first block size plus one characters
15:20
first block size plus one characters I'll explain why plus one in a
15:22
I'll explain why plus one in a
15:22
I'll explain why plus one in a second so this is the first nine
15:24
second so this is the first nine
15:24
second so this is the first nine characters in the sequence in the
15:27
characters in the sequence in the
15:27
characters in the sequence in the training set now what I'd like to point
15:29
training set now what I'd like to point
15:30
training set now what I'd like to point out is that when you sample a chunk of
15:31
out is that when you sample a chunk of
15:31
out is that when you sample a chunk of data like this so say the these nine
15:34
data like this so say the these nine
15:34
data like this so say the these nine characters out of the training set this
15:36
characters out of the training set this
15:36
characters out of the training set this actually has multiple examples packed
15:38
actually has multiple examples packed
15:38
actually has multiple examples packed into it and uh that's because all of
15:41
into it and uh that's because all of
15:41
into it and uh that's because all of these characters follow each other and
15:43
these characters follow each other and
15:43
these characters follow each other and so what this thing is going to say when
15:47
so what this thing is going to say when
15:47
so what this thing is going to say when we plug it into a Transformer is we're
15:49
we plug it into a Transformer is we're
15:49
we plug it into a Transformer is we're going to actually simultaneously train
15:50
going to actually simultaneously train
15:50
going to actually simultaneously train it to make prediction at every one of
15:52
it to make prediction at every one of
15:52
it to make prediction at every one of these
15:53
these
15:53
these positions now in the in a chunk of nine
15:56
positions now in the in a chunk of nine
15:56
positions now in the in a chunk of nine characters there's actually eight indiv
15:58
characters there's actually eight indiv
15:58
characters there's actually eight indiv ual examples packed in there so there's
16:01
ual examples packed in there so there's
16:01
ual examples packed in there so there's the example that when 18 when in the
16:04
the example that when 18 when in the
16:04
the example that when 18 when in the context of 18 47 likely comes next in a
16:08
context of 18 47 likely comes next in a
16:08
context of 18 47 likely comes next in a context of 18 and 47 56 comes next in a
16:12
context of 18 and 47 56 comes next in a
16:12
context of 18 and 47 56 comes next in a context of 18 47 56 57 can come next and
16:16
context of 18 47 56 57 can come next and
16:16
context of 18 47 56 57 can come next and so on so that's the eight individual
16:18
so on so that's the eight individual
16:18
so on so that's the eight individual examples let me actually spell it out
16:20
examples let me actually spell it out
16:20
examples let me actually spell it out with
16:21
with
16:21
with code so here's a chunk of code to
16:24
code so here's a chunk of code to
16:24
code so here's a chunk of code to illustrate X are the inputs to the
16:26
illustrate X are the inputs to the
16:26
illustrate X are the inputs to the Transformer it will just be the first
16:28
Transformer it will just be the first
16:28
Transformer it will just be the first block size characters y will be the uh
16:32
block size characters y will be the uh
16:32
block size characters y will be the uh next block size characters so it's
16:34
next block size characters so it's
16:34
next block size characters so it's offset by one and that's because y are
16:37
offset by one and that's because y are
16:37
offset by one and that's because y are the targets for each position in the
16:40
the targets for each position in the
16:40
the targets for each position in the input and then here I'm iterating over
16:42
input and then here I'm iterating over
16:42
input and then here I'm iterating over all the block size of eight and the
16:45
all the block size of eight and the
16:45
all the block size of eight and the context is always all the characters in
16:47
context is always all the characters in
16:47
context is always all the characters in x uh up to T and including T and the
16:51
x uh up to T and including T and the
16:51
x uh up to T and including T and the target is always the teth character but
16:53
target is always the teth character but
16:53
target is always the teth character but in the targets array y so let me just
16:56
in the targets array y so let me just
16:56
in the targets array y so let me just run
16:57
run
16:57
run this and basically it spells out what I
16:59
this and basically it spells out what I
16:59
this and basically it spells out what I said in words uh these are the eight
17:02
said in words uh these are the eight
17:02
said in words uh these are the eight examples hidden in a chunk of nine
17:04
examples hidden in a chunk of nine
17:04
examples hidden in a chunk of nine characters that we uh sampled from the
17:08
characters that we uh sampled from the
17:08
characters that we uh sampled from the training set I want to mention one more
17:11
training set I want to mention one more
17:11
training set I want to mention one more thing we train on all the eight examples
17:14
thing we train on all the eight examples
17:14
thing we train on all the eight examples here with context between one all the
17:16
here with context between one all the
17:16
here with context between one all the way up to context of block size and we
17:19
way up to context of block size and we
17:19
way up to context of block size and we train on that not just for computational
17:20
train on that not just for computational
17:20
train on that not just for computational reasons because we happen to have the
17:22
reasons because we happen to have the
17:22
reasons because we happen to have the sequence already or something like that
17:23
sequence already or something like that
17:23
sequence already or something like that it's not just done for efficiency it's
17:26
it's not just done for efficiency it's
17:26
it's not just done for efficiency it's also done um to make the Transformer
17:28
also done um to make the Transformer
17:28
also done um to make the Transformer Network be used to seeing contexts all
17:31
Network be used to seeing contexts all
17:32
Network be used to seeing contexts all the way from as little as one all the
17:33
the way from as little as one all the
17:33
the way from as little as one all the way to block size and we'd like the
17:36
way to block size and we'd like the
17:36
way to block size and we'd like the transform to be used to seeing
17:38
transform to be used to seeing
17:38
transform to be used to seeing everything in between and that's going
17:39
everything in between and that's going
17:39
everything in between and that's going to be useful later during inference
17:41
to be useful later during inference
17:41
to be useful later during inference because while we're sampling we can
17:43
because while we're sampling we can
17:43
because while we're sampling we can start the sampling generation with as
17:45
start the sampling generation with as
17:45
start the sampling generation with as little as one character of context and
17:47
little as one character of context and
17:47
little as one character of context and the Transformer knows how to predict the
17:49
the Transformer knows how to predict the
17:49
the Transformer knows how to predict the next character with all the way up to
17:50
next character with all the way up to
17:51
next character with all the way up to just context of one and so then it can
17:53
just context of one and so then it can
17:53
just context of one and so then it can predict everything up to block size and
17:55
predict everything up to block size and
17:55
predict everything up to block size and after block size we have to start
17:56
after block size we have to start
17:56
after block size we have to start truncating because the Transformer will
17:58
truncating because the Transformer will
17:58
truncating because the Transformer will will never um receive more than block
18:01
will never um receive more than block
18:01
will never um receive more than block size inputs when it's predicting the
18:02
size inputs when it's predicting the
18:03
size inputs when it's predicting the next
18:03
next
18:03
next character Okay so we've looked at the
18:06
character Okay so we've looked at the
18:06
character Okay so we've looked at the time dimension of the tensors that are
18:07
time dimension of the tensors that are
18:07
time dimension of the tensors that are going to be feeding into the Transformer
18:09
going to be feeding into the Transformer
18:09
going to be feeding into the Transformer there's one more Dimension to care about
18:10
there's one more Dimension to care about
18:11
there's one more Dimension to care about and that is the batch Dimension and so
18:13
and that is the batch Dimension and so
18:13
and that is the batch Dimension and so as we're sampling these chunks of text
18:15
as we're sampling these chunks of text
18:15
as we're sampling these chunks of text we're going to be actually every time
18:17
we're going to be actually every time
18:17
we're going to be actually every time we're going to feed them into a
18:18
we're going to feed them into a
18:18
we're going to feed them into a Transformer we're going to have many
18:20
Transformer we're going to have many
18:20
Transformer we're going to have many batches of multiple chunks of text that
18:22
batches of multiple chunks of text that
18:22
batches of multiple chunks of text that are all like stacked up in a single
18:23
are all like stacked up in a single
18:23
are all like stacked up in a single tensor and that's just done for
18:25
tensor and that's just done for
18:25
tensor and that's just done for efficiency just so that we can keep the
18:26
efficiency just so that we can keep the
18:27
efficiency just so that we can keep the gpus busy uh because they are very good
18:29
gpus busy uh because they are very good
18:29
gpus busy uh because they are very good at parallel processing of um of data and
18:33
at parallel processing of um of data and
18:33
at parallel processing of um of data and so we just want to process multiple
18:35
so we just want to process multiple
18:35
so we just want to process multiple chunks all at the same time but those
18:37
chunks all at the same time but those
18:37
chunks all at the same time but those chunks are processed completely
18:38
chunks are processed completely
18:38
chunks are processed completely independently they don't talk to each
18:39
independently they don't talk to each
18:39
independently they don't talk to each other and so on so let me basically just
18:42
other and so on so let me basically just
18:42
other and so on so let me basically just generalize this and introduce a batch
18:44
generalize this and introduce a batch
18:44
generalize this and introduce a batch Dimension here's a chunk of
18:46
Dimension here's a chunk of
18:46
Dimension here's a chunk of code let me just run it and then I'm
18:48
code let me just run it and then I'm
18:48
code let me just run it and then I'm going to explain what it
18:50
going to explain what it
18:50
going to explain what it does so here because we're going to
18:52
does so here because we're going to
18:52
does so here because we're going to start sampling random locations in the
18:54
start sampling random locations in the
18:54
start sampling random locations in the data set to pull chunks from I am
18:56
data set to pull chunks from I am
18:57
data set to pull chunks from I am setting the seed so that um in the
19:00
setting the seed so that um in the
19:00
setting the seed so that um in the random number generator so that the
19:01
random number generator so that the
19:01
random number generator so that the numbers I see here are going to be the
19:02
numbers I see here are going to be the
19:02
numbers I see here are going to be the same numbers you see later if you try to
19:04
same numbers you see later if you try to
19:04
same numbers you see later if you try to reproduce this now the batch size here
19:07
reproduce this now the batch size here
19:07
reproduce this now the batch size here is how many independent sequences we are
19:09
is how many independent sequences we are
19:09
is how many independent sequences we are processing every forward backward pass
19:11
processing every forward backward pass
19:11
processing every forward backward pass of the
19:12
of the
19:12
of the Transformer the block size as I
19:14
Transformer the block size as I
19:14
Transformer the block size as I explained is the maximum context length
19:16
explained is the maximum context length
19:16
explained is the maximum context length to make those predictions so let's say B
19:19
to make those predictions so let's say B
19:19
to make those predictions so let's say B size four block size eight and then
19:21
size four block size eight and then
19:21
size four block size eight and then here's how we get batch for any
19:23
here's how we get batch for any
19:23
here's how we get batch for any arbitrary split if the split is a
19:25
arbitrary split if the split is a
19:25
arbitrary split if the split is a training split then we're going to look
19:26
training split then we're going to look
19:26
training split then we're going to look at train data otherwise at valid data
19:29
at train data otherwise at valid data
19:30
at train data otherwise at valid data that gives us the data array and then
19:33
that gives us the data array and then
19:33
that gives us the data array and then when I Generate random positions to grab
19:35
when I Generate random positions to grab
19:35
when I Generate random positions to grab a chunk out of I actually grab I
19:38
a chunk out of I actually grab I
19:38
a chunk out of I actually grab I actually generate batch size number of
19:41
actually generate batch size number of
19:41
actually generate batch size number of Random offsets so because this is four
19:44
Random offsets so because this is four
19:44
Random offsets so because this is four we are ex is going to be a uh four
19:47
we are ex is going to be a uh four
19:47
we are ex is going to be a uh four numbers that are randomly generated
19:49
numbers that are randomly generated
19:49
numbers that are randomly generated between zero and Len of data minus block
19:51
between zero and Len of data minus block
19:51
between zero and Len of data minus block size so it's just random offsets into
19:53
size so it's just random offsets into
19:53
size so it's just random offsets into the training
19:54
the training
19:54
the training set and then X's as I explained are the
19:58
set and then X's as I explained are the
19:58
set and then X's as I explained are the first first block size characters
20:00
first first block size characters
20:00
first first block size characters starting at I the Y's are the offset by
20:04
starting at I the Y's are the offset by
20:05
starting at I the Y's are the offset by one of that so just add plus one and
20:08
one of that so just add plus one and
20:08
one of that so just add plus one and then we're going to get those chunks for
20:10
then we're going to get those chunks for
20:10
then we're going to get those chunks for every one of integers I INX and use a
20:14
every one of integers I INX and use a
20:14
every one of integers I INX and use a torch. stack to take all those uh uh
20:17
torch. stack to take all those uh uh
20:17
torch. stack to take all those uh uh one-dimensional tensors as we saw here
20:20
one-dimensional tensors as we saw here
20:20
one-dimensional tensors as we saw here and we're going to um stack them up at
20:24
and we're going to um stack them up at
20:24
and we're going to um stack them up at rows and so they all become a row in a
20:27
rows and so they all become a row in a
20:27
rows and so they all become a row in a 4x8 tensor
20:29
4x8 tensor
20:29
4x8 tensor so here's where I'm printing then when I
20:32
so here's where I'm printing then when I
20:32
so here's where I'm printing then when I sample a batch XB and YB the inputs to
20:35
sample a batch XB and YB the inputs to
20:35
sample a batch XB and YB the inputs to the Transformer now are the input X is
20:38
the Transformer now are the input X is
20:39
the Transformer now are the input X is the 4x8 tensor four uh rows of eight
20:44
the 4x8 tensor four uh rows of eight
20:44
the 4x8 tensor four uh rows of eight columns and each one of these is a chunk
20:47
columns and each one of these is a chunk
20:47
columns and each one of these is a chunk of the training
20:48
of the training
20:48
of the training set and then the targets here are in the
20:52
set and then the targets here are in the
20:52
set and then the targets here are in the associated array Y and they will come in
20:54
associated array Y and they will come in
20:54
associated array Y and they will come in to the Transformer all the way at the
20:55
to the Transformer all the way at the
20:55
to the Transformer all the way at the end uh to um create the loss function
20:59
end uh to um create the loss function
20:59
end uh to um create the loss function uh so they will give us the correct
21:00
uh so they will give us the correct
21:01
uh so they will give us the correct answer for every single position inside
21:03
answer for every single position inside
21:03
answer for every single position inside X and then these are the four
21:06
X and then these are the four
21:06
X and then these are the four independent
21:07
independent
21:07
independent rows so spelled out as we did
21:11
rows so spelled out as we did
21:11
rows so spelled out as we did before uh this 4x8 array contains a
21:14
before uh this 4x8 array contains a
21:14
before uh this 4x8 array contains a total of 32 examples and they're
21:17
total of 32 examples and they're
21:17
total of 32 examples and they're completely independent as far as the
21:19
completely independent as far as the
21:19
completely independent as far as the Transformer is
21:20
Transformer is
21:20
Transformer is concerned uh so when the input is 24 the
21:25
concerned uh so when the input is 24 the
21:25
concerned uh so when the input is 24 the target is 43 or rather 43 here in the Y
21:28
target is 43 or rather 43 here in the Y
21:28
target is 43 or rather 43 here in the Y array
21:29
array
21:29
array when the input is 2443 the target is
21:31
when the input is 2443 the target is
21:31
when the input is 2443 the target is 58 uh when the input is 24 43 58 the
21:34
58 uh when the input is 24 43 58 the
21:34
58 uh when the input is 24 43 58 the target is 5 Etc or like when it is a 52
21:38
target is 5 Etc or like when it is a 52
21:38
target is 5 Etc or like when it is a 52 581 the target is
21:40
581 the target is
21:40
581 the target is 58 right so you can sort of see this
21:43
58 right so you can sort of see this
21:43
58 right so you can sort of see this spelled out these are the 32 independent
21:45
spelled out these are the 32 independent
21:45
spelled out these are the 32 independent examples packed in to a single batch of
21:48
examples packed in to a single batch of
21:48
examples packed in to a single batch of the input X and then the desired targets
21:51
the input X and then the desired targets
21:51
the input X and then the desired targets are in y and so now this integer tensor
21:57
are in y and so now this integer tensor
21:57
are in y and so now this integer tensor of um X is going to feed into the
22:00
of um X is going to feed into the
22:00
of um X is going to feed into the Transformer and that Transformer is
22:02
Transformer and that Transformer is
22:02
Transformer and that Transformer is going to simultaneously process all
22:04
going to simultaneously process all
22:04
going to simultaneously process all these examples and then look up the
22:05
these examples and then look up the
22:06
these examples and then look up the correct um integers to predict in every
22:08
correct um integers to predict in every
22:08
correct um integers to predict in every one of these positions in the tensor y
22:11
one of these positions in the tensor y
22:11
one of these positions in the tensor y okay so now that we have our batch of
22:13
okay so now that we have our batch of
22:13
okay so now that we have our batch of input that we'd like to feed into a
22:15
input that we'd like to feed into a
22:15
input that we'd like to feed into a Transformer let's start basically
22:16
Transformer let's start basically
22:16
Transformer let's start basically feeding this into neural networks now
22:19
feeding this into neural networks now
22:19
feeding this into neural networks now we're going to start off with the
22:20
we're going to start off with the
22:20
we're going to start off with the simplest possible neural network which
22:22
simplest possible neural network which
22:22
simplest possible neural network which in the case of language modeling in my
22:23
in the case of language modeling in my
22:23
in the case of language modeling in my opinion is the Byram language model and
22:25
opinion is the Byram language model and
22:25
opinion is the Byram language model and we've covered the Byram language model
22:26
we've covered the Byram language model
22:26
we've covered the Byram language model in my make more series in a lot of depth
22:29
in my make more series in a lot of depth
22:29
in my make more series in a lot of depth and so here I'm going to sort of go
22:31
and so here I'm going to sort of go
22:31
and so here I'm going to sort of go faster and let's just Implement pytorch
22:33
faster and let's just Implement pytorch
22:33
faster and let's just Implement pytorch module directly that implements the byr
22:36
module directly that implements the byr
22:36
module directly that implements the byr language
22:36
language
22:36
language model so I'm importing the pytorch um NN
22:41
model so I'm importing the pytorch um NN
22:41
model so I'm importing the pytorch um NN module uh for
22:43
module uh for
22:43
module uh for reproducibility and then here I'm
22:44
reproducibility and then here I'm
22:44
reproducibility and then here I'm constructing a Byram language model
22:46
constructing a Byram language model
22:46
constructing a Byram language model which is a subass of NN
22:48
which is a subass of NN
22:48
which is a subass of NN module and then I'm calling it and I'm
22:51
module and then I'm calling it and I'm
22:51
module and then I'm calling it and I'm passing it the inputs and the targets
22:53
passing it the inputs and the targets
22:53
passing it the inputs and the targets and I'm just printing now when the
22:55
and I'm just printing now when the
22:55
and I'm just printing now when the inputs on targets come here you see that
22:57
inputs on targets come here you see that
22:57
inputs on targets come here you see that I'm just taking the index uh the inputs
23:00
I'm just taking the index uh the inputs
23:00
I'm just taking the index uh the inputs X here which I rename to idx and I'm
23:03
X here which I rename to idx and I'm
23:03
X here which I rename to idx and I'm just passing them into this token
23:04
just passing them into this token
23:04
just passing them into this token embedding table so it's going on here is
23:07
embedding table so it's going on here is
23:07
embedding table so it's going on here is that here in the Constructor we are
23:09
that here in the Constructor we are
23:09
that here in the Constructor we are creating a token embedding table and it
23:12
creating a token embedding table and it
23:12
creating a token embedding table and it is of size vocap size by vocap
23:15
is of size vocap size by vocap
23:15
is of size vocap size by vocap size and we're using an. embedding which
23:18
size and we're using an. embedding which
23:18
size and we're using an. embedding which is a very thin wrapper around basically
23:20
is a very thin wrapper around basically
23:20
is a very thin wrapper around basically a tensor of shape voap size by vocab
23:23
a tensor of shape voap size by vocab
23:23
a tensor of shape voap size by vocab size and what's happening here is that
23:25
size and what's happening here is that
23:25
size and what's happening here is that when we pass idx here every single
23:28
when we pass idx here every single
23:28
when we pass idx here every single integer in our input is going to refer
23:30
integer in our input is going to refer
23:30
integer in our input is going to refer to this embedding table and it's going
23:32
to this embedding table and it's going
23:32
to this embedding table and it's going to pluck out a row of that embedding
23:34
to pluck out a row of that embedding
23:34
to pluck out a row of that embedding table corresponding to its index so 24
23:37
table corresponding to its index so 24
23:37
table corresponding to its index so 24 here will go into the embedding table
23:39
here will go into the embedding table
23:39
here will go into the embedding table and we'll pluck out the 24th row and
23:42
and we'll pluck out the 24th row and
23:42
and we'll pluck out the 24th row and then 43 will go here and pluck out the
23:44
then 43 will go here and pluck out the
23:44
then 43 will go here and pluck out the 43d row Etc and then pytorch is going to
23:47
43d row Etc and then pytorch is going to
23:47
43d row Etc and then pytorch is going to arrange all of this into a batch by Time
23:50
arrange all of this into a batch by Time
23:50
arrange all of this into a batch by Time by channel uh tensor in this case batch
23:53
by channel uh tensor in this case batch
23:53
by channel uh tensor in this case batch is four time is eight and C which is the
23:57
is four time is eight and C which is the
23:57
is four time is eight and C which is the channels is vocab size or 65 and so
24:01
channels is vocab size or 65 and so
24:01
channels is vocab size or 65 and so we're just going to pluck out all those
24:02
we're just going to pluck out all those
24:02
we're just going to pluck out all those rows arrange them in a b by T by C and
24:05
rows arrange them in a b by T by C and
24:05
rows arrange them in a b by T by C and now we're going to interpret this as the
24:07
now we're going to interpret this as the
24:07
now we're going to interpret this as the logits which are basically the scores
24:10
logits which are basically the scores
24:10
logits which are basically the scores for the next character in the sequence
24:12
for the next character in the sequence
24:12
for the next character in the sequence and so what's happening here is we are
24:14
and so what's happening here is we are
24:14
and so what's happening here is we are predicting what comes next based on just
24:17
predicting what comes next based on just
24:17
predicting what comes next based on just the individual identity of a single
24:19
the individual identity of a single
24:19
the individual identity of a single token and you can do that because um I
24:22
token and you can do that because um I
24:22
token and you can do that because um I mean currently the tokens are not
24:23
mean currently the tokens are not
24:23
mean currently the tokens are not talking to each other and they're not
24:25
talking to each other and they're not
24:25
talking to each other and they're not seeing any context except for they're
24:26
seeing any context except for they're
24:26
seeing any context except for they're just seeing themselves so I'm a f I'm a
24:29
just seeing themselves so I'm a f I'm a
24:29
just seeing themselves so I'm a f I'm a token number five and then I can
24:32
token number five and then I can
24:32
token number five and then I can actually make pretty decent predictions
24:33
actually make pretty decent predictions
24:33
actually make pretty decent predictions about what comes next just by knowing
24:35
about what comes next just by knowing
24:35
about what comes next just by knowing that I'm token five because some
24:37
that I'm token five because some
24:37
that I'm token five because some characters uh know um C follow other
24:39
characters uh know um C follow other
24:39
characters uh know um C follow other characters in in typical scenarios so we
24:42
characters in in typical scenarios so we
24:42
characters in in typical scenarios so we saw a lot of this in a lot more depth in
24:44
saw a lot of this in a lot more depth in
24:44
saw a lot of this in a lot more depth in the make more series and here if I just
24:46
the make more series and here if I just
24:46
the make more series and here if I just run this then we currently get the
24:49
run this then we currently get the
24:49
run this then we currently get the predictions the scores the lits for
24:53
predictions the scores the lits for
24:53
predictions the scores the lits for every one of the 4x8 positions now that
24:55
every one of the 4x8 positions now that
24:55
every one of the 4x8 positions now that we've made predictions about what comes
24:57
we've made predictions about what comes
24:57
we've made predictions about what comes next we'd like to evaluate the loss
24:58
next we'd like to evaluate the loss
24:58
next we'd like to evaluate the loss function and so in make more series we
25:00
function and so in make more series we
25:00
function and so in make more series we saw that a good way to measure a loss or
25:03
saw that a good way to measure a loss or
25:03
saw that a good way to measure a loss or like a quality of the predictions is to
25:05
like a quality of the predictions is to
25:05
like a quality of the predictions is to use the negative log likelihood loss
25:07
use the negative log likelihood loss
25:07
use the negative log likelihood loss which is also implemented in pytorch
25:09
which is also implemented in pytorch
25:09
which is also implemented in pytorch under the name cross entropy so what we'
25:12
under the name cross entropy so what we'
25:12
under the name cross entropy so what we' like to do here is loss is the cross
25:15
like to do here is loss is the cross
25:15
like to do here is loss is the cross entropy on the predictions and the
25:17
entropy on the predictions and the
25:17
entropy on the predictions and the targets and so this measures the quality
25:20
targets and so this measures the quality
25:20
targets and so this measures the quality of the logits with respect to the
25:21
of the logits with respect to the
25:21
of the logits with respect to the Targets in other words we have the
25:24
Targets in other words we have the
25:24
Targets in other words we have the identity of the next character so how
25:26
identity of the next character so how
25:26
identity of the next character so how well are we predicting the next
25:28
well are we predicting the next
25:28
well are we predicting the next character based on the lits and
25:30
character based on the lits and
25:30
character based on the lits and intuitively the correct um the correct
25:33
intuitively the correct um the correct
25:33
intuitively the correct um the correct dimension of low jits uh depending on
25:36
dimension of low jits uh depending on
25:36
dimension of low jits uh depending on whatever the target is should have a
25:38
whatever the target is should have a
25:38
whatever the target is should have a very high number and all the other
25:39
very high number and all the other
25:39
very high number and all the other dimensions should be very low number
25:41
dimensions should be very low number
25:41
dimensions should be very low number right now the issue is that this won't
25:44
right now the issue is that this won't
25:44
right now the issue is that this won't actually this is what we want we want to
25:46
actually this is what we want we want to
25:46
actually this is what we want we want to basically output the logits and the
25:50
basically output the logits and the
25:50
basically output the logits and the loss this is what we want but
25:52
loss this is what we want but
25:52
loss this is what we want but unfortunately uh this won't actually run
25:55
unfortunately uh this won't actually run
25:55
unfortunately uh this won't actually run we get an error message but intuitively
25:57
we get an error message but intuitively
25:57
we get an error message but intuitively we want to uh measure this now when we
26:01
we want to uh measure this now when we
26:01
we want to uh measure this now when we go to the pytorch um cross entropy
26:04
go to the pytorch um cross entropy
26:04
go to the pytorch um cross entropy documentation here um we're trying to
26:08
documentation here um we're trying to
26:08
documentation here um we're trying to call the cross entropy in its functional
26:10
call the cross entropy in its functional
26:10
call the cross entropy in its functional form uh so that means we don't have to
26:11
form uh so that means we don't have to
26:11
form uh so that means we don't have to create like a module for it but here
26:14
create like a module for it but here
26:14
create like a module for it but here when we go to the documentation you have
26:16
when we go to the documentation you have
26:16
when we go to the documentation you have to look into the details of how pitor
26:18
to look into the details of how pitor
26:18
to look into the details of how pitor expects these inputs and basically the
26:20
expects these inputs and basically the
26:20
expects these inputs and basically the issue here is ptor expects if you have
26:24
issue here is ptor expects if you have
26:24
issue here is ptor expects if you have multi-dimensional input which we do
26:25
multi-dimensional input which we do
26:25
multi-dimensional input which we do because we have a b BYT by C tensor then
26:28
because we have a b BYT by C tensor then
26:28
because we have a b BYT by C tensor then it actually really wants the channels to
26:31
it actually really wants the channels to
26:31
it actually really wants the channels to be the second uh Dimension here so if
26:35
be the second uh Dimension here so if
26:35
be the second uh Dimension here so if you um so basically it wants a b by C
26:38
you um so basically it wants a b by C
26:38
you um so basically it wants a b by C BYT instead of a b by T by C and so it's
26:42
BYT instead of a b by T by C and so it's
26:42
BYT instead of a b by T by C and so it's just the details of how P torch treats
26:45
just the details of how P torch treats
26:45
just the details of how P torch treats um these kinds of inputs and so we don't
26:49
um these kinds of inputs and so we don't
26:49
um these kinds of inputs and so we don't actually want to deal with that so what
26:51
actually want to deal with that so what
26:51
actually want to deal with that so what we're going to do instead is we need to
26:52
we're going to do instead is we need to
26:52
we're going to do instead is we need to basically reshape our logits so here's
26:54
basically reshape our logits so here's
26:54
basically reshape our logits so here's what I like to do I like to take
26:56
what I like to do I like to take
26:56
what I like to do I like to take basically give names to the dimensions
26:58
basically give names to the dimensions
26:58
basically give names to the dimensions so lit. shape is B BYT by C and unpack
27:01
so lit. shape is B BYT by C and unpack
27:01
so lit. shape is B BYT by C and unpack those numbers and then let's uh say that
27:04
those numbers and then let's uh say that
27:04
those numbers and then let's uh say that logits equals lit. View and we want it
27:07
logits equals lit. View and we want it
27:07
logits equals lit. View and we want it to be a b * c b * T by C so just a two-
27:11
to be a b * c b * T by C so just a two-
27:11
to be a b * c b * T by C so just a two- dimensional
27:12
dimensional
27:12
dimensional array right so we're going to take all
27:15
array right so we're going to take all
27:15
array right so we're going to take all the we're going to take all of these um
27:18
the we're going to take all of these um
27:18
the we're going to take all of these um positions here and we're going to uh
27:20
positions here and we're going to uh
27:20
positions here and we're going to uh stretch them out in a onedimensional
27:22
stretch them out in a onedimensional
27:22
stretch them out in a onedimensional sequence and uh preserve the channel
27:24
sequence and uh preserve the channel
27:25
sequence and uh preserve the channel Dimension as the second
27:26
Dimension as the second
27:26
Dimension as the second dimension so we're just kind of like
27:28
dimension so we're just kind of like
27:28
dimension so we're just kind of like stretching out the array so it's two-
27:29
stretching out the array so it's two-
27:29
stretching out the array so it's two- dimensional and in that case it's going
27:31
dimensional and in that case it's going
27:31
dimensional and in that case it's going to better conform to what pytorch uh
27:33
to better conform to what pytorch uh
27:33
to better conform to what pytorch uh sort of expects in its Dimensions now we
27:36
sort of expects in its Dimensions now we
27:36
sort of expects in its Dimensions now we have to do the same to targets because
27:38
have to do the same to targets because
27:38
have to do the same to targets because currently targets are um of shape B by T
27:44
currently targets are um of shape B by T
27:44
currently targets are um of shape B by T and we want it to be just B * T so
27:47
and we want it to be just B * T so
27:47
and we want it to be just B * T so onedimensional now alternatively you
27:49
onedimensional now alternatively you
27:49
onedimensional now alternatively you could always still just do minus one
27:51
could always still just do minus one
27:51
could always still just do minus one because pytor will guess what this
27:53
because pytor will guess what this
27:53
because pytor will guess what this should be if you want to lay it out uh
27:55
should be if you want to lay it out uh
27:55
should be if you want to lay it out uh but let me just be explicit and say p *
27:57
but let me just be explicit and say p *
27:57
but let me just be explicit and say p * t once we've reshaped this it will match
28:00
t once we've reshaped this it will match
28:00
t once we've reshaped this it will match the cross entropy case and then we
28:03
the cross entropy case and then we
28:03
the cross entropy case and then we should be able to evaluate our
28:06
should be able to evaluate our
28:06
should be able to evaluate our loss okay so that R now and we can do
28:10
loss okay so that R now and we can do
28:10
loss okay so that R now and we can do loss and So currently we see that the
28:11
loss and So currently we see that the
28:12
loss and So currently we see that the loss is
28:13
loss is
28:13
loss is 4.87 now because our uh we have 65
28:17
4.87 now because our uh we have 65
28:17
4.87 now because our uh we have 65 possible vocabulary elements we can
28:19
possible vocabulary elements we can
28:19
possible vocabulary elements we can actually guess at what the loss should
28:20
actually guess at what the loss should
28:20
actually guess at what the loss should be and in
28:22
be and in
28:22
be and in particular we covered negative log
28:24
particular we covered negative log
28:24
particular we covered negative log likelihood in a lot of detail we are
28:26
likelihood in a lot of detail we are
28:26
likelihood in a lot of detail we are expecting log or lawn of um 1 over 65
28:32
expecting log or lawn of um 1 over 65
28:32
expecting log or lawn of um 1 over 65 and negative of that so we're expecting
28:34
and negative of that so we're expecting
28:34
and negative of that so we're expecting the loss to be about 4.1 17 but we're
28:37
the loss to be about 4.1 17 but we're
28:37
the loss to be about 4.1 17 but we're getting 4.87 and so that's telling us
28:40
getting 4.87 and so that's telling us
28:40
getting 4.87 and so that's telling us that the initial predictions are not uh
28:41
that the initial predictions are not uh
28:42
that the initial predictions are not uh super diffuse they've got a little bit
28:43
super diffuse they've got a little bit
28:43
super diffuse they've got a little bit of entropy and so we're guessing wrong
28:47
of entropy and so we're guessing wrong
28:47
of entropy and so we're guessing wrong uh so uh yes but actually we're I a we
28:50
uh so uh yes but actually we're I a we
28:50
uh so uh yes but actually we're I a we are able to evaluate the loss okay so
28:53
are able to evaluate the loss okay so
28:53
are able to evaluate the loss okay so now that we can evaluate the quality of
28:54
now that we can evaluate the quality of
28:54
now that we can evaluate the quality of the model on some data we'd like to also
28:57
the model on some data we'd like to also
28:57
the model on some data we'd like to also be able to generate from the model so
28:59
be able to generate from the model so
28:59
be able to generate from the model so let's do the generation now I'm going to
29:01
let's do the generation now I'm going to
29:01
let's do the generation now I'm going to go again a little bit faster here
29:03
go again a little bit faster here
29:03
go again a little bit faster here because I covered all this already in
29:04
because I covered all this already in
29:04
because I covered all this already in previous
29:05
previous
29:05
previous videos
29:07
videos
29:07
videos so here's a generate function for the
29:11
so here's a generate function for the
29:11
so here's a generate function for the model so we take some uh we take the the
29:15
model so we take some uh we take the the
29:15
model so we take some uh we take the the same kind of input idx here and
29:18
same kind of input idx here and
29:18
same kind of input idx here and basically this is the current uh context
29:21
basically this is the current uh context
29:22
basically this is the current uh context of some characters in a batch in some
29:24
of some characters in a batch in some
29:24
of some characters in a batch in some batch so it's also B BYT and the job of
29:28
batch so it's also B BYT and the job of
29:28
batch so it's also B BYT and the job of generate is to basically take this B BYT
29:30
generate is to basically take this B BYT
29:30
generate is to basically take this B BYT and extend it to be B BYT + 1 plus 2
29:32
and extend it to be B BYT + 1 plus 2
29:32
and extend it to be B BYT + 1 plus 2 plus 3 and so it's just basically it
29:34
plus 3 and so it's just basically it
29:34
plus 3 and so it's just basically it continues the generation in all the
29:36
continues the generation in all the
29:36
continues the generation in all the batch dimensions in the time Dimension
29:39
batch dimensions in the time Dimension
29:39
batch dimensions in the time Dimension So that's its job and it will do that
29:41
So that's its job and it will do that
29:41
So that's its job and it will do that for Max new tokens so you can see here
29:43
for Max new tokens so you can see here
29:43
for Max new tokens so you can see here on the bottom there's going to be some
29:45
on the bottom there's going to be some
29:45
on the bottom there's going to be some stuff here but on the bottom whatever is
29:47
stuff here but on the bottom whatever is
29:47
stuff here but on the bottom whatever is predicted is concatenated on top of the
29:50
predicted is concatenated on top of the
29:50
predicted is concatenated on top of the previous idx along the First Dimension
29:53
previous idx along the First Dimension
29:53
previous idx along the First Dimension which is the time Dimension to create a
29:54
which is the time Dimension to create a
29:54
which is the time Dimension to create a b BYT + one so that becomes a new idx so
29:58
b BYT + one so that becomes a new idx so
29:58
b BYT + one so that becomes a new idx so the job of generate is to take a b BYT
30:00
the job of generate is to take a b BYT
30:00
the job of generate is to take a b BYT and make it a b BYT plus 1 plus 2 plus
30:02
and make it a b BYT plus 1 plus 2 plus
30:02
and make it a b BYT plus 1 plus 2 plus three as many as we want Max new tokens
30:05
three as many as we want Max new tokens
30:05
three as many as we want Max new tokens so this is the generation from the model
30:08
so this is the generation from the model
30:08
so this is the generation from the model now inside the generation what what are
30:10
now inside the generation what what are
30:10
now inside the generation what what are we doing we're taking the current
30:11
we doing we're taking the current
30:11
we doing we're taking the current indices we're getting the predictions so
30:15
indices we're getting the predictions so
30:15
indices we're getting the predictions so we get uh those are in the low jits and
30:18
we get uh those are in the low jits and
30:18
we get uh those are in the low jits and then the loss here is going to be
30:19
then the loss here is going to be
30:19
then the loss here is going to be ignored because um we're not we're not
30:21
ignored because um we're not we're not
30:21
ignored because um we're not we're not using that and we have no targets that
30:23
using that and we have no targets that
30:23
using that and we have no targets that are sort of ground truth targets that
30:25
are sort of ground truth targets that
30:25
are sort of ground truth targets that we're going to be comparing with
30:28
we're going to be comparing with
30:28
we're going to be comparing with then once we get the logits we are only
30:30
then once we get the logits we are only
30:30
then once we get the logits we are only focusing on the last step so instead of
30:33
focusing on the last step so instead of
30:33
focusing on the last step so instead of a b by T by C we're going to pluck out
30:36
a b by T by C we're going to pluck out
30:36
a b by T by C we're going to pluck out the negative-1 the last element in the
30:38
the negative-1 the last element in the
30:38
the negative-1 the last element in the time Dimension because those are the
30:40
time Dimension because those are the
30:40
time Dimension because those are the predictions for what comes next so that
30:42
predictions for what comes next so that
30:42
predictions for what comes next so that gives us the logits which we then
30:44
gives us the logits which we then
30:44
gives us the logits which we then convert to probabilities via softmax and
30:47
convert to probabilities via softmax and
30:47
convert to probabilities via softmax and then we use tor. multinomial to sample
30:49
then we use tor. multinomial to sample
30:49
then we use tor. multinomial to sample from those probabilities and we ask
30:51
from those probabilities and we ask
30:51
from those probabilities and we ask pytorch to give us one sample and so idx
30:54
pytorch to give us one sample and so idx
30:54
pytorch to give us one sample and so idx next will become a b by one because in
30:57
next will become a b by one because in
30:57
next will become a b by one because in each uh one of the batch Dimensions
31:00
each uh one of the batch Dimensions
31:00
each uh one of the batch Dimensions we're going to have a single prediction
31:01
we're going to have a single prediction
31:01
we're going to have a single prediction for what comes next so this num samples
31:03
for what comes next so this num samples
31:03
for what comes next so this num samples equals one will make this be a
31:06
equals one will make this be a
31:06
equals one will make this be a one and then we're going to take those
31:08
one and then we're going to take those
31:08
one and then we're going to take those integers that come from the sampling
31:10
integers that come from the sampling
31:10
integers that come from the sampling process according to the probability
31:11
process according to the probability
31:11
process according to the probability distribution given here and those
31:13
distribution given here and those
31:13
distribution given here and those integers got just concatenated on top of
31:15
integers got just concatenated on top of
31:15
integers got just concatenated on top of the current sort of like running stream
31:17
the current sort of like running stream
31:17
the current sort of like running stream of integers and this gives us a b BYT +
31:20
of integers and this gives us a b BYT +
31:20
of integers and this gives us a b BYT + one and then we can return that now one
31:24
one and then we can return that now one
31:24
one and then we can return that now one thing here is you see how I'm calling
31:26
thing here is you see how I'm calling
31:26
thing here is you see how I'm calling self of idx which will end up going to
31:29
self of idx which will end up going to
31:29
self of idx which will end up going to the forward function I'm not providing
31:31
the forward function I'm not providing
31:31
the forward function I'm not providing any Targets So currently this would give
31:33
any Targets So currently this would give
31:33
any Targets So currently this would give an error because targets is uh is uh
31:36
an error because targets is uh is uh
31:36
an error because targets is uh is uh sort of like not given so targets has to
31:39
sort of like not given so targets has to
31:39
sort of like not given so targets has to be optional so targets is none by
31:41
be optional so targets is none by
31:41
be optional so targets is none by default and then if targets is none then
31:44
default and then if targets is none then
31:44
default and then if targets is none then there's no loss to create so it's just
31:47
there's no loss to create so it's just
31:47
there's no loss to create so it's just loss is none but else all of this
31:50
loss is none but else all of this
31:50
loss is none but else all of this happens and we can create a loss so this
31:53
happens and we can create a loss so this
31:53
happens and we can create a loss so this will make it so um if we have the
31:56
will make it so um if we have the
31:56
will make it so um if we have the targets we provide them and get a loss
31:57
targets we provide them and get a loss
31:57
targets we provide them and get a loss if we have no targets it will'll just
31:59
if we have no targets it will'll just
31:59
if we have no targets it will'll just get the
32:00
get the
32:00
get the loits so this here will generate from
32:02
loits so this here will generate from
32:02
loits so this here will generate from the model um and let's take that for a
32:06
the model um and let's take that for a
32:06
the model um and let's take that for a ride
32:08
ride
32:08
ride now oops so I have another code chunk
32:11
now oops so I have another code chunk
32:11
now oops so I have another code chunk here which will generate for the model
32:13
here which will generate for the model
32:13
here which will generate for the model from the model and okay this is kind of
32:15
from the model and okay this is kind of
32:15
from the model and okay this is kind of crazy so maybe let me let me break this
32:18
crazy so maybe let me let me break this
32:18
crazy so maybe let me let me break this down so these are the idx
32:23
right I'm creating a batch will be just
32:26
right I'm creating a batch will be just
32:26
right I'm creating a batch will be just one time will be just one so I'm
32:30
one time will be just one so I'm
32:30
one time will be just one so I'm creating a little one by one tensor and
32:32
creating a little one by one tensor and
32:32
creating a little one by one tensor and it's holding a zero and the D type the
32:35
it's holding a zero and the D type the
32:35
it's holding a zero and the D type the data type is uh integer so zero is going
32:38
data type is uh integer so zero is going
32:38
data type is uh integer so zero is going to be how we kick off the generation and
32:40
to be how we kick off the generation and
32:40
to be how we kick off the generation and remember that zero is uh is the element
32:44
remember that zero is uh is the element
32:44
remember that zero is uh is the element standing for a new line character so
32:45
standing for a new line character so
32:45
standing for a new line character so it's kind of like a reasonable thing to
32:47
it's kind of like a reasonable thing to
32:47
it's kind of like a reasonable thing to to feed in as the very first character
32:49
to feed in as the very first character
32:49
to feed in as the very first character in a sequence to be the new
32:51
in a sequence to be the new
32:51
in a sequence to be the new line um so it's going to be idx which
32:54
line um so it's going to be idx which
32:54
line um so it's going to be idx which we're going to feed in here then we're
32:56
we're going to feed in here then we're
32:56
we're going to feed in here then we're going to ask for 100 tokens
32:58
going to ask for 100 tokens
32:58
going to ask for 100 tokens and then. generate will continue that
33:01
and then. generate will continue that
33:01
and then. generate will continue that now because uh generate works on the
33:04
now because uh generate works on the
33:05
now because uh generate works on the level of batches we we then have to
33:07
level of batches we we then have to
33:07
level of batches we we then have to index into the zero throw to basically
33:09
index into the zero throw to basically
33:09
index into the zero throw to basically unplug the um the single batch Dimension
33:13
unplug the um the single batch Dimension
33:13
unplug the um the single batch Dimension that exists and then that gives us a um
33:18
that exists and then that gives us a um
33:18
that exists and then that gives us a um time steps just a onedimensional array
33:20
time steps just a onedimensional array
33:20
time steps just a onedimensional array of all the indices which we will convert
33:23
of all the indices which we will convert
33:23
of all the indices which we will convert to simple python list from pytorch
33:26
to simple python list from pytorch
33:26
to simple python list from pytorch tensor so that that can feed into our
33:28
tensor so that that can feed into our
33:28
tensor so that that can feed into our decode function and uh convert those
33:32
decode function and uh convert those
33:32
decode function and uh convert those integers into text so let me bring this
33:34
integers into text so let me bring this
33:34
integers into text so let me bring this back and we're generating 100 tokens
33:37
back and we're generating 100 tokens
33:37
back and we're generating 100 tokens let's
33:37
let's
33:37
let's run and uh here's the generation that we
33:40
run and uh here's the generation that we
33:40
run and uh here's the generation that we achieved so obviously it's garbage and
33:43
achieved so obviously it's garbage and
33:43
achieved so obviously it's garbage and the reason it's garbage is because this
33:44
the reason it's garbage is because this
33:44
the reason it's garbage is because this is a totally random model so next up
33:46
is a totally random model so next up
33:47
is a totally random model so next up we're going to want to train this model
33:49
we're going to want to train this model
33:49
we're going to want to train this model now one more thing I wanted to point out
33:50
now one more thing I wanted to point out
33:50
now one more thing I wanted to point out here is this function is written to be
33:53
here is this function is written to be
33:53
here is this function is written to be General but it's kind of like ridiculous
33:55
General but it's kind of like ridiculous
33:55
General but it's kind of like ridiculous right now because
33:57
right now because
33:58
right now because we're feeding in all this we're building
33:59
we're feeding in all this we're building
33:59
we're feeding in all this we're building out this context and we're concatenating
34:02
out this context and we're concatenating
34:02
out this context and we're concatenating it all and we're always feeding it all
34:05
it all and we're always feeding it all
34:05
it all and we're always feeding it all into the model but that's kind of
34:07
into the model but that's kind of
34:07
into the model but that's kind of ridiculous because this is just a simple
34:09
ridiculous because this is just a simple
34:09
ridiculous because this is just a simple Byram model so to make for example this
34:11
Byram model so to make for example this
34:11
Byram model so to make for example this prediction about K we only needed this W
34:14
prediction about K we only needed this W
34:14
prediction about K we only needed this W but actually what we fed into the model
34:15
but actually what we fed into the model
34:15
but actually what we fed into the model is we fed the entire sequence and then
34:18
is we fed the entire sequence and then
34:18
is we fed the entire sequence and then we only looked at the very last piece
34:20
we only looked at the very last piece
34:20
we only looked at the very last piece and predicted K so the only reason I'm
34:23
and predicted K so the only reason I'm
34:23
and predicted K so the only reason I'm writing it in this way is because right
34:25
writing it in this way is because right
34:25
writing it in this way is because right now this is a byr model but I'd like to
34:27
now this is a byr model but I'd like to
34:27
now this is a byr model but I'd like to keep keep this function fixed and I'd
34:29
keep keep this function fixed and I'd
34:29
keep keep this function fixed and I'd like it to work um later when our
34:32
like it to work um later when our
34:32
like it to work um later when our characters actually um basically look
34:35
characters actually um basically look
34:35
characters actually um basically look further in the history and so right now
34:37
further in the history and so right now
34:37
further in the history and so right now the history is not used so this looks
34:39
the history is not used so this looks
34:39
the history is not used so this looks silly uh but eventually the history will
34:42
silly uh but eventually the history will
34:42
silly uh but eventually the history will be used and so that's why we want to uh
34:44
be used and so that's why we want to uh
34:44
be used and so that's why we want to uh do it this way so just a quick comment
34:46
do it this way so just a quick comment
34:46
do it this way so just a quick comment on that so now we see that this is um
34:49
on that so now we see that this is um
34:49
on that so now we see that this is um random so let's train the model so it
34:51
random so let's train the model so it
34:51
random so let's train the model so it becomes a bit less random okay let's Now
34:53
becomes a bit less random okay let's Now
34:53
becomes a bit less random okay let's Now train the model so first what I'm going
34:55
train the model so first what I'm going
34:55
train the model so first what I'm going to do is I'm going to create a pyour
34:57
to do is I'm going to create a pyour
34:57
to do is I'm going to create a pyour optimization object so here we are using
35:00
optimization object so here we are using
35:00
optimization object so here we are using the optimizer ATM W um now in a make
35:05
the optimizer ATM W um now in a make
35:05
the optimizer ATM W um now in a make more series we've only ever use tastic
35:06
more series we've only ever use tastic
35:06
more series we've only ever use tastic gradi in descent the simplest possible
35:08
gradi in descent the simplest possible
35:08
gradi in descent the simplest possible Optimizer which you can get using the
35:10
Optimizer which you can get using the
35:10
Optimizer which you can get using the SGD instead but I want to use Adam which
35:12
SGD instead but I want to use Adam which
35:12
SGD instead but I want to use Adam which is a much more advanced and popular
35:14
is a much more advanced and popular
35:14
is a much more advanced and popular Optimizer and it works extremely well
35:16
Optimizer and it works extremely well
35:16
Optimizer and it works extremely well for uh typical good setting for the
35:19
for uh typical good setting for the
35:19
for uh typical good setting for the learning rate is roughly 3 E4 uh but for
35:22
learning rate is roughly 3 E4 uh but for
35:22
learning rate is roughly 3 E4 uh but for very very small networks like is the
35:23
very very small networks like is the
35:23
very very small networks like is the case here you can get away with much
35:25
case here you can get away with much
35:25
case here you can get away with much much higher learning rates R3 or even
35:28
much higher learning rates R3 or even
35:28
much higher learning rates R3 or even higher probably but let me create the
35:30
higher probably but let me create the
35:30
higher probably but let me create the optimizer object which will basically
35:33
optimizer object which will basically
35:33
optimizer object which will basically take the gradients and uh update the
35:35
take the gradients and uh update the
35:35
take the gradients and uh update the parameters using the
35:36
parameters using the
35:36
parameters using the gradients and then here our batch size
35:40
gradients and then here our batch size
35:40
gradients and then here our batch size up above was only four so let me
35:41
up above was only four so let me
35:41
up above was only four so let me actually use something bigger let's say
35:43
actually use something bigger let's say
35:43
actually use something bigger let's say 32 and then for some number of steps um
35:46
32 and then for some number of steps um
35:46
32 and then for some number of steps um we are sampling a new batch of data
35:48
we are sampling a new batch of data
35:48
we are sampling a new batch of data we're evaluating the loss uh we're
35:51
we're evaluating the loss uh we're
35:51
we're evaluating the loss uh we're zeroing out all the gradients from the
35:52
zeroing out all the gradients from the
35:52
zeroing out all the gradients from the previous step getting the gradients for
35:54
previous step getting the gradients for
35:54
previous step getting the gradients for all the parameters and then using those
35:56
all the parameters and then using those
35:56
all the parameters and then using those gradients to up update our parameters so
35:58
gradients to up update our parameters so
35:58
gradients to up update our parameters so typical training loop as we saw in the
36:00
typical training loop as we saw in the
36:00
typical training loop as we saw in the make more series so let me now uh run
36:04
make more series so let me now uh run
36:04
make more series so let me now uh run this for say 100 iterations and let's
36:07
this for say 100 iterations and let's
36:07
this for say 100 iterations and let's see what kind of losses we're going to
36:09
see what kind of losses we're going to
36:09
see what kind of losses we're going to get so we started around
36:12
get so we started around
36:12
get so we started around 4.7 and now we're getting to down to
36:14
4.7 and now we're getting to down to
36:14
4.7 and now we're getting to down to like 4.6 4.5 Etc so the optimization is
36:18
like 4.6 4.5 Etc so the optimization is
36:18
like 4.6 4.5 Etc so the optimization is definitely happening but um let's uh
36:22
definitely happening but um let's uh
36:22
definitely happening but um let's uh sort of try to increase number of
36:23
sort of try to increase number of
36:23
sort of try to increase number of iterations and only print at the
36:25
iterations and only print at the
36:25
iterations and only print at the end because we probably want train for
36:29
end because we probably want train for
36:29
end because we probably want train for longer okay so we're down to 3.6
36:34
roughly roughly down to
36:40
three this is the most janky
36:46
optimization okay it's working let's
36:48
optimization okay it's working let's
36:48
optimization okay it's working let's just do
36:50
just do
36:50
just do 10,000 and then from here we want to
36:53
10,000 and then from here we want to
36:53
10,000 and then from here we want to copy this and hopefully that we're going
36:56
copy this and hopefully that we're going
36:56
copy this and hopefully that we're going to get something reason and of course
36:58
to get something reason and of course
36:58
to get something reason and of course it's not going to be Shakespeare from a
37:00
it's not going to be Shakespeare from a
37:00
it's not going to be Shakespeare from a byr model but at least we see that the
37:01
byr model but at least we see that the
37:01
byr model but at least we see that the loss is improving and uh hopefully we're
37:05
loss is improving and uh hopefully we're
37:05
loss is improving and uh hopefully we're expecting something a bit more
37:06
expecting something a bit more
37:06
expecting something a bit more reasonable okay so we're down at about
37:08
reasonable okay so we're down at about
37:08
reasonable okay so we're down at about 2.5 is let's see what we get okay
37:12
2.5 is let's see what we get okay
37:12
2.5 is let's see what we get okay dramatic improvements certainly on what
37:14
dramatic improvements certainly on what
37:14
dramatic improvements certainly on what we had here so let me just increase the
37:17
we had here so let me just increase the
37:17
we had here so let me just increase the number of tokens okay so we see that
37:19
number of tokens okay so we see that
37:19
number of tokens okay so we see that we're starting to get something at least
37:21
we're starting to get something at least
37:21
we're starting to get something at least like reasonable is
37:25
like reasonable is
37:25
like reasonable is um certainly not shakes spear but uh the
37:29
um certainly not shakes spear but uh the
37:29
um certainly not shakes spear but uh the model is making progress so that is the
37:31
model is making progress so that is the
37:31
model is making progress so that is the simplest possible
37:33
simplest possible
37:33
simplest possible model so now what I'd like to do
37:36
model so now what I'd like to do
37:36
model so now what I'd like to do is obviously this is a very simple model
37:39
is obviously this is a very simple model
37:39
is obviously this is a very simple model because the tokens are not talking to
37:41
because the tokens are not talking to
37:41
because the tokens are not talking to each other so given the previous context
37:43
each other so given the previous context
37:43
each other so given the previous context of whatever was generated we're only
37:45
of whatever was generated we're only
37:45
of whatever was generated we're only looking at the very last character to
37:46
looking at the very last character to
37:46
looking at the very last character to make the predictions about what comes
37:48
make the predictions about what comes
37:48
make the predictions about what comes next so now these uh now these tokens
37:50
next so now these uh now these tokens
37:50
next so now these uh now these tokens have to start talking to each other and
37:53
have to start talking to each other and
37:53
have to start talking to each other and figuring out what is in the context so
37:55
figuring out what is in the context so
37:55
figuring out what is in the context so that they can make better predictions
37:56
that they can make better predictions
37:56
that they can make better predictions for what comes next and this is how
37:57
for what comes next and this is how
37:57
for what comes next and this is how we're going to kick off the uh
37:59
we're going to kick off the uh
37:59
we're going to kick off the uh Transformer okay so next I took the code
38:02
Transformer okay so next I took the code
38:02
Transformer okay so next I took the code that we developed in this juper notebook
38:03
that we developed in this juper notebook
38:03
that we developed in this juper notebook and I converted it to be a script and
38:05
and I converted it to be a script and
38:05
and I converted it to be a script and I'm doing this because I just want to
38:08
I'm doing this because I just want to
38:08
I'm doing this because I just want to simplify our intermediate work into just
38:10
simplify our intermediate work into just
38:10
simplify our intermediate work into just the final product that we have at this
38:12
the final product that we have at this
38:12
the final product that we have at this point so in the top here I put all the
38:15
point so in the top here I put all the
38:15
point so in the top here I put all the hyp parameters that we to find I
38:16
hyp parameters that we to find I
38:16
hyp parameters that we to find I introduced a few and I'm going to speak
38:18
introduced a few and I'm going to speak
38:18
introduced a few and I'm going to speak to that in a little bit otherwise a lot
38:20
to that in a little bit otherwise a lot
38:20
to that in a little bit otherwise a lot of this should be recognizable uh
38:23
of this should be recognizable uh
38:23
of this should be recognizable uh reproducibility read data get the
38:25
reproducibility read data get the
38:25
reproducibility read data get the encoder and the decoder create the train
38:27
encoder and the decoder create the train
38:27
encoder and the decoder create the train into splits uh use the uh kind of like
38:30
into splits uh use the uh kind of like
38:30
into splits uh use the uh kind of like data loader um that gets a batch of the
38:33
data loader um that gets a batch of the
38:34
data loader um that gets a batch of the inputs and Targets this is new and I'll
38:36
inputs and Targets this is new and I'll
38:36
inputs and Targets this is new and I'll talk about it in a second now this is
38:39
talk about it in a second now this is
38:39
talk about it in a second now this is the Byram language model that we
38:40
the Byram language model that we
38:40
the Byram language model that we developed and it can forward and give us
38:43
developed and it can forward and give us
38:43
developed and it can forward and give us a logits and loss and it can
38:45
a logits and loss and it can
38:45
a logits and loss and it can generate and then here we are creating
38:48
generate and then here we are creating
38:48
generate and then here we are creating the optimizer and this is the training
38:51
the optimizer and this is the training
38:51
the optimizer and this is the training Loop so everything here should look
38:53
Loop so everything here should look
38:53
Loop so everything here should look pretty familiar now some of the small
38:55
pretty familiar now some of the small
38:55
pretty familiar now some of the small things that I added number one I added
38:57
things that I added number one I added
38:57
things that I added number one I added the ability to run on a GPU if you have
39:00
the ability to run on a GPU if you have
39:00
the ability to run on a GPU if you have it so if you have a GPU then you can
39:02
it so if you have a GPU then you can
39:02
it so if you have a GPU then you can this will use Cuda instead of just CPU
39:04
this will use Cuda instead of just CPU
39:04
this will use Cuda instead of just CPU and everything will be a lot more faster
39:07
and everything will be a lot more faster
39:07
and everything will be a lot more faster now when device becomes Cuda then we
39:09
now when device becomes Cuda then we
39:09
now when device becomes Cuda then we need to make sure that when we load the
39:11
need to make sure that when we load the
39:11
need to make sure that when we load the data we move it to
39:13
data we move it to
39:13
data we move it to device when we create the model we want
39:15
device when we create the model we want
39:15
device when we create the model we want to move uh the model parameters to
39:18
to move uh the model parameters to
39:18
to move uh the model parameters to device so as an example here we have the
39:21
device so as an example here we have the
39:21
device so as an example here we have the N an embedding table and it's got a
39:23
N an embedding table and it's got a
39:23
N an embedding table and it's got a weight inside it which stores the uh
39:26
weight inside it which stores the uh
39:26
weight inside it which stores the uh sort of lookup table so so that would be
39:27
sort of lookup table so so that would be
39:27
sort of lookup table so so that would be moved to the GPU so that all the
39:29
moved to the GPU so that all the
39:29
moved to the GPU so that all the calculations here happen on the GPU and
39:31
calculations here happen on the GPU and
39:32
calculations here happen on the GPU and they can be a lot faster and then
39:34
they can be a lot faster and then
39:34
they can be a lot faster and then finally here when I'm creating the
39:35
finally here when I'm creating the
39:35
finally here when I'm creating the context that feeds in to generate I have
39:37
context that feeds in to generate I have
39:37
context that feeds in to generate I have to make sure that I create it on the
39:39
to make sure that I create it on the
39:39
to make sure that I create it on the device number two what I introduced is
39:43
device number two what I introduced is
39:43
device number two what I introduced is uh the fact that here in the training
39:46
uh the fact that here in the training
39:46
uh the fact that here in the training Loop here I was just printing the um l.
39:50
Loop here I was just printing the um l.
39:50
Loop here I was just printing the um l. item inside the training Loop but this
39:53
item inside the training Loop but this
39:53
item inside the training Loop but this is a very noisy measurement of the
39:54
is a very noisy measurement of the
39:54
is a very noisy measurement of the current loss because every batch will be
39:56
current loss because every batch will be
39:56
current loss because every batch will be more or less lucky and so what I want to
39:59
more or less lucky and so what I want to
39:59
more or less lucky and so what I want to do usually um is uh I have an estimate
40:02
do usually um is uh I have an estimate
40:02
do usually um is uh I have an estimate loss function and the estimate loss
40:05
loss function and the estimate loss
40:05
loss function and the estimate loss basically then um goes up here and it
40:10
basically then um goes up here and it
40:10
basically then um goes up here and it averages up the loss over multiple
40:12
averages up the loss over multiple
40:12
averages up the loss over multiple batches so in particular we're going to
40:15
batches so in particular we're going to
40:15
batches so in particular we're going to iterate eval iter times and we're going
40:17
iterate eval iter times and we're going
40:17
iterate eval iter times and we're going to basically get our loss and then we're
40:19
to basically get our loss and then we're
40:19
to basically get our loss and then we're going to get the average loss for both
40:21
going to get the average loss for both
40:21
going to get the average loss for both splits and so this will be a lot less
40:24
splits and so this will be a lot less
40:24
splits and so this will be a lot less noisy so here when we call the estimate
40:26
noisy so here when we call the estimate
40:26
noisy so here when we call the estimate loss we're we're going to report the uh
40:28
loss we're we're going to report the uh
40:28
loss we're we're going to report the uh pretty accurate train and validation
40:31
pretty accurate train and validation
40:31
pretty accurate train and validation loss now when we come back up you'll
40:33
loss now when we come back up you'll
40:33
loss now when we come back up you'll notice a few things here I'm setting the
40:35
notice a few things here I'm setting the
40:35
notice a few things here I'm setting the model to evaluation phase and down here
40:38
model to evaluation phase and down here
40:38
model to evaluation phase and down here I'm resetting it back to training phase
40:40
I'm resetting it back to training phase
40:40
I'm resetting it back to training phase now right now for our model as is this
40:42
now right now for our model as is this
40:42
now right now for our model as is this doesn't actually do anything because the
40:44
doesn't actually do anything because the
40:44
doesn't actually do anything because the only thing inside this model is this uh
40:46
only thing inside this model is this uh
40:46
only thing inside this model is this uh nn. embedding and um this this um
40:51
nn. embedding and um this this um
40:51
nn. embedding and um this this um Network would behave both would behave
40:53
Network would behave both would behave
40:53
Network would behave both would behave the same in both evaluation mode and
40:55
the same in both evaluation mode and
40:55
the same in both evaluation mode and training mode we have no drop off layers
40:57
training mode we have no drop off layers
40:57
training mode we have no drop off layers we have no batm layers Etc but it is a
41:00
we have no batm layers Etc but it is a
41:00
we have no batm layers Etc but it is a good practice to Think Through what mode
41:02
good practice to Think Through what mode
41:02
good practice to Think Through what mode your neural network is in because some
41:04
your neural network is in because some
41:04
your neural network is in because some layers will have different Behavior Uh
41:07
layers will have different Behavior Uh
41:07
layers will have different Behavior Uh at inference time or training time and
41:11
at inference time or training time and
41:11
at inference time or training time and there's also this context manager torch
41:12
there's also this context manager torch
41:12
there's also this context manager torch up nograd and this is just telling
41:14
up nograd and this is just telling
41:14
up nograd and this is just telling pytorch that everything that happens
41:16
pytorch that everything that happens
41:16
pytorch that everything that happens inside this function we will not call do
41:18
inside this function we will not call do
41:18
inside this function we will not call do backward on and so pytorch can be a lot
41:21
backward on and so pytorch can be a lot
41:21
backward on and so pytorch can be a lot more efficient with its memory use
41:23
more efficient with its memory use
41:23
more efficient with its memory use because it doesn't have to store all the
41:25
because it doesn't have to store all the
41:25
because it doesn't have to store all the intermediate variables uh because we're
41:26
intermediate variables uh because we're
41:27
intermediate variables uh because we're never going to call backward and so it
41:29
never going to call backward and so it
41:29
never going to call backward and so it can it can be a lot more memory
41:30
can it can be a lot more memory
41:30
can it can be a lot more memory efficient in that way so also a good
41:32
efficient in that way so also a good
41:32
efficient in that way so also a good practice to tpy torch when we don't
41:35
practice to tpy torch when we don't
41:35
practice to tpy torch when we don't intend to do back
41:36
intend to do back
41:36
intend to do back propagation so right now this script is
41:39
propagation so right now this script is
41:39
propagation so right now this script is about 120 lines of code of and that's
41:43
about 120 lines of code of and that's
41:43
about 120 lines of code of and that's kind of our starter code I'm calling it
41:45
kind of our starter code I'm calling it
41:45
kind of our starter code I'm calling it b.p and I'm going to release it later
41:48
b.p and I'm going to release it later
41:48
b.p and I'm going to release it later now running this
41:50
now running this
41:50
now running this script gives us output in the terminal
41:52
script gives us output in the terminal
41:52
script gives us output in the terminal and it looks something like this it
41:54
and it looks something like this it
41:54
and it looks something like this it basically as I ran this code uh it was
41:57
basically as I ran this code uh it was
41:57
basically as I ran this code uh it was giving me the train loss and Val loss
41:59
giving me the train loss and Val loss
41:59
giving me the train loss and Val loss and we see that we convert to somewhere
42:01
and we see that we convert to somewhere
42:01
and we see that we convert to somewhere around
42:01
around
42:01
around 2.5 with the pyr model and then here's
42:04
2.5 with the pyr model and then here's
42:04
2.5 with the pyr model and then here's the sample that we produced at the
42:07
the sample that we produced at the
42:07
the sample that we produced at the end and so we have everything packaged
42:09
end and so we have everything packaged
42:09
end and so we have everything packaged up in the script and we're in a good
42:11
up in the script and we're in a good
42:11
up in the script and we're in a good position now to iterate on this okay so
42:13
position now to iterate on this okay so
42:13
position now to iterate on this okay so we are almost ready to start writing our
42:15
we are almost ready to start writing our
42:15
we are almost ready to start writing our very first self attention block for
42:18
very first self attention block for
42:18
very first self attention block for processing these uh tokens now before we
42:22
processing these uh tokens now before we
42:22
processing these uh tokens now before we actually get there I want to get you
42:24
actually get there I want to get you
42:24
actually get there I want to get you used to a mathematical trick that is
42:26
used to a mathematical trick that is
42:26
used to a mathematical trick that is used in the self attention inside a
42:28
used in the self attention inside a
42:28
used in the self attention inside a Transformer and is really just like at
42:30
Transformer and is really just like at
42:30
Transformer and is really just like at the heart of an an efficient
42:32
the heart of an an efficient
42:32
the heart of an an efficient implementation of self attention and so
42:34
implementation of self attention and so
42:34
implementation of self attention and so I want to work with this toy example to
42:36
I want to work with this toy example to
42:36
I want to work with this toy example to just get you used to this operation and
42:38
just get you used to this operation and
42:38
just get you used to this operation and then it's going to make it much more
42:39
then it's going to make it much more
42:39
then it's going to make it much more clear once we actually get to um to it
42:43
clear once we actually get to um to it
42:43
clear once we actually get to um to it uh in the script
42:44
uh in the script
42:44
uh in the script again so let's create a b BYT by C where
42:47
again so let's create a b BYT by C where
42:47
again so let's create a b BYT by C where BT and C are just 48 and two in the toy
42:50
BT and C are just 48 and two in the toy
42:50
BT and C are just 48 and two in the toy example and these are basically channels
42:53
example and these are basically channels
42:53
example and these are basically channels and we have uh batches and we have the
42:55
and we have uh batches and we have the
42:55
and we have uh batches and we have the time component and we have information
42:58
time component and we have information
42:58
time component and we have information at each point in the sequence so
43:01
at each point in the sequence so
43:01
at each point in the sequence so see now what we would like to do is we
43:03
see now what we would like to do is we
43:03
see now what we would like to do is we would like these um tokens so we have up
43:06
would like these um tokens so we have up
43:06
would like these um tokens so we have up to eight tokens here in a batch and
43:08
to eight tokens here in a batch and
43:08
to eight tokens here in a batch and these eight tokens are currently not
43:10
these eight tokens are currently not
43:10
these eight tokens are currently not talking to each other and we would like
43:11
talking to each other and we would like
43:11
talking to each other and we would like them to talk to each other we'd like to
43:13
them to talk to each other we'd like to
43:13
them to talk to each other we'd like to couple them and in particular we don't
43:17
couple them and in particular we don't
43:17
couple them and in particular we don't we we want to couple them in a very
43:18
we we want to couple them in a very
43:18
we we want to couple them in a very specific way so the token for example at
43:21
specific way so the token for example at
43:21
specific way so the token for example at the fifth location it should not
43:23
the fifth location it should not
43:23
the fifth location it should not communicate with tokens in the sixth
43:25
communicate with tokens in the sixth
43:25
communicate with tokens in the sixth seventh and eighth location
43:27
seventh and eighth location
43:27
seventh and eighth location because uh those are future tokens in
43:29
because uh those are future tokens in
43:29
because uh those are future tokens in the sequence the token on the fifth
43:31
the sequence the token on the fifth
43:31
the sequence the token on the fifth location should only talk to the one in
43:33
location should only talk to the one in
43:33
location should only talk to the one in the fourth third second and first so
43:36
the fourth third second and first so
43:36
the fourth third second and first so it's only so information only flows from
43:38
it's only so information only flows from
43:38
it's only so information only flows from previous context to the current time
43:40
previous context to the current time
43:40
previous context to the current time step and we cannot get any information
43:42
step and we cannot get any information
43:42
step and we cannot get any information from the future because we are about to
43:44
from the future because we are about to
43:44
from the future because we are about to try to predict the
43:45
try to predict the
43:45
try to predict the future so what is the easiest way for
43:49
future so what is the easiest way for
43:49
future so what is the easiest way for tokens to communicate okay the easiest
43:52
tokens to communicate okay the easiest
43:52
tokens to communicate okay the easiest way I would say is okay if we're up to
43:54
way I would say is okay if we're up to
43:54
way I would say is okay if we're up to if we're a fifth token and I'd like to
43:56
if we're a fifth token and I'd like to
43:56
if we're a fifth token and I'd like to communicate with my past the simplest
43:58
communicate with my past the simplest
43:58
communicate with my past the simplest way we can do that is to just do a
44:00
way we can do that is to just do a
44:00
way we can do that is to just do a weight is to just do an average of all
44:03
weight is to just do an average of all
44:03
weight is to just do an average of all the um of all the preceding elements so
44:06
the um of all the preceding elements so
44:06
the um of all the preceding elements so for example if I'm the fif token I would
44:08
for example if I'm the fif token I would
44:08
for example if I'm the fif token I would like to take the channels uh that make
44:10
like to take the channels uh that make
44:10
like to take the channels uh that make up that are information at my step but
44:13
up that are information at my step but
44:13
up that are information at my step but then also the channels from the fourth
44:15
then also the channels from the fourth
44:15
then also the channels from the fourth step third step second step and the
44:17
step third step second step and the
44:17
step third step second step and the first step I'd like to average those up
44:19
first step I'd like to average those up
44:19
first step I'd like to average those up and then that would become sort of like
44:21
and then that would become sort of like
44:21
and then that would become sort of like a feature Vector that summarizes me in
44:23
a feature Vector that summarizes me in
44:23
a feature Vector that summarizes me in the context of my history now of course
44:26
the context of my history now of course
44:26
the context of my history now of course just doing a sum or like an average is
44:28
just doing a sum or like an average is
44:28
just doing a sum or like an average is an extremely weak form of interaction
44:30
an extremely weak form of interaction
44:30
an extremely weak form of interaction like this communication is uh extremely
44:32
like this communication is uh extremely
44:32
like this communication is uh extremely lossy we've lost a ton of information
44:33
lossy we've lost a ton of information
44:34
lossy we've lost a ton of information about the spatial Arrangements of all
44:35
about the spatial Arrangements of all
44:35
about the spatial Arrangements of all those tokens uh but that's okay for now
44:38
those tokens uh but that's okay for now
44:38
those tokens uh but that's okay for now we'll see how we can bring that
44:39
we'll see how we can bring that
44:39
we'll see how we can bring that information back later for now what we
44:41
information back later for now what we
44:41
information back later for now what we would like to do is for every single
44:43
would like to do is for every single
44:43
would like to do is for every single batch element independently for every
44:46
batch element independently for every
44:46
batch element independently for every teeth token in that sequence we'd like
44:49
teeth token in that sequence we'd like
44:49
teeth token in that sequence we'd like to now calculate the average of all the
44:52
to now calculate the average of all the
44:53
to now calculate the average of all the vectors in all the previous tokens and
44:55
vectors in all the previous tokens and
44:55
vectors in all the previous tokens and also at this token so let's write that
44:58
also at this token so let's write that
44:58
also at this token so let's write that out um I have a small snippet here and
45:01
out um I have a small snippet here and
45:01
out um I have a small snippet here and instead of just fumbling around let me
45:03
instead of just fumbling around let me
45:03
instead of just fumbling around let me just copy paste it and talk to
45:05
just copy paste it and talk to
45:05
just copy paste it and talk to it so in other words we're going to
45:07
it so in other words we're going to
45:08
it so in other words we're going to create X and B is short for bag of words
45:12
create X and B is short for bag of words
45:12
create X and B is short for bag of words because bag of words is um is kind of
45:15
because bag of words is um is kind of
45:15
because bag of words is um is kind of like um a term that people use when you
45:17
like um a term that people use when you
45:17
like um a term that people use when you are just averaging up things so this is
45:19
are just averaging up things so this is
45:19
are just averaging up things so this is just a bag of words basically there's a
45:21
just a bag of words basically there's a
45:21
just a bag of words basically there's a word stored on every one of these eight
45:23
word stored on every one of these eight
45:23
word stored on every one of these eight locations and we're doing a bag of words
45:25
locations and we're doing a bag of words
45:25
locations and we're doing a bag of words we're just averaging
45:27
we're just averaging
45:27
we're just averaging so in the beginning we're going to say
45:28
so in the beginning we're going to say
45:28
so in the beginning we're going to say that it's just initialized at Zero and
45:30
that it's just initialized at Zero and
45:30
that it's just initialized at Zero and then I'm doing a for Loop here so we're
45:32
then I'm doing a for Loop here so we're
45:32
then I'm doing a for Loop here so we're not being efficient yet that's coming
45:34
not being efficient yet that's coming
45:34
not being efficient yet that's coming but for now we're just iterating over
45:35
but for now we're just iterating over
45:36
but for now we're just iterating over all the batch Dimensions independently
45:37
all the batch Dimensions independently
45:38
all the batch Dimensions independently iterating over time and then the
45:40
iterating over time and then the
45:40
iterating over time and then the previous uh tokens are at this uh batch
45:45
previous uh tokens are at this uh batch
45:45
previous uh tokens are at this uh batch Dimension and then everything up to and
45:47
Dimension and then everything up to and
45:47
Dimension and then everything up to and including the teeth token okay so when
45:51
including the teeth token okay so when
45:51
including the teeth token okay so when we slice out X in this way X prev
45:54
we slice out X in this way X prev
45:54
we slice out X in this way X prev Becomes of shape um how many T elements
45:58
Becomes of shape um how many T elements
45:58
Becomes of shape um how many T elements there were in the past and then of
46:00
there were in the past and then of
46:00
there were in the past and then of course C so all the two-dimensional
46:02
course C so all the two-dimensional
46:02
course C so all the two-dimensional information from these little tokens so
46:05
information from these little tokens so
46:05
information from these little tokens so that's the previous uh sort of chunk of
46:08
that's the previous uh sort of chunk of
46:08
that's the previous uh sort of chunk of um tokens from my current sequence and
46:12
um tokens from my current sequence and
46:12
um tokens from my current sequence and then I'm just doing the average or the
46:13
then I'm just doing the average or the
46:13
then I'm just doing the average or the mean over the zero Dimension so I'm
46:15
mean over the zero Dimension so I'm
46:15
mean over the zero Dimension so I'm averaging out the time here and I'm just
46:19
averaging out the time here and I'm just
46:19
averaging out the time here and I'm just going to get a little c one dimensional
46:21
going to get a little c one dimensional
46:21
going to get a little c one dimensional Vector which I'm going to store in X bag
46:23
Vector which I'm going to store in X bag
46:23
Vector which I'm going to store in X bag of words so I can run this and and uh
46:27
of words so I can run this and and uh
46:27
of words so I can run this and and uh this is not going to be very informative
46:30
this is not going to be very informative
46:30
this is not going to be very informative because let's see so this is X of Zer so
46:32
because let's see so this is X of Zer so
46:32
because let's see so this is X of Zer so this is the zeroth batch element and
46:35
this is the zeroth batch element and
46:35
this is the zeroth batch element and then expo at zero now you see how the at
46:40
then expo at zero now you see how the at
46:40
then expo at zero now you see how the at the first location here you see that the
46:42
the first location here you see that the
46:42
the first location here you see that the two are equal and that's because it's
46:44
two are equal and that's because it's
46:45
two are equal and that's because it's we're just doing an average of this one
46:46
we're just doing an average of this one
46:46
we're just doing an average of this one token but here this one is now an
46:49
token but here this one is now an
46:49
token but here this one is now an average of these two and now this one is
46:53
average of these two and now this one is
46:53
average of these two and now this one is an average of these
46:54
an average of these
46:54
an average of these three and so on
46:57
three and so on
46:57
three and so on so uh and this last one is the average
47:01
so uh and this last one is the average
47:01
so uh and this last one is the average of all of these elements so vertical
47:03
of all of these elements so vertical
47:03
of all of these elements so vertical average just averaging up all the tokens
47:05
average just averaging up all the tokens
47:05
average just averaging up all the tokens now gives this outcome
47:07
now gives this outcome
47:07
now gives this outcome here so this is all well and good uh but
47:10
here so this is all well and good uh but
47:10
here so this is all well and good uh but this is very inefficient now the trick
47:12
this is very inefficient now the trick
47:12
this is very inefficient now the trick is that we can be very very efficient
47:14
is that we can be very very efficient
47:14
is that we can be very very efficient about doing this using matrix
47:16
about doing this using matrix
47:16
about doing this using matrix multiplication so that's the
47:18
multiplication so that's the
47:18
multiplication so that's the mathematical trick and let me show you
47:19
mathematical trick and let me show you
47:19
mathematical trick and let me show you what I mean let's work with the toy
47:21
what I mean let's work with the toy
47:21
what I mean let's work with the toy example here let me run it and I'll
47:24
example here let me run it and I'll
47:24
example here let me run it and I'll explain I have a simple Matrix here that
47:27
explain I have a simple Matrix here that
47:27
explain I have a simple Matrix here that is a 3X3 of all ones a matrix B of just
47:31
is a 3X3 of all ones a matrix B of just
47:31
is a 3X3 of all ones a matrix B of just random numbers and it's a 3x2 and a
47:33
random numbers and it's a 3x2 and a
47:33
random numbers and it's a 3x2 and a matrix C which will be 3x3 multip 3x2
47:36
matrix C which will be 3x3 multip 3x2
47:36
matrix C which will be 3x3 multip 3x2 which will give out a 3x2 so here we're
47:39
which will give out a 3x2 so here we're
47:39
which will give out a 3x2 so here we're just using um matrix multiplication so a
47:43
just using um matrix multiplication so a
47:43
just using um matrix multiplication so a multiply B gives us
47:46
multiply B gives us
47:46
multiply B gives us C okay so how are these numbers in C um
47:51
C okay so how are these numbers in C um
47:51
C okay so how are these numbers in C um achieved right so this number in the top
47:54
achieved right so this number in the top
47:54
achieved right so this number in the top left is the first row of a dot product
47:57
left is the first row of a dot product
47:57
left is the first row of a dot product with the First Column of B and since all
48:00
with the First Column of B and since all
48:00
with the First Column of B and since all the the row of a right now is all just
48:02
the the row of a right now is all just
48:02
the the row of a right now is all just ones then the do product here with with
48:05
ones then the do product here with with
48:05
ones then the do product here with with this column of B is just going to do a
48:07
this column of B is just going to do a
48:07
this column of B is just going to do a sum of these of this column so 2 + 6 + 6
48:11
sum of these of this column so 2 + 6 + 6
48:11
sum of these of this column so 2 + 6 + 6 is
48:12
is
48:12
is 14 the element here in the output of C
48:15
14 the element here in the output of C
48:15
14 the element here in the output of C is also the first column here the first
48:17
is also the first column here the first
48:17
is also the first column here the first row of a multiplied now with the second
48:20
row of a multiplied now with the second
48:20
row of a multiplied now with the second column of B so 7 + 4 + 5 is 16 now you
48:25
column of B so 7 + 4 + 5 is 16 now you
48:25
column of B so 7 + 4 + 5 is 16 now you see that there's repeating elements here
48:26
see that there's repeating elements here
48:26
see that there's repeating elements here so this 14 again is because this row is
48:28
so this 14 again is because this row is
48:28
so this 14 again is because this row is again all ones and it's multiplying the
48:30
again all ones and it's multiplying the
48:30
again all ones and it's multiplying the First Column of B so we get 14 and this
48:33
First Column of B so we get 14 and this
48:33
First Column of B so we get 14 and this one is and so on so this last number
48:35
one is and so on so this last number
48:35
one is and so on so this last number here is the last row do product last
48:39
here is the last row do product last
48:39
here is the last row do product last column now the trick here is uh the
48:42
column now the trick here is uh the
48:42
column now the trick here is uh the following this is just a boring number
48:44
following this is just a boring number
48:44
following this is just a boring number of um it's just a boring array of all
48:48
of um it's just a boring array of all
48:48
of um it's just a boring array of all ones but torch has this function called
48:50
ones but torch has this function called
48:50
ones but torch has this function called Trail which is short for a
48:54
Trail which is short for a
48:54
Trail which is short for a triangular uh something like that and
48:56
triangular uh something like that and
48:56
triangular uh something like that and you can wrap it in torch up once and it
48:58
you can wrap it in torch up once and it
48:58
you can wrap it in torch up once and it will just return the lower triangular
49:00
will just return the lower triangular
49:00
will just return the lower triangular portion of this
49:03
portion of this
49:03
portion of this okay so now it will basically zero out
49:06
okay so now it will basically zero out
49:06
okay so now it will basically zero out uh these guys here so we just get the
49:08
uh these guys here so we just get the
49:08
uh these guys here so we just get the lower triangular part well what happens
49:10
lower triangular part well what happens
49:10
lower triangular part well what happens if we do
49:14
that so now we'll have a like this and B
49:17
that so now we'll have a like this and B
49:17
that so now we'll have a like this and B like this and now what are we getting
49:18
like this and now what are we getting
49:18
like this and now what are we getting here in C well what is this number well
49:22
here in C well what is this number well
49:22
here in C well what is this number well this is the first row times the First
49:24
this is the first row times the First
49:24
this is the first row times the First Column and because this is zeros
49:28
Column and because this is zeros
49:28
Column and because this is zeros uh these elements here are now ignored
49:30
uh these elements here are now ignored
49:30
uh these elements here are now ignored so we just get a two and then this
49:32
so we just get a two and then this
49:32
so we just get a two and then this number here is the first row times the
49:35
number here is the first row times the
49:35
number here is the first row times the second column and because these are
49:37
second column and because these are
49:37
second column and because these are zeros they get ignored and it's just
49:39
zeros they get ignored and it's just
49:39
zeros they get ignored and it's just seven this seven multiplies this one but
49:42
seven this seven multiplies this one but
49:42
seven this seven multiplies this one but look what happened here because this is
49:43
look what happened here because this is
49:43
look what happened here because this is one and then zeros we what ended up
49:46
one and then zeros we what ended up
49:46
one and then zeros we what ended up happening is we're just plucking out the
49:48
happening is we're just plucking out the
49:48
happening is we're just plucking out the row of this row of B and that's what we
49:51
row of this row of B and that's what we
49:51
row of this row of B and that's what we got now here we have one 1 Z so here 110
49:57
got now here we have one 1 Z so here 110
49:57
got now here we have one 1 Z so here 110 do product with these two columns will
49:58
do product with these two columns will
49:59
do product with these two columns will now give us 2 + 6 which is 8 and 7 + 4
50:01
now give us 2 + 6 which is 8 and 7 + 4
50:02
now give us 2 + 6 which is 8 and 7 + 4 which is 11 and because this is 111 we
50:05
which is 11 and because this is 111 we
50:05
which is 11 and because this is 111 we ended up with the addition of all of
50:07
ended up with the addition of all of
50:07
ended up with the addition of all of them and so basically depending on how
50:10
them and so basically depending on how
50:10
them and so basically depending on how many ones and zeros we have here we are
50:12
many ones and zeros we have here we are
50:12
many ones and zeros we have here we are basically doing a sum currently of a
50:16
basically doing a sum currently of a
50:16
basically doing a sum currently of a variable number of these rows and that
50:18
variable number of these rows and that
50:18
variable number of these rows and that gets deposited into
50:20
gets deposited into
50:20
gets deposited into C So currently we're doing sums because
50:23
C So currently we're doing sums because
50:23
C So currently we're doing sums because these are ones but we can also do
50:25
these are ones but we can also do
50:25
these are ones but we can also do average right and you can start to see
50:27
average right and you can start to see
50:27
average right and you can start to see how we could do average uh of the rows
50:29
how we could do average uh of the rows
50:29
how we could do average uh of the rows of B uh sort of in an incremental
50:32
of B uh sort of in an incremental
50:32
of B uh sort of in an incremental fashion because we don't have to we can
50:35
fashion because we don't have to we can
50:35
fashion because we don't have to we can basically normalize these rows so that
50:37
basically normalize these rows so that
50:37
basically normalize these rows so that they sum to one and then we're going to
50:39
they sum to one and then we're going to
50:39
they sum to one and then we're going to get an average so if we took a and then
50:41
get an average so if we took a and then
50:41
get an average so if we took a and then we did aals
50:43
we did aals
50:43
we did aals aide torch. sum in the um of a in the um
50:51
aide torch. sum in the um of a in the um
50:51
aide torch. sum in the um of a in the um oneth Dimension and then let's keep them
50:55
oneth Dimension and then let's keep them
50:55
oneth Dimension and then let's keep them as true so so therefore the broadcasting
50:57
as true so so therefore the broadcasting
50:57
as true so so therefore the broadcasting will work out so if I rerun this you see
51:00
will work out so if I rerun this you see
51:00
will work out so if I rerun this you see now that these rows now sum to one so
51:03
now that these rows now sum to one so
51:04
now that these rows now sum to one so this row is one this row is 0. 5.5 Z and
51:07
this row is one this row is 0. 5.5 Z and
51:07
this row is one this row is 0. 5.5 Z and here we get 1/3 and now when we do a
51:09
here we get 1/3 and now when we do a
51:09
here we get 1/3 and now when we do a multiply B what are we getting here we
51:12
multiply B what are we getting here we
51:12
multiply B what are we getting here we are just getting the first row first row
51:15
are just getting the first row first row
51:15
are just getting the first row first row here now we are getting the average of
51:18
here now we are getting the average of
51:18
here now we are getting the average of the first two
51:20
the first two
51:20
the first two rows okay so 2 and six average is four
51:23
rows okay so 2 and six average is four
51:23
rows okay so 2 and six average is four and four and seven average is
51:25
and four and seven average is
51:25
and four and seven average is 5.5 and on the bottom here we are now
51:27
5.5 and on the bottom here we are now
51:27
5.5 and on the bottom here we are now getting the average of these three rows
51:31
getting the average of these three rows
51:31
getting the average of these three rows so the average of all of elements of B
51:33
so the average of all of elements of B
51:33
so the average of all of elements of B are now deposited here and so you can
51:36
are now deposited here and so you can
51:36
are now deposited here and so you can see that by manipulating these uh
51:39
see that by manipulating these uh
51:40
see that by manipulating these uh elements of this multiplying Matrix and
51:42
elements of this multiplying Matrix and
51:42
elements of this multiplying Matrix and then multiplying it with any given
51:44
then multiplying it with any given
51:44
then multiplying it with any given Matrix we can do these averages in this
51:47
Matrix we can do these averages in this
51:47
Matrix we can do these averages in this incremental fashion because we just get
51:50
incremental fashion because we just get
51:50
incremental fashion because we just get um and we can manipulate that based on
51:53
um and we can manipulate that based on
51:53
um and we can manipulate that based on the elements of a okay so that's very
51:55
the elements of a okay so that's very
51:55
the elements of a okay so that's very convenient so let's let's swing back up
51:57
convenient so let's let's swing back up
51:57
convenient so let's let's swing back up here and see how we can vectorize this
51:59
here and see how we can vectorize this
51:59
here and see how we can vectorize this and make it much more efficient using
52:00
and make it much more efficient using
52:00
and make it much more efficient using what we've learned so in
52:03
what we've learned so in
52:03
what we've learned so in particular we are going to produce an
52:05
particular we are going to produce an
52:05
particular we are going to produce an array a but here I'm going to call it we
52:08
array a but here I'm going to call it we
52:08
array a but here I'm going to call it we short for weights but this is our
52:11
short for weights but this is our
52:11
short for weights but this is our a and this is how much of every row we
52:14
a and this is how much of every row we
52:14
a and this is how much of every row we want to average up and it's going to be
52:17
want to average up and it's going to be
52:17
want to average up and it's going to be an average because you can see that
52:18
an average because you can see that
52:18
an average because you can see that these rows sum to
52:20
these rows sum to
52:20
these rows sum to one so this is our a and then our B in
52:23
one so this is our a and then our B in
52:23
one so this is our a and then our B in this example of course is X
52:27
this example of course is X
52:27
this example of course is X so what's going to happen here now is
52:29
so what's going to happen here now is
52:29
so what's going to happen here now is that we are going to have an expo
52:31
that we are going to have an expo
52:31
that we are going to have an expo 2 and this Expo 2 is going to be way
52:36
2 and this Expo 2 is going to be way
52:36
2 and this Expo 2 is going to be way multiplying
52:38
multiplying
52:38
multiplying RX so let's think this true way is T BYT
52:42
RX so let's think this true way is T BYT
52:42
RX so let's think this true way is T BYT and this is Matrix multiplying in
52:44
and this is Matrix multiplying in
52:44
and this is Matrix multiplying in pytorch a b by T by
52:47
pytorch a b by T by
52:47
pytorch a b by T by C and it's giving us uh different what
52:50
C and it's giving us uh different what
52:50
C and it's giving us uh different what shape so pytorch will come here and it
52:52
shape so pytorch will come here and it
52:52
shape so pytorch will come here and it will see that these shapes are not the
52:54
will see that these shapes are not the
52:54
will see that these shapes are not the same so it will create a batch Dimension
52:57
same so it will create a batch Dimension
52:57
same so it will create a batch Dimension here and this is a batched matrix
53:00
here and this is a batched matrix
53:00
here and this is a batched matrix multiply and so it will apply this
53:02
multiply and so it will apply this
53:02
multiply and so it will apply this matrix multiplication in all the batch
53:03
matrix multiplication in all the batch
53:04
matrix multiplication in all the batch elements um in parallel and individually
53:08
elements um in parallel and individually
53:08
elements um in parallel and individually and then for each batch element there
53:09
and then for each batch element there
53:09
and then for each batch element there will be a t BYT multiplying T by C
53:12
will be a t BYT multiplying T by C
53:12
will be a t BYT multiplying T by C exactly as we had
53:15
exactly as we had
53:15
exactly as we had below so this will now create B by T by
53:20
below so this will now create B by T by
53:20
below so this will now create B by T by C and Expo 2 will now become identical
53:24
C and Expo 2 will now become identical
53:24
C and Expo 2 will now become identical to Expo
53:28
so we can see that torch. all close of
53:32
so we can see that torch. all close of
53:32
so we can see that torch. all close of xbo and xbo 2 should be true
53:36
xbo and xbo 2 should be true
53:36
xbo and xbo 2 should be true now so this kind of like convinces us
53:38
now so this kind of like convinces us
53:38
now so this kind of like convinces us that uh these are in fact um the same so
53:43
that uh these are in fact um the same so
53:43
that uh these are in fact um the same so xbo and xbo 2 if I just print
53:47
xbo and xbo 2 if I just print
53:47
xbo and xbo 2 if I just print them uh okay we're not going to be able
53:49
them uh okay we're not going to be able
53:49
them uh okay we're not going to be able to okay we're not going to be able to
53:51
to okay we're not going to be able to
53:51
to okay we're not going to be able to just stare it down but
53:54
just stare it down but
53:54
just stare it down but um well let me try Expo basically just
53:56
um well let me try Expo basically just
53:56
um well let me try Expo basically just at the zeroth element and Expo two at
53:58
at the zeroth element and Expo two at
53:58
at the zeroth element and Expo two at the zeroth element so just the first
53:59
the zeroth element so just the first
53:59
the zeroth element so just the first batch and we should see that this and
54:02
batch and we should see that this and
54:02
batch and we should see that this and that should be identical which they
54:04
that should be identical which they
54:04
that should be identical which they are right so what happened here the
54:07
are right so what happened here the
54:07
are right so what happened here the trick is we were able to use batched
54:09
trick is we were able to use batched
54:09
trick is we were able to use batched Matrix multiply to do this uh
54:12
Matrix multiply to do this uh
54:12
Matrix multiply to do this uh aggregation really and it's a weighted
54:15
aggregation really and it's a weighted
54:15
aggregation really and it's a weighted aggregation and the weights are
54:17
aggregation and the weights are
54:17
aggregation and the weights are specified in this um T BYT array and
54:21
specified in this um T BYT array and
54:21
specified in this um T BYT array and we're basically doing weighted sums and
54:24
we're basically doing weighted sums and
54:24
we're basically doing weighted sums and uh these weighted sums are are U
54:26
uh these weighted sums are are U
54:26
uh these weighted sums are are U according to uh the weights inside here
54:28
according to uh the weights inside here
54:28
according to uh the weights inside here they take on sort of this triangular
54:31
they take on sort of this triangular
54:31
they take on sort of this triangular form and so that means that a token at
54:33
form and so that means that a token at
54:33
form and so that means that a token at the teth dimension will only get uh sort
54:36
the teth dimension will only get uh sort
54:36
the teth dimension will only get uh sort of um information from the um tokens
54:39
of um information from the um tokens
54:39
of um information from the um tokens perceiving it so that's exactly what we
54:41
perceiving it so that's exactly what we
54:41
perceiving it so that's exactly what we want and finally I would like to rewrite
54:43
want and finally I would like to rewrite
54:43
want and finally I would like to rewrite it in one more way and we're going to
54:46
it in one more way and we're going to
54:46
it in one more way and we're going to see why that's useful so this is the
54:48
see why that's useful so this is the
54:48
see why that's useful so this is the third version and it's also identical to
54:50
third version and it's also identical to
54:50
third version and it's also identical to the first and second but let me talk
54:53
the first and second but let me talk
54:53
the first and second but let me talk through it it uses
54:54
through it it uses
54:54
through it it uses softmax so Trill here is this Matrix
55:00
softmax so Trill here is this Matrix
55:00
softmax so Trill here is this Matrix lower triangular
55:01
lower triangular
55:01
lower triangular ones way begins as all
55:05
ones way begins as all
55:05
ones way begins as all zero okay so if I just print way in the
55:07
zero okay so if I just print way in the
55:07
zero okay so if I just print way in the beginning it's all zero then I
55:11
beginning it's all zero then I
55:11
beginning it's all zero then I used masked fill so what this is doing
55:15
used masked fill so what this is doing
55:15
used masked fill so what this is doing is we. masked fill it's all zeros and
55:18
is we. masked fill it's all zeros and
55:18
is we. masked fill it's all zeros and I'm saying for all the elements where
55:20
I'm saying for all the elements where
55:20
I'm saying for all the elements where Trill is equal equal Z make them be
55:23
Trill is equal equal Z make them be
55:23
Trill is equal equal Z make them be negative Infinity so all the elements
55:26
negative Infinity so all the elements
55:26
negative Infinity so all the elements where Trill is zero will become negative
55:28
where Trill is zero will become negative
55:28
where Trill is zero will become negative Infinity now so this is what we get and
55:32
Infinity now so this is what we get and
55:32
Infinity now so this is what we get and then the final line here is
55:36
then the final line here is
55:36
then the final line here is softmax so if I take a softmax along
55:38
softmax so if I take a softmax along
55:38
softmax so if I take a softmax along every single so dim is negative one so
55:40
every single so dim is negative one so
55:40
every single so dim is negative one so along every single row if I do softmax
55:44
along every single row if I do softmax
55:44
along every single row if I do softmax what is that going to
55:46
what is that going to
55:46
what is that going to do well softmax is um is also like a
55:51
do well softmax is um is also like a
55:51
do well softmax is um is also like a normalization operation right and so
55:54
normalization operation right and so
55:54
normalization operation right and so spoiler alert you get the exact same
55:58
spoiler alert you get the exact same
55:58
spoiler alert you get the exact same Matrix let me bring back to
56:00
Matrix let me bring back to
56:00
Matrix let me bring back to softmax and recall that in softmax we're
56:02
softmax and recall that in softmax we're
56:02
softmax and recall that in softmax we're going to exponentiate every single one
56:04
going to exponentiate every single one
56:04
going to exponentiate every single one of these and then we're going to divide
56:06
of these and then we're going to divide
56:06
of these and then we're going to divide by the sum and so if we exponentiate
56:09
by the sum and so if we exponentiate
56:10
by the sum and so if we exponentiate every single element here we're going to
56:11
every single element here we're going to
56:11
every single element here we're going to get a one and here we're going to get uh
56:13
get a one and here we're going to get uh
56:14
get a one and here we're going to get uh basically zero 0 z0 Z everywhere else
56:17
basically zero 0 z0 Z everywhere else
56:17
basically zero 0 z0 Z everywhere else and then when we normalize we just get
56:19
and then when we normalize we just get
56:19
and then when we normalize we just get one here we're going to get one one and
56:21
one here we're going to get one one and
56:21
one here we're going to get one one and then zeros and then softmax will again
56:24
then zeros and then softmax will again
56:24
then zeros and then softmax will again divide and this will give us 5.5 and so
56:27
divide and this will give us 5.5 and so
56:27
divide and this will give us 5.5 and so on and so this is also the uh the same
56:30
on and so this is also the uh the same
56:30
on and so this is also the uh the same way to produce uh this mask now the
56:33
way to produce uh this mask now the
56:33
way to produce uh this mask now the reason that this is a bit more
56:34
reason that this is a bit more
56:34
reason that this is a bit more interesting and the reason we're going
56:36
interesting and the reason we're going
56:36
interesting and the reason we're going to end up using it in self
56:37
to end up using it in self
56:37
to end up using it in self attention is that these weights here
56:41
attention is that these weights here
56:41
attention is that these weights here begin uh with zero and you can think of
56:44
begin uh with zero and you can think of
56:44
begin uh with zero and you can think of this as like an interaction strength or
56:46
this as like an interaction strength or
56:46
this as like an interaction strength or like an affinity so basically it's
56:49
like an affinity so basically it's
56:49
like an affinity so basically it's telling us how much of each uh token
56:52
telling us how much of each uh token
56:52
telling us how much of each uh token from the past do we want to Aggregate
56:54
from the past do we want to Aggregate
56:54
from the past do we want to Aggregate and average up
56:57
and average up
56:57
and average up and then this line is saying tokens from
56:59
and then this line is saying tokens from
56:59
and then this line is saying tokens from the past cannot communicate by setting
57:02
the past cannot communicate by setting
57:02
the past cannot communicate by setting them to negative Infinity we're saying
57:04
them to negative Infinity we're saying
57:04
them to negative Infinity we're saying that we will not aggregate anything from
57:06
that we will not aggregate anything from
57:06
that we will not aggregate anything from those
57:07
those
57:07
those tokens and so basically this then goes
57:09
tokens and so basically this then goes
57:09
tokens and so basically this then goes through softmax and through the weighted
57:11
through softmax and through the weighted
57:11
through softmax and through the weighted and this is the aggregation through
57:12
and this is the aggregation through
57:12
and this is the aggregation through matrix
57:14
matrix
57:14
matrix multiplication and so what this is now
57:16
multiplication and so what this is now
57:16
multiplication and so what this is now is you can think of these as um these
57:19
is you can think of these as um these
57:19
is you can think of these as um these zeros are currently just set by us to be
57:21
zeros are currently just set by us to be
57:21
zeros are currently just set by us to be zero but a quick preview is that these
57:25
zero but a quick preview is that these
57:25
zero but a quick preview is that these affinities between the tokens are not
57:27
affinities between the tokens are not
57:27
affinities between the tokens are not going to be just constant at zero
57:29
going to be just constant at zero
57:29
going to be just constant at zero they're going to be data dependent these
57:31
they're going to be data dependent these
57:31
they're going to be data dependent these tokens are going to start looking at
57:32
tokens are going to start looking at
57:32
tokens are going to start looking at each other and some tokens will find
57:34
each other and some tokens will find
57:34
each other and some tokens will find other tokens more or less interesting
57:37
other tokens more or less interesting
57:37
other tokens more or less interesting and depending on what their values are
57:39
and depending on what their values are
57:39
and depending on what their values are they're going to find each other
57:40
they're going to find each other
57:41
they're going to find each other interesting to different amounts and I'm
57:42
interesting to different amounts and I'm
57:42
interesting to different amounts and I'm going to call those affinities I think
57:45
going to call those affinities I think
57:45
going to call those affinities I think and then here we are saying the future
57:47
and then here we are saying the future
57:47
and then here we are saying the future cannot communicate with the past we're
57:49
cannot communicate with the past we're
57:49
cannot communicate with the past we're we're going to clamp them and then when
57:51
we're going to clamp them and then when
57:51
we're going to clamp them and then when we normalize and sum we're going to
57:53
we normalize and sum we're going to
57:53
we normalize and sum we're going to aggregate uh sort of their values
57:56
aggregate uh sort of their values
57:56
aggregate uh sort of their values depending on how interesting they find
57:57
depending on how interesting they find
57:57
depending on how interesting they find each other and so that's the preview for
57:59
each other and so that's the preview for
57:59
each other and so that's the preview for self attention and basically long story
58:03
self attention and basically long story
58:03
self attention and basically long story short from this entire section is that
58:05
short from this entire section is that
58:05
short from this entire section is that you can do weighted aggregations of your
58:07
you can do weighted aggregations of your
58:07
you can do weighted aggregations of your past
58:08
past
58:08
past Elements by having by using matrix
58:11
Elements by having by using matrix
58:12
Elements by having by using matrix multiplication of a lower triangular
58:14
multiplication of a lower triangular
58:14
multiplication of a lower triangular fashion and then the elements here in
58:17
fashion and then the elements here in
58:17
fashion and then the elements here in the lower triangular part are telling
58:18
the lower triangular part are telling
58:18
the lower triangular part are telling you how much of each element uh fuses
58:21
you how much of each element uh fuses
58:21
you how much of each element uh fuses into this position so we're going to use
58:23
into this position so we're going to use
58:24
into this position so we're going to use this trick now to develop the self
58:25
this trick now to develop the self
58:25
this trick now to develop the self attention block block so first let's get
58:27
attention block block so first let's get
58:27
attention block block so first let's get some quick preliminaries out of the way
58:30
some quick preliminaries out of the way
58:30
some quick preliminaries out of the way first the thing I'm kind of bothered by
58:31
first the thing I'm kind of bothered by
58:31
first the thing I'm kind of bothered by is that you see how we're passing in
58:33
is that you see how we're passing in
58:33
is that you see how we're passing in vocap size into the Constructor there's
58:35
vocap size into the Constructor there's
58:35
vocap size into the Constructor there's no need to do that because vocap size is
58:36
no need to do that because vocap size is
58:36
no need to do that because vocap size is already defined uh up top as a global
58:38
already defined uh up top as a global
58:38
already defined uh up top as a global variable so there's no need to pass this
58:40
variable so there's no need to pass this
58:40
variable so there's no need to pass this stuff
58:41
stuff
58:41
stuff around next what I want to do is I don't
58:44
around next what I want to do is I don't
58:44
around next what I want to do is I don't want to actually create I want to create
58:46
want to actually create I want to create
58:46
want to actually create I want to create like a level of indirection here where
58:47
like a level of indirection here where
58:47
like a level of indirection here where we don't directly go to the embedding
58:49
we don't directly go to the embedding
58:49
we don't directly go to the embedding for the um logits but instead we go
58:52
for the um logits but instead we go
58:52
for the um logits but instead we go through this intermediate phase because
58:54
through this intermediate phase because
58:54
through this intermediate phase because we're going to start making that bigger
58:57
we're going to start making that bigger
58:57
we're going to start making that bigger so let me introduce a new variable n
58:59
so let me introduce a new variable n
58:59
so let me introduce a new variable n embed it shorted for number of embedding
59:02
embed it shorted for number of embedding
59:02
embed it shorted for number of embedding Dimensions so
59:04
Dimensions so
59:04
Dimensions so nbed here will be say 32 that was a
59:09
nbed here will be say 32 that was a
59:09
nbed here will be say 32 that was a suggestion from GitHub co-pilot by the
59:11
suggestion from GitHub co-pilot by the
59:11
suggestion from GitHub co-pilot by the way um it also suest 32 which is a good
59:14
way um it also suest 32 which is a good
59:14
way um it also suest 32 which is a good number so this is an embedding table and
59:16
number so this is an embedding table and
59:16
number so this is an embedding table and only 32 dimensional
59:18
only 32 dimensional
59:18
only 32 dimensional embeddings so then here this is not
59:21
embeddings so then here this is not
59:21
embeddings so then here this is not going to give us logits directly instead
59:23
going to give us logits directly instead
59:23
going to give us logits directly instead this is going to give us token
59:24
this is going to give us token
59:24
this is going to give us token embeddings that's I'm going to call it
59:27
embeddings that's I'm going to call it
59:27
embeddings that's I'm going to call it and then to go from the token Tings to
59:28
and then to go from the token Tings to
59:29
and then to go from the token Tings to the logits we're going to need a linear
59:30
the logits we're going to need a linear
59:30
the logits we're going to need a linear layer so self. LM head let's call it
59:34
layer so self. LM head let's call it
59:34
layer so self. LM head let's call it short for language modeling head is n
59:36
short for language modeling head is n
59:36
short for language modeling head is n and linear from n ined up to vocap size
59:39
and linear from n ined up to vocap size
59:39
and linear from n ined up to vocap size and then when we swing over here we're
59:41
and then when we swing over here we're
59:41
and then when we swing over here we're actually going to get the loits by
59:43
actually going to get the loits by
59:43
actually going to get the loits by exactly what the co-pilot says now we
59:46
exactly what the co-pilot says now we
59:46
exactly what the co-pilot says now we have to be careful here because this C
59:48
have to be careful here because this C
59:48
have to be careful here because this C and this C are not equal um this is nmed
59:52
and this C are not equal um this is nmed
59:52
and this C are not equal um this is nmed C and this is vocap size so let's just
59:55
C and this is vocap size so let's just
59:55
C and this is vocap size so let's just say that n ined is equal to
59:57
say that n ined is equal to
59:57
say that n ined is equal to C and then this just creates one spous
1:00:01
C and then this just creates one spous
1:00:01
C and then this just creates one spous layer of interaction through a linear
1:00:02
layer of interaction through a linear
1:00:02
layer of interaction through a linear layer but uh this should basically
1:00:11
run so we see that this runs and uh this
1:00:15
run so we see that this runs and uh this
1:00:15
run so we see that this runs and uh this currently looks kind of spous but uh
1:00:16
currently looks kind of spous but uh
1:00:17
currently looks kind of spous but uh we're going to build on top of this now
1:00:19
we're going to build on top of this now
1:00:19
we're going to build on top of this now next up so far we've taken these indices
1:00:22
next up so far we've taken these indices
1:00:22
next up so far we've taken these indices and we've encoded them based on the
1:00:23
and we've encoded them based on the
1:00:23
and we've encoded them based on the identity of the uh tokens in inside idx
1:00:28
identity of the uh tokens in inside idx
1:00:28
identity of the uh tokens in inside idx the next thing that people very often do
1:00:30
the next thing that people very often do
1:00:30
the next thing that people very often do is that we're not just encoding the
1:00:31
is that we're not just encoding the
1:00:31
is that we're not just encoding the identity of these tokens but also their
1:00:33
identity of these tokens but also their
1:00:33
identity of these tokens but also their position so we're going to have a second
1:00:35
position so we're going to have a second
1:00:35
position so we're going to have a second position uh embedding table here so
1:00:38
position uh embedding table here so
1:00:38
position uh embedding table here so self. position embedding table is an an
1:00:41
self. position embedding table is an an
1:00:41
self. position embedding table is an an embedding of block size by an embed and
1:00:44
embedding of block size by an embed and
1:00:44
embedding of block size by an embed and so each position from zero to block size
1:00:46
so each position from zero to block size
1:00:46
so each position from zero to block size minus one will also get its own
1:00:47
minus one will also get its own
1:00:47
minus one will also get its own embedding vector and then here first let
1:00:50
embedding vector and then here first let
1:00:50
embedding vector and then here first let me decode B BYT from idx do
1:00:54
me decode B BYT from idx do
1:00:54
me decode B BYT from idx do shape and then here we're also going to
1:00:56
shape and then here we're also going to
1:00:56
shape and then here we're also going to have a pause embedding which is the
1:00:58
have a pause embedding which is the
1:00:58
have a pause embedding which is the positional embedding and these are this
1:01:00
positional embedding and these are this
1:01:00
positional embedding and these are this is to arrange so this will be basically
1:01:02
is to arrange so this will be basically
1:01:03
is to arrange so this will be basically just integers from Z to T minus one and
1:01:06
just integers from Z to T minus one and
1:01:06
just integers from Z to T minus one and all of those integers from 0 to T minus
1:01:08
all of those integers from 0 to T minus
1:01:08
all of those integers from 0 to T minus one get embedded through the table to
1:01:09
one get embedded through the table to
1:01:09
one get embedded through the table to create a t by
1:01:11
create a t by
1:01:11
create a t by C and then here this gets renamed to
1:01:14
C and then here this gets renamed to
1:01:14
C and then here this gets renamed to just say x and x will be the addition of
1:01:18
just say x and x will be the addition of
1:01:18
just say x and x will be the addition of the token embeddings with the positional
1:01:20
the token embeddings with the positional
1:01:20
the token embeddings with the positional embeddings and here the broadcasting
1:01:22
embeddings and here the broadcasting
1:01:22
embeddings and here the broadcasting note will work out so B by T by C plus T
1:01:25
note will work out so B by T by C plus T
1:01:25
note will work out so B by T by C plus T by C
1:01:26
by C
1:01:26
by C this gets right aligned a new dimension
1:01:28
this gets right aligned a new dimension
1:01:28
this gets right aligned a new dimension of one gets added and it gets
1:01:30
of one gets added and it gets
1:01:30
of one gets added and it gets broadcasted across
1:01:31
broadcasted across
1:01:31
broadcasted across batch so at this point x holds not just
1:01:34
batch so at this point x holds not just
1:01:34
batch so at this point x holds not just the token identities but the positions
1:01:37
the token identities but the positions
1:01:37
the token identities but the positions at which these tokens occur and this is
1:01:39
at which these tokens occur and this is
1:01:39
at which these tokens occur and this is currently not that useful because of
1:01:41
currently not that useful because of
1:01:41
currently not that useful because of course we just have a simple byr model
1:01:43
course we just have a simple byr model
1:01:43
course we just have a simple byr model so it doesn't matter if you're in the
1:01:44
so it doesn't matter if you're in the
1:01:44
so it doesn't matter if you're in the fifth position the second position or
1:01:46
fifth position the second position or
1:01:46
fifth position the second position or wherever it's all translation invariant
1:01:48
wherever it's all translation invariant
1:01:48
wherever it's all translation invariant at this stage uh so this information
1:01:50
at this stage uh so this information
1:01:50
at this stage uh so this information currently wouldn't help uh but as we
1:01:52
currently wouldn't help uh but as we
1:01:52
currently wouldn't help uh but as we work on the self attention block we'll
1:01:54
work on the self attention block we'll
1:01:54
work on the self attention block we'll see that this starts to matter
1:01:59
okay so now we get the Crux of self
1:02:01
okay so now we get the Crux of self
1:02:01
okay so now we get the Crux of self attention so this is probably the most
1:02:03
attention so this is probably the most
1:02:03
attention so this is probably the most important part of this video to
1:02:05
important part of this video to
1:02:05
important part of this video to understand we're going to implement a
1:02:07
understand we're going to implement a
1:02:07
understand we're going to implement a small self attention for a single
1:02:08
small self attention for a single
1:02:08
small self attention for a single individual head as they're called so we
1:02:11
individual head as they're called so we
1:02:11
individual head as they're called so we start off with where we were so all of
1:02:13
start off with where we were so all of
1:02:13
start off with where we were so all of this code is familiar so right now I'm
1:02:16
this code is familiar so right now I'm
1:02:16
this code is familiar so right now I'm working with an example where I Chang
1:02:17
working with an example where I Chang
1:02:17
working with an example where I Chang the number of channels from 2 to 32 so
1:02:20
the number of channels from 2 to 32 so
1:02:20
the number of channels from 2 to 32 so we have a 4x8 arrangement of tokens and
1:02:24
we have a 4x8 arrangement of tokens and
1:02:24
we have a 4x8 arrangement of tokens and each to and the information each token
1:02:26
each to and the information each token
1:02:26
each to and the information each token is currently 32 dimensional but we just
1:02:28
is currently 32 dimensional but we just
1:02:28
is currently 32 dimensional but we just are working with random
1:02:30
are working with random
1:02:30
are working with random numbers now we saw here that the code as
1:02:34
numbers now we saw here that the code as
1:02:34
numbers now we saw here that the code as we had it before does a uh simple weight
1:02:37
we had it before does a uh simple weight
1:02:37
we had it before does a uh simple weight simple average of all the past tokens
1:02:41
simple average of all the past tokens
1:02:41
simple average of all the past tokens and the current token so it's just the
1:02:43
and the current token so it's just the
1:02:43
and the current token so it's just the previous information and current
1:02:44
previous information and current
1:02:44
previous information and current information is just being mixed together
1:02:45
information is just being mixed together
1:02:45
information is just being mixed together in an average and that's what this code
1:02:48
in an average and that's what this code
1:02:48
in an average and that's what this code currently achieves and it Doo by
1:02:50
currently achieves and it Doo by
1:02:50
currently achieves and it Doo by creating this lower triangular structure
1:02:52
creating this lower triangular structure
1:02:52
creating this lower triangular structure which allows us to mask out this uh we
1:02:55
which allows us to mask out this uh we
1:02:55
which allows us to mask out this uh we uh Matrix that we create so we mask it
1:02:59
uh Matrix that we create so we mask it
1:02:59
uh Matrix that we create so we mask it out and then we normalize it and
1:03:01
out and then we normalize it and
1:03:01
out and then we normalize it and currently when we initialize the
1:03:03
currently when we initialize the
1:03:03
currently when we initialize the affinities between all the different
1:03:05
affinities between all the different
1:03:05
affinities between all the different sort of tokens or nodes I'm going to use
1:03:08
sort of tokens or nodes I'm going to use
1:03:08
sort of tokens or nodes I'm going to use those terms
1:03:09
those terms
1:03:09
those terms interchangeably so when we initialize
1:03:11
interchangeably so when we initialize
1:03:11
interchangeably so when we initialize the affinities between all the different
1:03:13
the affinities between all the different
1:03:13
the affinities between all the different tokens to be zero then we see that way
1:03:16
tokens to be zero then we see that way
1:03:16
tokens to be zero then we see that way gives us this um structure where every
1:03:18
gives us this um structure where every
1:03:18
gives us this um structure where every single row has these um uniform numbers
1:03:22
single row has these um uniform numbers
1:03:22
single row has these um uniform numbers and so that's what that's what then uh
1:03:25
and so that's what that's what then uh
1:03:25
and so that's what that's what then uh in this Matrix multiply makes it so that
1:03:27
in this Matrix multiply makes it so that
1:03:27
in this Matrix multiply makes it so that we're doing a simple
1:03:28
we're doing a simple
1:03:28
we're doing a simple average now we don't actually want this
1:03:32
average now we don't actually want this
1:03:32
average now we don't actually want this to be all uniform because different uh
1:03:36
to be all uniform because different uh
1:03:36
to be all uniform because different uh tokens will find different other tokens
1:03:38
tokens will find different other tokens
1:03:38
tokens will find different other tokens more or less interesting and we want
1:03:40
more or less interesting and we want
1:03:40
more or less interesting and we want that to be data dependent so for example
1:03:42
that to be data dependent so for example
1:03:42
that to be data dependent so for example if I'm a vowel then maybe I'm looking
1:03:44
if I'm a vowel then maybe I'm looking
1:03:44
if I'm a vowel then maybe I'm looking for consonants in my past and maybe I
1:03:46
for consonants in my past and maybe I
1:03:46
for consonants in my past and maybe I want to know what those consonants are
1:03:48
want to know what those consonants are
1:03:48
want to know what those consonants are and I want that information to flow to
1:03:50
and I want that information to flow to
1:03:50
and I want that information to flow to me and so I want to now gather
1:03:52
me and so I want to now gather
1:03:52
me and so I want to now gather information from the past but I want to
1:03:54
information from the past but I want to
1:03:54
information from the past but I want to do it in the data dependent way and this
1:03:56
do it in the data dependent way and this
1:03:56
do it in the data dependent way and this is the problem that self attention
1:03:58
is the problem that self attention
1:03:58
is the problem that self attention solves now the way self attention solves
1:04:00
solves now the way self attention solves
1:04:00
solves now the way self attention solves this is the following every single node
1:04:03
this is the following every single node
1:04:03
this is the following every single node or every single token at each position
1:04:06
or every single token at each position
1:04:06
or every single token at each position will emit two vectors it will emit a
1:04:09
will emit two vectors it will emit a
1:04:09
will emit two vectors it will emit a query and it will emit a
1:04:12
query and it will emit a
1:04:12
query and it will emit a key now the query Vector roughly
1:04:15
key now the query Vector roughly
1:04:15
key now the query Vector roughly speaking is what am I looking for and
1:04:18
speaking is what am I looking for and
1:04:18
speaking is what am I looking for and the key Vector roughly speaking is what
1:04:20
the key Vector roughly speaking is what
1:04:20
the key Vector roughly speaking is what do I
1:04:21
do I
1:04:21
do I contain and then the way we get
1:04:24
contain and then the way we get
1:04:24
contain and then the way we get affinities between these uh tokens now
1:04:27
affinities between these uh tokens now
1:04:27
affinities between these uh tokens now in a sequence is we basically just do a
1:04:29
in a sequence is we basically just do a
1:04:29
in a sequence is we basically just do a do product between the keys and the
1:04:31
do product between the keys and the
1:04:31
do product between the keys and the queries so my query dot products with
1:04:35
queries so my query dot products with
1:04:35
queries so my query dot products with all the keys of all the other tokens and
1:04:37
all the keys of all the other tokens and
1:04:37
all the keys of all the other tokens and that dot product now becomes
1:04:41
that dot product now becomes
1:04:41
that dot product now becomes wayy and so um if the key and the query
1:04:45
wayy and so um if the key and the query
1:04:45
wayy and so um if the key and the query are sort of aligned they will interact
1:04:47
are sort of aligned they will interact
1:04:47
are sort of aligned they will interact to a very high amount and then I will
1:04:50
to a very high amount and then I will
1:04:50
to a very high amount and then I will get to learn more about that specific
1:04:52
get to learn more about that specific
1:04:52
get to learn more about that specific token as opposed to any other token in
1:04:55
token as opposed to any other token in
1:04:55
token as opposed to any other token in the sequence
1:04:56
the sequence
1:04:56
the sequence so let's implement this
1:05:00
now we're going to implement a
1:05:03
now we're going to implement a
1:05:03
now we're going to implement a single what's called head of self
1:05:06
single what's called head of self
1:05:07
single what's called head of self attention so this is just one head
1:05:09
attention so this is just one head
1:05:09
attention so this is just one head there's a hyper parameter involved with
1:05:10
there's a hyper parameter involved with
1:05:10
there's a hyper parameter involved with these heads which is the head size and
1:05:13
these heads which is the head size and
1:05:13
these heads which is the head size and then here I'm initializing linear
1:05:15
then here I'm initializing linear
1:05:15
then here I'm initializing linear modules and I'm using bias equals false
1:05:18
modules and I'm using bias equals false
1:05:18
modules and I'm using bias equals false so these are just going to apply a
1:05:19
so these are just going to apply a
1:05:19
so these are just going to apply a matrix multiply with some fixed
1:05:21
matrix multiply with some fixed
1:05:21
matrix multiply with some fixed weights and now let me produce a key and
1:05:26
weights and now let me produce a key and
1:05:26
weights and now let me produce a key and q k and Q by forwarding these modules on
1:05:29
q k and Q by forwarding these modules on
1:05:29
q k and Q by forwarding these modules on X so the size of this will now
1:05:32
X so the size of this will now
1:05:32
X so the size of this will now become B by T by 16 because that is the
1:05:36
become B by T by 16 because that is the
1:05:36
become B by T by 16 because that is the head size and the same here B by T by
1:05:44
16 so this being the head size so you
1:05:47
16 so this being the head size so you
1:05:47
16 so this being the head size so you see here that when I forward this linear
1:05:49
see here that when I forward this linear
1:05:49
see here that when I forward this linear on top of my X all the tokens in all the
1:05:52
on top of my X all the tokens in all the
1:05:52
on top of my X all the tokens in all the positions in the B BYT Arrangement all
1:05:55
positions in the B BYT Arrangement all
1:05:55
positions in the B BYT Arrangement all of them them in parallel and
1:05:56
of them them in parallel and
1:05:57
of them them in parallel and independently produce a key and a query
1:05:59
independently produce a key and a query
1:05:59
independently produce a key and a query so no communication has happened
1:06:01
so no communication has happened
1:06:01
so no communication has happened yet but the communication comes now all
1:06:04
yet but the communication comes now all
1:06:04
yet but the communication comes now all the queries will do product with all the
1:06:07
the queries will do product with all the
1:06:07
the queries will do product with all the keys so basically what we want is we
1:06:09
keys so basically what we want is we
1:06:09
keys so basically what we want is we want way now or the affinities between
1:06:12
want way now or the affinities between
1:06:12
want way now or the affinities between these to be query multiplying key but we
1:06:16
these to be query multiplying key but we
1:06:16
these to be query multiplying key but we have to be careful with uh we can't
1:06:18
have to be careful with uh we can't
1:06:18
have to be careful with uh we can't Matrix multiply this we actually need to
1:06:20
Matrix multiply this we actually need to
1:06:20
Matrix multiply this we actually need to transpose uh K but we have to be also
1:06:22
transpose uh K but we have to be also
1:06:23
transpose uh K but we have to be also careful because these are when you have
1:06:25
careful because these are when you have
1:06:25
careful because these are when you have The Bash Dimension so in particular we
1:06:27
The Bash Dimension so in particular we
1:06:27
The Bash Dimension so in particular we want to transpose uh the last two
1:06:30
want to transpose uh the last two
1:06:30
want to transpose uh the last two dimensions dimension1 and dimension -2
1:06:33
dimensions dimension1 and dimension -2
1:06:33
dimensions dimension1 and dimension -2 so
1:06:36
-21 and so this Matrix multiply now will
1:06:40
-21 and so this Matrix multiply now will
1:06:40
-21 and so this Matrix multiply now will basically do the following B by T by
1:06:44
basically do the following B by T by
1:06:44
basically do the following B by T by 16 Matrix multiplies B by 16 by T to
1:06:49
16 Matrix multiplies B by 16 by T to
1:06:49
16 Matrix multiplies B by 16 by T to give us B by T by
1:06:53
give us B by T by
1:06:53
give us B by T by T right
1:06:56
T right
1:06:56
T right so for every row of B we're now going to
1:06:58
so for every row of B we're now going to
1:06:58
so for every row of B we're now going to have a t Square Matrix giving us the
1:07:01
have a t Square Matrix giving us the
1:07:01
have a t Square Matrix giving us the affinities and these are now the way so
1:07:04
affinities and these are now the way so
1:07:04
affinities and these are now the way so they're not zeros they are now coming
1:07:06
they're not zeros they are now coming
1:07:06
they're not zeros they are now coming from this dot product between the keys
1:07:08
from this dot product between the keys
1:07:08
from this dot product between the keys and the queries so this can now run I
1:07:11
and the queries so this can now run I
1:07:11
and the queries so this can now run I can I can run this and the weighted
1:07:13
can I can run this and the weighted
1:07:13
can I can run this and the weighted aggregation now is a function in a data
1:07:16
aggregation now is a function in a data
1:07:16
aggregation now is a function in a data Bandon manner between the keys and
1:07:18
Bandon manner between the keys and
1:07:18
Bandon manner between the keys and queries of these nodes so just
1:07:20
queries of these nodes so just
1:07:20
queries of these nodes so just inspecting what happened
1:07:22
inspecting what happened
1:07:22
inspecting what happened here the way takes on this form
1:07:26
here the way takes on this form
1:07:26
here the way takes on this form and you see that before way was uh just
1:07:28
and you see that before way was uh just
1:07:29
and you see that before way was uh just a constant so it was applied in the same
1:07:31
a constant so it was applied in the same
1:07:31
a constant so it was applied in the same way to all the batch elements but now
1:07:33
way to all the batch elements but now
1:07:33
way to all the batch elements but now every single batch elements will have
1:07:34
every single batch elements will have
1:07:34
every single batch elements will have different sort of we because uh every
1:07:37
different sort of we because uh every
1:07:37
different sort of we because uh every single batch element contains different
1:07:39
single batch element contains different
1:07:39
single batch element contains different uh tokens at different positions and so
1:07:41
uh tokens at different positions and so
1:07:41
uh tokens at different positions and so this is not data dependent so when we
1:07:44
this is not data dependent so when we
1:07:44
this is not data dependent so when we look at just the zeroth uh Row for
1:07:47
look at just the zeroth uh Row for
1:07:47
look at just the zeroth uh Row for example in the input these are the
1:07:49
example in the input these are the
1:07:49
example in the input these are the weights that came out and so you can see
1:07:51
weights that came out and so you can see
1:07:51
weights that came out and so you can see now that they're not just exactly
1:07:53
now that they're not just exactly
1:07:53
now that they're not just exactly uniform um and in particular as an
1:07:55
uniform um and in particular as an
1:07:55
uniform um and in particular as an example here for the last row this was
1:07:58
example here for the last row this was
1:07:58
example here for the last row this was the eighth token and the eighth token
1:08:00
the eighth token and the eighth token
1:08:00
the eighth token and the eighth token knows what content it has and it knows
1:08:02
knows what content it has and it knows
1:08:02
knows what content it has and it knows at what position it's in and now the E
1:08:04
at what position it's in and now the E
1:08:04
at what position it's in and now the E token based on that uh creates a query
1:08:08
token based on that uh creates a query
1:08:08
token based on that uh creates a query hey I'm looking for this kind of stuff
1:08:10
hey I'm looking for this kind of stuff
1:08:10
hey I'm looking for this kind of stuff um I'm a vowel I'm on the E position I'm
1:08:12
um I'm a vowel I'm on the E position I'm
1:08:12
um I'm a vowel I'm on the E position I'm looking for any consonant at positions
1:08:14
looking for any consonant at positions
1:08:14
looking for any consonant at positions up to four and then all the nodes get to
1:08:18
up to four and then all the nodes get to
1:08:18
up to four and then all the nodes get to emit keys and maybe one of the channels
1:08:20
emit keys and maybe one of the channels
1:08:20
emit keys and maybe one of the channels could be I am a I am a consonant and I
1:08:22
could be I am a I am a consonant and I
1:08:23
could be I am a I am a consonant and I am in a position up to four and that
1:08:25
am in a position up to four and that
1:08:25
am in a position up to four and that that key would have a high number in
1:08:27
that key would have a high number in
1:08:27
that key would have a high number in that specific Channel and that's how the
1:08:29
that specific Channel and that's how the
1:08:29
that specific Channel and that's how the query and the key when they do product
1:08:31
query and the key when they do product
1:08:31
query and the key when they do product they can find each other and create a
1:08:33
they can find each other and create a
1:08:33
they can find each other and create a high affinity and when they have a high
1:08:35
high affinity and when they have a high
1:08:35
high affinity and when they have a high Affinity like say uh this token was
1:08:38
Affinity like say uh this token was
1:08:38
Affinity like say uh this token was pretty interesting to uh to this eighth
1:08:41
pretty interesting to uh to this eighth
1:08:41
pretty interesting to uh to this eighth token when they have a high Affinity
1:08:43
token when they have a high Affinity
1:08:43
token when they have a high Affinity then through the softmax I will end up
1:08:45
then through the softmax I will end up
1:08:45
then through the softmax I will end up aggregating a lot of its information
1:08:47
aggregating a lot of its information
1:08:47
aggregating a lot of its information into my position and so I'll get to
1:08:49
into my position and so I'll get to
1:08:49
into my position and so I'll get to learn a lot about
1:08:51
learn a lot about
1:08:51
learn a lot about it now just this we're looking at way
1:08:55
it now just this we're looking at way
1:08:55
it now just this we're looking at way after this has already happened um let
1:08:59
after this has already happened um let
1:08:59
after this has already happened um let me erase this operation as well so let
1:09:01
me erase this operation as well so let
1:09:01
me erase this operation as well so let me erase the masking and the softmax
1:09:03
me erase the masking and the softmax
1:09:03
me erase the masking and the softmax just to show you the under the hood
1:09:04
just to show you the under the hood
1:09:04
just to show you the under the hood internals and how that works so without
1:09:07
internals and how that works so without
1:09:07
internals and how that works so without the masking in the softmax Whey comes
1:09:09
the masking in the softmax Whey comes
1:09:09
the masking in the softmax Whey comes out like this right this is the outputs
1:09:11
out like this right this is the outputs
1:09:11
out like this right this is the outputs of the do products um and these are the
1:09:14
of the do products um and these are the
1:09:14
of the do products um and these are the raw outputs and they take on values from
1:09:15
raw outputs and they take on values from
1:09:15
raw outputs and they take on values from negative you know two to positive two
1:09:18
negative you know two to positive two
1:09:18
negative you know two to positive two Etc so that's the raw interactions and
1:09:21
Etc so that's the raw interactions and
1:09:21
Etc so that's the raw interactions and raw affinities between all the nodes but
1:09:24
raw affinities between all the nodes but
1:09:24
raw affinities between all the nodes but now if I'm going if I'm a fifth node I
1:09:26
now if I'm going if I'm a fifth node I
1:09:26
now if I'm going if I'm a fifth node I will not want to aggregate anything from
1:09:28
will not want to aggregate anything from
1:09:28
will not want to aggregate anything from the sixth node seventh node and the
1:09:30
the sixth node seventh node and the
1:09:30
the sixth node seventh node and the eighth node so actually we use the upper
1:09:32
eighth node so actually we use the upper
1:09:32
eighth node so actually we use the upper triangular masking so those are not
1:09:35
triangular masking so those are not
1:09:35
triangular masking so those are not allowed to
1:09:37
allowed to
1:09:37
allowed to communicate and now we actually want to
1:09:40
communicate and now we actually want to
1:09:40
communicate and now we actually want to have a nice uh distribution uh so we
1:09:42
have a nice uh distribution uh so we
1:09:42
have a nice uh distribution uh so we don't want to aggregate negative .11 of
1:09:45
don't want to aggregate negative .11 of
1:09:45
don't want to aggregate negative .11 of this node that's crazy so instead we
1:09:47
this node that's crazy so instead we
1:09:47
this node that's crazy so instead we exponentiate and normalize and now we
1:09:49
exponentiate and normalize and now we
1:09:49
exponentiate and normalize and now we get a nice distribution that sums to one
1:09:51
get a nice distribution that sums to one
1:09:51
get a nice distribution that sums to one and this is telling us now in the data
1:09:52
and this is telling us now in the data
1:09:52
and this is telling us now in the data dependent manner how much of information
1:09:54
dependent manner how much of information
1:09:54
dependent manner how much of information to aggregate from any of these tokens in
1:09:56
to aggregate from any of these tokens in
1:09:56
to aggregate from any of these tokens in the
1:09:58
the
1:09:58
the past so that's way and it's not zeros
1:10:01
past so that's way and it's not zeros
1:10:01
past so that's way and it's not zeros anymore but but it's calculated in this
1:10:04
anymore but but it's calculated in this
1:10:04
anymore but but it's calculated in this way now there's one more uh part to a
1:10:07
way now there's one more uh part to a
1:10:08
way now there's one more uh part to a single self attention head and that is
1:10:10
single self attention head and that is
1:10:10
single self attention head and that is that when we do the aggregation we don't
1:10:12
that when we do the aggregation we don't
1:10:12
that when we do the aggregation we don't actually aggregate the tokens exactly we
1:10:15
actually aggregate the tokens exactly we
1:10:15
actually aggregate the tokens exactly we aggregate we produce one more value here
1:10:17
aggregate we produce one more value here
1:10:17
aggregate we produce one more value here and we call that the
1:10:20
and we call that the
1:10:20
and we call that the value so in the same way that we
1:10:22
value so in the same way that we
1:10:22
value so in the same way that we produced p and query we're also going to
1:10:23
produced p and query we're also going to
1:10:23
produced p and query we're also going to create a value
1:10:26
create a value
1:10:26
create a value and
1:10:26
and
1:10:26
and then here we don't
1:10:30
then here we don't
1:10:30
then here we don't aggregate X we calculate a v which is
1:10:34
aggregate X we calculate a v which is
1:10:34
aggregate X we calculate a v which is just achieved by uh propagating this
1:10:37
just achieved by uh propagating this
1:10:37
just achieved by uh propagating this linear on top of X again and then we
1:10:40
linear on top of X again and then we
1:10:40
linear on top of X again and then we output way multiplied by V so V is the
1:10:44
output way multiplied by V so V is the
1:10:44
output way multiplied by V so V is the elements that we aggregate or the the
1:10:46
elements that we aggregate or the the
1:10:46
elements that we aggregate or the the vectors that we aggregate instead of the
1:10:47
vectors that we aggregate instead of the
1:10:47
vectors that we aggregate instead of the raw
1:10:48
raw
1:10:48
raw X and now of course uh this will make it
1:10:51
X and now of course uh this will make it
1:10:51
X and now of course uh this will make it so that the output here of this single
1:10:53
so that the output here of this single
1:10:53
so that the output here of this single head will be 16 dimensional because that
1:10:55
head will be 16 dimensional because that
1:10:55
head will be 16 dimensional because that is the head
1:10:57
is the head
1:10:57
is the head size so you can think of X as kind of
1:10:59
size so you can think of X as kind of
1:10:59
size so you can think of X as kind of like private information to this token
1:11:01
like private information to this token
1:11:01
like private information to this token if you if you think about it that way so
1:11:03
if you if you think about it that way so
1:11:03
if you if you think about it that way so X is kind of private to this token so
1:11:06
X is kind of private to this token so
1:11:06
X is kind of private to this token so I'm a fifth token at some and I have
1:11:08
I'm a fifth token at some and I have
1:11:08
I'm a fifth token at some and I have some identity and uh my information is
1:11:11
some identity and uh my information is
1:11:11
some identity and uh my information is kept in Vector X and now for the
1:11:14
kept in Vector X and now for the
1:11:14
kept in Vector X and now for the purposes of the single head here's what
1:11:16
purposes of the single head here's what
1:11:16
purposes of the single head here's what I'm interested in here's what I have and
1:11:19
I'm interested in here's what I have and
1:11:20
I'm interested in here's what I have and if you find me interesting here's what I
1:11:21
if you find me interesting here's what I
1:11:21
if you find me interesting here's what I will communicate to you and that's
1:11:23
will communicate to you and that's
1:11:23
will communicate to you and that's stored in v and so V is the thing that
1:11:26
stored in v and so V is the thing that
1:11:26
stored in v and so V is the thing that gets aggregated for the purposes of this
1:11:28
gets aggregated for the purposes of this
1:11:28
gets aggregated for the purposes of this single head between the different
1:11:30
single head between the different
1:11:30
single head between the different notes and that's uh basically the self
1:11:34
notes and that's uh basically the self
1:11:34
notes and that's uh basically the self attention mechanism this is this is what
1:11:36
attention mechanism this is this is what
1:11:36
attention mechanism this is this is what it does there are a few notes that I
1:11:39
it does there are a few notes that I
1:11:39
it does there are a few notes that I would make like to make about attention
1:11:41
would make like to make about attention
1:11:41
would make like to make about attention number one attention is a communication
1:11:44
number one attention is a communication
1:11:44
number one attention is a communication mechanism you can really think about it
1:11:46
mechanism you can really think about it
1:11:46
mechanism you can really think about it as a communication mechanism where you
1:11:48
as a communication mechanism where you
1:11:48
as a communication mechanism where you have a number of nodes in a directed
1:11:50
have a number of nodes in a directed
1:11:50
have a number of nodes in a directed graph where basically you have edges
1:11:52
graph where basically you have edges
1:11:52
graph where basically you have edges pointed between noes like
1:11:53
pointed between noes like
1:11:53
pointed between noes like this and what happens is every node has
1:11:56
this and what happens is every node has
1:11:56
this and what happens is every node has some Vector of information and it gets
1:11:58
some Vector of information and it gets
1:11:58
some Vector of information and it gets to aggregate information via a weighted
1:12:01
to aggregate information via a weighted
1:12:01
to aggregate information via a weighted sum from all of the nodes that point to
1:12:03
sum from all of the nodes that point to
1:12:03
sum from all of the nodes that point to it and this is done in a data dependent
1:12:06
it and this is done in a data dependent
1:12:06
it and this is done in a data dependent manner so depending on whatever data is
1:12:08
manner so depending on whatever data is
1:12:08
manner so depending on whatever data is actually stored that you should not at
1:12:09
actually stored that you should not at
1:12:09
actually stored that you should not at any point in time now our graph doesn't
1:12:13
any point in time now our graph doesn't
1:12:13
any point in time now our graph doesn't look like this our graph has a different
1:12:14
look like this our graph has a different
1:12:15
look like this our graph has a different structure we have eight nodes because
1:12:17
structure we have eight nodes because
1:12:17
structure we have eight nodes because the block size is eight and there's
1:12:18
the block size is eight and there's
1:12:18
the block size is eight and there's always eight to
1:12:20
always eight to
1:12:20
always eight to tokens and uh the first node is only
1:12:23
tokens and uh the first node is only
1:12:23
tokens and uh the first node is only pointed to by itself the second node is
1:12:25
pointed to by itself the second node is
1:12:25
pointed to by itself the second node is pointed to by the first node and itself
1:12:27
pointed to by the first node and itself
1:12:27
pointed to by the first node and itself all the way up to the eighth node which
1:12:29
all the way up to the eighth node which
1:12:29
all the way up to the eighth node which is pointed to by all the previous nodes
1:12:32
is pointed to by all the previous nodes
1:12:32
is pointed to by all the previous nodes and itself and so that's the structure
1:12:34
and itself and so that's the structure
1:12:34
and itself and so that's the structure that our directed graph has or happens
1:12:37
that our directed graph has or happens
1:12:37
that our directed graph has or happens happens to have in Auto regressive sort
1:12:38
happens to have in Auto regressive sort
1:12:38
happens to have in Auto regressive sort of scenario like language modeling but
1:12:41
of scenario like language modeling but
1:12:41
of scenario like language modeling but in principle attention can be applied to
1:12:42
in principle attention can be applied to
1:12:42
in principle attention can be applied to any arbitrary directed graph and it's
1:12:44
any arbitrary directed graph and it's
1:12:44
any arbitrary directed graph and it's just a communication mechanism between
1:12:46
just a communication mechanism between
1:12:46
just a communication mechanism between the nodes the second note is that notice
1:12:48
the nodes the second note is that notice
1:12:48
the nodes the second note is that notice that there is no notion of space so
1:12:51
that there is no notion of space so
1:12:51
that there is no notion of space so attention simply acts over like a set of
1:12:53
attention simply acts over like a set of
1:12:53
attention simply acts over like a set of vectors in this graph and so by default
1:12:56
vectors in this graph and so by default
1:12:56
vectors in this graph and so by default these nodes have no idea where they are
1:12:58
these nodes have no idea where they are
1:12:58
these nodes have no idea where they are positioned in the space and that's why
1:12:59
positioned in the space and that's why
1:12:59
positioned in the space and that's why we need to encode them positionally and
1:13:02
we need to encode them positionally and
1:13:02
we need to encode them positionally and sort of give them some information that
1:13:03
sort of give them some information that
1:13:03
sort of give them some information that is anchored to a specific position so
1:13:05
is anchored to a specific position so
1:13:05
is anchored to a specific position so that they sort of know where they are
1:13:08
that they sort of know where they are
1:13:08
that they sort of know where they are and this is different than for example
1:13:09
and this is different than for example
1:13:09
and this is different than for example from convolution because if you're run
1:13:11
from convolution because if you're run
1:13:11
from convolution because if you're run for example a convolution operation over
1:13:13
for example a convolution operation over
1:13:13
for example a convolution operation over some input there's a very specific sort
1:13:15
some input there's a very specific sort
1:13:15
some input there's a very specific sort of layout of the information in space
1:13:18
of layout of the information in space
1:13:18
of layout of the information in space and the convolutional filters sort of
1:13:20
and the convolutional filters sort of
1:13:20
and the convolutional filters sort of act in space and so it's it's not like
1:13:23
act in space and so it's it's not like
1:13:23
act in space and so it's it's not like an attention in ATT ention is just a set
1:13:25
an attention in ATT ention is just a set
1:13:26
an attention in ATT ention is just a set of vectors out there in space they
1:13:27
of vectors out there in space they
1:13:27
of vectors out there in space they communicate and if you want them to have
1:13:29
communicate and if you want them to have
1:13:29
communicate and if you want them to have a notion of space you need to
1:13:31
a notion of space you need to
1:13:31
a notion of space you need to specifically add it which is what we've
1:13:33
specifically add it which is what we've
1:13:33
specifically add it which is what we've done when we calculated the um relative
1:13:36
done when we calculated the um relative
1:13:36
done when we calculated the um relative the positional encode encodings and
1:13:38
the positional encode encodings and
1:13:38
the positional encode encodings and added that information to the vectors
1:13:40
added that information to the vectors
1:13:40
added that information to the vectors the next thing that I hope is very clear
1:13:41
the next thing that I hope is very clear
1:13:41
the next thing that I hope is very clear is that the elements across the batch
1:13:43
is that the elements across the batch
1:13:43
is that the elements across the batch Dimension which are independent examples
1:13:45
Dimension which are independent examples
1:13:45
Dimension which are independent examples never talk to each other they're always
1:13:47
never talk to each other they're always
1:13:47
never talk to each other they're always processed independently and this is a
1:13:49
processed independently and this is a
1:13:49
processed independently and this is a batched matrix multiply that applies
1:13:51
batched matrix multiply that applies
1:13:51
batched matrix multiply that applies basically a matrix multiplication uh
1:13:53
basically a matrix multiplication uh
1:13:53
basically a matrix multiplication uh kind of in parallel across the batch
1:13:54
kind of in parallel across the batch
1:13:54
kind of in parallel across the batch dimension so maybe it would be more
1:13:56
dimension so maybe it would be more
1:13:56
dimension so maybe it would be more accurate to say that in this analogy of
1:13:58
accurate to say that in this analogy of
1:13:58
accurate to say that in this analogy of a directed graph we really have because
1:14:00
a directed graph we really have because
1:14:00
a directed graph we really have because the back size is four we really have
1:14:03
the back size is four we really have
1:14:03
the back size is four we really have four separate pools of eight nodes and
1:14:05
four separate pools of eight nodes and
1:14:05
four separate pools of eight nodes and those eight nodes only talk to each
1:14:06
those eight nodes only talk to each
1:14:07
those eight nodes only talk to each other but in total there's like 32 nodes
1:14:08
other but in total there's like 32 nodes
1:14:08
other but in total there's like 32 nodes that are being processed uh but there's
1:14:10
that are being processed uh but there's
1:14:11
that are being processed uh but there's um sort of four separate pools of eight
1:14:13
um sort of four separate pools of eight
1:14:13
um sort of four separate pools of eight you can look at it that way the next
1:14:15
you can look at it that way the next
1:14:15
you can look at it that way the next note is that here in the case of
1:14:17
note is that here in the case of
1:14:18
note is that here in the case of language modeling uh we have this
1:14:20
language modeling uh we have this
1:14:20
language modeling uh we have this specific uh structure of directed graph
1:14:22
specific uh structure of directed graph
1:14:22
specific uh structure of directed graph where the future tokens will not
1:14:24
where the future tokens will not
1:14:24
where the future tokens will not communicate to the Past tokens but this
1:14:27
communicate to the Past tokens but this
1:14:27
communicate to the Past tokens but this doesn't necessarily have to be the
1:14:28
doesn't necessarily have to be the
1:14:28
doesn't necessarily have to be the constraint in the general case and in
1:14:30
constraint in the general case and in
1:14:30
constraint in the general case and in fact in many cases you may want to have
1:14:32
fact in many cases you may want to have
1:14:32
fact in many cases you may want to have all of the uh noes talk to each other uh
1:14:35
all of the uh noes talk to each other uh
1:14:35
all of the uh noes talk to each other uh fully so as an example if you're doing
1:14:37
fully so as an example if you're doing
1:14:37
fully so as an example if you're doing sentiment analysis or something like
1:14:38
sentiment analysis or something like
1:14:38
sentiment analysis or something like that with a Transformer you might have a
1:14:40
that with a Transformer you might have a
1:14:40
that with a Transformer you might have a number of tokens and you may want to
1:14:42
number of tokens and you may want to
1:14:42
number of tokens and you may want to have them all talk to each other fully
1:14:45
have them all talk to each other fully
1:14:45
have them all talk to each other fully because later you are predicting for
1:14:46
because later you are predicting for
1:14:46
because later you are predicting for example the sentiment of the sentence
1:14:48
example the sentiment of the sentence
1:14:49
example the sentiment of the sentence and so it's okay for these NOS to talk
1:14:50
and so it's okay for these NOS to talk
1:14:50
and so it's okay for these NOS to talk to each other and so in those cases you
1:14:53
to each other and so in those cases you
1:14:53
to each other and so in those cases you will use an encoder block of self
1:14:55
will use an encoder block of self
1:14:55
will use an encoder block of self attention and uh all it means that it's
1:14:58
attention and uh all it means that it's
1:14:58
attention and uh all it means that it's an encoder block is that you will delete
1:15:00
an encoder block is that you will delete
1:15:00
an encoder block is that you will delete this line of code allowing all the noes
1:15:02
this line of code allowing all the noes
1:15:02
this line of code allowing all the noes to completely talk to each other what
1:15:04
to completely talk to each other what
1:15:04
to completely talk to each other what we're implementing here is sometimes
1:15:06
we're implementing here is sometimes
1:15:06
we're implementing here is sometimes called a decoder block and it's called a
1:15:09
called a decoder block and it's called a
1:15:09
called a decoder block and it's called a decoder because it is sort of like a
1:15:12
decoder because it is sort of like a
1:15:12
decoder because it is sort of like a decoding language and it's got this
1:15:14
decoding language and it's got this
1:15:15
decoding language and it's got this autor regressive format where you have
1:15:16
autor regressive format where you have
1:15:17
autor regressive format where you have to mask with the Triangular Matrix so
1:15:19
to mask with the Triangular Matrix so
1:15:19
to mask with the Triangular Matrix so that uh nodes from the future never talk
1:15:21
that uh nodes from the future never talk
1:15:22
that uh nodes from the future never talk to the Past because they would give away
1:15:24
to the Past because they would give away
1:15:24
to the Past because they would give away the answer
1:15:25
the answer
1:15:25
the answer and so basically in encoder blocks you
1:15:27
and so basically in encoder blocks you
1:15:27
and so basically in encoder blocks you would delete this allow all the noes to
1:15:29
would delete this allow all the noes to
1:15:29
would delete this allow all the noes to talk in decoder blocks this will always
1:15:31
talk in decoder blocks this will always
1:15:31
talk in decoder blocks this will always be present so that you have this
1:15:33
be present so that you have this
1:15:33
be present so that you have this triangular structure uh but both are
1:15:35
triangular structure uh but both are
1:15:35
triangular structure uh but both are allowed and attention doesn't care
1:15:36
allowed and attention doesn't care
1:15:36
allowed and attention doesn't care attention supports arbitrary
1:15:38
attention supports arbitrary
1:15:38
attention supports arbitrary connectivity between nodes the next
1:15:40
connectivity between nodes the next
1:15:40
connectivity between nodes the next thing I wanted to comment on is you keep
1:15:41
thing I wanted to comment on is you keep
1:15:41
thing I wanted to comment on is you keep me you keep hearing me say attention
1:15:43
me you keep hearing me say attention
1:15:43
me you keep hearing me say attention self attention Etc there's actually also
1:15:45
self attention Etc there's actually also
1:15:45
self attention Etc there's actually also something called cross attention what is
1:15:47
something called cross attention what is
1:15:47
something called cross attention what is the
1:15:47
the
1:15:47
the difference
1:15:49
difference
1:15:49
difference so basically the reason this attention
1:15:52
so basically the reason this attention
1:15:52
so basically the reason this attention is self attention is because because the
1:15:55
is self attention is because because the
1:15:55
is self attention is because because the keys queries and the values are all
1:15:57
keys queries and the values are all
1:15:57
keys queries and the values are all coming from the same Source from X so
1:16:01
coming from the same Source from X so
1:16:01
coming from the same Source from X so the same Source X produces Keys queries
1:16:03
the same Source X produces Keys queries
1:16:03
the same Source X produces Keys queries and values so these nodes are self
1:16:05
and values so these nodes are self
1:16:05
and values so these nodes are self attending but in principle attention is
1:16:08
attending but in principle attention is
1:16:08
attending but in principle attention is much more General than that so for
1:16:10
much more General than that so for
1:16:10
much more General than that so for example an encoder decoder Transformers
1:16:12
example an encoder decoder Transformers
1:16:12
example an encoder decoder Transformers uh you can have a case where the queries
1:16:14
uh you can have a case where the queries
1:16:15
uh you can have a case where the queries are produced from X but the keys and the
1:16:17
are produced from X but the keys and the
1:16:17
are produced from X but the keys and the values come from a whole separate
1:16:18
values come from a whole separate
1:16:18
values come from a whole separate external source and sometimes from uh
1:16:21
external source and sometimes from uh
1:16:21
external source and sometimes from uh encoder blocks that encode some context
1:16:23
encoder blocks that encode some context
1:16:23
encoder blocks that encode some context that we'd like to condition on
1:16:25
that we'd like to condition on
1:16:25
that we'd like to condition on and so the keys and the values will
1:16:26
and so the keys and the values will
1:16:26
and so the keys and the values will actually come from a whole separate
1:16:28
actually come from a whole separate
1:16:28
actually come from a whole separate Source those are nodes on the side and
1:16:30
Source those are nodes on the side and
1:16:31
Source those are nodes on the side and here we're just producing queries and
1:16:32
here we're just producing queries and
1:16:32
here we're just producing queries and we're reading off information from the
1:16:34
we're reading off information from the
1:16:34
we're reading off information from the side so cross attention is used when
1:16:37
side so cross attention is used when
1:16:37
side so cross attention is used when there's a separate source of nodes we'd
1:16:40
there's a separate source of nodes we'd
1:16:40
there's a separate source of nodes we'd like to pull information from into our
1:16:42
like to pull information from into our
1:16:42
like to pull information from into our nodes and it's self attention if we just
1:16:45
nodes and it's self attention if we just
1:16:45
nodes and it's self attention if we just have nodes that would like to look at
1:16:46
have nodes that would like to look at
1:16:46
have nodes that would like to look at each other and talk to each other so
1:16:48
each other and talk to each other so
1:16:48
each other and talk to each other so this attention here happens to be self
1:16:51
this attention here happens to be self
1:16:51
this attention here happens to be self attention but in principle um attention
1:16:55
attention but in principle um attention
1:16:55
attention but in principle um attention is a lot more General okay and the last
1:16:57
is a lot more General okay and the last
1:16:57
is a lot more General okay and the last note at this stage is if we come to the
1:16:59
note at this stage is if we come to the
1:16:59
note at this stage is if we come to the attention is all need paper here we've
1:17:01
attention is all need paper here we've
1:17:01
attention is all need paper here we've already implemented attention so given
1:17:03
already implemented attention so given
1:17:03
already implemented attention so given query key and value we've U multiplied
1:17:06
query key and value we've U multiplied
1:17:06
query key and value we've U multiplied the query and a key we've soft maxed it
1:17:09
the query and a key we've soft maxed it
1:17:09
the query and a key we've soft maxed it and then we are aggregating the values
1:17:11
and then we are aggregating the values
1:17:11
and then we are aggregating the values there's one more thing that we're
1:17:12
there's one more thing that we're
1:17:12
there's one more thing that we're missing here which is the dividing by
1:17:13
missing here which is the dividing by
1:17:13
missing here which is the dividing by one / square root of the head size the
1:17:16
one / square root of the head size the
1:17:16
one / square root of the head size the DK here is the head size why are they
1:17:18
DK here is the head size why are they
1:17:18
DK here is the head size why are they doing this finds this important so they
1:17:21
doing this finds this important so they
1:17:21
doing this finds this important so they call it the scaled attention and it's
1:17:24
call it the scaled attention and it's
1:17:24
call it the scaled attention and it's kind of like an important normalization
1:17:25
kind of like an important normalization
1:17:25
kind of like an important normalization to basically
1:17:26
to basically
1:17:26
to basically have the problem is if you have unit gsh
1:17:29
have the problem is if you have unit gsh
1:17:29
have the problem is if you have unit gsh and inputs so zero mean unit variance K
1:17:32
and inputs so zero mean unit variance K
1:17:32
and inputs so zero mean unit variance K and Q are unit gashin then if you just
1:17:34
and Q are unit gashin then if you just
1:17:34
and Q are unit gashin then if you just do we naively then you see that your we
1:17:36
do we naively then you see that your we
1:17:37
do we naively then you see that your we actually will be uh the variance will be
1:17:38
actually will be uh the variance will be
1:17:38
actually will be uh the variance will be on the order of head size which in our
1:17:40
on the order of head size which in our
1:17:40
on the order of head size which in our case is 16 but if you multiply by one
1:17:43
case is 16 but if you multiply by one
1:17:43
case is 16 but if you multiply by one over head size square root so this is
1:17:45
over head size square root so this is
1:17:45
over head size square root so this is square root and this is one
1:17:47
square root and this is one
1:17:47
square root and this is one over then the variance of we will be one
1:17:50
over then the variance of we will be one
1:17:50
over then the variance of we will be one so it will be
1:17:51
so it will be
1:17:52
so it will be preserved now why is this important
1:17:54
preserved now why is this important
1:17:54
preserved now why is this important you'll not notice that way
1:17:56
you'll not notice that way
1:17:56
you'll not notice that way here will feed into
1:17:58
here will feed into
1:17:58
here will feed into softmax and so it's really important
1:18:00
softmax and so it's really important
1:18:00
softmax and so it's really important especially at initialization that we be
1:18:03
especially at initialization that we be
1:18:03
especially at initialization that we be fairly diffuse so in our case here we
1:18:06
fairly diffuse so in our case here we
1:18:06
fairly diffuse so in our case here we sort of locked out here and we had a
1:18:10
sort of locked out here and we had a
1:18:10
sort of locked out here and we had a fairly diffuse numbers here so um like
1:18:13
fairly diffuse numbers here so um like
1:18:13
fairly diffuse numbers here so um like this now the problem is that because of
1:18:15
this now the problem is that because of
1:18:15
this now the problem is that because of softmax if weight takes on very positive
1:18:18
softmax if weight takes on very positive
1:18:18
softmax if weight takes on very positive and very negative numbers inside it
1:18:20
and very negative numbers inside it
1:18:20
and very negative numbers inside it softmax will actually converge towards
1:18:22
softmax will actually converge towards
1:18:22
softmax will actually converge towards one hot vectors and so I can illustrate
1:18:25
one hot vectors and so I can illustrate
1:18:25
one hot vectors and so I can illustrate that here um say we are applying softmax
1:18:29
that here um say we are applying softmax
1:18:29
that here um say we are applying softmax to a tensor of values that are very
1:18:31
to a tensor of values that are very
1:18:31
to a tensor of values that are very close to zero then we're going to get a
1:18:33
close to zero then we're going to get a
1:18:33
close to zero then we're going to get a diffuse thing out of
1:18:34
diffuse thing out of
1:18:34
diffuse thing out of softmax but the moment I take the exact
1:18:36
softmax but the moment I take the exact
1:18:36
softmax but the moment I take the exact same thing and I start sharpening it
1:18:38
same thing and I start sharpening it
1:18:38
same thing and I start sharpening it making it bigger by multiplying these
1:18:40
making it bigger by multiplying these
1:18:40
making it bigger by multiplying these numbers by eight for example you'll see
1:18:42
numbers by eight for example you'll see
1:18:42
numbers by eight for example you'll see that the softmax will start to sharpen
1:18:44
that the softmax will start to sharpen
1:18:44
that the softmax will start to sharpen and in fact it will sharpen towards the
1:18:46
and in fact it will sharpen towards the
1:18:46
and in fact it will sharpen towards the max so it will sharpen towards whatever
1:18:48
max so it will sharpen towards whatever
1:18:48
max so it will sharpen towards whatever number here is the highest and so um
1:18:51
number here is the highest and so um
1:18:51
number here is the highest and so um basically we don't want these values to
1:18:52
basically we don't want these values to
1:18:52
basically we don't want these values to be too extreme especially at
1:18:53
be too extreme especially at
1:18:53
be too extreme especially at initialization otherwise softmax will be
1:18:55
initialization otherwise softmax will be
1:18:55
initialization otherwise softmax will be way too peaky and um you're basically
1:18:58
way too peaky and um you're basically
1:18:58
way too peaky and um you're basically aggregating um information from like a
1:19:01
aggregating um information from like a
1:19:01
aggregating um information from like a single node every node just agregates
1:19:03
single node every node just agregates
1:19:03
single node every node just agregates information from a single other node
1:19:04
information from a single other node
1:19:04
information from a single other node that's not what we want especially at
1:19:06
that's not what we want especially at
1:19:06
that's not what we want especially at initialization and so the scaling is
1:19:08
initialization and so the scaling is
1:19:08
initialization and so the scaling is used just to control the variance at
1:19:11
used just to control the variance at
1:19:11
used just to control the variance at initialization okay so having said all
1:19:13
initialization okay so having said all
1:19:13
initialization okay so having said all that let's now take our self attention
1:19:15
that let's now take our self attention
1:19:15
that let's now take our self attention knowledge and let's uh take it for a
1:19:17
knowledge and let's uh take it for a
1:19:17
knowledge and let's uh take it for a spin so here in the code I created this
1:19:19
spin so here in the code I created this
1:19:19
spin so here in the code I created this head module and it implements a single
1:19:22
head module and it implements a single
1:19:22
head module and it implements a single head of self attention so you give it a
1:19:24
head of self attention so you give it a
1:19:24
head of self attention so you give it a head size and then here it creates the
1:19:26
head size and then here it creates the
1:19:26
head size and then here it creates the key query and the value linear layers
1:19:29
key query and the value linear layers
1:19:29
key query and the value linear layers typically people don't use biases in
1:19:31
typically people don't use biases in
1:19:31
typically people don't use biases in these uh so those are the linear
1:19:33
these uh so those are the linear
1:19:33
these uh so those are the linear projections that we're going to apply to
1:19:34
projections that we're going to apply to
1:19:34
projections that we're going to apply to all of our nodes now here I'm creating
1:19:37
all of our nodes now here I'm creating
1:19:37
all of our nodes now here I'm creating this Trill variable Trill is not a
1:19:39
this Trill variable Trill is not a
1:19:40
this Trill variable Trill is not a parameter of the module so in sort of
1:19:41
parameter of the module so in sort of
1:19:41
parameter of the module so in sort of pytorch naming conventions uh this is
1:19:43
pytorch naming conventions uh this is
1:19:43
pytorch naming conventions uh this is called a buffer it's not a parameter and
1:19:46
called a buffer it's not a parameter and
1:19:46
called a buffer it's not a parameter and you have to call it you have to assign
1:19:47
you have to call it you have to assign
1:19:47
you have to call it you have to assign it to the module using a register buffer
1:19:49
it to the module using a register buffer
1:19:49
it to the module using a register buffer so that creates the trill uh the triang
1:19:52
so that creates the trill uh the triang
1:19:52
so that creates the trill uh the triang lower triangular Matrix and we're given
1:19:55
lower triangular Matrix and we're given
1:19:55
lower triangular Matrix and we're given the input X this should look very
1:19:56
the input X this should look very
1:19:56
the input X this should look very familiar now we calculate the keys the
1:19:58
familiar now we calculate the keys the
1:19:58
familiar now we calculate the keys the queries we C calculate the attention
1:20:00
queries we C calculate the attention
1:20:00
queries we C calculate the attention scores inside way uh we normalize it so
1:20:03
scores inside way uh we normalize it so
1:20:03
scores inside way uh we normalize it so we're using scaled attention here then
1:20:06
we're using scaled attention here then
1:20:06
we're using scaled attention here then we make sure that uh future doesn't
1:20:08
we make sure that uh future doesn't
1:20:08
we make sure that uh future doesn't communicate with the past so this makes
1:20:09
communicate with the past so this makes
1:20:10
communicate with the past so this makes it a decoder block and then softmax and
1:20:13
it a decoder block and then softmax and
1:20:13
it a decoder block and then softmax and then aggregate the value and
1:20:15
then aggregate the value and
1:20:15
then aggregate the value and output then here in the language model
1:20:17
output then here in the language model
1:20:17
output then here in the language model I'm creating a head in the Constructor
1:20:20
I'm creating a head in the Constructor
1:20:20
I'm creating a head in the Constructor and I'm calling it self attention head
1:20:22
and I'm calling it self attention head
1:20:22
and I'm calling it self attention head and the head size I'm going to keep as
1:20:24
and the head size I'm going to keep as
1:20:24
and the head size I'm going to keep as the same and embed just for
1:20:27
the same and embed just for
1:20:27
the same and embed just for now and then here once we've encoded the
1:20:31
now and then here once we've encoded the
1:20:31
now and then here once we've encoded the information with the token embeddings
1:20:32
information with the token embeddings
1:20:32
information with the token embeddings and the position embeddings we're simply
1:20:34
and the position embeddings we're simply
1:20:34
and the position embeddings we're simply going to feed it into the self attention
1:20:36
going to feed it into the self attention
1:20:36
going to feed it into the self attention head and then the output of that is
1:20:38
head and then the output of that is
1:20:38
head and then the output of that is going to go into uh the decoder language
1:20:42
going to go into uh the decoder language
1:20:42
going to go into uh the decoder language modeling head and create the logits so
1:20:44
modeling head and create the logits so
1:20:44
modeling head and create the logits so this the sort of the simplest way to
1:20:46
this the sort of the simplest way to
1:20:46
this the sort of the simplest way to plug in a self attention component uh
1:20:48
plug in a self attention component uh
1:20:49
plug in a self attention component uh into our Network right now I had to make
1:20:51
into our Network right now I had to make
1:20:51
into our Network right now I had to make one more change which is that here in
1:20:55
one more change which is that here in
1:20:55
one more change which is that here in the generate uh we have to make sure
1:20:57
the generate uh we have to make sure
1:20:57
the generate uh we have to make sure that our idx that we feed into the model
1:21:00
that our idx that we feed into the model
1:21:01
that our idx that we feed into the model because now we're using positional
1:21:02
because now we're using positional
1:21:02
because now we're using positional embeddings we can never have more than
1:21:04
embeddings we can never have more than
1:21:04
embeddings we can never have more than block size coming in because if idx is
1:21:07
block size coming in because if idx is
1:21:07
block size coming in because if idx is more than block size then our position
1:21:09
more than block size then our position
1:21:09
more than block size then our position embedding table is going to run out of
1:21:11
embedding table is going to run out of
1:21:11
embedding table is going to run out of scope because it only has embeddings for
1:21:12
scope because it only has embeddings for
1:21:12
scope because it only has embeddings for up to block size and so therefore I
1:21:15
up to block size and so therefore I
1:21:15
up to block size and so therefore I added some uh code here to crop the
1:21:17
added some uh code here to crop the
1:21:17
added some uh code here to crop the context that we're going to feed into
1:21:20
context that we're going to feed into
1:21:20
context that we're going to feed into self um so that uh we never pass in more
1:21:23
self um so that uh we never pass in more
1:21:23
self um so that uh we never pass in more than block siiz elements
1:21:25
than block siiz elements
1:21:25
than block siiz elements so those are the changes and let's Now
1:21:27
so those are the changes and let's Now
1:21:27
so those are the changes and let's Now train the network okay so I also came up
1:21:29
train the network okay so I also came up
1:21:29
train the network okay so I also came up to the script here and I decreased the
1:21:30
to the script here and I decreased the
1:21:30
to the script here and I decreased the learning rate because uh the self
1:21:32
learning rate because uh the self
1:21:32
learning rate because uh the self attention can't tolerate very very high
1:21:34
attention can't tolerate very very high
1:21:34
attention can't tolerate very very high learning rates and then I also increased
1:21:36
learning rates and then I also increased
1:21:36
learning rates and then I also increased number of iterations because the
1:21:37
number of iterations because the
1:21:37
number of iterations because the learning rate is lower and then I
1:21:39
learning rate is lower and then I
1:21:39
learning rate is lower and then I trained it and previously we were only
1:21:41
trained it and previously we were only
1:21:41
trained it and previously we were only able to get to up to 2.5 and now we are
1:21:43
able to get to up to 2.5 and now we are
1:21:43
able to get to up to 2.5 and now we are down to 2.4 so we definitely see a
1:21:46
down to 2.4 so we definitely see a
1:21:46
down to 2.4 so we definitely see a little bit of an improvement from 2.5 to
1:21:48
little bit of an improvement from 2.5 to
1:21:48
little bit of an improvement from 2.5 to 2.4 roughly uh but the text is still not
1:21:50
2.4 roughly uh but the text is still not
1:21:51
2.4 roughly uh but the text is still not amazing so clearly the self attention
1:21:53
amazing so clearly the self attention
1:21:53
amazing so clearly the self attention head is doing some useful communication
1:21:56
head is doing some useful communication
1:21:56
head is doing some useful communication but um we still have a long way to go
1:21:59
but um we still have a long way to go
1:21:59
but um we still have a long way to go okay so now we've implemented the scale.
1:22:00
okay so now we've implemented the scale.
1:22:01
okay so now we've implemented the scale. product attention now next up and the
1:22:02
product attention now next up and the
1:22:02
product attention now next up and the attention is all you need paper there's
1:22:05
attention is all you need paper there's
1:22:05
attention is all you need paper there's something called multi-head attention
1:22:07
something called multi-head attention
1:22:07
something called multi-head attention and what is multi-head attention it's
1:22:09
and what is multi-head attention it's
1:22:09
and what is multi-head attention it's just applying multiple attentions in
1:22:11
just applying multiple attentions in
1:22:11
just applying multiple attentions in parallel and concatenating their results
1:22:13
parallel and concatenating their results
1:22:13
parallel and concatenating their results so they have a little bit of diagram
1:22:15
so they have a little bit of diagram
1:22:15
so they have a little bit of diagram here I don't know if this is super clear
1:22:18
here I don't know if this is super clear
1:22:18
here I don't know if this is super clear it's really just multiple attentions in
1:22:20
it's really just multiple attentions in
1:22:20
it's really just multiple attentions in parallel so let's Implement that fairly
1:22:23
parallel so let's Implement that fairly
1:22:23
parallel so let's Implement that fairly straightforward
1:22:25
straightforward
1:22:25
straightforward if we want a multi-head attention then
1:22:27
if we want a multi-head attention then
1:22:27
if we want a multi-head attention then we want multiple heads of self attention
1:22:28
we want multiple heads of self attention
1:22:28
we want multiple heads of self attention running in parallel so in pytorch we can
1:22:32
running in parallel so in pytorch we can
1:22:32
running in parallel so in pytorch we can do this by simply creating multiple
1:22:35
do this by simply creating multiple
1:22:35
do this by simply creating multiple heads so however heads how however many
1:22:38
heads so however heads how however many
1:22:38
heads so however heads how however many heads you want and then what is the head
1:22:39
heads you want and then what is the head
1:22:39
heads you want and then what is the head size of each and then we run all of them
1:22:43
size of each and then we run all of them
1:22:43
size of each and then we run all of them in parallel into a list and simply
1:22:45
in parallel into a list and simply
1:22:46
in parallel into a list and simply concatenate all of the outputs and we're
1:22:48
concatenate all of the outputs and we're
1:22:48
concatenate all of the outputs and we're concatenating over the channel
1:22:50
concatenating over the channel
1:22:50
concatenating over the channel Dimension so the way this looks now is
1:22:53
Dimension so the way this looks now is
1:22:53
Dimension so the way this looks now is we don't have just a single ATT
1:22:56
we don't have just a single ATT
1:22:56
we don't have just a single ATT that uh has a hit size of 32 because
1:22:59
that uh has a hit size of 32 because
1:22:59
that uh has a hit size of 32 because remember n Ed is
1:23:00
remember n Ed is
1:23:00
remember n Ed is 32 instead of having one Communication
1:23:03
32 instead of having one Communication
1:23:03
32 instead of having one Communication channel we now have four communication
1:23:06
channel we now have four communication
1:23:06
channel we now have four communication channels in parallel and each one of
1:23:08
channels in parallel and each one of
1:23:08
channels in parallel and each one of these communication channels typically
1:23:10
these communication channels typically
1:23:10
these communication channels typically will be uh smaller uh correspondingly so
1:23:14
will be uh smaller uh correspondingly so
1:23:14
will be uh smaller uh correspondingly so because we have four communication
1:23:15
because we have four communication
1:23:15
because we have four communication channels we want eight dimensional self
1:23:18
channels we want eight dimensional self
1:23:18
channels we want eight dimensional self attention and so from each Communication
1:23:20
attention and so from each Communication
1:23:20
attention and so from each Communication channel we're going to together eight
1:23:22
channel we're going to together eight
1:23:22
channel we're going to together eight dimensional vectors and then we have
1:23:23
dimensional vectors and then we have
1:23:23
dimensional vectors and then we have four of them and that concatenates to
1:23:25
four of them and that concatenates to
1:23:25
four of them and that concatenates to give us 32 which is the original and
1:23:28
give us 32 which is the original and
1:23:28
give us 32 which is the original and embed and so this is kind of similar to
1:23:30
embed and so this is kind of similar to
1:23:30
embed and so this is kind of similar to um if you're familiar with convolutions
1:23:32
um if you're familiar with convolutions
1:23:32
um if you're familiar with convolutions this is kind of like a group convolution
1:23:34
this is kind of like a group convolution
1:23:34
this is kind of like a group convolution uh because basically instead of having
1:23:36
uh because basically instead of having
1:23:36
uh because basically instead of having one large convolution we do convolution
1:23:37
one large convolution we do convolution
1:23:38
one large convolution we do convolution in groups and uh that's multi-headed
1:23:40
in groups and uh that's multi-headed
1:23:40
in groups and uh that's multi-headed self
1:23:41
self
1:23:41
self attention and so then here we just use
1:23:44
attention and so then here we just use
1:23:44
attention and so then here we just use essay heads self attention heads instead
1:23:47
essay heads self attention heads instead
1:23:47
essay heads self attention heads instead now I actually ran it and uh scrolling
1:23:51
now I actually ran it and uh scrolling
1:23:51
now I actually ran it and uh scrolling down I ran the same thing and then we
1:23:53
down I ran the same thing and then we
1:23:53
down I ran the same thing and then we now get this down to 2.28 roughly and
1:23:57
now get this down to 2.28 roughly and
1:23:57
now get this down to 2.28 roughly and the output is still the generation is
1:23:58
the output is still the generation is
1:23:58
the output is still the generation is still not amazing but clearly the
1:24:00
still not amazing but clearly the
1:24:00
still not amazing but clearly the validation loss is improving because we
1:24:02
validation loss is improving because we
1:24:02
validation loss is improving because we were at 2.4 just now and so it helps to
1:24:05
were at 2.4 just now and so it helps to
1:24:05
were at 2.4 just now and so it helps to have multiple communication channels
1:24:07
have multiple communication channels
1:24:07
have multiple communication channels because obviously these tokens have a
1:24:09
because obviously these tokens have a
1:24:09
because obviously these tokens have a lot to talk about they want to find the
1:24:11
lot to talk about they want to find the
1:24:11
lot to talk about they want to find the consonants the vowels they want to find
1:24:13
consonants the vowels they want to find
1:24:13
consonants the vowels they want to find the vowels just from certain positions
1:24:15
the vowels just from certain positions
1:24:15
the vowels just from certain positions uh they want to find any kinds of
1:24:17
uh they want to find any kinds of
1:24:17
uh they want to find any kinds of different things and so it helps to
1:24:19
different things and so it helps to
1:24:19
different things and so it helps to create multiple independent channels of
1:24:20
create multiple independent channels of
1:24:20
create multiple independent channels of communication gather lots of different
1:24:22
communication gather lots of different
1:24:22
communication gather lots of different types of data and then uh decode the
1:24:25
types of data and then uh decode the
1:24:25
types of data and then uh decode the output now going back to the paper for a
1:24:27
output now going back to the paper for a
1:24:27
output now going back to the paper for a second of course I didn't explain this
1:24:28
second of course I didn't explain this
1:24:28
second of course I didn't explain this figure in full detail but we are
1:24:30
figure in full detail but we are
1:24:30
figure in full detail but we are starting to see some components of what
1:24:32
starting to see some components of what
1:24:32
starting to see some components of what we've already implemented we have the
1:24:33
we've already implemented we have the
1:24:33
we've already implemented we have the positional encodings the token encodings
1:24:35
positional encodings the token encodings
1:24:35
positional encodings the token encodings that add we have the masked multi-headed
1:24:37
that add we have the masked multi-headed
1:24:37
that add we have the masked multi-headed attention implemented now here's another
1:24:41
attention implemented now here's another
1:24:41
attention implemented now here's another multi-headed attention which is a cross
1:24:42
multi-headed attention which is a cross
1:24:42
multi-headed attention which is a cross attention to an encoder which we haven't
1:24:45
attention to an encoder which we haven't
1:24:45
attention to an encoder which we haven't we're not going to implement in this
1:24:46
we're not going to implement in this
1:24:46
we're not going to implement in this case I'm going to come back to that
1:24:48
case I'm going to come back to that
1:24:48
case I'm going to come back to that later but I want you to notice that
1:24:50
later but I want you to notice that
1:24:50
later but I want you to notice that there's a feed forward part here and
1:24:52
there's a feed forward part here and
1:24:52
there's a feed forward part here and then this is grouped into a block that
1:24:53
then this is grouped into a block that
1:24:53
then this is grouped into a block that gets repeat it again and again now the
1:24:56
gets repeat it again and again now the
1:24:56
gets repeat it again and again now the feedforward part here is just a simple
1:24:57
feedforward part here is just a simple
1:24:57
feedforward part here is just a simple uh multi-layer perceptron
1:25:00
uh multi-layer perceptron
1:25:00
uh multi-layer perceptron um so the multi-headed so here position
1:25:04
um so the multi-headed so here position
1:25:04
um so the multi-headed so here position wise feed forward networks is just a
1:25:06
wise feed forward networks is just a
1:25:06
wise feed forward networks is just a simple little MLP so I want to start
1:25:08
simple little MLP so I want to start
1:25:08
simple little MLP so I want to start basically in a similar fashion also
1:25:10
basically in a similar fashion also
1:25:10
basically in a similar fashion also adding computation into the network and
1:25:13
adding computation into the network and
1:25:13
adding computation into the network and this computation is on a per node level
1:25:16
this computation is on a per node level
1:25:16
this computation is on a per node level so I've already implemented it and you
1:25:18
so I've already implemented it and you
1:25:18
so I've already implemented it and you can see the diff highlighted on the left
1:25:20
can see the diff highlighted on the left
1:25:20
can see the diff highlighted on the left here when I've added or changed things
1:25:22
here when I've added or changed things
1:25:22
here when I've added or changed things now before we had the self multi-headed
1:25:25
now before we had the self multi-headed
1:25:25
now before we had the self multi-headed self attention that did the
1:25:26
self attention that did the
1:25:26
self attention that did the communication but we went way too fast
1:25:28
communication but we went way too fast
1:25:28
communication but we went way too fast to calculate the logits so the tokens
1:25:31
to calculate the logits so the tokens
1:25:31
to calculate the logits so the tokens looked at each other but didn't really
1:25:32
looked at each other but didn't really
1:25:32
looked at each other but didn't really have a lot of time to think on what they
1:25:35
have a lot of time to think on what they
1:25:35
have a lot of time to think on what they found from the other tokens and so what
1:25:38
found from the other tokens and so what
1:25:38
found from the other tokens and so what I've implemented here is a little feet
1:25:40
I've implemented here is a little feet
1:25:40
I've implemented here is a little feet forward single layer and this little
1:25:42
forward single layer and this little
1:25:42
forward single layer and this little layer is just a linear followed by a Rel
1:25:45
layer is just a linear followed by a Rel
1:25:45
layer is just a linear followed by a Rel nonlinearity and that's that's it so
1:25:48
nonlinearity and that's that's it so
1:25:48
nonlinearity and that's that's it so it's just a little layer and then I call
1:25:50
it's just a little layer and then I call
1:25:50
it's just a little layer and then I call it feed
1:25:52
it feed
1:25:52
it feed forward um and embed
1:25:54
forward um and embed
1:25:54
forward um and embed and then this feed forward is just
1:25:56
and then this feed forward is just
1:25:56
and then this feed forward is just called sequentially right after the self
1:25:57
called sequentially right after the self
1:25:58
called sequentially right after the self attention so we self attend then we feed
1:26:00
attention so we self attend then we feed
1:26:01
attention so we self attend then we feed forward and you'll notice that the feet
1:26:02
forward and you'll notice that the feet
1:26:02
forward and you'll notice that the feet forward here when it's applying linear
1:26:04
forward here when it's applying linear
1:26:04
forward here when it's applying linear this is on a per token level all the
1:26:06
this is on a per token level all the
1:26:06
this is on a per token level all the tokens do this independently so the self
1:26:09
tokens do this independently so the self
1:26:09
tokens do this independently so the self attention is the communication and then
1:26:11
attention is the communication and then
1:26:11
attention is the communication and then once they've gathered all the data now
1:26:13
once they've gathered all the data now
1:26:13
once they've gathered all the data now they need to think on that data
1:26:15
they need to think on that data
1:26:15
they need to think on that data individually and so that's what feed
1:26:16
individually and so that's what feed
1:26:16
individually and so that's what feed forward is doing and that's why I've
1:26:18
forward is doing and that's why I've
1:26:18
forward is doing and that's why I've added it here now when I train this the
1:26:21
added it here now when I train this the
1:26:21
added it here now when I train this the validation LW actually continues to go
1:26:23
validation LW actually continues to go
1:26:23
validation LW actually continues to go down now to 2. 24 which is down from
1:26:26
down now to 2. 24 which is down from
1:26:26
down now to 2. 24 which is down from 2.28 uh the output still look kind of
1:26:28
2.28 uh the output still look kind of
1:26:28
2.28 uh the output still look kind of terrible but at least we've improved the
1:26:31
terrible but at least we've improved the
1:26:31
terrible but at least we've improved the situation and so as a preview we're
1:26:34
situation and so as a preview we're
1:26:34
situation and so as a preview we're going to now start to intersperse the
1:26:37
going to now start to intersperse the
1:26:37
going to now start to intersperse the communication with the computation and
1:26:39
communication with the computation and
1:26:39
communication with the computation and that's also what the Transformer does
1:26:42
that's also what the Transformer does
1:26:42
that's also what the Transformer does when it has blocks that communicate and
1:26:44
when it has blocks that communicate and
1:26:44
when it has blocks that communicate and then compute and it groups them and
1:26:46
then compute and it groups them and
1:26:46
then compute and it groups them and replicates them okay so let me show you
1:26:49
replicates them okay so let me show you
1:26:49
replicates them okay so let me show you what we'd like to do we'd like to do
1:26:51
what we'd like to do we'd like to do
1:26:51
what we'd like to do we'd like to do something like this we have a block and
1:26:53
something like this we have a block and
1:26:53
something like this we have a block and this block is is basically this part
1:26:55
this block is is basically this part
1:26:55
this block is is basically this part here except for the cross
1:26:57
here except for the cross
1:26:57
here except for the cross attention now the block basically
1:26:59
attention now the block basically
1:26:59
attention now the block basically intersperses communication and then
1:27:01
intersperses communication and then
1:27:01
intersperses communication and then computation the computation the
1:27:03
computation the computation the
1:27:03
computation the computation the communication is done using multi-headed
1:27:05
communication is done using multi-headed
1:27:05
communication is done using multi-headed selfelf attention and then the
1:27:07
selfelf attention and then the
1:27:07
selfelf attention and then the computation is done using a feed forward
1:27:08
computation is done using a feed forward
1:27:08
computation is done using a feed forward Network on all the tokens
1:27:11
Network on all the tokens
1:27:11
Network on all the tokens independently now what I've added here
1:27:14
independently now what I've added here
1:27:14
independently now what I've added here also is you'll
1:27:16
also is you'll
1:27:16
also is you'll notice this takes the number of
1:27:18
notice this takes the number of
1:27:18
notice this takes the number of embeddings in the embedding Dimension
1:27:19
embeddings in the embedding Dimension
1:27:19
embeddings in the embedding Dimension and number of heads that we would like
1:27:21
and number of heads that we would like
1:27:21
and number of heads that we would like which is kind of like group size in
1:27:22
which is kind of like group size in
1:27:22
which is kind of like group size in group convolution and and I'm saying
1:27:24
group convolution and and I'm saying
1:27:24
group convolution and and I'm saying that number of heads we'd like is four
1:27:26
that number of heads we'd like is four
1:27:26
that number of heads we'd like is four and so because this is 32 we calculate
1:27:29
and so because this is 32 we calculate
1:27:29
and so because this is 32 we calculate that because this is 32 the number of
1:27:31
that because this is 32 the number of
1:27:31
that because this is 32 the number of heads should be four um the head size
1:27:34
heads should be four um the head size
1:27:34
heads should be four um the head size should be eight so that everything sort
1:27:36
should be eight so that everything sort
1:27:36
should be eight so that everything sort of works out Channel wise um so this is
1:27:39
of works out Channel wise um so this is
1:27:39
of works out Channel wise um so this is how the Transformer structures uh sort
1:27:41
how the Transformer structures uh sort
1:27:41
how the Transformer structures uh sort of the uh the sizes typically so the
1:27:44
of the uh the sizes typically so the
1:27:44
of the uh the sizes typically so the head size will become eight and then
1:27:45
head size will become eight and then
1:27:45
head size will become eight and then this is how we want to intersperse them
1:27:47
this is how we want to intersperse them
1:27:47
this is how we want to intersperse them and then here I'm trying to create
1:27:49
and then here I'm trying to create
1:27:49
and then here I'm trying to create blocks which is just a sequential
1:27:51
blocks which is just a sequential
1:27:51
blocks which is just a sequential application of block block block so that
1:27:53
application of block block block so that
1:27:53
application of block block block so that we're interspersing communication feed
1:27:55
we're interspersing communication feed
1:27:55
we're interspersing communication feed forward many many times and then finally
1:27:57
forward many many times and then finally
1:27:57
forward many many times and then finally we decode now I actually tried to run
1:28:01
we decode now I actually tried to run
1:28:01
we decode now I actually tried to run this and the problem is this doesn't
1:28:02
this and the problem is this doesn't
1:28:02
this and the problem is this doesn't actually give a very good uh answer and
1:28:05
actually give a very good uh answer and
1:28:05
actually give a very good uh answer and very good result and the reason for that
1:28:07
very good result and the reason for that
1:28:07
very good result and the reason for that is we're start starting to actually get
1:28:09
is we're start starting to actually get
1:28:09
is we're start starting to actually get like a pretty deep neural net and deep
1:28:11
like a pretty deep neural net and deep
1:28:11
like a pretty deep neural net and deep neural Nets uh suffer from optimization
1:28:13
neural Nets uh suffer from optimization
1:28:13
neural Nets uh suffer from optimization issues and I think that's what we're
1:28:14
issues and I think that's what we're
1:28:14
issues and I think that's what we're kind of like slightly starting to run
1:28:16
kind of like slightly starting to run
1:28:16
kind of like slightly starting to run into so we need one more idea that we
1:28:18
into so we need one more idea that we
1:28:18
into so we need one more idea that we can borrow from the um Transformer paper
1:28:21
can borrow from the um Transformer paper
1:28:21
can borrow from the um Transformer paper to resolve those difficulties now there
1:28:23
to resolve those difficulties now there
1:28:23
to resolve those difficulties now there are two optimizations that dramatically
1:28:25
are two optimizations that dramatically
1:28:25
are two optimizations that dramatically help with the depth of these networks
1:28:27
help with the depth of these networks
1:28:27
help with the depth of these networks and make sure that the networks remain
1:28:29
and make sure that the networks remain
1:28:29
and make sure that the networks remain optimizable let's talk about the first
1:28:31
optimizable let's talk about the first
1:28:31
optimizable let's talk about the first one the first one in this diagram is you
1:28:33
one the first one in this diagram is you
1:28:33
one the first one in this diagram is you see this Arrow here and then this arrow
1:28:36
see this Arrow here and then this arrow
1:28:36
see this Arrow here and then this arrow and this Arrow those are skip
1:28:38
and this Arrow those are skip
1:28:38
and this Arrow those are skip connections or sometimes called residual
1:28:40
connections or sometimes called residual
1:28:40
connections or sometimes called residual connections they come from this paper uh
1:28:43
connections they come from this paper uh
1:28:43
connections they come from this paper uh the presidual learning for image
1:28:44
the presidual learning for image
1:28:44
the presidual learning for image recognition from about
1:28:46
recognition from about
1:28:46
recognition from about 2015 uh that introduced the concept now
1:28:50
2015 uh that introduced the concept now
1:28:51
2015 uh that introduced the concept now these are basically what it means is you
1:28:53
these are basically what it means is you
1:28:53
these are basically what it means is you transform data but then you have a skip
1:28:55
transform data but then you have a skip
1:28:55
transform data but then you have a skip connection with addition from the
1:28:57
connection with addition from the
1:28:57
connection with addition from the previous features now the way I like to
1:29:00
previous features now the way I like to
1:29:00
previous features now the way I like to visualize it uh that I prefer is the
1:29:03
visualize it uh that I prefer is the
1:29:03
visualize it uh that I prefer is the following here the computation happens
1:29:05
following here the computation happens
1:29:05
following here the computation happens from the top to bottom and basically you
1:29:08
from the top to bottom and basically you
1:29:08
from the top to bottom and basically you have this uh residual pathway and you
1:29:11
have this uh residual pathway and you
1:29:11
have this uh residual pathway and you are free to Fork off from the residual
1:29:13
are free to Fork off from the residual
1:29:13
are free to Fork off from the residual pathway perform some computation and
1:29:14
pathway perform some computation and
1:29:15
pathway perform some computation and then project back to the residual
1:29:16
then project back to the residual
1:29:16
then project back to the residual pathway via addition and so you go from
1:29:19
pathway via addition and so you go from
1:29:19
pathway via addition and so you go from the the uh inputs to the targets only
1:29:22
the the uh inputs to the targets only
1:29:22
the the uh inputs to the targets only via plus and plus plus and the reason
1:29:25
via plus and plus plus and the reason
1:29:25
via plus and plus plus and the reason this is useful is because during back
1:29:27
this is useful is because during back
1:29:27
this is useful is because during back propagation remember from our microG
1:29:29
propagation remember from our microG
1:29:29
propagation remember from our microG grad video earlier addition distributes
1:29:32
grad video earlier addition distributes
1:29:32
grad video earlier addition distributes gradients equally to both of its
1:29:34
gradients equally to both of its
1:29:34
gradients equally to both of its branches that that fed as the input and
1:29:37
branches that that fed as the input and
1:29:37
branches that that fed as the input and so the supervision or the gradients from
1:29:40
so the supervision or the gradients from
1:29:40
so the supervision or the gradients from the loss basically hop through every
1:29:43
the loss basically hop through every
1:29:43
the loss basically hop through every addition node all the way to the input
1:29:46
addition node all the way to the input
1:29:46
addition node all the way to the input and then also Fork off into the residual
1:29:50
and then also Fork off into the residual
1:29:50
and then also Fork off into the residual blocks but basically you have this
1:29:52
blocks but basically you have this
1:29:52
blocks but basically you have this gradient Super Highway that goes
1:29:53
gradient Super Highway that goes
1:29:53
gradient Super Highway that goes directly from the supervision all the
1:29:55
directly from the supervision all the
1:29:55
directly from the supervision all the way to the input unimpeded and then
1:29:58
way to the input unimpeded and then
1:29:58
way to the input unimpeded and then these viral blocks are usually
1:29:59
these viral blocks are usually
1:29:59
these viral blocks are usually initialized in the beginning so they
1:30:01
initialized in the beginning so they
1:30:01
initialized in the beginning so they contribute very very little if anything
1:30:03
contribute very very little if anything
1:30:03
contribute very very little if anything to the residual pathway they they are
1:30:05
to the residual pathway they they are
1:30:05
to the residual pathway they they are initialized that way so in the beginning
1:30:07
initialized that way so in the beginning
1:30:07
initialized that way so in the beginning they are sort of almost kind of like not
1:30:09
they are sort of almost kind of like not
1:30:09
they are sort of almost kind of like not there but then during the optimization
1:30:11
there but then during the optimization
1:30:11
there but then during the optimization they come online over time and they uh
1:30:14
they come online over time and they uh
1:30:14
they come online over time and they uh start to contribute but at least at the
1:30:17
start to contribute but at least at the
1:30:17
start to contribute but at least at the initialization you can go from directly
1:30:18
initialization you can go from directly
1:30:19
initialization you can go from directly supervision to the input gradient is
1:30:21
supervision to the input gradient is
1:30:21
supervision to the input gradient is unimpeded and just flows and then the
1:30:23
unimpeded and just flows and then the
1:30:23
unimpeded and just flows and then the blocks over time
1:30:24
blocks over time
1:30:24
blocks over time kick in and so that dramatically helps
1:30:27
kick in and so that dramatically helps
1:30:27
kick in and so that dramatically helps with the optimization so let's implement
1:30:29
with the optimization so let's implement
1:30:29
with the optimization so let's implement this so coming back to our block here
1:30:31
this so coming back to our block here
1:30:31
this so coming back to our block here basically what we want to do is we want
1:30:33
basically what we want to do is we want
1:30:33
basically what we want to do is we want to do xal
1:30:35
to do xal
1:30:35
to do xal X+ self attention and xal X+ self. feed
1:30:39
X+ self attention and xal X+ self. feed
1:30:39
X+ self attention and xal X+ self. feed forward so this is X and then we Fork
1:30:43
forward so this is X and then we Fork
1:30:43
forward so this is X and then we Fork off and do some communication and come
1:30:45
off and do some communication and come
1:30:45
off and do some communication and come back and we Fork off and we do some
1:30:46
back and we Fork off and we do some
1:30:46
back and we Fork off and we do some computation and come back so those are
1:30:49
computation and come back so those are
1:30:49
computation and come back so those are residual connections and then swinging
1:30:51
residual connections and then swinging
1:30:51
residual connections and then swinging back up here we also have to introd use
1:30:54
back up here we also have to introd use
1:30:54
back up here we also have to introd use this projection so nn.
1:30:57
this projection so nn.
1:30:57
this projection so nn. linear and uh this is going to be
1:31:00
linear and uh this is going to be
1:31:00
linear and uh this is going to be from after we concatenate this this is
1:31:03
from after we concatenate this this is
1:31:03
from after we concatenate this this is the prze and embed so this is the output
1:31:05
the prze and embed so this is the output
1:31:05
the prze and embed so this is the output of the self tension itself but then we
1:31:08
of the self tension itself but then we
1:31:08
of the self tension itself but then we actually want the uh to apply the
1:31:11
actually want the uh to apply the
1:31:11
actually want the uh to apply the projection and that's the
1:31:13
projection and that's the
1:31:13
projection and that's the result so the projection is just a
1:31:15
result so the projection is just a
1:31:15
result so the projection is just a linear transformation of the outcome of
1:31:16
linear transformation of the outcome of
1:31:16
linear transformation of the outcome of this
1:31:17
this
1:31:17
this layer so that's the projection back into
1:31:20
layer so that's the projection back into
1:31:20
layer so that's the projection back into the virual pathway and then here in a
1:31:22
the virual pathway and then here in a
1:31:22
the virual pathway and then here in a feet forward it's going to be the same
1:31:23
feet forward it's going to be the same
1:31:23
feet forward it's going to be the same same thing I could have a a self doot
1:31:26
same thing I could have a a self doot
1:31:26
same thing I could have a a self doot projection here as well but let me just
1:31:28
projection here as well but let me just
1:31:28
projection here as well but let me just simplify it and let me uh couple it
1:31:32
simplify it and let me uh couple it
1:31:32
simplify it and let me uh couple it inside the same sequential container and
1:31:34
inside the same sequential container and
1:31:34
inside the same sequential container and so this is the projection layer going
1:31:36
so this is the projection layer going
1:31:36
so this is the projection layer going back into the residual
1:31:38
back into the residual
1:31:38
back into the residual pathway and
1:31:40
pathway and
1:31:40
pathway and so that's uh well that's it so now we
1:31:42
so that's uh well that's it so now we
1:31:43
so that's uh well that's it so now we can train this so I implemented one more
1:31:44
can train this so I implemented one more
1:31:44
can train this so I implemented one more small change when you look into the
1:31:47
small change when you look into the
1:31:47
small change when you look into the paper again you see that the
1:31:49
paper again you see that the
1:31:49
paper again you see that the dimensionality of input and output is
1:31:51
dimensionality of input and output is
1:31:51
dimensionality of input and output is 512 for them and they're saying that the
1:31:53
512 for them and they're saying that the
1:31:53
512 for them and they're saying that the inner layer here in the feet forward has
1:31:55
inner layer here in the feet forward has
1:31:55
inner layer here in the feet forward has dimensionality of 248 so there's a
1:31:57
dimensionality of 248 so there's a
1:31:57
dimensionality of 248 so there's a multiplier of four and so the inner
1:32:00
multiplier of four and so the inner
1:32:00
multiplier of four and so the inner layer of the feet forward Network should
1:32:02
layer of the feet forward Network should
1:32:02
layer of the feet forward Network should be multiplied by four in terms of
1:32:04
be multiplied by four in terms of
1:32:04
be multiplied by four in terms of Channel sizes so I came here and I
1:32:05
Channel sizes so I came here and I
1:32:06
Channel sizes so I came here and I multiplied four times embed here for the
1:32:08
multiplied four times embed here for the
1:32:08
multiplied four times embed here for the feed forward and then from four times
1:32:10
feed forward and then from four times
1:32:10
feed forward and then from four times nmed coming back down to nmed when we go
1:32:13
nmed coming back down to nmed when we go
1:32:13
nmed coming back down to nmed when we go back to the pro uh to the projection so
1:32:15
back to the pro uh to the projection so
1:32:15
back to the pro uh to the projection so adding a bit of computation here and
1:32:17
adding a bit of computation here and
1:32:17
adding a bit of computation here and growing that layer that is in the
1:32:19
growing that layer that is in the
1:32:19
growing that layer that is in the residual block on the side of the
1:32:21
residual block on the side of the
1:32:21
residual block on the side of the residual
1:32:22
residual
1:32:22
residual pathway and then I train this and we
1:32:24
pathway and then I train this and we
1:32:24
pathway and then I train this and we actually get down all the way to uh 2.08
1:32:27
actually get down all the way to uh 2.08
1:32:27
actually get down all the way to uh 2.08 validation loss and we also see that
1:32:29
validation loss and we also see that
1:32:29
validation loss and we also see that network is starting to get big enough
1:32:30
network is starting to get big enough
1:32:30
network is starting to get big enough that our train loss is getting ahead of
1:32:32
that our train loss is getting ahead of
1:32:32
that our train loss is getting ahead of validation loss so we're starting to see
1:32:33
validation loss so we're starting to see
1:32:33
validation loss so we're starting to see like a little bit of
1:32:35
like a little bit of
1:32:35
like a little bit of overfitting and um our our
1:32:38
overfitting and um our our
1:32:38
overfitting and um our our um uh Generations here are still not
1:32:41
um uh Generations here are still not
1:32:41
um uh Generations here are still not amazing but at least you see that we can
1:32:42
amazing but at least you see that we can
1:32:42
amazing but at least you see that we can see like is here this now grief syn like
1:32:46
see like is here this now grief syn like
1:32:46
see like is here this now grief syn like this starts to almost look like English
1:32:48
this starts to almost look like English
1:32:48
this starts to almost look like English so um yeah we're starting to really get
1:32:50
so um yeah we're starting to really get
1:32:50
so um yeah we're starting to really get there okay and the second Innovation
1:32:52
there okay and the second Innovation
1:32:52
there okay and the second Innovation that is very helpful for optimizing very
1:32:54
that is very helpful for optimizing very
1:32:54
that is very helpful for optimizing very deep neural networks is right here so we
1:32:57
deep neural networks is right here so we
1:32:57
deep neural networks is right here so we have this addition now that's the
1:32:58
have this addition now that's the
1:32:58
have this addition now that's the residual part but this Norm is referring
1:33:00
residual part but this Norm is referring
1:33:00
residual part but this Norm is referring to something called layer Norm so layer
1:33:03
to something called layer Norm so layer
1:33:03
to something called layer Norm so layer Norm is implemented in pytorch it's a
1:33:04
Norm is implemented in pytorch it's a
1:33:04
Norm is implemented in pytorch it's a paper that came out a while back here
1:33:09
paper that came out a while back here
1:33:09
paper that came out a while back here um and layer Norm is very very similar
1:33:11
um and layer Norm is very very similar
1:33:11
um and layer Norm is very very similar to bash Norm so remember back to our
1:33:14
to bash Norm so remember back to our
1:33:14
to bash Norm so remember back to our make more series part three we
1:33:16
make more series part three we
1:33:16
make more series part three we implemented bash
1:33:17
implemented bash
1:33:17
implemented bash normalization and uh bash normalization
1:33:19
normalization and uh bash normalization
1:33:19
normalization and uh bash normalization basically just made sure that um Across
1:33:22
basically just made sure that um Across
1:33:22
basically just made sure that um Across The Bash dimension any individual neuron
1:33:25
The Bash dimension any individual neuron
1:33:25
The Bash dimension any individual neuron had unit uh Gan um distribution so it
1:33:30
had unit uh Gan um distribution so it
1:33:30
had unit uh Gan um distribution so it was zero mean and unit standard
1:33:32
was zero mean and unit standard
1:33:32
was zero mean and unit standard deviation one standard deviation output
1:33:35
deviation one standard deviation output
1:33:35
deviation one standard deviation output so what I did here is I'm copy pasting
1:33:37
so what I did here is I'm copy pasting
1:33:37
so what I did here is I'm copy pasting the bashor 1D that we developed in our
1:33:39
the bashor 1D that we developed in our
1:33:39
the bashor 1D that we developed in our make more series and see here we can
1:33:42
make more series and see here we can
1:33:42
make more series and see here we can initialize for example this module and
1:33:44
initialize for example this module and
1:33:44
initialize for example this module and we can have a batch of 32 100
1:33:47
we can have a batch of 32 100
1:33:47
we can have a batch of 32 100 dimensional vectors feeding through the
1:33:48
dimensional vectors feeding through the
1:33:48
dimensional vectors feeding through the bachor layer so what this does is it
1:33:52
bachor layer so what this does is it
1:33:52
bachor layer so what this does is it guarantees that when we look at just the
1:33:54
guarantees that when we look at just the
1:33:54
guarantees that when we look at just the zeroth column it's a zero mean one
1:33:58
zeroth column it's a zero mean one
1:33:58
zeroth column it's a zero mean one standard deviation so it's normalizing
1:34:00
standard deviation so it's normalizing
1:34:00
standard deviation so it's normalizing every single column of this uh input now
1:34:04
every single column of this uh input now
1:34:04
every single column of this uh input now the rows are not uh going to be
1:34:06
the rows are not uh going to be
1:34:06
the rows are not uh going to be normalized by default because we're just
1:34:07
normalized by default because we're just
1:34:08
normalized by default because we're just normalizing columns so let's now
1:34:10
normalizing columns so let's now
1:34:10
normalizing columns so let's now Implement layer Norm uh it's very
1:34:12
Implement layer Norm uh it's very
1:34:12
Implement layer Norm uh it's very complicated look we come here we change
1:34:15
complicated look we come here we change
1:34:15
complicated look we come here we change this from zero to one so we don't
1:34:18
this from zero to one so we don't
1:34:18
this from zero to one so we don't normalize The Columns we normalize the
1:34:20
normalize The Columns we normalize the
1:34:20
normalize The Columns we normalize the rows and now we've implemented layer
1:34:23
rows and now we've implemented layer
1:34:23
rows and now we've implemented layer Norm
1:34:25
Norm
1:34:25
Norm so now the columns are not going to be
1:34:28
so now the columns are not going to be
1:34:28
so now the columns are not going to be normalized um but the rows are going to
1:34:31
normalized um but the rows are going to
1:34:31
normalized um but the rows are going to be normalized for every individual
1:34:33
be normalized for every individual
1:34:33
be normalized for every individual example it's 100 dimensional Vector is
1:34:35
example it's 100 dimensional Vector is
1:34:35
example it's 100 dimensional Vector is normalized uh in this way and because
1:34:38
normalized uh in this way and because
1:34:38
normalized uh in this way and because our computation Now does not span across
1:34:40
our computation Now does not span across
1:34:40
our computation Now does not span across examples we can delete all of this
1:34:43
examples we can delete all of this
1:34:43
examples we can delete all of this buffers stuff uh because uh we can
1:34:45
buffers stuff uh because uh we can
1:34:45
buffers stuff uh because uh we can always apply this operation and don't
1:34:48
always apply this operation and don't
1:34:48
always apply this operation and don't need to maintain any running buffers so
1:34:50
need to maintain any running buffers so
1:34:50
need to maintain any running buffers so we don't need the
1:34:52
we don't need the
1:34:52
we don't need the buffers uh we
1:34:54
buffers uh we
1:34:54
buffers uh we don't There's no distinction between
1:34:56
don't There's no distinction between
1:34:56
don't There's no distinction between training and test
1:34:58
training and test
1:34:58
training and test time uh and we don't need these running
1:35:00
time uh and we don't need these running
1:35:00
time uh and we don't need these running buffers we do keep gamma and beta we
1:35:03
buffers we do keep gamma and beta we
1:35:03
buffers we do keep gamma and beta we don't need the momentum we don't care if
1:35:05
don't need the momentum we don't care if
1:35:05
don't need the momentum we don't care if it's training or not and this is now a
1:35:08
it's training or not and this is now a
1:35:08
it's training or not and this is now a layer
1:35:09
layer
1:35:09
layer norm and it normalizes the rows instead
1:35:12
norm and it normalizes the rows instead
1:35:12
norm and it normalizes the rows instead of the columns and this here is
1:35:15
of the columns and this here is
1:35:15
of the columns and this here is identical to basically this here so
1:35:19
identical to basically this here so
1:35:19
identical to basically this here so let's now Implement layer Norm in our
1:35:21
let's now Implement layer Norm in our
1:35:21
let's now Implement layer Norm in our Transformer before I incorporate the
1:35:23
Transformer before I incorporate the
1:35:23
Transformer before I incorporate the layer Norm I just wanted to note that as
1:35:25
layer Norm I just wanted to note that as
1:35:25
layer Norm I just wanted to note that as I said very few details about the
1:35:27
I said very few details about the
1:35:27
I said very few details about the Transformer have changed in the last 5
1:35:28
Transformer have changed in the last 5
1:35:28
Transformer have changed in the last 5 years but this is actually something
1:35:30
years but this is actually something
1:35:30
years but this is actually something that slightly departs from the original
1:35:31
that slightly departs from the original
1:35:31
that slightly departs from the original paper you see that the ADD and Norm is
1:35:34
paper you see that the ADD and Norm is
1:35:34
paper you see that the ADD and Norm is applied after the
1:35:36
applied after the
1:35:36
applied after the transformation but um in now it is a bit
1:35:39
transformation but um in now it is a bit
1:35:40
transformation but um in now it is a bit more uh basically common to apply the
1:35:42
more uh basically common to apply the
1:35:42
more uh basically common to apply the layer Norm before the transformation so
1:35:44
layer Norm before the transformation so
1:35:44
layer Norm before the transformation so there's a reshuffling of the layer Norms
1:35:46
there's a reshuffling of the layer Norms
1:35:46
there's a reshuffling of the layer Norms uh so this is called the prorm
1:35:48
uh so this is called the prorm
1:35:48
uh so this is called the prorm formulation and that's the one that
1:35:49
formulation and that's the one that
1:35:49
formulation and that's the one that we're going to implement as well so
1:35:50
we're going to implement as well so
1:35:50
we're going to implement as well so select deviation from the original paper
1:35:53
select deviation from the original paper
1:35:53
select deviation from the original paper basically we need two layer Norms layer
1:35:55
basically we need two layer Norms layer
1:35:55
basically we need two layer Norms layer Norm one is uh NN do layer norm and we
1:35:59
Norm one is uh NN do layer norm and we
1:35:59
Norm one is uh NN do layer norm and we tell it how many um what is the
1:36:00
tell it how many um what is the
1:36:01
tell it how many um what is the embedding Dimension and we need the
1:36:03
embedding Dimension and we need the
1:36:03
embedding Dimension and we need the second layer norm and then here the
1:36:06
second layer norm and then here the
1:36:06
second layer norm and then here the layer Norms are applied immediately on X
1:36:09
layer Norms are applied immediately on X
1:36:09
layer Norms are applied immediately on X so self. layer Norm one applied on X and
1:36:13
so self. layer Norm one applied on X and
1:36:13
so self. layer Norm one applied on X and self. layer Norm two applied on X before
1:36:15
self. layer Norm two applied on X before
1:36:15
self. layer Norm two applied on X before it goes into self attention and feed
1:36:18
it goes into self attention and feed
1:36:18
it goes into self attention and feed forward and uh the size of the layer
1:36:20
forward and uh the size of the layer
1:36:20
forward and uh the size of the layer Norm here is an ed so 32 so when the
1:36:23
Norm here is an ed so 32 so when the
1:36:23
Norm here is an ed so 32 so when the layer Norm is normalizing our features
1:36:26
layer Norm is normalizing our features
1:36:26
layer Norm is normalizing our features it is uh the normalization here uh
1:36:30
it is uh the normalization here uh
1:36:30
it is uh the normalization here uh happens the mean and the variance are
1:36:32
happens the mean and the variance are
1:36:32
happens the mean and the variance are taken over 32 numbers so the batch and
1:36:34
taken over 32 numbers so the batch and
1:36:34
taken over 32 numbers so the batch and the time act as batch Dimensions both of
1:36:37
the time act as batch Dimensions both of
1:36:37
the time act as batch Dimensions both of them so this is kind of like a per token
1:36:40
them so this is kind of like a per token
1:36:40
them so this is kind of like a per token um transformation that just normalizes
1:36:42
um transformation that just normalizes
1:36:42
um transformation that just normalizes the features and makes them a unit mean
1:36:46
the features and makes them a unit mean
1:36:46
the features and makes them a unit mean uh unit Gan at
1:36:48
uh unit Gan at
1:36:48
uh unit Gan at initialization but of course because
1:36:50
initialization but of course because
1:36:50
initialization but of course because these layer Norms inside it have these
1:36:52
these layer Norms inside it have these
1:36:52
these layer Norms inside it have these gamma and beta training
1:36:54
gamma and beta training
1:36:54
gamma and beta training parameters uh the layer Norm will U
1:36:57
parameters uh the layer Norm will U
1:36:57
parameters uh the layer Norm will U eventually create outputs that might not
1:36:59
eventually create outputs that might not
1:36:59
eventually create outputs that might not be unit gion but the optimization will
1:37:01
be unit gion but the optimization will
1:37:01
be unit gion but the optimization will determine that so for now this is the uh
1:37:05
determine that so for now this is the uh
1:37:05
determine that so for now this is the uh this is incorporating the layer norms
1:37:06
this is incorporating the layer norms
1:37:06
this is incorporating the layer norms and let's train them on okay so I let it
1:37:09
and let's train them on okay so I let it
1:37:09
and let's train them on okay so I let it run and we see that we get down to 2.06
1:37:12
run and we see that we get down to 2.06
1:37:12
run and we see that we get down to 2.06 which is better than the previous 2.08
1:37:14
which is better than the previous 2.08
1:37:14
which is better than the previous 2.08 so a slight Improvement by adding the
1:37:15
so a slight Improvement by adding the
1:37:15
so a slight Improvement by adding the layer norms and I'd expect that they
1:37:17
layer norms and I'd expect that they
1:37:17
layer norms and I'd expect that they help uh even more if we had bigger and
1:37:19
help uh even more if we had bigger and
1:37:19
help uh even more if we had bigger and deeper Network one more thing I forgot
1:37:21
deeper Network one more thing I forgot
1:37:21
deeper Network one more thing I forgot to add is that there should be a layer
1:37:23
to add is that there should be a layer
1:37:23
to add is that there should be a layer Norm here also typically as at the end
1:37:26
Norm here also typically as at the end
1:37:26
Norm here also typically as at the end of the Transformer and right before the
1:37:28
of the Transformer and right before the
1:37:28
of the Transformer and right before the final uh linear layer that decodes into
1:37:31
final uh linear layer that decodes into
1:37:31
final uh linear layer that decodes into vocabulary so I added that as well so at
1:37:34
vocabulary so I added that as well so at
1:37:35
vocabulary so I added that as well so at this stage we actually have a pretty
1:37:36
this stage we actually have a pretty
1:37:36
this stage we actually have a pretty complete uh Transformer according to the
1:37:38
complete uh Transformer according to the
1:37:38
complete uh Transformer according to the original paper and it's a decoder only
1:37:40
original paper and it's a decoder only
1:37:40
original paper and it's a decoder only Transformer I'll I'll talk about that in
1:37:42
Transformer I'll I'll talk about that in
1:37:42
Transformer I'll I'll talk about that in a second uh but at this stage uh the
1:37:44
a second uh but at this stage uh the
1:37:44
a second uh but at this stage uh the major pieces are in place so we can try
1:37:46
major pieces are in place so we can try
1:37:46
major pieces are in place so we can try to scale this up and see how well we can
1:37:47
to scale this up and see how well we can
1:37:47
to scale this up and see how well we can push this number now in order to scale
1:37:50
push this number now in order to scale
1:37:50
push this number now in order to scale out the model I had to perform some
1:37:51
out the model I had to perform some
1:37:51
out the model I had to perform some cosmetic changes here to make it nicer
1:37:54
cosmetic changes here to make it nicer
1:37:54
cosmetic changes here to make it nicer so I introduced this variable called n
1:37:56
so I introduced this variable called n
1:37:56
so I introduced this variable called n layer which just specifies how many
1:37:57
layer which just specifies how many
1:37:57
layer which just specifies how many layers of the blocks we're going to have
1:38:01
layers of the blocks we're going to have
1:38:01
layers of the blocks we're going to have I created a bunch of blocks and we have
1:38:02
I created a bunch of blocks and we have
1:38:02
I created a bunch of blocks and we have a new variable number of heads as well I
1:38:05
a new variable number of heads as well I
1:38:05
a new variable number of heads as well I pulled out the layer Norm here and uh so
1:38:07
pulled out the layer Norm here and uh so
1:38:07
pulled out the layer Norm here and uh so this is identical now one thing that I
1:38:09
this is identical now one thing that I
1:38:10
this is identical now one thing that I did briefly change is I added a Dropout
1:38:13
did briefly change is I added a Dropout
1:38:13
did briefly change is I added a Dropout so Dropout is something that you can add
1:38:15
so Dropout is something that you can add
1:38:15
so Dropout is something that you can add right before the residual connection
1:38:17
right before the residual connection
1:38:17
right before the residual connection back right before the connection back
1:38:19
back right before the connection back
1:38:19
back right before the connection back into the residual pathway so we can drop
1:38:22
into the residual pathway so we can drop
1:38:22
into the residual pathway so we can drop out that as l layer here we can drop out
1:38:26
out that as l layer here we can drop out
1:38:26
out that as l layer here we can drop out uh here at the end of the multi-headed
1:38:27
uh here at the end of the multi-headed
1:38:27
uh here at the end of the multi-headed exension as well and we can also drop
1:38:30
exension as well and we can also drop
1:38:30
exension as well and we can also drop out here uh when we calculate the um
1:38:34
out here uh when we calculate the um
1:38:34
out here uh when we calculate the um basically affinities and after the
1:38:35
basically affinities and after the
1:38:36
basically affinities and after the softmax we can drop out some of those so
1:38:38
softmax we can drop out some of those so
1:38:38
softmax we can drop out some of those so we can randomly prevent some of the
1:38:40
we can randomly prevent some of the
1:38:40
we can randomly prevent some of the nodes from
1:38:41
nodes from
1:38:41
nodes from communicating and so Dropout uh comes
1:38:43
communicating and so Dropout uh comes
1:38:43
communicating and so Dropout uh comes from this paper from 2014 or so and
1:38:49
from this paper from 2014 or so and
1:38:49
from this paper from 2014 or so and basically it takes your neural
1:38:50
basically it takes your neural
1:38:50
basically it takes your neural nut and it randomly every forward
1:38:53
nut and it randomly every forward
1:38:53
nut and it randomly every forward backward pass shuts off some subset of
1:38:56
backward pass shuts off some subset of
1:38:56
backward pass shuts off some subset of uh neurons so randomly drops them to
1:38:59
uh neurons so randomly drops them to
1:38:59
uh neurons so randomly drops them to zero and trains without them and what
1:39:02
zero and trains without them and what
1:39:02
zero and trains without them and what this does effectively is because the
1:39:04
this does effectively is because the
1:39:04
this does effectively is because the mask of what's being dropped out is
1:39:06
mask of what's being dropped out is
1:39:06
mask of what's being dropped out is changed every single forward backward
1:39:07
changed every single forward backward
1:39:07
changed every single forward backward pass it ends up kind of uh training an
1:39:10
pass it ends up kind of uh training an
1:39:11
pass it ends up kind of uh training an ensemble of sub networks and then at
1:39:13
ensemble of sub networks and then at
1:39:13
ensemble of sub networks and then at test time everything is fully enabled
1:39:15
test time everything is fully enabled
1:39:15
test time everything is fully enabled and kind of all of those sub networks
1:39:16
and kind of all of those sub networks
1:39:16
and kind of all of those sub networks are merged into a single Ensemble if you
1:39:18
are merged into a single Ensemble if you
1:39:18
are merged into a single Ensemble if you can if you want to think about it that
1:39:20
can if you want to think about it that
1:39:20
can if you want to think about it that way so I would read the paper to get the
1:39:22
way so I would read the paper to get the
1:39:22
way so I would read the paper to get the full detail for now we're just going to
1:39:24
full detail for now we're just going to
1:39:24
full detail for now we're just going to stay on the level of this is a
1:39:25
stay on the level of this is a
1:39:25
stay on the level of this is a regularization technique and I added it
1:39:28
regularization technique and I added it
1:39:28
regularization technique and I added it because I'm about to scale up the model
1:39:30
because I'm about to scale up the model
1:39:30
because I'm about to scale up the model quite a bit and I was concerned about
1:39:32
quite a bit and I was concerned about
1:39:32
quite a bit and I was concerned about overfitting so now when we scroll up to
1:39:34
overfitting so now when we scroll up to
1:39:34
overfitting so now when we scroll up to the top uh we'll see that I changed a
1:39:36
the top uh we'll see that I changed a
1:39:36
the top uh we'll see that I changed a number of hyper parameters here about
1:39:38
number of hyper parameters here about
1:39:38
number of hyper parameters here about our neural nut so I made the batch size
1:39:40
our neural nut so I made the batch size
1:39:40
our neural nut so I made the batch size be much larger now it's 64 I changed the
1:39:43
be much larger now it's 64 I changed the
1:39:43
be much larger now it's 64 I changed the block size to be 256 so previously it
1:39:46
block size to be 256 so previously it
1:39:46
block size to be 256 so previously it was just eight eight characters of
1:39:47
was just eight eight characters of
1:39:47
was just eight eight characters of context now it is 256 characters of
1:39:50
context now it is 256 characters of
1:39:50
context now it is 256 characters of context to predict the 257th
1:39:54
context to predict the 257th
1:39:54
context to predict the 257th uh I brought down the learning rate a
1:39:55
uh I brought down the learning rate a
1:39:55
uh I brought down the learning rate a little bit because the neural net is now
1:39:57
little bit because the neural net is now
1:39:57
little bit because the neural net is now much bigger so I brought down the
1:39:58
much bigger so I brought down the
1:39:58
much bigger so I brought down the learning rate the embedding Dimension is
1:40:01
learning rate the embedding Dimension is
1:40:01
learning rate the embedding Dimension is now 384 and there are six heads so 384
1:40:05
now 384 and there are six heads so 384
1:40:05
now 384 and there are six heads so 384 divide 6 means that every head is 64
1:40:08
divide 6 means that every head is 64
1:40:08
divide 6 means that every head is 64 dimensional as it as a standard and then
1:40:11
dimensional as it as a standard and then
1:40:11
dimensional as it as a standard and then there's going to be six layers of that
1:40:13
there's going to be six layers of that
1:40:13
there's going to be six layers of that and the Dropout will be at 02 so every
1:40:15
and the Dropout will be at 02 so every
1:40:15
and the Dropout will be at 02 so every forward backward pass 20% of all of
1:40:18
forward backward pass 20% of all of
1:40:18
forward backward pass 20% of all of these um intermediate calculations are
1:40:21
these um intermediate calculations are
1:40:21
these um intermediate calculations are disabled and dropped to zero
1:40:24
disabled and dropped to zero
1:40:24
disabled and dropped to zero and then I already trained this and I
1:40:25
and then I already trained this and I
1:40:25
and then I already trained this and I ran it so uh drum roll how well does it
1:40:28
ran it so uh drum roll how well does it
1:40:28
ran it so uh drum roll how well does it perform so let me just scroll up
1:40:31
perform so let me just scroll up
1:40:31
perform so let me just scroll up here we get a validation loss of
1:40:34
here we get a validation loss of
1:40:34
here we get a validation loss of 1.48 which is actually quite a bit of an
1:40:37
1.48 which is actually quite a bit of an
1:40:37
1.48 which is actually quite a bit of an improvement on what we had before which
1:40:38
improvement on what we had before which
1:40:38
improvement on what we had before which I think was 2.07 so it went from 2.07
1:40:41
I think was 2.07 so it went from 2.07
1:40:41
I think was 2.07 so it went from 2.07 all the way down to 1.48 just by scaling
1:40:43
all the way down to 1.48 just by scaling
1:40:43
all the way down to 1.48 just by scaling up this neural nut with the code that we
1:40:45
up this neural nut with the code that we
1:40:45
up this neural nut with the code that we have and this of course ran for a lot
1:40:47
have and this of course ran for a lot
1:40:47
have and this of course ran for a lot longer this maybe trained for I want to
1:40:49
longer this maybe trained for I want to
1:40:49
longer this maybe trained for I want to say about 15 minutes on my a100 GPU so
1:40:52
say about 15 minutes on my a100 GPU so
1:40:52
say about 15 minutes on my a100 GPU so that's a pretty a GPU and if you don't
1:40:54
that's a pretty a GPU and if you don't
1:40:54
that's a pretty a GPU and if you don't have a GPU you're not going to be able
1:40:56
have a GPU you're not going to be able
1:40:56
have a GPU you're not going to be able to reproduce this uh on a CPU this would
1:40:58
to reproduce this uh on a CPU this would
1:40:59
to reproduce this uh on a CPU this would be um I would not run this on a CPU or
1:41:01
be um I would not run this on a CPU or
1:41:01
be um I would not run this on a CPU or MacBook or something like that you'll
1:41:02
MacBook or something like that you'll
1:41:03
MacBook or something like that you'll have to Brak down the number of uh
1:41:04
have to Brak down the number of uh
1:41:04
have to Brak down the number of uh layers and the embedding Dimension and
1:41:06
layers and the embedding Dimension and
1:41:06
layers and the embedding Dimension and so on uh but in about 15 minutes we can
1:41:09
so on uh but in about 15 minutes we can
1:41:09
so on uh but in about 15 minutes we can get this kind of a result and um I'm
1:41:12
get this kind of a result and um I'm
1:41:12
get this kind of a result and um I'm printing some of the Shakespeare here
1:41:15
printing some of the Shakespeare here
1:41:15
printing some of the Shakespeare here but what I did also is I printed 10,000
1:41:17
but what I did also is I printed 10,000
1:41:17
but what I did also is I printed 10,000 characters so a lot more and I wrote
1:41:18
characters so a lot more and I wrote
1:41:18
characters so a lot more and I wrote them to a file and so here we see some
1:41:21
them to a file and so here we see some
1:41:21
them to a file and so here we see some of the outputs
1:41:24
of the outputs
1:41:24
of the outputs so it's a lot more recognizable as the
1:41:26
so it's a lot more recognizable as the
1:41:26
so it's a lot more recognizable as the input text file so the input text file
1:41:29
input text file so the input text file
1:41:29
input text file so the input text file just for reference looked like this so
1:41:31
just for reference looked like this so
1:41:31
just for reference looked like this so there's always like someone speaking in
1:41:33
there's always like someone speaking in
1:41:33
there's always like someone speaking in this manner and uh our predictions now
1:41:37
this manner and uh our predictions now
1:41:37
this manner and uh our predictions now take on that form except of course
1:41:40
take on that form except of course
1:41:40
take on that form except of course they're they're nonsensical when you
1:41:41
they're they're nonsensical when you
1:41:41
they're they're nonsensical when you actually read them
1:41:43
actually read them
1:41:43
actually read them so it is every crimp tap be a house oh
1:41:47
so it is every crimp tap be a house oh
1:41:47
so it is every crimp tap be a house oh those
1:41:48
those
1:41:48
those prepation we give
1:41:51
prepation we give
1:41:51
prepation we give heed um you know
1:41:56
Oho sent me you mighty
1:41:59
Lord anyway so you can read through this
1:42:02
Lord anyway so you can read through this
1:42:02
Lord anyway so you can read through this um it's nonsensical of course but this
1:42:04
um it's nonsensical of course but this
1:42:04
um it's nonsensical of course but this is just a Transformer trained on a
1:42:06
is just a Transformer trained on a
1:42:06
is just a Transformer trained on a character level for 1 million characters
1:42:09
character level for 1 million characters
1:42:09
character level for 1 million characters that come from Shakespeare so there's
1:42:10
that come from Shakespeare so there's
1:42:10
that come from Shakespeare so there's sort of like blabbers on in Shakespeare
1:42:12
sort of like blabbers on in Shakespeare
1:42:12
sort of like blabbers on in Shakespeare like manner but it doesn't of course
1:42:14
like manner but it doesn't of course
1:42:14
like manner but it doesn't of course make sense at this scale uh but I think
1:42:17
make sense at this scale uh but I think
1:42:18
make sense at this scale uh but I think I think still a pretty good
1:42:19
I think still a pretty good
1:42:19
I think still a pretty good demonstration of what's
1:42:20
demonstration of what's
1:42:20
demonstration of what's possible so now
1:42:24
possible so now
1:42:24
possible so now I think uh that kind of like concludes
1:42:26
I think uh that kind of like concludes
1:42:26
I think uh that kind of like concludes the programming section of this video we
1:42:28
the programming section of this video we
1:42:28
the programming section of this video we basically kind of uh did a pretty good
1:42:30
basically kind of uh did a pretty good
1:42:30
basically kind of uh did a pretty good job and um of implementing this
1:42:32
job and um of implementing this
1:42:32
job and um of implementing this Transformer uh but the picture doesn't
1:42:35
Transformer uh but the picture doesn't
1:42:35
Transformer uh but the picture doesn't exactly match up to what we've done so
1:42:37
exactly match up to what we've done so
1:42:37
exactly match up to what we've done so what's going on with all these digital
1:42:38
what's going on with all these digital
1:42:38
what's going on with all these digital Parts here so let me finish explaining
1:42:41
Parts here so let me finish explaining
1:42:41
Parts here so let me finish explaining this architecture and why it looks so
1:42:43
this architecture and why it looks so
1:42:43
this architecture and why it looks so funky basically what's happening here is
1:42:45
funky basically what's happening here is
1:42:45
funky basically what's happening here is what we implemented here is a decoder
1:42:47
what we implemented here is a decoder
1:42:47
what we implemented here is a decoder only Transformer so there's no component
1:42:50
only Transformer so there's no component
1:42:50
only Transformer so there's no component here this part is called the encoder and
1:42:52
here this part is called the encoder and
1:42:52
here this part is called the encoder and there's no cross attention block here
1:42:55
there's no cross attention block here
1:42:55
there's no cross attention block here our block only has a self attention and
1:42:58
our block only has a self attention and
1:42:58
our block only has a self attention and the feet forward so it is missing this
1:43:00
the feet forward so it is missing this
1:43:00
the feet forward so it is missing this third in between piece here this piece
1:43:03
third in between piece here this piece
1:43:03
third in between piece here this piece does cross attention so we don't have it
1:43:05
does cross attention so we don't have it
1:43:05
does cross attention so we don't have it and we don't have the encoder we just
1:43:07
and we don't have the encoder we just
1:43:07
and we don't have the encoder we just have the decoder and the reason we have
1:43:08
have the decoder and the reason we have
1:43:08
have the decoder and the reason we have a decoder only uh is because we are just
1:43:12
a decoder only uh is because we are just
1:43:12
a decoder only uh is because we are just uh generating text and it's
1:43:13
uh generating text and it's
1:43:13
uh generating text and it's unconditioned on anything we're just
1:43:15
unconditioned on anything we're just
1:43:15
unconditioned on anything we're just we're just blabbering on according to a
1:43:16
we're just blabbering on according to a
1:43:16
we're just blabbering on according to a given data set what makes it a decoder
1:43:19
given data set what makes it a decoder
1:43:19
given data set what makes it a decoder is that we are using the Triangular mask
1:43:21
is that we are using the Triangular mask
1:43:21
is that we are using the Triangular mask in our uh trans former so it has this
1:43:24
in our uh trans former so it has this
1:43:24
in our uh trans former so it has this Auto regressive property where we can
1:43:26
Auto regressive property where we can
1:43:26
Auto regressive property where we can just uh go and sample from it so the
1:43:28
just uh go and sample from it so the
1:43:28
just uh go and sample from it so the fact that it's using the Triangular
1:43:30
fact that it's using the Triangular
1:43:30
fact that it's using the Triangular triangular mask to mask out the
1:43:32
triangular mask to mask out the
1:43:32
triangular mask to mask out the attention makes it a decoder and it can
1:43:34
attention makes it a decoder and it can
1:43:34
attention makes it a decoder and it can be used for language modeling now the
1:43:37
be used for language modeling now the
1:43:37
be used for language modeling now the reason that the original paper had an
1:43:39
reason that the original paper had an
1:43:39
reason that the original paper had an incoder decoder architecture is because
1:43:41
incoder decoder architecture is because
1:43:41
incoder decoder architecture is because it is a machine translation paper so it
1:43:43
it is a machine translation paper so it
1:43:43
it is a machine translation paper so it is concerned with a different setting in
1:43:45
is concerned with a different setting in
1:43:45
is concerned with a different setting in particular it expects some uh tokens
1:43:49
particular it expects some uh tokens
1:43:49
particular it expects some uh tokens that encode say for example French and
1:43:52
that encode say for example French and
1:43:52
that encode say for example French and then it is expecting to decode the
1:43:54
then it is expecting to decode the
1:43:54
then it is expecting to decode the translation in English so so you
1:43:56
translation in English so so you
1:43:56
translation in English so so you typically these here are special tokens
1:43:59
typically these here are special tokens
1:43:59
typically these here are special tokens so you are expected to read in this and
1:44:02
so you are expected to read in this and
1:44:02
so you are expected to read in this and condition on it and then you start off
1:44:04
condition on it and then you start off
1:44:04
condition on it and then you start off the generation with a special token
1:44:05
the generation with a special token
1:44:05
the generation with a special token called start so this is a special new
1:44:07
called start so this is a special new
1:44:08
called start so this is a special new token um that you introduce and always
1:44:10
token um that you introduce and always
1:44:10
token um that you introduce and always place in the beginning and then the
1:44:12
place in the beginning and then the
1:44:12
place in the beginning and then the network is expected to Output neural
1:44:15
network is expected to Output neural
1:44:15
network is expected to Output neural networks are awesome and then a special
1:44:17
networks are awesome and then a special
1:44:17
networks are awesome and then a special end token to finish the
1:44:20
end token to finish the
1:44:20
end token to finish the generation so this part here will be
1:44:23
generation so this part here will be
1:44:23
generation so this part here will be decoded exactly as we we've done it
1:44:25
decoded exactly as we we've done it
1:44:25
decoded exactly as we we've done it neural networks are awesome will be
1:44:27
neural networks are awesome will be
1:44:27
neural networks are awesome will be identical to what we did but unlike what
1:44:29
identical to what we did but unlike what
1:44:29
identical to what we did but unlike what we did they wanton to condition the
1:44:32
we did they wanton to condition the
1:44:32
we did they wanton to condition the generation on some additional
1:44:34
generation on some additional
1:44:34
generation on some additional information and in that case this
1:44:36
information and in that case this
1:44:36
information and in that case this additional information is the French
1:44:38
additional information is the French
1:44:38
additional information is the French sentence that they should be
1:44:39
sentence that they should be
1:44:39
sentence that they should be translating so what they do now is they
1:44:42
translating so what they do now is they
1:44:42
translating so what they do now is they bring in the encoder now the encoder
1:44:45
bring in the encoder now the encoder
1:44:45
bring in the encoder now the encoder reads this part here so we're only going
1:44:48
reads this part here so we're only going
1:44:48
reads this part here so we're only going to take the part of French and we're
1:44:50
to take the part of French and we're
1:44:50
to take the part of French and we're going to uh create tokens from it
1:44:52
going to uh create tokens from it
1:44:52
going to uh create tokens from it exactly as we've seen in our video and
1:44:54
exactly as we've seen in our video and
1:44:54
exactly as we've seen in our video and we're going to put a Transformer on it
1:44:57
we're going to put a Transformer on it
1:44:57
we're going to put a Transformer on it but there's going to be no triangular
1:44:58
but there's going to be no triangular
1:44:58
but there's going to be no triangular mask and so all the tokens are allowed
1:45:00
mask and so all the tokens are allowed
1:45:00
mask and so all the tokens are allowed to talk to each other as much as they
1:45:02
to talk to each other as much as they
1:45:02
to talk to each other as much as they want and they're just encoding
1:45:04
want and they're just encoding
1:45:04
want and they're just encoding whatever's the content of this French uh
1:45:07
whatever's the content of this French uh
1:45:07
whatever's the content of this French uh sentence once they've encoded it they
1:45:10
sentence once they've encoded it they
1:45:10
sentence once they've encoded it they they basically come out in the top here
1:45:13
they basically come out in the top here
1:45:13
they basically come out in the top here and then what happens here is in our
1:45:14
and then what happens here is in our
1:45:14
and then what happens here is in our decoder which does the uh language
1:45:17
decoder which does the uh language
1:45:17
decoder which does the uh language modeling there's an additional
1:45:20
modeling there's an additional
1:45:20
modeling there's an additional connection here to the outputs of the
1:45:21
connection here to the outputs of the
1:45:22
connection here to the outputs of the encoder
1:45:23
encoder
1:45:23
encoder and that is brought in through a cross
1:45:25
and that is brought in through a cross
1:45:26
and that is brought in through a cross attention so the queries are still
1:45:28
attention so the queries are still
1:45:28
attention so the queries are still generated from X but now the keys and
1:45:30
generated from X but now the keys and
1:45:30
generated from X but now the keys and the values are coming from the side the
1:45:32
the values are coming from the side the
1:45:32
the values are coming from the side the keys and the values are coming from the
1:45:34
keys and the values are coming from the
1:45:34
keys and the values are coming from the top generated by the nodes that came
1:45:36
top generated by the nodes that came
1:45:36
top generated by the nodes that came outside of the de the encoder and those
1:45:39
outside of the de the encoder and those
1:45:40
outside of the de the encoder and those tops the keys and the values there the
1:45:42
tops the keys and the values there the
1:45:42
tops the keys and the values there the top of it feed in on a side into every
1:45:45
top of it feed in on a side into every
1:45:45
top of it feed in on a side into every single block of the decoder and so
1:45:47
single block of the decoder and so
1:45:47
single block of the decoder and so that's why there's an additional cross
1:45:49
that's why there's an additional cross
1:45:49
that's why there's an additional cross attention and really what it's doing is
1:45:51
attention and really what it's doing is
1:45:51
attention and really what it's doing is it's conditioning the decoding
1:45:53
it's conditioning the decoding
1:45:53
it's conditioning the decoding not just on the past of this current
1:45:55
not just on the past of this current
1:45:55
not just on the past of this current decoding but also on having seen the
1:45:59
decoding but also on having seen the
1:45:59
decoding but also on having seen the full fully encoded French um prompt sort
1:46:04
full fully encoded French um prompt sort
1:46:04
full fully encoded French um prompt sort of and so it's an encoder decoder model
1:46:06
of and so it's an encoder decoder model
1:46:06
of and so it's an encoder decoder model which is why we have those two
1:46:07
which is why we have those two
1:46:07
which is why we have those two Transformers an additional block and so
1:46:09
Transformers an additional block and so
1:46:09
Transformers an additional block and so on so we did not do this because we have
1:46:11
on so we did not do this because we have
1:46:12
on so we did not do this because we have no we have nothing to encode there's no
1:46:13
no we have nothing to encode there's no
1:46:13
no we have nothing to encode there's no conditioning we just have a text file
1:46:15
conditioning we just have a text file
1:46:15
conditioning we just have a text file and we just want to imitate it and
1:46:16
and we just want to imitate it and
1:46:16
and we just want to imitate it and that's why we are using a decoder only
1:46:18
that's why we are using a decoder only
1:46:19
that's why we are using a decoder only Transformer exactly as done in
1:46:21
Transformer exactly as done in
1:46:21
Transformer exactly as done in GPT okay okay so now I wanted to do a
1:46:24
GPT okay okay so now I wanted to do a
1:46:24
GPT okay okay so now I wanted to do a very brief walkthrough of nanog GPT
1:46:26
very brief walkthrough of nanog GPT
1:46:26
very brief walkthrough of nanog GPT which you can find in my GitHub and uh
1:46:28
which you can find in my GitHub and uh
1:46:28
which you can find in my GitHub and uh nanog GPT is basically two files of
1:46:30
nanog GPT is basically two files of
1:46:30
nanog GPT is basically two files of Interest there's train.py and model.py
1:46:33
Interest there's train.py and model.py
1:46:33
Interest there's train.py and model.py train.py is all the boilerplate code for
1:46:35
train.py is all the boilerplate code for
1:46:35
train.py is all the boilerplate code for training the network it is basically all
1:46:38
training the network it is basically all
1:46:38
training the network it is basically all the stuff that we had here it's the
1:46:40
the stuff that we had here it's the
1:46:40
the stuff that we had here it's the training loop it's just that it's a lot
1:46:42
training loop it's just that it's a lot
1:46:42
training loop it's just that it's a lot more complicated because we're saving
1:46:44
more complicated because we're saving
1:46:44
more complicated because we're saving and loading checkpoints and pre-trained
1:46:46
and loading checkpoints and pre-trained
1:46:46
and loading checkpoints and pre-trained weights and we are uh decaying the
1:46:48
weights and we are uh decaying the
1:46:48
weights and we are uh decaying the learning rate and compiling the model
1:46:50
learning rate and compiling the model
1:46:50
learning rate and compiling the model and using distributed training across
1:46:51
and using distributed training across
1:46:51
and using distributed training across multiple nodes or GP use so the training
1:46:54
multiple nodes or GP use so the training
1:46:54
multiple nodes or GP use so the training Pi gets a little bit more hairy
1:46:56
Pi gets a little bit more hairy
1:46:56
Pi gets a little bit more hairy complicated uh there's more options Etc
1:46:59
complicated uh there's more options Etc
1:46:59
complicated uh there's more options Etc but the model.py should look very very
1:47:01
but the model.py should look very very
1:47:01
but the model.py should look very very um similar to what we've done here in
1:47:04
um similar to what we've done here in
1:47:04
um similar to what we've done here in fact the model is is almost identical so
1:47:08
fact the model is is almost identical so
1:47:08
fact the model is is almost identical so first here we have the causal self
1:47:09
first here we have the causal self
1:47:09
first here we have the causal self attention block and all of this should
1:47:11
attention block and all of this should
1:47:11
attention block and all of this should look very very recognizable to you we're
1:47:13
look very very recognizable to you we're
1:47:13
look very very recognizable to you we're producing queries Keys values we're
1:47:16
producing queries Keys values we're
1:47:16
producing queries Keys values we're doing Dot products we're masking
1:47:18
doing Dot products we're masking
1:47:18
doing Dot products we're masking applying soft Maxs optionally dropping
1:47:20
applying soft Maxs optionally dropping
1:47:20
applying soft Maxs optionally dropping out and here we are pulling the wi the
1:47:23
out and here we are pulling the wi the
1:47:23
out and here we are pulling the wi the values what is different here is that in
1:47:25
values what is different here is that in
1:47:25
values what is different here is that in our code I have separated out the
1:47:30
our code I have separated out the
1:47:30
our code I have separated out the multi-headed detention into just a
1:47:31
multi-headed detention into just a
1:47:31
multi-headed detention into just a single individual head and then here I
1:47:34
single individual head and then here I
1:47:34
single individual head and then here I have multiple heads and I explicitly
1:47:36
have multiple heads and I explicitly
1:47:36
have multiple heads and I explicitly concatenate them whereas here uh all of
1:47:39
concatenate them whereas here uh all of
1:47:39
concatenate them whereas here uh all of it is implemented in a batched manner
1:47:41
it is implemented in a batched manner
1:47:41
it is implemented in a batched manner inside a single causal self attention
1:47:43
inside a single causal self attention
1:47:43
inside a single causal self attention and so we don't just have a b and a T
1:47:45
and so we don't just have a b and a T
1:47:45
and so we don't just have a b and a T and A C Dimension we also end up with a
1:47:47
and A C Dimension we also end up with a
1:47:47
and A C Dimension we also end up with a fourth dimension which is the heads and
1:47:50
fourth dimension which is the heads and
1:47:50
fourth dimension which is the heads and so it just gets a lot more sort of hairy
1:47:52
so it just gets a lot more sort of hairy
1:47:52
so it just gets a lot more sort of hairy because we have four dimensional array
1:47:54
because we have four dimensional array
1:47:54
because we have four dimensional array um tensors now but it is um equivalent
1:47:57
um tensors now but it is um equivalent
1:47:57
um tensors now but it is um equivalent mathematically so the exact same thing
1:47:59
mathematically so the exact same thing
1:47:59
mathematically so the exact same thing is happening as what we have it's just
1:48:01
is happening as what we have it's just
1:48:01
is happening as what we have it's just it's a bit more efficient because all
1:48:02
it's a bit more efficient because all
1:48:02
it's a bit more efficient because all the heads are now treated as a batch
1:48:04
the heads are now treated as a batch
1:48:04
the heads are now treated as a batch Dimension as
1:48:05
Dimension as
1:48:05
Dimension as well then we have the multier perceptron
1:48:08
well then we have the multier perceptron
1:48:08
well then we have the multier perceptron it's using the Galu nonlinearity which
1:48:10
it's using the Galu nonlinearity which
1:48:10
it's using the Galu nonlinearity which is defined here except instead of Ru and
1:48:13
is defined here except instead of Ru and
1:48:13
is defined here except instead of Ru and this is done just because opening I used
1:48:14
this is done just because opening I used
1:48:14
this is done just because opening I used it and I want to be able to load their
1:48:17
it and I want to be able to load their
1:48:17
it and I want to be able to load their checkpoints uh the blocks of the
1:48:18
checkpoints uh the blocks of the
1:48:19
checkpoints uh the blocks of the Transformer are identical to communicate
1:48:21
Transformer are identical to communicate
1:48:21
Transformer are identical to communicate in the compute phase as we saw and then
1:48:23
in the compute phase as we saw and then
1:48:23
in the compute phase as we saw and then the GPT will be identical we have the
1:48:25
the GPT will be identical we have the
1:48:25
the GPT will be identical we have the position encodings token encodings the
1:48:27
position encodings token encodings the
1:48:27
position encodings token encodings the blocks the layer Norm at the end uh the
1:48:30
blocks the layer Norm at the end uh the
1:48:30
blocks the layer Norm at the end uh the final linear layer and this should look
1:48:33
final linear layer and this should look
1:48:33
final linear layer and this should look all very recognizable and there's a bit
1:48:35
all very recognizable and there's a bit
1:48:35
all very recognizable and there's a bit more here because I'm loading
1:48:36
more here because I'm loading
1:48:36
more here because I'm loading checkpoints and stuff like that I'm
1:48:38
checkpoints and stuff like that I'm
1:48:38
checkpoints and stuff like that I'm separating out the parameters into those
1:48:40
separating out the parameters into those
1:48:40
separating out the parameters into those that should be weight decayed and those
1:48:41
that should be weight decayed and those
1:48:42
that should be weight decayed and those that
1:48:42
that
1:48:42
that shouldn't um but the generate function
1:48:44
shouldn't um but the generate function
1:48:44
shouldn't um but the generate function should also be very very similar so a
1:48:47
should also be very very similar so a
1:48:47
should also be very very similar so a few details are different but you should
1:48:48
few details are different but you should
1:48:48
few details are different but you should definitely be able to look at this uh
1:48:51
definitely be able to look at this uh
1:48:51
definitely be able to look at this uh file and be able to understand little
1:48:52
file and be able to understand little
1:48:52
file and be able to understand little the pieces now so let's now bring things
1:48:55
the pieces now so let's now bring things
1:48:55
the pieces now so let's now bring things back to chat GPT what would it look like
1:48:57
back to chat GPT what would it look like
1:48:57
back to chat GPT what would it look like if we wanted to train chat GPT ourselves
1:48:59
if we wanted to train chat GPT ourselves
1:48:59
if we wanted to train chat GPT ourselves and how does it relate to what we
1:49:00
and how does it relate to what we
1:49:00
and how does it relate to what we learned today well to train in chat GPT
1:49:03
learned today well to train in chat GPT
1:49:03
learned today well to train in chat GPT there are roughly two stages first is
1:49:05
there are roughly two stages first is
1:49:05
there are roughly two stages first is the pre-training stage and then the
1:49:07
the pre-training stage and then the
1:49:07
the pre-training stage and then the fine-tuning stage in the pre-training
1:49:09
fine-tuning stage in the pre-training
1:49:09
fine-tuning stage in the pre-training stage uh we are training on a large
1:49:12
stage uh we are training on a large
1:49:12
stage uh we are training on a large chunk of internet and just trying to get
1:49:14
chunk of internet and just trying to get
1:49:14
chunk of internet and just trying to get a first decoder only Transformer to
1:49:17
a first decoder only Transformer to
1:49:17
a first decoder only Transformer to babble text so it's very very similar to
1:49:20
babble text so it's very very similar to
1:49:20
babble text so it's very very similar to what we've done ourselves except we've
1:49:23
what we've done ourselves except we've
1:49:23
what we've done ourselves except we've done like a tiny little baby
1:49:24
done like a tiny little baby
1:49:24
done like a tiny little baby pre-training step um and so in our case
1:49:28
pre-training step um and so in our case
1:49:28
pre-training step um and so in our case uh this is how you print a number of
1:49:30
uh this is how you print a number of
1:49:30
uh this is how you print a number of parameters I printed it and it's about
1:49:32
parameters I printed it and it's about
1:49:32
parameters I printed it and it's about 10 million so this Transformer that I
1:49:35
10 million so this Transformer that I
1:49:35
10 million so this Transformer that I created here to create little
1:49:37
created here to create little
1:49:37
created here to create little Shakespeare um Transformer was about 10
1:49:40
Shakespeare um Transformer was about 10
1:49:40
Shakespeare um Transformer was about 10 million parameters our data set is
1:49:42
million parameters our data set is
1:49:42
million parameters our data set is roughly 1 million uh characters so
1:49:45
roughly 1 million uh characters so
1:49:45
roughly 1 million uh characters so roughly 1 million tokens but you have to
1:49:47
roughly 1 million tokens but you have to
1:49:47
roughly 1 million tokens but you have to remember that opening I is different
1:49:48
remember that opening I is different
1:49:48
remember that opening I is different vocabulary they're not on the Character
1:49:50
vocabulary they're not on the Character
1:49:50
vocabulary they're not on the Character level they use these um subword chunks
1:49:53
level they use these um subword chunks
1:49:53
level they use these um subword chunks of words and so they have a vocabulary
1:49:55
of words and so they have a vocabulary
1:49:55
of words and so they have a vocabulary of 50,000 roughly elements and so their
1:49:58
of 50,000 roughly elements and so their
1:49:58
of 50,000 roughly elements and so their sequences are a bit more condensed so
1:50:01
sequences are a bit more condensed so
1:50:01
sequences are a bit more condensed so our data set the Shakespeare data set
1:50:03
our data set the Shakespeare data set
1:50:03
our data set the Shakespeare data set would be probably around 300,000 uh
1:50:05
would be probably around 300,000 uh
1:50:05
would be probably around 300,000 uh tokens in the open AI vocabulary roughly
1:50:09
tokens in the open AI vocabulary roughly
1:50:09
tokens in the open AI vocabulary roughly so we trained about 10 million parameter
1:50:11
so we trained about 10 million parameter
1:50:11
so we trained about 10 million parameter model on roughly 300,000 tokens now when
1:50:14
model on roughly 300,000 tokens now when
1:50:14
model on roughly 300,000 tokens now when you go to the gpt3
1:50:16
you go to the gpt3
1:50:16
you go to the gpt3 paper and you look at the Transformers
1:50:20
paper and you look at the Transformers
1:50:20
paper and you look at the Transformers that they trained they trained a number
1:50:22
that they trained they trained a number
1:50:22
that they trained they trained a number of trans Transformers of different sizes
1:50:24
of trans Transformers of different sizes
1:50:24
of trans Transformers of different sizes but the biggest Transformer here has 175
1:50:26
but the biggest Transformer here has 175
1:50:27
but the biggest Transformer here has 175 billion parameters uh so ours is again
1:50:29
billion parameters uh so ours is again
1:50:29
billion parameters uh so ours is again 10 million they used this number of
1:50:31
10 million they used this number of
1:50:31
10 million they used this number of layers in the Transformer this is the
1:50:34
layers in the Transformer this is the
1:50:34
layers in the Transformer this is the nmed this is the number of heads and
1:50:36
nmed this is the number of heads and
1:50:36
nmed this is the number of heads and this is the head size and then this is
1:50:39
this is the head size and then this is
1:50:39
this is the head size and then this is the batch size uh so ours was
1:50:43
the batch size uh so ours was
1:50:43
the batch size uh so ours was 65 and the learning rate is similar now
1:50:46
65 and the learning rate is similar now
1:50:46
65 and the learning rate is similar now when they train this Transformer they
1:50:47
when they train this Transformer they
1:50:47
when they train this Transformer they trained on 300 billion tokens so again
1:50:51
trained on 300 billion tokens so again
1:50:51
trained on 300 billion tokens so again remember ours is about 300,000
1:50:53
remember ours is about 300,000
1:50:53
remember ours is about 300,000 so this is uh about a millionfold
1:50:56
so this is uh about a millionfold
1:50:56
so this is uh about a millionfold increase and this number would not be
1:50:57
increase and this number would not be
1:50:57
increase and this number would not be even that large by today's standards
1:50:59
even that large by today's standards
1:50:59
even that large by today's standards you'd be going up uh 1 trillion and
1:51:01
you'd be going up uh 1 trillion and
1:51:01
you'd be going up uh 1 trillion and above so they are training a
1:51:04
above so they are training a
1:51:04
above so they are training a significantly larger
1:51:06
significantly larger
1:51:06
significantly larger model on uh a good chunk of the internet
1:51:09
model on uh a good chunk of the internet
1:51:10
model on uh a good chunk of the internet and that is the pre-training stage but
1:51:12
and that is the pre-training stage but
1:51:12
and that is the pre-training stage but otherwise these hyper parameters should
1:51:13
otherwise these hyper parameters should
1:51:13
otherwise these hyper parameters should be fairly recognizable to you and the
1:51:15
be fairly recognizable to you and the
1:51:15
be fairly recognizable to you and the architecture is actually like nearly
1:51:17
architecture is actually like nearly
1:51:17
architecture is actually like nearly identical to what we implemented
1:51:18
identical to what we implemented
1:51:18
identical to what we implemented ourselves but of course it's a massive
1:51:20
ourselves but of course it's a massive
1:51:20
ourselves but of course it's a massive infrastructure challenge to train this
1:51:22
infrastructure challenge to train this
1:51:22
infrastructure challenge to train this you're talking about typically thousands
1:51:24
you're talking about typically thousands
1:51:24
you're talking about typically thousands of gpus having to you know talk to each
1:51:27
of gpus having to you know talk to each
1:51:27
of gpus having to you know talk to each other to train models of this size so
1:51:29
other to train models of this size so
1:51:29
other to train models of this size so that's just a pre-training stage now
1:51:31
that's just a pre-training stage now
1:51:32
that's just a pre-training stage now after you complete the pre-training
1:51:33
after you complete the pre-training
1:51:33
after you complete the pre-training stage uh you don't get something that
1:51:35
stage uh you don't get something that
1:51:35
stage uh you don't get something that responds to your questions with answers
1:51:37
responds to your questions with answers
1:51:38
responds to your questions with answers and is not helpful and Etc you get a
1:51:40
and is not helpful and Etc you get a
1:51:40
and is not helpful and Etc you get a document
1:51:41
document
1:51:41
document completer right so it babbles but it
1:51:44
completer right so it babbles but it
1:51:44
completer right so it babbles but it doesn't Babble Shakespeare it babbles
1:51:46
doesn't Babble Shakespeare it babbles
1:51:46
doesn't Babble Shakespeare it babbles internet it will create arbitrary news
1:51:48
internet it will create arbitrary news
1:51:48
internet it will create arbitrary news articles and documents and it will try
1:51:49
articles and documents and it will try
1:51:50
articles and documents and it will try to complete documents because that's
1:51:51
to complete documents because that's
1:51:51
to complete documents because that's what it's trained for it's trying to
1:51:52
what it's trained for it's trying to
1:51:52
what it's trained for it's trying to complete the sequence so when you give
1:51:54
complete the sequence so when you give
1:51:54
complete the sequence so when you give it a question it would just uh
1:51:56
it a question it would just uh
1:51:56
it a question it would just uh potentially just give you more questions
1:51:58
potentially just give you more questions
1:51:58
potentially just give you more questions it would follow with more questions it
1:52:00
it would follow with more questions it
1:52:00
it would follow with more questions it will do whatever it looks like the some
1:52:02
will do whatever it looks like the some
1:52:02
will do whatever it looks like the some close document would do in the training
1:52:05
close document would do in the training
1:52:05
close document would do in the training data on the internet and so who knows
1:52:07
data on the internet and so who knows
1:52:07
data on the internet and so who knows you're getting kind of like undefined
1:52:08
you're getting kind of like undefined
1:52:08
you're getting kind of like undefined Behavior it might basically answer with
1:52:11
Behavior it might basically answer with
1:52:11
Behavior it might basically answer with to questions with other questions it
1:52:13
to questions with other questions it
1:52:13
to questions with other questions it might ignore your question it might just
1:52:15
might ignore your question it might just
1:52:15
might ignore your question it might just try to complete some news article it's
1:52:17
try to complete some news article it's
1:52:17
try to complete some news article it's totally unineed as we say so the second
1:52:20
totally unineed as we say so the second
1:52:20
totally unineed as we say so the second fine-tuning stage is to actually align
1:52:22
fine-tuning stage is to actually align
1:52:22
fine-tuning stage is to actually align it to be an assistant and uh this is the
1:52:25
it to be an assistant and uh this is the
1:52:25
it to be an assistant and uh this is the second stage and so this chat GPT block
1:52:28
second stage and so this chat GPT block
1:52:28
second stage and so this chat GPT block post from openi talks a little bit about
1:52:30
post from openi talks a little bit about
1:52:30
post from openi talks a little bit about how the stage is achieved we basically
1:52:34
how the stage is achieved we basically
1:52:34
how the stage is achieved we basically um there's roughly three steps to to
1:52:36
um there's roughly three steps to to
1:52:36
um there's roughly three steps to to this stage uh so what they do here is
1:52:38
this stage uh so what they do here is
1:52:39
this stage uh so what they do here is they start to collect training data that
1:52:41
they start to collect training data that
1:52:41
they start to collect training data that looks specifically like what an
1:52:42
looks specifically like what an
1:52:42
looks specifically like what an assistant would do so these are
1:52:44
assistant would do so these are
1:52:44
assistant would do so these are documents that have to format where the
1:52:45
documents that have to format where the
1:52:46
documents that have to format where the question is on top and then an answer is
1:52:47
question is on top and then an answer is
1:52:47
question is on top and then an answer is below and they have a large number of
1:52:50
below and they have a large number of
1:52:50
below and they have a large number of these but probably not on the order of
1:52:51
these but probably not on the order of
1:52:51
these but probably not on the order of the internet uh this is probably on the
1:52:53
the internet uh this is probably on the
1:52:53
the internet uh this is probably on the of maybe thousands of examples and so
1:52:58
of maybe thousands of examples and so
1:52:58
of maybe thousands of examples and so they they then fine-tune the model to
1:53:00
they they then fine-tune the model to
1:53:00
they they then fine-tune the model to basically only focus on documents that
1:53:03
basically only focus on documents that
1:53:03
basically only focus on documents that look like that and so you're starting to
1:53:05
look like that and so you're starting to
1:53:05
look like that and so you're starting to slowly align it so it's going to expect
1:53:07
slowly align it so it's going to expect
1:53:07
slowly align it so it's going to expect a question at the top and it's going to
1:53:08
a question at the top and it's going to
1:53:08
a question at the top and it's going to expect to complete the answer and uh
1:53:11
expect to complete the answer and uh
1:53:11
expect to complete the answer and uh these very very large models are very
1:53:13
these very very large models are very
1:53:13
these very very large models are very sample efficient during their
1:53:14
sample efficient during their
1:53:14
sample efficient during their fine-tuning so this actually somehow
1:53:16
fine-tuning so this actually somehow
1:53:16
fine-tuning so this actually somehow works but that's just step one that's
1:53:19
works but that's just step one that's
1:53:19
works but that's just step one that's just fine tuning so then they actually
1:53:20
just fine tuning so then they actually
1:53:20
just fine tuning so then they actually have more steps where okay the second
1:53:23
have more steps where okay the second
1:53:23
have more steps where okay the second step is you let the model respond and
1:53:25
step is you let the model respond and
1:53:25
step is you let the model respond and then different Raiders look at the
1:53:26
then different Raiders look at the
1:53:27
then different Raiders look at the different responses and rank them for
1:53:29
different responses and rank them for
1:53:29
different responses and rank them for their preference as to which one is
1:53:30
their preference as to which one is
1:53:30
their preference as to which one is better than the other they use that to
1:53:32
better than the other they use that to
1:53:32
better than the other they use that to train a reward model so they can predict
1:53:35
train a reward model so they can predict
1:53:35
train a reward model so they can predict uh basically using a different network
1:53:37
uh basically using a different network
1:53:37
uh basically using a different network how much of any candidate
1:53:39
how much of any candidate
1:53:39
how much of any candidate response would be desirable and then
1:53:43
response would be desirable and then
1:53:43
response would be desirable and then once they have a reward model they run
1:53:45
once they have a reward model they run
1:53:45
once they have a reward model they run po which is a form of polic policy
1:53:47
po which is a form of polic policy
1:53:47
po which is a form of polic policy gradient um reinforcement learning
1:53:49
gradient um reinforcement learning
1:53:49
gradient um reinforcement learning Optimizer to uh fine-tune this sampling
1:53:52
Optimizer to uh fine-tune this sampling
1:53:53
Optimizer to uh fine-tune this sampling policy uh so that the answers that the
1:53:55
policy uh so that the answers that the
1:53:55
policy uh so that the answers that the GP chat GPT now generates are expected
1:53:59
GP chat GPT now generates are expected
1:53:59
GP chat GPT now generates are expected to score a high reward according to the
1:54:02
to score a high reward according to the
1:54:02
to score a high reward according to the reward model and so basically there's a
1:54:04
reward model and so basically there's a
1:54:04
reward model and so basically there's a whole aligning stage here or fine-tuning
1:54:07
whole aligning stage here or fine-tuning
1:54:07
whole aligning stage here or fine-tuning stage it's got multiple steps in between
1:54:09
stage it's got multiple steps in between
1:54:09
stage it's got multiple steps in between there as well and it takes the model
1:54:11
there as well and it takes the model
1:54:11
there as well and it takes the model from being a document completer to a
1:54:14
from being a document completer to a
1:54:14
from being a document completer to a question answerer and that's like a
1:54:16
question answerer and that's like a
1:54:16
question answerer and that's like a whole separate stage a lot of this data
1:54:19
whole separate stage a lot of this data
1:54:19
whole separate stage a lot of this data is not available publicly it is internal
1:54:21
is not available publicly it is internal
1:54:21
is not available publicly it is internal to open AI and uh it's much harder to
1:54:24
to open AI and uh it's much harder to
1:54:24
to open AI and uh it's much harder to replicate this stage um and so that's
1:54:27
replicate this stage um and so that's
1:54:27
replicate this stage um and so that's roughly what would give you a chat GPT
1:54:29
roughly what would give you a chat GPT
1:54:29
roughly what would give you a chat GPT and nanog GPT focuses on the
1:54:31
and nanog GPT focuses on the
1:54:31
and nanog GPT focuses on the pre-training stage okay and that's
1:54:32
pre-training stage okay and that's
1:54:32
pre-training stage okay and that's everything that I wanted to cover today
1:54:34
everything that I wanted to cover today
1:54:35
everything that I wanted to cover today so we trained to summarize a decoder
1:54:38
so we trained to summarize a decoder
1:54:38
so we trained to summarize a decoder only Transformer following this famous
1:54:41
only Transformer following this famous
1:54:41
only Transformer following this famous paper attention is all you need from
1:54:43
paper attention is all you need from
1:54:43
paper attention is all you need from 2017 and so that's basically a GPT we
1:54:47
2017 and so that's basically a GPT we
1:54:47
2017 and so that's basically a GPT we trained it on Tiny Shakespeare and got
1:54:50
trained it on Tiny Shakespeare and got
1:54:50
trained it on Tiny Shakespeare and got sensible results
1:54:52
sensible results
1:54:52
sensible results all of the training code is
1:54:54
all of the training code is
1:54:54
all of the training code is roughly 200 lines of code I will be
1:54:57
roughly 200 lines of code I will be
1:54:57
roughly 200 lines of code I will be releasing this um code base so also it
1:55:01
releasing this um code base so also it
1:55:01
releasing this um code base so also it comes with all the git log commits along
1:55:04
comes with all the git log commits along
1:55:04
comes with all the git log commits along the way as we built it
1:55:05
the way as we built it
1:55:05
the way as we built it up in addition to this code I'm going to
1:55:08
up in addition to this code I'm going to
1:55:08
up in addition to this code I'm going to release the um notebook of course the
1:55:10
release the um notebook of course the
1:55:10
release the um notebook of course the Google collab and I hope that gave you a
1:55:13
Google collab and I hope that gave you a
1:55:13
Google collab and I hope that gave you a sense for how you can train um these
1:55:16
sense for how you can train um these
1:55:16
sense for how you can train um these models like say gpt3 that will be um
1:55:19
models like say gpt3 that will be um
1:55:19
models like say gpt3 that will be um architecturally basically identical to
1:55:20
architecturally basically identical to
1:55:20
architecturally basically identical to what we have but they are somewhere
1:55:22
what we have but they are somewhere
1:55:22
what we have but they are somewhere between 10,000 and 1 million times
1:55:24
between 10,000 and 1 million times
1:55:24
between 10,000 and 1 million times bigger depending on how you count and so
1:55:27
bigger depending on how you count and so
1:55:27
bigger depending on how you count and so uh that's all I have for now uh we did
1:55:30
uh that's all I have for now uh we did
1:55:30
uh that's all I have for now uh we did not talk about any of the fine-tuning
1:55:31
not talk about any of the fine-tuning
1:55:32
not talk about any of the fine-tuning stages that would typically go on top of
1:55:33
stages that would typically go on top of
1:55:33
stages that would typically go on top of this so if you're interested in
1:55:35
this so if you're interested in
1:55:35
this so if you're interested in something that's not just language
1:55:36
something that's not just language
1:55:36
something that's not just language modeling but you actually want to you
1:55:38
modeling but you actually want to you
1:55:38
modeling but you actually want to you know say perform tasks um or you want
1:55:40
know say perform tasks um or you want
1:55:40
know say perform tasks um or you want them to be aligned in a specific way or
1:55:43
them to be aligned in a specific way or
1:55:43
them to be aligned in a specific way or you want um to detect sentiment or
1:55:45
you want um to detect sentiment or
1:55:45
you want um to detect sentiment or anything like that basically anytime you
1:55:47
anything like that basically anytime you
1:55:47
anything like that basically anytime you don't want something that's just a
1:55:48
don't want something that's just a
1:55:48
don't want something that's just a document completer you have to complete
1:55:50
document completer you have to complete
1:55:50
document completer you have to complete further stages of fine tuning which did
1:55:52
further stages of fine tuning which did
1:55:52
further stages of fine tuning which did not cover uh and that could be simple
1:55:55
not cover uh and that could be simple
1:55:55
not cover uh and that could be simple supervised fine tuning or it can be
1:55:57
supervised fine tuning or it can be
1:55:57
supervised fine tuning or it can be something more fancy like we see in chat
1:55:58
something more fancy like we see in chat
1:55:58
something more fancy like we see in chat jpt where we actually train a reward
1:56:00
jpt where we actually train a reward
1:56:00
jpt where we actually train a reward model and then do rounds of Po to uh
1:56:03
model and then do rounds of Po to uh
1:56:03
model and then do rounds of Po to uh align it with respect to the reward
1:56:04
align it with respect to the reward
1:56:04
align it with respect to the reward model so there's a lot more that can be
1:56:06
model so there's a lot more that can be
1:56:06
model so there's a lot more that can be done on top of it I think for now we're
1:56:08
done on top of it I think for now we're
1:56:08
done on top of it I think for now we're starting to get to about two hours Mark
1:56:10
starting to get to about two hours Mark
1:56:10
starting to get to about two hours Mark uh so I'm going to um kind of finish
1:56:13
uh so I'm going to um kind of finish
1:56:13
uh so I'm going to um kind of finish here uh I hope you enjoyed the lecture
1:56:15
here uh I hope you enjoyed the lecture
1:56:15
here uh I hope you enjoyed the lecture uh and uh yeah go forth and transform
1:56:18
uh and uh yeah go forth and transform
1:56:18
uh and uh yeah go forth and transform see you later