View on GitHub
GitHub
Neural Networks: Zero to Hero
Building makemore Part 2: MLP
Loading player
Notes
Transcript
4202 segments
0:00
hi everyone
0:01
hi everyone
0:01
hi everyone today we are continuing our
0:03
today we are continuing our
0:03
today we are continuing our implementation of makemore
0:05
implementation of makemore
0:05
implementation of makemore now in the last lecture we implemented
0:06
now in the last lecture we implemented
0:06
now in the last lecture we implemented the bigram language model and we
0:08
the bigram language model and we
0:08
the bigram language model and we implemented it both using counts and
0:10
implemented it both using counts and
0:10
implemented it both using counts and also using a super simple neural network
0:12
also using a super simple neural network
0:12
also using a super simple neural network that had a single linear layer
0:15
that had a single linear layer
0:15
that had a single linear layer now this is the
0:17
now this is the
0:17
now this is the jupyter notebook that we built out last
0:19
jupyter notebook that we built out last
0:19
jupyter notebook that we built out last lecture
0:20
lecture
0:20
lecture and we saw that the way we approached
0:21
and we saw that the way we approached
0:21
and we saw that the way we approached this is that we looked at only the
0:23
this is that we looked at only the
0:23
this is that we looked at only the single previous character and we
0:24
single previous character and we
0:24
single previous character and we predicted the distribution for the
0:26
predicted the distribution for the
0:26
predicted the distribution for the character that would go next in the
0:28
character that would go next in the
0:28
character that would go next in the sequence and we did that by taking
0:30
sequence and we did that by taking
0:30
sequence and we did that by taking counts and normalizing them into
0:32
counts and normalizing them into
0:32
counts and normalizing them into probabilities
0:33
probabilities
0:33
probabilities so that each row here sums to one
0:36
so that each row here sums to one
0:36
so that each row here sums to one now this is all well and good if you
0:38
now this is all well and good if you
0:38
now this is all well and good if you only have one character of previous
0:40
only have one character of previous
0:40
only have one character of previous context
0:41
context
0:41
context and this works and it's approachable the
0:43
and this works and it's approachable the
0:43
and this works and it's approachable the problem with this model of course is
0:45
problem with this model of course is
0:45
problem with this model of course is that
0:46
that
0:46
that the predictions from this model are not
0:48
the predictions from this model are not
0:48
the predictions from this model are not very good because you only take one
0:50
very good because you only take one
0:50
very good because you only take one character of context so the model didn't
0:52
character of context so the model didn't
0:52
character of context so the model didn't produce very name like sounding things
0:56
produce very name like sounding things
0:56
produce very name like sounding things now the problem with this approach
0:57
now the problem with this approach
0:57
now the problem with this approach though is that if we are to take more
0:59
though is that if we are to take more
0:59
though is that if we are to take more context into account when predicting the
1:01
context into account when predicting the
1:01
context into account when predicting the next character in a sequence things
1:03
next character in a sequence things
1:03
next character in a sequence things quickly blow up and this table the size
1:06
quickly blow up and this table the size
1:06
quickly blow up and this table the size of this table grows and in fact it grows
1:08
of this table grows and in fact it grows
1:08
of this table grows and in fact it grows exponentially with the length of the
1:09
exponentially with the length of the
1:10
exponentially with the length of the context
1:11
context
1:11
context because if we only take a single
1:12
because if we only take a single
1:12
because if we only take a single character at a time that's 27
1:13
character at a time that's 27
1:13
character at a time that's 27 possibilities of context
1:15
possibilities of context
1:15
possibilities of context but if we take two characters in the
1:17
but if we take two characters in the
1:17
but if we take two characters in the past and try to predict the third one
1:19
past and try to predict the third one
1:19
past and try to predict the third one suddenly the number of rows in this
1:21
suddenly the number of rows in this
1:21
suddenly the number of rows in this matrix you can look at it that way
1:23
matrix you can look at it that way
1:23
matrix you can look at it that way is 27 times 27 so there's 729
1:26
is 27 times 27 so there's 729
1:26
is 27 times 27 so there's 729 possibilities for what could have come
1:28
possibilities for what could have come
1:28
possibilities for what could have come in the context
1:30
in the context
1:30
in the context if we take three characters as the
1:31
if we take three characters as the
1:31
if we take three characters as the context suddenly we have
1:34
context suddenly we have
1:34
context suddenly we have 20 000 possibilities of context
1:37
20 000 possibilities of context
1:37
20 000 possibilities of context and so there's just way too many rows of
1:40
and so there's just way too many rows of
1:40
and so there's just way too many rows of this matrix it's way too few counts
1:43
this matrix it's way too few counts
1:43
this matrix it's way too few counts for each possibility and the whole thing
1:45
for each possibility and the whole thing
1:45
for each possibility and the whole thing just kind of explodes and doesn't work
1:47
just kind of explodes and doesn't work
1:47
just kind of explodes and doesn't work very well
1:48
very well
1:48
very well so that's why today we're going to move
1:50
so that's why today we're going to move
1:50
so that's why today we're going to move on to this bullet point here and we're
1:52
on to this bullet point here and we're
1:52
on to this bullet point here and we're going to implement a multi-layer
1:53
going to implement a multi-layer
1:53
going to implement a multi-layer perceptron model to predict the next uh
1:56
perceptron model to predict the next uh
1:56
perceptron model to predict the next uh character in a sequence
1:58
character in a sequence
1:58
character in a sequence and this modeling approach that we're
2:00
and this modeling approach that we're
2:00
and this modeling approach that we're going to adopt follows this paper
2:01
going to adopt follows this paper
2:01
going to adopt follows this paper benguetal 2003
2:04
benguetal 2003
2:04
benguetal 2003 so i have the paper pulled up here
2:06
so i have the paper pulled up here
2:06
so i have the paper pulled up here now this isn't the very first paper that
2:08
now this isn't the very first paper that
2:08
now this isn't the very first paper that proposed the use of multiglio
2:09
proposed the use of multiglio
2:10
proposed the use of multiglio perceptrons or neural networks to
2:11
perceptrons or neural networks to
2:11
perceptrons or neural networks to predict the next character or token in a
2:13
predict the next character or token in a
2:13
predict the next character or token in a sequence but it's definitely one that is
2:15
sequence but it's definitely one that is
2:15
sequence but it's definitely one that is uh was very influential around that time
2:17
uh was very influential around that time
2:18
uh was very influential around that time it is very often cited to stand in for
2:19
it is very often cited to stand in for
2:20
it is very often cited to stand in for this idea and i think it's a very nice
2:21
this idea and i think it's a very nice
2:22
this idea and i think it's a very nice write-up and so this is the paper that
2:23
write-up and so this is the paper that
2:23
write-up and so this is the paper that we're going to first look at and then
2:25
we're going to first look at and then
2:25
we're going to first look at and then implement now this paper has 19 pages so
2:28
implement now this paper has 19 pages so
2:28
implement now this paper has 19 pages so we don't have time to go into
2:30
we don't have time to go into
2:30
we don't have time to go into the full detail of this paper but i
2:31
the full detail of this paper but i
2:31
the full detail of this paper but i invite you to read it
2:33
invite you to read it
2:33
invite you to read it it's very readable interesting and has a
2:34
it's very readable interesting and has a
2:34
it's very readable interesting and has a lot of interesting ideas in it as well
2:37
lot of interesting ideas in it as well
2:37
lot of interesting ideas in it as well in the introduction they describe the
2:38
in the introduction they describe the
2:38
in the introduction they describe the exact same problem i just described and
2:40
exact same problem i just described and
2:40
exact same problem i just described and then to address it they propose the
2:42
then to address it they propose the
2:42
then to address it they propose the following model
2:43
following model
2:43
following model now keep in mind that we are building a
2:46
now keep in mind that we are building a
2:46
now keep in mind that we are building a character level language model so we're
2:48
character level language model so we're
2:48
character level language model so we're working on the level of characters in
2:50
working on the level of characters in
2:50
working on the level of characters in this paper they have a vocabulary of 17
2:52
this paper they have a vocabulary of 17
2:52
this paper they have a vocabulary of 17 000 possible words and they instead
2:54
000 possible words and they instead
2:54
000 possible words and they instead build a word level language model but
2:56
build a word level language model but
2:56
build a word level language model but we're going to still stick with the
2:58
we're going to still stick with the
2:58
we're going to still stick with the characters but we'll take the same
2:59
characters but we'll take the same
2:59
characters but we'll take the same modeling approach
3:01
modeling approach
3:01
modeling approach now what they do is basically they
3:03
now what they do is basically they
3:03
now what they do is basically they propose to take every one of these words
3:05
propose to take every one of these words
3:05
propose to take every one of these words seventeen thousand words and they're
3:07
seventeen thousand words and they're
3:07
seventeen thousand words and they're going to associate to each word a say
3:10
going to associate to each word a say
3:10
going to associate to each word a say thirty dimensional feature vector
3:12
thirty dimensional feature vector
3:12
thirty dimensional feature vector so every word is now
3:15
so every word is now
3:15
so every word is now embedded into a thirty dimensional space
3:17
embedded into a thirty dimensional space
3:17
embedded into a thirty dimensional space you can think of it that way so we have
3:19
you can think of it that way so we have
3:19
you can think of it that way so we have 17 000 points or vectors in a 30
3:22
17 000 points or vectors in a 30
3:22
17 000 points or vectors in a 30 dimensional space
3:23
dimensional space
3:23
dimensional space and that's um you might imagine that's
3:25
and that's um you might imagine that's
3:25
and that's um you might imagine that's very crowded that's a lot of points for
3:27
very crowded that's a lot of points for
3:27
very crowded that's a lot of points for a very small space
3:28
a very small space
3:28
a very small space now
3:30
now
3:30
now in the beginning these words are
3:31
in the beginning these words are
3:31
in the beginning these words are initialized completely randomly so
3:32
initialized completely randomly so
3:32
initialized completely randomly so they're spread out at random
3:34
they're spread out at random
3:34
they're spread out at random but then we're going to tune these
3:36
but then we're going to tune these
3:36
but then we're going to tune these embeddings of these words using back
3:38
embeddings of these words using back
3:38
embeddings of these words using back propagation
3:39
propagation
3:39
propagation so during the course of training of this
3:41
so during the course of training of this
3:41
so during the course of training of this neural network these points or vectors
3:43
neural network these points or vectors
3:43
neural network these points or vectors are going to basically move around in
3:44
are going to basically move around in
3:44
are going to basically move around in this space and you might imagine that
3:46
this space and you might imagine that
3:46
this space and you might imagine that for example words that have very similar
3:48
for example words that have very similar
3:48
for example words that have very similar meanings or that are indeed synonyms of
3:50
meanings or that are indeed synonyms of
3:50
meanings or that are indeed synonyms of each other might end up in a very
3:52
each other might end up in a very
3:52
each other might end up in a very similar part of the space and conversely
3:54
similar part of the space and conversely
3:54
similar part of the space and conversely words that mean very different things
3:56
words that mean very different things
3:56
words that mean very different things would go somewhere else in a space
3:59
would go somewhere else in a space
3:59
would go somewhere else in a space now their modeling approach otherwise is
4:01
now their modeling approach otherwise is
4:01
now their modeling approach otherwise is identical to ours they are using a
4:03
identical to ours they are using a
4:03
identical to ours they are using a multi-layer neural network to predict
4:04
multi-layer neural network to predict
4:04
multi-layer neural network to predict the next word given the previous words
4:07
the next word given the previous words
4:07
the next word given the previous words and to train the neural network they are
4:08
and to train the neural network they are
4:08
and to train the neural network they are maximizing the log likelihood of the
4:10
maximizing the log likelihood of the
4:10
maximizing the log likelihood of the training data just like we did
4:12
training data just like we did
4:12
training data just like we did so the modeling approach itself is
4:14
so the modeling approach itself is
4:14
so the modeling approach itself is identical now here they have a concrete
4:16
identical now here they have a concrete
4:16
identical now here they have a concrete example of this intuition
4:18
example of this intuition
4:18
example of this intuition why does it work
4:20
why does it work
4:20
why does it work basically suppose that for example you
4:21
basically suppose that for example you
4:21
basically suppose that for example you are trying to predict a dog was running
4:23
are trying to predict a dog was running
4:23
are trying to predict a dog was running in a blank
4:25
in a blank
4:25
in a blank now suppose that the exact phrase a dog
4:28
now suppose that the exact phrase a dog
4:28
now suppose that the exact phrase a dog was running in a
4:29
was running in a
4:29
was running in a has never occurred in a training data
4:31
has never occurred in a training data
4:31
has never occurred in a training data and here you are at sort of test time
4:33
and here you are at sort of test time
4:33
and here you are at sort of test time later when the model is deployed
4:35
later when the model is deployed
4:35
later when the model is deployed somewhere
4:36
somewhere
4:36
somewhere and it's trying to make a sentence and
4:38
and it's trying to make a sentence and
4:38
and it's trying to make a sentence and it's saying a dog was running in a blank
4:41
it's saying a dog was running in a blank
4:41
it's saying a dog was running in a blank and because it's never encountered this
4:43
and because it's never encountered this
4:43
and because it's never encountered this exact phrase in the training set you're
4:45
exact phrase in the training set you're
4:45
exact phrase in the training set you're out of distribution as we say like you
4:47
out of distribution as we say like you
4:47
out of distribution as we say like you don't have fundamentally any
4:49
don't have fundamentally any
4:49
don't have fundamentally any reason to suspect
4:52
reason to suspect
4:52
reason to suspect what might come next
4:53
what might come next
4:54
what might come next but this approach actually allows you to
4:55
but this approach actually allows you to
4:55
but this approach actually allows you to get around that because maybe you didn't
4:57
get around that because maybe you didn't
4:57
get around that because maybe you didn't see the exact phrase a dog was running
4:59
see the exact phrase a dog was running
4:59
see the exact phrase a dog was running in a something but maybe you've seen
5:01
in a something but maybe you've seen
5:01
in a something but maybe you've seen similar phrases maybe you've seen the
5:03
similar phrases maybe you've seen the
5:03
similar phrases maybe you've seen the phrase the dog was running in a blank
5:06
phrase the dog was running in a blank
5:06
phrase the dog was running in a blank and maybe your network has learned that
5:07
and maybe your network has learned that
5:07
and maybe your network has learned that a and the
5:08
a and the
5:08
a and the are like frequently are interchangeable
5:10
are like frequently are interchangeable
5:10
are like frequently are interchangeable with each other and so maybe it took the
5:13
with each other and so maybe it took the
5:13
with each other and so maybe it took the embedding for a and the embedding for
5:15
embedding for a and the embedding for
5:15
embedding for a and the embedding for the and it actually put them like nearby
5:17
the and it actually put them like nearby
5:17
the and it actually put them like nearby each other in the space and so you can
5:18
each other in the space and so you can
5:18
each other in the space and so you can transfer knowledge through that
5:20
transfer knowledge through that
5:20
transfer knowledge through that embedding and you can generalize in that
5:22
embedding and you can generalize in that
5:22
embedding and you can generalize in that way
5:23
way
5:23
way similarly the network could know that
5:25
similarly the network could know that
5:25
similarly the network could know that cats and dogs are animals and they
5:26
cats and dogs are animals and they
5:26
cats and dogs are animals and they co-occur in lots of very similar
5:28
co-occur in lots of very similar
5:28
co-occur in lots of very similar contexts and so even though you haven't
5:30
contexts and so even though you haven't
5:30
contexts and so even though you haven't seen this exact phrase
5:32
seen this exact phrase
5:32
seen this exact phrase or if you haven't seen exactly walking
5:34
or if you haven't seen exactly walking
5:34
or if you haven't seen exactly walking or running
5:35
or running
5:35
or running you can through the embedding space
5:37
you can through the embedding space
5:37
you can through the embedding space transfer knowledge and you can
5:39
transfer knowledge and you can
5:39
transfer knowledge and you can generalize to novel scenarios
5:42
generalize to novel scenarios
5:42
generalize to novel scenarios so let's now scroll down to the diagram
5:43
so let's now scroll down to the diagram
5:43
so let's now scroll down to the diagram of the neural network
5:44
of the neural network
5:44
of the neural network they have a nice diagram here
5:47
they have a nice diagram here
5:47
they have a nice diagram here and in this example we are taking three
5:49
and in this example we are taking three
5:49
and in this example we are taking three previous words
5:51
previous words
5:51
previous words and we are trying to predict the fourth
5:53
and we are trying to predict the fourth
5:53
and we are trying to predict the fourth word
5:53
word
5:54
word in a sequence
5:55
in a sequence
5:55
in a sequence now these three previous words as i
5:57
now these three previous words as i
5:57
now these three previous words as i mentioned uh we have a vocabulary of 17
6:00
mentioned uh we have a vocabulary of 17
6:00
mentioned uh we have a vocabulary of 17 000 um possible words
6:02
000 um possible words
6:02
000 um possible words so every one of these
6:04
so every one of these
6:04
so every one of these basically basically
6:05
basically basically
6:05
basically basically are the index of the incoming word
6:09
are the index of the incoming word
6:09
are the index of the incoming word and because there are 17 000 words this
6:11
and because there are 17 000 words this
6:11
and because there are 17 000 words this is an integer between 0 and 16999
6:17
now there's also a lookup table that
6:19
now there's also a lookup table that
6:19
now there's also a lookup table that they call c
6:20
they call c
6:20
they call c this lookup table is a matrix that is 17
6:23
this lookup table is a matrix that is 17
6:23
this lookup table is a matrix that is 17 000 by say 30.
6:25
000 by say 30.
6:25
000 by say 30. and basically what we're doing here is
6:27
and basically what we're doing here is
6:27
and basically what we're doing here is we're treating this as a lookup table
6:29
we're treating this as a lookup table
6:29
we're treating this as a lookup table and so every index is
6:31
and so every index is
6:31
and so every index is plucking out a row of this embedding
6:34
plucking out a row of this embedding
6:34
plucking out a row of this embedding matrix
6:35
matrix
6:35
matrix so that each index is converted to the
6:37
so that each index is converted to the
6:37
so that each index is converted to the 30 dimensional vector that corresponds
6:39
30 dimensional vector that corresponds
6:39
30 dimensional vector that corresponds to the embedding vector for that word
6:42
to the embedding vector for that word
6:42
to the embedding vector for that word so here we have the input layer of 30
6:45
so here we have the input layer of 30
6:45
so here we have the input layer of 30 neurons for three words making up 90
6:48
neurons for three words making up 90
6:48
neurons for three words making up 90 neurons in total
6:50
neurons in total
6:50
neurons in total and here they're saying that this matrix
6:52
and here they're saying that this matrix
6:52
and here they're saying that this matrix c is shared across all the words so
6:54
c is shared across all the words so
6:54
c is shared across all the words so we're always indexing into the same
6:56
we're always indexing into the same
6:56
we're always indexing into the same matrix c over and over um
6:59
matrix c over and over um
6:59
matrix c over and over um for each one of these
7:00
for each one of these
7:00
for each one of these words next up is the hidden layer of
7:03
words next up is the hidden layer of
7:03
words next up is the hidden layer of this neural network the size of this
7:05
this neural network the size of this
7:05
this neural network the size of this hidden neural layer of this neural net
7:07
hidden neural layer of this neural net
7:07
hidden neural layer of this neural net is a hoppy parameter so we use the word
7:09
is a hoppy parameter so we use the word
7:09
is a hoppy parameter so we use the word hyperparameter when it's kind of like a
7:11
hyperparameter when it's kind of like a
7:11
hyperparameter when it's kind of like a design choice up to the designer of the
7:12
design choice up to the designer of the
7:12
design choice up to the designer of the neural net and this can be as large as
7:15
neural net and this can be as large as
7:15
neural net and this can be as large as you'd like or as small as you'd like so
7:16
you'd like or as small as you'd like so
7:16
you'd like or as small as you'd like so for example the size could be a hundred
7:19
for example the size could be a hundred
7:19
for example the size could be a hundred and we are going to go over multiple
7:20
and we are going to go over multiple
7:20
and we are going to go over multiple choices of the size of this hidden layer
7:23
choices of the size of this hidden layer
7:23
choices of the size of this hidden layer and we're going to evaluate how well
7:24
and we're going to evaluate how well
7:24
and we're going to evaluate how well they work
7:25
they work
7:26
they work so say there were 100 neurons here all
7:28
so say there were 100 neurons here all
7:28
so say there were 100 neurons here all of them would be fully connected to the
7:29
of them would be fully connected to the
7:30
of them would be fully connected to the 90 words or 90 um
7:32
90 words or 90 um
7:32
90 words or 90 um numbers that make up these three words
7:35
numbers that make up these three words
7:35
numbers that make up these three words so this is a fully connected layer
7:37
so this is a fully connected layer
7:37
so this is a fully connected layer then there's a 10 inch long linearity
7:40
then there's a 10 inch long linearity
7:40
then there's a 10 inch long linearity and then there's this output layer and
7:42
and then there's this output layer and
7:42
and then there's this output layer and because there are 17 000 possible words
7:44
because there are 17 000 possible words
7:44
because there are 17 000 possible words that could come next
7:45
that could come next
7:45
that could come next this layer has 17 000 neurons
7:48
this layer has 17 000 neurons
7:48
this layer has 17 000 neurons and all of them are fully connected to
7:51
and all of them are fully connected to
7:51
and all of them are fully connected to all of these neurons in the hidden layer
7:54
all of these neurons in the hidden layer
7:54
all of these neurons in the hidden layer so there's a lot of parameters here
7:56
so there's a lot of parameters here
7:56
so there's a lot of parameters here because there's a lot of words so most
7:58
because there's a lot of words so most
7:58
because there's a lot of words so most computation is here this is the
7:59
computation is here this is the
7:59
computation is here this is the expensive layer
8:01
expensive layer
8:01
expensive layer now there are 17 000 logits here so on
8:04
now there are 17 000 logits here so on
8:04
now there are 17 000 logits here so on top of there we have the softmax layer
8:06
top of there we have the softmax layer
8:06
top of there we have the softmax layer which we've seen in our previous video
8:08
which we've seen in our previous video
8:08
which we've seen in our previous video as well so every one of these logits is
8:10
as well so every one of these logits is
8:10
as well so every one of these logits is exponentiated and then everything is
8:12
exponentiated and then everything is
8:12
exponentiated and then everything is normalized to sum to 1 so that we have a
8:15
normalized to sum to 1 so that we have a
8:15
normalized to sum to 1 so that we have a nice probability distribution for the
8:17
nice probability distribution for the
8:17
nice probability distribution for the next word in the sequence
8:19
next word in the sequence
8:19
next word in the sequence now of course during training we
8:21
now of course during training we
8:21
now of course during training we actually have the label we have the
8:22
actually have the label we have the
8:22
actually have the label we have the identity of the next word in a sequence
8:25
identity of the next word in a sequence
8:25
identity of the next word in a sequence that word
8:26
that word
8:26
that word or its index is used to pluck out the
8:30
or its index is used to pluck out the
8:30
or its index is used to pluck out the probability of that word
8:32
probability of that word
8:32
probability of that word and then we are maximizing the
8:34
and then we are maximizing the
8:34
and then we are maximizing the probability of that word
8:37
probability of that word
8:37
probability of that word with respect to the parameters of this
8:38
with respect to the parameters of this
8:38
with respect to the parameters of this neural net
8:39
neural net
8:39
neural net so the parameters are the weights and
8:41
so the parameters are the weights and
8:41
so the parameters are the weights and biases of this output layer
8:44
biases of this output layer
8:44
biases of this output layer the weights and biases of the hidden
8:45
the weights and biases of the hidden
8:45
the weights and biases of the hidden layer and the embedding lookup table c
8:49
layer and the embedding lookup table c
8:49
layer and the embedding lookup table c and all of that is optimized using back
8:50
and all of that is optimized using back
8:50
and all of that is optimized using back propagation
8:52
propagation
8:52
propagation and these uh dashed arrows ignore those
8:55
and these uh dashed arrows ignore those
8:55
and these uh dashed arrows ignore those uh that represents a variation of a
8:57
uh that represents a variation of a
8:57
uh that represents a variation of a neural nut that we are not going to
8:58
neural nut that we are not going to
8:58
neural nut that we are not going to explore in this video
8:59
explore in this video
8:59
explore in this video so that's the setup and now let's
9:01
so that's the setup and now let's
9:01
so that's the setup and now let's implement it
9:02
implement it
9:02
implement it okay so i started a brand new notebook
9:04
okay so i started a brand new notebook
9:04
okay so i started a brand new notebook for this lecture
9:05
for this lecture
9:05
for this lecture we are importing pytorch and we are
9:07
we are importing pytorch and we are
9:07
we are importing pytorch and we are importing matplotlib so we can create
9:09
importing matplotlib so we can create
9:09
importing matplotlib so we can create figures
9:10
figures
9:10
figures then i am reading all the names into a
9:13
then i am reading all the names into a
9:13
then i am reading all the names into a list of words like i did before and i'm
9:15
list of words like i did before and i'm
9:15
list of words like i did before and i'm showing the first eight right here
9:18
showing the first eight right here
9:18
showing the first eight right here keep in mind that we have a 32 000 in
9:20
keep in mind that we have a 32 000 in
9:20
keep in mind that we have a 32 000 in total these are just the first eight
9:22
total these are just the first eight
9:22
total these are just the first eight and then here i'm building out the
9:23
and then here i'm building out the
9:24
and then here i'm building out the vocabulary of characters and all the
9:25
vocabulary of characters and all the
9:25
vocabulary of characters and all the mappings from the characters as strings
9:28
mappings from the characters as strings
9:28
mappings from the characters as strings to integers and vice versa
9:31
to integers and vice versa
9:31
to integers and vice versa now the first thing we want to do is we
9:32
now the first thing we want to do is we
9:32
now the first thing we want to do is we want to compile the data set for the
9:34
want to compile the data set for the
9:34
want to compile the data set for the neural network
9:35
neural network
9:35
neural network and i had to rewrite this code um i'll
9:37
and i had to rewrite this code um i'll
9:37
and i had to rewrite this code um i'll show you in a second what it looks like
9:41
show you in a second what it looks like
9:41
show you in a second what it looks like so this is the code that i created for
9:43
so this is the code that i created for
9:43
so this is the code that i created for the dataset creation so let me first run
9:44
the dataset creation so let me first run
9:44
the dataset creation so let me first run it and then i'll briefly explain how
9:46
it and then i'll briefly explain how
9:46
it and then i'll briefly explain how this works
9:48
this works
9:48
this works so first we're going to define something
9:50
so first we're going to define something
9:50
so first we're going to define something called block size and this is basically
9:52
called block size and this is basically
9:52
called block size and this is basically the context length of how many
9:54
the context length of how many
9:54
the context length of how many characters do we take to predict the
9:55
characters do we take to predict the
9:56
characters do we take to predict the next one so here in this example we're
9:57
next one so here in this example we're
9:58
next one so here in this example we're taking three characters to predict the
9:59
taking three characters to predict the
9:59
taking three characters to predict the fourth one so we have a block size of
10:01
fourth one so we have a block size of
10:02
fourth one so we have a block size of three that's the size of the block that
10:04
three that's the size of the block that
10:04
three that's the size of the block that supports the prediction
10:06
supports the prediction
10:06
supports the prediction then here i'm building out the x and y
10:09
then here i'm building out the x and y
10:09
then here i'm building out the x and y the x are the input to the neural net
10:12
the x are the input to the neural net
10:12
the x are the input to the neural net and the y are the labels for each
10:15
and the y are the labels for each
10:15
and the y are the labels for each example inside x
10:17
example inside x
10:17
example inside x then i'm airing over the first five
10:19
then i'm airing over the first five
10:19
then i'm airing over the first five words i'm doing first five just for
10:21
words i'm doing first five just for
10:21
words i'm doing first five just for efficiency while we are developing all
10:23
efficiency while we are developing all
10:23
efficiency while we are developing all the code but then later we're going to
10:24
the code but then later we're going to
10:24
the code but then later we're going to come here and erase this so that we use
10:26
come here and erase this so that we use
10:26
come here and erase this so that we use the entire training set
10:29
the entire training set
10:29
the entire training set so here i'm printing the word
10:30
so here i'm printing the word
10:30
so here i'm printing the word emma and here i'm basically showing the
10:33
emma and here i'm basically showing the
10:33
emma and here i'm basically showing the examples that we can generate the five
10:35
examples that we can generate the five
10:36
examples that we can generate the five examples that we can generate out of the
10:37
examples that we can generate out of the
10:37
examples that we can generate out of the single
10:38
single
10:38
single um sort of word emma
10:41
um sort of word emma
10:41
um sort of word emma so
10:42
so
10:42
so when we are given the context of just uh
10:44
when we are given the context of just uh
10:44
when we are given the context of just uh dot dot the first character in a
10:45
dot dot the first character in a
10:45
dot dot the first character in a sequence is e
10:47
sequence is e
10:47
sequence is e in this context the label is m
10:50
in this context the label is m
10:50
in this context the label is m when the context is this the label is m
10:53
when the context is this the label is m
10:53
when the context is this the label is m and so forth and so the way i build this
10:55
and so forth and so the way i build this
10:55
and so forth and so the way i build this out is first i start with a padded
10:56
out is first i start with a padded
10:56
out is first i start with a padded context of just zero tokens
10:59
context of just zero tokens
10:59
context of just zero tokens then i iterate over all the characters i
11:01
then i iterate over all the characters i
11:01
then i iterate over all the characters i get the character in the sequence and i
11:04
get the character in the sequence and i
11:04
get the character in the sequence and i basically build out the array y of this
11:07
basically build out the array y of this
11:07
basically build out the array y of this current character and the array x which
11:09
current character and the array x which
11:09
current character and the array x which stores the current running context
11:11
stores the current running context
11:11
stores the current running context and then here see i print everything and
11:14
and then here see i print everything and
11:14
and then here see i print everything and here i um crop the context and enter the
11:17
here i um crop the context and enter the
11:17
here i um crop the context and enter the new character in a sequence so this is
11:18
new character in a sequence so this is
11:18
new character in a sequence so this is kind of like a rolling window of context
11:22
kind of like a rolling window of context
11:22
kind of like a rolling window of context now we can change the block size here to
11:24
now we can change the block size here to
11:24
now we can change the block size here to for example four
11:25
for example four
11:25
for example four and in that case we'll be predicting the
11:27
and in that case we'll be predicting the
11:27
and in that case we'll be predicting the fifth character given the previous four
11:30
fifth character given the previous four
11:30
fifth character given the previous four or it can be five
11:31
or it can be five
11:31
or it can be five and then it would look like this
11:33
and then it would look like this
11:33
and then it would look like this or it can be say ten
11:36
or it can be say ten
11:36
or it can be say ten and then it would look something like
11:37
and then it would look something like
11:37
and then it would look something like this we're taking ten characters to
11:39
this we're taking ten characters to
11:39
this we're taking ten characters to predict the eleventh one
11:40
predict the eleventh one
11:40
predict the eleventh one and we're always padding with dots
11:43
and we're always padding with dots
11:43
and we're always padding with dots so let me bring this back to three
11:45
so let me bring this back to three
11:45
so let me bring this back to three just so that we have what we have here
11:47
just so that we have what we have here
11:47
just so that we have what we have here in the paper
11:49
in the paper
11:50
in the paper and finally the data set right now looks
11:51
and finally the data set right now looks
11:51
and finally the data set right now looks as follows
11:53
as follows
11:53
as follows from these five words we have created a
11:55
from these five words we have created a
11:55
from these five words we have created a data set of 32 examples
11:57
data set of 32 examples
11:57
data set of 32 examples and each input of the neural net is
11:59
and each input of the neural net is
11:59
and each input of the neural net is three integers and we have a label that
12:01
three integers and we have a label that
12:02
three integers and we have a label that is also an integer
12:03
is also an integer
12:03
is also an integer y
12:04
y
12:04
y so x looks like this
12:06
so x looks like this
12:06
so x looks like this these are the individual examples
12:08
these are the individual examples
12:08
these are the individual examples and then y are the labels
12:12
and then y are the labels
12:12
and then y are the labels so
12:13
so
12:13
so given this
12:15
given this
12:15
given this let's now write a neural network that
12:17
let's now write a neural network that
12:17
let's now write a neural network that takes these axes and predicts the y's
12:19
takes these axes and predicts the y's
12:19
takes these axes and predicts the y's first let's build the embedding lookup
12:21
first let's build the embedding lookup
12:21
first let's build the embedding lookup table c
12:23
table c
12:23
table c so we have 27 possible characters and
12:25
so we have 27 possible characters and
12:25
so we have 27 possible characters and we're going to embed them in a lower
12:26
we're going to embed them in a lower
12:26
we're going to embed them in a lower dimensional space
12:28
dimensional space
12:28
dimensional space in the paper they have 17 000 words and
12:31
in the paper they have 17 000 words and
12:31
in the paper they have 17 000 words and they bet them in uh spaces as small
12:33
they bet them in uh spaces as small
12:33
they bet them in uh spaces as small dimensional as 30. so they cram 17 000
12:38
dimensional as 30. so they cram 17 000
12:38
dimensional as 30. so they cram 17 000 words into 30 dimensional space in our
12:40
words into 30 dimensional space in our
12:40
words into 30 dimensional space in our case we have only 27 possible characters
12:43
case we have only 27 possible characters
12:43
case we have only 27 possible characters so let's grab them in something as small
12:45
so let's grab them in something as small
12:45
so let's grab them in something as small as to start with for example a
12:46
as to start with for example a
12:46
as to start with for example a two-dimensional space
12:48
two-dimensional space
12:48
two-dimensional space so this lookup table will be random
12:50
so this lookup table will be random
12:50
so this lookup table will be random numbers
12:51
numbers
12:51
numbers and we'll have 27 rows and we'll have
12:54
and we'll have 27 rows and we'll have
12:54
and we'll have 27 rows and we'll have two columns
12:56
two columns
12:56
two columns right so each 20 each one of 27
12:58
right so each 20 each one of 27
12:58
right so each 20 each one of 27 characters will have a two-dimensional
13:00
characters will have a two-dimensional
13:00
characters will have a two-dimensional embedding
13:01
embedding
13:02
embedding so that's our matrix c of embeddings in
13:05
so that's our matrix c of embeddings in
13:05
so that's our matrix c of embeddings in the beginning initialized randomly
13:07
the beginning initialized randomly
13:07
the beginning initialized randomly now before we embed all of the integers
13:10
now before we embed all of the integers
13:10
now before we embed all of the integers inside the input x using this lookup
13:12
inside the input x using this lookup
13:12
inside the input x using this lookup table c
13:14
table c
13:14
table c let me actually just try to embed a
13:15
let me actually just try to embed a
13:15
let me actually just try to embed a single individual integer like say five
13:19
single individual integer like say five
13:19
single individual integer like say five so we get a sense of how this works
13:21
so we get a sense of how this works
13:21
so we get a sense of how this works now
13:22
now
13:22
now one way this works of course is we can
13:24
one way this works of course is we can
13:24
one way this works of course is we can just take the c and we can index into
13:26
just take the c and we can index into
13:26
just take the c and we can index into row five
13:27
row five
13:27
row five and that gives us a vector the fifth row
13:30
and that gives us a vector the fifth row
13:30
and that gives us a vector the fifth row of c
13:31
of c
13:31
of c and um
13:33
and um
13:33
and um this is one way to do it
13:34
this is one way to do it
13:34
this is one way to do it the other way that i presented in the
13:36
the other way that i presented in the
13:36
the other way that i presented in the previous lecture is actually seemingly
13:38
previous lecture is actually seemingly
13:38
previous lecture is actually seemingly different but actually identical so in
13:40
different but actually identical so in
13:40
different but actually identical so in the previous lecture what we did is we
13:42
the previous lecture what we did is we
13:42
the previous lecture what we did is we took these integers and we used the one
13:44
took these integers and we used the one
13:44
took these integers and we used the one hot encoding to first encode them so f.1
13:47
hot encoding to first encode them so f.1
13:47
hot encoding to first encode them so f.1 hot
13:48
hot
13:48
hot we want to encode integer 5 and we want
13:51
we want to encode integer 5 and we want
13:51
we want to encode integer 5 and we want to tell it that the number of classes is
13:52
to tell it that the number of classes is
13:52
to tell it that the number of classes is 27 so that's the 26 dimensional vector
13:55
27 so that's the 26 dimensional vector
13:55
27 so that's the 26 dimensional vector of all zeros except the fifth bit is
13:57
of all zeros except the fifth bit is
13:57
of all zeros except the fifth bit is turned on
13:59
turned on
13:59
turned on now this actually doesn't work
14:03
now this actually doesn't work
14:03
now this actually doesn't work the reason is that
14:04
the reason is that
14:04
the reason is that this input actually must be a doorstop
14:05
this input actually must be a doorstop
14:06
this input actually must be a doorstop tensor
14:07
tensor
14:07
tensor and i'm making some of these errors
14:09
and i'm making some of these errors
14:09
and i'm making some of these errors intentionally just so you get to see
14:10
intentionally just so you get to see
14:10
intentionally just so you get to see some errors and how to fix them
14:12
some errors and how to fix them
14:12
some errors and how to fix them so this must be a tester not an int
14:14
so this must be a tester not an int
14:14
so this must be a tester not an int fairly straightforward to fix we get a
14:16
fairly straightforward to fix we get a
14:16
fairly straightforward to fix we get a one hot vector the fifth dimension is
14:18
one hot vector the fifth dimension is
14:18
one hot vector the fifth dimension is one and the shape of this is 27.
14:22
one and the shape of this is 27.
14:22
one and the shape of this is 27. and now notice that just as i briefly
14:24
and now notice that just as i briefly
14:24
and now notice that just as i briefly alluded to in the previous video if we
14:26
alluded to in the previous video if we
14:26
alluded to in the previous video if we take this one hot vector and we multiply
14:29
take this one hot vector and we multiply
14:29
take this one hot vector and we multiply it by c
14:33
then
14:35
then
14:35
then what would you expect
14:37
what would you expect
14:37
what would you expect well number one
14:39
well number one
14:39
well number one first you'd expect an error
14:41
first you'd expect an error
14:41
first you'd expect an error because
14:43
because
14:43
because expected scalar type long but found
14:45
expected scalar type long but found
14:45
expected scalar type long but found float
14:46
float
14:46
float so a little bit confusing but
14:48
so a little bit confusing but
14:48
so a little bit confusing but the problem here is that one hot the
14:50
the problem here is that one hot the
14:50
the problem here is that one hot the data type of it
14:52
data type of it
14:52
data type of it is
14:53
is
14:53
is long it's a 64-bit integer but this is a
14:56
long it's a 64-bit integer but this is a
14:56
long it's a 64-bit integer but this is a float tensor and so pytorch doesn't know
14:59
float tensor and so pytorch doesn't know
14:59
float tensor and so pytorch doesn't know how to multiply an int with a float and
15:01
how to multiply an int with a float and
15:01
how to multiply an int with a float and that's why we had to explicitly cast
15:03
that's why we had to explicitly cast
15:03
that's why we had to explicitly cast this to a float so that we can multiply
15:06
this to a float so that we can multiply
15:06
this to a float so that we can multiply now the output actually here
15:09
now the output actually here
15:09
now the output actually here is identical
15:11
is identical
15:11
is identical and that it's identical because of the
15:12
and that it's identical because of the
15:12
and that it's identical because of the way the matrix multiplication here works
15:15
way the matrix multiplication here works
15:15
way the matrix multiplication here works we have the one hot um vector
15:17
we have the one hot um vector
15:17
we have the one hot um vector multiplying columns of c
15:19
multiplying columns of c
15:19
multiplying columns of c and because of all the zeros they
15:21
and because of all the zeros they
15:21
and because of all the zeros they actually end up masking out everything
15:23
actually end up masking out everything
15:23
actually end up masking out everything in c except for the fifth row which is
15:25
in c except for the fifth row which is
15:25
in c except for the fifth row which is plucked out
15:27
plucked out
15:27
plucked out and so we actually arrive at the same
15:28
and so we actually arrive at the same
15:28
and so we actually arrive at the same result
15:29
result
15:29
result and that tells you that
15:31
and that tells you that
15:31
and that tells you that here we can interpret this first
15:33
here we can interpret this first
15:33
here we can interpret this first piece here this embedding of the integer
15:35
piece here this embedding of the integer
15:35
piece here this embedding of the integer we can either think of it as the integer
15:37
we can either think of it as the integer
15:37
we can either think of it as the integer indexing into a lookup table c but
15:40
indexing into a lookup table c but
15:40
indexing into a lookup table c but equivalently we can also think of this
15:42
equivalently we can also think of this
15:42
equivalently we can also think of this little piece here as a first layer of
15:44
little piece here as a first layer of
15:44
little piece here as a first layer of this bigger neural net
15:46
this bigger neural net
15:46
this bigger neural net this layer here has neurons that have no
15:48
this layer here has neurons that have no
15:48
this layer here has neurons that have no non-linearity there's no 10h they're
15:50
non-linearity there's no 10h they're
15:50
non-linearity there's no 10h they're just linear neurons and their weight
15:52
just linear neurons and their weight
15:52
just linear neurons and their weight matrix is c
15:55
matrix is c
15:55
matrix is c and then we are encoding integers into
15:57
and then we are encoding integers into
15:57
and then we are encoding integers into one hot and feeding those into a neural
15:59
one hot and feeding those into a neural
15:59
one hot and feeding those into a neural net and this first layer basically
16:01
net and this first layer basically
16:01
net and this first layer basically embeds them
16:02
embeds them
16:02
embeds them so those are two equivalent ways of
16:04
so those are two equivalent ways of
16:04
so those are two equivalent ways of doing the same thing we're just going to
16:06
doing the same thing we're just going to
16:06
doing the same thing we're just going to index because it's much much faster and
16:08
index because it's much much faster and
16:08
index because it's much much faster and we're going to discard this
16:09
we're going to discard this
16:09
we're going to discard this interpretation of one hot inputs into
16:12
interpretation of one hot inputs into
16:12
interpretation of one hot inputs into neural nets and we're just going to
16:14
neural nets and we're just going to
16:14
neural nets and we're just going to index integers and create and use
16:16
index integers and create and use
16:16
index integers and create and use embedding tables now embedding a single
16:18
embedding tables now embedding a single
16:18
embedding tables now embedding a single integer like 5 is easy enough we can
16:20
integer like 5 is easy enough we can
16:20
integer like 5 is easy enough we can simply ask pytorch to retrieve the fifth
16:23
simply ask pytorch to retrieve the fifth
16:23
simply ask pytorch to retrieve the fifth row of c
16:24
row of c
16:24
row of c or the row index five of c
16:27
or the row index five of c
16:27
or the row index five of c but how do we simultaneously embed all
16:30
but how do we simultaneously embed all
16:30
but how do we simultaneously embed all of these 32 by three integers stored in
16:32
of these 32 by three integers stored in
16:32
of these 32 by three integers stored in array x
16:34
array x
16:34
array x luckily pytorch indexing is fairly
16:36
luckily pytorch indexing is fairly
16:36
luckily pytorch indexing is fairly flexible and quite powerful
16:38
flexible and quite powerful
16:38
flexible and quite powerful so it doesn't just work to
16:41
so it doesn't just work to
16:41
so it doesn't just work to ask for a single element five like this
16:44
ask for a single element five like this
16:44
ask for a single element five like this you can actually index using lists so
16:46
you can actually index using lists so
16:46
you can actually index using lists so for example we can get the rows five six
16:48
for example we can get the rows five six
16:48
for example we can get the rows five six and seven
16:49
and seven
16:49
and seven and this will just work like this we can
16:51
and this will just work like this we can
16:51
and this will just work like this we can index with a list
16:53
index with a list
16:53
index with a list it doesn't just have to be a list it can
16:55
it doesn't just have to be a list it can
16:55
it doesn't just have to be a list it can also be a actually a tensor of integers
16:58
also be a actually a tensor of integers
16:58
also be a actually a tensor of integers and we can index with that
17:00
and we can index with that
17:00
and we can index with that so this is a integer tensor 567 and this
17:03
so this is a integer tensor 567 and this
17:03
so this is a integer tensor 567 and this will just work as well
17:05
will just work as well
17:05
will just work as well in fact we can also for example repeat
17:08
in fact we can also for example repeat
17:08
in fact we can also for example repeat row 7 and retrieve it multiple times
17:10
row 7 and retrieve it multiple times
17:10
row 7 and retrieve it multiple times and
17:12
and
17:12
and that same index will just get embedded
17:14
that same index will just get embedded
17:14
that same index will just get embedded multiple times here
17:16
multiple times here
17:16
multiple times here so here we are indexing with a
17:17
so here we are indexing with a
17:17
so here we are indexing with a one-dimensional
17:18
one-dimensional
17:18
one-dimensional tensor of integers but it turns out that
17:21
tensor of integers but it turns out that
17:21
tensor of integers but it turns out that you can also index with
17:22
you can also index with
17:22
you can also index with multi-dimensional tensors of integers
17:25
multi-dimensional tensors of integers
17:25
multi-dimensional tensors of integers here we have a two-dimensional in tensor
17:27
here we have a two-dimensional in tensor
17:27
here we have a two-dimensional in tensor of integers so we can simply just do c
17:30
of integers so we can simply just do c
17:30
of integers so we can simply just do c at x and this just works
17:34
at x and this just works
17:34
at x and this just works and the shape of this
17:36
and the shape of this
17:36
and the shape of this is
17:37
is
17:37
is 32 by 3 which is the original shape and
17:40
32 by 3 which is the original shape and
17:40
32 by 3 which is the original shape and now for every one of those 32 by 3
17:41
now for every one of those 32 by 3
17:41
now for every one of those 32 by 3 integers we've retrieved the embedding
17:43
integers we've retrieved the embedding
17:43
integers we've retrieved the embedding vector
17:44
vector
17:44
vector here so basically we have that as an
17:48
here so basically we have that as an
17:48
here so basically we have that as an example
17:49
example
17:49
example the 13th or example index 13
17:54
the 13th or example index 13
17:54
the 13th or example index 13 the second dimension is the integer 1 as
17:57
the second dimension is the integer 1 as
17:57
the second dimension is the integer 1 as an example
17:58
an example
17:58
an example and so
17:59
and so
17:59
and so here if we do c of x which gives us that
18:02
here if we do c of x which gives us that
18:02
here if we do c of x which gives us that array and then we index into 13 by two
18:05
array and then we index into 13 by two
18:05
array and then we index into 13 by two of that array
18:07
of that array
18:07
of that array then we we get the embedding
18:09
then we we get the embedding
18:09
then we we get the embedding here
18:10
here
18:10
here and you can verify that
18:12
and you can verify that
18:12
and you can verify that c
18:12
c
18:12
c at one which is the integer at that
18:15
at one which is the integer at that
18:15
at one which is the integer at that location is indeed equal to this
18:19
location is indeed equal to this
18:19
location is indeed equal to this you see they're equal
18:21
you see they're equal
18:21
you see they're equal so basically long story short pytorch
18:23
so basically long story short pytorch
18:23
so basically long story short pytorch indexing is awesome and to embed
18:26
indexing is awesome and to embed
18:26
indexing is awesome and to embed simultaneously all of the integers in x
18:29
simultaneously all of the integers in x
18:29
simultaneously all of the integers in x we can simply do c of x
18:31
we can simply do c of x
18:31
we can simply do c of x and that is our embedding
18:33
and that is our embedding
18:33
and that is our embedding and that just works
18:35
and that just works
18:35
and that just works now let's construct this layer here the
18:37
now let's construct this layer here the
18:37
now let's construct this layer here the hidden layer
18:38
hidden layer
18:38
hidden layer so we have that w1 as i'll call it are
18:42
so we have that w1 as i'll call it are
18:42
so we have that w1 as i'll call it are these weights which we will initialize
18:44
these weights which we will initialize
18:44
these weights which we will initialize randomly
18:45
randomly
18:46
randomly now the number of inputs to this layer
18:48
now the number of inputs to this layer
18:48
now the number of inputs to this layer is going to be
18:49
is going to be
18:49
is going to be three times two right because we have
18:51
three times two right because we have
18:51
three times two right because we have two dimensional embeddings and we have
18:52
two dimensional embeddings and we have
18:52
two dimensional embeddings and we have three of them
18:53
three of them
18:53
three of them so the number of inputs is 6
18:56
so the number of inputs is 6
18:56
so the number of inputs is 6 and the number of neurons in this layer
18:58
and the number of neurons in this layer
18:58
and the number of neurons in this layer is a variable up to us
19:00
is a variable up to us
19:00
is a variable up to us let's use 100 neurons as an example
19:02
let's use 100 neurons as an example
19:02
let's use 100 neurons as an example and then biases
19:04
and then biases
19:04
and then biases will be also initialized randomly as an
19:06
will be also initialized randomly as an
19:06
will be also initialized randomly as an example
19:07
example
19:07
example and let's and we just need 100 of them
19:11
and let's and we just need 100 of them
19:11
and let's and we just need 100 of them now the problem with this is we can't
19:13
now the problem with this is we can't
19:13
now the problem with this is we can't simply normally we would take the input
19:15
simply normally we would take the input
19:15
simply normally we would take the input in this case that's embedding and we'd
19:17
in this case that's embedding and we'd
19:17
in this case that's embedding and we'd like to multiply it with these weights
19:20
like to multiply it with these weights
19:20
like to multiply it with these weights and then we would like to add the bias
19:22
and then we would like to add the bias
19:22
and then we would like to add the bias this is roughly what we want to do
19:24
this is roughly what we want to do
19:24
this is roughly what we want to do but the problem here is that these
19:25
but the problem here is that these
19:25
but the problem here is that these embeddings are stacked up in the
19:27
embeddings are stacked up in the
19:27
embeddings are stacked up in the dimensions of this input tensor
19:29
dimensions of this input tensor
19:29
dimensions of this input tensor so this will not work this matrix
19:31
so this will not work this matrix
19:31
so this will not work this matrix multiplication because this is a shape
19:32
multiplication because this is a shape
19:32
multiplication because this is a shape 32 by 3 by 2 and i can't multiply that
19:35
32 by 3 by 2 and i can't multiply that
19:35
32 by 3 by 2 and i can't multiply that by 6 by 100
19:37
by 6 by 100
19:37
by 6 by 100 so somehow we need to concatenate these
19:39
so somehow we need to concatenate these
19:40
so somehow we need to concatenate these inputs here together so that we can do
19:41
inputs here together so that we can do
19:41
inputs here together so that we can do something along these lines which
19:43
something along these lines which
19:43
something along these lines which currently does not work
19:45
currently does not work
19:45
currently does not work so how do we transform this 32 by 3 by 2
19:47
so how do we transform this 32 by 3 by 2
19:47
so how do we transform this 32 by 3 by 2 into a 32 by 6 so that we can actually
19:50
into a 32 by 6 so that we can actually
19:50
into a 32 by 6 so that we can actually perform
19:51
perform
19:51
perform this
19:52
this
19:52
this multiplication over here i'd like to
19:54
multiplication over here i'd like to
19:54
multiplication over here i'd like to show you that there are usually many
19:56
show you that there are usually many
19:56
show you that there are usually many ways of
19:57
ways of
19:57
ways of implementing what you'd like to do in
19:59
implementing what you'd like to do in
19:59
implementing what you'd like to do in torch and some of them will be faster
20:01
torch and some of them will be faster
20:01
torch and some of them will be faster better shorter etc
20:03
better shorter etc
20:03
better shorter etc and that's because torch is a very large
20:06
and that's because torch is a very large
20:06
and that's because torch is a very large library and it's got lots and lots of
20:07
library and it's got lots and lots of
20:07
library and it's got lots and lots of functions so if you just go to the
20:09
functions so if you just go to the
20:09
functions so if you just go to the documentation and click on torch you'll
20:11
documentation and click on torch you'll
20:11
documentation and click on torch you'll see that my slider here is very tiny and
20:14
see that my slider here is very tiny and
20:14
see that my slider here is very tiny and that's because there are so many
20:15
that's because there are so many
20:15
that's because there are so many functions that you can call on these
20:16
functions that you can call on these
20:16
functions that you can call on these tensors
20:17
tensors
20:17
tensors to transform them create them multiply
20:20
to transform them create them multiply
20:20
to transform them create them multiply them add them perform all kinds of
20:21
them add them perform all kinds of
20:22
them add them perform all kinds of different operations on them
20:24
different operations on them
20:24
different operations on them and so this is kind of like
20:28
and so this is kind of like
20:28
and so this is kind of like the space of possibility if you will
20:31
the space of possibility if you will
20:31
the space of possibility if you will now one of the things that you can do is
20:32
now one of the things that you can do is
20:32
now one of the things that you can do is if we can control here ctrl f for
20:34
if we can control here ctrl f for
20:34
if we can control here ctrl f for concatenate and we see that there's a
20:36
concatenate and we see that there's a
20:36
concatenate and we see that there's a function
20:37
function
20:37
function torque.cat short for concatenate
20:40
torque.cat short for concatenate
20:40
torque.cat short for concatenate and this concatenates the given sequence
20:41
and this concatenates the given sequence
20:42
and this concatenates the given sequence of tensors in a given dimension
20:45
of tensors in a given dimension
20:45
of tensors in a given dimension and these sensors must have the same
20:46
and these sensors must have the same
20:46
and these sensors must have the same shape etc so we can use the concatenate
20:49
shape etc so we can use the concatenate
20:49
shape etc so we can use the concatenate operation to in a naive way concatenate
20:52
operation to in a naive way concatenate
20:52
operation to in a naive way concatenate these three embeddings for each input
20:55
these three embeddings for each input
20:56
these three embeddings for each input so in this case we have m of
20:58
so in this case we have m of
20:58
so in this case we have m of amp of the shape and really what we want
21:01
amp of the shape and really what we want
21:01
amp of the shape and really what we want to do is we want to retrieve these three
21:03
to do is we want to retrieve these three
21:03
to do is we want to retrieve these three parts and concatenate them
21:04
parts and concatenate them
21:04
parts and concatenate them so we want to grab all the examples
21:08
so we want to grab all the examples
21:08
so we want to grab all the examples we want to grab
21:10
we want to grab
21:10
we want to grab first the zeroth
21:13
first the zeroth
21:13
first the zeroth index
21:14
index
21:14
index and then all of
21:16
and then all of
21:16
and then all of this
21:17
this
21:17
this so this plucks out
21:20
so this plucks out
21:20
so this plucks out the 32 by 2 embeddings of just the first
21:24
the 32 by 2 embeddings of just the first
21:24
the 32 by 2 embeddings of just the first word here
21:26
word here
21:26
word here and so basically we want this guy
21:28
and so basically we want this guy
21:28
and so basically we want this guy we want the first dimension and we want
21:30
we want the first dimension and we want
21:30
we want the first dimension and we want the second dimension
21:32
the second dimension
21:32
the second dimension and these are the three pieces
21:34
and these are the three pieces
21:34
and these are the three pieces individually
21:36
individually
21:36
individually and then we want to treat this as a
21:37
and then we want to treat this as a
21:38
and then we want to treat this as a sequence and we want to torch that cat
21:40
sequence and we want to torch that cat
21:40
sequence and we want to torch that cat on that sequence so this is the list
21:43
on that sequence so this is the list
21:43
on that sequence so this is the list tor.cat takes a
21:45
tor.cat takes a
21:45
tor.cat takes a sequence
21:46
sequence
21:46
sequence of tensors and then we have to tell it
21:48
of tensors and then we have to tell it
21:48
of tensors and then we have to tell it along which dimension to concatenate
21:51
along which dimension to concatenate
21:51
along which dimension to concatenate so in this case all these are 32 by 2
21:53
so in this case all these are 32 by 2
21:53
so in this case all these are 32 by 2 and we want to concatenate not across
21:55
and we want to concatenate not across
21:55
and we want to concatenate not across dimension 0 by the cross dimension one
21:58
dimension 0 by the cross dimension one
21:58
dimension 0 by the cross dimension one so passing in one
21:59
so passing in one
22:00
so passing in one gives us a result
22:01
gives us a result
22:01
gives us a result the shape of this is 32 by 6 exactly as
22:04
the shape of this is 32 by 6 exactly as
22:04
the shape of this is 32 by 6 exactly as we'd like
22:05
we'd like
22:05
we'd like so that basically took 32 and squashed
22:08
so that basically took 32 and squashed
22:08
so that basically took 32 and squashed these by concatenating them into 32 by
22:10
these by concatenating them into 32 by
22:10
these by concatenating them into 32 by 6.
22:11
6.
22:11
6. now this is kind of ugly because this
22:13
now this is kind of ugly because this
22:13
now this is kind of ugly because this code would not generalize if we want to
22:15
code would not generalize if we want to
22:15
code would not generalize if we want to later change the block size right now we
22:17
later change the block size right now we
22:17
later change the block size right now we have three inputs
22:19
have three inputs
22:19
have three inputs three words but what if we had five
22:22
three words but what if we had five
22:22
three words but what if we had five then here we would have to change the
22:23
then here we would have to change the
22:23
then here we would have to change the code because i'm indexing directly well
22:26
code because i'm indexing directly well
22:26
code because i'm indexing directly well torch comes to rescue again because that
22:28
torch comes to rescue again because that
22:28
torch comes to rescue again because that turns out to be a function called unbind
22:31
turns out to be a function called unbind
22:31
turns out to be a function called unbind and it removes a tensor dimension
22:35
so it removes the tensor dimension
22:36
so it removes the tensor dimension
22:36
so it removes the tensor dimension returns a tuple of all slices along a
22:39
returns a tuple of all slices along a
22:39
returns a tuple of all slices along a given dimension
22:40
given dimension
22:40
given dimension without it
22:41
without it
22:41
without it so this is exactly what we need
22:43
so this is exactly what we need
22:43
so this is exactly what we need and basically when we call torch dot
22:45
and basically when we call torch dot
22:45
and basically when we call torch dot unbind
22:47
unbind
22:48
unbind torch dot unbind
22:50
torch dot unbind
22:50
torch dot unbind of m
22:51
of m
22:51
of m and pass in dimension
22:54
and pass in dimension
22:54
and pass in dimension 1 index 1
22:56
1 index 1
22:56
1 index 1 this gives us a list of
22:59
this gives us a list of
22:59
this gives us a list of a list of tensors exactly equivalent to
23:01
a list of tensors exactly equivalent to
23:01
a list of tensors exactly equivalent to this
23:02
this
23:02
this so running this
23:04
so running this
23:04
so running this gives us a line
23:06
gives us a line
23:06
gives us a line 3
23:07
3
23:07
3 and it's exactly this list and so we can
23:09
and it's exactly this list and so we can
23:09
and it's exactly this list and so we can call torch.cat on it
23:12
call torch.cat on it
23:12
call torch.cat on it and along the first dimension
23:14
and along the first dimension
23:14
and along the first dimension and this works
23:16
and this works
23:16
and this works and this shape is the same
23:19
and this shape is the same
23:19
and this shape is the same but now this is uh it doesn't matter if
23:21
but now this is uh it doesn't matter if
23:21
but now this is uh it doesn't matter if we have block size 3 or 5 or 10 this
23:23
we have block size 3 or 5 or 10 this
23:23
we have block size 3 or 5 or 10 this will just work
23:24
will just work
23:24
will just work so this is one way to do it but it turns
23:26
so this is one way to do it but it turns
23:26
so this is one way to do it but it turns out that in this case there's actually a
23:28
out that in this case there's actually a
23:28
out that in this case there's actually a significantly better and more efficient
23:30
significantly better and more efficient
23:30
significantly better and more efficient way and this gives me an opportunity to
23:32
way and this gives me an opportunity to
23:32
way and this gives me an opportunity to hint at some of the internals of
23:34
hint at some of the internals of
23:34
hint at some of the internals of torch.tensor
23:36
torch.tensor
23:36
torch.tensor so let's create
23:37
so let's create
23:37
so let's create an array here
23:40
an array here
23:40
an array here of elements from 0 to 17
23:42
of elements from 0 to 17
23:42
of elements from 0 to 17 and the shape of this
23:44
and the shape of this
23:44
and the shape of this is just 18. it's a single picture of 18
23:46
is just 18. it's a single picture of 18
23:46
is just 18. it's a single picture of 18 numbers
23:47
numbers
23:47
numbers it turns out that we can very quickly
23:49
it turns out that we can very quickly
23:49
it turns out that we can very quickly re-represent this as different sized and
23:52
re-represent this as different sized and
23:52
re-represent this as different sized and dimensional tensors
23:54
dimensional tensors
23:54
dimensional tensors we do this by calling a view
23:57
we do this by calling a view
23:57
we do this by calling a view and we can say that actually this is not
23:59
and we can say that actually this is not
23:59
and we can say that actually this is not a single vector of 18 this is a two by
24:02
a single vector of 18 this is a two by
24:02
a single vector of 18 this is a two by nine tensor or alternatively this is a
24:05
nine tensor or alternatively this is a
24:05
nine tensor or alternatively this is a nine by two tensor
24:07
nine by two tensor
24:08
nine by two tensor or this is actually a three by three by
24:10
or this is actually a three by three by
24:10
or this is actually a three by three by two tensor
24:11
two tensor
24:11
two tensor as long as the total number of elements
24:13
as long as the total number of elements
24:13
as long as the total number of elements here multiply to be the same
24:16
here multiply to be the same
24:16
here multiply to be the same this will just work and
24:18
this will just work and
24:18
this will just work and in pytorch this operation calling that
24:21
in pytorch this operation calling that
24:21
in pytorch this operation calling that view is extremely efficient
24:23
view is extremely efficient
24:24
view is extremely efficient and the reason for that is that
24:25
and the reason for that is that
24:26
and the reason for that is that in each tensor there's something called
24:27
in each tensor there's something called
24:27
in each tensor there's something called the underlying storage
24:30
the underlying storage
24:30
the underlying storage and the storage is just the numbers
24:32
and the storage is just the numbers
24:32
and the storage is just the numbers always as a one-dimensional vector and
24:34
always as a one-dimensional vector and
24:34
always as a one-dimensional vector and this is how this tensor is represented
24:37
this is how this tensor is represented
24:37
this is how this tensor is represented in the computer memory it's always a
24:38
in the computer memory it's always a
24:38
in the computer memory it's always a one-dimensional vector
24:41
one-dimensional vector
24:41
one-dimensional vector but when we call that view we are
24:43
but when we call that view we are
24:44
but when we call that view we are manipulating some of attributes of that
24:46
manipulating some of attributes of that
24:46
manipulating some of attributes of that tensor that dictate how this
24:48
tensor that dictate how this
24:48
tensor that dictate how this one-dimensional sequence is interpreted
24:50
one-dimensional sequence is interpreted
24:50
one-dimensional sequence is interpreted to be an n-dimensional tensor
24:53
to be an n-dimensional tensor
24:53
to be an n-dimensional tensor and so what's happening here is that no
24:55
and so what's happening here is that no
24:55
and so what's happening here is that no memory is being changed copied moved or
24:57
memory is being changed copied moved or
24:57
memory is being changed copied moved or created when we call that view the
24:59
created when we call that view the
24:59
created when we call that view the storage
25:00
storage
25:00
storage is identical but when you call that view
25:03
is identical but when you call that view
25:03
is identical but when you call that view some of the internal
25:05
some of the internal
25:05
some of the internal attributes of the view of the sensor are
25:07
attributes of the view of the sensor are
25:07
attributes of the view of the sensor are being manipulated and changed in
25:09
being manipulated and changed in
25:09
being manipulated and changed in particular that's something there's
25:10
particular that's something there's
25:10
particular that's something there's something called a storage offset
25:12
something called a storage offset
25:12
something called a storage offset strides and shapes and those are
25:14
strides and shapes and those are
25:14
strides and shapes and those are manipulated so that this one-dimensional
25:15
manipulated so that this one-dimensional
25:16
manipulated so that this one-dimensional sequence of bytes is seen as different
25:18
sequence of bytes is seen as different
25:18
sequence of bytes is seen as different and dimensional arrays
25:20
and dimensional arrays
25:20
and dimensional arrays there's a blog post here from eric
25:22
there's a blog post here from eric
25:22
there's a blog post here from eric called pi torch internals where he goes
25:25
called pi torch internals where he goes
25:25
called pi torch internals where he goes into some of this with respect to tensor
25:27
into some of this with respect to tensor
25:27
into some of this with respect to tensor and how the view of the tensor is
25:29
and how the view of the tensor is
25:29
and how the view of the tensor is represented
25:30
represented
25:30
represented and this is really just like a logical
25:32
and this is really just like a logical
25:32
and this is really just like a logical construct of representing the physical
25:34
construct of representing the physical
25:34
construct of representing the physical memory
25:35
memory
25:35
memory and so this is a pretty good um blog
25:38
and so this is a pretty good um blog
25:38
and so this is a pretty good um blog post that you can go into i might also
25:39
post that you can go into i might also
25:40
post that you can go into i might also create an entire video on the internals
25:41
create an entire video on the internals
25:41
create an entire video on the internals of torch tensor and how this works
25:44
of torch tensor and how this works
25:44
of torch tensor and how this works for here we just note that this is an
25:46
for here we just note that this is an
25:46
for here we just note that this is an extremely efficient operation
25:48
extremely efficient operation
25:48
extremely efficient operation and if i delete this and come back to
25:50
and if i delete this and come back to
25:50
and if i delete this and come back to our end
25:53
our end
25:53
our end we see that the shape of our end is 32
25:55
we see that the shape of our end is 32
25:55
we see that the shape of our end is 32 by three by two but we can simply
25:58
by three by two but we can simply
25:58
by three by two but we can simply ask for pytorch to view this instead as
26:00
ask for pytorch to view this instead as
26:00
ask for pytorch to view this instead as a 32 by six
26:02
a 32 by six
26:02
a 32 by six and the way this gets flattened into a
26:05
and the way this gets flattened into a
26:05
and the way this gets flattened into a 32 by six array
26:07
32 by six array
26:07
32 by six array just happens that
26:09
just happens that
26:09
just happens that these two
26:10
these two
26:10
these two get stacked up
26:12
get stacked up
26:12
get stacked up in a single row and so that's basically
26:14
in a single row and so that's basically
26:14
in a single row and so that's basically the concatenation operation that we're
26:15
the concatenation operation that we're
26:15
the concatenation operation that we're after
26:17
after
26:17
after and you can verify that this actually
26:18
and you can verify that this actually
26:18
and you can verify that this actually gives the exact same result as what we
26:20
gives the exact same result as what we
26:20
gives the exact same result as what we had before
26:22
had before
26:22
had before so this is an element y equals and you
26:23
so this is an element y equals and you
26:23
so this is an element y equals and you can see that all the elements of these
26:25
can see that all the elements of these
26:25
can see that all the elements of these two tensors are the same
26:27
two tensors are the same
26:27
two tensors are the same and so we get the exact same result
26:30
and so we get the exact same result
26:30
and so we get the exact same result so long story short we can actually just
26:33
so long story short we can actually just
26:33
so long story short we can actually just come here
26:34
come here
26:34
come here and if we just view this as a 32x6
26:38
instead then this multiplication will
26:40
instead then this multiplication will
26:40
instead then this multiplication will work and give us the hidden states that
26:43
work and give us the hidden states that
26:43
work and give us the hidden states that we're after
26:44
we're after
26:44
we're after so if this is h
26:45
so if this is h
26:45
so if this is h then h shape is now
26:48
then h shape is now
26:48
then h shape is now the 100 dimensional activations for
26:51
the 100 dimensional activations for
26:51
the 100 dimensional activations for every one of our 32 examples
26:53
every one of our 32 examples
26:53
every one of our 32 examples and this gives the desired result let me
26:55
and this gives the desired result let me
26:55
and this gives the desired result let me do two things here number one let's not
26:57
do two things here number one let's not
26:57
do two things here number one let's not use 32 we can for example do something
27:00
use 32 we can for example do something
27:00
use 32 we can for example do something like
27:03
m.shape at 0
27:05
m.shape at 0
27:05
m.shape at 0 so that we don't hard code these numbers
27:07
so that we don't hard code these numbers
27:07
so that we don't hard code these numbers and this would work for any size of this
27:09
and this would work for any size of this
27:09
and this would work for any size of this amp
27:10
amp
27:10
amp or alternatively we can also do negative
27:11
or alternatively we can also do negative
27:11
or alternatively we can also do negative one when we do negative one pi torch
27:14
one when we do negative one pi torch
27:14
one when we do negative one pi torch will infer what this should be
27:16
will infer what this should be
27:16
will infer what this should be because the number of elements must be
27:17
because the number of elements must be
27:17
because the number of elements must be the same and we're saying that this is 6
27:20
the same and we're saying that this is 6
27:20
the same and we're saying that this is 6 by church will derive that this must be
27:21
by church will derive that this must be
27:21
by church will derive that this must be 32 or whatever else it is if m is of
27:24
32 or whatever else it is if m is of
27:24
32 or whatever else it is if m is of different size
27:26
different size
27:26
different size the other thing is here um
27:29
the other thing is here um
27:29
the other thing is here um one more thing i'd like to point out is
27:33
one more thing i'd like to point out is
27:33
one more thing i'd like to point out is here when we do the concatenation
27:35
here when we do the concatenation
27:35
here when we do the concatenation this actually is much less efficient
27:37
this actually is much less efficient
27:37
this actually is much less efficient because um this concatenation would
27:39
because um this concatenation would
27:39
because um this concatenation would create a whole new tensor with a whole
27:41
create a whole new tensor with a whole
27:41
create a whole new tensor with a whole new storage so new memory is being
27:43
new storage so new memory is being
27:43
new storage so new memory is being created because there's no way to
27:44
created because there's no way to
27:44
created because there's no way to concatenate tensors just by manipulating
27:46
concatenate tensors just by manipulating
27:46
concatenate tensors just by manipulating the view attributes
27:48
the view attributes
27:48
the view attributes so this is inefficient and creates all
27:50
so this is inefficient and creates all
27:50
so this is inefficient and creates all kinds of new memory
27:52
kinds of new memory
27:52
kinds of new memory uh so let me delete this now
27:55
uh so let me delete this now
27:55
uh so let me delete this now we don't need this
27:57
we don't need this
27:57
we don't need this and here to calculate h we want to also
27:59
and here to calculate h we want to also
27:59
and here to calculate h we want to also dot 10h
28:01
dot 10h
28:01
dot 10h of this
28:02
of this
28:02
of this to get our
28:04
to get our
28:04
to get our oops to get our h
28:07
oops to get our h
28:07
oops to get our h so these are now numbers between
28:08
so these are now numbers between
28:08
so these are now numbers between negative one and one because of the 10h
28:10
negative one and one because of the 10h
28:10
negative one and one because of the 10h and we have
28:11
and we have
28:11
and we have that the shape is 32 by 100
28:13
that the shape is 32 by 100
28:14
that the shape is 32 by 100 and that is basically this hidden layer
28:15
and that is basically this hidden layer
28:15
and that is basically this hidden layer of activations here
28:17
of activations here
28:17
of activations here for every one of our 32 examples
28:20
for every one of our 32 examples
28:20
for every one of our 32 examples now there's one more thing i've lost
28:21
now there's one more thing i've lost
28:21
now there's one more thing i've lost over that we have to be very careful
28:23
over that we have to be very careful
28:23
over that we have to be very careful with and that this
28:24
with and that this
28:24
with and that this and that's this plus here
28:26
and that's this plus here
28:26
and that's this plus here in particular we want to make sure that
28:27
in particular we want to make sure that
28:27
in particular we want to make sure that the broadcasting will do what we like
28:30
the broadcasting will do what we like
28:30
the broadcasting will do what we like the shape of this is 32 by 100 and the
28:33
the shape of this is 32 by 100 and the
28:33
the shape of this is 32 by 100 and the ones shape is 100.
28:35
ones shape is 100.
28:35
ones shape is 100. so we see that the addition here will
28:37
so we see that the addition here will
28:37
so we see that the addition here will broadcast these two and in particular we
28:39
broadcast these two and in particular we
28:40
broadcast these two and in particular we have 32 by 100 broadcasting to 100.
28:44
have 32 by 100 broadcasting to 100.
28:44
have 32 by 100 broadcasting to 100. so broadcasting will align on the right
28:47
so broadcasting will align on the right
28:47
so broadcasting will align on the right create a fake dimension here so this
28:49
create a fake dimension here so this
28:49
create a fake dimension here so this will become a 1 by 100 row vector and
28:51
will become a 1 by 100 row vector and
28:51
will become a 1 by 100 row vector and then it will copy vertically
28:54
then it will copy vertically
28:54
then it will copy vertically for every one of these rows of 32 and do
28:57
for every one of these rows of 32 and do
28:57
for every one of these rows of 32 and do an element wise addition
28:58
an element wise addition
28:58
an element wise addition so in this case the correct thing will
29:00
so in this case the correct thing will
29:00
so in this case the correct thing will be happening because the same bias
29:02
be happening because the same bias
29:02
be happening because the same bias vector will be added to all the rows
29:05
vector will be added to all the rows
29:05
vector will be added to all the rows of
29:06
of
29:06
of this matrix so that is correct that's
29:08
this matrix so that is correct that's
29:08
this matrix so that is correct that's what we'd like and it's always good
29:11
what we'd like and it's always good
29:11
what we'd like and it's always good practice you just make sure
29:12
practice you just make sure
29:12
practice you just make sure so that you don't shoot yourself in the
29:14
so that you don't shoot yourself in the
29:14
so that you don't shoot yourself in the foot and finally let's create the final
29:16
foot and finally let's create the final
29:16
foot and finally let's create the final layer here
29:17
layer here
29:17
layer here so let's create
29:19
so let's create
29:19
so let's create w2 and v2
29:22
w2 and v2
29:22
w2 and v2 the input now is 100
29:24
the input now is 100
29:24
the input now is 100 and the output number of neurons will be
29:26
and the output number of neurons will be
29:26
and the output number of neurons will be for us 27 because we have 27 possible
29:29
for us 27 because we have 27 possible
29:29
for us 27 because we have 27 possible characters that come next
29:31
characters that come next
29:31
characters that come next so the biases will be 27 as well
29:34
so the biases will be 27 as well
29:34
so the biases will be 27 as well so therefore the logits which are the
29:36
so therefore the logits which are the
29:36
so therefore the logits which are the outputs of this neural net
29:38
outputs of this neural net
29:38
outputs of this neural net are going to be um
29:41
are going to be um
29:41
are going to be um h
29:42
h
29:42
h multiplied by w2 plus b2
29:47
multiplied by w2 plus b2
29:47
multiplied by w2 plus b2 logistic shape is 32 by 27
29:50
logistic shape is 32 by 27
29:50
logistic shape is 32 by 27 and the logits look
29:52
and the logits look
29:52
and the logits look good now exactly as we saw in the
29:54
good now exactly as we saw in the
29:54
good now exactly as we saw in the previous video we want to take these
29:55
previous video we want to take these
29:55
previous video we want to take these logits and we want to first exponentiate
29:57
logits and we want to first exponentiate
29:58
logits and we want to first exponentiate them to get our fake counts
29:59
them to get our fake counts
30:00
them to get our fake counts and then we want to normalize them into
30:01
and then we want to normalize them into
30:01
and then we want to normalize them into a probability
30:02
a probability
30:02
a probability so prob is counts divide
30:05
so prob is counts divide
30:05
so prob is counts divide and now
30:06
and now
30:06
and now counts dot sum along the first dimension
30:10
counts dot sum along the first dimension
30:10
counts dot sum along the first dimension and keep them as true exactly as in the
30:12
and keep them as true exactly as in the
30:12
and keep them as true exactly as in the previous video
30:14
previous video
30:14
previous video and so
30:15
and so
30:16
and so prob that shape now is 32 by 27
30:20
prob that shape now is 32 by 27
30:20
prob that shape now is 32 by 27 and you'll see that every row of prob
30:23
and you'll see that every row of prob
30:23
and you'll see that every row of prob sums to one so it's normalized
30:26
sums to one so it's normalized
30:26
sums to one so it's normalized so that gives us the probabilities now
30:28
so that gives us the probabilities now
30:28
so that gives us the probabilities now of course we have the actual letter that
30:30
of course we have the actual letter that
30:30
of course we have the actual letter that comes next and that comes from this
30:32
comes next and that comes from this
30:32
comes next and that comes from this array y
30:34
array y
30:34
array y which we which we created during the
30:36
which we which we created during the
30:36
which we which we created during the dataset creation so why is this last
30:39
dataset creation so why is this last
30:39
dataset creation so why is this last piece here which is the identity of the
30:40
piece here which is the identity of the
30:40
piece here which is the identity of the next character in the sequence that we'd
30:42
next character in the sequence that we'd
30:42
next character in the sequence that we'd like to now predict
30:44
like to now predict
30:44
like to now predict so what we'd like to do now is just as
30:46
so what we'd like to do now is just as
30:46
so what we'd like to do now is just as in the previous video we'd like to index
30:48
in the previous video we'd like to index
30:48
in the previous video we'd like to index into the rows of prob and in each row
30:51
into the rows of prob and in each row
30:51
into the rows of prob and in each row we'd like to pluck out the probability
30:52
we'd like to pluck out the probability
30:52
we'd like to pluck out the probability assigned to the correct character
30:55
assigned to the correct character
30:55
assigned to the correct character as given here
30:56
as given here
30:56
as given here so first we have torch.range of 32 which
31:00
so first we have torch.range of 32 which
31:00
so first we have torch.range of 32 which is kind of like a iterator over
31:03
is kind of like a iterator over
31:03
is kind of like a iterator over numbers from 0 to 31
31:05
numbers from 0 to 31
31:05
numbers from 0 to 31 and then we can index into prob in the
31:07
and then we can index into prob in the
31:07
and then we can index into prob in the following way
31:09
following way
31:09
following way prop in
31:10
prop in
31:10
prop in torch.range of 32 which iterates the
31:12
torch.range of 32 which iterates the
31:12
torch.range of 32 which iterates the roads and in each row we'd like to grab
31:15
roads and in each row we'd like to grab
31:15
roads and in each row we'd like to grab this column as given by y
31:19
this column as given by y
31:19
this column as given by y so this gives the current probabilities
31:21
so this gives the current probabilities
31:21
so this gives the current probabilities as assigned by this neural network with
31:23
as assigned by this neural network with
31:23
as assigned by this neural network with this setting of its weights
31:24
this setting of its weights
31:24
this setting of its weights to the correct character in the sequence
31:27
to the correct character in the sequence
31:27
to the correct character in the sequence and you can see here that this looks
31:29
and you can see here that this looks
31:29
and you can see here that this looks okay for some of these characters like
31:30
okay for some of these characters like
31:30
okay for some of these characters like this is basically 0.2
31:32
this is basically 0.2
31:32
this is basically 0.2 but it doesn't look very good at all for
31:34
but it doesn't look very good at all for
31:34
but it doesn't look very good at all for many other characters like this is
31:37
many other characters like this is
31:37
many other characters like this is 0.0701 probability and so the network
31:40
0.0701 probability and so the network
31:40
0.0701 probability and so the network thinks that some of these are extremely
31:42
thinks that some of these are extremely
31:42
thinks that some of these are extremely unlikely but of course we haven't
31:43
unlikely but of course we haven't
31:43
unlikely but of course we haven't trained the neural network yet so
31:47
trained the neural network yet so
31:47
trained the neural network yet so this will improve and ideally all of
31:48
this will improve and ideally all of
31:48
this will improve and ideally all of these numbers here of course are one
31:50
these numbers here of course are one
31:50
these numbers here of course are one because then we are correctly predicting
31:52
because then we are correctly predicting
31:52
because then we are correctly predicting the next character
31:53
the next character
31:53
the next character now just as in the previous video we
31:55
now just as in the previous video we
31:55
now just as in the previous video we want to take these probabilities we want
31:57
want to take these probabilities we want
31:57
want to take these probabilities we want to look at the lock probability
31:59
to look at the lock probability
31:59
to look at the lock probability and then we want to look at the average
32:00
and then we want to look at the average
32:00
and then we want to look at the average probability
32:02
probability
32:02
probability and the negative of it to create the
32:04
and the negative of it to create the
32:04
and the negative of it to create the negative log likelihood loss
32:07
negative log likelihood loss
32:07
negative log likelihood loss so the loss here is 17
32:10
so the loss here is 17
32:10
so the loss here is 17 and this is the loss that we'd like to
32:11
and this is the loss that we'd like to
32:11
and this is the loss that we'd like to minimize to get the network to predict
32:14
minimize to get the network to predict
32:14
minimize to get the network to predict the correct character in the sequence
32:16
the correct character in the sequence
32:16
the correct character in the sequence okay so i rewrote everything here and
32:18
okay so i rewrote everything here and
32:18
okay so i rewrote everything here and made it a bit more respectable so here's
32:20
made it a bit more respectable so here's
32:20
made it a bit more respectable so here's our data set here's all the parameters
32:22
our data set here's all the parameters
32:22
our data set here's all the parameters that we defined
32:24
that we defined
32:24
that we defined i'm now using a generator to make it
32:26
i'm now using a generator to make it
32:26
i'm now using a generator to make it reproducible
32:27
reproducible
32:27
reproducible i clustered all the parameters into a
32:29
i clustered all the parameters into a
32:29
i clustered all the parameters into a single list of parameters so that for
32:31
single list of parameters so that for
32:31
single list of parameters so that for example it's easy to count them and see
32:33
example it's easy to count them and see
32:33
example it's easy to count them and see that in total we currently have about
32:34
that in total we currently have about
32:34
that in total we currently have about 3400 parameters
32:37
3400 parameters
32:37
3400 parameters and this is the forward pass as we
32:38
and this is the forward pass as we
32:38
and this is the forward pass as we developed it
32:39
developed it
32:39
developed it and we arrive at a single number here
32:41
and we arrive at a single number here
32:41
and we arrive at a single number here the loss that is currently expressing
32:43
the loss that is currently expressing
32:44
the loss that is currently expressing how well
32:45
how well
32:45
how well this neural network works with the
32:46
this neural network works with the
32:46
this neural network works with the current setting of parameters
32:48
current setting of parameters
32:48
current setting of parameters now i would like to make it even more
32:50
now i would like to make it even more
32:50
now i would like to make it even more respectable so in particular see these
32:52
respectable so in particular see these
32:52
respectable so in particular see these lines here where we take the logits and
32:54
lines here where we take the logits and
32:54
lines here where we take the logits and we calculate the loss
32:57
we calculate the loss
32:57
we calculate the loss we're not actually reinventing the wheel
32:59
we're not actually reinventing the wheel
32:59
we're not actually reinventing the wheel here this is just um
33:01
here this is just um
33:01
here this is just um classification and many people use
33:03
classification and many people use
33:03
classification and many people use classification and that's why there is a
33:05
classification and that's why there is a
33:05
classification and that's why there is a functional.cross entropy function in
33:07
functional.cross entropy function in
33:07
functional.cross entropy function in pytorch to calculate this much more
33:09
pytorch to calculate this much more
33:09
pytorch to calculate this much more efficiently
33:10
efficiently
33:10
efficiently so we can just simply call f.cross
33:12
so we can just simply call f.cross
33:12
so we can just simply call f.cross entropy
33:13
entropy
33:13
entropy and we can pass in the logits and we can
33:14
and we can pass in the logits and we can
33:14
and we can pass in the logits and we can pass in the
33:16
pass in the
33:16
pass in the array of targets y
33:18
array of targets y
33:18
array of targets y and this calculates the exact same loss
33:22
so in fact we can simply put this here
33:25
so in fact we can simply put this here
33:25
so in fact we can simply put this here and erase these three lines and we're
33:27
and erase these three lines and we're
33:27
and erase these three lines and we're going to get the exact same result now
33:29
going to get the exact same result now
33:29
going to get the exact same result now there are actually many good reasons to
33:31
there are actually many good reasons to
33:31
there are actually many good reasons to prefer f.cross entropy over rolling your
33:34
prefer f.cross entropy over rolling your
33:34
prefer f.cross entropy over rolling your own implementation like this i did this
33:36
own implementation like this i did this
33:36
own implementation like this i did this for educational reasons but you'd never
33:37
for educational reasons but you'd never
33:37
for educational reasons but you'd never use this in practice why is that
33:40
use this in practice why is that
33:40
use this in practice why is that number one when you use f.cross entropy
33:42
number one when you use f.cross entropy
33:42
number one when you use f.cross entropy by torch will not actually create all
33:44
by torch will not actually create all
33:44
by torch will not actually create all these intermediate tensors because these
33:46
these intermediate tensors because these
33:46
these intermediate tensors because these are all new tensors in memory and all
33:49
are all new tensors in memory and all
33:49
are all new tensors in memory and all this is fairly inefficient to run like
33:51
this is fairly inefficient to run like
33:51
this is fairly inefficient to run like this instead pytorch will cluster up all
33:53
this instead pytorch will cluster up all
33:54
this instead pytorch will cluster up all these operations and very often create
33:56
these operations and very often create
33:56
these operations and very often create have fused kernels that very efficiently
33:59
have fused kernels that very efficiently
33:59
have fused kernels that very efficiently evaluate these expressions that are sort
34:01
evaluate these expressions that are sort
34:01
evaluate these expressions that are sort of like clustered mathematical
34:03
of like clustered mathematical
34:03
of like clustered mathematical operations
34:04
operations
34:04
operations number two the backward pass can be made
34:06
number two the backward pass can be made
34:06
number two the backward pass can be made much more efficient and not just because
34:08
much more efficient and not just because
34:08
much more efficient and not just because it's a fused kernel but also
34:10
it's a fused kernel but also
34:10
it's a fused kernel but also analytically and mathematically it's
34:12
analytically and mathematically it's
34:12
analytically and mathematically it's much it's often a very much simpler
34:15
much it's often a very much simpler
34:15
much it's often a very much simpler backward pass to implement
34:17
backward pass to implement
34:17
backward pass to implement we actually sell this with micrograd
34:19
we actually sell this with micrograd
34:19
we actually sell this with micrograd you see here when we implemented 10h the
34:21
you see here when we implemented 10h the
34:22
you see here when we implemented 10h the forward pass of this operation to
34:23
forward pass of this operation to
34:23
forward pass of this operation to calculate the 10h was actually a fairly
34:25
calculate the 10h was actually a fairly
34:25
calculate the 10h was actually a fairly complicated mathematical expression
34:28
complicated mathematical expression
34:28
complicated mathematical expression but because it's a clustered
34:29
but because it's a clustered
34:29
but because it's a clustered mathematical expression when we did the
34:31
mathematical expression when we did the
34:31
mathematical expression when we did the backward pass we didn't individually
34:33
backward pass we didn't individually
34:33
backward pass we didn't individually backward through the x and the two times
34:35
backward through the x and the two times
34:35
backward through the x and the two times and the minus one in division etc we
34:38
and the minus one in division etc we
34:38
and the minus one in division etc we just said it's one minus t squared and
34:40
just said it's one minus t squared and
34:40
just said it's one minus t squared and that's a much simpler mathematical
34:42
that's a much simpler mathematical
34:42
that's a much simpler mathematical expression
34:43
expression
34:43
expression and we were able to do this because
34:44
and we were able to do this because
34:44
and we were able to do this because we're able to reuse calculations and
34:46
we're able to reuse calculations and
34:46
we're able to reuse calculations and because we are able to mathematically
34:48
because we are able to mathematically
34:48
because we are able to mathematically and analytically derive the derivative
34:50
and analytically derive the derivative
34:50
and analytically derive the derivative and often that expression simplifies
34:52
and often that expression simplifies
34:52
and often that expression simplifies mathematically and so there's much less
34:54
mathematically and so there's much less
34:54
mathematically and so there's much less to implement
34:56
to implement
34:56
to implement so not only can can it be made more
34:57
so not only can can it be made more
34:57
so not only can can it be made more efficient because it runs in a fused
34:59
efficient because it runs in a fused
34:59
efficient because it runs in a fused kernel but also because the expressions
35:01
kernel but also because the expressions
35:01
kernel but also because the expressions can take a much simpler form
35:03
can take a much simpler form
35:03
can take a much simpler form mathematically
35:05
mathematically
35:05
mathematically so that's number one number two
35:08
so that's number one number two
35:08
so that's number one number two under the hood f that cross entropy can
35:10
under the hood f that cross entropy can
35:10
under the hood f that cross entropy can also be significantly more um
35:13
also be significantly more um
35:13
also be significantly more um numerically well behaved let me show you
35:15
numerically well behaved let me show you
35:15
numerically well behaved let me show you an example of how this works
35:19
suppose we have a logits of negative 2 3
35:21
suppose we have a logits of negative 2 3
35:21
suppose we have a logits of negative 2 3 negative 3 0 and 5
35:23
negative 3 0 and 5
35:24
negative 3 0 and 5 and then we are taking the exponent of
35:25
and then we are taking the exponent of
35:25
and then we are taking the exponent of it and normalizing it to sum to 1.
35:27
it and normalizing it to sum to 1.
35:27
it and normalizing it to sum to 1. so
35:28
so
35:28
so when logits take on this values
35:30
when logits take on this values
35:30
when logits take on this values everything is well and good and we get a
35:31
everything is well and good and we get a
35:31
everything is well and good and we get a nice probability distribution
35:33
nice probability distribution
35:33
nice probability distribution now consider what happens when some of
35:35
now consider what happens when some of
35:35
now consider what happens when some of these logits take on more extreme values
35:37
these logits take on more extreme values
35:37
these logits take on more extreme values and that can happen during optimization
35:38
and that can happen during optimization
35:38
and that can happen during optimization of the neural network
35:40
of the neural network
35:40
of the neural network suppose that some of these numbers grow
35:42
suppose that some of these numbers grow
35:42
suppose that some of these numbers grow very negative like say negative 100
35:45
very negative like say negative 100
35:45
very negative like say negative 100 then actually everything will come out
35:47
then actually everything will come out
35:47
then actually everything will come out fine we still get the probabilities that
35:49
fine we still get the probabilities that
35:49
fine we still get the probabilities that um
35:50
um
35:50
um you know are well behaved and they sum
35:52
you know are well behaved and they sum
35:52
you know are well behaved and they sum to one and everything is great
35:54
to one and everything is great
35:54
to one and everything is great but because of the way the x works if
35:56
but because of the way the x works if
35:56
but because of the way the x works if you have very positive logits let's say
35:58
you have very positive logits let's say
35:58
you have very positive logits let's say positive 100 in here
36:00
positive 100 in here
36:00
positive 100 in here you actually start to run into trouble
36:02
you actually start to run into trouble
36:02
you actually start to run into trouble and we get not a number here
36:04
and we get not a number here
36:04
and we get not a number here and the reason for that is that these
36:06
and the reason for that is that these
36:06
and the reason for that is that these counts
36:08
have an if here
36:10
have an if here
36:10
have an if here so if you pass in a very negative number
36:12
so if you pass in a very negative number
36:12
so if you pass in a very negative number to x you just get a very negative sorry
36:15
to x you just get a very negative sorry
36:15
to x you just get a very negative sorry not negative but very small number very
36:17
not negative but very small number very
36:17
not negative but very small number very very near zero and that's fine
36:19
very near zero and that's fine
36:19
very near zero and that's fine but if you pass in a very positive
36:21
but if you pass in a very positive
36:21
but if you pass in a very positive number suddenly we run out of range in
36:23
number suddenly we run out of range in
36:23
number suddenly we run out of range in our floating point number that
36:25
our floating point number that
36:25
our floating point number that represents these counts
36:28
represents these counts
36:28
represents these counts so basically we're taking e and we're
36:29
so basically we're taking e and we're
36:29
so basically we're taking e and we're raising it to the power of 100 and that
36:31
raising it to the power of 100 and that
36:31
raising it to the power of 100 and that gives us if because we run out of
36:33
gives us if because we run out of
36:34
gives us if because we run out of dynamic range on this floating point
36:35
dynamic range on this floating point
36:35
dynamic range on this floating point number that is count
36:37
number that is count
36:37
number that is count and so we cannot pass very large logits
36:41
and so we cannot pass very large logits
36:41
and so we cannot pass very large logits through this expression
36:43
through this expression
36:43
through this expression now let me reset these numbers to
36:45
now let me reset these numbers to
36:45
now let me reset these numbers to something reasonable
36:47
something reasonable
36:47
something reasonable the way pi torch solved this
36:49
the way pi torch solved this
36:49
the way pi torch solved this is that
36:50
is that
36:50
is that you see how we have a well-behaved
36:52
you see how we have a well-behaved
36:52
you see how we have a well-behaved result here
36:53
result here
36:53
result here it turns out that because of the
36:54
it turns out that because of the
36:54
it turns out that because of the normalization here you can actually
36:56
normalization here you can actually
36:56
normalization here you can actually offset logits by any arbitrary constant
36:59
offset logits by any arbitrary constant
36:59
offset logits by any arbitrary constant value that you want so if i add 1 here
37:02
value that you want so if i add 1 here
37:02
value that you want so if i add 1 here you actually get the exact same result
37:04
you actually get the exact same result
37:04
you actually get the exact same result or if i add 2
37:06
or if i add 2
37:06
or if i add 2 or if i subtract three
37:08
or if i subtract three
37:08
or if i subtract three any offset will produce the exact same
37:10
any offset will produce the exact same
37:10
any offset will produce the exact same probabilities
37:12
probabilities
37:12
probabilities so because negative numbers are okay but
37:15
so because negative numbers are okay but
37:15
so because negative numbers are okay but positive numbers can actually overflow
37:16
positive numbers can actually overflow
37:16
positive numbers can actually overflow this x what patrick does is it
37:19
this x what patrick does is it
37:19
this x what patrick does is it internally calculates the maximum value
37:21
internally calculates the maximum value
37:21
internally calculates the maximum value that occurs in the logits and it
37:23
that occurs in the logits and it
37:23
that occurs in the logits and it subtracts it so in this case it would
37:25
subtracts it so in this case it would
37:25
subtracts it so in this case it would subtract five
37:26
subtract five
37:26
subtract five and so therefore the greatest number in
37:28
and so therefore the greatest number in
37:28
and so therefore the greatest number in logits will become zero and all the
37:30
logits will become zero and all the
37:30
logits will become zero and all the other numbers will become some negative
37:32
other numbers will become some negative
37:32
other numbers will become some negative numbers
37:33
numbers
37:33
numbers and then the result of this is always
37:35
and then the result of this is always
37:35
and then the result of this is always well behaved so even if we have 100 here
37:37
well behaved so even if we have 100 here
37:37
well behaved so even if we have 100 here previously
37:39
previously
37:39
previously not good but because pytorch will
37:41
not good but because pytorch will
37:41
not good but because pytorch will subtract 100 this will work
37:44
subtract 100 this will work
37:44
subtract 100 this will work and so there's many good reasons to call
37:46
and so there's many good reasons to call
37:46
and so there's many good reasons to call cross-entropy number one the forward
37:49
cross-entropy number one the forward
37:49
cross-entropy number one the forward pass can be much more efficient the
37:50
pass can be much more efficient the
37:50
pass can be much more efficient the backward pass can be much more efficient
37:53
backward pass can be much more efficient
37:53
backward pass can be much more efficient and also things can be much more
37:54
and also things can be much more
37:54
and also things can be much more numerically well behaved okay so let's
37:57
numerically well behaved okay so let's
37:57
numerically well behaved okay so let's now set up the training of this neural
37:58
now set up the training of this neural
37:58
now set up the training of this neural net
37:59
net
37:59
net we have the forward pass
38:02
uh we don't need these
38:04
uh we don't need these
38:04
uh we don't need these is that we have the losses equal to the
38:06
is that we have the losses equal to the
38:06
is that we have the losses equal to the f.cross entropy that's the forward pass
38:09
f.cross entropy that's the forward pass
38:09
f.cross entropy that's the forward pass then we need the backward pass first we
38:12
then we need the backward pass first we
38:12
then we need the backward pass first we want to set the gradients to be zero so
38:14
want to set the gradients to be zero so
38:14
want to set the gradients to be zero so for p in parameters
38:16
for p in parameters
38:16
for p in parameters we want to make sure that p dot grad is
38:18
we want to make sure that p dot grad is
38:18
we want to make sure that p dot grad is none which is the same as setting it to
38:19
none which is the same as setting it to
38:19
none which is the same as setting it to zero in pi torch
38:20
zero in pi torch
38:20
zero in pi torch and then lost that backward to populate
38:23
and then lost that backward to populate
38:23
and then lost that backward to populate those gradients
38:24
those gradients
38:24
those gradients once we have the gradients we can do the
38:25
once we have the gradients we can do the
38:25
once we have the gradients we can do the parameter update so for p in parameters
38:28
parameter update so for p in parameters
38:28
parameter update so for p in parameters we want to take all the
38:29
we want to take all the
38:30
we want to take all the data and we want to nudge it
38:32
data and we want to nudge it
38:32
data and we want to nudge it learning rate times p dot grad
38:36
and then we want to repeat this
38:39
and then we want to repeat this
38:39
and then we want to repeat this a few times
38:43
and let's print the loss here as well
38:48
now this won't suffice and it will
38:50
now this won't suffice and it will
38:50
now this won't suffice and it will create an error because we also have to
38:51
create an error because we also have to
38:51
create an error because we also have to go for pn parameters
38:53
go for pn parameters
38:53
go for pn parameters and we have to make sure that p dot
38:55
and we have to make sure that p dot
38:55
and we have to make sure that p dot requires grad is set to true in pi torch
38:59
requires grad is set to true in pi torch
38:59
requires grad is set to true in pi torch and this should just work
39:03
and this should just work
39:03
and this should just work okay so we started off with loss of 17
39:05
okay so we started off with loss of 17
39:05
okay so we started off with loss of 17 and we're decreasing it
39:08
and we're decreasing it
39:08
and we're decreasing it let's run longer
39:10
let's run longer
39:10
let's run longer and you see how the loss decreases
39:12
and you see how the loss decreases
39:12
and you see how the loss decreases a lot here so
39:17
if we just run for a thousand times
39:19
if we just run for a thousand times
39:19
if we just run for a thousand times we get a very very low loss and that
39:21
we get a very very low loss and that
39:21
we get a very very low loss and that means that we're making very good
39:22
means that we're making very good
39:22
means that we're making very good predictions now the reason that this is
39:25
predictions now the reason that this is
39:25
predictions now the reason that this is so straightforward right now
39:27
so straightforward right now
39:27
so straightforward right now is because we're only um
39:29
is because we're only um
39:29
is because we're only um overfitting 32 examples
39:32
overfitting 32 examples
39:32
overfitting 32 examples so we only have 32 examples uh of the
39:34
so we only have 32 examples uh of the
39:34
so we only have 32 examples uh of the first five words
39:36
first five words
39:36
first five words and therefore it's very easy to make
39:37
and therefore it's very easy to make
39:37
and therefore it's very easy to make this neural net fit only these two 32
39:40
this neural net fit only these two 32
39:40
this neural net fit only these two 32 examples because we have 3 400
39:43
examples because we have 3 400
39:43
examples because we have 3 400 parameters and only 32 examples so we're
39:46
parameters and only 32 examples so we're
39:46
parameters and only 32 examples so we're doing what's called overfitting a single
39:48
doing what's called overfitting a single
39:48
doing what's called overfitting a single batch of the data
39:50
batch of the data
39:50
batch of the data and getting a very low loss and good
39:52
and getting a very low loss and good
39:52
and getting a very low loss and good predictions
39:53
predictions
39:53
predictions um but that's just because we have so
39:55
um but that's just because we have so
39:55
um but that's just because we have so many parameters for so few examples so
39:56
many parameters for so few examples so
39:56
many parameters for so few examples so it's easy to
39:57
it's easy to
39:58
it's easy to uh make this be very low
40:00
uh make this be very low
40:00
uh make this be very low now we're not able to achieve exactly
40:01
now we're not able to achieve exactly
40:01
now we're not able to achieve exactly zero
40:02
zero
40:02
zero and the reason for that is we can for
40:04
and the reason for that is we can for
40:04
and the reason for that is we can for example look at logits which are being
40:06
example look at logits which are being
40:06
example look at logits which are being predicted
40:08
predicted
40:08
predicted and we can look at the max along the
40:11
and we can look at the max along the
40:11
and we can look at the max along the first dimension
40:13
first dimension
40:13
first dimension and in pi torch
40:15
and in pi torch
40:15
and in pi torch max reports both the actual values that
40:17
max reports both the actual values that
40:17
max reports both the actual values that take on the maximum number but also the
40:20
take on the maximum number but also the
40:20
take on the maximum number but also the indices of piece
40:22
indices of piece
40:22
indices of piece and you'll see that the indices are very
40:23
and you'll see that the indices are very
40:23
and you'll see that the indices are very close to the labels
40:26
close to the labels
40:26
close to the labels but in some cases they differ
40:28
but in some cases they differ
40:28
but in some cases they differ for example in this very first example
40:31
for example in this very first example
40:31
for example in this very first example the predicted index is 19 but the label
40:33
the predicted index is 19 but the label
40:33
the predicted index is 19 but the label is five
40:35
is five
40:35
is five and we're not able to make loss be zero
40:36
and we're not able to make loss be zero
40:36
and we're not able to make loss be zero and fundamentally that's because here
40:40
and fundamentally that's because here
40:40
and fundamentally that's because here the very first or the zeroth index is
40:42
the very first or the zeroth index is
40:42
the very first or the zeroth index is the example where dot dot dot is
40:44
the example where dot dot dot is
40:44
the example where dot dot dot is supposed to predict e but you see how
40:46
supposed to predict e but you see how
40:46
supposed to predict e but you see how dot dot dot is also supposed to predict
40:48
dot dot dot is also supposed to predict
40:48
dot dot dot is also supposed to predict an o and dot dot is also supposed to
40:50
an o and dot dot is also supposed to
40:50
an o and dot dot is also supposed to predict an i and then s as well and so
40:53
predict an i and then s as well and so
40:53
predict an i and then s as well and so basically e o a or s are all possible
40:57
basically e o a or s are all possible
40:57
basically e o a or s are all possible outcomes in a training set for the exact
40:59
outcomes in a training set for the exact
40:59
outcomes in a training set for the exact same input so we're not able to
41:01
same input so we're not able to
41:01
same input so we're not able to completely over fit and um
41:03
completely over fit and um
41:03
completely over fit and um and make the loss be exactly zero so but
41:06
and make the loss be exactly zero so but
41:06
and make the loss be exactly zero so but we're getting very close in the cases
41:08
we're getting very close in the cases
41:08
we're getting very close in the cases where
41:09
where
41:09
where there's a unique input for a unique
41:11
there's a unique input for a unique
41:11
there's a unique input for a unique output in those cases we do what's
41:13
output in those cases we do what's
41:13
output in those cases we do what's called overfit and we basically get the
41:15
called overfit and we basically get the
41:15
called overfit and we basically get the exact same and the exact correct result
41:19
exact same and the exact correct result
41:19
exact same and the exact correct result so now all we have to do
41:21
so now all we have to do
41:21
so now all we have to do is we just need to make sure that we
41:22
is we just need to make sure that we
41:22
is we just need to make sure that we read in the full data set and optimize
41:24
read in the full data set and optimize
41:24
read in the full data set and optimize the neural net
41:25
the neural net
41:25
the neural net okay so let's swing back up
41:27
okay so let's swing back up
41:27
okay so let's swing back up where we created the dataset
41:29
where we created the dataset
41:29
where we created the dataset and we see that here we only use the
41:30
and we see that here we only use the
41:30
and we see that here we only use the first five words so let me now erase
41:32
first five words so let me now erase
41:32
first five words so let me now erase this
41:33
this
41:33
this and let me erase the print statements
41:35
and let me erase the print statements
41:35
and let me erase the print statements otherwise we'd be printing way too much
41:37
otherwise we'd be printing way too much
41:38
otherwise we'd be printing way too much and so when we processed the full data
41:39
and so when we processed the full data
41:39
and so when we processed the full data set of all the words we now had 228 000
41:42
set of all the words we now had 228 000
41:42
set of all the words we now had 228 000 examples instead of just 32.
41:45
examples instead of just 32.
41:45
examples instead of just 32. so let's now scroll back down
41:47
so let's now scroll back down
41:47
so let's now scroll back down to this is much larger reinitialize the
41:49
to this is much larger reinitialize the
41:50
to this is much larger reinitialize the weights the same number of parameters
41:52
weights the same number of parameters
41:52
weights the same number of parameters they all require gradients
41:54
they all require gradients
41:54
they all require gradients and then let's push this print out
41:56
and then let's push this print out
41:56
and then let's push this print out lost.item to be here
41:58
lost.item to be here
41:58
lost.item to be here and let's just see how the optimization
41:59
and let's just see how the optimization
41:59
and let's just see how the optimization goes if we run this
42:04
okay so we started with a fairly high
42:05
okay so we started with a fairly high
42:05
okay so we started with a fairly high loss and then as we're optimizing the
42:07
loss and then as we're optimizing the
42:07
loss and then as we're optimizing the loss is coming down
42:12
but you'll notice that it takes quite a
42:13
but you'll notice that it takes quite a
42:13
but you'll notice that it takes quite a bit of time for every single iteration
42:15
bit of time for every single iteration
42:15
bit of time for every single iteration so let's actually address that because
42:17
so let's actually address that because
42:17
so let's actually address that because we're doing way too much work forwarding
42:19
we're doing way too much work forwarding
42:19
we're doing way too much work forwarding and backwarding 220 000 examples
42:22
and backwarding 220 000 examples
42:22
and backwarding 220 000 examples in practice what people usually do is
42:24
in practice what people usually do is
42:24
in practice what people usually do is they perform forward and backward pass
42:26
they perform forward and backward pass
42:26
they perform forward and backward pass and update on many batches of the data
42:29
and update on many batches of the data
42:29
and update on many batches of the data so what we will want to do is we want to
42:31
so what we will want to do is we want to
42:31
so what we will want to do is we want to randomly select some portion of the data
42:33
randomly select some portion of the data
42:33
randomly select some portion of the data set and that's a mini batch and then
42:35
set and that's a mini batch and then
42:35
set and that's a mini batch and then only forward backward and update on that
42:37
only forward backward and update on that
42:37
only forward backward and update on that little mini batch and then
42:40
little mini batch and then
42:40
little mini batch and then we iterate on those many batches
42:41
we iterate on those many batches
42:42
we iterate on those many batches so in pytorch we can for example use
42:43
so in pytorch we can for example use
42:43
so in pytorch we can for example use storage.randint
42:45
storage.randint
42:45
storage.randint we can generate numbers between 0 and 5
42:47
we can generate numbers between 0 and 5
42:47
we can generate numbers between 0 and 5 and make 32 of them
42:52
i believe the size has to be a
42:54
i believe the size has to be a
42:54
i believe the size has to be a tuple
42:56
tuple
42:56
tuple in my torch
42:57
in my torch
42:57
in my torch so we can have a tuple 32 of numbers
43:00
so we can have a tuple 32 of numbers
43:00
so we can have a tuple 32 of numbers between zero and five but actually we
43:02
between zero and five but actually we
43:02
between zero and five but actually we want x dot shape of zero here
43:05
want x dot shape of zero here
43:05
want x dot shape of zero here and so this creates uh integers that
43:08
and so this creates uh integers that
43:08
and so this creates uh integers that index into our data set and there's 32
43:10
index into our data set and there's 32
43:10
index into our data set and there's 32 of them so if our mini batch size is 32
43:13
of them so if our mini batch size is 32
43:13
of them so if our mini batch size is 32 then we can come here and we can first
43:14
then we can come here and we can first
43:14
then we can come here and we can first do a mini batch
43:17
do a mini batch
43:18
do a mini batch construct
43:20
construct
43:20
construct so in the integers that we want to
43:21
so in the integers that we want to
43:22
so in the integers that we want to optimize in this
43:24
optimize in this
43:24
optimize in this single iteration
43:25
single iteration
43:25
single iteration are in the ix
43:27
are in the ix
43:27
are in the ix and then we want to index into
43:29
and then we want to index into
43:29
and then we want to index into x
43:30
x
43:30
x with ix to only grab those rows
43:34
with ix to only grab those rows
43:34
with ix to only grab those rows so we're only getting 32 rows of x
43:36
so we're only getting 32 rows of x
43:36
so we're only getting 32 rows of x and therefore embeddings will again be
43:38
and therefore embeddings will again be
43:38
and therefore embeddings will again be 32 by three by two not two hundred
43:40
32 by three by two not two hundred
43:40
32 by three by two not two hundred thousand by three by two
43:43
thousand by three by two
43:43
thousand by three by two and then this ix has to be used not just
43:45
and then this ix has to be used not just
43:45
and then this ix has to be used not just to index into x
43:46
to index into x
43:46
to index into x but also to index into y
43:50
but also to index into y
43:50
but also to index into y and now this should be many batches and
43:52
and now this should be many batches and
43:52
and now this should be many batches and this should be much much faster so
43:55
this should be much much faster so
43:55
this should be much much faster so okay so it's instant almost
43:57
okay so it's instant almost
43:57
okay so it's instant almost so this way we can run many many
44:00
so this way we can run many many
44:00
so this way we can run many many examples
44:01
examples
44:01
examples nearly instantly and decrease the loss
44:03
nearly instantly and decrease the loss
44:03
nearly instantly and decrease the loss much much faster
44:05
much much faster
44:05
much much faster now because we're only dealing with mini
44:07
now because we're only dealing with mini
44:07
now because we're only dealing with mini batches the quality of our gradient is
44:09
batches the quality of our gradient is
44:09
batches the quality of our gradient is lower so the direction is not as
44:11
lower so the direction is not as
44:11
lower so the direction is not as reliable it's not the actual gradient
44:13
reliable it's not the actual gradient
44:13
reliable it's not the actual gradient direction
44:14
direction
44:14
direction but the gradient direction is good
44:16
but the gradient direction is good
44:16
but the gradient direction is good enough even when it's estimating on only
44:18
enough even when it's estimating on only
44:18
enough even when it's estimating on only 32 examples that it is useful and so
44:22
32 examples that it is useful and so
44:22
32 examples that it is useful and so it's much better to have an approximate
44:24
it's much better to have an approximate
44:24
it's much better to have an approximate gradient and just make more steps than
44:26
gradient and just make more steps than
44:26
gradient and just make more steps than it is to evaluate the exact gradient and
44:29
it is to evaluate the exact gradient and
44:29
it is to evaluate the exact gradient and take fewer steps so that's why in
44:31
take fewer steps so that's why in
44:31
take fewer steps so that's why in practice uh this works quite well
44:34
practice uh this works quite well
44:34
practice uh this works quite well so let's now continue the optimization
44:38
let me take out this lost item from here
44:41
let me take out this lost item from here
44:41
let me take out this lost item from here and uh
44:42
and uh
44:42
and uh place it over here at the end
44:45
place it over here at the end
44:45
place it over here at the end okay so we're hovering around 2.5 or so
44:50
okay so we're hovering around 2.5 or so
44:50
okay so we're hovering around 2.5 or so however this is only the loss for that
44:51
however this is only the loss for that
44:51
however this is only the loss for that mini batch so let's actually evaluate
44:53
mini batch so let's actually evaluate
44:53
mini batch so let's actually evaluate the loss
44:54
the loss
44:54
the loss here
44:56
here
44:56
here for all of x
44:58
for all of x
44:58
for all of x and for all of y just so we have a
45:00
and for all of y just so we have a
45:00
and for all of y just so we have a full sense of exactly how all the model
45:03
full sense of exactly how all the model
45:03
full sense of exactly how all the model is doing right now
45:05
is doing right now
45:05
is doing right now so right now we're at about 2.7 on the
45:07
so right now we're at about 2.7 on the
45:07
so right now we're at about 2.7 on the entire training set
45:09
entire training set
45:09
entire training set so let's
45:10
so let's
45:10
so let's run the optimization for a while
45:12
run the optimization for a while
45:12
run the optimization for a while okay right 2.6
45:15
okay right 2.6
45:15
okay right 2.6 2.57
45:17
2.57
45:17
2.57 2.53
45:21
okay
45:22
okay
45:22
okay so one issue of course is we don't know
45:25
so one issue of course is we don't know
45:25
so one issue of course is we don't know if we're stepping too slow or too fast
45:28
if we're stepping too slow or too fast
45:28
if we're stepping too slow or too fast so this point one i just guessed it
45:30
so this point one i just guessed it
45:30
so this point one i just guessed it so one question is how do you determine
45:33
so one question is how do you determine
45:33
so one question is how do you determine this learning rate
45:34
this learning rate
45:34
this learning rate and how do we gain confidence that we're
45:37
and how do we gain confidence that we're
45:37
and how do we gain confidence that we're stepping in the right
45:39
stepping in the right
45:39
stepping in the right sort of speed so i'll show you one way
45:41
sort of speed so i'll show you one way
45:41
sort of speed so i'll show you one way to determine a reasonable learning rate
45:43
to determine a reasonable learning rate
45:43
to determine a reasonable learning rate it works as follows let's reset our
45:46
it works as follows let's reset our
45:46
it works as follows let's reset our parameters
45:47
parameters
45:47
parameters to the initial
45:49
to the initial
45:49
to the initial settings
45:51
settings
45:51
settings and now let's
45:52
and now let's
45:52
and now let's print in every step
45:54
print in every step
45:54
print in every step but let's only do 10 steps or so
45:58
but let's only do 10 steps or so
45:58
but let's only do 10 steps or so or maybe maybe 100 steps
46:00
or maybe maybe 100 steps
46:00
or maybe maybe 100 steps we want to find like a very reasonable
46:02
we want to find like a very reasonable
46:02
we want to find like a very reasonable set
46:03
set
46:03
set search range if you will so for example
46:05
search range if you will so for example
46:05
search range if you will so for example if this is like very low
46:07
if this is like very low
46:07
if this is like very low then
46:09
then
46:09
then we see that the loss is barely
46:11
we see that the loss is barely
46:11
we see that the loss is barely decreasing so that's not
46:13
decreasing so that's not
46:13
decreasing so that's not that's like too low basically so let's
46:15
that's like too low basically so let's
46:15
that's like too low basically so let's try
46:16
try
46:16
try this one
46:18
this one
46:18
this one okay so we're decreasing the loss but
46:20
okay so we're decreasing the loss but
46:20
okay so we're decreasing the loss but like not very quickly so that's a pretty
46:21
like not very quickly so that's a pretty
46:21
like not very quickly so that's a pretty good low range
46:23
good low range
46:23
good low range now let's reset it again
46:25
now let's reset it again
46:25
now let's reset it again and now let's try to find the place at
46:27
and now let's try to find the place at
46:27
and now let's try to find the place at which the loss kind of explodes
46:29
which the loss kind of explodes
46:29
which the loss kind of explodes uh so maybe at negative one
46:33
okay we see that we're minimizing the
46:34
okay we see that we're minimizing the
46:34
okay we see that we're minimizing the loss but you see how uh it's kind of
46:36
loss but you see how uh it's kind of
46:36
loss but you see how uh it's kind of unstable it goes up and down quite a bit
46:39
unstable it goes up and down quite a bit
46:39
unstable it goes up and down quite a bit um so negative one is probably like a
46:41
um so negative one is probably like a
46:41
um so negative one is probably like a fast learning rate let's try negative
46:44
fast learning rate let's try negative
46:44
fast learning rate let's try negative 10.
46:45
10.
46:45
10. okay so this
46:46
okay so this
46:46
okay so this isn't optimizing this is not working
46:48
isn't optimizing this is not working
46:48
isn't optimizing this is not working very well so negative 10 is way too big
46:50
very well so negative 10 is way too big
46:50
very well so negative 10 is way too big negative one was already kind of big
46:53
negative one was already kind of big
46:53
negative one was already kind of big um
46:54
um
46:54
um so therefore
46:56
so therefore
46:56
so therefore negative one was like somewhat
46:57
negative one was like somewhat
46:57
negative one was like somewhat reasonable if i reset
47:00
reasonable if i reset
47:00
reasonable if i reset so i'm thinking that the right learning
47:01
so i'm thinking that the right learning
47:01
so i'm thinking that the right learning rate is somewhere between
47:03
rate is somewhere between
47:03
rate is somewhere between uh negative zero point zero zero one and
47:05
uh negative zero point zero zero one and
47:05
uh negative zero point zero zero one and um
47:06
um
47:06
um negative one
47:08
negative one
47:08
negative one so the way we can do this here is we can
47:09
so the way we can do this here is we can
47:10
so the way we can do this here is we can use uh torch shot lens space
47:13
use uh torch shot lens space
47:13
use uh torch shot lens space and we want to basically do something
47:14
and we want to basically do something
47:14
and we want to basically do something like this between zero and one but
47:17
like this between zero and one but
47:17
like this between zero and one but um
47:19
um
47:19
um those number of steps is one more
47:20
those number of steps is one more
47:20
those number of steps is one more parameter that's required let's do a
47:22
parameter that's required let's do a
47:22
parameter that's required let's do a thousand steps this creates 1000
47:26
thousand steps this creates 1000
47:26
thousand steps this creates 1000 numbers between 0.01 and 1
47:29
numbers between 0.01 and 1
47:29
numbers between 0.01 and 1 but it doesn't really make sense to step
47:31
but it doesn't really make sense to step
47:31
but it doesn't really make sense to step between these linearly so instead let me
47:33
between these linearly so instead let me
47:33
between these linearly so instead let me create learning rate exponent
47:36
create learning rate exponent
47:36
create learning rate exponent and instead of 0.001 this will be a
47:39
and instead of 0.001 this will be a
47:39
and instead of 0.001 this will be a negative 3 and this will be a zero
47:41
negative 3 and this will be a zero
47:41
negative 3 and this will be a zero and then the actual lrs that we want to
47:43
and then the actual lrs that we want to
47:43
and then the actual lrs that we want to search over are going to be 10 to the
47:45
search over are going to be 10 to the
47:45
search over are going to be 10 to the power of lre
47:48
power of lre
47:48
power of lre so now what we're doing is we're
47:49
so now what we're doing is we're
47:49
so now what we're doing is we're stepping linearly between the exponents
47:51
stepping linearly between the exponents
47:51
stepping linearly between the exponents of these learning rates this is 0.001
47:54
of these learning rates this is 0.001
47:54
of these learning rates this is 0.001 and this is 1 because 10 to the power of
47:57
and this is 1 because 10 to the power of
47:57
and this is 1 because 10 to the power of 0 is 1.
47:58
0 is 1.
47:58
0 is 1. and therefore we are spaced
48:00
and therefore we are spaced
48:00
and therefore we are spaced exponentially in this interval
48:02
exponentially in this interval
48:02
exponentially in this interval so these are the candidate learning
48:03
so these are the candidate learning
48:03
so these are the candidate learning rates
48:04
rates
48:04
rates that we want to sort of like search over
48:06
that we want to sort of like search over
48:06
that we want to sort of like search over roughly
48:07
roughly
48:07
roughly so now what we're going to do is
48:10
so now what we're going to do is
48:10
so now what we're going to do is here we are going to run the
48:12
here we are going to run the
48:12
here we are going to run the optimization for 1000 steps
48:14
optimization for 1000 steps
48:14
optimization for 1000 steps and instead of using a fixed number
48:16
and instead of using a fixed number
48:16
and instead of using a fixed number we are going to use learning rate
48:19
we are going to use learning rate
48:19
we are going to use learning rate indexing into here lrs of i
48:22
indexing into here lrs of i
48:22
indexing into here lrs of i and make this i
48:25
so basically let me reset this to be
48:28
so basically let me reset this to be
48:28
so basically let me reset this to be again starting from random
48:30
again starting from random
48:30
again starting from random creating these learning rates between
48:32
creating these learning rates between
48:32
creating these learning rates between negative
48:33
negative
48:33
negative zero points between 0.001 and um
48:36
zero points between 0.001 and um
48:36
zero points between 0.001 and um one but exponentially stopped
48:39
one but exponentially stopped
48:39
one but exponentially stopped and here what we're doing is we're
48:41
and here what we're doing is we're
48:41
and here what we're doing is we're iterating a thousand times
48:43
iterating a thousand times
48:43
iterating a thousand times we're going to use the learning rate
48:45
we're going to use the learning rate
48:45
we're going to use the learning rate um that's in the beginning very very low
48:48
um that's in the beginning very very low
48:48
um that's in the beginning very very low in the beginning is going to be 0.001
48:50
in the beginning is going to be 0.001
48:50
in the beginning is going to be 0.001 but by the end it's going to be
48:52
but by the end it's going to be
48:52
but by the end it's going to be 1.
48:53
1.
48:53
1. and then we're going to step with that
48:55
and then we're going to step with that
48:55
and then we're going to step with that learning rate
48:57
learning rate
48:57
learning rate and now what we want to do is we want to
48:58
and now what we want to do is we want to
48:58
and now what we want to do is we want to keep track of the uh
49:04
learning rates that we used and we want
49:05
learning rates that we used and we want
49:05
learning rates that we used and we want to look at the losses
49:07
to look at the losses
49:07
to look at the losses that resulted
49:09
that resulted
49:09
that resulted and so here let me
49:12
and so here let me
49:12
and so here let me track stats
49:14
track stats
49:14
track stats so lri.append lr
49:16
so lri.append lr
49:16
so lri.append lr and um lost side that append
49:20
and um lost side that append
49:20
and um lost side that append loss that item
49:22
loss that item
49:22
loss that item okay
49:23
okay
49:23
okay so again reset everything
49:27
so again reset everything
49:27
so again reset everything and then run
49:30
and then run
49:30
and then run and so basically we started with a very
49:31
and so basically we started with a very
49:31
and so basically we started with a very low learning rate and we went all the
49:33
low learning rate and we went all the
49:33
low learning rate and we went all the way up to a learning rate of negative
49:35
way up to a learning rate of negative
49:35
way up to a learning rate of negative one
49:35
one
49:35
one and now what we can do is we can plt
49:37
and now what we can do is we can plt
49:37
and now what we can do is we can plt that plot
49:38
that plot
49:38
that plot and we can plot the two so we can plot
49:41
and we can plot the two so we can plot
49:41
and we can plot the two so we can plot the learning rates on the x-axis and the
49:43
the learning rates on the x-axis and the
49:43
the learning rates on the x-axis and the losses we saw on the y-axis
49:46
losses we saw on the y-axis
49:46
losses we saw on the y-axis and often you're going to find that your
49:47
and often you're going to find that your
49:48
and often you're going to find that your plot looks something like this
49:50
plot looks something like this
49:50
plot looks something like this where in the beginning
49:51
where in the beginning
49:52
where in the beginning you had very low learning rates so
49:53
you had very low learning rates so
49:53
you had very low learning rates so basically anything
49:54
basically anything
49:54
basically anything barely anything happened
49:57
barely anything happened
49:57
barely anything happened then we got to like a nice spot here
50:00
then we got to like a nice spot here
50:00
then we got to like a nice spot here and then as we increase the learning
50:01
and then as we increase the learning
50:01
and then as we increase the learning rate enough
50:02
rate enough
50:02
rate enough we basically started to be kind of
50:04
we basically started to be kind of
50:04
we basically started to be kind of unstable here
50:05
unstable here
50:05
unstable here so a good learning rate turns out to be
50:07
so a good learning rate turns out to be
50:07
so a good learning rate turns out to be somewhere around here
50:10
somewhere around here
50:10
somewhere around here um and because we have lri here
50:13
um and because we have lri here
50:13
um and because we have lri here um
50:14
um
50:14
um we actually may want to
50:16
we actually may want to
50:16
we actually may want to um
50:19
do not lr
50:20
do not lr
50:20
do not lr not the learning rate but the exponent
50:22
not the learning rate but the exponent
50:22
not the learning rate but the exponent so that would be the lre at i is maybe
50:25
so that would be the lre at i is maybe
50:25
so that would be the lre at i is maybe what we want to log so let me reset this
50:27
what we want to log so let me reset this
50:27
what we want to log so let me reset this and redo that calculation
50:30
and redo that calculation
50:30
and redo that calculation but now on the x axis we have the
50:32
but now on the x axis we have the
50:32
but now on the x axis we have the [Music]
50:34
[Music]
50:34
[Music] exponent of the learning rate and so we
50:36
exponent of the learning rate and so we
50:36
exponent of the learning rate and so we can see the exponent of the learning
50:37
can see the exponent of the learning
50:37
can see the exponent of the learning rate that is good to use it would be
50:38
rate that is good to use it would be
50:38
rate that is good to use it would be sort of like roughly in the valley here
50:41
sort of like roughly in the valley here
50:41
sort of like roughly in the valley here because here the learning rates are just
50:42
because here the learning rates are just
50:42
because here the learning rates are just way too low and then here where we
50:44
way too low and then here where we
50:44
way too low and then here where we expect relatively good learning rates
50:46
expect relatively good learning rates
50:46
expect relatively good learning rates somewhere here and then here things are
50:47
somewhere here and then here things are
50:47
somewhere here and then here things are starting to explode so somewhere around
50:50
starting to explode so somewhere around
50:50
starting to explode so somewhere around negative one x the exponent of the
50:51
negative one x the exponent of the
50:51
negative one x the exponent of the learning rate is a pretty good setting
50:54
learning rate is a pretty good setting
50:54
learning rate is a pretty good setting and 10 to the negative one is 0.1 so 0.1
50:58
and 10 to the negative one is 0.1 so 0.1
50:58
and 10 to the negative one is 0.1 so 0.1 is actually 0.1 was actually a fairly
50:59
is actually 0.1 was actually a fairly
50:59
is actually 0.1 was actually a fairly good learning rate around here
51:02
good learning rate around here
51:02
good learning rate around here and that's what we had in the initial
51:03
and that's what we had in the initial
51:03
and that's what we had in the initial setting
51:05
setting
51:05
setting but that's roughly how you would
51:06
but that's roughly how you would
51:06
but that's roughly how you would determine it and so here now we can take
51:09
determine it and so here now we can take
51:09
determine it and so here now we can take out the tracking of these
51:12
out the tracking of these
51:12
out the tracking of these and we can just simply set lr to be 10
51:15
and we can just simply set lr to be 10
51:15
and we can just simply set lr to be 10 to the negative one or
51:17
to the negative one or
51:18
to the negative one or basically otherwise 0.1 as it was before
51:20
basically otherwise 0.1 as it was before
51:20
basically otherwise 0.1 as it was before and now we have some confidence that
51:21
and now we have some confidence that
51:21
and now we have some confidence that this is actually a fairly good learning
51:23
this is actually a fairly good learning
51:23
this is actually a fairly good learning rate
51:24
rate
51:24
rate and so now we can do is we can crank up
51:26
and so now we can do is we can crank up
51:26
and so now we can do is we can crank up the iterations
51:27
the iterations
51:27
the iterations we can reset our optimization
51:30
we can reset our optimization
51:30
we can reset our optimization and
51:32
and
51:32
and we can run for a pretty long time using
51:34
we can run for a pretty long time using
51:34
we can run for a pretty long time using this learning rate
51:36
this learning rate
51:36
this learning rate oops and we don't want to print that's
51:38
oops and we don't want to print that's
51:38
oops and we don't want to print that's way too much printing
51:40
way too much printing
51:40
way too much printing so let me again reset
51:42
so let me again reset
51:42
so let me again reset and run ten thousand stops
51:48
okay so we're 0.2 2.48 roughly let's run
51:52
okay so we're 0.2 2.48 roughly let's run
51:52
okay so we're 0.2 2.48 roughly let's run another 10 000 steps
51:58
2.46
52:00
2.46
52:00
2.46 and now let's do one learning rate decay
52:02
and now let's do one learning rate decay
52:02
and now let's do one learning rate decay what this means is we're going to take
52:03
what this means is we're going to take
52:03
what this means is we're going to take our learning rate and we're going to 10x
52:05
our learning rate and we're going to 10x
52:05
our learning rate and we're going to 10x lower it and so we're at the late stages
52:08
lower it and so we're at the late stages
52:08
lower it and so we're at the late stages of training potentially and we may want
52:10
of training potentially and we may want
52:10
of training potentially and we may want to go a bit slower let's do one more
52:12
to go a bit slower let's do one more
52:12
to go a bit slower let's do one more actually at 0.1 just to see if
52:16
we're making a dent here
52:18
we're making a dent here
52:18
we're making a dent here okay we're still making dent and by the
52:20
okay we're still making dent and by the
52:20
okay we're still making dent and by the way the
52:21
way the
52:21
way the bi-gram loss that we achieved last video
52:24
bi-gram loss that we achieved last video
52:24
bi-gram loss that we achieved last video was 2.45 so we've already surpassed the
52:27
was 2.45 so we've already surpassed the
52:27
was 2.45 so we've already surpassed the bi-gram model
52:29
bi-gram model
52:29
bi-gram model and once i get a sense that this is
52:30
and once i get a sense that this is
52:30
and once i get a sense that this is actually kind of starting to plateau off
52:32
actually kind of starting to plateau off
52:32
actually kind of starting to plateau off people like to do as i mentioned this
52:34
people like to do as i mentioned this
52:34
people like to do as i mentioned this learning rate decay so let's try to
52:36
learning rate decay so let's try to
52:36
learning rate decay so let's try to decay the loss
52:37
decay the loss
52:37
decay the loss the learning rate i mean
52:42
and we achieve it about 2.3 now
52:46
and we achieve it about 2.3 now
52:46
and we achieve it about 2.3 now obviously this is janky and not exactly
52:48
obviously this is janky and not exactly
52:48
obviously this is janky and not exactly how you would train it in production but
52:49
how you would train it in production but
52:50
how you would train it in production but this is roughly what you're going
52:51
this is roughly what you're going
52:51
this is roughly what you're going through you first find a decent learning
52:53
through you first find a decent learning
52:53
through you first find a decent learning rate using the approach that i showed
52:54
rate using the approach that i showed
52:54
rate using the approach that i showed you
52:55
you
52:55
you then you start with that learning rate
52:57
then you start with that learning rate
52:57
then you start with that learning rate and you train for a while
52:58
and you train for a while
52:58
and you train for a while and then at the end people like to do a
53:00
and then at the end people like to do a
53:00
and then at the end people like to do a learning rate decay where you decay the
53:02
learning rate decay where you decay the
53:02
learning rate decay where you decay the learning rate by say a factor of 10 and
53:03
learning rate by say a factor of 10 and
53:03
learning rate by say a factor of 10 and you do a few more steps and then you get
53:06
you do a few more steps and then you get
53:06
you do a few more steps and then you get a trained network roughly speaking
53:08
a trained network roughly speaking
53:08
a trained network roughly speaking so we've achieved 2.3 and dramatically
53:11
so we've achieved 2.3 and dramatically
53:11
so we've achieved 2.3 and dramatically improved on the bi-gram language model
53:13
improved on the bi-gram language model
53:13
improved on the bi-gram language model using this simple neural net as
53:15
using this simple neural net as
53:15
using this simple neural net as described here
53:17
described here
53:17
described here using these 3 400 parameters now there's
53:20
using these 3 400 parameters now there's
53:20
using these 3 400 parameters now there's something we have to be careful with
53:22
something we have to be careful with
53:22
something we have to be careful with i said that we have a better model
53:24
i said that we have a better model
53:24
i said that we have a better model because we are achieving a lower loss
53:26
because we are achieving a lower loss
53:26
because we are achieving a lower loss 2.3 much lower than 2.45 with the
53:28
2.3 much lower than 2.45 with the
53:28
2.3 much lower than 2.45 with the bi-gram model previously
53:30
bi-gram model previously
53:30
bi-gram model previously now that's not exactly true and the
53:32
now that's not exactly true and the
53:32
now that's not exactly true and the reason that's not true is that
53:37
this is actually fairly small model but
53:39
this is actually fairly small model but
53:39
this is actually fairly small model but these models can get larger and larger
53:41
these models can get larger and larger
53:41
these models can get larger and larger if you keep adding neurons and
53:42
if you keep adding neurons and
53:42
if you keep adding neurons and parameters so you can imagine that we
53:44
parameters so you can imagine that we
53:44
parameters so you can imagine that we don't potentially have a thousand
53:45
don't potentially have a thousand
53:45
don't potentially have a thousand parameters we could have 10 000 or 100
53:47
parameters we could have 10 000 or 100
53:47
parameters we could have 10 000 or 100 000 or millions of parameters
53:49
000 or millions of parameters
53:49
000 or millions of parameters and as the capacity of the neural
53:51
and as the capacity of the neural
53:51
and as the capacity of the neural network grows
53:52
network grows
53:52
network grows it becomes more and more capable of
53:54
it becomes more and more capable of
53:54
it becomes more and more capable of overfitting your training set
53:56
overfitting your training set
53:56
overfitting your training set what that means is that the loss on the
53:59
what that means is that the loss on the
53:59
what that means is that the loss on the training set on the data that you're
54:00
training set on the data that you're
54:00
training set on the data that you're training on will become very very low as
54:03
training on will become very very low as
54:03
training on will become very very low as low as zero
54:04
low as zero
54:04
low as zero but all that the model is doing is
54:06
but all that the model is doing is
54:06
but all that the model is doing is memorizing your training set verbatim so
54:09
memorizing your training set verbatim so
54:09
memorizing your training set verbatim so if you take that model and it looks like
54:10
if you take that model and it looks like
54:10
if you take that model and it looks like it's working really well but you try to
54:12
it's working really well but you try to
54:12
it's working really well but you try to sample from it you will basically only
54:14
sample from it you will basically only
54:14
sample from it you will basically only get examples exactly as they are in the
54:16
get examples exactly as they are in the
54:16
get examples exactly as they are in the training set you won't get any new data
54:19
training set you won't get any new data
54:19
training set you won't get any new data in addition to that if you try to
54:20
in addition to that if you try to
54:20
in addition to that if you try to evaluate the loss on some withheld names
54:23
evaluate the loss on some withheld names
54:23
evaluate the loss on some withheld names or other words
54:24
or other words
54:24
or other words you will actually see that the loss on
54:26
you will actually see that the loss on
54:26
you will actually see that the loss on those can be very high and so basically
54:29
those can be very high and so basically
54:29
those can be very high and so basically it's not a good model
54:30
it's not a good model
54:30
it's not a good model so the standard in the field is to split
54:32
so the standard in the field is to split
54:32
so the standard in the field is to split up your data set into three splits as we
54:35
up your data set into three splits as we
54:35
up your data set into three splits as we call them we have the training split the
54:37
call them we have the training split the
54:37
call them we have the training split the dev split or the validation split
54:40
dev split or the validation split
54:40
dev split or the validation split and the test split
54:42
and the test split
54:42
and the test split so
54:43
so
54:43
so training split
54:45
training split
54:45
training split test or um sorry dev or validation split
54:49
test or um sorry dev or validation split
54:49
test or um sorry dev or validation split and test split and typically this would
54:53
and test split and typically this would
54:53
and test split and typically this would be say eighty percent of your data set
54:54
be say eighty percent of your data set
54:54
be say eighty percent of your data set this could be ten percent and this ten
54:56
this could be ten percent and this ten
54:56
this could be ten percent and this ten percent roughly
54:58
percent roughly
54:58
percent roughly so you have these three splits of the
54:59
so you have these three splits of the
55:00
so you have these three splits of the data
55:01
data
55:01
data now these eighty percent of your
55:02
now these eighty percent of your
55:02
now these eighty percent of your trainings of the data set the training
55:04
trainings of the data set the training
55:04
trainings of the data set the training set is used to optimize the parameters
55:06
set is used to optimize the parameters
55:06
set is used to optimize the parameters of the model just like we're doing here
55:08
of the model just like we're doing here
55:08
of the model just like we're doing here using gradient descent
55:10
using gradient descent
55:10
using gradient descent these 10 percent of the
55:12
these 10 percent of the
55:12
these 10 percent of the examples the dev or validation split
55:14
examples the dev or validation split
55:14
examples the dev or validation split they're used for development over all
55:16
they're used for development over all
55:16
they're used for development over all the hyper parameters of your model so
55:19
the hyper parameters of your model so
55:19
the hyper parameters of your model so hyper parameters are for example the
55:21
hyper parameters are for example the
55:21
hyper parameters are for example the size of this hidden layer
55:22
size of this hidden layer
55:22
size of this hidden layer the size of the embedding so this is a
55:24
the size of the embedding so this is a
55:24
the size of the embedding so this is a hundred or a two for us but we could try
55:26
hundred or a two for us but we could try
55:26
hundred or a two for us but we could try different things
55:27
different things
55:28
different things the strength of the regularization which
55:29
the strength of the regularization which
55:29
the strength of the regularization which we aren't using yet so far
55:31
we aren't using yet so far
55:31
we aren't using yet so far so there's lots of different hybrid
55:33
so there's lots of different hybrid
55:33
so there's lots of different hybrid parameters and settings that go into
55:34
parameters and settings that go into
55:34
parameters and settings that go into defining your neural net and you can try
55:36
defining your neural net and you can try
55:36
defining your neural net and you can try many different variations of them and
55:38
many different variations of them and
55:38
many different variations of them and see whichever one works best on your
55:41
see whichever one works best on your
55:41
see whichever one works best on your validation split
55:43
validation split
55:43
validation split so this is used to train the parameters
55:45
so this is used to train the parameters
55:45
so this is used to train the parameters this is used to train the hyperprimers
55:48
this is used to train the hyperprimers
55:48
this is used to train the hyperprimers and test split is used to evaluate
55:51
and test split is used to evaluate
55:51
and test split is used to evaluate basically the performance of the model
55:53
basically the performance of the model
55:53
basically the performance of the model at the end
55:54
at the end
55:54
at the end so we're only evaluating the loss on the
55:55
so we're only evaluating the loss on the
55:55
so we're only evaluating the loss on the test plate very very sparingly and very
55:57
test plate very very sparingly and very
55:57
test plate very very sparingly and very few times because every single time you
56:00
few times because every single time you
56:00
few times because every single time you evaluate your test loss and you learn
56:02
evaluate your test loss and you learn
56:02
evaluate your test loss and you learn something from it
56:03
something from it
56:03
something from it you are basically starting to also train
56:06
you are basically starting to also train
56:06
you are basically starting to also train on the test split
56:07
on the test split
56:08
on the test split so you are only allowed to test the loss
56:10
so you are only allowed to test the loss
56:10
so you are only allowed to test the loss on a test
56:11
on a test
56:11
on a test set
56:13
set
56:13
set very very few times otherwise you risk
56:15
very very few times otherwise you risk
56:15
very very few times otherwise you risk overfitting to it as well as you
56:17
overfitting to it as well as you
56:17
overfitting to it as well as you experiment on your model
56:19
experiment on your model
56:19
experiment on your model so let's also split up our training data
56:21
so let's also split up our training data
56:22
so let's also split up our training data into train dev and test and then we are
56:25
into train dev and test and then we are
56:25
into train dev and test and then we are going to train on train
56:26
going to train on train
56:26
going to train on train and only evaluate on tests very very
56:28
and only evaluate on tests very very
56:28
and only evaluate on tests very very sparingly okay so here we go
56:31
sparingly okay so here we go
56:31
sparingly okay so here we go here is where we took all the words and
56:33
here is where we took all the words and
56:33
here is where we took all the words and put them into x and y tensors
56:36
put them into x and y tensors
56:36
put them into x and y tensors so instead let me create a new cell here
56:38
so instead let me create a new cell here
56:38
so instead let me create a new cell here and let me just copy paste some code
56:39
and let me just copy paste some code
56:39
and let me just copy paste some code here
56:40
here
56:40
here because i don't think it's that
56:43
because i don't think it's that
56:43
because i don't think it's that complex but
56:45
complex but
56:45
complex but we're going to try to save a little bit
56:46
we're going to try to save a little bit
56:46
we're going to try to save a little bit of time
56:47
of time
56:47
of time i'm converting this to be a function now
56:49
i'm converting this to be a function now
56:49
i'm converting this to be a function now and this function takes some list of
56:51
and this function takes some list of
56:51
and this function takes some list of words and builds the arrays x and y for
56:54
words and builds the arrays x and y for
56:54
words and builds the arrays x and y for those words only
56:56
those words only
56:56
those words only and then here i am shuffling up all the
56:59
and then here i am shuffling up all the
56:59
and then here i am shuffling up all the words so these are the input words that
57:01
words so these are the input words that
57:01
words so these are the input words that we get
57:02
we get
57:02
we get we are randomly shuffling them all up
57:04
we are randomly shuffling them all up
57:04
we are randomly shuffling them all up and then um
57:06
and then um
57:06
and then um we're going to
57:08
we're going to
57:08
we're going to set n1 to be
57:09
set n1 to be
57:09
set n1 to be the number of examples that there's 80
57:11
the number of examples that there's 80
57:11
the number of examples that there's 80 of the words and n2 to be
57:13
of the words and n2 to be
57:13
of the words and n2 to be 90
57:14
90
57:14
90 of the way of the words so basically if
57:16
of the way of the words so basically if
57:16
of the way of the words so basically if len of words is 32 000 n1 is
57:21
len of words is 32 000 n1 is
57:21
len of words is 32 000 n1 is well sorry i should probably run this
57:24
well sorry i should probably run this
57:24
well sorry i should probably run this n1 is 25 000 and n2 is 28 000.
57:28
n1 is 25 000 and n2 is 28 000.
57:28
n1 is 25 000 and n2 is 28 000. and so here we see that
57:29
and so here we see that
57:29
and so here we see that i'm calling build data set to build the
57:31
i'm calling build data set to build the
57:32
i'm calling build data set to build the training set x and y
57:33
training set x and y
57:33
training set x and y by indexing into up to and one so we're
57:36
by indexing into up to and one so we're
57:36
by indexing into up to and one so we're going to have only 25 000 training words
57:39
going to have only 25 000 training words
57:39
going to have only 25 000 training words and then we're going to have
57:42
and then we're going to have
57:42
and then we're going to have roughly
57:44
roughly
57:44
roughly n2 minus n1
57:46
n2 minus n1
57:46
n2 minus n1 3 3 000 validation examples or dev
57:49
3 3 000 validation examples or dev
57:49
3 3 000 validation examples or dev examples and we're going to have
57:53
when of words basically minus and two
57:57
when of words basically minus and two
57:57
when of words basically minus and two or
57:58
or
57:58
or 3 204 examples
58:00
3 204 examples
58:00
3 204 examples here for the test set
58:03
here for the test set
58:03
here for the test set so
58:04
so
58:04
so now we have x's and y's
58:07
now we have x's and y's
58:07
now we have x's and y's for all those three splits
58:13
oh yeah i'm printing their size here
58:14
oh yeah i'm printing their size here
58:14
oh yeah i'm printing their size here inside the function as well
58:18
but here we don't have words but these
58:20
but here we don't have words but these
58:20
but here we don't have words but these are already the individual examples made
58:22
are already the individual examples made
58:22
are already the individual examples made from those words
58:25
from those words
58:25
from those words so let's now scroll down here
58:27
so let's now scroll down here
58:27
so let's now scroll down here and the data set now for training is
58:31
and the data set now for training is
58:31
and the data set now for training is more like this
58:33
more like this
58:33
more like this and then when we reset the network
58:38
when we're training we're only going to
58:40
when we're training we're only going to
58:40
when we're training we're only going to be training using x train
58:43
be training using x train
58:43
be training using x train x train and y train
58:47
x train and y train
58:47
x train and y train so that's the only thing we're training
58:49
so that's the only thing we're training
58:49
so that's the only thing we're training on
58:57
let's see where we are on the
59:00
let's see where we are on the
59:00
let's see where we are on the single batch
59:02
single batch
59:02
single batch let's now train maybe a few more steps
59:08
training neural networks can take a
59:09
training neural networks can take a
59:09
training neural networks can take a while usually you don't do it inline you
59:11
while usually you don't do it inline you
59:11
while usually you don't do it inline you launch a bunch of jobs and you wait for
59:12
launch a bunch of jobs and you wait for
59:12
launch a bunch of jobs and you wait for them to finish um can take in multiple
59:15
them to finish um can take in multiple
59:15
them to finish um can take in multiple days and so on
59:16
days and so on
59:16
days and so on luckily this is a very small network
59:20
okay so the loss is pretty good
59:23
okay so the loss is pretty good
59:23
okay so the loss is pretty good oh we accidentally used a learning rate
59:25
oh we accidentally used a learning rate
59:25
oh we accidentally used a learning rate that is way too low
59:27
that is way too low
59:27
that is way too low so let me actually come back
59:29
so let me actually come back
59:29
so let me actually come back we use the decay learning rate of 0.01
59:35
so this will train much faster
59:37
so this will train much faster
59:37
so this will train much faster and then here when we evaluate
59:39
and then here when we evaluate
59:39
and then here when we evaluate let's use the dep set here
59:42
let's use the dep set here
59:42
let's use the dep set here xdev
59:43
xdev
59:43
xdev and ydev to evaluate the loss
59:47
and ydev to evaluate the loss
59:47
and ydev to evaluate the loss okay
59:48
okay
59:48
okay and let's now decay the learning rate
59:50
and let's now decay the learning rate
59:50
and let's now decay the learning rate and only do say 10 000 examples
59:55
and let's evaluate the dev loss
59:57
and let's evaluate the dev loss
59:57
and let's evaluate the dev loss ones here
59:59
ones here
59:59
ones here okay so we're getting about 2.3 on dev
1:00:01
okay so we're getting about 2.3 on dev
1:00:01
okay so we're getting about 2.3 on dev and so the neural network when it was
1:00:02
and so the neural network when it was
1:00:02
and so the neural network when it was training did not see these dev examples
1:00:05
training did not see these dev examples
1:00:05
training did not see these dev examples it hasn't optimized on them and yet
1:00:07
it hasn't optimized on them and yet
1:00:08
it hasn't optimized on them and yet when we evaluate the loss on these dev
1:00:10
when we evaluate the loss on these dev
1:00:10
when we evaluate the loss on these dev we actually get a pretty decent loss
1:00:12
we actually get a pretty decent loss
1:00:12
we actually get a pretty decent loss and so we can also look at what the
1:00:16
and so we can also look at what the
1:00:16
and so we can also look at what the loss is on all of training set
1:00:18
loss is on all of training set
1:00:18
loss is on all of training set oops
1:00:20
oops
1:00:20
oops and so we see that the training and the
1:00:22
and so we see that the training and the
1:00:22
and so we see that the training and the dev loss are about equal so we're not
1:00:24
dev loss are about equal so we're not
1:00:24
dev loss are about equal so we're not over fitting
1:00:25
over fitting
1:00:25
over fitting um this model is not powerful enough to
1:00:28
um this model is not powerful enough to
1:00:28
um this model is not powerful enough to just be purely memorizing the data and
1:00:31
just be purely memorizing the data and
1:00:31
just be purely memorizing the data and so far we are what's called underfitting
1:00:33
so far we are what's called underfitting
1:00:33
so far we are what's called underfitting because the training loss and the dev or
1:00:35
because the training loss and the dev or
1:00:35
because the training loss and the dev or test losses are roughly equal so what
1:00:38
test losses are roughly equal so what
1:00:38
test losses are roughly equal so what that typically means is that our network
1:00:40
that typically means is that our network
1:00:40
that typically means is that our network is very tiny very small and we expect to
1:00:43
is very tiny very small and we expect to
1:00:43
is very tiny very small and we expect to make performance improvements by scaling
1:00:46
make performance improvements by scaling
1:00:46
make performance improvements by scaling up the size of this neural net so let's
1:00:47
up the size of this neural net so let's
1:00:47
up the size of this neural net so let's do that now so let's come over here
1:00:50
do that now so let's come over here
1:00:50
do that now so let's come over here and let's increase the size of the
1:00:51
and let's increase the size of the
1:00:51
and let's increase the size of the neural net the easiest way to do this is
1:00:53
neural net the easiest way to do this is
1:00:53
neural net the easiest way to do this is we can come here to the hidden layer
1:00:55
we can come here to the hidden layer
1:00:55
we can come here to the hidden layer which currently has 100 neurons and
1:00:56
which currently has 100 neurons and
1:00:56
which currently has 100 neurons and let's just bump this up so let's do 300
1:00:58
let's just bump this up so let's do 300
1:00:58
let's just bump this up so let's do 300 neurons
1:01:00
neurons
1:01:00
neurons and then this is also 300 biases and
1:01:03
and then this is also 300 biases and
1:01:03
and then this is also 300 biases and here we have 300 inputs into the final
1:01:05
here we have 300 inputs into the final
1:01:05
here we have 300 inputs into the final layer
1:01:07
layer
1:01:07
layer so
1:01:08
so
1:01:08
so let's initialize our neural net we now
1:01:10
let's initialize our neural net we now
1:01:10
let's initialize our neural net we now have ten thousand ex ten thousand
1:01:12
have ten thousand ex ten thousand
1:01:12
have ten thousand ex ten thousand parameters instead of three thousand
1:01:13
parameters instead of three thousand
1:01:13
parameters instead of three thousand parameters
1:01:15
parameters
1:01:15
parameters and then we're not using this
1:01:18
and then we're not using this
1:01:18
and then we're not using this and then here what i'd like to do is i'd
1:01:19
and then here what i'd like to do is i'd
1:01:19
and then here what i'd like to do is i'd like to actually uh keep track of uh
1:01:23
like to actually uh keep track of uh
1:01:23
like to actually uh keep track of uh tap
1:01:24
tap
1:01:24
tap um
1:01:27
okay let's just do this let's keep stats
1:01:29
okay let's just do this let's keep stats
1:01:29
okay let's just do this let's keep stats again
1:01:30
again
1:01:30
again and here when we're keeping track of the
1:01:34
loss let's just also keep track of the
1:01:37
loss let's just also keep track of the
1:01:37
loss let's just also keep track of the steps and let's just have i here
1:01:40
steps and let's just have i here
1:01:40
steps and let's just have i here and let's train on thirty thousand
1:01:44
and let's train on thirty thousand
1:01:44
and let's train on thirty thousand or rather say
1:01:46
or rather say
1:01:46
or rather say okay let's try thirty thousand
1:01:48
okay let's try thirty thousand
1:01:48
okay let's try thirty thousand and we are at point one
1:01:51
and we are at point one
1:01:51
and we are at point one and
1:01:52
and
1:01:52
and we should be able to run this
1:01:54
we should be able to run this
1:01:54
we should be able to run this and optimize the neural net
1:01:57
and optimize the neural net
1:01:57
and optimize the neural net and then here basically i want to
1:01:59
and then here basically i want to
1:01:59
and then here basically i want to plt.plot
1:02:00
plt.plot
1:02:00
plt.plot the steps
1:02:02
the steps
1:02:02
the steps against the loss
1:02:09
so these are the x's and y's
1:02:11
so these are the x's and y's
1:02:11
so these are the x's and y's and this is
1:02:13
and this is
1:02:13
and this is the loss function and how it's being
1:02:15
the loss function and how it's being
1:02:15
the loss function and how it's being optimized
1:02:16
optimized
1:02:16
optimized now you see that there's quite a bit of
1:02:18
now you see that there's quite a bit of
1:02:18
now you see that there's quite a bit of thickness to this and that's because we
1:02:19
thickness to this and that's because we
1:02:19
thickness to this and that's because we are optimizing over these mini batches
1:02:21
are optimizing over these mini batches
1:02:21
are optimizing over these mini batches and the mini batches create a little bit
1:02:23
and the mini batches create a little bit
1:02:23
and the mini batches create a little bit of noise
1:02:24
of noise
1:02:24
of noise in this
1:02:25
in this
1:02:25
in this uh where are we in the def set we are at
1:02:28
uh where are we in the def set we are at
1:02:28
uh where are we in the def set we are at 2.5 so we still haven't optimized this
1:02:30
2.5 so we still haven't optimized this
1:02:30
2.5 so we still haven't optimized this neural net very well
1:02:32
neural net very well
1:02:32
neural net very well and that's probably because we made it
1:02:33
and that's probably because we made it
1:02:33
and that's probably because we made it bigger it might take longer for this
1:02:34
bigger it might take longer for this
1:02:34
bigger it might take longer for this neural net to converge
1:02:36
neural net to converge
1:02:36
neural net to converge um
1:02:37
um
1:02:37
um and so let's continue training
1:02:40
and so let's continue training
1:02:40
and so let's continue training um
1:02:42
um
1:02:42
um yeah let's just continue training
1:02:46
one possibility is that the batch size
1:02:47
one possibility is that the batch size
1:02:48
one possibility is that the batch size is so low
1:02:49
is so low
1:02:49
is so low that uh we just have way too much noise
1:02:51
that uh we just have way too much noise
1:02:51
that uh we just have way too much noise in the training and we may want to
1:02:53
in the training and we may want to
1:02:53
in the training and we may want to increase the batch size so that we have
1:02:54
increase the batch size so that we have
1:02:54
increase the batch size so that we have a bit more um correct gradient and we're
1:02:57
a bit more um correct gradient and we're
1:02:57
a bit more um correct gradient and we're not thrashing too much and we can
1:02:59
not thrashing too much and we can
1:02:59
not thrashing too much and we can actually like optimize more properly
1:03:07
okay
1:03:08
okay
1:03:08
okay this will now become meaningless because
1:03:10
this will now become meaningless because
1:03:10
this will now become meaningless because we've reinitialized these so
1:03:13
we've reinitialized these so
1:03:13
we've reinitialized these so yeah this looks not
1:03:14
yeah this looks not
1:03:14
yeah this looks not pleasing right now but there probably is
1:03:16
pleasing right now but there probably is
1:03:16
pleasing right now but there probably is like a tiny improvement but it's so hard
1:03:18
like a tiny improvement but it's so hard
1:03:18
like a tiny improvement but it's so hard to tell
1:03:20
to tell
1:03:20
to tell let's go again
1:03:22
let's go again
1:03:22
let's go again 2.52
1:03:25
let's try to decrease the learning rate
1:03:27
let's try to decrease the learning rate
1:03:27
let's try to decrease the learning rate by factor two
1:03:50
okay we're at 2.32
1:03:52
okay we're at 2.32
1:03:52
okay we're at 2.32 let's continue training
1:04:05
we basically expect to see a lower loss
1:04:07
we basically expect to see a lower loss
1:04:07
we basically expect to see a lower loss than what we had before because now we
1:04:08
than what we had before because now we
1:04:08
than what we had before because now we have a much much bigger model and we
1:04:10
have a much much bigger model and we
1:04:10
have a much much bigger model and we were under fitting so we'd expect that
1:04:12
were under fitting so we'd expect that
1:04:12
were under fitting so we'd expect that increasing the size of the model should
1:04:14
increasing the size of the model should
1:04:14
increasing the size of the model should help the neural net
1:04:16
help the neural net
1:04:16
help the neural net 2.32 okay so that's not happening too
1:04:18
2.32 okay so that's not happening too
1:04:18
2.32 okay so that's not happening too well
1:04:19
well
1:04:19
well now one other concern is that even
1:04:21
now one other concern is that even
1:04:21
now one other concern is that even though we've made the 10h layer here or
1:04:23
though we've made the 10h layer here or
1:04:23
though we've made the 10h layer here or the hidden layer much much bigger it
1:04:25
the hidden layer much much bigger it
1:04:25
the hidden layer much much bigger it could be that the bottleneck of the
1:04:26
could be that the bottleneck of the
1:04:26
could be that the bottleneck of the network right now are these embeddings
1:04:28
network right now are these embeddings
1:04:28
network right now are these embeddings that are two dimensional it can be that
1:04:30
that are two dimensional it can be that
1:04:30
that are two dimensional it can be that we're just cramming way too many
1:04:32
we're just cramming way too many
1:04:32
we're just cramming way too many characters into just two dimensions and
1:04:34
characters into just two dimensions and
1:04:34
characters into just two dimensions and the neural net is not able to really use
1:04:36
the neural net is not able to really use
1:04:36
the neural net is not able to really use that space effectively and that that is
1:04:38
that space effectively and that that is
1:04:38
that space effectively and that that is sort of like the bottleneck to our
1:04:39
sort of like the bottleneck to our
1:04:39
sort of like the bottleneck to our network's performance
1:04:42
network's performance
1:04:42
network's performance okay 2.23 so just by decreasing the
1:04:45
okay 2.23 so just by decreasing the
1:04:45
okay 2.23 so just by decreasing the learning rate i was able to make quite a
1:04:46
learning rate i was able to make quite a
1:04:46
learning rate i was able to make quite a bit of progress let's run this one more
1:04:48
bit of progress let's run this one more
1:04:48
bit of progress let's run this one more time
1:04:51
and then evaluate the training and the
1:04:53
and then evaluate the training and the
1:04:53
and then evaluate the training and the dev loss
1:04:56
now one more thing after training that
1:04:58
now one more thing after training that
1:04:58
now one more thing after training that i'd like to do is i'd like to visualize
1:05:00
i'd like to do is i'd like to visualize
1:05:00
i'd like to do is i'd like to visualize the um
1:05:02
the um
1:05:02
the um embedding vectors for these
1:05:05
embedding vectors for these
1:05:05
embedding vectors for these characters before we scale up the
1:05:07
characters before we scale up the
1:05:07
characters before we scale up the embedding size from two
1:05:09
embedding size from two
1:05:09
embedding size from two because we'd like to make uh this
1:05:11
because we'd like to make uh this
1:05:11
because we'd like to make uh this bottleneck potentially go away
1:05:13
bottleneck potentially go away
1:05:13
bottleneck potentially go away but once i make this greater than two we
1:05:15
but once i make this greater than two we
1:05:15
but once i make this greater than two we won't be able to visualize them
1:05:17
won't be able to visualize them
1:05:17
won't be able to visualize them so here
1:05:18
so here
1:05:18
so here okay we're at 2.23 and 2.24
1:05:21
okay we're at 2.23 and 2.24
1:05:21
okay we're at 2.23 and 2.24 so um we're not improving much more and
1:05:24
so um we're not improving much more and
1:05:24
so um we're not improving much more and maybe the bottleneck now is the
1:05:25
maybe the bottleneck now is the
1:05:25
maybe the bottleneck now is the character embedding size which is two
1:05:28
character embedding size which is two
1:05:28
character embedding size which is two so here i have a bunch of code that will
1:05:29
so here i have a bunch of code that will
1:05:29
so here i have a bunch of code that will create a figure
1:05:31
create a figure
1:05:31
create a figure and then we're going to visualize
1:05:33
and then we're going to visualize
1:05:34
and then we're going to visualize the embeddings that were trained by the
1:05:35
the embeddings that were trained by the
1:05:35
the embeddings that were trained by the neural net
1:05:36
neural net
1:05:36
neural net on these characters because right now
1:05:38
on these characters because right now
1:05:38
on these characters because right now the embedding has just two so we can
1:05:40
the embedding has just two so we can
1:05:40
the embedding has just two so we can visualize all the characters with the x
1:05:42
visualize all the characters with the x
1:05:42
visualize all the characters with the x and the y coordinates as the two
1:05:43
and the y coordinates as the two
1:05:43
and the y coordinates as the two embedding locations for each of these
1:05:46
embedding locations for each of these
1:05:46
embedding locations for each of these characters
1:05:47
characters
1:05:47
characters and so here are the x coordinates and
1:05:49
and so here are the x coordinates and
1:05:50
and so here are the x coordinates and the y coordinates which are the columns
1:05:51
the y coordinates which are the columns
1:05:51
the y coordinates which are the columns of c
1:05:52
of c
1:05:52
of c and then for each one i also include the
1:05:55
and then for each one i also include the
1:05:55
and then for each one i also include the text of the little character
1:05:58
text of the little character
1:05:58
text of the little character so here what we see is actually kind of
1:05:59
so here what we see is actually kind of
1:05:59
so here what we see is actually kind of interesting
1:06:02
the network has basically learned to
1:06:04
the network has basically learned to
1:06:04
the network has basically learned to separate out the characters and cluster
1:06:06
separate out the characters and cluster
1:06:06
separate out the characters and cluster them a little bit uh so for example you
1:06:08
them a little bit uh so for example you
1:06:08
them a little bit uh so for example you see how the vowels
1:06:09
see how the vowels
1:06:09
see how the vowels a e i o u are clustered up here
1:06:12
a e i o u are clustered up here
1:06:12
a e i o u are clustered up here so that's telling us that is that the
1:06:14
so that's telling us that is that the
1:06:14
so that's telling us that is that the neural net treats these is very similar
1:06:16
neural net treats these is very similar
1:06:16
neural net treats these is very similar right because when they feed into the
1:06:17
right because when they feed into the
1:06:17
right because when they feed into the neural net
1:06:18
neural net
1:06:18
neural net the embedding uh for all these
1:06:20
the embedding uh for all these
1:06:20
the embedding uh for all these characters is very similar and so the
1:06:22
characters is very similar and so the
1:06:22
characters is very similar and so the neural net thinks that they're very
1:06:23
neural net thinks that they're very
1:06:23
neural net thinks that they're very similar and kind of like interchangeable
1:06:25
similar and kind of like interchangeable
1:06:25
similar and kind of like interchangeable if that makes sense
1:06:27
if that makes sense
1:06:27
if that makes sense um
1:06:29
um
1:06:29
um then the the points that are like really
1:06:31
then the the points that are like really
1:06:31
then the the points that are like really far away are for example q q is kind of
1:06:33
far away are for example q q is kind of
1:06:33
far away are for example q q is kind of treated as an exception and q has a very
1:06:35
treated as an exception and q has a very
1:06:35
treated as an exception and q has a very special
1:06:36
special
1:06:36
special embedding vector so to speak
1:06:38
embedding vector so to speak
1:06:38
embedding vector so to speak similarly dot which is a special
1:06:40
similarly dot which is a special
1:06:40
similarly dot which is a special character is all the way out here
1:06:42
character is all the way out here
1:06:42
character is all the way out here and a lot of the other letters are sort
1:06:44
and a lot of the other letters are sort
1:06:44
and a lot of the other letters are sort of like clustered up here and so it's
1:06:46
of like clustered up here and so it's
1:06:46
of like clustered up here and so it's kind of interesting that there's a
1:06:47
kind of interesting that there's a
1:06:48
kind of interesting that there's a little bit of structure here
1:06:50
little bit of structure here
1:06:50
little bit of structure here after the training
1:06:51
after the training
1:06:51
after the training and it's not definitely not random and
1:06:53
and it's not definitely not random and
1:06:53
and it's not definitely not random and these embeddings make sense
1:06:55
these embeddings make sense
1:06:55
these embeddings make sense so we're now going to scale up the
1:06:57
so we're now going to scale up the
1:06:57
so we're now going to scale up the embedding size and won't be able to
1:06:59
embedding size and won't be able to
1:06:59
embedding size and won't be able to visualize it directly but we expect that
1:07:01
visualize it directly but we expect that
1:07:01
visualize it directly but we expect that because we're under fitting
1:07:03
because we're under fitting
1:07:03
because we're under fitting and we made this layer much bigger and
1:07:06
and we made this layer much bigger and
1:07:06
and we made this layer much bigger and did not sufficiently improve the loss
1:07:08
did not sufficiently improve the loss
1:07:08
did not sufficiently improve the loss we're thinking that the um
1:07:10
we're thinking that the um
1:07:10
we're thinking that the um constraint to better performance right
1:07:12
constraint to better performance right
1:07:12
constraint to better performance right now could be these embedding pictures so
1:07:15
now could be these embedding pictures so
1:07:15
now could be these embedding pictures so let's make them bigger okay so let's
1:07:16
let's make them bigger okay so let's
1:07:16
let's make them bigger okay so let's scroll up here
1:07:17
scroll up here
1:07:17
scroll up here and now we don't have two dimensional
1:07:19
and now we don't have two dimensional
1:07:19
and now we don't have two dimensional embeddings we are going to have
1:07:21
embeddings we are going to have
1:07:21
embeddings we are going to have say 10 dimensional embeddings for each
1:07:23
say 10 dimensional embeddings for each
1:07:23
say 10 dimensional embeddings for each word
1:07:25
word
1:07:25
word then
1:07:26
then
1:07:26
then this layer will receive 3 times 10 so 30
1:07:30
this layer will receive 3 times 10 so 30
1:07:30
this layer will receive 3 times 10 so 30 inputs
1:07:31
inputs
1:07:31
inputs will go into
1:07:33
will go into
1:07:33
will go into the hidden layer
1:07:35
the hidden layer
1:07:35
the hidden layer let's also make the hidden layer a bit
1:07:37
let's also make the hidden layer a bit
1:07:37
let's also make the hidden layer a bit smaller so instead of 300 let's just do
1:07:38
smaller so instead of 300 let's just do
1:07:38
smaller so instead of 300 let's just do 200 neurons in that hidden layer
1:07:41
200 neurons in that hidden layer
1:07:41
200 neurons in that hidden layer so now the total number of elements will
1:07:43
so now the total number of elements will
1:07:43
so now the total number of elements will be slightly bigger at 11 000
1:07:47
be slightly bigger at 11 000
1:07:47
be slightly bigger at 11 000 and then here we have to be a bit
1:07:48
and then here we have to be a bit
1:07:48
and then here we have to be a bit careful because um
1:07:50
careful because um
1:07:50
careful because um okay the learning rate we set to 0.1
1:07:53
okay the learning rate we set to 0.1
1:07:53
okay the learning rate we set to 0.1 here we are hardcoded in six and
1:07:55
here we are hardcoded in six and
1:07:56
here we are hardcoded in six and obviously if you're working in
1:07:56
obviously if you're working in
1:07:56
obviously if you're working in production you don't wanna be
1:07:57
production you don't wanna be
1:07:57
production you don't wanna be hard-coding magic numbers but instead of
1:07:59
hard-coding magic numbers but instead of
1:08:00
hard-coding magic numbers but instead of six this should now be thirty
1:08:02
six this should now be thirty
1:08:02
six this should now be thirty um
1:08:04
um
1:08:04
um and let's run for fifty thousand
1:08:05
and let's run for fifty thousand
1:08:05
and let's run for fifty thousand iterations and let me split out the
1:08:07
iterations and let me split out the
1:08:07
iterations and let me split out the initialization here outside
1:08:10
initialization here outside
1:08:10
initialization here outside so that when we run this cell multiple
1:08:12
so that when we run this cell multiple
1:08:12
so that when we run this cell multiple times it's not going to wipe out
1:08:14
times it's not going to wipe out
1:08:14
times it's not going to wipe out our loss
1:08:17
in addition to that
1:08:19
in addition to that
1:08:19
in addition to that here
1:08:20
here
1:08:20
here let's instead of logging lost.item let's
1:08:22
let's instead of logging lost.item let's
1:08:22
let's instead of logging lost.item let's actually
1:08:23
actually
1:08:23
actually log the
1:08:25
log the
1:08:25
log the let's
1:08:26
let's
1:08:26
let's do log 10
1:08:28
do log 10
1:08:28
do log 10 i believe that's a function of the loss
1:08:32
i believe that's a function of the loss
1:08:32
i believe that's a function of the loss and i'll show you why in a second let's
1:08:34
and i'll show you why in a second let's
1:08:34
and i'll show you why in a second let's optimize this
1:08:37
basically i'd like to plot the log loss
1:08:39
basically i'd like to plot the log loss
1:08:39
basically i'd like to plot the log loss instead of the loss because when you
1:08:40
instead of the loss because when you
1:08:40
instead of the loss because when you plot the loss many times it can have
1:08:42
plot the loss many times it can have
1:08:42
plot the loss many times it can have this hockey stick appearance and log
1:08:44
this hockey stick appearance and log
1:08:44
this hockey stick appearance and log squashes it in
1:08:46
squashes it in
1:08:46
squashes it in uh so it just kind of like looks nicer
1:08:48
uh so it just kind of like looks nicer
1:08:48
uh so it just kind of like looks nicer so the x-axis is step i
1:08:51
so the x-axis is step i
1:08:51
so the x-axis is step i and the y-axis will be the loss i
1:09:00
and then here this is 30.
1:09:03
and then here this is 30.
1:09:03
and then here this is 30. ideally we wouldn't be hard-coding these
1:09:08
okay so let's look at the loss
1:09:11
okay so let's look at the loss
1:09:11
okay so let's look at the loss okay it's again very thick because the
1:09:13
okay it's again very thick because the
1:09:13
okay it's again very thick because the mini batch size is very small but the
1:09:15
mini batch size is very small but the
1:09:15
mini batch size is very small but the total loss over the training set is 2.3
1:09:17
total loss over the training set is 2.3
1:09:18
total loss over the training set is 2.3 and the the tests or the def set is 2.38
1:09:20
and the the tests or the def set is 2.38
1:09:20
and the the tests or the def set is 2.38 as well
1:09:21
as well
1:09:21
as well so so far so good uh let's try to now
1:09:24
so so far so good uh let's try to now
1:09:24
so so far so good uh let's try to now decrease the learning rate
1:09:25
decrease the learning rate
1:09:25
decrease the learning rate by a factor of 10
1:09:29
and train for another 50 000 iterations
1:09:35
we'd hope that we would be able to beat
1:09:37
we'd hope that we would be able to beat
1:09:37
we'd hope that we would be able to beat uh 2.32
1:09:43
but again we're just kind of like doing
1:09:44
but again we're just kind of like doing
1:09:44
but again we're just kind of like doing this very haphazardly so i don't
1:09:46
this very haphazardly so i don't
1:09:46
this very haphazardly so i don't actually have confidence that our
1:09:48
actually have confidence that our
1:09:48
actually have confidence that our learning rate is set very well that our
1:09:50
learning rate is set very well that our
1:09:50
learning rate is set very well that our learning rate decay which we just do
1:09:52
learning rate decay which we just do
1:09:52
learning rate decay which we just do at random is set very well
1:09:55
at random is set very well
1:09:55
at random is set very well and um so the optimization here is kind
1:09:57
and um so the optimization here is kind
1:09:57
and um so the optimization here is kind of suspect to be honest and this is not
1:09:59
of suspect to be honest and this is not
1:09:59
of suspect to be honest and this is not how you would do it typically in
1:10:00
how you would do it typically in
1:10:00
how you would do it typically in production in production you would
1:10:01
production in production you would
1:10:02
production in production you would create parameters or hyper parameters
1:10:04
create parameters or hyper parameters
1:10:04
create parameters or hyper parameters out of all these settings and then you
1:10:05
out of all these settings and then you
1:10:05
out of all these settings and then you would run lots of experiments and see
1:10:07
would run lots of experiments and see
1:10:07
would run lots of experiments and see whichever ones are working well for you
1:10:11
whichever ones are working well for you
1:10:11
whichever ones are working well for you okay
1:10:12
okay
1:10:12
okay so we have 2.17 now and 2.2 okay so you
1:10:16
so we have 2.17 now and 2.2 okay so you
1:10:16
so we have 2.17 now and 2.2 okay so you see how the training and the validation
1:10:19
see how the training and the validation
1:10:19
see how the training and the validation performance are starting to slightly
1:10:21
performance are starting to slightly
1:10:21
performance are starting to slightly slowly depart
1:10:23
slowly depart
1:10:23
slowly depart so maybe we're getting the sense that
1:10:24
so maybe we're getting the sense that
1:10:24
so maybe we're getting the sense that the neural net
1:10:26
the neural net
1:10:26
the neural net is getting good enough or
1:10:28
is getting good enough or
1:10:28
is getting good enough or that number of parameters is large
1:10:30
that number of parameters is large
1:10:30
that number of parameters is large enough that we are slowly starting to
1:10:32
enough that we are slowly starting to
1:10:32
enough that we are slowly starting to overfit
1:10:34
overfit
1:10:34
overfit let's maybe run one more iteration of
1:10:36
let's maybe run one more iteration of
1:10:36
let's maybe run one more iteration of this
1:10:37
this
1:10:37
this and see where we get
1:10:41
but yeah basically you would be running
1:10:43
but yeah basically you would be running
1:10:43
but yeah basically you would be running lots of experiments and then you are
1:10:44
lots of experiments and then you are
1:10:44
lots of experiments and then you are slowly scrutinizing whichever ones give
1:10:46
slowly scrutinizing whichever ones give
1:10:46
slowly scrutinizing whichever ones give you the best depth performance and then
1:10:48
you the best depth performance and then
1:10:48
you the best depth performance and then once you find all the
1:10:49
once you find all the
1:10:50
once you find all the hyper parameters that make your dev
1:10:51
hyper parameters that make your dev
1:10:51
hyper parameters that make your dev performance good you take that model and
1:10:53
performance good you take that model and
1:10:54
performance good you take that model and you evaluate the test set performance a
1:10:55
you evaluate the test set performance a
1:10:55
you evaluate the test set performance a single time and that's the number that
1:10:57
single time and that's the number that
1:10:57
single time and that's the number that you report in your paper or wherever
1:10:59
you report in your paper or wherever
1:10:59
you report in your paper or wherever else you want to talk about and brag
1:11:01
else you want to talk about and brag
1:11:01
else you want to talk about and brag about your model
1:11:05
so let's then rerun the plot and rerun
1:11:08
so let's then rerun the plot and rerun
1:11:08
so let's then rerun the plot and rerun the train and death
1:11:11
and because we're getting lower loss now
1:11:12
and because we're getting lower loss now
1:11:12
and because we're getting lower loss now it is the case that the embedding size
1:11:15
it is the case that the embedding size
1:11:15
it is the case that the embedding size of these was holding us back very likely
1:11:20
okay so 2.162.19 is what we're roughly
1:11:22
okay so 2.162.19 is what we're roughly
1:11:22
okay so 2.162.19 is what we're roughly getting
1:11:24
getting
1:11:24
getting so there's many ways to go from many
1:11:26
so there's many ways to go from many
1:11:26
so there's many ways to go from many ways to go from here we can continue
1:11:28
ways to go from here we can continue
1:11:28
ways to go from here we can continue tuning the optimization
1:11:30
tuning the optimization
1:11:30
tuning the optimization we can continue for example playing with
1:11:32
we can continue for example playing with
1:11:32
we can continue for example playing with the sizes of the neural net or we can
1:11:34
the sizes of the neural net or we can
1:11:34
the sizes of the neural net or we can increase the number of uh
1:11:36
increase the number of uh
1:11:36
increase the number of uh words or characters in our case that we
1:11:38
words or characters in our case that we
1:11:38
words or characters in our case that we are taking as an input so instead of
1:11:39
are taking as an input so instead of
1:11:39
are taking as an input so instead of just three characters we could be taking
1:11:41
just three characters we could be taking
1:11:41
just three characters we could be taking more characters as an input and that
1:11:43
more characters as an input and that
1:11:43
more characters as an input and that could further improve the loss
1:11:46
could further improve the loss
1:11:46
could further improve the loss okay so i changed the code slightly so
1:11:48
okay so i changed the code slightly so
1:11:48
okay so i changed the code slightly so we have here 200 000 steps of the
1:11:50
we have here 200 000 steps of the
1:11:50
we have here 200 000 steps of the optimization and in the first 100 000
1:11:52
optimization and in the first 100 000
1:11:52
optimization and in the first 100 000 we're using a learning rate of 0.1 and
1:11:54
we're using a learning rate of 0.1 and
1:11:54
we're using a learning rate of 0.1 and then in the next 100 000 we're using a
1:11:56
then in the next 100 000 we're using a
1:11:56
then in the next 100 000 we're using a learning rate of 0.01
1:11:58
learning rate of 0.01
1:11:58
learning rate of 0.01 this is the loss that i achieve
1:12:00
this is the loss that i achieve
1:12:00
this is the loss that i achieve and these are the performance on the
1:12:01
and these are the performance on the
1:12:01
and these are the performance on the training and validation loss
1:12:03
training and validation loss
1:12:03
training and validation loss and in particular the best validation
1:12:05
and in particular the best validation
1:12:05
and in particular the best validation loss i've been able to obtain in the
1:12:07
loss i've been able to obtain in the
1:12:07
loss i've been able to obtain in the last 30 minutes or so is 2.17
1:12:10
last 30 minutes or so is 2.17
1:12:10
last 30 minutes or so is 2.17 so now i invite you to beat this number
1:12:12
so now i invite you to beat this number
1:12:12
so now i invite you to beat this number and you have quite a few knobs available
1:12:14
and you have quite a few knobs available
1:12:14
and you have quite a few knobs available to you to i think surpass this number
1:12:17
to you to i think surpass this number
1:12:17
to you to i think surpass this number so number one you can of course change
1:12:18
so number one you can of course change
1:12:18
so number one you can of course change the number of neurons in the hidden
1:12:20
the number of neurons in the hidden
1:12:20
the number of neurons in the hidden layer of this model you can change the
1:12:22
layer of this model you can change the
1:12:22
layer of this model you can change the dimensionality of the embedding
1:12:24
dimensionality of the embedding
1:12:24
dimensionality of the embedding lookup table
1:12:25
lookup table
1:12:25
lookup table you can change the number of characters
1:12:26
you can change the number of characters
1:12:26
you can change the number of characters that are feeding in as an input
1:12:29
that are feeding in as an input
1:12:29
that are feeding in as an input as the context into this model
1:12:32
as the context into this model
1:12:32
as the context into this model and then of course you can change the
1:12:33
and then of course you can change the
1:12:33
and then of course you can change the details of the optimization how long are
1:12:35
details of the optimization how long are
1:12:35
details of the optimization how long are we running what is the learning rate how
1:12:37
we running what is the learning rate how
1:12:37
we running what is the learning rate how does it change over time
1:12:39
does it change over time
1:12:39
does it change over time how does it decay
1:12:41
how does it decay
1:12:41
how does it decay you can change the batch size and you
1:12:42
you can change the batch size and you
1:12:42
you can change the batch size and you may be able to actually achieve a much
1:12:44
may be able to actually achieve a much
1:12:44
may be able to actually achieve a much better convergence speed
1:12:46
better convergence speed
1:12:46
better convergence speed in terms of
1:12:47
in terms of
1:12:47
in terms of how many seconds or minutes it takes to
1:12:49
how many seconds or minutes it takes to
1:12:49
how many seconds or minutes it takes to train the model and get
1:12:51
train the model and get
1:12:51
train the model and get your result in terms of really good
1:12:53
your result in terms of really good
1:12:54
your result in terms of really good loss
1:12:55
loss
1:12:55
loss and then of course i actually invite you
1:12:57
and then of course i actually invite you
1:12:57
and then of course i actually invite you to read this paper it is 19 pages but at
1:12:59
to read this paper it is 19 pages but at
1:12:59
to read this paper it is 19 pages but at this point you should actually be able
1:13:00
this point you should actually be able
1:13:00
this point you should actually be able to read a good chunk of this paper and
1:13:03
to read a good chunk of this paper and
1:13:03
to read a good chunk of this paper and understand
1:13:04
understand
1:13:04
understand pretty good chunks of it
1:13:06
pretty good chunks of it
1:13:06
pretty good chunks of it and this paper also has quite a few
1:13:08
and this paper also has quite a few
1:13:08
and this paper also has quite a few ideas for improvements that you can play
1:13:09
ideas for improvements that you can play
1:13:09
ideas for improvements that you can play with
1:13:11
with
1:13:11
with so all of those are not available to you
1:13:13
so all of those are not available to you
1:13:13
so all of those are not available to you and you should be able to beat this
1:13:14
and you should be able to beat this
1:13:14
and you should be able to beat this number i'm leaving that as an exercise
1:13:16
number i'm leaving that as an exercise
1:13:16
number i'm leaving that as an exercise to the reader and that's it for now and
1:13:18
to the reader and that's it for now and
1:13:18
to the reader and that's it for now and i'll see you next time
1:13:24
before we wrap up i also wanted to show
1:13:25
before we wrap up i also wanted to show
1:13:25
before we wrap up i also wanted to show how you would sample from the model
1:13:28
how you would sample from the model
1:13:28
how you would sample from the model so we're going to generate 20 samples
1:13:31
so we're going to generate 20 samples
1:13:31
so we're going to generate 20 samples at first we begin with all dots so
1:13:33
at first we begin with all dots so
1:13:33
at first we begin with all dots so that's the context
1:13:35
that's the context
1:13:35
that's the context and then until we generate
1:13:37
and then until we generate
1:13:37
and then until we generate the zeroth character again
1:13:39
the zeroth character again
1:13:40
the zeroth character again we're going to embed the current context
1:13:43
we're going to embed the current context
1:13:43
we're going to embed the current context using the embedding table c now usually
1:13:47
using the embedding table c now usually
1:13:47
using the embedding table c now usually uh here the first dimension was the size
1:13:49
uh here the first dimension was the size
1:13:49
uh here the first dimension was the size of the training set but here we're only
1:13:51
of the training set but here we're only
1:13:51
of the training set but here we're only working with a single example that we're
1:13:52
working with a single example that we're
1:13:52
working with a single example that we're generating so this is just the mission
1:13:55
generating so this is just the mission
1:13:55
generating so this is just the mission one just for simplicity
1:13:58
and so this embedding then gets
1:14:00
and so this embedding then gets
1:14:00
and so this embedding then gets projected into the end state you get the
1:14:02
projected into the end state you get the
1:14:02
projected into the end state you get the logits
1:14:03
logits
1:14:03
logits now we calculate the probabilities for
1:14:05
now we calculate the probabilities for
1:14:05
now we calculate the probabilities for that you can use f.softmax
1:14:09
that you can use f.softmax
1:14:09
that you can use f.softmax of logits and that just basically
1:14:10
of logits and that just basically
1:14:10
of logits and that just basically exponentiates the logits and makes them
1:14:12
exponentiates the logits and makes them
1:14:12
exponentiates the logits and makes them sum to one and similar to cross entropy
1:14:15
sum to one and similar to cross entropy
1:14:15
sum to one and similar to cross entropy it is careful that there's no overflows
1:14:18
it is careful that there's no overflows
1:14:18
it is careful that there's no overflows once we have the probabilities we sample
1:14:20
once we have the probabilities we sample
1:14:20
once we have the probabilities we sample from them using torture multinomial to
1:14:21
from them using torture multinomial to
1:14:22
from them using torture multinomial to get our next index and then we shift the
1:14:24
get our next index and then we shift the
1:14:24
get our next index and then we shift the context window to append the index and
1:14:26
context window to append the index and
1:14:26
context window to append the index and record it and then we can just
1:14:29
record it and then we can just
1:14:29
record it and then we can just decode all the integers to strings
1:14:31
decode all the integers to strings
1:14:31
decode all the integers to strings and print them out
1:14:33
and print them out
1:14:33
and print them out and so these are some example samples
1:14:34
and so these are some example samples
1:14:34
and so these are some example samples and you can see that the model now works
1:14:36
and you can see that the model now works
1:14:36
and you can see that the model now works much better so the words here are much
1:14:38
much better so the words here are much
1:14:38
much better so the words here are much more word like or name like so we have
1:14:41
more word like or name like so we have
1:14:41
more word like or name like so we have things like ham
1:14:44
things like ham
1:14:44
things like ham joes
1:14:48
you know it's starting to sound a little
1:14:49
you know it's starting to sound a little
1:14:49
you know it's starting to sound a little bit more name-like so we're definitely
1:14:51
bit more name-like so we're definitely
1:14:51
bit more name-like so we're definitely making progress but we can still improve
1:14:53
making progress but we can still improve
1:14:53
making progress but we can still improve on this model quite a lot
1:14:55
on this model quite a lot
1:14:55
on this model quite a lot okay sorry there's some bonus content i
1:14:57
okay sorry there's some bonus content i
1:14:57
okay sorry there's some bonus content i wanted to mention that i want to make
1:14:59
wanted to mention that i want to make
1:14:59
wanted to mention that i want to make these notebooks more accessible and so i
1:15:01
these notebooks more accessible and so i
1:15:01
these notebooks more accessible and so i don't want you to have to like install
1:15:03
don't want you to have to like install
1:15:03
don't want you to have to like install jupyter notebooks and torch and
1:15:04
jupyter notebooks and torch and
1:15:04
jupyter notebooks and torch and everything else so i will be sharing a
1:15:06
everything else so i will be sharing a
1:15:06
everything else so i will be sharing a link to a google colab
1:15:08
link to a google colab
1:15:08
link to a google colab and google collab will look like a
1:15:10
and google collab will look like a
1:15:10
and google collab will look like a notebook in your browser and you can
1:15:13
notebook in your browser and you can
1:15:13
notebook in your browser and you can just go to the url and you'll be able to
1:15:15
just go to the url and you'll be able to
1:15:15
just go to the url and you'll be able to execute all of the code that you saw in
1:15:17
execute all of the code that you saw in
1:15:17
execute all of the code that you saw in the google collab and so this is me
1:15:19
the google collab and so this is me
1:15:20
the google collab and so this is me executing the code in this lecture and i
1:15:22
executing the code in this lecture and i
1:15:22
executing the code in this lecture and i shortened it a little bit but basically
1:15:24
shortened it a little bit but basically
1:15:24
shortened it a little bit but basically you're able to train the exact same
1:15:25
you're able to train the exact same
1:15:25
you're able to train the exact same network and then plot and sample from
1:15:28
network and then plot and sample from
1:15:28
network and then plot and sample from the model and everything is ready for
1:15:29
the model and everything is ready for
1:15:30
the model and everything is ready for you to like tinker with the numbers
1:15:31
you to like tinker with the numbers
1:15:31
you to like tinker with the numbers right there in your browser no
1:15:33
right there in your browser no
1:15:33
right there in your browser no installation necessary
1:15:35
installation necessary
1:15:35
installation necessary so i just wanted to point that out and
1:15:36
so i just wanted to point that out and
1:15:36
so i just wanted to point that out and the link to this will be in the video
1:15:38
the link to this will be in the video
1:15:38
the link to this will be in the video description