View on GitHub
GitHub
Neural Networks: Zero to Hero
Building makemore Part 3: Activations & Gradients, BatchNorm
Loading player
Notes
Transcript
6088 segments
0:00
hi everyone today we are continuing our
0:02
hi everyone today we are continuing our
0:02
hi everyone today we are continuing our implementation of make more now in the
0:04
implementation of make more now in the
0:04
implementation of make more now in the last lecture we implemented the multier
0:06
last lecture we implemented the multier
0:06
last lecture we implemented the multier perceptron along the lines of benj 2003
0:08
perceptron along the lines of benj 2003
0:08
perceptron along the lines of benj 2003 for character level language modeling so
0:10
for character level language modeling so
0:10
for character level language modeling so we followed this paper took in a few
0:12
we followed this paper took in a few
0:12
we followed this paper took in a few characters in the past and used an MLP
0:14
characters in the past and used an MLP
0:14
characters in the past and used an MLP to predict the next character in a
0:16
to predict the next character in a
0:16
to predict the next character in a sequence so what we'd like to do now is
0:18
sequence so what we'd like to do now is
0:18
sequence so what we'd like to do now is we'd like to move on to more complex and
0:20
we'd like to move on to more complex and
0:20
we'd like to move on to more complex and larger neural networks like recurrent
0:21
larger neural networks like recurrent
0:22
larger neural networks like recurrent neural networks and their variations
0:23
neural networks and their variations
0:23
neural networks and their variations like the grw lstm and so on now before
0:26
like the grw lstm and so on now before
0:27
like the grw lstm and so on now before we do that though we have to stick
0:28
we do that though we have to stick
0:28
we do that though we have to stick around the level of malalia perception
0:29
around the level of malalia perception
0:30
around the level of malalia perception on for a bit longer and I'd like to do
0:32
on for a bit longer and I'd like to do
0:32
on for a bit longer and I'd like to do this because I would like us to have a
0:33
this because I would like us to have a
0:33
this because I would like us to have a very good intuitive understanding of the
0:35
very good intuitive understanding of the
0:35
very good intuitive understanding of the activations in the neural net during
0:37
activations in the neural net during
0:37
activations in the neural net during training and especially the gradients
0:39
training and especially the gradients
0:39
training and especially the gradients that are flowing backwards and how they
0:41
that are flowing backwards and how they
0:41
that are flowing backwards and how they behave and what they look like and this
0:43
behave and what they look like and this
0:43
behave and what they look like and this is going to be very important to
0:45
is going to be very important to
0:45
is going to be very important to understand the history of the
0:45
understand the history of the
0:46
understand the history of the development of these architectures
0:48
development of these architectures
0:48
development of these architectures because we'll see that recurr neural
0:49
because we'll see that recurr neural
0:49
because we'll see that recurr neural networks while they are very expressive
0:51
networks while they are very expressive
0:51
networks while they are very expressive in that they are a universal
0:53
in that they are a universal
0:53
in that they are a universal approximator and can in principle
0:54
approximator and can in principle
0:54
approximator and can in principle Implement uh all the algorithms uh we'll
0:58
Implement uh all the algorithms uh we'll
0:58
Implement uh all the algorithms uh we'll see that they are not very easily
0:59
see that they are not very easily
0:59
see that they are not very easily optimizable with the first order
1:01
optimizable with the first order
1:01
optimizable with the first order gradient based techniques that we have
1:02
gradient based techniques that we have
1:02
gradient based techniques that we have available to us and that we use all the
1:03
available to us and that we use all the
1:03
available to us and that we use all the time and the key to understanding why
1:06
time and the key to understanding why
1:06
time and the key to understanding why they are not optimizable easily is to
1:09
they are not optimizable easily is to
1:09
they are not optimizable easily is to understand the the activations and the
1:10
understand the the activations and the
1:10
understand the the activations and the gradients and how they behave during
1:11
gradients and how they behave during
1:11
gradients and how they behave during training and we'll see that a lot of the
1:13
training and we'll see that a lot of the
1:13
training and we'll see that a lot of the variants since recur neural networks
1:16
variants since recur neural networks
1:16
variants since recur neural networks have tried to improve that situation and
1:19
have tried to improve that situation and
1:19
have tried to improve that situation and so that's the path that we have to take
1:21
so that's the path that we have to take
1:21
so that's the path that we have to take and uh let's get started so the starting
1:23
and uh let's get started so the starting
1:23
and uh let's get started so the starting code for this lecture is largely the
1:25
code for this lecture is largely the
1:25
code for this lecture is largely the code from before but I've cleaned it up
1:27
code from before but I've cleaned it up
1:27
code from before but I've cleaned it up a little bit so you'll see that we are
1:29
a little bit so you'll see that we are
1:29
a little bit so you'll see that we are importing
1:30
importing
1:30
importing all the torch and math plb utilities
1:33
all the torch and math plb utilities
1:33
all the torch and math plb utilities we're reading in the words just like
1:34
we're reading in the words just like
1:34
we're reading in the words just like before these are eight example words
1:37
before these are eight example words
1:37
before these are eight example words there's a total of 32,000 of them here's
1:39
there's a total of 32,000 of them here's
1:39
there's a total of 32,000 of them here's a vocabulary of all the lowercase
1:41
a vocabulary of all the lowercase
1:41
a vocabulary of all the lowercase letters and the special dot token here
1:44
letters and the special dot token here
1:44
letters and the special dot token here we are reading the data set and
1:46
we are reading the data set and
1:46
we are reading the data set and processing it and um creating three
1:49
processing it and um creating three
1:49
processing it and um creating three splits the train Dev and the test split
1:53
splits the train Dev and the test split
1:53
splits the train Dev and the test split now in MLP this is the identical same
1:55
now in MLP this is the identical same
1:55
now in MLP this is the identical same MLP except you see that I removed a
1:57
MLP except you see that I removed a
1:57
MLP except you see that I removed a bunch of magic numbers that we had here
1:59
bunch of magic numbers that we had here
1:59
bunch of magic numbers that we had here and instead we have the dimensionality
2:01
and instead we have the dimensionality
2:01
and instead we have the dimensionality of the embedding space of the characters
2:03
of the embedding space of the characters
2:03
of the embedding space of the characters and the number of hidden units in the
2:05
and the number of hidden units in the
2:05
and the number of hidden units in the hidden layer and so I've pulled them
2:06
hidden layer and so I've pulled them
2:06
hidden layer and so I've pulled them outside here uh so that we don't have to
2:08
outside here uh so that we don't have to
2:09
outside here uh so that we don't have to go and change all these magic numbers
2:10
go and change all these magic numbers
2:10
go and change all these magic numbers all the time we have the same neural net
2:12
all the time we have the same neural net
2:12
all the time we have the same neural net with 11,000 parameters that we optimize
2:14
with 11,000 parameters that we optimize
2:14
with 11,000 parameters that we optimize now over 200,000 steps with a batch size
2:17
now over 200,000 steps with a batch size
2:17
now over 200,000 steps with a batch size of 32 and you'll see that I refactor I
2:20
of 32 and you'll see that I refactor I
2:20
of 32 and you'll see that I refactor I refactored the code here a little bit
2:22
refactored the code here a little bit
2:22
refactored the code here a little bit but there are no functional changes I
2:23
but there are no functional changes I
2:23
but there are no functional changes I just created a few extra variables a few
2:26
just created a few extra variables a few
2:26
just created a few extra variables a few more comments and I removed all the
2:28
more comments and I removed all the
2:28
more comments and I removed all the magic numbers and otherwise is the exact
2:30
magic numbers and otherwise is the exact
2:30
magic numbers and otherwise is the exact same thing then when we optimize we saw
2:33
same thing then when we optimize we saw
2:33
same thing then when we optimize we saw that our loss looked something like this
2:35
that our loss looked something like this
2:36
that our loss looked something like this we saw that the train and Val loss were
2:38
we saw that the train and Val loss were
2:38
we saw that the train and Val loss were about
2:39
about
2:39
about 2.16 and so on here I refactored the uh
2:43
2.16 and so on here I refactored the uh
2:43
2.16 and so on here I refactored the uh code a little bit for the evaluation of
2:45
code a little bit for the evaluation of
2:45
code a little bit for the evaluation of arbitary splits so you pass in a string
2:48
arbitary splits so you pass in a string
2:48
arbitary splits so you pass in a string of which split you'd like to evaluate
2:50
of which split you'd like to evaluate
2:50
of which split you'd like to evaluate and then here depending on train Val or
2:52
and then here depending on train Val or
2:52
and then here depending on train Val or test I index in and I get the correct
2:54
test I index in and I get the correct
2:54
test I index in and I get the correct split and then this is the forward pass
2:56
split and then this is the forward pass
2:56
split and then this is the forward pass of the network and evaluation of the
2:58
of the network and evaluation of the
2:58
of the network and evaluation of the loss and printing it so just making that
3:01
loss and printing it so just making that
3:01
loss and printing it so just making that nicer uh one thing that you'll notice
3:04
nicer uh one thing that you'll notice
3:04
nicer uh one thing that you'll notice here is I'm using a decorator torch.
3:06
here is I'm using a decorator torch.
3:06
here is I'm using a decorator torch. nograd which you can also um look up and
3:09
nograd which you can also um look up and
3:09
nograd which you can also um look up and read the documentation of basically what
3:11
read the documentation of basically what
3:11
read the documentation of basically what this decorator does on top of a function
3:14
this decorator does on top of a function
3:14
this decorator does on top of a function is that whatever happens in this
3:16
is that whatever happens in this
3:16
is that whatever happens in this function is assumed by uh torch to never
3:20
function is assumed by uh torch to never
3:20
function is assumed by uh torch to never require any gradients so it will not do
3:22
require any gradients so it will not do
3:22
require any gradients so it will not do any of the bookkeeping that it does to
3:25
any of the bookkeeping that it does to
3:25
any of the bookkeeping that it does to keep track of all the gradients in
3:26
keep track of all the gradients in
3:26
keep track of all the gradients in anticipation of an eventual backward
3:28
anticipation of an eventual backward
3:28
anticipation of an eventual backward pass it's it's almost as if all the
3:30
pass it's it's almost as if all the
3:30
pass it's it's almost as if all the tensors that get created here have a
3:32
tensors that get created here have a
3:32
tensors that get created here have a required grad of false and so it just
3:35
required grad of false and so it just
3:35
required grad of false and so it just makes everything much more efficient
3:36
makes everything much more efficient
3:36
makes everything much more efficient because you're telling torch that I will
3:37
because you're telling torch that I will
3:37
because you're telling torch that I will not call that backward on any of this
3:39
not call that backward on any of this
3:39
not call that backward on any of this computation and you don't need to
3:41
computation and you don't need to
3:41
computation and you don't need to maintain the graph under the hood so
3:44
maintain the graph under the hood so
3:44
maintain the graph under the hood so that's what this does and you can also
3:46
that's what this does and you can also
3:46
that's what this does and you can also use a context manager uh with torch du
3:49
use a context manager uh with torch du
3:49
use a context manager uh with torch du nograd and you can look those
3:51
nograd and you can look those
3:52
nograd and you can look those up then here we have the sampling from a
3:54
up then here we have the sampling from a
3:54
up then here we have the sampling from a model um just as before just a for
3:57
model um just as before just a for
3:57
model um just as before just a for Passive neural nut getting the
3:58
Passive neural nut getting the
3:58
Passive neural nut getting the distribution sent from it adjusting the
4:01
distribution sent from it adjusting the
4:01
distribution sent from it adjusting the context window and repeating until we
4:03
context window and repeating until we
4:03
context window and repeating until we get the special end token and we see
4:05
get the special end token and we see
4:05
get the special end token and we see that we are starting to get much nicer
4:07
that we are starting to get much nicer
4:07
that we are starting to get much nicer looking words simple from the model it's
4:10
looking words simple from the model it's
4:10
looking words simple from the model it's still not amazing and they're still not
4:11
still not amazing and they're still not
4:11
still not amazing and they're still not fully name like uh but it's much better
4:14
fully name like uh but it's much better
4:14
fully name like uh but it's much better than what we had with the BAM
4:16
than what we had with the BAM
4:16
than what we had with the BAM model so that's our starting point now
4:19
model so that's our starting point now
4:19
model so that's our starting point now the first thing I would like to
4:20
the first thing I would like to
4:20
the first thing I would like to scrutinize is the
4:21
scrutinize is the
4:21
scrutinize is the initialization I can tell that our
4:24
initialization I can tell that our
4:24
initialization I can tell that our network is very improperly configured at
4:26
network is very improperly configured at
4:26
network is very improperly configured at initialization and there's multiple
4:28
initialization and there's multiple
4:28
initialization and there's multiple things wrong with it but let's just
4:29
things wrong with it but let's just
4:29
things wrong with it but let's just start with the first one look here on
4:31
start with the first one look here on
4:31
start with the first one look here on the zeroth iteration the very first
4:33
the zeroth iteration the very first
4:33
the zeroth iteration the very first iteration we are recording a loss of 27
4:37
iteration we are recording a loss of 27
4:37
iteration we are recording a loss of 27 and this rapidly comes down to roughly
4:38
and this rapidly comes down to roughly
4:38
and this rapidly comes down to roughly one or two or so so I can tell that the
4:40
one or two or so so I can tell that the
4:40
one or two or so so I can tell that the initialization is all messed up because
4:42
initialization is all messed up because
4:42
initialization is all messed up because this is way too high in training of
4:44
this is way too high in training of
4:44
this is way too high in training of neural Nets it is almost always the case
4:46
neural Nets it is almost always the case
4:46
neural Nets it is almost always the case that you will have a rough idea for what
4:48
that you will have a rough idea for what
4:48
that you will have a rough idea for what loss to expect at initialization and
4:51
loss to expect at initialization and
4:51
loss to expect at initialization and that just depends on the loss function
4:52
that just depends on the loss function
4:52
that just depends on the loss function and the problem setup in this case I do
4:55
and the problem setup in this case I do
4:55
and the problem setup in this case I do not expect 27 I expect a much lower
4:57
not expect 27 I expect a much lower
4:57
not expect 27 I expect a much lower number and we can calculate it together
5:00
number and we can calculate it together
5:00
number and we can calculate it together basically at initialization what we like
5:03
basically at initialization what we like
5:03
basically at initialization what we like is that um there's 27 characters that
5:06
is that um there's 27 characters that
5:06
is that um there's 27 characters that could come next for any one training
5:08
could come next for any one training
5:08
could come next for any one training example at initialization we have no
5:10
example at initialization we have no
5:10
example at initialization we have no reason to believe any characters to be
5:11
reason to believe any characters to be
5:11
reason to believe any characters to be much more likely than others and so we'd
5:14
much more likely than others and so we'd
5:14
much more likely than others and so we'd expect that the propy distribution that
5:15
expect that the propy distribution that
5:15
expect that the propy distribution that comes out initially is a uniform
5:18
comes out initially is a uniform
5:18
comes out initially is a uniform distribution assigning about equal
5:20
distribution assigning about equal
5:20
distribution assigning about equal probability to all the 27
5:22
probability to all the 27
5:22
probability to all the 27 characters so basically what we' like is
5:25
characters so basically what we' like is
5:25
characters so basically what we' like is the probability for any character would
5:27
the probability for any character would
5:27
the probability for any character would be roughly 1 over 20
5:30
be roughly 1 over 20
5:30
be roughly 1 over 20 7 that is the probability we should
5:32
7 that is the probability we should
5:32
7 that is the probability we should record and then the loss is the negative
5:35
record and then the loss is the negative
5:35
record and then the loss is the negative log probability so let's wrap this in a
5:38
log probability so let's wrap this in a
5:38
log probability so let's wrap this in a tensor and then then we can take the log
5:41
tensor and then then we can take the log
5:41
tensor and then then we can take the log of it and then the negative log
5:43
of it and then the negative log
5:43
of it and then the negative log probability is the loss we would expect
5:46
probability is the loss we would expect
5:46
probability is the loss we would expect which is 3.29 much much lower than 27
5:49
which is 3.29 much much lower than 27
5:49
which is 3.29 much much lower than 27 and so what's happening right now is
5:51
and so what's happening right now is
5:51
and so what's happening right now is that at initialization the neural nut is
5:53
that at initialization the neural nut is
5:53
that at initialization the neural nut is creating probity distributions that are
5:55
creating probity distributions that are
5:55
creating probity distributions that are all messed up some characters are very
5:57
all messed up some characters are very
5:57
all messed up some characters are very confident and some characters are very
5:59
confident and some characters are very
5:59
confident and some characters are very not confident confident and then
6:01
not confident confident and then
6:01
not confident confident and then basically what's happening is that the
6:02
basically what's happening is that the
6:02
basically what's happening is that the network is very confidently wrong and uh
6:06
network is very confidently wrong and uh
6:06
network is very confidently wrong and uh that that's what makes it um record very
6:09
that that's what makes it um record very
6:09
that that's what makes it um record very high loss so here's a smaller
6:11
high loss so here's a smaller
6:11
high loss so here's a smaller four-dimensional example of the issue
6:13
four-dimensional example of the issue
6:13
four-dimensional example of the issue let's say we only have four characters
6:16
let's say we only have four characters
6:16
let's say we only have four characters and then we have logits that come out of
6:17
and then we have logits that come out of
6:17
and then we have logits that come out of the neural net and they are very very
6:19
the neural net and they are very very
6:19
the neural net and they are very very close to zero then when we take the
6:21
close to zero then when we take the
6:21
close to zero then when we take the softmax of all zeros we get
6:24
softmax of all zeros we get
6:24
softmax of all zeros we get probabilities there are a diffused
6:26
probabilities there are a diffused
6:26
probabilities there are a diffused distribution so sums to one and is
6:29
distribution so sums to one and is
6:29
distribution so sums to one and is exactly
6:30
exactly
6:30
exactly uniform and then in this case if the
6:32
uniform and then in this case if the
6:32
uniform and then in this case if the label is say two it doesn't actually
6:34
label is say two it doesn't actually
6:34
label is say two it doesn't actually matter if this if the label is two or
6:36
matter if this if the label is two or
6:36
matter if this if the label is two or three or one or zero because it's a
6:38
three or one or zero because it's a
6:38
three or one or zero because it's a uniform distribution we're recording the
6:40
uniform distribution we're recording the
6:40
uniform distribution we're recording the exact same loss in this case 1.38 so
6:43
exact same loss in this case 1.38 so
6:43
exact same loss in this case 1.38 so this is the loss we would expect for a
6:44
this is the loss we would expect for a
6:44
this is the loss we would expect for a four-dimensional example and now you can
6:46
four-dimensional example and now you can
6:46
four-dimensional example and now you can see of course that as we start to
6:48
see of course that as we start to
6:48
see of course that as we start to manipulate these logits uh we're going
6:50
manipulate these logits uh we're going
6:50
manipulate these logits uh we're going to be changing the law here so it could
6:52
to be changing the law here so it could
6:53
to be changing the law here so it could be that we lock out and by chance uh
6:55
be that we lock out and by chance uh
6:55
be that we lock out and by chance uh this could be a very high number like
6:57
this could be a very high number like
6:57
this could be a very high number like you know five or something like that
6:59
you know five or something like that
6:59
you know five or something like that then case we'll record a very low loss
7:01
then case we'll record a very low loss
7:01
then case we'll record a very low loss because we're assigning the correct
7:02
because we're assigning the correct
7:02
because we're assigning the correct probability at initialization by chance
7:04
probability at initialization by chance
7:04
probability at initialization by chance to the correct label much more likely it
7:07
to the correct label much more likely it
7:07
to the correct label much more likely it is that some other dimension will have a
7:11
is that some other dimension will have a
7:11
is that some other dimension will have a high uh logit and then what will happen
7:14
high uh logit and then what will happen
7:14
high uh logit and then what will happen is we start to record much higher loss
7:17
is we start to record much higher loss
7:17
is we start to record much higher loss and what can come what can happen is
7:18
and what can come what can happen is
7:18
and what can come what can happen is basically the logits come out like
7:20
basically the logits come out like
7:20
basically the logits come out like something like this you know and they
7:22
something like this you know and they
7:22
something like this you know and they take on Extreme values and we record
7:25
take on Extreme values and we record
7:25
take on Extreme values and we record really high loss
7:27
really high loss
7:27
really high loss um for example if we have to 4. random
7:30
um for example if we have to 4. random
7:30
um for example if we have to 4. random of four so these are uniform um sorry
7:34
of four so these are uniform um sorry
7:34
of four so these are uniform um sorry these are normally distributed um
7:37
these are normally distributed um
7:37
these are normally distributed um numbers uh four of
7:39
numbers uh four of
7:39
numbers uh four of them then here we can also print the
7:42
them then here we can also print the
7:42
them then here we can also print the logits probabilities that come out of it
7:45
logits probabilities that come out of it
7:45
logits probabilities that come out of it and the loss and so because these logits
7:48
and the loss and so because these logits
7:48
and the loss and so because these logits are near zero for the most part the loss
7:51
are near zero for the most part the loss
7:51
are near zero for the most part the loss that comes out is is okay uh but suppose
7:54
that comes out is is okay uh but suppose
7:54
that comes out is is okay uh but suppose this is like times 10
7:57
this is like times 10
7:57
this is like times 10 now you see how because these are more
8:00
now you see how because these are more
8:00
now you see how because these are more extreme values it's very unlikely that
8:02
extreme values it's very unlikely that
8:02
extreme values it's very unlikely that you're going to be guessing the correct
8:05
you're going to be guessing the correct
8:05
you're going to be guessing the correct bucket and then you're confidently wrong
8:07
bucket and then you're confidently wrong
8:07
bucket and then you're confidently wrong and recording very high loss if your
8:09
and recording very high loss if your
8:09
and recording very high loss if your loes are coming out even more
8:11
loes are coming out even more
8:11
loes are coming out even more extreme you might get extremely insane
8:15
extreme you might get extremely insane
8:15
extreme you might get extremely insane losses like infinity even at
8:17
losses like infinity even at
8:17
losses like infinity even at initialization
8:19
initialization
8:19
initialization um so basically this is not good and we
8:21
um so basically this is not good and we
8:21
um so basically this is not good and we want the loges to be roughly zero um
8:25
want the loges to be roughly zero um
8:25
want the loges to be roughly zero um when the network is initialized in fact
8:28
when the network is initialized in fact
8:28
when the network is initialized in fact the lits can don't have to be just zero
8:30
the lits can don't have to be just zero
8:30
the lits can don't have to be just zero they just have to be equal so for
8:31
they just have to be equal so for
8:31
they just have to be equal so for example if all the logits are one then
8:34
example if all the logits are one then
8:34
example if all the logits are one then because of the normalization inside the
8:35
because of the normalization inside the
8:35
because of the normalization inside the softmax this will actually come out okay
8:38
softmax this will actually come out okay
8:38
softmax this will actually come out okay but by symmetry we don't want it to be
8:39
but by symmetry we don't want it to be
8:40
but by symmetry we don't want it to be any arbitrary positive or negative
8:41
any arbitrary positive or negative
8:41
any arbitrary positive or negative number we just want it to be all zeros
8:43
number we just want it to be all zeros
8:43
number we just want it to be all zeros and record the loss that we expect at
8:45
and record the loss that we expect at
8:45
and record the loss that we expect at initialization so let's now concretely
8:47
initialization so let's now concretely
8:47
initialization so let's now concretely see where things go wrong in our example
8:49
see where things go wrong in our example
8:49
see where things go wrong in our example here we have the initialization let me
8:51
here we have the initialization let me
8:51
here we have the initialization let me reinitialize the neuronet and here let
8:54
reinitialize the neuronet and here let
8:54
reinitialize the neuronet and here let me break after the very first iteration
8:56
me break after the very first iteration
8:56
me break after the very first iteration so we only see the initial loss which is
8:58
so we only see the initial loss which is
8:58
so we only see the initial loss which is 27
9:00
27
9:00
27 so that's way too high and intuitively
9:02
so that's way too high and intuitively
9:02
so that's way too high and intuitively now we can expect the variables involved
9:04
now we can expect the variables involved
9:04
now we can expect the variables involved and we see that the logits here if we
9:06
and we see that the logits here if we
9:06
and we see that the logits here if we just print some of
9:08
just print some of
9:08
just print some of these if we just print the first row we
9:11
these if we just print the first row we
9:11
these if we just print the first row we see that the Lo just take on quite
9:12
see that the Lo just take on quite
9:12
see that the Lo just take on quite extreme values and that's what's
9:14
extreme values and that's what's
9:14
extreme values and that's what's creating the fake confidence in
9:16
creating the fake confidence in
9:16
creating the fake confidence in incorrect answers and makes the loss um
9:20
incorrect answers and makes the loss um
9:20
incorrect answers and makes the loss um get very very high so these loes should
9:22
get very very high so these loes should
9:22
get very very high so these loes should be much much closer to zero so now let's
9:25
be much much closer to zero so now let's
9:25
be much much closer to zero so now let's think through how we can achieve logits
9:27
think through how we can achieve logits
9:28
think through how we can achieve logits coming out of this neur not to be more
9:30
coming out of this neur not to be more
9:30
coming out of this neur not to be more closer to zero you see here that loes
9:33
closer to zero you see here that loes
9:33
closer to zero you see here that loes are calculated as the hidden states
9:34
are calculated as the hidden states
9:35
are calculated as the hidden states multip by W2 plus B2 so first of all
9:38
multip by W2 plus B2 so first of all
9:38
multip by W2 plus B2 so first of all currently we're initializing B2 as
9:40
currently we're initializing B2 as
9:40
currently we're initializing B2 as random values uh of the right size but
9:45
random values uh of the right size but
9:45
random values uh of the right size but because we want roughly zero we don't
9:46
because we want roughly zero we don't
9:46
because we want roughly zero we don't actually want to be adding a bias of
9:48
actually want to be adding a bias of
9:48
actually want to be adding a bias of random numbers so in fact I'm going to
9:50
random numbers so in fact I'm going to
9:50
random numbers so in fact I'm going to add a times zero here to make sure that
9:52
add a times zero here to make sure that
9:52
add a times zero here to make sure that B2 is just um basically zero at
9:56
B2 is just um basically zero at
9:56
B2 is just um basically zero at initialization and second this is H
9:58
initialization and second this is H
9:58
initialization and second this is H multip by W2 so if we want logits to be
10:01
multip by W2 so if we want logits to be
10:01
multip by W2 so if we want logits to be very very small then we would be
10:03
very very small then we would be
10:03
very very small then we would be multiplying W2 and making that
10:06
multiplying W2 and making that
10:06
multiplying W2 and making that smaller so for example if we scale down
10:08
smaller so for example if we scale down
10:08
smaller so for example if we scale down W2 by 0.1 all the elements then if I do
10:12
W2 by 0.1 all the elements then if I do
10:12
W2 by 0.1 all the elements then if I do again just a very first iteration you
10:14
again just a very first iteration you
10:14
again just a very first iteration you see that we are getting much closer to
10:16
see that we are getting much closer to
10:16
see that we are getting much closer to what we expect so rough roughly what we
10:18
what we expect so rough roughly what we
10:18
what we expect so rough roughly what we want is about
10:19
want is about
10:19
want is about 3.29 this is
10:21
3.29 this is
10:21
3.29 this is 4.2 I can make this maybe even
10:24
4.2 I can make this maybe even
10:24
4.2 I can make this maybe even smaller 3.32 okay so we're getting
10:27
smaller 3.32 okay so we're getting
10:27
smaller 3.32 okay so we're getting closer and closer now you're probably
10:30
closer and closer now you're probably
10:30
closer and closer now you're probably wondering can we just set this to zero
10:33
wondering can we just set this to zero
10:33
wondering can we just set this to zero then we get of course exactly what we're
10:34
then we get of course exactly what we're
10:34
then we get of course exactly what we're looking for um at
10:37
looking for um at
10:37
looking for um at initialization and the reason I don't
10:39
initialization and the reason I don't
10:39
initialization and the reason I don't usually do this is because I'm I'm very
10:41
usually do this is because I'm I'm very
10:41
usually do this is because I'm I'm very nervous and I'll show you in a second
10:43
nervous and I'll show you in a second
10:43
nervous and I'll show you in a second why you don't want to be setting W's or
10:46
why you don't want to be setting W's or
10:46
why you don't want to be setting W's or weights of a neural nut exactly to zero
10:48
weights of a neural nut exactly to zero
10:48
weights of a neural nut exactly to zero um you you usually want it to be small
10:50
um you you usually want it to be small
10:50
um you you usually want it to be small numbers instead of exactly zero um for
10:53
numbers instead of exactly zero um for
10:53
numbers instead of exactly zero um for this output layer in this specific case
10:55
this output layer in this specific case
10:55
this output layer in this specific case I think it would be fine but I'll show
10:57
I think it would be fine but I'll show
10:57
I think it would be fine but I'll show you in a second where things go wrong
10:58
you in a second where things go wrong
10:58
you in a second where things go wrong very quick quickly if you do that so
11:00
very quick quickly if you do that so
11:00
very quick quickly if you do that so let's just go with
11:01
let's just go with
11:01
let's just go with 0.01 in that case our loss is close
11:04
0.01 in that case our loss is close
11:04
0.01 in that case our loss is close enough but has some entropy it's not
11:07
enough but has some entropy it's not
11:07
enough but has some entropy it's not exactly zero it's got some little
11:09
exactly zero it's got some little
11:09
exactly zero it's got some little entropy and that's used for symmetry
11:11
entropy and that's used for symmetry
11:11
entropy and that's used for symmetry breaking as we'll see in a second the
11:13
breaking as we'll see in a second the
11:13
breaking as we'll see in a second the logits are now coming out much closer to
11:15
logits are now coming out much closer to
11:15
logits are now coming out much closer to zero and everything is well and good so
11:18
zero and everything is well and good so
11:18
zero and everything is well and good so if I just erase these and I now take
11:22
if I just erase these and I now take
11:22
if I just erase these and I now take away the break
11:23
away the break
11:23
away the break statement we can run the optimization
11:26
statement we can run the optimization
11:26
statement we can run the optimization with this new initialization and let's
11:28
with this new initialization and let's
11:28
with this new initialization and let's just see
11:30
just see
11:30
just see what losses we record okay so I let it
11:33
what losses we record okay so I let it
11:33
what losses we record okay so I let it run and you see that we started off good
11:35
run and you see that we started off good
11:35
run and you see that we started off good and then we came down a
11:37
and then we came down a
11:37
and then we came down a bit the plot of the loss uh now doesn't
11:40
bit the plot of the loss uh now doesn't
11:40
bit the plot of the loss uh now doesn't have this hockey shape appearance um
11:43
have this hockey shape appearance um
11:43
have this hockey shape appearance um because basically what's happening in
11:44
because basically what's happening in
11:44
because basically what's happening in the hockey stick the very first few
11:46
the hockey stick the very first few
11:46
the hockey stick the very first few iterations of the loss what's happening
11:48
iterations of the loss what's happening
11:48
iterations of the loss what's happening during the optimization is the
11:50
during the optimization is the
11:50
during the optimization is the optimization is just squashing down the
11:52
optimization is just squashing down the
11:52
optimization is just squashing down the logits and then it's rearranging the
11:54
logits and then it's rearranging the
11:54
logits and then it's rearranging the logits so basically we took away this
11:56
logits so basically we took away this
11:56
logits so basically we took away this easy part of the loss function where
11:58
easy part of the loss function where
11:58
easy part of the loss function where just the the weights were just being
12:00
just the the weights were just being
12:00
just the the weights were just being shrunk down and so therefore we're we
12:03
shrunk down and so therefore we're we
12:03
shrunk down and so therefore we're we don't we don't get these easy gains in
12:04
don't we don't get these easy gains in
12:04
don't we don't get these easy gains in the beginning and we're just getting
12:06
the beginning and we're just getting
12:06
the beginning and we're just getting some of the hard gains of training the
12:07
some of the hard gains of training the
12:07
some of the hard gains of training the actual neural nut and so there's no
12:09
actual neural nut and so there's no
12:09
actual neural nut and so there's no hockey stick appearance so good things
12:12
hockey stick appearance so good things
12:12
hockey stick appearance so good things are happening in that both number one
12:14
are happening in that both number one
12:14
are happening in that both number one losset initialization is what we expect
12:17
losset initialization is what we expect
12:17
losset initialization is what we expect and the the loss doesn't look like a
12:19
and the the loss doesn't look like a
12:19
and the the loss doesn't look like a hockey stick and this is true for any
12:21
hockey stick and this is true for any
12:21
hockey stick and this is true for any neuron that you might train um and
12:23
neuron that you might train um and
12:23
neuron that you might train um and something to look out for and second the
12:26
something to look out for and second the
12:26
something to look out for and second the loss that came out is actually quite a
12:28
loss that came out is actually quite a
12:28
loss that came out is actually quite a bit improved unfortunately I erased what
12:30
bit improved unfortunately I erased what
12:30
bit improved unfortunately I erased what we had here before I believe this was 2.
12:33
we had here before I believe this was 2.
12:33
we had here before I believe this was 2. um2 and this was this was 2.16 so we get
12:37
um2 and this was this was 2.16 so we get
12:37
um2 and this was this was 2.16 so we get a slightly improved result and the
12:40
a slightly improved result and the
12:40
a slightly improved result and the reason for that is uh because we're
12:41
reason for that is uh because we're
12:41
reason for that is uh because we're spending more Cycles more time
12:44
spending more Cycles more time
12:44
spending more Cycles more time optimizing the neuronet actually instead
12:46
optimizing the neuronet actually instead
12:46
optimizing the neuronet actually instead of just uh spending the first several
12:49
of just uh spending the first several
12:49
of just uh spending the first several thousand iterations probably just
12:50
thousand iterations probably just
12:50
thousand iterations probably just squashing down the
12:52
squashing down the
12:52
squashing down the weights because they are so way too high
12:54
weights because they are so way too high
12:54
weights because they are so way too high in the beginning in the initialization
12:56
in the beginning in the initialization
12:56
in the beginning in the initialization so something to look out for and uh
12:58
so something to look out for and uh
12:58
so something to look out for and uh that's number one now let's look at the
13:00
that's number one now let's look at the
13:00
that's number one now let's look at the second problem let me reinitialize our
13:02
second problem let me reinitialize our
13:02
second problem let me reinitialize our neural net and let me reintroduce The
13:04
neural net and let me reintroduce The
13:04
neural net and let me reintroduce The Brak statement so we have a reasonable
13:07
Brak statement so we have a reasonable
13:07
Brak statement so we have a reasonable initial loss so even though everything
13:09
initial loss so even though everything
13:09
initial loss so even though everything is looking good on the level of the loss
13:10
is looking good on the level of the loss
13:10
is looking good on the level of the loss and we get something that we expect
13:12
and we get something that we expect
13:12
and we get something that we expect there's still a deeper problem looking
13:14
there's still a deeper problem looking
13:14
there's still a deeper problem looking inside this neural net and its
13:16
inside this neural net and its
13:16
inside this neural net and its initialization so the logits are now
13:19
initialization so the logits are now
13:19
initialization so the logits are now okay the problem now is with the values
13:21
okay the problem now is with the values
13:21
okay the problem now is with the values of H the activations of the Hidden
13:24
of H the activations of the Hidden
13:24
of H the activations of the Hidden States now if we just visualize this
13:26
States now if we just visualize this
13:26
States now if we just visualize this Vector sorry this tensor h it's kind of
13:29
Vector sorry this tensor h it's kind of
13:29
Vector sorry this tensor h it's kind of hard to see but the problem here roughly
13:31
hard to see but the problem here roughly
13:31
hard to see but the problem here roughly speaking is you see how many of the
13:32
speaking is you see how many of the
13:32
speaking is you see how many of the elements are one or negative 1 now
13:36
elements are one or negative 1 now
13:36
elements are one or negative 1 now recall that torch. 10 the 10 function is
13:39
recall that torch. 10 the 10 function is
13:39
recall that torch. 10 the 10 function is a squashing function it takes arbitrary
13:41
a squashing function it takes arbitrary
13:41
a squashing function it takes arbitrary numbers and it squashes them into a
13:42
numbers and it squashes them into a
13:42
numbers and it squashes them into a range of negative 1 and one and it does
13:44
range of negative 1 and one and it does
13:44
range of negative 1 and one and it does so smoothly so let's look at the
13:46
so smoothly so let's look at the
13:46
so smoothly so let's look at the histogram of H to get a better idea of
13:48
histogram of H to get a better idea of
13:49
histogram of H to get a better idea of the distribution of the values inside
13:50
the distribution of the values inside
13:51
the distribution of the values inside this tensor we can do this
13:54
this tensor we can do this
13:54
this tensor we can do this first well we can see that H is 32
13:57
first well we can see that H is 32
13:57
first well we can see that H is 32 examples and 200 activations in each
13:59
examples and 200 activations in each
14:00
examples and 200 activations in each example we can view it as1 to stretch it
14:03
example we can view it as1 to stretch it
14:03
example we can view it as1 to stretch it out into one large
14:05
out into one large
14:05
out into one large vector and we can then call two list to
14:08
vector and we can then call two list to
14:08
vector and we can then call two list to convert this into one large python list
14:12
convert this into one large python list
14:12
convert this into one large python list of floats and then we can pass this into
14:15
of floats and then we can pass this into
14:15
of floats and then we can pass this into PLT doist for histogram and we say we
14:18
PLT doist for histogram and we say we
14:18
PLT doist for histogram and we say we want 50 bins and a semicolon to suppress
14:21
want 50 bins and a semicolon to suppress
14:21
want 50 bins and a semicolon to suppress a bunch of output we don't
14:23
a bunch of output we don't
14:23
a bunch of output we don't want so we see this histogram and we see
14:25
want so we see this histogram and we see
14:25
want so we see this histogram and we see that most the values by far take on
14:28
that most the values by far take on
14:28
that most the values by far take on value of netive one and one so this 10 H
14:31
value of netive one and one so this 10 H
14:31
value of netive one and one so this 10 H is very very active and we can also look
14:34
is very very active and we can also look
14:34
is very very active and we can also look at basically why that is we can look at
14:38
at basically why that is we can look at
14:38
at basically why that is we can look at the pre activations that feed into the
14:41
the pre activations that feed into the
14:41
the pre activations that feed into the 10 and we can see that the distribution
14:44
10 and we can see that the distribution
14:44
10 and we can see that the distribution of the pre activations are is very very
14:46
of the pre activations are is very very
14:46
of the pre activations are is very very broad these take numbers between -5 and
14:49
broad these take numbers between -5 and
14:49
broad these take numbers between -5 and 15 and that's why in a torure 10
14:51
15 and that's why in a torure 10
14:52
15 and that's why in a torure 10 everything is being squashed and capped
14:53
everything is being squashed and capped
14:53
everything is being squashed and capped to be in the range of negative 1 and one
14:55
to be in the range of negative 1 and one
14:55
to be in the range of negative 1 and one and lots of numbers here take on very
14:57
and lots of numbers here take on very
14:57
and lots of numbers here take on very extreme values now if you are new to
15:00
extreme values now if you are new to
15:00
extreme values now if you are new to neural networks you might not actually
15:01
neural networks you might not actually
15:01
neural networks you might not actually see this as an issue but if you're well
15:04
see this as an issue but if you're well
15:04
see this as an issue but if you're well vered in the dark arts of back
15:05
vered in the dark arts of back
15:05
vered in the dark arts of back propagation and then having an intuitive
15:07
propagation and then having an intuitive
15:07
propagation and then having an intuitive sense of how these gradients flow
15:08
sense of how these gradients flow
15:09
sense of how these gradients flow through a neural net you are looking at
15:11
through a neural net you are looking at
15:11
through a neural net you are looking at your distribution of 10h activations
15:13
your distribution of 10h activations
15:13
your distribution of 10h activations here and you are sweating so let me show
15:15
here and you are sweating so let me show
15:15
here and you are sweating so let me show you why we have to keep in mind that
15:17
you why we have to keep in mind that
15:17
you why we have to keep in mind that during back propagation just like we saw
15:18
during back propagation just like we saw
15:18
during back propagation just like we saw in microad we are doing backward passs
15:21
in microad we are doing backward passs
15:21
in microad we are doing backward passs starting at the loss and flowing through
15:22
starting at the loss and flowing through
15:22
starting at the loss and flowing through the network backwards in particular
15:25
the network backwards in particular
15:25
the network backwards in particular we're going to back propagate through
15:26
we're going to back propagate through
15:26
we're going to back propagate through this torch.
15:27
this torch.
15:27
this torch. 10h and this layer here is made up of
15:30
10h and this layer here is made up of
15:30
10h and this layer here is made up of 200 neurons for each one of these
15:32
200 neurons for each one of these
15:32
200 neurons for each one of these examples and uh it implements an
15:35
examples and uh it implements an
15:35
examples and uh it implements an elementwise 10 so let's look at what
15:37
elementwise 10 so let's look at what
15:37
elementwise 10 so let's look at what happens in 10h in the backward pass we
15:39
happens in 10h in the backward pass we
15:39
happens in 10h in the backward pass we can actually go back to our previous uh
15:41
can actually go back to our previous uh
15:41
can actually go back to our previous uh microgr code in the very first lecture
15:44
microgr code in the very first lecture
15:44
microgr code in the very first lecture and see how we implemented 10 AG we saw
15:47
and see how we implemented 10 AG we saw
15:47
and see how we implemented 10 AG we saw that the input here was X and then we
15:49
that the input here was X and then we
15:49
that the input here was X and then we calculate T which is the 10 age of X so
15:52
calculate T which is the 10 age of X so
15:52
calculate T which is the 10 age of X so that's T and T is between 1 and 1 it's
15:54
that's T and T is between 1 and 1 it's
15:54
that's T and T is between 1 and 1 it's the output of the 10 H and then in the
15:56
the output of the 10 H and then in the
15:56
the output of the 10 H and then in the backward pass how do we back propagate
15:58
backward pass how do we back propagate
15:58
backward pass how do we back propagate through a 10 H we take out that grad um
16:02
through a 10 H we take out that grad um
16:02
through a 10 H we take out that grad um and then we multiply it this is the
16:04
and then we multiply it this is the
16:04
and then we multiply it this is the chain rule with the local gradient which
16:06
chain rule with the local gradient which
16:06
chain rule with the local gradient which took the form of 1 - t ^2 so what
16:09
took the form of 1 - t ^2 so what
16:09
took the form of 1 - t ^2 so what happens if the outputs of your t h are
16:11
happens if the outputs of your t h are
16:11
happens if the outputs of your t h are very close to1 or 1 if you plug in t one
16:15
very close to1 or 1 if you plug in t one
16:15
very close to1 or 1 if you plug in t one here you're going to get a zero
16:17
here you're going to get a zero
16:17
here you're going to get a zero multiplying out. grad no matter what
16:20
multiplying out. grad no matter what
16:20
multiplying out. grad no matter what out. grad is we are killing the gradient
16:22
out. grad is we are killing the gradient
16:22
out. grad is we are killing the gradient and we're stopping effectively the back
16:25
and we're stopping effectively the back
16:25
and we're stopping effectively the back propagation through this 10 unit
16:27
propagation through this 10 unit
16:27
propagation through this 10 unit similarly when t is1 this will again
16:29
similarly when t is1 this will again
16:29
similarly when t is1 this will again become zero and out that grad just stops
16:33
become zero and out that grad just stops
16:33
become zero and out that grad just stops and intuitively this makes sense because
16:35
and intuitively this makes sense because
16:35
and intuitively this makes sense because this is a 10h
16:36
this is a 10h
16:36
this is a 10h neuron and what's happening is if its
16:39
neuron and what's happening is if its
16:39
neuron and what's happening is if its output is very close to one then we are
16:41
output is very close to one then we are
16:41
output is very close to one then we are in the tail of this
16:43
in the tail of this
16:43
in the tail of this 10 and so changing basically the
16:48
10 and so changing basically the
16:48
10 and so changing basically the input is not going to impact the output
16:50
input is not going to impact the output
16:50
input is not going to impact the output of the 10 too much because it's it's so
16:53
of the 10 too much because it's it's so
16:53
of the 10 too much because it's it's so it's in a flat region of the 10 H and so
16:55
it's in a flat region of the 10 H and so
16:56
it's in a flat region of the 10 H and so therefore there's no impact on the loss
16:58
therefore there's no impact on the loss
16:58
therefore there's no impact on the loss and so so indeed the the weights and the
17:01
and so so indeed the the weights and the
17:02
and so so indeed the the weights and the biases along with the 10h neuron do not
17:04
biases along with the 10h neuron do not
17:04
biases along with the 10h neuron do not impact the loss because the output of
17:06
impact the loss because the output of
17:06
impact the loss because the output of the 10 unit is in the flat region of the
17:08
the 10 unit is in the flat region of the
17:08
the 10 unit is in the flat region of the 10 and there's no influence we can we
17:10
10 and there's no influence we can we
17:10
10 and there's no influence we can we can be changing them whatever we want
17:12
can be changing them whatever we want
17:12
can be changing them whatever we want however we want and the loss is not
17:13
however we want and the loss is not
17:13
however we want and the loss is not impacted that's so that's another way to
17:15
impacted that's so that's another way to
17:15
impacted that's so that's another way to justify that indeed the gradient would
17:17
justify that indeed the gradient would
17:17
justify that indeed the gradient would be basically zero it
17:19
be basically zero it
17:19
be basically zero it vanishes indeed uh when T equals zero we
17:24
vanishes indeed uh when T equals zero we
17:24
vanishes indeed uh when T equals zero we get one times out that grad so when the
17:28
get one times out that grad so when the
17:28
get one times out that grad so when the 10 h takes on exactly value of zero then
17:31
10 h takes on exactly value of zero then
17:31
10 h takes on exactly value of zero then out grad is just passed through so
17:35
out grad is just passed through so
17:35
out grad is just passed through so basically what this is doing right is if
17:36
basically what this is doing right is if
17:37
basically what this is doing right is if T is equal to zero then this the 10 unit
17:39
T is equal to zero then this the 10 unit
17:40
T is equal to zero then this the 10 unit is uh sort of inactive and uh gradient
17:43
is uh sort of inactive and uh gradient
17:43
is uh sort of inactive and uh gradient just passes through but the more you are
17:45
just passes through but the more you are
17:45
just passes through but the more you are in the flat tails the more the gradient
17:48
in the flat tails the more the gradient
17:48
in the flat tails the more the gradient is squashed so in fact you'll see that
17:50
is squashed so in fact you'll see that
17:50
is squashed so in fact you'll see that the the gradient flowing through 10 can
17:53
the the gradient flowing through 10 can
17:53
the the gradient flowing through 10 can only ever decrease and the amount that
17:55
only ever decrease and the amount that
17:55
only ever decrease and the amount that it decreases is um proportional through
17:59
it decreases is um proportional through
17:59
it decreases is um proportional through a square here um depending on how far
18:02
a square here um depending on how far
18:02
a square here um depending on how far you are in the flat tail so this 10 H
18:05
you are in the flat tail so this 10 H
18:05
you are in the flat tail so this 10 H and so that's kind of what's Happening
18:06
and so that's kind of what's Happening
18:06
and so that's kind of what's Happening Here and through this the concern here
18:09
Here and through this the concern here
18:09
Here and through this the concern here is that if all of these um outputs H are
18:12
is that if all of these um outputs H are
18:12
is that if all of these um outputs H are in the flat regions of negative 1 and
18:14
in the flat regions of negative 1 and
18:14
in the flat regions of negative 1 and one then the gradients that are flowing
18:16
one then the gradients that are flowing
18:16
one then the gradients that are flowing through the network will just get
18:18
through the network will just get
18:18
through the network will just get destroyed at this
18:20
destroyed at this
18:20
destroyed at this layer now there is some redeeming
18:22
layer now there is some redeeming
18:23
layer now there is some redeeming quality here and that we can actually
18:24
quality here and that we can actually
18:25
quality here and that we can actually get a sense of the problem here as
18:26
get a sense of the problem here as
18:26
get a sense of the problem here as follows I wrote some code here and
18:29
follows I wrote some code here and
18:29
follows I wrote some code here and basically what we want to do here is we
18:30
basically what we want to do here is we
18:30
basically what we want to do here is we want to take a look at H take the the
18:33
want to take a look at H take the the
18:33
want to take a look at H take the the absolute value and see how often it is
18:36
absolute value and see how often it is
18:36
absolute value and see how often it is in the in a flat uh region so say
18:40
in the in a flat uh region so say
18:40
in the in a flat uh region so say greater than
18:41
greater than
18:41
greater than 099 and what you get is the following
18:44
099 and what you get is the following
18:44
099 and what you get is the following and this is a Boolean tensor so uh in
18:47
and this is a Boolean tensor so uh in
18:47
and this is a Boolean tensor so uh in the Boolean tensor you get a white if
18:49
the Boolean tensor you get a white if
18:49
the Boolean tensor you get a white if this is true and a black if this is
18:51
this is true and a black if this is
18:51
this is true and a black if this is false and so basically what we have here
18:53
false and so basically what we have here
18:53
false and so basically what we have here is the 32 examples and 200 hidden
18:56
is the 32 examples and 200 hidden
18:56
is the 32 examples and 200 hidden neurons and we see that a lot of this is
18:59
neurons and we see that a lot of this is
18:59
neurons and we see that a lot of this is white and what that's telling us is that
19:01
white and what that's telling us is that
19:01
white and what that's telling us is that all these 10h neurons were very very
19:05
all these 10h neurons were very very
19:05
all these 10h neurons were very very active and uh they're in a flat tail and
19:09
active and uh they're in a flat tail and
19:09
active and uh they're in a flat tail and so in all these cases uh the back the
19:12
so in all these cases uh the back the
19:12
so in all these cases uh the back the backward gradient would get uh
19:15
backward gradient would get uh
19:15
backward gradient would get uh destroyed now we would be in a lot of
19:18
destroyed now we would be in a lot of
19:18
destroyed now we would be in a lot of trouble if for for any one of these 200
19:21
trouble if for for any one of these 200
19:21
trouble if for for any one of these 200 neurons if it was the case that the
19:24
neurons if it was the case that the
19:24
neurons if it was the case that the entire column is white because in that
19:26
entire column is white because in that
19:26
entire column is white because in that case we have what's called a dead neuron
19:28
case we have what's called a dead neuron
19:28
case we have what's called a dead neuron and this is could be a 10 neuron where
19:30
and this is could be a 10 neuron where
19:30
and this is could be a 10 neuron where the initialization of the weights and
19:31
the initialization of the weights and
19:31
the initialization of the weights and the biases could be such that no single
19:33
the biases could be such that no single
19:33
the biases could be such that no single example ever activates uh this 10h in
19:37
example ever activates uh this 10h in
19:37
example ever activates uh this 10h in the um sort of active part of the 10age
19:39
the um sort of active part of the 10age
19:39
the um sort of active part of the 10age if all the examples land in the tail
19:42
if all the examples land in the tail
19:42
if all the examples land in the tail then this neuron will never learn it is
19:44
then this neuron will never learn it is
19:44
then this neuron will never learn it is a dead neuron and so just scrutinizing
19:48
a dead neuron and so just scrutinizing
19:48
a dead neuron and so just scrutinizing this and looking for Columns of
19:50
this and looking for Columns of
19:50
this and looking for Columns of completely white uh we see that this is
19:53
completely white uh we see that this is
19:53
completely white uh we see that this is not the case so uh I don't see a single
19:56
not the case so uh I don't see a single
19:56
not the case so uh I don't see a single neuron that is all of uh you know white
19:59
neuron that is all of uh you know white
19:59
neuron that is all of uh you know white and so therefore it is the case that for
20:01
and so therefore it is the case that for
20:01
and so therefore it is the case that for every one of these 10h neurons uh we do
20:04
every one of these 10h neurons uh we do
20:04
every one of these 10h neurons uh we do have some examples that activate them in
20:06
have some examples that activate them in
20:07
have some examples that activate them in the uh active part of the 10 and so some
20:09
the uh active part of the 10 and so some
20:09
the uh active part of the 10 and so some gradients will flow through and this
20:10
gradients will flow through and this
20:10
gradients will flow through and this neuron will learn and the neuron will
20:13
neuron will learn and the neuron will
20:13
neuron will learn and the neuron will change and it will move and it will do
20:15
change and it will move and it will do
20:15
change and it will move and it will do something but you can sometimes get get
20:17
something but you can sometimes get get
20:17
something but you can sometimes get get yourself in cases where you have dead
20:19
yourself in cases where you have dead
20:19
yourself in cases where you have dead neurons and the way this manifests is
20:21
neurons and the way this manifests is
20:21
neurons and the way this manifests is that um for 10h neuron this would be
20:24
that um for 10h neuron this would be
20:24
that um for 10h neuron this would be when no matter what inputs you plug in
20:26
when no matter what inputs you plug in
20:26
when no matter what inputs you plug in from your data set this 10h neuron
20:28
from your data set this 10h neuron
20:28
from your data set this 10h neuron always fir
20:29
always fir
20:29
always fir completely one or completely negative
20:30
completely one or completely negative
20:30
completely one or completely negative one and then it will just not learn
20:33
one and then it will just not learn
20:33
one and then it will just not learn because all the gradients will be just
20:34
because all the gradients will be just
20:34
because all the gradients will be just zeroed out uh this is true not just for
20:37
zeroed out uh this is true not just for
20:37
zeroed out uh this is true not just for 10 but for a lot of other nonlinearities
20:39
10 but for a lot of other nonlinearities
20:39
10 but for a lot of other nonlinearities that people use in neural networks so we
20:41
that people use in neural networks so we
20:41
that people use in neural networks so we certainly used 10 a lot but sigmoid will
20:43
certainly used 10 a lot but sigmoid will
20:43
certainly used 10 a lot but sigmoid will have the exact same issue because it is
20:45
have the exact same issue because it is
20:45
have the exact same issue because it is a squashing neuron and so the same will
20:48
a squashing neuron and so the same will
20:48
a squashing neuron and so the same will be true for sigmoid uh but um but um you
20:53
be true for sigmoid uh but um but um you
20:53
be true for sigmoid uh but um but um you know um basically the same will actually
20:55
know um basically the same will actually
20:55
know um basically the same will actually apply to sigmoid the same will also
20:57
apply to sigmoid the same will also
20:57
apply to sigmoid the same will also apply to reu
20:59
apply to reu
20:59
apply to reu so reu has a completely flat region here
21:01
so reu has a completely flat region here
21:02
so reu has a completely flat region here below zero so if you have a reu neuron
21:04
below zero so if you have a reu neuron
21:05
below zero so if you have a reu neuron then it is a pass through um if it is
21:07
then it is a pass through um if it is
21:07
then it is a pass through um if it is positive and if it's if the
21:09
positive and if it's if the
21:09
positive and if it's if the preactivation is negative it will just
21:11
preactivation is negative it will just
21:11
preactivation is negative it will just shut it off since the region here is
21:13
shut it off since the region here is
21:13
shut it off since the region here is completely flat then during back
21:16
completely flat then during back
21:16
completely flat then during back propagation uh this would be exactly
21:18
propagation uh this would be exactly
21:18
propagation uh this would be exactly zeroing out the gradient um like all of
21:21
zeroing out the gradient um like all of
21:21
zeroing out the gradient um like all of the gradient would be set exactly to
21:22
the gradient would be set exactly to
21:22
the gradient would be set exactly to zero instead of just like a very very
21:24
zero instead of just like a very very
21:24
zero instead of just like a very very small number depending on how positive
21:26
small number depending on how positive
21:26
small number depending on how positive or negative T is and so you can get for
21:29
or negative T is and so you can get for
21:29
or negative T is and so you can get for example a dead reu neuron and a dead reu
21:32
example a dead reu neuron and a dead reu
21:32
example a dead reu neuron and a dead reu neuron would basically look like
21:35
neuron would basically look like
21:35
neuron would basically look like basically what it is is if a neuron with
21:37
basically what it is is if a neuron with
21:37
basically what it is is if a neuron with a reu nonlinearity never activates so
21:41
a reu nonlinearity never activates so
21:41
a reu nonlinearity never activates so for any examples that you plug in in the
21:43
for any examples that you plug in in the
21:43
for any examples that you plug in in the data set it never turns on it's always
21:45
data set it never turns on it's always
21:45
data set it never turns on it's always in this flat region then this re neuron
21:48
in this flat region then this re neuron
21:48
in this flat region then this re neuron is a dead neuron its weights and bias
21:51
is a dead neuron its weights and bias
21:51
is a dead neuron its weights and bias will never learn they will never get a
21:52
will never learn they will never get a
21:52
will never learn they will never get a gradient because the neuron never
21:54
gradient because the neuron never
21:54
gradient because the neuron never activated and this can sometimes happen
21:56
activated and this can sometimes happen
21:56
activated and this can sometimes happen at initialization uh because the way and
21:58
at initialization uh because the way and
21:58
at initialization uh because the way and a biases just make it so that by chance
22:00
a biases just make it so that by chance
22:00
a biases just make it so that by chance some neurons are just forever dead but
22:02
some neurons are just forever dead but
22:02
some neurons are just forever dead but it can also happen during optimization
22:04
it can also happen during optimization
22:04
it can also happen during optimization if you have like a too high of learning
22:06
if you have like a too high of learning
22:06
if you have like a too high of learning rate for example sometimes you have
22:08
rate for example sometimes you have
22:08
rate for example sometimes you have these neurons that get too much of a
22:09
these neurons that get too much of a
22:09
these neurons that get too much of a gradient and they get knocked out off
22:11
gradient and they get knocked out off
22:11
gradient and they get knocked out off the data
22:12
the data
22:12
the data manifold and what happens is that from
22:14
manifold and what happens is that from
22:14
manifold and what happens is that from then on no example ever activates this
22:17
then on no example ever activates this
22:17
then on no example ever activates this neuron so this neuron remains dead
22:18
neuron so this neuron remains dead
22:18
neuron so this neuron remains dead forever so it's kind of like a permanent
22:20
forever so it's kind of like a permanent
22:20
forever so it's kind of like a permanent brain damage in a in a mind of a network
22:23
brain damage in a in a mind of a network
22:23
brain damage in a in a mind of a network and so sometimes what can happen is if
22:25
and so sometimes what can happen is if
22:25
and so sometimes what can happen is if your learning rate is very high for
22:26
your learning rate is very high for
22:26
your learning rate is very high for example and you have a neural net with
22:28
example and you have a neural net with
22:28
example and you have a neural net with neurons you train the neuron net and you
22:30
neurons you train the neuron net and you
22:30
neurons you train the neuron net and you get some last loss but then actually
22:33
get some last loss but then actually
22:33
get some last loss but then actually what you do is you go through the entire
22:35
what you do is you go through the entire
22:35
what you do is you go through the entire training set and you forward um your
22:38
training set and you forward um your
22:38
training set and you forward um your examples and you can find neurons that
22:40
examples and you can find neurons that
22:40
examples and you can find neurons that never activate they are dead neurons in
22:43
never activate they are dead neurons in
22:43
never activate they are dead neurons in your network and so those neurons will
22:45
your network and so those neurons will
22:45
your network and so those neurons will will never turn on and usually what
22:46
will never turn on and usually what
22:47
will never turn on and usually what happens is that during training these
22:48
happens is that during training these
22:48
happens is that during training these Rel neurons are changing moving Etc and
22:50
Rel neurons are changing moving Etc and
22:50
Rel neurons are changing moving Etc and then because of a high gradient
22:51
then because of a high gradient
22:52
then because of a high gradient somewhere by chance they get knocked off
22:54
somewhere by chance they get knocked off
22:54
somewhere by chance they get knocked off and then nothing ever activates them and
22:56
and then nothing ever activates them and
22:56
and then nothing ever activates them and from then on they are just dead uh so
22:59
from then on they are just dead uh so
22:59
from then on they are just dead uh so that's kind of like a permanent brain
23:00
that's kind of like a permanent brain
23:00
that's kind of like a permanent brain damage that can happen to some of these
23:02
damage that can happen to some of these
23:02
damage that can happen to some of these neurons these other nonlinearities like
23:04
neurons these other nonlinearities like
23:04
neurons these other nonlinearities like leyu will not suffer from this issue as
23:06
leyu will not suffer from this issue as
23:06
leyu will not suffer from this issue as much because you can see that it doesn't
23:08
much because you can see that it doesn't
23:08
much because you can see that it doesn't have flat Tails you'll almost always get
23:12
have flat Tails you'll almost always get
23:12
have flat Tails you'll almost always get gradients and uh elu is also fairly uh
23:14
gradients and uh elu is also fairly uh
23:14
gradients and uh elu is also fairly uh frequently used um it also might suffer
23:17
frequently used um it also might suffer
23:17
frequently used um it also might suffer from this issue because it has flat
23:19
from this issue because it has flat
23:19
from this issue because it has flat parts so that's just something to be
23:21
parts so that's just something to be
23:21
parts so that's just something to be aware of and something to be concerned
23:23
aware of and something to be concerned
23:23
aware of and something to be concerned about and in this case we have way too
23:26
about and in this case we have way too
23:26
about and in this case we have way too many um activations AG that take on
23:29
many um activations AG that take on
23:29
many um activations AG that take on Extreme values and because there's no
23:31
Extreme values and because there's no
23:31
Extreme values and because there's no column of white I think we will be okay
23:34
column of white I think we will be okay
23:34
column of white I think we will be okay and indeed the network optimizes and
23:35
and indeed the network optimizes and
23:35
and indeed the network optimizes and gives us a pretty decent loss but it's
23:38
gives us a pretty decent loss but it's
23:38
gives us a pretty decent loss but it's just not optimal and this is not
23:39
just not optimal and this is not
23:39
just not optimal and this is not something you want especially during
23:41
something you want especially during
23:41
something you want especially during initialization and so basically what's
23:43
initialization and so basically what's
23:43
initialization and so basically what's happening is that uh this H
23:45
happening is that uh this H
23:45
happening is that uh this H preactivation that's floating to 10 H
23:48
preactivation that's floating to 10 H
23:48
preactivation that's floating to 10 H it's it's too extreme it's too large
23:50
it's it's too extreme it's too large
23:51
it's it's too extreme it's too large it's creating very um it's creating a
23:54
it's creating very um it's creating a
23:54
it's creating very um it's creating a distribution that is too saturated in
23:55
distribution that is too saturated in
23:55
distribution that is too saturated in both sides of the 10 H and it's not
23:57
both sides of the 10 H and it's not
23:57
both sides of the 10 H and it's not something you want because it means that
23:59
something you want because it means that
23:59
something you want because it means that there's less training uh for these
24:01
there's less training uh for these
24:01
there's less training uh for these neurons because they update um less
24:04
neurons because they update um less
24:04
neurons because they update um less frequently so how do we fix this well H
24:07
frequently so how do we fix this well H
24:07
frequently so how do we fix this well H preactivation is MCAT which comes from C
24:12
preactivation is MCAT which comes from C
24:12
preactivation is MCAT which comes from C so these are uniform gsan but then it's
24:15
so these are uniform gsan but then it's
24:15
so these are uniform gsan but then it's multiply by W1 plus B1 and H preact is
24:18
multiply by W1 plus B1 and H preact is
24:18
multiply by W1 plus B1 and H preact is too far off from zero and that's causing
24:20
too far off from zero and that's causing
24:20
too far off from zero and that's causing the issue so we want this reactivation
24:23
the issue so we want this reactivation
24:23
the issue so we want this reactivation to be closer to zero very similar to
24:25
to be closer to zero very similar to
24:25
to be closer to zero very similar to what we had with
24:26
what we had with
24:26
what we had with logits so here
24:28
logits so here
24:28
logits so here we want actually something very very
24:30
we want actually something very very
24:30
we want actually something very very similar now it's okay to set the biases
24:33
similar now it's okay to set the biases
24:33
similar now it's okay to set the biases to very small number we can either
24:35
to very small number we can either
24:35
to very small number we can either multiply by 0 01 to get like a little
24:37
multiply by 0 01 to get like a little
24:37
multiply by 0 01 to get like a little bit of entropy um I sometimes like to do
24:40
bit of entropy um I sometimes like to do
24:40
bit of entropy um I sometimes like to do that um just so that there's like a
24:43
that um just so that there's like a
24:43
that um just so that there's like a little bit of variation and diversity in
24:45
little bit of variation and diversity in
24:45
little bit of variation and diversity in the original initialization of these 10
24:48
the original initialization of these 10
24:48
the original initialization of these 10 H neurons and I find in practice that
24:50
H neurons and I find in practice that
24:50
H neurons and I find in practice that that can help optimization a little bit
24:53
that can help optimization a little bit
24:53
that can help optimization a little bit and then the weights we can also just
24:55
and then the weights we can also just
24:55
and then the weights we can also just like squash so let's multiply everything
24:57
like squash so let's multiply everything
24:57
like squash so let's multiply everything by 0.1
24:59
by 0.1
24:59
by 0.1 let's rerun the first batch and now
25:01
let's rerun the first batch and now
25:01
let's rerun the first batch and now let's look at this and well first let's
25:04
let's look at this and well first let's
25:04
let's look at this and well first let's look
25:05
look
25:06
look here you see now because we multiply dou
25:08
here you see now because we multiply dou
25:08
here you see now because we multiply dou by 0.1 we have a much better histogram
25:11
by 0.1 we have a much better histogram
25:11
by 0.1 we have a much better histogram and that's because the pre activations
25:12
and that's because the pre activations
25:12
and that's because the pre activations are now between 1.5 and 1.5 and this we
25:15
are now between 1.5 and 1.5 and this we
25:15
are now between 1.5 and 1.5 and this we expect much much less white okay there's
25:19
expect much much less white okay there's
25:19
expect much much less white okay there's no white so basically that's because
25:22
no white so basically that's because
25:22
no white so basically that's because there are no neurons that saturated
25:24
there are no neurons that saturated
25:24
there are no neurons that saturated above 99 in either direction so this
25:28
above 99 in either direction so this
25:28
above 99 in either direction so this actually a pretty decent place to be um
25:31
actually a pretty decent place to be um
25:31
actually a pretty decent place to be um maybe we can go up a little
25:35
maybe we can go up a little
25:35
maybe we can go up a little bit sorry am I am I changing W1 here so
25:39
bit sorry am I am I changing W1 here so
25:39
bit sorry am I am I changing W1 here so maybe we can go to 0
25:41
maybe we can go to 0
25:41
maybe we can go to 0 2 okay so maybe something like this is
25:44
2 okay so maybe something like this is
25:44
2 okay so maybe something like this is is a nice distribution so maybe this is
25:47
is a nice distribution so maybe this is
25:47
is a nice distribution so maybe this is what our initialization should be so let
25:49
what our initialization should be so let
25:49
what our initialization should be so let me now
25:50
me now
25:50
me now erase
25:52
erase
25:52
erase these and let me starting with
25:55
these and let me starting with
25:55
these and let me starting with initialization let me run the full
25:57
initialization let me run the full
25:57
initialization let me run the full optimization
25:58
optimization
25:58
optimization without the break and uh let's see what
26:02
without the break and uh let's see what
26:02
without the break and uh let's see what we get okay so the optimization finished
26:04
we get okay so the optimization finished
26:04
we get okay so the optimization finished and I re the loss and this is the result
26:06
and I re the loss and this is the result
26:06
and I re the loss and this is the result that we get and then just as a reminder
26:09
that we get and then just as a reminder
26:09
that we get and then just as a reminder I put down all the losses that we saw
26:10
I put down all the losses that we saw
26:10
I put down all the losses that we saw previously in this lecture so we see
26:13
previously in this lecture so we see
26:13
previously in this lecture so we see that we actually do get an improvement
26:14
that we actually do get an improvement
26:14
that we actually do get an improvement here and just as a reminder we started
26:16
here and just as a reminder we started
26:16
here and just as a reminder we started off with a validation loss of 2.17 when
26:18
off with a validation loss of 2.17 when
26:19
off with a validation loss of 2.17 when we started by fixing the softmax being
26:21
we started by fixing the softmax being
26:21
we started by fixing the softmax being confidently wrong we came down to 2.13
26:24
confidently wrong we came down to 2.13
26:24
confidently wrong we came down to 2.13 and by fixing the 10h layer being way
26:25
and by fixing the 10h layer being way
26:25
and by fixing the 10h layer being way too saturated we came down to 2.10
26:28
too saturated we came down to 2.10
26:28
too saturated we came down to 2.10 and the reason this is happening of
26:30
and the reason this is happening of
26:30
and the reason this is happening of course is because our initialization is
26:31
course is because our initialization is
26:31
course is because our initialization is better and so we're spending more time
26:33
better and so we're spending more time
26:33
better and so we're spending more time doing productive training instead of um
26:36
doing productive training instead of um
26:36
doing productive training instead of um not very productive training because our
26:38
not very productive training because our
26:38
not very productive training because our gradients are set to zero and uh we have
26:41
gradients are set to zero and uh we have
26:41
gradients are set to zero and uh we have to learn very simple things like uh the
26:43
to learn very simple things like uh the
26:43
to learn very simple things like uh the overconfidence of the softmax in the
26:44
overconfidence of the softmax in the
26:44
overconfidence of the softmax in the beginning and we're spending Cycles just
26:46
beginning and we're spending Cycles just
26:46
beginning and we're spending Cycles just like squashing down the weight Matrix so
26:49
like squashing down the weight Matrix so
26:50
like squashing down the weight Matrix so this is illustrating um basically
26:52
this is illustrating um basically
26:52
this is illustrating um basically initialization and its impacts on
26:54
initialization and its impacts on
26:54
initialization and its impacts on performance uh just by being aware of
26:57
performance uh just by being aware of
26:57
performance uh just by being aware of the internals of these neural net and
26:58
the internals of these neural net and
26:58
the internals of these neural net and their activations their gradients now
27:01
their activations their gradients now
27:01
their activations their gradients now we're working with a very small Network
27:02
we're working with a very small Network
27:02
we're working with a very small Network this is just one layer multi-layer
27:04
this is just one layer multi-layer
27:04
this is just one layer multi-layer perception so because the network is so
27:07
perception so because the network is so
27:07
perception so because the network is so shallow the optimization problem is
27:08
shallow the optimization problem is
27:08
shallow the optimization problem is actually quite easy and very forgiving
27:11
actually quite easy and very forgiving
27:11
actually quite easy and very forgiving so even though our initialization was
27:12
so even though our initialization was
27:12
so even though our initialization was terrible the network still learned
27:14
terrible the network still learned
27:14
terrible the network still learned eventually it just got a bit worse
27:16
eventually it just got a bit worse
27:16
eventually it just got a bit worse result this is not the case in general
27:19
result this is not the case in general
27:19
result this is not the case in general though once we actually start um working
27:21
though once we actually start um working
27:21
though once we actually start um working with much deeper networks that have say
27:23
with much deeper networks that have say
27:23
with much deeper networks that have say 50 layers uh things can get uh much more
27:26
50 layers uh things can get uh much more
27:26
50 layers uh things can get uh much more complicated and uh these problems stack
27:29
complicated and uh these problems stack
27:29
complicated and uh these problems stack up and so you can actually get into a
27:32
up and so you can actually get into a
27:32
up and so you can actually get into a place where the network is basically not
27:34
place where the network is basically not
27:34
place where the network is basically not training at all if your initialization
27:35
training at all if your initialization
27:35
training at all if your initialization is bad enough and the deeper your
27:38
is bad enough and the deeper your
27:38
is bad enough and the deeper your network is and the more complex it is
27:39
network is and the more complex it is
27:39
network is and the more complex it is the less forgiving it is to some of
27:41
the less forgiving it is to some of
27:41
the less forgiving it is to some of these errors and so um something to
27:45
these errors and so um something to
27:45
these errors and so um something to definitely be aware of and uh something
27:47
definitely be aware of and uh something
27:47
definitely be aware of and uh something to scrutinize something to plot and
27:49
to scrutinize something to plot and
27:49
to scrutinize something to plot and something to be careful with and um yeah
27:53
something to be careful with and um yeah
27:53
something to be careful with and um yeah okay so that's great that that worked
27:54
okay so that's great that that worked
27:54
okay so that's great that that worked for us but what we have here now is all
27:57
for us but what we have here now is all
27:57
for us but what we have here now is all these magic numbers like0 2 like where
27:59
these magic numbers like0 2 like where
27:59
these magic numbers like0 2 like where do I come up with this and how am I
28:01
do I come up with this and how am I
28:01
do I come up with this and how am I supposed to set these if I have a large
28:02
supposed to set these if I have a large
28:02
supposed to set these if I have a large neural net with lots and lots of layers
28:05
neural net with lots and lots of layers
28:05
neural net with lots and lots of layers and so obviously no one does this by
28:07
and so obviously no one does this by
28:07
and so obviously no one does this by hand there's actually some relatively
28:08
hand there's actually some relatively
28:08
hand there's actually some relatively principled ways of setting these scales
28:11
principled ways of setting these scales
28:11
principled ways of setting these scales um that I would like to introduce to you
28:12
um that I would like to introduce to you
28:13
um that I would like to introduce to you now so let me paste some code here that
28:15
now so let me paste some code here that
28:15
now so let me paste some code here that I prepared just to motivate the
28:17
I prepared just to motivate the
28:17
I prepared just to motivate the discussion of
28:18
discussion of
28:18
discussion of this so what I'm doing here is we have
28:21
this so what I'm doing here is we have
28:21
this so what I'm doing here is we have some random input here x that is drawn
28:23
some random input here x that is drawn
28:23
some random input here x that is drawn from a gan and there's 1,000 examples
28:27
from a gan and there's 1,000 examples
28:27
from a gan and there's 1,000 examples that are 10 dimensional
28:28
that are 10 dimensional
28:28
that are 10 dimensional and then we have a waiting layer here
28:30
and then we have a waiting layer here
28:30
and then we have a waiting layer here that is also initialized using caution
28:33
that is also initialized using caution
28:33
that is also initialized using caution just like we did here and we these
28:36
just like we did here and we these
28:36
just like we did here and we these neurons in the hidden layer look at 10
28:37
neurons in the hidden layer look at 10
28:37
neurons in the hidden layer look at 10 inputs and there are 200 neurons in this
28:40
inputs and there are 200 neurons in this
28:40
inputs and there are 200 neurons in this hidden layer and then we have here just
28:43
hidden layer and then we have here just
28:43
hidden layer and then we have here just like here um in this case the
28:45
like here um in this case the
28:45
like here um in this case the multiplication X multip by W to get the
28:47
multiplication X multip by W to get the
28:47
multiplication X multip by W to get the pre activations of these
28:49
pre activations of these
28:49
pre activations of these neurons and basically the analysis here
28:52
neurons and basically the analysis here
28:52
neurons and basically the analysis here looks at okay suppose these are uniform
28:54
looks at okay suppose these are uniform
28:54
looks at okay suppose these are uniform gion and these weights are uniform gion
28:57
gion and these weights are uniform gion
28:57
gion and these weights are uniform gion if I do X W and we forget for now the
29:00
if I do X W and we forget for now the
29:00
if I do X W and we forget for now the bias and the
29:02
bias and the
29:02
bias and the nonlinearity then what is the mean and
29:04
nonlinearity then what is the mean and
29:04
nonlinearity then what is the mean and the standard deviation of these gions so
29:07
the standard deviation of these gions so
29:07
the standard deviation of these gions so in the beginning here the input is uh
29:09
in the beginning here the input is uh
29:09
in the beginning here the input is uh just a normal Gan distribution mean zero
29:11
just a normal Gan distribution mean zero
29:11
just a normal Gan distribution mean zero and the standard deviation is one and
29:13
and the standard deviation is one and
29:13
and the standard deviation is one and the standard deviation again is just the
29:15
the standard deviation again is just the
29:15
the standard deviation again is just the measure of a spread of the
29:17
measure of a spread of the
29:17
measure of a spread of the gion but then once we multiply here and
29:19
gion but then once we multiply here and
29:19
gion but then once we multiply here and we look at the um histogram of Y we see
29:23
we look at the um histogram of Y we see
29:23
we look at the um histogram of Y we see that the mean of course stays the same
29:25
that the mean of course stays the same
29:25
that the mean of course stays the same it's about zero because this is a
29:27
it's about zero because this is a
29:27
it's about zero because this is a symmetric operation but we see here that
29:29
symmetric operation but we see here that
29:29
symmetric operation but we see here that the standard deviation has expanded to
29:31
the standard deviation has expanded to
29:31
the standard deviation has expanded to three so the input standard deviation
29:33
three so the input standard deviation
29:33
three so the input standard deviation was one but now we've grown to three and
29:36
was one but now we've grown to three and
29:36
was one but now we've grown to three and so what you're seeing in the histogram
29:37
so what you're seeing in the histogram
29:37
so what you're seeing in the histogram is that this Gan is
29:40
is that this Gan is
29:40
is that this Gan is expanding and so um we're expanding this
29:43
expanding and so um we're expanding this
29:43
expanding and so um we're expanding this Gan um from the input and we don't want
29:46
Gan um from the input and we don't want
29:46
Gan um from the input and we don't want that we want most of the neural net to
29:48
that we want most of the neural net to
29:48
that we want most of the neural net to have relatively similar activations uh
29:50
have relatively similar activations uh
29:50
have relatively similar activations uh so unit gion roughly throughout the
29:52
so unit gion roughly throughout the
29:52
so unit gion roughly throughout the neural net and so the question is how do
29:54
neural net and so the question is how do
29:54
neural net and so the question is how do we scale these W's to preserve the uh um
29:58
we scale these W's to preserve the uh um
29:58
we scale these W's to preserve the uh um to preserve this distribution to uh
30:01
to preserve this distribution to uh
30:01
to preserve this distribution to uh remain
30:02
remain
30:02
remain aan and so intuitively if I multiply
30:05
aan and so intuitively if I multiply
30:05
aan and so intuitively if I multiply here uh these elements of w by a larger
30:09
here uh these elements of w by a larger
30:09
here uh these elements of w by a larger number let's say by
30:11
number let's say by
30:11
number let's say by five then this gsan gross and gross in
30:14
five then this gsan gross and gross in
30:14
five then this gsan gross and gross in standard deviation so now we're at 15 so
30:17
standard deviation so now we're at 15 so
30:17
standard deviation so now we're at 15 so basically these numbers here in the
30:19
basically these numbers here in the
30:19
basically these numbers here in the output y take on more and more extreme
30:21
output y take on more and more extreme
30:21
output y take on more and more extreme values but if we scale it down like .2
30:25
values but if we scale it down like .2
30:25
values but if we scale it down like .2 then conversely this Gan is getting
30:28
then conversely this Gan is getting
30:28
then conversely this Gan is getting smaller and smaller and it's shrinking
30:31
smaller and smaller and it's shrinking
30:31
smaller and smaller and it's shrinking and you can see that the standard
30:32
and you can see that the standard
30:32
and you can see that the standard deviation is 6 and so the question is
30:34
deviation is 6 and so the question is
30:34
deviation is 6 and so the question is what do I multiply by here to exactly
30:37
what do I multiply by here to exactly
30:37
what do I multiply by here to exactly preserve the standard deviation to be
30:39
preserve the standard deviation to be
30:39
preserve the standard deviation to be one and it turns out that the correct
30:41
one and it turns out that the correct
30:42
one and it turns out that the correct answer mathematically when you work out
30:43
answer mathematically when you work out
30:43
answer mathematically when you work out through the variance of uh this
30:45
through the variance of uh this
30:45
through the variance of uh this multiplication here is that you are
30:48
multiplication here is that you are
30:48
multiplication here is that you are supposed to divide by the square root of
30:51
supposed to divide by the square root of
30:51
supposed to divide by the square root of the fan in the fan in is the basically
30:54
the fan in the fan in is the basically
30:54
the fan in the fan in is the basically the uh number of input elements here 10
30:57
the uh number of input elements here 10
30:58
the uh number of input elements here 10 so we are supposed to divide by 10
30:59
so we are supposed to divide by 10
30:59
so we are supposed to divide by 10 square root and this is one way to do
31:01
square root and this is one way to do
31:01
square root and this is one way to do the square root you raise it to a power
31:03
the square root you raise it to a power
31:03
the square root you raise it to a power of 0. five that's the same as doing a
31:05
of 0. five that's the same as doing a
31:05
of 0. five that's the same as doing a square root so when you divide by the um
31:09
square root so when you divide by the um
31:09
square root so when you divide by the um square root of 10 then we see that the
31:12
square root of 10 then we see that the
31:12
square root of 10 then we see that the output caution it has exactly standard
31:15
output caution it has exactly standard
31:15
output caution it has exactly standard deviation of one now unsurprisingly a
31:18
deviation of one now unsurprisingly a
31:18
deviation of one now unsurprisingly a number of papers have looked into how
31:21
number of papers have looked into how
31:21
number of papers have looked into how but to best initialized neural networks
31:23
but to best initialized neural networks
31:23
but to best initialized neural networks and in the case of multilayer
31:24
and in the case of multilayer
31:24
and in the case of multilayer perceptrons we can have fairly deep
31:26
perceptrons we can have fairly deep
31:26
perceptrons we can have fairly deep networks that have these nonlinearity in
31:28
networks that have these nonlinearity in
31:28
networks that have these nonlinearity in between and we want to make sure that
31:30
between and we want to make sure that
31:30
between and we want to make sure that the activations are well behaved and
31:31
the activations are well behaved and
31:32
the activations are well behaved and they don't expand to infinity or Shrink
31:34
they don't expand to infinity or Shrink
31:34
they don't expand to infinity or Shrink all the way to zero and the question is
31:35
all the way to zero and the question is
31:35
all the way to zero and the question is how do we initialize the weights so that
31:37
how do we initialize the weights so that
31:37
how do we initialize the weights so that these activations take on reasonable
31:38
these activations take on reasonable
31:38
these activations take on reasonable values throughout the network now one
31:41
values throughout the network now one
31:41
values throughout the network now one paper that has studied this in quite a
31:42
paper that has studied this in quite a
31:43
paper that has studied this in quite a bit of detail that is often referenced
31:44
bit of detail that is often referenced
31:45
bit of detail that is often referenced is this paper by King hatal called
31:47
is this paper by King hatal called
31:47
is this paper by King hatal called delving deep into rectifiers now in this
31:49
delving deep into rectifiers now in this
31:49
delving deep into rectifiers now in this case they actually study convolution
31:51
case they actually study convolution
31:51
case they actually study convolution neur neurals and they study especially
31:54
neur neurals and they study especially
31:54
neur neurals and they study especially the reu nonlinearity and the p
31:56
the reu nonlinearity and the p
31:56
the reu nonlinearity and the p nonlinearity instead of a 10h
31:58
nonlinearity instead of a 10h
31:58
nonlinearity instead of a 10h nonlinearity but the analysis is very
32:00
nonlinearity but the analysis is very
32:00
nonlinearity but the analysis is very similar and um basically what happens
32:04
similar and um basically what happens
32:04
similar and um basically what happens here is for them the the relu
32:06
here is for them the the relu
32:06
here is for them the the relu nonlinearity that they care about quite
32:08
nonlinearity that they care about quite
32:08
nonlinearity that they care about quite a bit here is a squashing function where
32:11
a bit here is a squashing function where
32:11
a bit here is a squashing function where all the negative numbers are simply
32:14
all the negative numbers are simply
32:14
all the negative numbers are simply clamped to zero so the positive numbers
32:16
clamped to zero so the positive numbers
32:16
clamped to zero so the positive numbers are pass through but everything negative
32:18
are pass through but everything negative
32:18
are pass through but everything negative is just set to zero and because uh you
32:21
is just set to zero and because uh you
32:21
is just set to zero and because uh you are basically throwing away half of the
32:23
are basically throwing away half of the
32:23
are basically throwing away half of the distribution they find in their analysis
32:25
distribution they find in their analysis
32:25
distribution they find in their analysis of the forward activations in the neural
32:28
of the forward activations in the neural
32:28
of the forward activations in the neural that you have to compensate for that
32:29
that you have to compensate for that
32:29
that you have to compensate for that with a
32:31
with a
32:31
with a gain and so here they find that
32:34
gain and so here they find that
32:34
gain and so here they find that basically when they initialize their
32:36
basically when they initialize their
32:36
basically when they initialize their weights they have to do it with a zero
32:38
weights they have to do it with a zero
32:38
weights they have to do it with a zero mean Gan whose standard deviation is
32:40
mean Gan whose standard deviation is
32:40
mean Gan whose standard deviation is square < TK of 2 over the Fanon what we
32:43
square < TK of 2 over the Fanon what we
32:43
square < TK of 2 over the Fanon what we have here is we are initializing gashin
32:46
have here is we are initializing gashin
32:46
have here is we are initializing gashin with the square root of Fanon this NL
32:49
with the square root of Fanon this NL
32:49
with the square root of Fanon this NL here is the Fanon so what we have is
32:52
here is the Fanon so what we have is
32:52
here is the Fanon so what we have is sare root of one over the Fanon because
32:55
sare root of one over the Fanon because
32:55
sare root of one over the Fanon because we have the division here
32:58
we have the division here
32:58
we have the division here now they have to add this factor of two
33:00
now they have to add this factor of two
33:00
now they have to add this factor of two because of the reu which basically
33:02
because of the reu which basically
33:02
because of the reu which basically discards half of the distribution and
33:04
discards half of the distribution and
33:04
discards half of the distribution and clamps it at zero and so that's where
33:06
clamps it at zero and so that's where
33:06
clamps it at zero and so that's where you get an additional Factor now in
33:08
you get an additional Factor now in
33:08
you get an additional Factor now in addition to that this paper also studies
33:10
addition to that this paper also studies
33:10
addition to that this paper also studies not just the uh sort of behavior of the
33:12
not just the uh sort of behavior of the
33:12
not just the uh sort of behavior of the activations in the forward pass of the
33:14
activations in the forward pass of the
33:14
activations in the forward pass of the neural net but it also studies the back
33:16
neural net but it also studies the back
33:16
neural net but it also studies the back propagation and we have to make sure
33:18
propagation and we have to make sure
33:18
propagation and we have to make sure that the gradients also are well behaved
33:21
that the gradients also are well behaved
33:21
that the gradients also are well behaved and so um because ultimately they end up
33:23
and so um because ultimately they end up
33:23
and so um because ultimately they end up updating our parameters and what they
33:26
updating our parameters and what they
33:26
updating our parameters and what they find here through a lot of analysis that
33:28
find here through a lot of analysis that
33:28
find here through a lot of analysis that I invite you to read through but it's
33:29
I invite you to read through but it's
33:29
I invite you to read through but it's not exactly approachable what they find
33:32
not exactly approachable what they find
33:32
not exactly approachable what they find is basically if you properly initialize
33:35
is basically if you properly initialize
33:35
is basically if you properly initialize the forward pass the backward pass is
33:36
the forward pass the backward pass is
33:36
the forward pass the backward pass is also approximately initialized up to a
33:40
also approximately initialized up to a
33:40
also approximately initialized up to a constant factor that has to do with the
33:42
constant factor that has to do with the
33:42
constant factor that has to do with the size of the number of um hidden neurons
33:45
size of the number of um hidden neurons
33:45
size of the number of um hidden neurons in an early and a late
33:48
in an early and a late
33:48
in an early and a late layer and uh but basically they find
33:50
layer and uh but basically they find
33:50
layer and uh but basically they find empirically that this is not a choice
33:52
empirically that this is not a choice
33:52
empirically that this is not a choice that matters too much now this timing
33:54
that matters too much now this timing
33:54
that matters too much now this timing initialization is also implemented in
33:57
initialization is also implemented in
33:57
initialization is also implemented in pytorch so if you go to torch. and then.
33:59
pytorch so if you go to torch. and then.
33:59
pytorch so if you go to torch. and then. init documentation you'll find climing
34:01
init documentation you'll find climing
34:01
init documentation you'll find climing normal and in my opinion this is
34:03
normal and in my opinion this is
34:03
normal and in my opinion this is probably the most common way of
34:05
probably the most common way of
34:05
probably the most common way of initializing neural networks now and it
34:07
initializing neural networks now and it
34:07
initializing neural networks now and it takes a few keyword arguments here so
34:09
takes a few keyword arguments here so
34:09
takes a few keyword arguments here so number one it wants to know the mode
34:12
number one it wants to know the mode
34:12
number one it wants to know the mode would you like to normalize the
34:14
would you like to normalize the
34:14
would you like to normalize the activations or would you like to
34:15
activations or would you like to
34:15
activations or would you like to normalize the gradients to to be always
34:18
normalize the gradients to to be always
34:18
normalize the gradients to to be always uh gsh in with zero mean and a unit or
34:20
uh gsh in with zero mean and a unit or
34:20
uh gsh in with zero mean and a unit or one standard deviation and because they
34:23
one standard deviation and because they
34:23
one standard deviation and because they find in the paper that this doesn't
34:24
find in the paper that this doesn't
34:24
find in the paper that this doesn't matter too much most of the people just
34:25
matter too much most of the people just
34:25
matter too much most of the people just leave it as the default which is Fan in
34:28
leave it as the default which is Fan in
34:28
leave it as the default which is Fan in and then second passing the nonlinearity
34:30
and then second passing the nonlinearity
34:30
and then second passing the nonlinearity that you are using because depending on
34:32
that you are using because depending on
34:32
that you are using because depending on the nonlinearity we need to calculate a
34:34
the nonlinearity we need to calculate a
34:34
the nonlinearity we need to calculate a slightly different gain and so if your
34:36
slightly different gain and so if your
34:36
slightly different gain and so if your nonlinearity is just um linear so
34:39
nonlinearity is just um linear so
34:39
nonlinearity is just um linear so there's no nonlinearity then the gain
34:41
there's no nonlinearity then the gain
34:41
there's no nonlinearity then the gain here will be one and we have the exact
34:43
here will be one and we have the exact
34:43
here will be one and we have the exact same uh kind of formula that we've come
34:45
same uh kind of formula that we've come
34:45
same uh kind of formula that we've come up here but if the nonlinearity is
34:47
up here but if the nonlinearity is
34:47
up here but if the nonlinearity is something else we're going to get a
34:48
something else we're going to get a
34:48
something else we're going to get a slightly different gain and so if we
34:50
slightly different gain and so if we
34:50
slightly different gain and so if we come up here to the top we see that for
34:52
come up here to the top we see that for
34:52
come up here to the top we see that for example in the case of reu this gain is
34:55
example in the case of reu this gain is
34:55
example in the case of reu this gain is a square root of two and the reason it's
34:56
a square root of two and the reason it's
34:56
a square root of two and the reason it's a square root because in this
35:02
paper you see how the two is inside of
35:05
paper you see how the two is inside of
35:05
paper you see how the two is inside of the square root so the gain is a square
35:07
the square root so the gain is a square
35:07
the square root so the gain is a square root of two in the case of linear or
35:11
root of two in the case of linear or
35:11
root of two in the case of linear or identity we just get a gain of one in a
35:13
identity we just get a gain of one in a
35:14
identity we just get a gain of one in a case of 10 H which is what we're using
35:15
case of 10 H which is what we're using
35:15
case of 10 H which is what we're using here the advised gain is a 5 over3 and
35:18
here the advised gain is a 5 over3 and
35:19
here the advised gain is a 5 over3 and intuitively why do we need a gain on top
35:21
intuitively why do we need a gain on top
35:21
intuitively why do we need a gain on top of the initialization is because 10 just
35:23
of the initialization is because 10 just
35:23
of the initialization is because 10 just like reu is a contractive uh
35:26
like reu is a contractive uh
35:26
like reu is a contractive uh transformation so that means is you're
35:28
transformation so that means is you're
35:28
transformation so that means is you're taking the output distribution from this
35:30
taking the output distribution from this
35:30
taking the output distribution from this matrix multiplication and then you are
35:32
matrix multiplication and then you are
35:32
matrix multiplication and then you are squashing it in some way now reu
35:34
squashing it in some way now reu
35:34
squashing it in some way now reu squashes it by taking everything below
35:35
squashes it by taking everything below
35:35
squashes it by taking everything below zero and clamping it to zero 10 also
35:38
zero and clamping it to zero 10 also
35:38
zero and clamping it to zero 10 also squashes it because it's a contractive
35:40
squashes it because it's a contractive
35:40
squashes it because it's a contractive operation it will take the Tails and it
35:42
operation it will take the Tails and it
35:42
operation it will take the Tails and it will squeeze them in and so in order to
35:45
will squeeze them in and so in order to
35:45
will squeeze them in and so in order to fight the squeezing in we need to boost
35:47
fight the squeezing in we need to boost
35:47
fight the squeezing in we need to boost the weights a little bit so that we
35:49
the weights a little bit so that we
35:49
the weights a little bit so that we renormalize everything back to standard
35:51
renormalize everything back to standard
35:51
renormalize everything back to standard unit standard deviation so that's why
35:54
unit standard deviation so that's why
35:54
unit standard deviation so that's why there's a little bit of a gain that
35:55
there's a little bit of a gain that
35:55
there's a little bit of a gain that comes out now I'm skipping through this
35:57
comes out now I'm skipping through this
35:57
comes out now I'm skipping through this section A little bit quickly and I'm
35:58
section A little bit quickly and I'm
35:59
section A little bit quickly and I'm doing that actually intentionally and
36:01
doing that actually intentionally and
36:01
doing that actually intentionally and the reason for that is because about 7
36:03
the reason for that is because about 7
36:03
the reason for that is because about 7 years ago when this paper was written
36:06
years ago when this paper was written
36:06
years ago when this paper was written you had to actually be extremely careful
36:07
you had to actually be extremely careful
36:07
you had to actually be extremely careful with the activations and ingredients and
36:09
with the activations and ingredients and
36:09
with the activations and ingredients and their ranges and their histograms and
36:11
their ranges and their histograms and
36:11
their ranges and their histograms and you had to be very careful with the
36:13
you had to be very careful with the
36:13
you had to be very careful with the precise setting of gains and the
36:14
precise setting of gains and the
36:14
precise setting of gains and the scrutinizing of the nonlinearities used
36:16
scrutinizing of the nonlinearities used
36:16
scrutinizing of the nonlinearities used and so on and everything was very
36:18
and so on and everything was very
36:18
and so on and everything was very finicky and very fragile and to be very
36:20
finicky and very fragile and to be very
36:20
finicky and very fragile and to be very properly arranged for the neural nut to
36:22
properly arranged for the neural nut to
36:22
properly arranged for the neural nut to train especially if your neural nut was
36:23
train especially if your neural nut was
36:23
train especially if your neural nut was very deep but there are a number of
36:25
very deep but there are a number of
36:25
very deep but there are a number of modern innovations that have made
36:26
modern innovations that have made
36:26
modern innovations that have made everything significantly more stable and
36:28
everything significantly more stable and
36:28
everything significantly more stable and more well behaved and it's become less
36:30
more well behaved and it's become less
36:30
more well behaved and it's become less important to initialize these networks
36:32
important to initialize these networks
36:32
important to initialize these networks exactly right and some of those modern
36:34
exactly right and some of those modern
36:34
exactly right and some of those modern Innovations for example are residual
36:36
Innovations for example are residual
36:36
Innovations for example are residual connections which we will cover in the
36:38
connections which we will cover in the
36:38
connections which we will cover in the future the use of a number of uh
36:41
future the use of a number of uh
36:41
future the use of a number of uh normalization uh layers like for example
36:43
normalization uh layers like for example
36:43
normalization uh layers like for example batch normalization layer normalization
36:45
batch normalization layer normalization
36:45
batch normalization layer normalization group normalization we're going to go
36:47
group normalization we're going to go
36:47
group normalization we're going to go into a lot of these as well and number
36:49
into a lot of these as well and number
36:49
into a lot of these as well and number three much better optimizers not just
36:51
three much better optimizers not just
36:51
three much better optimizers not just stochastic gradient descent the simple
36:52
stochastic gradient descent the simple
36:53
stochastic gradient descent the simple Optimizer we're basically using here but
36:55
Optimizer we're basically using here but
36:55
Optimizer we're basically using here but a slightly more complex optimizers like
36:57
a slightly more complex optimizers like
36:57
a slightly more complex optimizers like ARS prop and especially Adam and so all
36:59
ARS prop and especially Adam and so all
37:00
ARS prop and especially Adam and so all of these modern Innovations make it less
37:01
of these modern Innovations make it less
37:02
of these modern Innovations make it less important for you to precisely calibrate
37:03
important for you to precisely calibrate
37:03
important for you to precisely calibrate the neutralization of the neural net all
37:06
the neutralization of the neural net all
37:06
the neutralization of the neural net all that being said in practice uh what
37:08
that being said in practice uh what
37:08
that being said in practice uh what should we do in practice when I
37:10
should we do in practice when I
37:10
should we do in practice when I initialize these neurals I basically
37:12
initialize these neurals I basically
37:12
initialize these neurals I basically just uh normalize my weights by the
37:14
just uh normalize my weights by the
37:14
just uh normalize my weights by the square root of the Fanon uh so basically
37:17
square root of the Fanon uh so basically
37:17
square root of the Fanon uh so basically uh roughly what we did here is what I do
37:20
uh roughly what we did here is what I do
37:20
uh roughly what we did here is what I do now if we want to be exactly accurate
37:22
now if we want to be exactly accurate
37:22
now if we want to be exactly accurate here we and go by um in it of uh timing
37:26
here we and go by um in it of uh timing
37:26
here we and go by um in it of uh timing normal this is how it would implemented
37:29
normal this is how it would implemented
37:29
normal this is how it would implemented we want to set the standard deviation to
37:31
we want to set the standard deviation to
37:31
we want to set the standard deviation to be gain over the square root of fan in
37:34
be gain over the square root of fan in
37:34
be gain over the square root of fan in right so to set the standard deviation
37:37
right so to set the standard deviation
37:37
right so to set the standard deviation of our weights we will proceed as
37:40
of our weights we will proceed as
37:40
of our weights we will proceed as follows basically when we have a torch.
37:42
follows basically when we have a torch.
37:42
follows basically when we have a torch. Ranon and let's say I just create a th
37:45
Ranon and let's say I just create a th
37:45
Ranon and let's say I just create a th numbers we can look at the standard
37:46
numbers we can look at the standard
37:46
numbers we can look at the standard deviation of this and of course that's
37:48
deviation of this and of course that's
37:48
deviation of this and of course that's one that's the amount of spread let's
37:50
one that's the amount of spread let's
37:50
one that's the amount of spread let's make this a bit bigger so it's closer to
37:51
make this a bit bigger so it's closer to
37:51
make this a bit bigger so it's closer to one so that's the spread of the Gan of
37:55
one so that's the spread of the Gan of
37:55
one so that's the spread of the Gan of zero mean and unit standard deviation
37:58
zero mean and unit standard deviation
37:58
zero mean and unit standard deviation now basically when you take these and
37:59
now basically when you take these and
37:59
now basically when you take these and you multiply by
38:01
you multiply by
38:01
you multiply by say2 that basically scales down the Gan
38:04
say2 that basically scales down the Gan
38:04
say2 that basically scales down the Gan and that makes it standard deviation 02
38:06
and that makes it standard deviation 02
38:07
and that makes it standard deviation 02 so basically the number that you
38:08
so basically the number that you
38:08
so basically the number that you multiply by here ends up being the
38:09
multiply by here ends up being the
38:09
multiply by here ends up being the standard deviation of this caution so
38:12
standard deviation of this caution so
38:12
standard deviation of this caution so here this is a um standard deviation
38:15
here this is a um standard deviation
38:15
here this is a um standard deviation point2 caution here when we sample our
38:18
point2 caution here when we sample our
38:18
point2 caution here when we sample our W1 but we want to set the standard
38:20
W1 but we want to set the standard
38:20
W1 but we want to set the standard deviation to gain over square root of
38:23
deviation to gain over square root of
38:23
deviation to gain over square root of fan mode which is Fanon so in other
38:26
fan mode which is Fanon so in other
38:26
fan mode which is Fanon so in other words we want to mul mly by uh gain
38:29
words we want to mul mly by uh gain
38:29
words we want to mul mly by uh gain which for 10 H is 5
38:33
which for 10 H is 5
38:33
which for 10 H is 5 over3 5 over3 is the gain and then
38:38
over3 5 over3 is the gain and then
38:38
over3 5 over3 is the gain and then times
38:43
um or I guess sorry
38:46
um or I guess sorry
38:46
um or I guess sorry divide uh square root of the fan in and
38:51
divide uh square root of the fan in and
38:51
divide uh square root of the fan in and in this example here the fan in was 10
38:53
in this example here the fan in was 10
38:53
in this example here the fan in was 10 and I just noticed that actually here
38:55
and I just noticed that actually here
38:55
and I just noticed that actually here the fan in for W1 is is actually an
38:58
the fan in for W1 is is actually an
38:58
the fan in for W1 is is actually an embed times block size which as you all
39:00
embed times block size which as you all
39:00
embed times block size which as you all recall is actually 30 and that's because
39:02
recall is actually 30 and that's because
39:02
recall is actually 30 and that's because each character is 10 dimensional but
39:04
each character is 10 dimensional but
39:04
each character is 10 dimensional but then we have three of them and we can
39:05
then we have three of them and we can
39:05
then we have three of them and we can catenate them so actually the fan in
39:06
catenate them so actually the fan in
39:06
catenate them so actually the fan in here was 30 and I should have used 30
39:09
here was 30 and I should have used 30
39:09
here was 30 and I should have used 30 here probably but basically we want 30
39:11
here probably but basically we want 30
39:11
here probably but basically we want 30 uh square root so this is the number
39:14
uh square root so this is the number
39:14
uh square root so this is the number this is what our standard deviation we
39:16
this is what our standard deviation we
39:16
this is what our standard deviation we want to be and this number turns out to
39:18
want to be and this number turns out to
39:18
want to be and this number turns out to be3 whereas here just by fiddling with
39:21
be3 whereas here just by fiddling with
39:21
be3 whereas here just by fiddling with it and looking at the distribution and
39:22
it and looking at the distribution and
39:22
it and looking at the distribution and making sure it looks okay uh we came up
39:24
making sure it looks okay uh we came up
39:24
making sure it looks okay uh we came up with 02 and so instead what we want to
39:27
with 02 and so instead what we want to
39:27
with 02 and so instead what we want to do here is we want to make the standard
39:29
do here is we want to make the standard
39:29
do here is we want to make the standard deviation b
39:32
deviation b
39:32
deviation b um 5 over3 which is our gain
39:36
um 5 over3 which is our gain
39:36
um 5 over3 which is our gain divide this
39:38
divide this
39:38
divide this amount times2 square root and these
39:41
amount times2 square root and these
39:41
amount times2 square root and these brackets here are not that uh necessary
39:44
brackets here are not that uh necessary
39:44
brackets here are not that uh necessary but I'll just put them here for clarity
39:46
but I'll just put them here for clarity
39:46
but I'll just put them here for clarity this is basically what we want this is
39:47
this is basically what we want this is
39:47
this is basically what we want this is the timing in it in our case for a 10h
39:51
the timing in it in our case for a 10h
39:51
the timing in it in our case for a 10h nonlinearity and this is how we would
39:53
nonlinearity and this is how we would
39:53
nonlinearity and this is how we would initialize the neural net and so we're
39:55
initialize the neural net and so we're
39:55
initialize the neural net and so we're multiplying by .3 instead of multiplying
39:58
multiplying by .3 instead of multiplying
39:58
multiplying by .3 instead of multiplying by
40:00
by
40:00
by .2 and so we can we can initialize this
40:04
.2 and so we can we can initialize this
40:04
.2 and so we can we can initialize this way and then we can train the neural net
40:06
way and then we can train the neural net
40:06
way and then we can train the neural net and see what we get okay so I trained
40:08
and see what we get okay so I trained
40:08
and see what we get okay so I trained the neural net and we end up in roughly
40:11
the neural net and we end up in roughly
40:11
the neural net and we end up in roughly the same spot so looking at the
40:12
the same spot so looking at the
40:12
the same spot so looking at the validation loss we now get 2.10 and
40:15
validation loss we now get 2.10 and
40:15
validation loss we now get 2.10 and previously we also had 2.10 there's a
40:17
previously we also had 2.10 there's a
40:17
previously we also had 2.10 there's a little bit of a difference but that's
40:18
little bit of a difference but that's
40:19
little bit of a difference but that's just the randomness of the process I
40:20
just the randomness of the process I
40:20
just the randomness of the process I suspect but the big deal of course is we
40:22
suspect but the big deal of course is we
40:22
suspect but the big deal of course is we get to the same spot but we did not have
40:25
get to the same spot but we did not have
40:25
get to the same spot but we did not have to introduce any um magic numbers that
40:29
to introduce any um magic numbers that
40:29
to introduce any um magic numbers that we got from just looking at histograms
40:31
we got from just looking at histograms
40:31
we got from just looking at histograms and guessing checking we have something
40:32
and guessing checking we have something
40:32
and guessing checking we have something that is semi- principled and will scale
40:34
that is semi- principled and will scale
40:34
that is semi- principled and will scale us to uh much bigger networks and uh
40:37
us to uh much bigger networks and uh
40:37
us to uh much bigger networks and uh something that we can sort of use as a
40:39
something that we can sort of use as a
40:39
something that we can sort of use as a guide so I mentioned that the precise
40:41
guide so I mentioned that the precise
40:41
guide so I mentioned that the precise setting of these initializations is not
40:43
setting of these initializations is not
40:43
setting of these initializations is not as important today due to some Modern
40:45
as important today due to some Modern
40:45
as important today due to some Modern Innovations and I think now is a pretty
40:46
Innovations and I think now is a pretty
40:46
Innovations and I think now is a pretty good time to introduce one of those
40:48
good time to introduce one of those
40:48
good time to introduce one of those modern Innovations and that is batch
40:50
modern Innovations and that is batch
40:50
modern Innovations and that is batch normalization so bat normalization came
40:52
normalization so bat normalization came
40:52
normalization so bat normalization came out in uh 2015 from a team at Google and
40:56
out in uh 2015 from a team at Google and
40:56
out in uh 2015 from a team at Google and it was an extremely impact paper because
40:58
it was an extremely impact paper because
40:58
it was an extremely impact paper because it made it possible to train very deep
41:00
it made it possible to train very deep
41:00
it made it possible to train very deep neuron Nets quite reliably and uh it
41:03
neuron Nets quite reliably and uh it
41:03
neuron Nets quite reliably and uh it basically just worked so here's what
41:05
basically just worked so here's what
41:05
basically just worked so here's what bash rization does and let's implement
41:07
bash rization does and let's implement
41:07
bash rization does and let's implement it
41:08
it
41:08
it um basically we have these uh hidden
41:11
um basically we have these uh hidden
41:11
um basically we have these uh hidden States H preact right and we were
41:14
States H preact right and we were
41:14
States H preact right and we were talking about how we don't want these uh
41:16
talking about how we don't want these uh
41:16
talking about how we don't want these uh these um preactivation states to be way
41:19
these um preactivation states to be way
41:19
these um preactivation states to be way too small because the then the 10 H is
41:21
too small because the then the 10 H is
41:21
too small because the then the 10 H is not um doing anything but we don't want
41:24
not um doing anything but we don't want
41:24
not um doing anything but we don't want them to be too large because then the 10
41:25
them to be too large because then the 10
41:25
them to be too large because then the 10 H is saturated in fact we want them to
41:28
H is saturated in fact we want them to
41:28
H is saturated in fact we want them to be roughly roughly Gan so zero mean and
41:31
be roughly roughly Gan so zero mean and
41:31
be roughly roughly Gan so zero mean and a unit or one standard deviation at
41:34
a unit or one standard deviation at
41:34
a unit or one standard deviation at least at
41:35
least at
41:35
least at initialization so the Insight from the
41:37
initialization so the Insight from the
41:37
initialization so the Insight from the bachor liation paper is okay you have
41:39
bachor liation paper is okay you have
41:39
bachor liation paper is okay you have these hidden States and you'd like them
41:42
these hidden States and you'd like them
41:42
these hidden States and you'd like them to be roughly Gan then why not take the
41:44
to be roughly Gan then why not take the
41:44
to be roughly Gan then why not take the hidden States and uh just normalize them
41:46
hidden States and uh just normalize them
41:47
hidden States and uh just normalize them to be
41:47
to be
41:47
to be Gan and it sounds kind of crazy but you
41:50
Gan and it sounds kind of crazy but you
41:50
Gan and it sounds kind of crazy but you can just do that because uh
41:53
can just do that because uh
41:53
can just do that because uh standardizing hidden States so that
41:55
standardizing hidden States so that
41:55
standardizing hidden States so that their unit caution is a perfect ly
41:57
their unit caution is a perfect ly
41:57
their unit caution is a perfect ly differentiable operation as we'll soon
41:58
differentiable operation as we'll soon
41:58
differentiable operation as we'll soon see and so that was kind of like the big
42:00
see and so that was kind of like the big
42:00
see and so that was kind of like the big Insight in this paper and when I first
42:02
Insight in this paper and when I first
42:02
Insight in this paper and when I first read it my mind was blown because you
42:04
read it my mind was blown because you
42:04
read it my mind was blown because you can just normalize these hidden States
42:06
can just normalize these hidden States
42:06
can just normalize these hidden States and if you'd like unit Gan States in
42:08
and if you'd like unit Gan States in
42:08
and if you'd like unit Gan States in your network uh at least initialization
42:11
your network uh at least initialization
42:11
your network uh at least initialization you can just normalize them to be unit
42:13
you can just normalize them to be unit
42:13
you can just normalize them to be unit gion so uh let's see how that works so
42:16
gion so uh let's see how that works so
42:16
gion so uh let's see how that works so we're going to scroll to our
42:17
we're going to scroll to our
42:17
we're going to scroll to our preactivation here just before they
42:19
preactivation here just before they
42:19
preactivation here just before they enter into the 10h now the idea again is
42:22
enter into the 10h now the idea again is
42:22
enter into the 10h now the idea again is remember we're trying to make these
42:23
remember we're trying to make these
42:23
remember we're trying to make these roughly Gan and that's because if these
42:25
roughly Gan and that's because if these
42:25
roughly Gan and that's because if these are way too small numbers then the 10 H
42:28
are way too small numbers then the 10 H
42:28
are way too small numbers then the 10 H here is kind of inactive but if these
42:30
here is kind of inactive but if these
42:30
here is kind of inactive but if these are very large numbers then the 10 H is
42:33
are very large numbers then the 10 H is
42:33
are very large numbers then the 10 H is way too saturated and gr is no flow so
42:36
way too saturated and gr is no flow so
42:36
way too saturated and gr is no flow so we'd like this to be roughly goshan so
42:39
we'd like this to be roughly goshan so
42:39
we'd like this to be roughly goshan so the Insight in Bat normalization again
42:41
the Insight in Bat normalization again
42:41
the Insight in Bat normalization again is that we can just standardize these
42:43
is that we can just standardize these
42:43
is that we can just standardize these activations so they are exactly Gan so
42:47
activations so they are exactly Gan so
42:47
activations so they are exactly Gan so here H
42:48
here H
42:48
here H preact has a shapee of 32 by 200 32
42:52
preact has a shapee of 32 by 200 32
42:52
preact has a shapee of 32 by 200 32 examples by 200 neurons in the hidden
42:54
examples by 200 neurons in the hidden
42:54
examples by 200 neurons in the hidden layer so basically what we can do is we
42:57
layer so basically what we can do is we
42:57
layer so basically what we can do is we can take H pract and we can just
42:58
can take H pract and we can just
42:59
can take H pract and we can just calculate the mean um and the mean we
43:02
calculate the mean um and the mean we
43:02
calculate the mean um and the mean we want to calculate across the zero
43:04
want to calculate across the zero
43:04
want to calculate across the zero Dimension and we want to also keep them
43:06
Dimension and we want to also keep them
43:06
Dimension and we want to also keep them as true so that we can easily broadcast
43:10
as true so that we can easily broadcast
43:10
as true so that we can easily broadcast this so the shape of this is 1 by 200 in
43:14
this so the shape of this is 1 by 200 in
43:14
this so the shape of this is 1 by 200 in other words we are doing the mean over
43:16
other words we are doing the mean over
43:17
other words we are doing the mean over all the uh elements in the
43:19
all the uh elements in the
43:19
all the uh elements in the batch and similarly we can calculate the
43:22
batch and similarly we can calculate the
43:22
batch and similarly we can calculate the standard deviation of these
43:25
standard deviation of these
43:25
standard deviation of these activations and that will also be 1 by
43:28
activations and that will also be 1 by
43:28
activations and that will also be 1 by 200 now in this paper they have
43:31
200 now in this paper they have
43:31
200 now in this paper they have the uh sort of prescription here and see
43:35
the uh sort of prescription here and see
43:35
the uh sort of prescription here and see here we are calculating the mean which
43:37
here we are calculating the mean which
43:37
here we are calculating the mean which is just taking uh the average
43:40
is just taking uh the average
43:40
is just taking uh the average value of any neurons activation and then
43:44
value of any neurons activation and then
43:44
value of any neurons activation and then the standard deviation is basically kind
43:45
the standard deviation is basically kind
43:45
the standard deviation is basically kind of like um this the measure of the
43:48
of like um this the measure of the
43:48
of like um this the measure of the spread that we've been using which is
43:50
spread that we've been using which is
43:50
spread that we've been using which is the distance of every one of these
43:53
the distance of every one of these
43:53
the distance of every one of these values away from the mean and that
43:56
values away from the mean and that
43:56
values away from the mean and that squared and
43:57
squared and
43:57
squared and averaged that's the that's the variance
44:01
averaged that's the that's the variance
44:01
averaged that's the that's the variance and then if you want to take the
44:02
and then if you want to take the
44:02
and then if you want to take the standard deviation you would square root
44:04
standard deviation you would square root
44:04
standard deviation you would square root the variance to get the standard
44:06
the variance to get the standard
44:06
the variance to get the standard deviation so these are the two that
44:08
deviation so these are the two that
44:09
deviation so these are the two that we're calculating and now we're going to
44:10
we're calculating and now we're going to
44:10
we're calculating and now we're going to normalize or standardize these X's by
44:13
normalize or standardize these X's by
44:13
normalize or standardize these X's by subtracting the mean and um dividing by
44:16
subtracting the mean and um dividing by
44:16
subtracting the mean and um dividing by the standard deviation so basically
44:18
the standard deviation so basically
44:18
the standard deviation so basically we're taking in pract and we
44:22
we're taking in pract and we
44:22
we're taking in pract and we subtract the mean
44:29
and then we divide by the standard
44:33
and then we divide by the standard
44:33
and then we divide by the standard deviation this is exactly what these two
44:35
deviation this is exactly what these two
44:35
deviation this is exactly what these two STD and mean are calculating
44:39
STD and mean are calculating
44:39
STD and mean are calculating oops sorry this is the mean and this is
44:42
oops sorry this is the mean and this is
44:42
oops sorry this is the mean and this is the variance you see how the sigma is a
44:44
the variance you see how the sigma is a
44:44
the variance you see how the sigma is a standard deviation usually so this is
44:45
standard deviation usually so this is
44:45
standard deviation usually so this is Sigma Square which the variance is the
44:47
Sigma Square which the variance is the
44:47
Sigma Square which the variance is the square of the standard
44:49
square of the standard
44:49
square of the standard deviation so this is how you standardize
44:52
deviation so this is how you standardize
44:52
deviation so this is how you standardize these values and what this will do is
44:54
these values and what this will do is
44:54
these values and what this will do is that every single neuron now and its
44:56
that every single neuron now and its
44:56
that every single neuron now and its firing rate will be exactly unit Gan on
44:59
firing rate will be exactly unit Gan on
44:59
firing rate will be exactly unit Gan on these 32 examples at least of this batch
45:01
these 32 examples at least of this batch
45:01
these 32 examples at least of this batch that's why it's called batch
45:02
that's why it's called batch
45:02
that's why it's called batch normalization we are normalizing these
45:05
normalization we are normalizing these
45:05
normalization we are normalizing these batches and then we could in principle
45:08
batches and then we could in principle
45:08
batches and then we could in principle train this notice that calculating the
45:10
train this notice that calculating the
45:10
train this notice that calculating the mean and your standard deviation these
45:12
mean and your standard deviation these
45:12
mean and your standard deviation these are just mathematical formulas they're
45:13
are just mathematical formulas they're
45:13
are just mathematical formulas they're perfectly differentiable all of this is
45:15
perfectly differentiable all of this is
45:15
perfectly differentiable all of this is perfectly differentiable and we can just
45:17
perfectly differentiable and we can just
45:17
perfectly differentiable and we can just train this the problem is you actually
45:19
train this the problem is you actually
45:19
train this the problem is you actually won't achieve a very good result with
45:22
won't achieve a very good result with
45:22
won't achieve a very good result with this and the reason for that
45:24
this and the reason for that
45:24
this and the reason for that is we want these to be roughly Gan but
45:27
is we want these to be roughly Gan but
45:27
is we want these to be roughly Gan but only at initialization uh but we don't
45:30
only at initialization uh but we don't
45:30
only at initialization uh but we don't want these be to be forced to be Garian
45:33
want these be to be forced to be Garian
45:33
want these be to be forced to be Garian always we we'd like to allow the neuron
45:35
always we we'd like to allow the neuron
45:35
always we we'd like to allow the neuron net to move this around to potentially
45:38
net to move this around to potentially
45:38
net to move this around to potentially make it more diffuse to make it more
45:39
make it more diffuse to make it more
45:39
make it more diffuse to make it more sharp to make some 10 neurons maybe be
45:42
sharp to make some 10 neurons maybe be
45:42
sharp to make some 10 neurons maybe be more trigger more trigger happy or less
45:44
more trigger more trigger happy or less
45:44
more trigger more trigger happy or less trigger happy so we'd like this
45:46
trigger happy so we'd like this
45:46
trigger happy so we'd like this distribution to move around and we'd
45:47
distribution to move around and we'd
45:47
distribution to move around and we'd like the back propagation to tell us how
45:49
like the back propagation to tell us how
45:49
like the back propagation to tell us how the distribution should move around and
45:52
the distribution should move around and
45:52
the distribution should move around and so in addition to this idea of
45:54
so in addition to this idea of
45:54
so in addition to this idea of standardizing the activations that any
45:57
standardizing the activations that any
45:57
standardizing the activations that any point in the network uh we have to also
46:00
point in the network uh we have to also
46:00
point in the network uh we have to also introduce this additional component in
46:01
introduce this additional component in
46:01
introduce this additional component in the paper here described as scale and
46:04
the paper here described as scale and
46:04
the paper here described as scale and shift and so basically what we're doing
46:06
shift and so basically what we're doing
46:06
shift and so basically what we're doing is we're taking these normalized inputs
46:09
is we're taking these normalized inputs
46:09
is we're taking these normalized inputs and we are additionally scaling them by
46:11
and we are additionally scaling them by
46:11
and we are additionally scaling them by some gain and offsetting them by some
46:13
some gain and offsetting them by some
46:13
some gain and offsetting them by some bias to get our final output from this
46:16
bias to get our final output from this
46:16
bias to get our final output from this layer and so what that amounts to is the
46:19
layer and so what that amounts to is the
46:19
layer and so what that amounts to is the following we are going to allow a batch
46:21
following we are going to allow a batch
46:21
following we are going to allow a batch normalization gain to be initialized at
46:25
normalization gain to be initialized at
46:25
normalization gain to be initialized at just uh once
46:27
just uh once
46:27
just uh once and the ones will be in the shape of 1
46:29
and the ones will be in the shape of 1
46:29
and the ones will be in the shape of 1 by n
46:31
by n
46:31
by n hidden and then we also will have a BN
46:34
hidden and then we also will have a BN
46:34
hidden and then we also will have a BN bias which will be torch. zeros and it
46:37
bias which will be torch. zeros and it
46:37
bias which will be torch. zeros and it will also be of the shape n by 1 by n
46:41
will also be of the shape n by 1 by n
46:41
will also be of the shape n by 1 by n hidden and then here the BN gain will
46:45
hidden and then here the BN gain will
46:45
hidden and then here the BN gain will multiply
46:46
multiply
46:46
multiply this and the BN bias will offset it
46:50
this and the BN bias will offset it
46:50
this and the BN bias will offset it here so because this is initialized to
46:52
here so because this is initialized to
46:52
here so because this is initialized to one and this to
46:53
one and this to
46:53
one and this to zero at initialization each neurons
46:57
zero at initialization each neurons
46:57
zero at initialization each neurons firing values in this batch will be
47:00
firing values in this batch will be
47:00
firing values in this batch will be exactly unit gion and will have nice
47:02
exactly unit gion and will have nice
47:02
exactly unit gion and will have nice numbers no matter what the distribution
47:04
numbers no matter what the distribution
47:04
numbers no matter what the distribution of the H pract is coming in coming out
47:07
of the H pract is coming in coming out
47:07
of the H pract is coming in coming out it will be un Gan for each neuron and
47:09
it will be un Gan for each neuron and
47:09
it will be un Gan for each neuron and that's roughly what we want at least at
47:12
that's roughly what we want at least at
47:12
that's roughly what we want at least at initialization um and then during
47:14
initialization um and then during
47:14
initialization um and then during optimization we'll be able to back
47:16
optimization we'll be able to back
47:16
optimization we'll be able to back propagate into BN gain and BM bias and
47:18
propagate into BN gain and BM bias and
47:18
propagate into BN gain and BM bias and change them so the network is given the
47:20
change them so the network is given the
47:20
change them so the network is given the full ability to do with this whatever it
47:23
full ability to do with this whatever it
47:23
full ability to do with this whatever it wants uh
47:24
wants uh
47:24
wants uh internally here we just have to make
47:26
internally here we just have to make
47:26
internally here we just have to make sure sure that we um include these in
47:30
sure sure that we um include these in
47:30
sure sure that we um include these in the parameters of the neural nut because
47:32
the parameters of the neural nut because
47:32
the parameters of the neural nut because they will be trained with back
47:34
they will be trained with back
47:34
they will be trained with back propagation so let's initialize this and
47:37
propagation so let's initialize this and
47:37
propagation so let's initialize this and then we should be able to
47:44
train and then we're going to also copy
47:48
train and then we're going to also copy
47:48
train and then we're going to also copy this line which is the batch
47:50
this line which is the batch
47:50
this line which is the batch normalization layer here on a single
47:52
normalization layer here on a single
47:52
normalization layer here on a single line of code and we're going to swing
47:54
line of code and we're going to swing
47:54
line of code and we're going to swing down here and we're also going to do the
47:56
down here and we're also going to do the
47:56
down here and we're also going to do the exact same thing at test time
48:00
here so similar to train time we're
48:03
here so similar to train time we're
48:03
here so similar to train time we're going to normalize uh and then scale and
48:06
going to normalize uh and then scale and
48:06
going to normalize uh and then scale and that's going to give us our train and
48:08
that's going to give us our train and
48:08
that's going to give us our train and validation
48:09
validation
48:09
validation loss and we'll see in a second that
48:11
loss and we'll see in a second that
48:11
loss and we'll see in a second that we're actually going to change this a
48:12
we're actually going to change this a
48:12
we're actually going to change this a little bit but for now I'm going to keep
48:14
little bit but for now I'm going to keep
48:14
little bit but for now I'm going to keep it this way so I'm just going to wait
48:16
it this way so I'm just going to wait
48:16
it this way so I'm just going to wait for this to converge okay so I allowed
48:18
for this to converge okay so I allowed
48:18
for this to converge okay so I allowed the neural nut to converge here and when
48:20
the neural nut to converge here and when
48:20
the neural nut to converge here and when we scroll down we see that our
48:21
we scroll down we see that our
48:21
we scroll down we see that our validation loss here is 2.10 roughly
48:24
validation loss here is 2.10 roughly
48:24
validation loss here is 2.10 roughly which I wrote down here and we see that
48:26
which I wrote down here and we see that
48:26
which I wrote down here and we see that this is actually kind of comparable to
48:28
this is actually kind of comparable to
48:28
this is actually kind of comparable to some of the results that we've achieved
48:29
some of the results that we've achieved
48:29
some of the results that we've achieved uh previously now I'm not actually
48:32
uh previously now I'm not actually
48:32
uh previously now I'm not actually expecting an improvement in this case
48:34
expecting an improvement in this case
48:34
expecting an improvement in this case and that's because we are dealing with a
48:36
and that's because we are dealing with a
48:36
and that's because we are dealing with a very simple neural nut that has just a
48:37
very simple neural nut that has just a
48:37
very simple neural nut that has just a single hidden layer so in fact in this
48:41
single hidden layer so in fact in this
48:41
single hidden layer so in fact in this very simple case of just one hidden
48:42
very simple case of just one hidden
48:42
very simple case of just one hidden layer we were able to actually calculate
48:44
layer we were able to actually calculate
48:44
layer we were able to actually calculate what the scale of w should be to make
48:47
what the scale of w should be to make
48:47
what the scale of w should be to make these pre activations already have a
48:48
these pre activations already have a
48:48
these pre activations already have a roughly Gan shape so the bat
48:50
roughly Gan shape so the bat
48:50
roughly Gan shape so the bat normalization is not doing much here but
48:53
normalization is not doing much here but
48:53
normalization is not doing much here but you might imagine that once you have a
48:54
you might imagine that once you have a
48:54
you might imagine that once you have a much deeper neural nut that has lots of
48:56
much deeper neural nut that has lots of
48:56
much deeper neural nut that has lots of different types of operations and
48:59
different types of operations and
48:59
different types of operations and there's also for example residual
49:00
there's also for example residual
49:00
there's also for example residual connections which we'll cover and so on
49:02
connections which we'll cover and so on
49:02
connections which we'll cover and so on it will become basically very very
49:04
it will become basically very very
49:05
it will become basically very very difficult to tune the scales of your
49:07
difficult to tune the scales of your
49:07
difficult to tune the scales of your weight matrices such that all the
49:09
weight matrices such that all the
49:09
weight matrices such that all the activations throughout the neural nut
49:10
activations throughout the neural nut
49:11
activations throughout the neural nut are roughly gsen and so that's going to
49:13
are roughly gsen and so that's going to
49:14
are roughly gsen and so that's going to become very quickly intractable but
49:16
become very quickly intractable but
49:16
become very quickly intractable but compared to that it's going to be much
49:17
compared to that it's going to be much
49:17
compared to that it's going to be much much easier to sprinkle batch
49:19
much easier to sprinkle batch
49:19
much easier to sprinkle batch normalization layers throughout the
49:20
normalization layers throughout the
49:20
normalization layers throughout the neural net so in particular it's common
49:24
neural net so in particular it's common
49:24
neural net so in particular it's common to to look at every single linear layer
49:26
to to look at every single linear layer
49:26
to to look at every single linear layer like this one one this is a linear layer
49:27
like this one one this is a linear layer
49:27
like this one one this is a linear layer multiplying by a weight Matrix and
49:29
multiplying by a weight Matrix and
49:29
multiplying by a weight Matrix and adding a bias or for example
49:31
adding a bias or for example
49:31
adding a bias or for example convolutions which we'll cover later and
49:33
convolutions which we'll cover later and
49:33
convolutions which we'll cover later and also perform basically a multiplication
49:36
also perform basically a multiplication
49:36
also perform basically a multiplication with a weight Matrix but in a more
49:37
with a weight Matrix but in a more
49:37
with a weight Matrix but in a more spatially structured format it's custom
49:40
spatially structured format it's custom
49:40
spatially structured format it's custom it's customary to take these linear
49:42
it's customary to take these linear
49:42
it's customary to take these linear layer or convolutional layer and append
49:45
layer or convolutional layer and append
49:45
layer or convolutional layer and append a b rization layer right after it to
49:47
a b rization layer right after it to
49:47
a b rization layer right after it to control the scale of these activations
49:49
control the scale of these activations
49:49
control the scale of these activations at every point in the neural nut so we'd
49:52
at every point in the neural nut so we'd
49:52
at every point in the neural nut so we'd be adding these bom layers throughout
49:53
be adding these bom layers throughout
49:53
be adding these bom layers throughout the neural nut and then this controls
49:55
the neural nut and then this controls
49:55
the neural nut and then this controls the scale of these AC ations throughout
49:57
the scale of these AC ations throughout
49:57
the scale of these AC ations throughout the neural net it doesn't require us to
49:59
the neural net it doesn't require us to
49:59
the neural net it doesn't require us to do uh perfect mathematics and care about
50:02
do uh perfect mathematics and care about
50:02
do uh perfect mathematics and care about the activation distributions uh for all
50:04
the activation distributions uh for all
50:04
the activation distributions uh for all these different types of neural network
50:06
these different types of neural network
50:06
these different types of neural network uh Lego building blocks that you might
50:07
uh Lego building blocks that you might
50:07
uh Lego building blocks that you might want to introduce into your neural net
50:09
want to introduce into your neural net
50:09
want to introduce into your neural net and it significantly stabilizes uh the
50:11
and it significantly stabilizes uh the
50:11
and it significantly stabilizes uh the training and that's why these uh layers
50:13
training and that's why these uh layers
50:13
training and that's why these uh layers are quite popular now the stability
50:15
are quite popular now the stability
50:15
are quite popular now the stability offered by bash normalization actually
50:17
offered by bash normalization actually
50:17
offered by bash normalization actually comes at a terrible cost and that cost
50:19
comes at a terrible cost and that cost
50:19
comes at a terrible cost and that cost is that if you think about what's
50:21
is that if you think about what's
50:21
is that if you think about what's Happening Here something something
50:23
Happening Here something something
50:23
Happening Here something something terribly strange and unnatural is
50:25
terribly strange and unnatural is
50:25
terribly strange and unnatural is happening it used to be that we have a
50:28
happening it used to be that we have a
50:28
happening it used to be that we have a single example feeding into a neural nut
50:30
single example feeding into a neural nut
50:30
single example feeding into a neural nut and then uh we calculate its activations
50:32
and then uh we calculate its activations
50:32
and then uh we calculate its activations and its loits and this is a
50:35
and its loits and this is a
50:35
and its loits and this is a deterministic sort of process so you
50:37
deterministic sort of process so you
50:37
deterministic sort of process so you arrive at some logits for this example
50:40
arrive at some logits for this example
50:40
arrive at some logits for this example and then because of efficiency of
50:41
and then because of efficiency of
50:41
and then because of efficiency of training we suddenly started to use
50:43
training we suddenly started to use
50:43
training we suddenly started to use batches of examples but those batches of
50:45
batches of examples but those batches of
50:45
batches of examples but those batches of examples were processed independently
50:47
examples were processed independently
50:47
examples were processed independently and it was just an efficiency thing but
50:50
and it was just an efficiency thing but
50:50
and it was just an efficiency thing but now suddenly in batch normalization
50:51
now suddenly in batch normalization
50:51
now suddenly in batch normalization because of the normalization through the
50:52
because of the normalization through the
50:53
because of the normalization through the batch we are coupling these examples
50:55
batch we are coupling these examples
50:55
batch we are coupling these examples mathematically and in the forward pass
50:57
mathematically and in the forward pass
50:57
mathematically and in the forward pass and the backward pass of a neural l so
50:59
and the backward pass of a neural l so
50:59
and the backward pass of a neural l so now the hidden State activations H pract
51:02
now the hidden State activations H pract
51:02
now the hidden State activations H pract in your log jits for any one input
51:04
in your log jits for any one input
51:04
in your log jits for any one input example are not just a function of that
51:06
example are not just a function of that
51:06
example are not just a function of that example and its input but they're also a
51:09
example and its input but they're also a
51:09
example and its input but they're also a function of all the other examples that
51:11
function of all the other examples that
51:11
function of all the other examples that happen to come for a ride in that batch
51:14
happen to come for a ride in that batch
51:14
happen to come for a ride in that batch and these examples are sampled randomly
51:16
and these examples are sampled randomly
51:16
and these examples are sampled randomly and so what's happening is for example
51:17
and so what's happening is for example
51:17
and so what's happening is for example when you look at H pract that's going to
51:19
when you look at H pract that's going to
51:19
when you look at H pract that's going to feed into H the hidden State activations
51:22
feed into H the hidden State activations
51:22
feed into H the hidden State activations for for example for for any one of these
51:24
for for example for for any one of these
51:24
for for example for for any one of these input examples is going to actually
51:26
input examples is going to actually
51:26
input examples is going to actually change slightly depending on what other
51:28
change slightly depending on what other
51:28
change slightly depending on what other examples there are in a batch and and
51:31
examples there are in a batch and and
51:31
examples there are in a batch and and depending on what other examples happen
51:32
depending on what other examples happen
51:32
depending on what other examples happen to come for a ride H is going to change
51:35
to come for a ride H is going to change
51:35
to come for a ride H is going to change subtly and it's going to like Jitter if
51:37
subtly and it's going to like Jitter if
51:37
subtly and it's going to like Jitter if you imagine sampling different examples
51:39
you imagine sampling different examples
51:39
you imagine sampling different examples because the the statistics of the mean
51:41
because the the statistics of the mean
51:41
because the the statistics of the mean and the standard deviation are going to
51:42
and the standard deviation are going to
51:42
and the standard deviation are going to be impacted and so you'll get a Jitter
51:44
be impacted and so you'll get a Jitter
51:44
be impacted and so you'll get a Jitter for H and you'll get a Jitter for
51:47
for H and you'll get a Jitter for
51:47
for H and you'll get a Jitter for loits and you think that this would be a
51:49
loits and you think that this would be a
51:49
loits and you think that this would be a bug uh or something undesirable but in a
51:52
bug uh or something undesirable but in a
51:53
bug uh or something undesirable but in a very strange way this actually turns out
51:55
very strange way this actually turns out
51:55
very strange way this actually turns out to be good in your Network training and
51:58
to be good in your Network training and
51:58
to be good in your Network training and as a side effect and the reason for that
52:00
as a side effect and the reason for that
52:00
as a side effect and the reason for that is that you can think of this as kind of
52:02
is that you can think of this as kind of
52:02
is that you can think of this as kind of like a regularizer because what's
52:04
like a regularizer because what's
52:04
like a regularizer because what's happening is you have your input and you
52:05
happening is you have your input and you
52:05
happening is you have your input and you get your age and then depending on the
52:07
get your age and then depending on the
52:07
get your age and then depending on the other examples this is jittering a bit
52:09
other examples this is jittering a bit
52:10
other examples this is jittering a bit and so what that does is that it's
52:11
and so what that does is that it's
52:11
and so what that does is that it's effectively padding out any one of these
52:13
effectively padding out any one of these
52:13
effectively padding out any one of these input examples and it's introducing a
52:15
input examples and it's introducing a
52:15
input examples and it's introducing a little bit of entropy and um because of
52:17
little bit of entropy and um because of
52:18
little bit of entropy and um because of the padding out it's actually kind of
52:19
the padding out it's actually kind of
52:19
the padding out it's actually kind of like a form of a data augmentation which
52:21
like a form of a data augmentation which
52:21
like a form of a data augmentation which we'll cover in the future and it's kind
52:23
we'll cover in the future and it's kind
52:23
we'll cover in the future and it's kind of like augmenting the input a little
52:25
of like augmenting the input a little
52:25
of like augmenting the input a little bit and jittering it and that makes it
52:27
bit and jittering it and that makes it
52:27
bit and jittering it and that makes it harder for the neural nut to overfit to
52:29
harder for the neural nut to overfit to
52:29
harder for the neural nut to overfit to these concrete specific examples so by
52:32
these concrete specific examples so by
52:32
these concrete specific examples so by introducing all this noise it actually
52:34
introducing all this noise it actually
52:34
introducing all this noise it actually like Pats out the examples and it
52:36
like Pats out the examples and it
52:36
like Pats out the examples and it regularizes the neural nut and that's
52:38
regularizes the neural nut and that's
52:38
regularizes the neural nut and that's one of the reasons why uh deceivingly as
52:40
one of the reasons why uh deceivingly as
52:40
one of the reasons why uh deceivingly as a second order effect uh this is
52:42
a second order effect uh this is
52:42
a second order effect uh this is actually a regularizer and that has made
52:44
actually a regularizer and that has made
52:44
actually a regularizer and that has made it harder uh for us to remove the use of
52:46
it harder uh for us to remove the use of
52:46
it harder uh for us to remove the use of batch
52:47
batch
52:47
batch normalization uh because basically no
52:49
normalization uh because basically no
52:49
normalization uh because basically no one likes this property that the the
52:51
one likes this property that the the
52:51
one likes this property that the the examples in the batch are coupled
52:53
examples in the batch are coupled
52:53
examples in the batch are coupled mathematically and in the forward pass
52:55
mathematically and in the forward pass
52:55
mathematically and in the forward pass and at least all kinds of like strange
52:57
and at least all kinds of like strange
52:57
and at least all kinds of like strange uh results uh we'll go into some of that
52:59
uh results uh we'll go into some of that
52:59
uh results uh we'll go into some of that in a second as well um and it leads to a
53:02
in a second as well um and it leads to a
53:02
in a second as well um and it leads to a lot of bugs and um and so on and so no
53:05
lot of bugs and um and so on and so no
53:05
lot of bugs and um and so on and so no one likes this property uh and so people
53:07
one likes this property uh and so people
53:07
one likes this property uh and so people have tried to um deprecate the use of
53:10
have tried to um deprecate the use of
53:10
have tried to um deprecate the use of bat normalization and move to other
53:11
bat normalization and move to other
53:11
bat normalization and move to other normalization techniques that do not
53:13
normalization techniques that do not
53:13
normalization techniques that do not couple the examples of a batch examples
53:15
couple the examples of a batch examples
53:15
couple the examples of a batch examples are ler normalization instance
53:17
are ler normalization instance
53:17
are ler normalization instance normalization group normalization and so
53:19
normalization group normalization and so
53:19
normalization group normalization and so on and we'll come we'll come some these
53:21
on and we'll come we'll come some these
53:21
on and we'll come we'll come some these uh later um but basically long story
53:25
uh later um but basically long story
53:25
uh later um but basically long story short bat normalization was the first
53:27
short bat normalization was the first
53:27
short bat normalization was the first kind of normalization layer to be
53:28
kind of normalization layer to be
53:28
kind of normalization layer to be introduced it worked extremely well it
53:31
introduced it worked extremely well it
53:31
introduced it worked extremely well it happened to have this regularizing
53:32
happened to have this regularizing
53:32
happened to have this regularizing effect it stabilized training and people
53:36
effect it stabilized training and people
53:36
effect it stabilized training and people have been trying to remove it and move
53:38
have been trying to remove it and move
53:38
have been trying to remove it and move to some of the other normalization
53:39
to some of the other normalization
53:39
to some of the other normalization techniques uh but it's been hard because
53:42
techniques uh but it's been hard because
53:42
techniques uh but it's been hard because it it just works quite well and some of
53:44
it it just works quite well and some of
53:44
it it just works quite well and some of the reason that it works quite well is
53:46
the reason that it works quite well is
53:46
the reason that it works quite well is again because of this regular rizing
53:47
again because of this regular rizing
53:47
again because of this regular rizing effect and because of the because it is
53:49
effect and because of the because it is
53:49
effect and because of the because it is quite effective at um controlling the
53:52
quite effective at um controlling the
53:52
quite effective at um controlling the activations and their
53:53
activations and their
53:53
activations and their distributions uh so that's kind of like
53:55
distributions uh so that's kind of like
53:55
distributions uh so that's kind of like the brief story of Bash normalization
53:57
the brief story of Bash normalization
53:57
the brief story of Bash normalization and I'd like to show you one of the
53:59
and I'd like to show you one of the
53:59
and I'd like to show you one of the other weird sort of outcomes of this
54:02
other weird sort of outcomes of this
54:02
other weird sort of outcomes of this coupling so here's one of the strange
54:04
coupling so here's one of the strange
54:04
coupling so here's one of the strange outcomes that I only glossed over
54:06
outcomes that I only glossed over
54:06
outcomes that I only glossed over previously when I was evaluating the
54:08
previously when I was evaluating the
54:08
previously when I was evaluating the loss on the validation set basically
54:11
loss on the validation set basically
54:11
loss on the validation set basically once we've trained a neural net we'd
54:13
once we've trained a neural net we'd
54:13
once we've trained a neural net we'd like to deploy it in some kind of a
54:15
like to deploy it in some kind of a
54:15
like to deploy it in some kind of a setting and we'd like to be able to feed
54:16
setting and we'd like to be able to feed
54:16
setting and we'd like to be able to feed in a single individual example and get a
54:19
in a single individual example and get a
54:19
in a single individual example and get a prediction out from our neural net but
54:21
prediction out from our neural net but
54:21
prediction out from our neural net but how do we do that when our neural net
54:23
how do we do that when our neural net
54:23
how do we do that when our neural net now in a forward pass estimates the
54:25
now in a forward pass estimates the
54:25
now in a forward pass estimates the statistics of the mean understand
54:26
statistics of the mean understand
54:26
statistics of the mean understand deviation of a batch the neur lot
54:28
deviation of a batch the neur lot
54:28
deviation of a batch the neur lot expects batches as an input now so how
54:30
expects batches as an input now so how
54:30
expects batches as an input now so how do we feed in a single example and get
54:32
do we feed in a single example and get
54:32
do we feed in a single example and get sensible results out and so the proposal
54:35
sensible results out and so the proposal
54:35
sensible results out and so the proposal in the batch normalization paper is the
54:37
in the batch normalization paper is the
54:37
in the batch normalization paper is the following what we would like to do here
54:40
following what we would like to do here
54:40
following what we would like to do here is we would like to basically have a
54:42
is we would like to basically have a
54:42
is we would like to basically have a step after training that uh calculates
54:46
step after training that uh calculates
54:46
step after training that uh calculates and sets the bach uh mean and standard
54:49
and sets the bach uh mean and standard
54:49
and sets the bach uh mean and standard deviation a single time over the
54:50
deviation a single time over the
54:50
deviation a single time over the training set and so I wrote this code
54:53
training set and so I wrote this code
54:53
training set and so I wrote this code here in interest of time and we're going
54:55
here in interest of time and we're going
54:55
here in interest of time and we're going to call what's called calibrate the
54:57
to call what's called calibrate the
54:57
to call what's called calibrate the bachor statistics and basically what we
54:59
bachor statistics and basically what we
54:59
bachor statistics and basically what we do is torch torch. nograd telling
55:02
do is torch torch. nograd telling
55:02
do is torch torch. nograd telling pytorch that none of this we will call
55:05
pytorch that none of this we will call
55:05
pytorch that none of this we will call Dot backward on and it's going to be a
55:07
Dot backward on and it's going to be a
55:07
Dot backward on and it's going to be a bit more efficient we're going to take
55:09
bit more efficient we're going to take
55:09
bit more efficient we're going to take the training set get the pre activations
55:11
the training set get the pre activations
55:11
the training set get the pre activations for every single training example and
55:13
for every single training example and
55:13
for every single training example and then one single time estimate the mean
55:15
then one single time estimate the mean
55:15
then one single time estimate the mean and standard deviation over the entire
55:16
and standard deviation over the entire
55:16
and standard deviation over the entire training set and then we're going to get
55:18
training set and then we're going to get
55:18
training set and then we're going to get B and mean and B and standard deviation
55:20
B and mean and B and standard deviation
55:20
B and mean and B and standard deviation and now these are fixed numbers
55:22
and now these are fixed numbers
55:22
and now these are fixed numbers estimating over the entire training set
55:25
estimating over the entire training set
55:25
estimating over the entire training set and here instead of estimating it
55:28
and here instead of estimating it
55:28
and here instead of estimating it dynamically we are going to instead here
55:31
dynamically we are going to instead here
55:31
dynamically we are going to instead here use B and
55:33
use B and
55:33
use B and mean and here we're just going to use B
55:35
mean and here we're just going to use B
55:35
mean and here we're just going to use B and standard
55:37
and standard
55:37
and standard deviation and so at test time we are
55:39
deviation and so at test time we are
55:39
deviation and so at test time we are going to fix these clamp them and use
55:41
going to fix these clamp them and use
55:41
going to fix these clamp them and use them during inference and
55:44
them during inference and
55:44
them during inference and now you see that we get basically
55:46
now you see that we get basically
55:46
now you see that we get basically identical result uh but the benefit that
55:49
identical result uh but the benefit that
55:49
identical result uh but the benefit that we've gained is that we can now also
55:51
we've gained is that we can now also
55:51
we've gained is that we can now also forward a single example because the
55:53
forward a single example because the
55:53
forward a single example because the mean and standard deviation are now
55:54
mean and standard deviation are now
55:54
mean and standard deviation are now fixed uh sort of tensor
55:57
fixed uh sort of tensor
55:57
fixed uh sort of tensor that said nobody actually wants to
55:58
that said nobody actually wants to
55:58
that said nobody actually wants to estimate this mean and standard
56:00
estimate this mean and standard
56:00
estimate this mean and standard deviation as a second stage after neural
56:03
deviation as a second stage after neural
56:03
deviation as a second stage after neural network training because everyone is
56:04
network training because everyone is
56:04
network training because everyone is lazy and so this batch normalization
56:07
lazy and so this batch normalization
56:07
lazy and so this batch normalization paper actually introduced one more idea
56:09
paper actually introduced one more idea
56:09
paper actually introduced one more idea which is that we are can we can estimate
56:11
which is that we are can we can estimate
56:11
which is that we are can we can estimate the mean and standard deviation in a
56:12
the mean and standard deviation in a
56:12
the mean and standard deviation in a running man running manner during
56:15
running man running manner during
56:15
running man running manner during training of the neuron nut and then we
56:17
training of the neuron nut and then we
56:17
training of the neuron nut and then we can uh simply just have a single stage
56:19
can uh simply just have a single stage
56:19
can uh simply just have a single stage of training and on the side of that
56:21
of training and on the side of that
56:21
of training and on the side of that training we are estimating the running
56:23
training we are estimating the running
56:23
training we are estimating the running mean and standard deviation so let's see
56:24
mean and standard deviation so let's see
56:24
mean and standard deviation so let's see what that would look like let me
56:26
what that would look like let me
56:26
what that would look like let me basically take the mean here that we are
56:28
basically take the mean here that we are
56:29
basically take the mean here that we are estimating on the batch and let me call
56:30
estimating on the batch and let me call
56:30
estimating on the batch and let me call this B and mean on the I
56:33
this B and mean on the I
56:33
this B and mean on the I iteration um and then here this is BN
56:39
iteration um and then here this is BN
56:39
iteration um and then here this is BN sdd um bnsd at I
56:45
okay uh and the mean comes here and the
56:50
okay uh and the mean comes here and the
56:50
okay uh and the mean comes here and the STD comes here so so far I've done
56:53
STD comes here so so far I've done
56:53
STD comes here so so far I've done nothing I've just uh moved around and I
56:55
nothing I've just uh moved around and I
56:55
nothing I've just uh moved around and I created these EXT extra variables for
56:56
created these EXT extra variables for
56:56
created these EXT extra variables for the mean and standard deviation and I've
56:58
the mean and standard deviation and I've
56:58
the mean and standard deviation and I've put them here so so far nothing has
57:00
put them here so so far nothing has
57:00
put them here so so far nothing has changed but what we're going to do now
57:02
changed but what we're going to do now
57:02
changed but what we're going to do now is we're going to keep running mean of
57:04
is we're going to keep running mean of
57:04
is we're going to keep running mean of both of these values during training so
57:06
both of these values during training so
57:06
both of these values during training so let me swing up here and let me create a
57:08
let me swing up here and let me create a
57:08
let me swing up here and let me create a BN meanor running and I'm going to
57:12
BN meanor running and I'm going to
57:12
BN meanor running and I'm going to initialize it at uh
57:15
initialize it at uh
57:15
initialize it at uh zeros and then BN STD running which I'll
57:19
zeros and then BN STD running which I'll
57:19
zeros and then BN STD running which I'll initialize at
57:22
initialize at
57:22
initialize at once because um in the beginning because
57:25
once because um in the beginning because
57:25
once because um in the beginning because of the way we ized W1 uh and B1 H pract
57:29
of the way we ized W1 uh and B1 H pract
57:29
of the way we ized W1 uh and B1 H pract will be roughly unit gion so the mean
57:31
will be roughly unit gion so the mean
57:31
will be roughly unit gion so the mean will be roughly zero and a standard
57:32
will be roughly zero and a standard
57:33
will be roughly zero and a standard deviation roughly one so I'm going to
57:35
deviation roughly one so I'm going to
57:35
deviation roughly one so I'm going to initialize these that way but then here
57:37
initialize these that way but then here
57:37
initialize these that way but then here I'm going to update these and in pytorch
57:40
I'm going to update these and in pytorch
57:40
I'm going to update these and in pytorch um uh these uh mean and standard
57:43
um uh these uh mean and standard
57:43
um uh these uh mean and standard deviation that are running uh they're
57:45
deviation that are running uh they're
57:45
deviation that are running uh they're not actually part of the gradient based
57:46
not actually part of the gradient based
57:46
not actually part of the gradient based optimization we're never going to derive
57:48
optimization we're never going to derive
57:48
optimization we're never going to derive gradients with respect to them they're
57:50
gradients with respect to them they're
57:50
gradients with respect to them they're they're updated on the side of training
57:53
they're updated on the side of training
57:53
they're updated on the side of training and so what we're going to do here is
57:54
and so what we're going to do here is
57:54
and so what we're going to do here is we're going to say with torch. nograd
57:57
we're going to say with torch. nograd
57:58
we're going to say with torch. nograd telling pytorch that the update here is
58:01
telling pytorch that the update here is
58:01
telling pytorch that the update here is not supposed to be building out a graph
58:02
not supposed to be building out a graph
58:02
not supposed to be building out a graph because there will be no dot
58:04
because there will be no dot
58:04
because there will be no dot backward but this running is basically
58:07
backward but this running is basically
58:07
backward but this running is basically going to be
58:09
going to be
58:09
going to be 0.99 uh9 times the current
58:12
0.99 uh9 times the current
58:12
0.99 uh9 times the current Value Plus 0.001 times the um this value
58:18
Value Plus 0.001 times the um this value
58:18
Value Plus 0.001 times the um this value this new
58:19
this new
58:19
this new mean and in the same way bnsd running
58:23
mean and in the same way bnsd running
58:23
mean and in the same way bnsd running will be mostly what it used to be
58:28
but it will receive a small update in
58:30
but it will receive a small update in
58:30
but it will receive a small update in the direction of what the current
58:32
the direction of what the current
58:32
the direction of what the current standard deviation
58:34
standard deviation
58:34
standard deviation is and as you're seeing here this update
58:36
is and as you're seeing here this update
58:36
is and as you're seeing here this update is outside and on the side of the
58:39
is outside and on the side of the
58:39
is outside and on the side of the gradient based optimization and it's
58:41
gradient based optimization and it's
58:41
gradient based optimization and it's simply being updated not using gradient
58:43
simply being updated not using gradient
58:43
simply being updated not using gradient scent it's just being updated using U
58:45
scent it's just being updated using U
58:45
scent it's just being updated using U janky like Smooth um sort of uh running
58:50
janky like Smooth um sort of uh running
58:50
janky like Smooth um sort of uh running mean
58:52
mean
58:52
mean Manner and so while the network is
58:54
Manner and so while the network is
58:54
Manner and so while the network is training and these pre activations are
58:56
training and these pre activations are
58:57
training and these pre activations are sort of changing and shifting around
58:58
sort of changing and shifting around
58:58
sort of changing and shifting around during during back propagation we are
59:00
during during back propagation we are
59:00
during during back propagation we are keeping track of the typical mean and
59:02
keeping track of the typical mean and
59:02
keeping track of the typical mean and standard deviation and we're estimating
59:04
standard deviation and we're estimating
59:04
standard deviation and we're estimating them once and when I run
59:08
them once and when I run
59:08
them once and when I run this now I'm keeping track of this in
59:10
this now I'm keeping track of this in
59:10
this now I'm keeping track of this in the running Manner and what we're hoping
59:12
the running Manner and what we're hoping
59:12
the running Manner and what we're hoping for of course is that the me BN meore
59:14
for of course is that the me BN meore
59:14
for of course is that the me BN meore running and BN meore STD are going to be
59:18
running and BN meore STD are going to be
59:18
running and BN meore STD are going to be very similar to the ones that we
59:20
very similar to the ones that we
59:20
very similar to the ones that we calculated here before and that way we
59:23
calculated here before and that way we
59:23
calculated here before and that way we don't need a second stage because we've
59:24
don't need a second stage because we've
59:25
don't need a second stage because we've sort of combined the two stages and
59:26
sort of combined the two stages and
59:26
sort of combined the two stages and we've put them on the side of each other
59:28
we've put them on the side of each other
59:28
we've put them on the side of each other if you want to look at it that way and
59:30
if you want to look at it that way and
59:31
if you want to look at it that way and this is how this is also implemented in
59:32
this is how this is also implemented in
59:32
this is how this is also implemented in The Bash normalization uh layer impi
59:34
The Bash normalization uh layer impi
59:34
The Bash normalization uh layer impi torch so during training um the exact
59:37
torch so during training um the exact
59:37
torch so during training um the exact same thing will happen and then later
59:39
same thing will happen and then later
59:39
same thing will happen and then later when you're using inference it will use
59:41
when you're using inference it will use
59:41
when you're using inference it will use the estimated running mean of both the
59:44
the estimated running mean of both the
59:44
the estimated running mean of both the mean and standard deviation of those
59:46
mean and standard deviation of those
59:46
mean and standard deviation of those hidden States so let's wait for the
59:48
hidden States so let's wait for the
59:48
hidden States so let's wait for the optimization to converge and hopefully
59:50
optimization to converge and hopefully
59:50
optimization to converge and hopefully the running mean and standard deviation
59:52
the running mean and standard deviation
59:52
the running mean and standard deviation are roughly equal to these two and then
59:54
are roughly equal to these two and then
59:54
are roughly equal to these two and then we can simply use it here and we don't
59:56
we can simply use it here and we don't
59:56
we can simply use it here and we don't need this stage of explicit calibration
59:58
need this stage of explicit calibration
59:58
need this stage of explicit calibration at the end okay so the optimization
1:00:00
at the end okay so the optimization
1:00:00
at the end okay so the optimization finished I'll rerun the explicit
1:00:02
finished I'll rerun the explicit
1:00:02
finished I'll rerun the explicit estimation and then the B and mean from
1:00:05
estimation and then the B and mean from
1:00:05
estimation and then the B and mean from the explicit estimation is here and B
1:00:08
the explicit estimation is here and B
1:00:08
the explicit estimation is here and B and mean from the running estimation
1:00:11
and mean from the running estimation
1:00:11
and mean from the running estimation during the during the optimization you
1:00:13
during the during the optimization you
1:00:13
during the during the optimization you can see is very very similar it's not
1:00:16
can see is very very similar it's not
1:00:16
can see is very very similar it's not identical but it's pretty
1:00:18
identical but it's pretty
1:00:18
identical but it's pretty close and the same way BN STD is this
1:00:22
close and the same way BN STD is this
1:00:22
close and the same way BN STD is this and BN STD running is this and so you
1:00:26
and BN STD running is this and so you
1:00:26
and BN STD running is this and so you can see that once again they are fairly
1:00:28
can see that once again they are fairly
1:00:28
can see that once again they are fairly similar values not identical but pretty
1:00:30
similar values not identical but pretty
1:00:30
similar values not identical but pretty close and so then here instead of being
1:00:33
close and so then here instead of being
1:00:33
close and so then here instead of being mean we can use the BN mean running
1:00:35
mean we can use the BN mean running
1:00:35
mean we can use the BN mean running instead of bnsd we can use bnsd
1:00:38
instead of bnsd we can use bnsd
1:00:38
instead of bnsd we can use bnsd running and uh hopefully the validation
1:00:41
running and uh hopefully the validation
1:00:41
running and uh hopefully the validation loss will not be impacted too
1:00:43
loss will not be impacted too
1:00:43
loss will not be impacted too much okay so it's basically identical
1:00:46
much okay so it's basically identical
1:00:46
much okay so it's basically identical and this way we've eliminated the need
1:00:49
and this way we've eliminated the need
1:00:49
and this way we've eliminated the need for this explicit stage of calibration
1:00:51
for this explicit stage of calibration
1:00:51
for this explicit stage of calibration because we are doing it in line over
1:00:53
because we are doing it in line over
1:00:53
because we are doing it in line over here okay so we're almost done with
1:00:55
here okay so we're almost done with
1:00:55
here okay so we're almost done with batch normalization there are only two
1:00:56
batch normalization there are only two
1:00:56
batch normalization there are only two more notes that I'd like to make number
1:00:58
more notes that I'd like to make number
1:00:58
more notes that I'd like to make number one I've skipped a discussion over what
1:01:00
one I've skipped a discussion over what
1:01:00
one I've skipped a discussion over what is this plus Epsilon doing here this
1:01:02
is this plus Epsilon doing here this
1:01:02
is this plus Epsilon doing here this Epsilon is usually like some small fixed
1:01:04
Epsilon is usually like some small fixed
1:01:04
Epsilon is usually like some small fixed number for example one5 by default and
1:01:07
number for example one5 by default and
1:01:07
number for example one5 by default and what it's doing is that it's basically
1:01:08
what it's doing is that it's basically
1:01:08
what it's doing is that it's basically preventing a division by zero in the
1:01:10
preventing a division by zero in the
1:01:11
preventing a division by zero in the case that the variance over your batch
1:01:14
case that the variance over your batch
1:01:14
case that the variance over your batch is exactly zero in that case uh here we
1:01:17
is exactly zero in that case uh here we
1:01:17
is exactly zero in that case uh here we normally have a division by zero but
1:01:19
normally have a division by zero but
1:01:19
normally have a division by zero but because of the plus Epsilon uh this is
1:01:21
because of the plus Epsilon uh this is
1:01:21
because of the plus Epsilon uh this is going to become a small number in the
1:01:22
going to become a small number in the
1:01:22
going to become a small number in the denominator instead and things will be
1:01:24
denominator instead and things will be
1:01:24
denominator instead and things will be more well behaved so feel free to also
1:01:26
more well behaved so feel free to also
1:01:26
more well behaved so feel free to also add a plus Epsilon here of a very small
1:01:28
add a plus Epsilon here of a very small
1:01:28
add a plus Epsilon here of a very small number it doesn't actually substantially
1:01:30
number it doesn't actually substantially
1:01:30
number it doesn't actually substantially change the result I'm going to skip it
1:01:31
change the result I'm going to skip it
1:01:31
change the result I'm going to skip it in our case just because uh this is
1:01:33
in our case just because uh this is
1:01:33
in our case just because uh this is unlikely to happen in our very simple
1:01:34
unlikely to happen in our very simple
1:01:34
unlikely to happen in our very simple example here and the second thing I want
1:01:37
example here and the second thing I want
1:01:37
example here and the second thing I want you to notice is that we're being
1:01:38
you to notice is that we're being
1:01:38
you to notice is that we're being wasteful here and it's very subtle but
1:01:41
wasteful here and it's very subtle but
1:01:41
wasteful here and it's very subtle but right here where we are adding the bias
1:01:43
right here where we are adding the bias
1:01:43
right here where we are adding the bias into H preact these biases now are
1:01:46
into H preact these biases now are
1:01:46
into H preact these biases now are actually useless because we're adding
1:01:48
actually useless because we're adding
1:01:48
actually useless because we're adding them to the H preact but then we are
1:01:50
them to the H preact but then we are
1:01:50
them to the H preact but then we are calculating the mean for every one of
1:01:53
calculating the mean for every one of
1:01:53
calculating the mean for every one of these neurons and subtracting it so
1:01:56
these neurons and subtracting it so
1:01:56
these neurons and subtracting it so whatever bias you add here is going to
1:01:58
whatever bias you add here is going to
1:01:58
whatever bias you add here is going to get subtracted right here and so these
1:02:01
get subtracted right here and so these
1:02:01
get subtracted right here and so these biases are not doing anything in fact
1:02:03
biases are not doing anything in fact
1:02:03
biases are not doing anything in fact they're being subtracted out and they
1:02:04
they're being subtracted out and they
1:02:04
they're being subtracted out and they don't impact the rest of the calculation
1:02:07
don't impact the rest of the calculation
1:02:07
don't impact the rest of the calculation so if you look at b1. grad it's actually
1:02:09
so if you look at b1. grad it's actually
1:02:09
so if you look at b1. grad it's actually going to be zero because it's being
1:02:10
going to be zero because it's being
1:02:10
going to be zero because it's being subtracted out and doesn't actually have
1:02:12
subtracted out and doesn't actually have
1:02:12
subtracted out and doesn't actually have any effect and so whenever you're using
1:02:14
any effect and so whenever you're using
1:02:14
any effect and so whenever you're using bash normalization layers then if you
1:02:16
bash normalization layers then if you
1:02:16
bash normalization layers then if you have any weight layers before like a
1:02:18
have any weight layers before like a
1:02:18
have any weight layers before like a linear or a c or something like that
1:02:20
linear or a c or something like that
1:02:20
linear or a c or something like that you're better off coming here and just
1:02:22
you're better off coming here and just
1:02:22
you're better off coming here and just like not using bias so you don't want to
1:02:25
like not using bias so you don't want to
1:02:25
like not using bias so you don't want to use bias and then here you don't want to
1:02:28
use bias and then here you don't want to
1:02:28
use bias and then here you don't want to add it because it's that spirous instead
1:02:31
add it because it's that spirous instead
1:02:31
add it because it's that spirous instead we have this B normalization bias here
1:02:33
we have this B normalization bias here
1:02:33
we have this B normalization bias here and that b rization bias is now in
1:02:35
and that b rization bias is now in
1:02:35
and that b rization bias is now in charge of the biasing of this
1:02:37
charge of the biasing of this
1:02:37
charge of the biasing of this distribution instead of this B1 that we
1:02:40
distribution instead of this B1 that we
1:02:40
distribution instead of this B1 that we had here originally and so uh basically
1:02:43
had here originally and so uh basically
1:02:43
had here originally and so uh basically bash normalization layer has its own
1:02:45
bash normalization layer has its own
1:02:45
bash normalization layer has its own bias and there's no need to have a bias
1:02:47
bias and there's no need to have a bias
1:02:48
bias and there's no need to have a bias in the layer before it because that bias
1:02:49
in the layer before it because that bias
1:02:49
in the layer before it because that bias is going to be subtracted out anyway so
1:02:52
is going to be subtracted out anyway so
1:02:52
is going to be subtracted out anyway so that's the other small detail to be
1:02:53
that's the other small detail to be
1:02:53
that's the other small detail to be careful with sometimes it's not going to
1:02:55
careful with sometimes it's not going to
1:02:55
careful with sometimes it's not going to do anything catastrophic this B1 will
1:02:57
do anything catastrophic this B1 will
1:02:57
do anything catastrophic this B1 will just be useless it will never get any
1:02:59
just be useless it will never get any
1:02:59
just be useless it will never get any gradient uh it will not learn it will
1:03:01
gradient uh it will not learn it will
1:03:01
gradient uh it will not learn it will stay constant and it's just wasteful but
1:03:03
stay constant and it's just wasteful but
1:03:03
stay constant and it's just wasteful but it doesn't actually really uh impact
1:03:05
it doesn't actually really uh impact
1:03:05
it doesn't actually really uh impact anything otherwise okay so I rearranged
1:03:07
anything otherwise okay so I rearranged
1:03:07
anything otherwise okay so I rearranged the code a little bit with comments and
1:03:09
the code a little bit with comments and
1:03:09
the code a little bit with comments and I just wanted to give a very quick
1:03:11
I just wanted to give a very quick
1:03:11
I just wanted to give a very quick summary of The Bash normalization layer
1:03:13
summary of The Bash normalization layer
1:03:13
summary of The Bash normalization layer we are using bash normalization to
1:03:15
we are using bash normalization to
1:03:15
we are using bash normalization to control the statistics of activations in
1:03:18
control the statistics of activations in
1:03:18
control the statistics of activations in the neural net it is common to sprinkle
1:03:20
the neural net it is common to sprinkle
1:03:20
the neural net it is common to sprinkle bash normalization layer across the
1:03:22
bash normalization layer across the
1:03:22
bash normalization layer across the neural net and usually we will place it
1:03:24
neural net and usually we will place it
1:03:24
neural net and usually we will place it after layer that have multiplications
1:03:27
after layer that have multiplications
1:03:27
after layer that have multiplications like for example a linear layer or
1:03:29
like for example a linear layer or
1:03:29
like for example a linear layer or convolutional layer which we may cover
1:03:31
convolutional layer which we may cover
1:03:31
convolutional layer which we may cover in the
1:03:32
in the
1:03:32
in the future now the bat normalization
1:03:34
future now the bat normalization
1:03:34
future now the bat normalization internally has parameters for the gain
1:03:38
internally has parameters for the gain
1:03:38
internally has parameters for the gain and the bias and these are trained using
1:03:40
and the bias and these are trained using
1:03:40
and the bias and these are trained using back propagation it also has two buffers
1:03:44
back propagation it also has two buffers
1:03:44
back propagation it also has two buffers the buffers are the mean and the
1:03:45
the buffers are the mean and the
1:03:45
the buffers are the mean and the standard deviation the running mean and
1:03:47
standard deviation the running mean and
1:03:47
standard deviation the running mean and the running mean of the standard
1:03:49
the running mean of the standard
1:03:49
the running mean of the standard deviation and these are not trained
1:03:51
deviation and these are not trained
1:03:51
deviation and these are not trained using back propagation these are trained
1:03:53
using back propagation these are trained
1:03:53
using back propagation these are trained using this uh janky update of kind of
1:03:56
using this uh janky update of kind of
1:03:56
using this uh janky update of kind of like a running mean
1:03:57
like a running mean
1:03:58
like a running mean update so
1:04:00
update so
1:04:00
update so um these are sort of the parameters and
1:04:02
um these are sort of the parameters and
1:04:03
um these are sort of the parameters and the buffers of Bator layer and then
1:04:05
the buffers of Bator layer and then
1:04:05
the buffers of Bator layer and then really what it's doing is it's
1:04:06
really what it's doing is it's
1:04:06
really what it's doing is it's calculating the mean and a standard
1:04:08
calculating the mean and a standard
1:04:08
calculating the mean and a standard deviation of the activations uh that are
1:04:10
deviation of the activations uh that are
1:04:10
deviation of the activations uh that are feeding into the Bator layer over that
1:04:13
feeding into the Bator layer over that
1:04:14
feeding into the Bator layer over that batch then it's centering that batch to
1:04:16
batch then it's centering that batch to
1:04:16
batch then it's centering that batch to be unit gion and then it's offsetting
1:04:19
be unit gion and then it's offsetting
1:04:19
be unit gion and then it's offsetting and scaling it by the Learned bias and
1:04:23
and scaling it by the Learned bias and
1:04:23
and scaling it by the Learned bias and gain and then on top of that it's
1:04:25
gain and then on top of that it's
1:04:25
gain and then on top of that it's keeping track of the mean and standard
1:04:26
keeping track of the mean and standard
1:04:26
keeping track of the mean and standard deviation of the inputs and it's
1:04:30
deviation of the inputs and it's
1:04:30
deviation of the inputs and it's maintaining this running mean and
1:04:31
maintaining this running mean and
1:04:31
maintaining this running mean and standard deviation and this will later
1:04:33
standard deviation and this will later
1:04:33
standard deviation and this will later be used at inference so that we don't
1:04:35
be used at inference so that we don't
1:04:35
be used at inference so that we don't have to reestimate the mean stand
1:04:37
have to reestimate the mean stand
1:04:37
have to reestimate the mean stand deviation all the time and in addition
1:04:39
deviation all the time and in addition
1:04:40
deviation all the time and in addition that allows us to basically forward
1:04:41
that allows us to basically forward
1:04:41
that allows us to basically forward individual examples at test time so
1:04:44
individual examples at test time so
1:04:44
individual examples at test time so that's the bash normalization layer it's
1:04:45
that's the bash normalization layer it's
1:04:45
that's the bash normalization layer it's a fairly complicated layer um but this
1:04:48
a fairly complicated layer um but this
1:04:48
a fairly complicated layer um but this is what it's doing internally now I
1:04:50
is what it's doing internally now I
1:04:50
is what it's doing internally now I wanted to show you a little bit of a
1:04:51
wanted to show you a little bit of a
1:04:51
wanted to show you a little bit of a real example so you can search resnet
1:04:55
real example so you can search resnet
1:04:55
real example so you can search resnet which is a residual neural network and
1:04:58
which is a residual neural network and
1:04:58
which is a residual neural network and these are common types of neural
1:04:59
these are common types of neural
1:04:59
these are common types of neural networks used for image
1:05:01
networks used for image
1:05:01
networks used for image classification and of course we haven't
1:05:03
classification and of course we haven't
1:05:03
classification and of course we haven't come resnets in detail so I'm not going
1:05:05
come resnets in detail so I'm not going
1:05:05
come resnets in detail so I'm not going to explain all the pieces of it but for
1:05:08
to explain all the pieces of it but for
1:05:08
to explain all the pieces of it but for now just note that the image feeds into
1:05:10
now just note that the image feeds into
1:05:10
now just note that the image feeds into a reset on the top here and there's many
1:05:12
a reset on the top here and there's many
1:05:12
a reset on the top here and there's many many layers with repeating structure all
1:05:15
many layers with repeating structure all
1:05:15
many layers with repeating structure all the way to predictions of what's inside
1:05:16
the way to predictions of what's inside
1:05:16
the way to predictions of what's inside that image this repeating structure is
1:05:19
that image this repeating structure is
1:05:19
that image this repeating structure is made up of these blocks and these blocks
1:05:21
made up of these blocks and these blocks
1:05:21
made up of these blocks and these blocks are just sequentially stacked up in this
1:05:23
are just sequentially stacked up in this
1:05:23
are just sequentially stacked up in this deep neural network now the code for
1:05:26
deep neural network now the code for
1:05:26
deep neural network now the code for this uh the block basically that's used
1:05:29
this uh the block basically that's used
1:05:29
this uh the block basically that's used and repeated sequentially in series is
1:05:32
and repeated sequentially in series is
1:05:32
and repeated sequentially in series is called this bottleneck block bottleneck
1:05:35
called this bottleneck block bottleneck
1:05:35
called this bottleneck block bottleneck block and there's a lot here this is all
1:05:37
block and there's a lot here this is all
1:05:37
block and there's a lot here this is all pych and of course we haven't covered
1:05:39
pych and of course we haven't covered
1:05:39
pych and of course we haven't covered all of it but I want to point out some
1:05:41
all of it but I want to point out some
1:05:41
all of it but I want to point out some small pieces of it here in the init is
1:05:44
small pieces of it here in the init is
1:05:44
small pieces of it here in the init is where we initialize the neuronet so this
1:05:46
where we initialize the neuronet so this
1:05:46
where we initialize the neuronet so this code of block here is basically the kind
1:05:47
code of block here is basically the kind
1:05:47
code of block here is basically the kind of stuff we're doing here we're
1:05:49
of stuff we're doing here we're
1:05:49
of stuff we're doing here we're initializing all the layers and in the
1:05:51
initializing all the layers and in the
1:05:51
initializing all the layers and in the forward we are specifying how the neuron
1:05:53
forward we are specifying how the neuron
1:05:53
forward we are specifying how the neuron lot acts once you actually have the
1:05:54
lot acts once you actually have the
1:05:55
lot acts once you actually have the input so this code here is along the
1:05:57
input so this code here is along the
1:05:57
input so this code here is along the lines of what we're doing
1:06:00
lines of what we're doing
1:06:00
lines of what we're doing here and now these blocks are replicated
1:06:03
here and now these blocks are replicated
1:06:04
here and now these blocks are replicated and stacked up serially and that's what
1:06:06
and stacked up serially and that's what
1:06:06
and stacked up serially and that's what a residual Network would be and so
1:06:09
a residual Network would be and so
1:06:09
a residual Network would be and so notice What's Happening Here com one um
1:06:12
notice What's Happening Here com one um
1:06:12
notice What's Happening Here com one um these are convolution layers and these
1:06:15
these are convolution layers and these
1:06:15
these are convolution layers and these convolution layers basically they're the
1:06:17
convolution layers basically they're the
1:06:17
convolution layers basically they're the same thing as a linear layer except
1:06:19
same thing as a linear layer except
1:06:19
same thing as a linear layer except convolutional layers don't apply um
1:06:22
convolutional layers don't apply um
1:06:22
convolutional layers don't apply um convolutional layers are used for images
1:06:24
convolutional layers are used for images
1:06:24
convolutional layers are used for images and so they have SP structure and
1:06:26
and so they have SP structure and
1:06:26
and so they have SP structure and basically this linear multiplication and
1:06:28
basically this linear multiplication and
1:06:28
basically this linear multiplication and bias offset are done on patches instead
1:06:32
bias offset are done on patches instead
1:06:32
bias offset are done on patches instead of math instead of the full input so
1:06:34
of math instead of the full input so
1:06:34
of math instead of the full input so because these images have structure
1:06:36
because these images have structure
1:06:36
because these images have structure spatial structure convolutions just
1:06:38
spatial structure convolutions just
1:06:38
spatial structure convolutions just basically do WX plus b but they do it on
1:06:41
basically do WX plus b but they do it on
1:06:41
basically do WX plus b but they do it on overlapping patches of the input but
1:06:44
overlapping patches of the input but
1:06:44
overlapping patches of the input but otherwise it's WX plus
1:06:45
otherwise it's WX plus
1:06:45
otherwise it's WX plus P then we have the norm layer which by
1:06:48
P then we have the norm layer which by
1:06:48
P then we have the norm layer which by default here is initialized to be a bash
1:06:50
default here is initialized to be a bash
1:06:50
default here is initialized to be a bash Norm in 2D so two- dimensional bash
1:06:52
Norm in 2D so two- dimensional bash
1:06:52
Norm in 2D so two- dimensional bash normalization layer and then we have a
1:06:54
normalization layer and then we have a
1:06:54
normalization layer and then we have a nonlinearity like reu so instead of uh
1:06:58
nonlinearity like reu so instead of uh
1:06:58
nonlinearity like reu so instead of uh here they use reu we are using 10 in
1:07:01
here they use reu we are using 10 in
1:07:01
here they use reu we are using 10 in this case but both both are just
1:07:03
this case but both both are just
1:07:03
this case but both both are just nonlinearities and you can just use them
1:07:05
nonlinearities and you can just use them
1:07:05
nonlinearities and you can just use them relatively interchangeably for very deep
1:07:07
relatively interchangeably for very deep
1:07:07
relatively interchangeably for very deep networks re typically empirically work a
1:07:10
networks re typically empirically work a
1:07:10
networks re typically empirically work a bit better so see the motif that's being
1:07:13
bit better so see the motif that's being
1:07:13
bit better so see the motif that's being repeated here we have convolution bat
1:07:15
repeated here we have convolution bat
1:07:15
repeated here we have convolution bat normalization reu convolution bat
1:07:17
normalization reu convolution bat
1:07:17
normalization reu convolution bat normalization re Etc and then here this
1:07:19
normalization re Etc and then here this
1:07:19
normalization re Etc and then here this is residual connection that we haven't
1:07:21
is residual connection that we haven't
1:07:21
is residual connection that we haven't covered yet but basically that's the
1:07:23
covered yet but basically that's the
1:07:23
covered yet but basically that's the exact same pattern we have here with we
1:07:25
exact same pattern we have here with we
1:07:25
exact same pattern we have here with we have a weight layer like a convolution
1:07:28
have a weight layer like a convolution
1:07:28
have a weight layer like a convolution or like a linear layer bash
1:07:31
or like a linear layer bash
1:07:31
or like a linear layer bash normalization and then 10h which is
1:07:34
normalization and then 10h which is
1:07:34
normalization and then 10h which is nonlinearity but basically a weight
1:07:36
nonlinearity but basically a weight
1:07:36
nonlinearity but basically a weight layer a normalization layer and
1:07:38
layer a normalization layer and
1:07:38
layer a normalization layer and nonlinearity and that's the motif that
1:07:40
nonlinearity and that's the motif that
1:07:40
nonlinearity and that's the motif that you would be stacking up when you create
1:07:42
you would be stacking up when you create
1:07:42
you would be stacking up when you create these deep neural networks exactly as
1:07:44
these deep neural networks exactly as
1:07:44
these deep neural networks exactly as it's done here and one more thing I'd
1:07:46
it's done here and one more thing I'd
1:07:46
it's done here and one more thing I'd like you to notice is that here when
1:07:47
like you to notice is that here when
1:07:47
like you to notice is that here when they are initializing the com layers
1:07:50
they are initializing the com layers
1:07:50
they are initializing the com layers like com 1 by one the depth for that is
1:07:53
like com 1 by one the depth for that is
1:07:53
like com 1 by one the depth for that is right here and so it's initializing an
1:07:55
right here and so it's initializing an
1:07:55
right here and so it's initializing an nn. Tod which is a convolution layer in
1:07:58
nn. Tod which is a convolution layer in
1:07:58
nn. Tod which is a convolution layer in pytorch and there's a bunch of keyword
1:07:59
pytorch and there's a bunch of keyword
1:07:59
pytorch and there's a bunch of keyword arguments here that I'm not going to
1:08:01
arguments here that I'm not going to
1:08:01
arguments here that I'm not going to explain yet but you see how there's bias
1:08:03
explain yet but you see how there's bias
1:08:03
explain yet but you see how there's bias equals false the bias equals false is
1:08:05
equals false the bias equals false is
1:08:05
equals false the bias equals false is exactly for the same reason as bias is
1:08:08
exactly for the same reason as bias is
1:08:08
exactly for the same reason as bias is not used in our case you see how I eras
1:08:10
not used in our case you see how I eras
1:08:10
not used in our case you see how I eras the use of bias and the use of bias is
1:08:13
the use of bias and the use of bias is
1:08:13
the use of bias and the use of bias is spous because after this weight layer
1:08:15
spous because after this weight layer
1:08:15
spous because after this weight layer there's a bash normalization and The
1:08:16
there's a bash normalization and The
1:08:16
there's a bash normalization and The Bash normalization subtracts that bias
1:08:19
Bash normalization subtracts that bias
1:08:19
Bash normalization subtracts that bias and then has its own bias so there's no
1:08:21
and then has its own bias so there's no
1:08:21
and then has its own bias so there's no need to introduce these spous parameters
1:08:23
need to introduce these spous parameters
1:08:23
need to introduce these spous parameters it wouldn't hurt performance it's just
1:08:24
it wouldn't hurt performance it's just
1:08:24
it wouldn't hurt performance it's just useless and so because they have this
1:08:27
useless and so because they have this
1:08:27
useless and so because they have this motif of C Bast umbrell they don't need
1:08:30
motif of C Bast umbrell they don't need
1:08:30
motif of C Bast umbrell they don't need a bias here because there's a bias
1:08:31
a bias here because there's a bias
1:08:31
a bias here because there's a bias inside here so by the way this example
1:08:35
inside here so by the way this example
1:08:35
inside here so by the way this example here is very easy to find just do
1:08:37
here is very easy to find just do
1:08:37
here is very easy to find just do resonet pie
1:08:38
resonet pie
1:08:38
resonet pie torch and uh it's this example here so
1:08:41
torch and uh it's this example here so
1:08:41
torch and uh it's this example here so this is kind of like the stock
1:08:42
this is kind of like the stock
1:08:42
this is kind of like the stock implementation of a residual neural
1:08:44
implementation of a residual neural
1:08:44
implementation of a residual neural network in pytorch and uh you can find
1:08:47
network in pytorch and uh you can find
1:08:47
network in pytorch and uh you can find that here but of course I haven't
1:08:48
that here but of course I haven't
1:08:48
that here but of course I haven't covered many of these parts yet and I
1:08:50
covered many of these parts yet and I
1:08:50
covered many of these parts yet and I would also like to briefly descend into
1:08:52
would also like to briefly descend into
1:08:52
would also like to briefly descend into the definitions of these pytorch layers
1:08:55
the definitions of these pytorch layers
1:08:55
the definitions of these pytorch layers and the the parameters that they take
1:08:56
and the the parameters that they take
1:08:57
and the the parameters that they take now instead of a convolutional layer
1:08:58
now instead of a convolutional layer
1:08:58
now instead of a convolutional layer we're going to look at a linear layer uh
1:09:01
we're going to look at a linear layer uh
1:09:01
we're going to look at a linear layer uh because that's the one that we're using
1:09:02
because that's the one that we're using
1:09:02
because that's the one that we're using here this is a linear layer and I
1:09:04
here this is a linear layer and I
1:09:04
here this is a linear layer and I haven't cover covered convolutions yet
1:09:06
haven't cover covered convolutions yet
1:09:06
haven't cover covered convolutions yet but as I mentioned convolutions are
1:09:07
but as I mentioned convolutions are
1:09:07
but as I mentioned convolutions are basically linear layers except on
1:09:10
basically linear layers except on
1:09:10
basically linear layers except on patches so a linear layer performs a WX
1:09:13
patches so a linear layer performs a WX
1:09:13
patches so a linear layer performs a WX plus b except here they're calling the W
1:09:16
plus b except here they're calling the W
1:09:16
plus b except here they're calling the W A
1:09:16
A
1:09:17
A transpose um so to calcul WX plus b very
1:09:20
transpose um so to calcul WX plus b very
1:09:20
transpose um so to calcul WX plus b very much like we did here to initialize this
1:09:22
much like we did here to initialize this
1:09:22
much like we did here to initialize this layer you need to know the fan in the
1:09:24
layer you need to know the fan in the
1:09:24
layer you need to know the fan in the fan out and that's so that they can
1:09:27
fan out and that's so that they can
1:09:27
fan out and that's so that they can initialize this W this is the fan in and
1:09:30
initialize this W this is the fan in and
1:09:30
initialize this W this is the fan in and the fan out so they know how how big the
1:09:33
the fan out so they know how how big the
1:09:33
the fan out so they know how how big the weight Matrix should be you need to also
1:09:36
weight Matrix should be you need to also
1:09:36
weight Matrix should be you need to also pass in whether you whether or not you
1:09:37
pass in whether you whether or not you
1:09:37
pass in whether you whether or not you want a bias and if you set it to false
1:09:40
want a bias and if you set it to false
1:09:40
want a bias and if you set it to false then no bias will be uh inside this
1:09:42
then no bias will be uh inside this
1:09:42
then no bias will be uh inside this layer um and you may want to do that
1:09:45
layer um and you may want to do that
1:09:45
layer um and you may want to do that exactly like in our case if your layer
1:09:47
exactly like in our case if your layer
1:09:47
exactly like in our case if your layer is followed by a normalization layer
1:09:49
is followed by a normalization layer
1:09:49
is followed by a normalization layer such as batch
1:09:50
such as batch
1:09:50
such as batch Norm so this allows you to basically
1:09:52
Norm so this allows you to basically
1:09:52
Norm so this allows you to basically disable a bias now in terms of the
1:09:55
disable a bias now in terms of the
1:09:55
disable a bias now in terms of the initial ation if we swing down here this
1:09:57
initial ation if we swing down here this
1:09:57
initial ation if we swing down here this is reporting the variables used inside
1:09:59
is reporting the variables used inside
1:09:59
is reporting the variables used inside this linear layer and our linear layer
1:10:02
this linear layer and our linear layer
1:10:02
this linear layer and our linear layer here has two parameters the weight and
1:10:04
here has two parameters the weight and
1:10:04
here has two parameters the weight and the bias in the same way they have a
1:10:06
the bias in the same way they have a
1:10:06
the bias in the same way they have a weight and a bias and they're talking
1:10:09
weight and a bias and they're talking
1:10:09
weight and a bias and they're talking about how they initialize it by default
1:10:11
about how they initialize it by default
1:10:11
about how they initialize it by default so by default P will initialize your
1:10:13
so by default P will initialize your
1:10:13
so by default P will initialize your weights by taking the
1:10:15
weights by taking the
1:10:15
weights by taking the Fanon and then um doing one over fanin
1:10:19
Fanon and then um doing one over fanin
1:10:19
Fanon and then um doing one over fanin square root and then instead of a normal
1:10:22
square root and then instead of a normal
1:10:22
square root and then instead of a normal distribution they are using a uniform
1:10:24
distribution they are using a uniform
1:10:24
distribution they are using a uniform distribution
1:10:25
distribution
1:10:25
distribution so it's very much the same thing but
1:10:28
so it's very much the same thing but
1:10:28
so it's very much the same thing but they are using a one instead of 5 over
1:10:30
they are using a one instead of 5 over
1:10:30
they are using a one instead of 5 over three so there's no gain being
1:10:31
three so there's no gain being
1:10:31
three so there's no gain being calculated here the gain is just one but
1:10:33
calculated here the gain is just one but
1:10:33
calculated here the gain is just one but otherwise is exactly one over the square
1:10:36
otherwise is exactly one over the square
1:10:36
otherwise is exactly one over the square root of fan in exactly as we have
1:10:39
root of fan in exactly as we have
1:10:39
root of fan in exactly as we have here so one over the square root of K is
1:10:42
here so one over the square root of K is
1:10:42
here so one over the square root of K is the is the scale of the weights but when
1:10:45
the is the scale of the weights but when
1:10:45
the is the scale of the weights but when they are drawing the numbers they're not
1:10:46
they are drawing the numbers they're not
1:10:46
they are drawing the numbers they're not using a gussion by default they're using
1:10:49
using a gussion by default they're using
1:10:49
using a gussion by default they're using a uniform distribution by default and so
1:10:51
a uniform distribution by default and so
1:10:51
a uniform distribution by default and so they draw uniformly from negative of K
1:10:54
they draw uniformly from negative of K
1:10:54
they draw uniformly from negative of K to squ of K
1:10:56
to squ of K
1:10:56
to squ of K but it's the exact same thing and the
1:10:57
but it's the exact same thing and the
1:10:57
but it's the exact same thing and the same motivation from for with respect to
1:11:00
same motivation from for with respect to
1:11:00
same motivation from for with respect to what we've seen in this lecture and the
1:11:03
what we've seen in this lecture and the
1:11:03
what we've seen in this lecture and the reason they're doing this is if you have
1:11:04
reason they're doing this is if you have
1:11:04
reason they're doing this is if you have a roughly gsan input this will ensure
1:11:07
a roughly gsan input this will ensure
1:11:08
a roughly gsan input this will ensure that out of this layer you will have a
1:11:10
that out of this layer you will have a
1:11:10
that out of this layer you will have a roughly Gan output and you you basically
1:11:13
roughly Gan output and you you basically
1:11:13
roughly Gan output and you you basically achieve that by scaling the weights by
1:11:15
achieve that by scaling the weights by
1:11:15
achieve that by scaling the weights by one over the square root of fan in so
1:11:17
one over the square root of fan in so
1:11:18
one over the square root of fan in so that's what this is
1:11:19
that's what this is
1:11:19
that's what this is doing and then the second thing is the
1:11:21
doing and then the second thing is the
1:11:21
doing and then the second thing is the bash normalization layer so let's look
1:11:23
bash normalization layer so let's look
1:11:23
bash normalization layer so let's look at what that looks like in pytorch
1:11:26
at what that looks like in pytorch
1:11:26
at what that looks like in pytorch so here we have a onedimensional b
1:11:27
so here we have a onedimensional b
1:11:27
so here we have a onedimensional b normalization layer exactly as we are
1:11:29
normalization layer exactly as we are
1:11:29
normalization layer exactly as we are using here and there are a number of
1:11:31
using here and there are a number of
1:11:31
using here and there are a number of keyword arguments going into it as well
1:11:33
keyword arguments going into it as well
1:11:33
keyword arguments going into it as well so we need to know the number of
1:11:34
so we need to know the number of
1:11:34
so we need to know the number of features uh for us that is 200 and that
1:11:37
features uh for us that is 200 and that
1:11:37
features uh for us that is 200 and that is needed so that we can initialize
1:11:39
is needed so that we can initialize
1:11:39
is needed so that we can initialize these parameters here the gain the bias
1:11:42
these parameters here the gain the bias
1:11:42
these parameters here the gain the bias and the buffers for the running uh mean
1:11:44
and the buffers for the running uh mean
1:11:44
and the buffers for the running uh mean and standard
1:11:45
and standard
1:11:46
and standard deviation then they need to know the
1:11:47
deviation then they need to know the
1:11:47
deviation then they need to know the value of Epsilon here and by default
1:11:50
value of Epsilon here and by default
1:11:50
value of Epsilon here and by default this is one5 you don't typically change
1:11:52
this is one5 you don't typically change
1:11:52
this is one5 you don't typically change this too much then they need to know the
1:11:54
this too much then they need to know the
1:11:54
this too much then they need to know the momentum
1:11:56
momentum
1:11:56
momentum and the momentum here as they explain is
1:11:58
and the momentum here as they explain is
1:11:58
and the momentum here as they explain is basically used for these uh running mean
1:12:01
basically used for these uh running mean
1:12:01
basically used for these uh running mean and running standard deviation so by
1:12:03
and running standard deviation so by
1:12:03
and running standard deviation so by default the momentum here is 0.1 the
1:12:05
default the momentum here is 0.1 the
1:12:05
default the momentum here is 0.1 the momentum we are using here in this
1:12:06
momentum we are using here in this
1:12:06
momentum we are using here in this example is
1:12:08
example is
1:12:08
example is 0.001 and basically rough you may want
1:12:12
0.001 and basically rough you may want
1:12:12
0.001 and basically rough you may want to change this sometimes and roughly
1:12:14
to change this sometimes and roughly
1:12:14
to change this sometimes and roughly speaking if you have a very large batch
1:12:16
speaking if you have a very large batch
1:12:16
speaking if you have a very large batch size then typically what you'll see is
1:12:18
size then typically what you'll see is
1:12:18
size then typically what you'll see is that when you estimate the mean and the
1:12:20
that when you estimate the mean and the
1:12:20
that when you estimate the mean and the standard deviation for every single
1:12:22
standard deviation for every single
1:12:22
standard deviation for every single batch size if it's large enough you're
1:12:23
batch size if it's large enough you're
1:12:23
batch size if it's large enough you're going to get roughly the same result
1:12:26
going to get roughly the same result
1:12:26
going to get roughly the same result and so therefore you can use slightly
1:12:28
and so therefore you can use slightly
1:12:28
and so therefore you can use slightly higher momentum like
1:12:30
higher momentum like
1:12:30
higher momentum like 0.1 but for a batch size as small as 32
1:12:34
0.1 but for a batch size as small as 32
1:12:34
0.1 but for a batch size as small as 32 the mean and standard deviation here
1:12:36
the mean and standard deviation here
1:12:36
the mean and standard deviation here might take on slightly different numbers
1:12:37
might take on slightly different numbers
1:12:37
might take on slightly different numbers because there's only 32 examples we are
1:12:39
because there's only 32 examples we are
1:12:39
because there's only 32 examples we are using to estimate the mean and standard
1:12:41
using to estimate the mean and standard
1:12:41
using to estimate the mean and standard deviation so the value is changing
1:12:42
deviation so the value is changing
1:12:42
deviation so the value is changing around a lot and if your momentum is 0.1
1:12:46
around a lot and if your momentum is 0.1
1:12:46
around a lot and if your momentum is 0.1 that that might not be good enough for
1:12:47
that that might not be good enough for
1:12:47
that that might not be good enough for this value to settle and um converge to
1:12:51
this value to settle and um converge to
1:12:51
this value to settle and um converge to the actual mean and standard deviation
1:12:53
the actual mean and standard deviation
1:12:53
the actual mean and standard deviation over the entire training set and so
1:12:55
over the entire training set and so
1:12:55
over the entire training set and so basically if your batch size is very
1:12:56
basically if your batch size is very
1:12:56
basically if your batch size is very small uh momentum of 0.1 is potentially
1:12:59
small uh momentum of 0.1 is potentially
1:12:59
small uh momentum of 0.1 is potentially dangerous and it might make it so that
1:13:00
dangerous and it might make it so that
1:13:00
dangerous and it might make it so that the running uh mean and stand deviation
1:13:02
the running uh mean and stand deviation
1:13:02
the running uh mean and stand deviation are is thrashing too much during
1:13:04
are is thrashing too much during
1:13:04
are is thrashing too much during training and it's not actually
1:13:06
training and it's not actually
1:13:06
training and it's not actually converging
1:13:08
converging
1:13:08
converging properly uh aine equals true determines
1:13:11
properly uh aine equals true determines
1:13:11
properly uh aine equals true determines whether this batch normalization layer
1:13:13
whether this batch normalization layer
1:13:13
whether this batch normalization layer has these learnable Aline parameters the
1:13:16
has these learnable Aline parameters the
1:13:16
has these learnable Aline parameters the uh the gain and the bias and this is
1:13:19
uh the gain and the bias and this is
1:13:19
uh the gain and the bias and this is almost always kept to true I'm not
1:13:21
almost always kept to true I'm not
1:13:21
almost always kept to true I'm not actually sure why you would want to
1:13:22
actually sure why you would want to
1:13:22
actually sure why you would want to change this to false um
1:13:26
change this to false um
1:13:26
change this to false um then track running stats is determining
1:13:28
then track running stats is determining
1:13:28
then track running stats is determining whether or not B rization layer of
1:13:30
whether or not B rization layer of
1:13:30
whether or not B rization layer of pytorch will be doing
1:13:31
pytorch will be doing
1:13:31
pytorch will be doing this and um one reason you may you may
1:13:35
this and um one reason you may you may
1:13:35
this and um one reason you may you may want to skip the running stats is
1:13:37
want to skip the running stats is
1:13:37
want to skip the running stats is because you may want to for example
1:13:39
because you may want to for example
1:13:39
because you may want to for example estimate them at the end as a stage two
1:13:42
estimate them at the end as a stage two
1:13:42
estimate them at the end as a stage two like this and in that case you don't
1:13:43
like this and in that case you don't
1:13:43
like this and in that case you don't want the bat normalization layer to be
1:13:45
want the bat normalization layer to be
1:13:45
want the bat normalization layer to be doing all this extra compute that you're
1:13:46
doing all this extra compute that you're
1:13:46
doing all this extra compute that you're not going to
1:13:47
not going to
1:13:47
not going to use and uh finally we need to know which
1:13:50
use and uh finally we need to know which
1:13:50
use and uh finally we need to know which device we're going to run this bash
1:13:52
device we're going to run this bash
1:13:52
device we're going to run this bash normalization on a CPU or a GPU and what
1:13:55
normalization on a CPU or a GPU and what
1:13:55
normalization on a CPU or a GPU and what the data type should be uh half
1:13:57
the data type should be uh half
1:13:57
the data type should be uh half Precision single Precision double
1:13:58
Precision single Precision double
1:13:58
Precision single Precision double precision and so
1:13:59
precision and so
1:14:00
precision and so on so that's the bat normalization layer
1:14:02
on so that's the bat normalization layer
1:14:02
on so that's the bat normalization layer otherwise they link to the paper is the
1:14:04
otherwise they link to the paper is the
1:14:04
otherwise they link to the paper is the same formula we've implemented and
1:14:06
same formula we've implemented and
1:14:06
same formula we've implemented and everything is the same exactly as we've
1:14:08
everything is the same exactly as we've
1:14:08
everything is the same exactly as we've done
1:14:09
done
1:14:09
done here okay so that's everything that I
1:14:11
here okay so that's everything that I
1:14:11
here okay so that's everything that I wanted to cover for this lecture really
1:14:14
wanted to cover for this lecture really
1:14:14
wanted to cover for this lecture really what I wanted to talk about is the
1:14:15
what I wanted to talk about is the
1:14:15
what I wanted to talk about is the importance of understanding the
1:14:16
importance of understanding the
1:14:16
importance of understanding the activations and the gradients and their
1:14:18
activations and the gradients and their
1:14:18
activations and the gradients and their statistics in neural networks and this
1:14:20
statistics in neural networks and this
1:14:20
statistics in neural networks and this becomes increasingly important
1:14:22
becomes increasingly important
1:14:22
becomes increasingly important especially as you make your neural
1:14:23
especially as you make your neural
1:14:23
especially as you make your neural networks bigger larger and deeper
1:14:25
networks bigger larger and deeper
1:14:25
networks bigger larger and deeper we looked at the distributions basically
1:14:27
we looked at the distributions basically
1:14:27
we looked at the distributions basically at the output layer and we saw that if
1:14:29
at the output layer and we saw that if
1:14:29
at the output layer and we saw that if you have two confident mispredictions
1:14:31
you have two confident mispredictions
1:14:31
you have two confident mispredictions because the activations are too messed
1:14:33
because the activations are too messed
1:14:33
because the activations are too messed up at the last layer you can end up with
1:14:35
up at the last layer you can end up with
1:14:35
up at the last layer you can end up with these hockey stick losses and if you fix
1:14:38
these hockey stick losses and if you fix
1:14:38
these hockey stick losses and if you fix this you get a better loss at the end of
1:14:39
this you get a better loss at the end of
1:14:39
this you get a better loss at the end of training because your training is not
1:14:41
training because your training is not
1:14:41
training because your training is not doing wasteful work then we also saw
1:14:44
doing wasteful work then we also saw
1:14:44
doing wasteful work then we also saw that we need to control the activations
1:14:45
that we need to control the activations
1:14:45
that we need to control the activations we don't want them to uh you know squash
1:14:48
we don't want them to uh you know squash
1:14:48
we don't want them to uh you know squash to zero or explode to infinity and
1:14:51
to zero or explode to infinity and
1:14:51
to zero or explode to infinity and because that you can run into a lot of
1:14:52
because that you can run into a lot of
1:14:52
because that you can run into a lot of trouble with all of these uh
1:14:53
trouble with all of these uh
1:14:53
trouble with all of these uh nonlinearities and these neural Nets and
1:14:55
nonlinearities and these neural Nets and
1:14:56
nonlinearities and these neural Nets and basically you want everything to be
1:14:57
basically you want everything to be
1:14:57
basically you want everything to be fairly homogeneous throughout the neural
1:14:58
fairly homogeneous throughout the neural
1:14:58
fairly homogeneous throughout the neural net you want roughly goshan activations
1:15:00
net you want roughly goshan activations
1:15:00
net you want roughly goshan activations throughout the neural net let me talked
1:15:03
throughout the neural net let me talked
1:15:03
throughout the neural net let me talked about okay if we want roughly Gan
1:15:05
about okay if we want roughly Gan
1:15:05
about okay if we want roughly Gan activations how do we scale these weight
1:15:08
activations how do we scale these weight
1:15:08
activations how do we scale these weight matrices and biases during
1:15:09
matrices and biases during
1:15:09
matrices and biases during initialization of the neural nut so that
1:15:11
initialization of the neural nut so that
1:15:11
initialization of the neural nut so that we don't get um you know so everything
1:15:13
we don't get um you know so everything
1:15:13
we don't get um you know so everything is as controlled as
1:15:15
is as controlled as
1:15:15
is as controlled as possible um so that give us a large
1:15:18
possible um so that give us a large
1:15:18
possible um so that give us a large boost in Improvement and then I talked
1:15:20
boost in Improvement and then I talked
1:15:20
boost in Improvement and then I talked about how that strategy is not actually
1:15:24
about how that strategy is not actually
1:15:24
about how that strategy is not actually uh Poss for much much deeper neural nuts
1:15:27
uh Poss for much much deeper neural nuts
1:15:27
uh Poss for much much deeper neural nuts because um when you have much deeper
1:15:29
because um when you have much deeper
1:15:29
because um when you have much deeper neural nuts with lots of different types
1:15:31
neural nuts with lots of different types
1:15:31
neural nuts with lots of different types of layers it becomes really really hard
1:15:33
of layers it becomes really really hard
1:15:33
of layers it becomes really really hard to precisely set the weights and the
1:15:35
to precisely set the weights and the
1:15:35
to precisely set the weights and the biases in such a way that the
1:15:37
biases in such a way that the
1:15:37
biases in such a way that the activations are roughly uniform
1:15:39
activations are roughly uniform
1:15:39
activations are roughly uniform throughout the neural nut so then I
1:15:41
throughout the neural nut so then I
1:15:41
throughout the neural nut so then I introduced the notion of a normalization
1:15:43
introduced the notion of a normalization
1:15:43
introduced the notion of a normalization layer now there are many normalization
1:15:45
layer now there are many normalization
1:15:45
layer now there are many normalization layers that that people use in practice
1:15:47
layers that that people use in practice
1:15:47
layers that that people use in practice bat normalization layer normalization
1:15:50
bat normalization layer normalization
1:15:50
bat normalization layer normalization instance normalization group
1:15:51
instance normalization group
1:15:51
instance normalization group normalization we haven't covered most of
1:15:53
normalization we haven't covered most of
1:15:53
normalization we haven't covered most of them but I've introduced the first one
1:15:55
them but I've introduced the first one
1:15:55
them but I've introduced the first one and also the one that I believe came out
1:15:57
and also the one that I believe came out
1:15:57
and also the one that I believe came out first and that's called Bat
1:15:59
first and that's called Bat
1:15:59
first and that's called Bat normalization and we saw how bat
1:16:01
normalization and we saw how bat
1:16:01
normalization and we saw how bat normalization Works uh this is a layer
1:16:03
normalization Works uh this is a layer
1:16:03
normalization Works uh this is a layer that you can sprinkle throughout your
1:16:05
that you can sprinkle throughout your
1:16:05
that you can sprinkle throughout your deep neural net and the basic idea is if
1:16:08
deep neural net and the basic idea is if
1:16:08
deep neural net and the basic idea is if you want roughly gsh in activations well
1:16:10
you want roughly gsh in activations well
1:16:10
you want roughly gsh in activations well then take your activations and um take
1:16:13
then take your activations and um take
1:16:13
then take your activations and um take the mean and the standard deviation and
1:16:14
the mean and the standard deviation and
1:16:14
the mean and the standard deviation and Center your data and you can do that
1:16:17
Center your data and you can do that
1:16:17
Center your data and you can do that because the centering operation is
1:16:20
because the centering operation is
1:16:20
because the centering operation is differentiable but and on top of that we
1:16:22
differentiable but and on top of that we
1:16:22
differentiable but and on top of that we actually had to add a lot of bells and
1:16:24
actually had to add a lot of bells and
1:16:24
actually had to add a lot of bells and whistles and that gave you a sense of
1:16:26
whistles and that gave you a sense of
1:16:26
whistles and that gave you a sense of the complexities of the batch
1:16:27
the complexities of the batch
1:16:27
the complexities of the batch normalization layer because now we're
1:16:29
normalization layer because now we're
1:16:29
normalization layer because now we're centering the data that's great but
1:16:31
centering the data that's great but
1:16:31
centering the data that's great but suddenly we need the gain and the bias
1:16:33
suddenly we need the gain and the bias
1:16:33
suddenly we need the gain and the bias and now those are
1:16:34
and now those are
1:16:34
and now those are trainable and then because we are
1:16:36
trainable and then because we are
1:16:36
trainable and then because we are coupling all of the training examples
1:16:38
coupling all of the training examples
1:16:38
coupling all of the training examples now suddenly the question is how do you
1:16:39
now suddenly the question is how do you
1:16:39
now suddenly the question is how do you do the inference where to do to do the
1:16:42
do the inference where to do to do the
1:16:42
do the inference where to do to do the inference we need to now estimate these
1:16:45
inference we need to now estimate these
1:16:45
inference we need to now estimate these um mean and standard deviation once uh
1:16:48
um mean and standard deviation once uh
1:16:48
um mean and standard deviation once uh or the entire training set and then use
1:16:50
or the entire training set and then use
1:16:50
or the entire training set and then use those at inference but then no one likes
1:16:52
those at inference but then no one likes
1:16:52
those at inference but then no one likes to do stage two so instead we fold
1:16:54
to do stage two so instead we fold
1:16:54
to do stage two so instead we fold everything everything into the bat
1:16:56
everything everything into the bat
1:16:56
everything everything into the bat normalization later during training and
1:16:57
normalization later during training and
1:16:57
normalization later during training and try to estimate these in the running
1:16:59
try to estimate these in the running
1:16:59
try to estimate these in the running manner so that everything is a bit
1:17:01
manner so that everything is a bit
1:17:01
manner so that everything is a bit simpler and that gives us the bat
1:17:03
simpler and that gives us the bat
1:17:03
simpler and that gives us the bat normalization layer um and as I
1:17:07
normalization layer um and as I
1:17:07
normalization layer um and as I mentioned no one likes this layer it
1:17:09
mentioned no one likes this layer it
1:17:09
mentioned no one likes this layer it causes a huge amount of bugs um and
1:17:13
causes a huge amount of bugs um and
1:17:13
causes a huge amount of bugs um and intuitively it's because it is coupling
1:17:15
intuitively it's because it is coupling
1:17:15
intuitively it's because it is coupling examples um in the for pass of a neural
1:17:17
examples um in the for pass of a neural
1:17:17
examples um in the for pass of a neural nut and uh I've shot myself in the foot
1:17:21
nut and uh I've shot myself in the foot
1:17:21
nut and uh I've shot myself in the foot with this layer over and over again in
1:17:24
with this layer over and over again in
1:17:24
with this layer over and over again in my life and I don't want you to suffer
1:17:26
my life and I don't want you to suffer
1:17:26
my life and I don't want you to suffer the same uh so basically try to avoid it
1:17:29
the same uh so basically try to avoid it
1:17:29
the same uh so basically try to avoid it as much as possible uh some of the other
1:17:32
as much as possible uh some of the other
1:17:32
as much as possible uh some of the other alternatives to these layers are for
1:17:34
alternatives to these layers are for
1:17:34
alternatives to these layers are for example group normalization or layer
1:17:35
example group normalization or layer
1:17:35
example group normalization or layer normalization and those have become more
1:17:37
normalization and those have become more
1:17:37
normalization and those have become more common uh in more recent deep learning
1:17:40
common uh in more recent deep learning
1:17:40
common uh in more recent deep learning uh but we haven't covered those yet uh
1:17:43
uh but we haven't covered those yet uh
1:17:43
uh but we haven't covered those yet uh but definitely bash normalization was
1:17:44
but definitely bash normalization was
1:17:44
but definitely bash normalization was very influential at the time when it
1:17:46
very influential at the time when it
1:17:46
very influential at the time when it came out in roughly 2015 because it was
1:17:49
came out in roughly 2015 because it was
1:17:49
came out in roughly 2015 because it was kind of the first time that you could
1:17:50
kind of the first time that you could
1:17:50
kind of the first time that you could train reliably uh much deeper neural
1:17:54
train reliably uh much deeper neural
1:17:54
train reliably uh much deeper neural nuts and fundamentally the reason for
1:17:56
nuts and fundamentally the reason for
1:17:56
nuts and fundamentally the reason for that is because this layer was very
1:17:58
that is because this layer was very
1:17:58
that is because this layer was very effective at controlling the statistics
1:18:00
effective at controlling the statistics
1:18:00
effective at controlling the statistics of the activations in the neural nut so
1:18:03
of the activations in the neural nut so
1:18:03
of the activations in the neural nut so that's the story so far and um that's
1:18:06
that's the story so far and um that's
1:18:06
that's the story so far and um that's all I wanted to cover and in the future
1:18:08
all I wanted to cover and in the future
1:18:08
all I wanted to cover and in the future lectures hopefully we can start going
1:18:09
lectures hopefully we can start going
1:18:09
lectures hopefully we can start going into recurrent R Nets and um recurring
1:18:13
into recurrent R Nets and um recurring
1:18:13
into recurrent R Nets and um recurring neural Nets as we'll see are just very
1:18:14
neural Nets as we'll see are just very
1:18:14
neural Nets as we'll see are just very very deep networks because you uh you
1:18:17
very deep networks because you uh you
1:18:17
very deep networks because you uh you unroll the loop and uh when you actually
1:18:19
unroll the loop and uh when you actually
1:18:19
unroll the loop and uh when you actually optimize these neurals and that's where
1:18:22
optimize these neurals and that's where
1:18:22
optimize these neurals and that's where a lot of this
1:18:23
a lot of this
1:18:23
a lot of this um analysis around the activation
1:18:26
um analysis around the activation
1:18:26
um analysis around the activation statistics and all these normalization
1:18:28
statistics and all these normalization
1:18:28
statistics and all these normalization layers will become very very important
1:18:30
layers will become very very important
1:18:30
layers will become very very important for uh good performance so we'll see
1:18:33
for uh good performance so we'll see
1:18:33
for uh good performance so we'll see that next time bye okay so I lied I
1:18:36
that next time bye okay so I lied I
1:18:36
that next time bye okay so I lied I would like us to do one more summary
1:18:37
would like us to do one more summary
1:18:37
would like us to do one more summary here as a bonus and I think it's useful
1:18:40
here as a bonus and I think it's useful
1:18:40
here as a bonus and I think it's useful as to have one more summary of
1:18:41
as to have one more summary of
1:18:42
as to have one more summary of everything I've presented in this
1:18:43
everything I've presented in this
1:18:43
everything I've presented in this lecture but also I would like us to
1:18:44
lecture but also I would like us to
1:18:44
lecture but also I would like us to start by torify our code a little bit so
1:18:47
start by torify our code a little bit so
1:18:47
start by torify our code a little bit so it looks much more like what you would
1:18:48
it looks much more like what you would
1:18:48
it looks much more like what you would encounter in PCH so you'll see that I
1:18:50
encounter in PCH so you'll see that I
1:18:50
encounter in PCH so you'll see that I will structure our code into these
1:18:53
will structure our code into these
1:18:53
will structure our code into these modules like a link
1:18:55
modules like a link
1:18:55
modules like a link uh module and a borm module and I'm
1:18:59
uh module and a borm module and I'm
1:18:59
uh module and a borm module and I'm putting the code inside these modules so
1:19:01
putting the code inside these modules so
1:19:01
putting the code inside these modules so that we can construct neural networks
1:19:02
that we can construct neural networks
1:19:02
that we can construct neural networks very much like we would construct them
1:19:04
very much like we would construct them
1:19:04
very much like we would construct them in pytorch and I will go through this in
1:19:05
in pytorch and I will go through this in
1:19:05
in pytorch and I will go through this in detail so we'll create our neural net
1:19:08
detail so we'll create our neural net
1:19:08
detail so we'll create our neural net then we will do the optimization loop as
1:19:11
then we will do the optimization loop as
1:19:11
then we will do the optimization loop as we did before and then the one more
1:19:13
we did before and then the one more
1:19:13
we did before and then the one more thing that I want to do here is I want
1:19:14
thing that I want to do here is I want
1:19:14
thing that I want to do here is I want to look at the activation statistics
1:19:16
to look at the activation statistics
1:19:16
to look at the activation statistics both in the forward pass and in the
1:19:18
both in the forward pass and in the
1:19:18
both in the forward pass and in the backward pass and then here we have the
1:19:20
backward pass and then here we have the
1:19:20
backward pass and then here we have the evaluation and sampling just like before
1:19:22
evaluation and sampling just like before
1:19:22
evaluation and sampling just like before so let me rewind all the way up here and
1:19:24
so let me rewind all the way up here and
1:19:24
so let me rewind all the way up here and and go a little bit slower so here I
1:19:27
and go a little bit slower so here I
1:19:27
and go a little bit slower so here I creating a linear layer you'll notice
1:19:29
creating a linear layer you'll notice
1:19:29
creating a linear layer you'll notice that torch.nn has lots of different
1:19:31
that torch.nn has lots of different
1:19:31
that torch.nn has lots of different types of layers and one of those layers
1:19:33
types of layers and one of those layers
1:19:33
types of layers and one of those layers is the linear layer torch. n. linear
1:19:36
is the linear layer torch. n. linear
1:19:36
is the linear layer torch. n. linear takes a number of input features output
1:19:37
takes a number of input features output
1:19:37
takes a number of input features output features whether or not we should have a
1:19:39
features whether or not we should have a
1:19:39
features whether or not we should have a bias and then the device that we want to
1:19:41
bias and then the device that we want to
1:19:41
bias and then the device that we want to place this layer on and the data type so
1:19:44
place this layer on and the data type so
1:19:44
place this layer on and the data type so I will emit these two but otherwise we
1:19:46
I will emit these two but otherwise we
1:19:46
I will emit these two but otherwise we have the exact same thing we have the
1:19:48
have the exact same thing we have the
1:19:48
have the exact same thing we have the fan in which is the number of inputs fan
1:19:50
fan in which is the number of inputs fan
1:19:50
fan in which is the number of inputs fan out the number of outputs and whether or
1:19:53
out the number of outputs and whether or
1:19:53
out the number of outputs and whether or not we want to use a bias
1:19:55
not we want to use a bias
1:19:55
not we want to use a bias and internally inside this layer there's
1:19:57
and internally inside this layer there's
1:19:57
and internally inside this layer there's a weight and a bias if you'd like it it
1:19:59
a weight and a bias if you'd like it it
1:20:00
a weight and a bias if you'd like it it is typical to initialize the weight
1:20:02
is typical to initialize the weight
1:20:02
is typical to initialize the weight using um say random numbers drawn from
1:20:05
using um say random numbers drawn from
1:20:05
using um say random numbers drawn from aashan and then here's the coming
1:20:07
aashan and then here's the coming
1:20:07
aashan and then here's the coming initialization um that we discussed
1:20:09
initialization um that we discussed
1:20:09
initialization um that we discussed already in this lecture and that's a
1:20:11
already in this lecture and that's a
1:20:11
already in this lecture and that's a good default and also the default that I
1:20:13
good default and also the default that I
1:20:13
good default and also the default that I believe pytor chooses and by default the
1:20:15
believe pytor chooses and by default the
1:20:15
believe pytor chooses and by default the bias is usually initialized to zeros now
1:20:18
bias is usually initialized to zeros now
1:20:18
bias is usually initialized to zeros now when you call this module uh this will
1:20:21
when you call this module uh this will
1:20:21
when you call this module uh this will basically calculate W * X plus b if you
1:20:23
basically calculate W * X plus b if you
1:20:23
basically calculate W * X plus b if you have a b and then when you also call
1:20:25
have a b and then when you also call
1:20:25
have a b and then when you also call that parameters on this module it will
1:20:27
that parameters on this module it will
1:20:27
that parameters on this module it will return uh the tensors that are the
1:20:30
return uh the tensors that are the
1:20:30
return uh the tensors that are the parameters of this layer now next we
1:20:32
parameters of this layer now next we
1:20:32
parameters of this layer now next we have the bash normalization layer so
1:20:35
have the bash normalization layer so
1:20:35
have the bash normalization layer so I've written that here and this is very
1:20:38
I've written that here and this is very
1:20:38
I've written that here and this is very similar to pytorch nn. bashor 1D layer
1:20:42
similar to pytorch nn. bashor 1D layer
1:20:42
similar to pytorch nn. bashor 1D layer as shown
1:20:43
as shown
1:20:43
as shown here so I'm kind of um taking these
1:20:46
here so I'm kind of um taking these
1:20:46
here so I'm kind of um taking these three parameters here the dimensionality
1:20:49
three parameters here the dimensionality
1:20:49
three parameters here the dimensionality the Epsilon that we will use in the
1:20:50
the Epsilon that we will use in the
1:20:50
the Epsilon that we will use in the division and the momentum that we will
1:20:52
division and the momentum that we will
1:20:52
division and the momentum that we will use in keeping track of these running
1:20:54
use in keeping track of these running
1:20:54
use in keeping track of these running stats the running mean and the running
1:20:56
stats the running mean and the running
1:20:56
stats the running mean and the running variance um now py actually takes quite
1:20:59
variance um now py actually takes quite
1:20:59
variance um now py actually takes quite a few more things but I'm assuming some
1:21:01
a few more things but I'm assuming some
1:21:01
a few more things but I'm assuming some of their settings so for us Aline will
1:21:03
of their settings so for us Aline will
1:21:03
of their settings so for us Aline will be true that means that we will be using
1:21:05
be true that means that we will be using
1:21:05
be true that means that we will be using a gamma and beta after the normalization
1:21:07
a gamma and beta after the normalization
1:21:08
a gamma and beta after the normalization the track running stats will be true so
1:21:09
the track running stats will be true so
1:21:09
the track running stats will be true so we will be keeping track of the running
1:21:11
we will be keeping track of the running
1:21:11
we will be keeping track of the running mean and the running variance in the in
1:21:13
mean and the running variance in the in
1:21:13
mean and the running variance in the in the bat Norm our device by default is
1:21:15
the bat Norm our device by default is
1:21:15
the bat Norm our device by default is the CPU and the data type by default is
1:21:18
the CPU and the data type by default is
1:21:18
the CPU and the data type by default is uh float float
1:21:21
uh float float
1:21:21
uh float float 32 so those are the defaults otherwise
1:21:24
32 so those are the defaults otherwise
1:21:24
32 so those are the defaults otherwise uh we are taking all the same parameters
1:21:26
uh we are taking all the same parameters
1:21:26
uh we are taking all the same parameters in this bachom layer so first I'm just
1:21:28
in this bachom layer so first I'm just
1:21:28
in this bachom layer so first I'm just saving them now here's something new
1:21:30
saving them now here's something new
1:21:31
saving them now here's something new there's a doc training which by default
1:21:32
there's a doc training which by default
1:21:32
there's a doc training which by default is true and pytorch andn modules also
1:21:34
is true and pytorch andn modules also
1:21:35
is true and pytorch andn modules also have this attribute. training and that's
1:21:37
have this attribute. training and that's
1:21:37
have this attribute. training and that's because many modules in borm is included
1:21:40
because many modules in borm is included
1:21:40
because many modules in borm is included in that have a different Behavior
1:21:43
in that have a different Behavior
1:21:43
in that have a different Behavior whether you are training your interet
1:21:44
whether you are training your interet
1:21:44
whether you are training your interet and or whether you are running it in an
1:21:46
and or whether you are running it in an
1:21:46
and or whether you are running it in an evaluation mode and calculating your
1:21:48
evaluation mode and calculating your
1:21:48
evaluation mode and calculating your evaluation loss or using it for
1:21:50
evaluation loss or using it for
1:21:50
evaluation loss or using it for inference on some test examples and
1:21:53
inference on some test examples and
1:21:53
inference on some test examples and bashor is an example of this because
1:21:54
bashor is an example of this because
1:21:54
bashor is an example of this because when we are training we are going to be
1:21:56
when we are training we are going to be
1:21:56
when we are training we are going to be using the mean and the variance
1:21:57
using the mean and the variance
1:21:57
using the mean and the variance estimated from the current batch but
1:22:00
estimated from the current batch but
1:22:00
estimated from the current batch but during inference we are using the
1:22:01
during inference we are using the
1:22:01
during inference we are using the running mean and running variance and so
1:22:04
running mean and running variance and so
1:22:04
running mean and running variance and so also if we are training we are updating
1:22:06
also if we are training we are updating
1:22:06
also if we are training we are updating mean and variance but if we are testing
1:22:08
mean and variance but if we are testing
1:22:08
mean and variance but if we are testing then these are not being updated they're
1:22:10
then these are not being updated they're
1:22:10
then these are not being updated they're kept fixed and so this flag is necessary
1:22:13
kept fixed and so this flag is necessary
1:22:13
kept fixed and so this flag is necessary and by default true just like in
1:22:15
and by default true just like in
1:22:15
and by default true just like in pytorch now the parameters of B 1D are
1:22:18
pytorch now the parameters of B 1D are
1:22:18
pytorch now the parameters of B 1D are the gamma and the beta
1:22:20
the gamma and the beta
1:22:20
the gamma and the beta here and then the running mean and
1:22:22
here and then the running mean and
1:22:22
here and then the running mean and running variance are called buffers in
1:22:25
running variance are called buffers in
1:22:25
running variance are called buffers in pyto
1:22:26
pyto
1:22:26
pyto nomenclature and these buffers are
1:22:29
nomenclature and these buffers are
1:22:29
nomenclature and these buffers are trained using exponential moving average
1:22:31
trained using exponential moving average
1:22:32
trained using exponential moving average here explicitly and they are not part of
1:22:34
here explicitly and they are not part of
1:22:34
here explicitly and they are not part of the back propagation and stochastic
1:22:35
the back propagation and stochastic
1:22:36
the back propagation and stochastic radient descent so they are not sort of
1:22:37
radient descent so they are not sort of
1:22:37
radient descent so they are not sort of like parameters of this layer and that's
1:22:40
like parameters of this layer and that's
1:22:40
like parameters of this layer and that's why when we C when we have a parameters
1:22:42
why when we C when we have a parameters
1:22:42
why when we C when we have a parameters here we only return gamma and beta we do
1:22:44
here we only return gamma and beta we do
1:22:44
here we only return gamma and beta we do not return the mean and the variance
1:22:46
not return the mean and the variance
1:22:46
not return the mean and the variance this is trained sort of like internally
1:22:48
this is trained sort of like internally
1:22:48
this is trained sort of like internally here um every forward pass using
1:22:51
here um every forward pass using
1:22:51
here um every forward pass using exponential moving average so that's the
1:22:55
exponential moving average so that's the
1:22:55
exponential moving average so that's the initialization now in a forward pass if
1:22:58
initialization now in a forward pass if
1:22:58
initialization now in a forward pass if we are training then we use the mean and
1:23:00
we are training then we use the mean and
1:23:00
we are training then we use the mean and the variance estimated by the batch let
1:23:03
the variance estimated by the batch let
1:23:03
the variance estimated by the batch let me pull up the paper
1:23:04
me pull up the paper
1:23:04
me pull up the paper here we calculate the mean and the
1:23:07
here we calculate the mean and the
1:23:07
here we calculate the mean and the variance now up above I was estimating
1:23:10
variance now up above I was estimating
1:23:10
variance now up above I was estimating the standard deviation and keeping track
1:23:12
the standard deviation and keeping track
1:23:12
the standard deviation and keeping track of the standard deviation here in the
1:23:15
of the standard deviation here in the
1:23:15
of the standard deviation here in the running standard deviation instead of
1:23:16
running standard deviation instead of
1:23:16
running standard deviation instead of running variance but let's follow the
1:23:18
running variance but let's follow the
1:23:18
running variance but let's follow the paper exactly here they calculate the
1:23:21
paper exactly here they calculate the
1:23:21
paper exactly here they calculate the variance which is the standard deviation
1:23:23
variance which is the standard deviation
1:23:23
variance which is the standard deviation squared and that's what's get track of
1:23:25
squared and that's what's get track of
1:23:25
squared and that's what's get track of in a running variance instead of a
1:23:27
in a running variance instead of a
1:23:27
in a running variance instead of a running standard
1:23:28
running standard
1:23:28
running standard deviation uh but those two would be very
1:23:30
deviation uh but those two would be very
1:23:31
deviation uh but those two would be very very similar I
1:23:32
very similar I
1:23:32
very similar I believe um if we are not training then
1:23:34
believe um if we are not training then
1:23:34
believe um if we are not training then we use running mean and variance we
1:23:38
we use running mean and variance we
1:23:38
we use running mean and variance we normalize and then here I am calculating
1:23:40
normalize and then here I am calculating
1:23:40
normalize and then here I am calculating the output of this layer and I'm also
1:23:42
the output of this layer and I'm also
1:23:42
the output of this layer and I'm also assigning it to an attribute called out
1:23:45
assigning it to an attribute called out
1:23:45
assigning it to an attribute called out now out is something that I'm using in
1:23:48
now out is something that I'm using in
1:23:48
now out is something that I'm using in our modules here uh this is not what you
1:23:50
our modules here uh this is not what you
1:23:50
our modules here uh this is not what you would find in pytorch we are slightly
1:23:51
would find in pytorch we are slightly
1:23:51
would find in pytorch we are slightly deviating from it I'm creating a DOT out
1:23:54
deviating from it I'm creating a DOT out
1:23:54
deviating from it I'm creating a DOT out because I would like to very easily um
1:23:57
because I would like to very easily um
1:23:57
because I would like to very easily um maintain all those variables so that we
1:23:58
maintain all those variables so that we
1:23:59
maintain all those variables so that we can create statistics of them and plot
1:24:00
can create statistics of them and plot
1:24:00
can create statistics of them and plot them but pytorch and modules will not
1:24:03
them but pytorch and modules will not
1:24:03
them but pytorch and modules will not have a do out attribute and finally here
1:24:05
have a do out attribute and finally here
1:24:05
have a do out attribute and finally here we are updating the buffers using again
1:24:08
we are updating the buffers using again
1:24:08
we are updating the buffers using again as I mentioned exponential moving
1:24:09
as I mentioned exponential moving
1:24:09
as I mentioned exponential moving average uh provide given the provided
1:24:11
average uh provide given the provided
1:24:11
average uh provide given the provided momentum and importantly you'll notice
1:24:14
momentum and importantly you'll notice
1:24:14
momentum and importantly you'll notice that I'm using the torch. nogra context
1:24:16
that I'm using the torch. nogra context
1:24:16
that I'm using the torch. nogra context manager and I doing this because if we
1:24:18
manager and I doing this because if we
1:24:18
manager and I doing this because if we don't use this then pytorch will start
1:24:20
don't use this then pytorch will start
1:24:20
don't use this then pytorch will start building out an entire computational
1:24:22
building out an entire computational
1:24:22
building out an entire computational graph out of these tensors because it is
1:24:25
graph out of these tensors because it is
1:24:25
graph out of these tensors because it is expecting that we will eventually call
1:24:26
expecting that we will eventually call
1:24:26
expecting that we will eventually call Dot backward but we are never going to
1:24:28
Dot backward but we are never going to
1:24:28
Dot backward but we are never going to be calling dot backward on anything that
1:24:30
be calling dot backward on anything that
1:24:30
be calling dot backward on anything that includes running mean and running
1:24:31
includes running mean and running
1:24:31
includes running mean and running variance so that's why we need to use
1:24:33
variance so that's why we need to use
1:24:33
variance so that's why we need to use this context manager so that we are not
1:24:36
this context manager so that we are not
1:24:36
this context manager so that we are not um sort of maintaining them using all
1:24:38
um sort of maintaining them using all
1:24:38
um sort of maintaining them using all this additional memory um so this will
1:24:40
this additional memory um so this will
1:24:40
this additional memory um so this will make it more efficient and it's just
1:24:42
make it more efficient and it's just
1:24:42
make it more efficient and it's just telling pyour that there will no
1:24:43
telling pyour that there will no
1:24:43
telling pyour that there will no backward we just have a bunch of tensors
1:24:45
backward we just have a bunch of tensors
1:24:45
backward we just have a bunch of tensors we want to update them that's it and
1:24:48
we want to update them that's it and
1:24:48
we want to update them that's it and then we
1:24:49
then we
1:24:49
then we return okay now scrolling down we have
1:24:51
return okay now scrolling down we have
1:24:51
return okay now scrolling down we have the 10h layer this is very very similar
1:24:53
the 10h layer this is very very similar
1:24:53
the 10h layer this is very very similar to uh torch. 10h and it doesn't do too
1:24:57
to uh torch. 10h and it doesn't do too
1:24:57
to uh torch. 10h and it doesn't do too much it just calculates 10 as you might
1:24:59
much it just calculates 10 as you might
1:24:59
much it just calculates 10 as you might expect so uh that's torch. 10h and uh
1:25:03
expect so uh that's torch. 10h and uh
1:25:03
expect so uh that's torch. 10h and uh there's no parameters in this layer but
1:25:05
there's no parameters in this layer but
1:25:05
there's no parameters in this layer but because these are layers um it now
1:25:07
because these are layers um it now
1:25:07
because these are layers um it now becomes very easy to sort of like stack
1:25:09
becomes very easy to sort of like stack
1:25:09
becomes very easy to sort of like stack them up into uh basically just a list um
1:25:13
them up into uh basically just a list um
1:25:13
them up into uh basically just a list um and uh we can do all the initializations
1:25:15
and uh we can do all the initializations
1:25:15
and uh we can do all the initializations that we're used to so we have the
1:25:17
that we're used to so we have the
1:25:17
that we're used to so we have the initial sort of embedding Matrix we have
1:25:19
initial sort of embedding Matrix we have
1:25:19
initial sort of embedding Matrix we have our layers and we can call them
1:25:21
our layers and we can call them
1:25:21
our layers and we can call them sequentially and then again with Tor no
1:25:24
sequentially and then again with Tor no
1:25:24
sequentially and then again with Tor no grb but there's some initializations
1:25:25
grb but there's some initializations
1:25:25
grb but there's some initializations here so we want to make the output
1:25:27
here so we want to make the output
1:25:27
here so we want to make the output softmax a bit less confident like we saw
1:25:30
softmax a bit less confident like we saw
1:25:30
softmax a bit less confident like we saw and in addition to that because we are
1:25:31
and in addition to that because we are
1:25:31
and in addition to that because we are using a six layer multi-layer percep on
1:25:34
using a six layer multi-layer percep on
1:25:34
using a six layer multi-layer percep on here so you see how I'm stacking linear
1:25:36
here so you see how I'm stacking linear
1:25:36
here so you see how I'm stacking linear 10age linear Tage Etc uh I'm going to be
1:25:39
10age linear Tage Etc uh I'm going to be
1:25:39
10age linear Tage Etc uh I'm going to be using the gain here and I'm going to
1:25:41
using the gain here and I'm going to
1:25:41
using the gain here and I'm going to play with this in a second so you'll see
1:25:43
play with this in a second so you'll see
1:25:43
play with this in a second so you'll see how uh when we change this what happens
1:25:45
how uh when we change this what happens
1:25:45
how uh when we change this what happens to the
1:25:46
to the
1:25:46
to the statistics finally the parameters are
1:25:48
statistics finally the parameters are
1:25:48
statistics finally the parameters are basically the embedding Matrix and all
1:25:50
basically the embedding Matrix and all
1:25:50
basically the embedding Matrix and all the parameters in all the layers and
1:25:52
the parameters in all the layers and
1:25:52
the parameters in all the layers and notice here I'm using a double list
1:25:54
notice here I'm using a double list
1:25:54
notice here I'm using a double list apprehension if you want to call it that
1:25:56
apprehension if you want to call it that
1:25:56
apprehension if you want to call it that but for every layer in layers and for
1:25:58
but for every layer in layers and for
1:25:58
but for every layer in layers and for every parameter in each of those layers
1:26:00
every parameter in each of those layers
1:26:00
every parameter in each of those layers we are just stacking up all those piece
1:26:03
we are just stacking up all those piece
1:26:03
we are just stacking up all those piece uh all those parameters now in total we
1:26:05
uh all those parameters now in total we
1:26:05
uh all those parameters now in total we have 46,000 um
1:26:08
have 46,000 um
1:26:08
have 46,000 um parameters and I'm telling P that all of
1:26:10
parameters and I'm telling P that all of
1:26:10
parameters and I'm telling P that all of them require
1:26:15
gradient then here uh we have everything
1:26:18
gradient then here uh we have everything
1:26:18
gradient then here uh we have everything here we are actually mostly used to uh
1:26:20
here we are actually mostly used to uh
1:26:20
here we are actually mostly used to uh we are sampling a batch we are doing a
1:26:22
we are sampling a batch we are doing a
1:26:22
we are sampling a batch we are doing a forward pass the forward pass now is
1:26:24
forward pass the forward pass now is
1:26:24
forward pass the forward pass now is just the linear application of all the
1:26:25
just the linear application of all the
1:26:25
just the linear application of all the layers in order followed by the cross
1:26:28
layers in order followed by the cross
1:26:28
layers in order followed by the cross entropy and then in the backward pass
1:26:30
entropy and then in the backward pass
1:26:30
entropy and then in the backward pass you'll notice that for every single
1:26:31
you'll notice that for every single
1:26:31
you'll notice that for every single layer I now iterate over all the outputs
1:26:34
layer I now iterate over all the outputs
1:26:34
layer I now iterate over all the outputs and I'm telling pytorch to retain the
1:26:35
and I'm telling pytorch to retain the
1:26:35
and I'm telling pytorch to retain the gradient of them and then here we are
1:26:38
gradient of them and then here we are
1:26:38
gradient of them and then here we are already used to uh all the all the
1:26:40
already used to uh all the all the
1:26:40
already used to uh all the all the gradient set To None do the backward to
1:26:42
gradient set To None do the backward to
1:26:42
gradient set To None do the backward to fill in the gradients uh do an update
1:26:44
fill in the gradients uh do an update
1:26:44
fill in the gradients uh do an update using stochastic gradient sent and then
1:26:46
using stochastic gradient sent and then
1:26:46
using stochastic gradient sent and then uh track some statistics and then I am
1:26:49
uh track some statistics and then I am
1:26:49
uh track some statistics and then I am going to break after a single iteration
1:26:52
going to break after a single iteration
1:26:52
going to break after a single iteration now here in this cell in this diagram I
1:26:54
now here in this cell in this diagram I
1:26:54
now here in this cell in this diagram I I'm visualizing the histogram the
1:26:56
I'm visualizing the histogram the
1:26:56
I'm visualizing the histogram the histograms of the for pass activations
1:26:58
histograms of the for pass activations
1:26:58
histograms of the for pass activations and I'm specifically doing it at the 10
1:27:00
and I'm specifically doing it at the 10
1:27:00
and I'm specifically doing it at the 10 each layers so iterating over all the
1:27:03
each layers so iterating over all the
1:27:03
each layers so iterating over all the layers except for the very last one
1:27:05
layers except for the very last one
1:27:05
layers except for the very last one which is basically just the U soft Max
1:27:08
which is basically just the U soft Max
1:27:08
which is basically just the U soft Max layer um if it is a 10h layer and I'm
1:27:11
layer um if it is a 10h layer and I'm
1:27:11
layer um if it is a 10h layer and I'm using a 10h layer just because they have
1:27:13
using a 10h layer just because they have
1:27:13
using a 10h layer just because they have a finite output netive 1 to 1 and so
1:27:15
a finite output netive 1 to 1 and so
1:27:15
a finite output netive 1 to 1 and so it's very easy to visualize here so you
1:27:17
it's very easy to visualize here so you
1:27:17
it's very easy to visualize here so you see 1 to one and it's a finite range and
1:27:19
see 1 to one and it's a finite range and
1:27:20
see 1 to one and it's a finite range and easy to work with I take the out tensor
1:27:23
easy to work with I take the out tensor
1:27:23
easy to work with I take the out tensor from that layer into T and then I'm
1:27:25
from that layer into T and then I'm
1:27:25
from that layer into T and then I'm calculating the mean the standard
1:27:27
calculating the mean the standard
1:27:27
calculating the mean the standard deviation and the percent saturation of
1:27:29
deviation and the percent saturation of
1:27:29
deviation and the percent saturation of T and the way I Define the percent
1:27:31
T and the way I Define the percent
1:27:31
T and the way I Define the percent saturation is that t. absolute value is
1:27:33
saturation is that t. absolute value is
1:27:33
saturation is that t. absolute value is greater than 97 so that means we are
1:27:36
greater than 97 so that means we are
1:27:36
greater than 97 so that means we are here at the tals of the 10 H and
1:27:38
here at the tals of the 10 H and
1:27:38
here at the tals of the 10 H and remember that when we are in the tales
1:27:39
remember that when we are in the tales
1:27:40
remember that when we are in the tales of the 10 H that will actually stop
1:27:41
of the 10 H that will actually stop
1:27:41
of the 10 H that will actually stop gradients so we don't want this to be
1:27:43
gradients so we don't want this to be
1:27:43
gradients so we don't want this to be too
1:27:44
too
1:27:44
too high now here I'm calling torch.
1:27:48
high now here I'm calling torch.
1:27:48
high now here I'm calling torch. histogram and then I am plotting this
1:27:50
histogram and then I am plotting this
1:27:50
histogram and then I am plotting this histogram so basically what this is
1:27:51
histogram so basically what this is
1:27:51
histogram so basically what this is doing is that every different type of
1:27:53
doing is that every different type of
1:27:53
doing is that every different type of layer and they have a different color we
1:27:55
layer and they have a different color we
1:27:55
layer and they have a different color we are looking at how many um values in
1:27:58
are looking at how many um values in
1:27:58
are looking at how many um values in these tensors take on any of the values
1:28:01
these tensors take on any of the values
1:28:01
these tensors take on any of the values Below on this axis here so the first
1:28:04
Below on this axis here so the first
1:28:04
Below on this axis here so the first layer is fairly saturated uh here at 20%
1:28:07
layer is fairly saturated uh here at 20%
1:28:07
layer is fairly saturated uh here at 20% so you can see that it's got Tails here
1:28:10
so you can see that it's got Tails here
1:28:10
so you can see that it's got Tails here but then everything sort of stabilizes
1:28:12
but then everything sort of stabilizes
1:28:12
but then everything sort of stabilizes and if we had more layers here it would
1:28:14
and if we had more layers here it would
1:28:14
and if we had more layers here it would actually just stabilize at around the
1:28:15
actually just stabilize at around the
1:28:15
actually just stabilize at around the standard deviation of about 65 and the
1:28:18
standard deviation of about 65 and the
1:28:18
standard deviation of about 65 and the saturation would be roughly 5% and the
1:28:21
saturation would be roughly 5% and the
1:28:21
saturation would be roughly 5% and the reason that the stabilizes and gives us
1:28:22
reason that the stabilizes and gives us
1:28:22
reason that the stabilizes and gives us a nice distribution here is because gain
1:28:25
a nice distribution here is because gain
1:28:25
a nice distribution here is because gain is set to 5
1:28:26
is set to 5
1:28:26
is set to 5 over3 now here this gain you see that by
1:28:31
over3 now here this gain you see that by
1:28:31
over3 now here this gain you see that by default we initialize with 1 /un of fan
1:28:34
default we initialize with 1 /un of fan
1:28:34
default we initialize with 1 /un of fan in but then here during initialization I
1:28:37
in but then here during initialization I
1:28:37
in but then here during initialization I come in and I erator all the layers and
1:28:38
come in and I erator all the layers and
1:28:38
come in and I erator all the layers and if it's a linear layer I boost that by
1:28:40
if it's a linear layer I boost that by
1:28:40
if it's a linear layer I boost that by the gain now we saw that one so
1:28:44
the gain now we saw that one so
1:28:44
the gain now we saw that one so basically if we just do not use a gain
1:28:47
basically if we just do not use a gain
1:28:47
basically if we just do not use a gain then what happens if I redraw this you
1:28:50
then what happens if I redraw this you
1:28:50
then what happens if I redraw this you will see that the standard deviation is
1:28:53
will see that the standard deviation is
1:28:53
will see that the standard deviation is shrinking and the saturation is coming
1:28:55
shrinking and the saturation is coming
1:28:55
shrinking and the saturation is coming to zero and basically what's happening
1:28:57
to zero and basically what's happening
1:28:58
to zero and basically what's happening is the first layer is you know pretty
1:29:00
is the first layer is you know pretty
1:29:00
is the first layer is you know pretty decent but then further layers are just
1:29:02
decent but then further layers are just
1:29:02
decent but then further layers are just kind of like shrinking down to zero and
1:29:05
kind of like shrinking down to zero and
1:29:05
kind of like shrinking down to zero and it's happening slowly but it's shrinking
1:29:06
it's happening slowly but it's shrinking
1:29:06
it's happening slowly but it's shrinking to zero and the reason for that is when
1:29:09
to zero and the reason for that is when
1:29:09
to zero and the reason for that is when you just have a sandwich of linear
1:29:11
you just have a sandwich of linear
1:29:11
you just have a sandwich of linear layers alone then a then initializing
1:29:15
layers alone then a then initializing
1:29:15
layers alone then a then initializing our weights in this manner we saw
1:29:18
our weights in this manner we saw
1:29:18
our weights in this manner we saw previously would have conserved the
1:29:20
previously would have conserved the
1:29:20
previously would have conserved the standard deviation of one but because we
1:29:22
standard deviation of one but because we
1:29:22
standard deviation of one but because we have this interspersed 10 in layers in
1:29:25
have this interspersed 10 in layers in
1:29:25
have this interspersed 10 in layers in there these 10h layers are squashing
1:29:28
there these 10h layers are squashing
1:29:28
there these 10h layers are squashing functions and so they take your
1:29:30
functions and so they take your
1:29:30
functions and so they take your distribution and they slightly squash it
1:29:32
distribution and they slightly squash it
1:29:32
distribution and they slightly squash it and so some gain is necessary to keep
1:29:35
and so some gain is necessary to keep
1:29:35
and so some gain is necessary to keep expanding it to fight the
1:29:39
expanding it to fight the
1:29:39
expanding it to fight the squashing so it just turns out that 5
1:29:41
squashing so it just turns out that 5
1:29:41
squashing so it just turns out that 5 over3 is a good value so if we have
1:29:44
over3 is a good value so if we have
1:29:44
over3 is a good value so if we have something too small like one we saw that
1:29:46
something too small like one we saw that
1:29:46
something too small like one we saw that things will come toward zero but if it's
1:29:49
things will come toward zero but if it's
1:29:49
things will come toward zero but if it's something too high let's do
1:29:51
something too high let's do
1:29:51
something too high let's do two then here we see that um
1:29:56
well let me do something a bit more
1:29:58
well let me do something a bit more
1:29:58
well let me do something a bit more extreme because so it's a bit more
1:29:59
extreme because so it's a bit more
1:30:00
extreme because so it's a bit more visible let's try
1:30:01
visible let's try
1:30:01
visible let's try three okay so we see here that the
1:30:03
three okay so we see here that the
1:30:03
three okay so we see here that the saturations are going to be way too
1:30:05
saturations are going to be way too
1:30:05
saturations are going to be way too large okay so three would create way too
1:30:08
large okay so three would create way too
1:30:08
large okay so three would create way too saturated activations so 5 over3 is a
1:30:12
saturated activations so 5 over3 is a
1:30:12
saturated activations so 5 over3 is a good setting for a sandwich of linear
1:30:15
good setting for a sandwich of linear
1:30:15
good setting for a sandwich of linear layers with 10h activations and it
1:30:18
layers with 10h activations and it
1:30:18
layers with 10h activations and it roughly stabilizes the standard
1:30:19
roughly stabilizes the standard
1:30:19
roughly stabilizes the standard deviation at a reasonable point now
1:30:23
deviation at a reasonable point now
1:30:23
deviation at a reasonable point now honestly I have no idea where 5 over3
1:30:24
honestly I have no idea where 5 over3
1:30:25
honestly I have no idea where 5 over3 came from in pytorch um when we were
1:30:27
came from in pytorch um when we were
1:30:27
came from in pytorch um when we were looking at the coming initialization um
1:30:30
looking at the coming initialization um
1:30:30
looking at the coming initialization um I see empirically that it stabilizes
1:30:32
I see empirically that it stabilizes
1:30:32
I see empirically that it stabilizes this sandwich of linear an 10age and
1:30:34
this sandwich of linear an 10age and
1:30:34
this sandwich of linear an 10age and that the saturation is in a good range
1:30:36
that the saturation is in a good range
1:30:36
that the saturation is in a good range um but I don't actually know if this
1:30:37
um but I don't actually know if this
1:30:37
um but I don't actually know if this came out of some math formula I tried
1:30:39
came out of some math formula I tried
1:30:39
came out of some math formula I tried searching briefly for where this comes
1:30:41
searching briefly for where this comes
1:30:41
searching briefly for where this comes from uh but I wasn't able to find
1:30:43
from uh but I wasn't able to find
1:30:43
from uh but I wasn't able to find anything uh but certainly we see that
1:30:45
anything uh but certainly we see that
1:30:45
anything uh but certainly we see that empirically these are very nice ranges
1:30:47
empirically these are very nice ranges
1:30:47
empirically these are very nice ranges our saturation is roughly 5% which is a
1:30:49
our saturation is roughly 5% which is a
1:30:49
our saturation is roughly 5% which is a pretty good number and uh this is a good
1:30:52
pretty good number and uh this is a good
1:30:52
pretty good number and uh this is a good setting of The gain in this context
1:30:55
setting of The gain in this context
1:30:55
setting of The gain in this context similarly we can do the exact same thing
1:30:57
similarly we can do the exact same thing
1:30:57
similarly we can do the exact same thing with the gradients so here is a very
1:30:59
with the gradients so here is a very
1:30:59
with the gradients so here is a very same Loop if it's a 10h but instead of
1:31:01
same Loop if it's a 10h but instead of
1:31:01
same Loop if it's a 10h but instead of taking a layer do out I'm taking the
1:31:03
taking a layer do out I'm taking the
1:31:03
taking a layer do out I'm taking the grad and then I'm also showing the mean
1:31:05
grad and then I'm also showing the mean
1:31:05
grad and then I'm also showing the mean and the standard deviation and I'm
1:31:07
and the standard deviation and I'm
1:31:07
and the standard deviation and I'm plotting the histogram of these values
1:31:09
plotting the histogram of these values
1:31:09
plotting the histogram of these values and so you'll see that the gradient
1:31:10
and so you'll see that the gradient
1:31:11
and so you'll see that the gradient distribution is uh fairly reasonable and
1:31:13
distribution is uh fairly reasonable and
1:31:13
distribution is uh fairly reasonable and in particular what we're looking for is
1:31:14
in particular what we're looking for is
1:31:15
in particular what we're looking for is that all the different layers in this
1:31:16
that all the different layers in this
1:31:16
that all the different layers in this sandwich has roughly the same gradient
1:31:19
sandwich has roughly the same gradient
1:31:19
sandwich has roughly the same gradient things are not shrinking or exploding so
1:31:22
things are not shrinking or exploding so
1:31:22
things are not shrinking or exploding so uh we can for example come here and we
1:31:24
uh we can for example come here and we
1:31:24
uh we can for example come here and we can take a look at what happens if this
1:31:25
can take a look at what happens if this
1:31:25
can take a look at what happens if this gain was way too small so this was
1:31:29
gain was way too small so this was
1:31:29
gain was way too small so this was 0.5 then you see the first of all the
1:31:32
0.5 then you see the first of all the
1:31:32
0.5 then you see the first of all the activations are shrinking to zero but
1:31:34
activations are shrinking to zero but
1:31:34
activations are shrinking to zero but also the gradients are doing something
1:31:35
also the gradients are doing something
1:31:35
also the gradients are doing something weird the gradients started out here and
1:31:38
weird the gradients started out here and
1:31:38
weird the gradients started out here and then now they're like expanding
1:31:40
then now they're like expanding
1:31:40
then now they're like expanding out and similarly if we for example have
1:31:43
out and similarly if we for example have
1:31:43
out and similarly if we for example have a too high of a gain so like
1:31:45
a too high of a gain so like
1:31:45
a too high of a gain so like three then we see that also the
1:31:47
three then we see that also the
1:31:47
three then we see that also the gradients have there's some asymmetry
1:31:49
gradients have there's some asymmetry
1:31:49
gradients have there's some asymmetry going on where as you go into deeper and
1:31:51
going on where as you go into deeper and
1:31:51
going on where as you go into deeper and deeper layers the activation CS are
1:31:53
deeper layers the activation CS are
1:31:53
deeper layers the activation CS are changing and so that's not what we want
1:31:55
changing and so that's not what we want
1:31:55
changing and so that's not what we want and in this case we saw that without the
1:31:57
and in this case we saw that without the
1:31:57
and in this case we saw that without the use of batro as we are going through
1:31:59
use of batro as we are going through
1:31:59
use of batro as we are going through right now we had to very carefully set
1:32:02
right now we had to very carefully set
1:32:02
right now we had to very carefully set those gains to get nice activations in
1:32:04
those gains to get nice activations in
1:32:04
those gains to get nice activations in both the forward pass and the backward
1:32:06
both the forward pass and the backward
1:32:07
both the forward pass and the backward pass now before we move on to bat
1:32:09
pass now before we move on to bat
1:32:09
pass now before we move on to bat normalization I would also like to take
1:32:11
normalization I would also like to take
1:32:11
normalization I would also like to take a look at what happens when we have no
1:32:12
a look at what happens when we have no
1:32:12
a look at what happens when we have no 10h units here so erasing all the 10
1:32:15
10h units here so erasing all the 10
1:32:15
10h units here so erasing all the 10 nonlinearities but keeping the gain at 5
1:32:18
nonlinearities but keeping the gain at 5
1:32:18
nonlinearities but keeping the gain at 5 over3 we now have just a giant linear
1:32:21
over3 we now have just a giant linear
1:32:21
over3 we now have just a giant linear sandwich so let's see what happens to
1:32:22
sandwich so let's see what happens to
1:32:22
sandwich so let's see what happens to the activations
1:32:24
the activations
1:32:24
the activations as we saw before the correct gain here
1:32:26
as we saw before the correct gain here
1:32:26
as we saw before the correct gain here is one that is the standard deviation
1:32:28
is one that is the standard deviation
1:32:28
is one that is the standard deviation preserving gain so 1.66 7 is too high
1:32:33
preserving gain so 1.66 7 is too high
1:32:33
preserving gain so 1.66 7 is too high and so what's going to happen now is the
1:32:36
and so what's going to happen now is the
1:32:36
and so what's going to happen now is the following uh I have to change this to be
1:32:38
following uh I have to change this to be
1:32:38
following uh I have to change this to be linear so we are because there's no more
1:32:40
linear so we are because there's no more
1:32:40
linear so we are because there's no more 10h layers and let me change this to
1:32:43
10h layers and let me change this to
1:32:43
10h layers and let me change this to linear as
1:32:44
linear as
1:32:45
linear as well so what we're seeing is um the
1:32:48
well so what we're seeing is um the
1:32:48
well so what we're seeing is um the activations started out on the blue and
1:32:51
activations started out on the blue and
1:32:51
activations started out on the blue and have by layer four become very diffuse
1:32:55
have by layer four become very diffuse
1:32:55
have by layer four become very diffuse so what's happening to the activations
1:32:56
so what's happening to the activations
1:32:56
so what's happening to the activations is this and with the gradients on the
1:32:59
is this and with the gradients on the
1:32:59
is this and with the gradients on the top layer the activation the gradient
1:33:02
top layer the activation the gradient
1:33:02
top layer the activation the gradient statistics are the purple and then they
1:33:05
statistics are the purple and then they
1:33:05
statistics are the purple and then they diminish as you go down deeper in the
1:33:06
diminish as you go down deeper in the
1:33:06
diminish as you go down deeper in the layers and so basically you have an
1:33:08
layers and so basically you have an
1:33:08
layers and so basically you have an asymmetry like in the neuron net and you
1:33:11
asymmetry like in the neuron net and you
1:33:11
asymmetry like in the neuron net and you might imagine that if you have very deep
1:33:12
might imagine that if you have very deep
1:33:12
might imagine that if you have very deep neural networks say like 50 layers or
1:33:14
neural networks say like 50 layers or
1:33:14
neural networks say like 50 layers or something like that this just uh this is
1:33:16
something like that this just uh this is
1:33:16
something like that this just uh this is not a good place to be uh so that's why
1:33:19
not a good place to be uh so that's why
1:33:19
not a good place to be uh so that's why before bash normalization this was
1:33:21
before bash normalization this was
1:33:21
before bash normalization this was incredibly tricky to to set in
1:33:24
incredibly tricky to to set in
1:33:24
incredibly tricky to to set in particular if this is too large of a
1:33:26
particular if this is too large of a
1:33:26
particular if this is too large of a gain this happens and if it's too little
1:33:27
gain this happens and if it's too little
1:33:27
gain this happens and if it's too little of a
1:33:28
of a
1:33:28
of a gain then this happens so the opposite
1:33:31
gain then this happens so the opposite
1:33:32
gain then this happens so the opposite of that basically happens here we have a
1:33:34
of that basically happens here we have a
1:33:34
of that basically happens here we have a um shrinking and a uh diffusion
1:33:39
um shrinking and a uh diffusion
1:33:39
um shrinking and a uh diffusion depending on which direction you look at
1:33:40
depending on which direction you look at
1:33:40
depending on which direction you look at it from and so certainly this is not
1:33:43
it from and so certainly this is not
1:33:43
it from and so certainly this is not what you want and in this case the
1:33:44
what you want and in this case the
1:33:44
what you want and in this case the correct setting of The gain is exactly
1:33:47
correct setting of The gain is exactly
1:33:47
correct setting of The gain is exactly one just like we're doing at
1:33:49
one just like we're doing at
1:33:49
one just like we're doing at initialization and then we see that the
1:33:52
initialization and then we see that the
1:33:52
initialization and then we see that the uh statistics for the forward and a
1:33:54
uh statistics for the forward and a
1:33:54
uh statistics for the forward and a backward pass are well behaved and so
1:33:57
backward pass are well behaved and so
1:33:57
backward pass are well behaved and so the reason I want to show you this is
1:33:59
the reason I want to show you this is
1:33:59
the reason I want to show you this is that basically like getting neural nness
1:34:02
that basically like getting neural nness
1:34:02
that basically like getting neural nness to train before these normalization
1:34:03
to train before these normalization
1:34:03
to train before these normalization layers and before the use of advanced
1:34:05
layers and before the use of advanced
1:34:05
layers and before the use of advanced optimizers like adom which we still have
1:34:07
optimizers like adom which we still have
1:34:07
optimizers like adom which we still have to cover and residual connections and so
1:34:09
to cover and residual connections and so
1:34:09
to cover and residual connections and so on uh training neurs basically looked
1:34:12
on uh training neurs basically looked
1:34:12
on uh training neurs basically looked like this it's like a total Balancing
1:34:14
like this it's like a total Balancing
1:34:14
like this it's like a total Balancing Act you have to make sure that
1:34:15
Act you have to make sure that
1:34:15
Act you have to make sure that everything is precisely orchestrated and
1:34:18
everything is precisely orchestrated and
1:34:18
everything is precisely orchestrated and you have to care about the activations
1:34:19
you have to care about the activations
1:34:19
you have to care about the activations and the gradients and their statistics
1:34:21
and the gradients and their statistics
1:34:21
and the gradients and their statistics and then maybe you can train something
1:34:23
and then maybe you can train something
1:34:23
and then maybe you can train something uh but it was it was basically
1:34:24
uh but it was it was basically
1:34:24
uh but it was it was basically impossible to train very deep networks
1:34:25
impossible to train very deep networks
1:34:25
impossible to train very deep networks and this is fundamentally the the reason
1:34:27
and this is fundamentally the the reason
1:34:27
and this is fundamentally the the reason for that you'd have to be very very
1:34:29
for that you'd have to be very very
1:34:29
for that you'd have to be very very careful with your
1:34:30
careful with your
1:34:30
careful with your initialization um the other point here
1:34:33
initialization um the other point here
1:34:33
initialization um the other point here is you might be asking yourself by the
1:34:35
is you might be asking yourself by the
1:34:35
is you might be asking yourself by the way I'm not sure if I covered this why
1:34:37
way I'm not sure if I covered this why
1:34:37
way I'm not sure if I covered this why do we need these 10h layers at all uh
1:34:40
do we need these 10h layers at all uh
1:34:40
do we need these 10h layers at all uh why do we include them and then have to
1:34:42
why do we include them and then have to
1:34:42
why do we include them and then have to worry about the gain and uh the reason
1:34:44
worry about the gain and uh the reason
1:34:44
worry about the gain and uh the reason for that of course is that if you just
1:34:45
for that of course is that if you just
1:34:45
for that of course is that if you just have a stack of linear layers then
1:34:48
have a stack of linear layers then
1:34:48
have a stack of linear layers then certainly we're getting very easily nice
1:34:50
certainly we're getting very easily nice
1:34:50
certainly we're getting very easily nice activations and so on uh but this is
1:34:53
activations and so on uh but this is
1:34:53
activations and so on uh but this is just massive linear sandwich and it
1:34:54
just massive linear sandwich and it
1:34:54
just massive linear sandwich and it turns out that it collapses to a single
1:34:56
turns out that it collapses to a single
1:34:56
turns out that it collapses to a single linear layer in terms of its uh
1:34:58
linear layer in terms of its uh
1:34:58
linear layer in terms of its uh representation power so if you were to
1:35:00
representation power so if you were to
1:35:00
representation power so if you were to plot the output as a function of the
1:35:02
plot the output as a function of the
1:35:02
plot the output as a function of the input you're just getting a linear
1:35:03
input you're just getting a linear
1:35:03
input you're just getting a linear function no matter how many linear
1:35:05
function no matter how many linear
1:35:05
function no matter how many linear layers you stack up you still just end
1:35:07
layers you stack up you still just end
1:35:07
layers you stack up you still just end up with a linear transformation all the
1:35:09
up with a linear transformation all the
1:35:09
up with a linear transformation all the WX plus BS just collapse into a large WX
1:35:13
WX plus BS just collapse into a large WX
1:35:13
WX plus BS just collapse into a large WX plus b with slightly different W's and
1:35:15
plus b with slightly different W's and
1:35:15
plus b with slightly different W's and slightly different B um but
1:35:17
slightly different B um but
1:35:17
slightly different B um but interestingly even though the forward
1:35:19
interestingly even though the forward
1:35:19
interestingly even though the forward pass collapses to just a linear layer
1:35:21
pass collapses to just a linear layer
1:35:21
pass collapses to just a linear layer because of back propagation and uh the
1:35:23
because of back propagation and uh the
1:35:23
because of back propagation and uh the dynamics of the backward pass the
1:35:26
dynamics of the backward pass the
1:35:26
dynamics of the backward pass the optimization natur is not identical you
1:35:28
optimization natur is not identical you
1:35:28
optimization natur is not identical you actually end up with uh all kinds of
1:35:30
actually end up with uh all kinds of
1:35:30
actually end up with uh all kinds of interesting um Dynamics in the backward
1:35:33
interesting um Dynamics in the backward
1:35:33
interesting um Dynamics in the backward pass uh because of the uh the way the
1:35:35
pass uh because of the uh the way the
1:35:36
pass uh because of the uh the way the chain Ru is calculating it and so
1:35:38
chain Ru is calculating it and so
1:35:38
chain Ru is calculating it and so optimizing a linear layer by itself and
1:35:41
optimizing a linear layer by itself and
1:35:41
optimizing a linear layer by itself and optimizing a sandwich of 10 linear
1:35:42
optimizing a sandwich of 10 linear
1:35:43
optimizing a sandwich of 10 linear layers in both cases those are just a
1:35:44
layers in both cases those are just a
1:35:44
layers in both cases those are just a linear transformation in the forward
1:35:46
linear transformation in the forward
1:35:46
linear transformation in the forward pass but the training Dynamics would be
1:35:47
pass but the training Dynamics would be
1:35:47
pass but the training Dynamics would be different and there's entire papers that
1:35:49
different and there's entire papers that
1:35:49
different and there's entire papers that analyze in fact like infinitely layered
1:35:52
analyze in fact like infinitely layered
1:35:52
analyze in fact like infinitely layered uh linear layers and and so on and so
1:35:55
uh linear layers and and so on and so
1:35:55
uh linear layers and and so on and so there's a lot of things to that you can
1:35:56
there's a lot of things to that you can
1:35:56
there's a lot of things to that you can play with
1:35:57
play with
1:35:57
play with there uh but basically the tal
1:35:59
there uh but basically the tal
1:35:59
there uh but basically the tal linearities allow us to
1:36:02
linearities allow us to
1:36:02
linearities allow us to um turn this sandwich from just a
1:36:07
um turn this sandwich from just a
1:36:07
um turn this sandwich from just a linear uh function into uh a neural
1:36:10
linear uh function into uh a neural
1:36:10
linear uh function into uh a neural network that can in principle um
1:36:13
network that can in principle um
1:36:13
network that can in principle um approximate any arbitrary function okay
1:36:15
approximate any arbitrary function okay
1:36:15
approximate any arbitrary function okay so now I've reset the code to use the
1:36:17
so now I've reset the code to use the
1:36:17
so now I've reset the code to use the linear tanh sandwich like before and I
1:36:20
linear tanh sandwich like before and I
1:36:20
linear tanh sandwich like before and I reset everything so the gain is 5 over
1:36:23
reset everything so the gain is 5 over
1:36:23
reset everything so the gain is 5 over three uh we can run a single step of
1:36:25
three uh we can run a single step of
1:36:25
three uh we can run a single step of optimization and we can look at the
1:36:27
optimization and we can look at the
1:36:27
optimization and we can look at the activation statistics of the forward
1:36:28
activation statistics of the forward
1:36:28
activation statistics of the forward pass and the backward pass but I've
1:36:30
pass and the backward pass but I've
1:36:30
pass and the backward pass but I've added one more plot here that I think is
1:36:32
added one more plot here that I think is
1:36:32
added one more plot here that I think is really important to look at when you're
1:36:33
really important to look at when you're
1:36:33
really important to look at when you're training your neural nuts and to
1:36:35
training your neural nuts and to
1:36:35
training your neural nuts and to consider and ultimately what we're doing
1:36:37
consider and ultimately what we're doing
1:36:37
consider and ultimately what we're doing is we're updating the parameters of the
1:36:39
is we're updating the parameters of the
1:36:39
is we're updating the parameters of the neural nut so we care about the
1:36:40
neural nut so we care about the
1:36:40
neural nut so we care about the parameters and their values and their
1:36:43
parameters and their values and their
1:36:43
parameters and their values and their gradients so here what I'm doing is I'm
1:36:45
gradients so here what I'm doing is I'm
1:36:45
gradients so here what I'm doing is I'm actually iterating over all the
1:36:46
actually iterating over all the
1:36:46
actually iterating over all the parameters available and then I'm only
1:36:49
parameters available and then I'm only
1:36:49
parameters available and then I'm only um restricting it to the two-dimensional
1:36:51
um restricting it to the two-dimensional
1:36:51
um restricting it to the two-dimensional parameters which are basically the
1:36:52
parameters which are basically the
1:36:52
parameters which are basically the weights of the linear layers and I'm
1:36:54
weights of the linear layers and I'm
1:36:54
weights of the linear layers and I'm skipping the biases and I'm skipping the
1:36:57
skipping the biases and I'm skipping the
1:36:57
skipping the biases and I'm skipping the um gamas and the betas in the bom just
1:37:00
um gamas and the betas in the bom just
1:37:00
um gamas and the betas in the bom just for Simplicity but you can also take a
1:37:03
for Simplicity but you can also take a
1:37:03
for Simplicity but you can also take a look at those as well but what's
1:37:04
look at those as well but what's
1:37:04
look at those as well but what's happening with the weights is um
1:37:06
happening with the weights is um
1:37:06
happening with the weights is um instructive by
1:37:07
instructive by
1:37:07
instructive by itself so here we have all the different
1:37:10
itself so here we have all the different
1:37:10
itself so here we have all the different weights their shapes uh so this is the
1:37:13
weights their shapes uh so this is the
1:37:13
weights their shapes uh so this is the embedding layer the first linear layer
1:37:15
embedding layer the first linear layer
1:37:15
embedding layer the first linear layer all the way to the very last linear
1:37:16
all the way to the very last linear
1:37:16
all the way to the very last linear layer and then we have the mean the
1:37:18
layer and then we have the mean the
1:37:18
layer and then we have the mean the standard deviation of all these
1:37:20
standard deviation of all these
1:37:20
standard deviation of all these parameters the histogram and you can see
1:37:23
parameters the histogram and you can see
1:37:23
parameters the histogram and you can see that actually doesn't look that amazing
1:37:24
that actually doesn't look that amazing
1:37:24
that actually doesn't look that amazing so there's some trouble in Paradise even
1:37:26
so there's some trouble in Paradise even
1:37:26
so there's some trouble in Paradise even though these gradients looked okay
1:37:28
though these gradients looked okay
1:37:28
though these gradients looked okay there's something weird going on here
1:37:30
there's something weird going on here
1:37:30
there's something weird going on here I'll get to that in a second and the
1:37:32
I'll get to that in a second and the
1:37:32
I'll get to that in a second and the last thing here is the gradient to data
1:37:34
last thing here is the gradient to data
1:37:34
last thing here is the gradient to data ratio so sometimes I like to visualize
1:37:37
ratio so sometimes I like to visualize
1:37:37
ratio so sometimes I like to visualize this as well because what this gives you
1:37:39
this as well because what this gives you
1:37:39
this as well because what this gives you a sense of is what is the scale of the
1:37:41
a sense of is what is the scale of the
1:37:41
a sense of is what is the scale of the gradient compared to the scale of the
1:37:44
gradient compared to the scale of the
1:37:44
gradient compared to the scale of the actual values and this is important
1:37:46
actual values and this is important
1:37:46
actual values and this is important because we're going to end up taking a
1:37:48
because we're going to end up taking a
1:37:48
because we're going to end up taking a step update um that is the learning rate
1:37:51
step update um that is the learning rate
1:37:51
step update um that is the learning rate times the gradient onto the data
1:37:54
times the gradient onto the data
1:37:54
times the gradient onto the data and so if the gradient has too large of
1:37:55
and so if the gradient has too large of
1:37:55
and so if the gradient has too large of magnitude if the numbers in there are
1:37:57
magnitude if the numbers in there are
1:37:57
magnitude if the numbers in there are too large compared to the numbers in
1:37:59
too large compared to the numbers in
1:37:59
too large compared to the numbers in data then you'd be in trouble but in
1:38:01
data then you'd be in trouble but in
1:38:02
data then you'd be in trouble but in this case the gradient to data is our
1:38:04
this case the gradient to data is our
1:38:04
this case the gradient to data is our low numbers so the values inside grad
1:38:07
low numbers so the values inside grad
1:38:07
low numbers so the values inside grad are 1,000 times smaller than the values
1:38:09
are 1,000 times smaller than the values
1:38:09
are 1,000 times smaller than the values inside data in these weights most of
1:38:12
inside data in these weights most of
1:38:13
inside data in these weights most of them now notably that is not true about
1:38:15
them now notably that is not true about
1:38:15
them now notably that is not true about the last layer and so the last layer
1:38:18
the last layer and so the last layer
1:38:18
the last layer and so the last layer actually here the output layer is a bit
1:38:19
actually here the output layer is a bit
1:38:19
actually here the output layer is a bit of a troublemaker in the way that this
1:38:21
of a troublemaker in the way that this
1:38:21
of a troublemaker in the way that this is currently arranged because you can
1:38:22
is currently arranged because you can
1:38:23
is currently arranged because you can see that the um last layer here in pink
1:38:28
see that the um last layer here in pink
1:38:28
see that the um last layer here in pink takes on values that are much larger
1:38:30
takes on values that are much larger
1:38:30
takes on values that are much larger than some of the values inside um inside
1:38:34
than some of the values inside um inside
1:38:34
than some of the values inside um inside the neural nut so the standard
1:38:36
the neural nut so the standard
1:38:36
the neural nut so the standard deviations are roughly 1 and3 throughout
1:38:39
deviations are roughly 1 and3 throughout
1:38:39
deviations are roughly 1 and3 throughout except for the last last uh layer which
1:38:41
except for the last last uh layer which
1:38:41
except for the last last uh layer which actually has roughly one -2 standard
1:38:44
actually has roughly one -2 standard
1:38:44
actually has roughly one -2 standard deviation of gradients and so the
1:38:46
deviation of gradients and so the
1:38:46
deviation of gradients and so the gradients on the last layer are
1:38:47
gradients on the last layer are
1:38:47
gradients on the last layer are currently about 100 times greater sorry
1:38:51
currently about 100 times greater sorry
1:38:51
currently about 100 times greater sorry 10 times greater than all the other
1:38:53
10 times greater than all the other
1:38:53
10 times greater than all the other weights inside the neural net and so
1:38:56
weights inside the neural net and so
1:38:56
weights inside the neural net and so that's problematic because in the simple
1:38:58
that's problematic because in the simple
1:38:58
that's problematic because in the simple stochastic rting theend setup you would
1:39:00
stochastic rting theend setup you would
1:39:00
stochastic rting theend setup you would be training this last layer about 10
1:39:02
be training this last layer about 10
1:39:02
be training this last layer about 10 times faster than you would be training
1:39:04
times faster than you would be training
1:39:04
times faster than you would be training the other layers at
1:39:06
the other layers at
1:39:06
the other layers at initialization now this actually like
1:39:08
initialization now this actually like
1:39:08
initialization now this actually like kind of fixes itself a little bit if you
1:39:10
kind of fixes itself a little bit if you
1:39:10
kind of fixes itself a little bit if you train for a bit longer so for example if
1:39:12
train for a bit longer so for example if
1:39:12
train for a bit longer so for example if I greater than 1,000 only then do a
1:39:15
I greater than 1,000 only then do a
1:39:15
I greater than 1,000 only then do a break let me reinitialize and then let
1:39:17
break let me reinitialize and then let
1:39:17
break let me reinitialize and then let me do it 1,000 steps and after 1,000
1:39:20
me do it 1,000 steps and after 1,000
1:39:20
me do it 1,000 steps and after 1,000 steps we can look at the forward pass
1:39:24
steps we can look at the forward pass
1:39:24
steps we can look at the forward pass okay so you see how the neurons are a
1:39:25
okay so you see how the neurons are a
1:39:26
okay so you see how the neurons are a bit are saturating a bit and we can also
1:39:28
bit are saturating a bit and we can also
1:39:28
bit are saturating a bit and we can also look at the backward pass but otherwise
1:39:30
look at the backward pass but otherwise
1:39:30
look at the backward pass but otherwise they look good they're about equal and
1:39:32
they look good they're about equal and
1:39:32
they look good they're about equal and there's no shrinking to zero or
1:39:34
there's no shrinking to zero or
1:39:34
there's no shrinking to zero or exploding to Infinities and you can see
1:39:36
exploding to Infinities and you can see
1:39:36
exploding to Infinities and you can see that here in the weights uh things are
1:39:38
that here in the weights uh things are
1:39:39
that here in the weights uh things are also stabilizing a little bit so the
1:39:41
also stabilizing a little bit so the
1:39:41
also stabilizing a little bit so the Tails of the last pink layer are
1:39:42
Tails of the last pink layer are
1:39:42
Tails of the last pink layer are actually coming coming in during the
1:39:45
actually coming coming in during the
1:39:45
actually coming coming in during the optimization but certainly this is like
1:39:47
optimization but certainly this is like
1:39:47
optimization but certainly this is like a little bit of troubling especially if
1:39:49
a little bit of troubling especially if
1:39:49
a little bit of troubling especially if you are using a very simple update rule
1:39:51
you are using a very simple update rule
1:39:51
you are using a very simple update rule like stochastic gradient descent instead
1:39:52
like stochastic gradient descent instead
1:39:52
like stochastic gradient descent instead of a modern Optimizer like Adam now I'd
1:39:55
of a modern Optimizer like Adam now I'd
1:39:55
of a modern Optimizer like Adam now I'd like to show you one more plot that I
1:39:56
like to show you one more plot that I
1:39:56
like to show you one more plot that I usually look at when I train neural
1:39:58
usually look at when I train neural
1:39:58
usually look at when I train neural networks and basically the gradient to
1:40:01
networks and basically the gradient to
1:40:01
networks and basically the gradient to data ratio is not actually that
1:40:02
data ratio is not actually that
1:40:02
data ratio is not actually that informative because what matters at the
1:40:04
informative because what matters at the
1:40:04
informative because what matters at the end is not the gradient to data ratio
1:40:06
end is not the gradient to data ratio
1:40:06
end is not the gradient to data ratio but the update to the data ratio because
1:40:08
but the update to the data ratio because
1:40:08
but the update to the data ratio because that is the amount by which we will
1:40:09
that is the amount by which we will
1:40:10
that is the amount by which we will actually change the data in these
1:40:11
actually change the data in these
1:40:11
actually change the data in these tensors so coming up here what I'd like
1:40:14
tensors so coming up here what I'd like
1:40:14
tensors so coming up here what I'd like to do is I'd like to introduce a new
1:40:16
to do is I'd like to introduce a new
1:40:16
to do is I'd like to introduce a new update to data uh ratio it's going to be
1:40:20
update to data uh ratio it's going to be
1:40:20
update to data uh ratio it's going to be list and we're going to build it out
1:40:21
list and we're going to build it out
1:40:21
list and we're going to build it out every single iteration and here I'd like
1:40:23
every single iteration and here I'd like
1:40:23
every single iteration and here I'd like to keep track of basically the
1:40:26
to keep track of basically the
1:40:26
to keep track of basically the ratio every single
1:40:29
ratio every single
1:40:29
ratio every single iteration so without any gradients I'm
1:40:33
iteration so without any gradients I'm
1:40:33
iteration so without any gradients I'm comparing the update which is learning
1:40:35
comparing the update which is learning
1:40:35
comparing the update which is learning rate times the times the
1:40:37
rate times the times the
1:40:37
rate times the times the gradient that is the update that we're
1:40:39
gradient that is the update that we're
1:40:39
gradient that is the update that we're going to apply to every
1:40:41
going to apply to every
1:40:41
going to apply to every parameter uh so see I'm iterating over
1:40:43
parameter uh so see I'm iterating over
1:40:43
parameter uh so see I'm iterating over all the parameters and then I'm taking
1:40:45
all the parameters and then I'm taking
1:40:45
all the parameters and then I'm taking the basically standard deviation of the
1:40:46
the basically standard deviation of the
1:40:46
the basically standard deviation of the update we're going to apply and divided
1:40:49
update we're going to apply and divided
1:40:49
update we're going to apply and divided by the um actual content the data of of
1:40:53
by the um actual content the data of of
1:40:53
by the um actual content the data of of that parameter and its standard
1:40:55
that parameter and its standard
1:40:55
that parameter and its standard deviation so this is the ratio of
1:40:57
deviation so this is the ratio of
1:40:57
deviation so this is the ratio of basically how great are the updates to
1:41:00
basically how great are the updates to
1:41:00
basically how great are the updates to the values in these tensors then we're
1:41:02
the values in these tensors then we're
1:41:02
the values in these tensors then we're going to take a log of it and actually
1:41:03
going to take a log of it and actually
1:41:03
going to take a log of it and actually I'd like to take a log
1:41:05
I'd like to take a log
1:41:05
I'd like to take a log 10 um just so it's a nicer
1:41:09
10 um just so it's a nicer
1:41:09
10 um just so it's a nicer visualization um so we're going to be
1:41:10
visualization um so we're going to be
1:41:10
visualization um so we're going to be basically looking at the exponents of uh
1:41:14
basically looking at the exponents of uh
1:41:14
basically looking at the exponents of uh the of this division here and then that
1:41:17
the of this division here and then that
1:41:17
the of this division here and then that item to pop out the float and we're
1:41:19
item to pop out the float and we're
1:41:19
item to pop out the float and we're going to be keeping track of this for
1:41:20
going to be keeping track of this for
1:41:20
going to be keeping track of this for all the parameters and adding it to
1:41:22
all the parameters and adding it to
1:41:22
all the parameters and adding it to these UD answer so now let me
1:41:24
these UD answer so now let me
1:41:24
these UD answer so now let me reinitialize and run a th iterations we
1:41:27
reinitialize and run a th iterations we
1:41:27
reinitialize and run a th iterations we can look at the activations the
1:41:30
can look at the activations the
1:41:30
can look at the activations the gradients and the parameter gradients as
1:41:33
gradients and the parameter gradients as
1:41:33
gradients and the parameter gradients as we did before but now I have one more
1:41:35
we did before but now I have one more
1:41:35
we did before but now I have one more plot here to
1:41:36
plot here to
1:41:36
plot here to introduce and what's Happening Here is
1:41:38
introduce and what's Happening Here is
1:41:38
introduce and what's Happening Here is we're are interval parameters and I'm
1:41:40
we're are interval parameters and I'm
1:41:40
we're are interval parameters and I'm constraining it again like I did here to
1:41:42
constraining it again like I did here to
1:41:42
constraining it again like I did here to just the
1:41:43
just the
1:41:43
just the weights so the number of dimensions in
1:41:46
weights so the number of dimensions in
1:41:46
weights so the number of dimensions in these sensors is two and then I'm
1:41:48
these sensors is two and then I'm
1:41:48
these sensors is two and then I'm basically plotting all of these um
1:41:50
basically plotting all of these um
1:41:50
basically plotting all of these um update ratios over time
1:41:54
update ratios over time
1:41:54
update ratios over time so when I plot this I plot those ratios
1:41:57
so when I plot this I plot those ratios
1:41:57
so when I plot this I plot those ratios and you can see that they evolve over
1:41:58
and you can see that they evolve over
1:41:58
and you can see that they evolve over time during initialization they take on
1:42:00
time during initialization they take on
1:42:00
time during initialization they take on certain values and then these updates s
1:42:02
certain values and then these updates s
1:42:02
certain values and then these updates s of like start stabilizing usually during
1:42:04
of like start stabilizing usually during
1:42:04
of like start stabilizing usually during training then the other thing that I'm
1:42:06
training then the other thing that I'm
1:42:06
training then the other thing that I'm plotting here is I'm plotting here like
1:42:08
plotting here is I'm plotting here like
1:42:08
plotting here is I'm plotting here like an approximate value that is a Rough
1:42:10
an approximate value that is a Rough
1:42:10
an approximate value that is a Rough Guide for what it roughly should be and
1:42:12
Guide for what it roughly should be and
1:42:12
Guide for what it roughly should be and it should be like roughly
1:42:14
it should be like roughly
1:42:14
it should be like roughly one3 and so that means that basically
1:42:17
one3 and so that means that basically
1:42:17
one3 and so that means that basically there's some values in the tensor um and
1:42:20
there's some values in the tensor um and
1:42:20
there's some values in the tensor um and they take on certain values and the
1:42:22
they take on certain values and the
1:42:22
they take on certain values and the updates to them at every iteration are
1:42:24
updates to them at every iteration are
1:42:24
updates to them at every iteration are no more than roughly 1,000th of the
1:42:27
no more than roughly 1,000th of the
1:42:27
no more than roughly 1,000th of the actual like magnitude in those tensors
1:42:30
actual like magnitude in those tensors
1:42:30
actual like magnitude in those tensors uh if this was much larger like for
1:42:32
uh if this was much larger like for
1:42:32
uh if this was much larger like for example if this was um if the log of
1:42:36
example if this was um if the log of
1:42:36
example if this was um if the log of this was like say negative 1 this is
1:42:37
this was like say negative 1 this is
1:42:37
this was like say negative 1 this is actually updating those values quite a
1:42:39
actually updating those values quite a
1:42:39
actually updating those values quite a lot they're undergoing a lot of change
1:42:42
lot they're undergoing a lot of change
1:42:42
lot they're undergoing a lot of change but the reason that the final rate the
1:42:44
but the reason that the final rate the
1:42:44
but the reason that the final rate the final uh layer here is an outlier is
1:42:46
final uh layer here is an outlier is
1:42:46
final uh layer here is an outlier is because this layer was artificially
1:42:49
because this layer was artificially
1:42:49
because this layer was artificially shrunk down to keep the soft Max um
1:42:51
shrunk down to keep the soft Max um
1:42:51
shrunk down to keep the soft Max um incom unconfident
1:42:54
incom unconfident
1:42:54
incom unconfident so here you see how we multiplied The
1:42:57
so here you see how we multiplied The
1:42:57
so here you see how we multiplied The Weight by
1:42:58
Weight by
1:42:58
Weight by 0.1 uh in the initialization to make the
1:43:00
0.1 uh in the initialization to make the
1:43:00
0.1 uh in the initialization to make the last layer prediction less confident
1:43:04
last layer prediction less confident
1:43:04
last layer prediction less confident that made that artificially made the
1:43:07
that made that artificially made the
1:43:07
that made that artificially made the values inside that tensor way too low
1:43:09
values inside that tensor way too low
1:43:09
values inside that tensor way too low and that's why we're getting temporarily
1:43:10
and that's why we're getting temporarily
1:43:10
and that's why we're getting temporarily a very high ratio but you see that that
1:43:12
a very high ratio but you see that that
1:43:12
a very high ratio but you see that that stabilizes over time once uh that weight
1:43:15
stabilizes over time once uh that weight
1:43:15
stabilizes over time once uh that weight starts to learn starts to learn but
1:43:18
starts to learn starts to learn but
1:43:18
starts to learn starts to learn but basically I like to look at the
1:43:19
basically I like to look at the
1:43:19
basically I like to look at the evolution of this update ratio for all
1:43:21
evolution of this update ratio for all
1:43:21
evolution of this update ratio for all my parameters usually and I like to make
1:43:23
my parameters usually and I like to make
1:43:23
my parameters usually and I like to make sure that it's not too much above onean
1:43:27
sure that it's not too much above onean
1:43:27
sure that it's not too much above onean neg3 roughly uh so around3 on this log
1:43:31
neg3 roughly uh so around3 on this log
1:43:32
neg3 roughly uh so around3 on this log plot if it's below -3 usually that means
1:43:34
plot if it's below -3 usually that means
1:43:34
plot if it's below -3 usually that means that the parameters are not trained fast
1:43:36
that the parameters are not trained fast
1:43:36
that the parameters are not trained fast enough so if our learning rate was very
1:43:38
enough so if our learning rate was very
1:43:38
enough so if our learning rate was very low let's do that
1:43:40
low let's do that
1:43:40
low let's do that experiment uh let's initialize and then
1:43:43
experiment uh let's initialize and then
1:43:43
experiment uh let's initialize and then let's actually do a learning rate of say
1:43:45
let's actually do a learning rate of say
1:43:45
let's actually do a learning rate of say one3 here so
1:43:48
one3 here so
1:43:48
one3 here so 0.001 if your learning rate is way too
1:43:50
0.001 if your learning rate is way too
1:43:50
0.001 if your learning rate is way too low
1:43:53
this plot will typically reveal it so
1:43:56
this plot will typically reveal it so
1:43:56
this plot will typically reveal it so you see how all of these updates are way
1:43:59
you see how all of these updates are way
1:43:59
you see how all of these updates are way too small so the size of the update is
1:44:02
too small so the size of the update is
1:44:02
too small so the size of the update is uh basically uh 10,000 times um in
1:44:06
uh basically uh 10,000 times um in
1:44:06
uh basically uh 10,000 times um in magnitude to the size of the numbers in
1:44:09
magnitude to the size of the numbers in
1:44:09
magnitude to the size of the numbers in that tensor in the first place so this
1:44:10
that tensor in the first place so this
1:44:10
that tensor in the first place so this is a symptom of training way too
1:44:13
is a symptom of training way too
1:44:13
is a symptom of training way too slow so this is another way to sometimes
1:44:15
slow so this is another way to sometimes
1:44:16
slow so this is another way to sometimes set the learning rate and to get a sense
1:44:17
set the learning rate and to get a sense
1:44:17
set the learning rate and to get a sense of what that learning rate should be and
1:44:19
of what that learning rate should be and
1:44:19
of what that learning rate should be and ultimately this is something that you
1:44:20
ultimately this is something that you
1:44:20
ultimately this is something that you would uh keep track of
1:44:25
if anything the learning rate here is a
1:44:27
if anything the learning rate here is a
1:44:27
if anything the learning rate here is a little bit on the higher side uh because
1:44:30
little bit on the higher side uh because
1:44:30
little bit on the higher side uh because you see that um we're above the black
1:44:33
you see that um we're above the black
1:44:33
you see that um we're above the black line of3 we're somewhere around -2.5
1:44:35
line of3 we're somewhere around -2.5
1:44:35
line of3 we're somewhere around -2.5 it's like okay and uh but everything is
1:44:38
it's like okay and uh but everything is
1:44:38
it's like okay and uh but everything is like somewhat stabilizing and so this
1:44:40
like somewhat stabilizing and so this
1:44:40
like somewhat stabilizing and so this looks like a pretty decent setting of of
1:44:42
looks like a pretty decent setting of of
1:44:42
looks like a pretty decent setting of of um learning rates and so on but this is
1:44:44
um learning rates and so on but this is
1:44:44
um learning rates and so on but this is something to look at and when things are
1:44:46
something to look at and when things are
1:44:46
something to look at and when things are miscalibrated you will you will see very
1:44:47
miscalibrated you will you will see very
1:44:47
miscalibrated you will you will see very quickly so for
1:44:49
quickly so for
1:44:49
quickly so for example everything looks pretty well
1:44:51
example everything looks pretty well
1:44:51
example everything looks pretty well behaved right but just as a comparison
1:44:53
behaved right but just as a comparison
1:44:53
behaved right but just as a comparison when things are not properly calibrated
1:44:55
when things are not properly calibrated
1:44:55
when things are not properly calibrated what does that look like let me come up
1:44:57
what does that look like let me come up
1:44:57
what does that look like let me come up here and let's say that for example uh
1:45:00
here and let's say that for example uh
1:45:00
here and let's say that for example uh what do we do let's say that we forgot
1:45:02
what do we do let's say that we forgot
1:45:02
what do we do let's say that we forgot to apply this a fan in normalization so
1:45:05
to apply this a fan in normalization so
1:45:05
to apply this a fan in normalization so the weights inside the linear layers are
1:45:07
the weights inside the linear layers are
1:45:07
the weights inside the linear layers are just sampled from aaan and all the
1:45:09
just sampled from aaan and all the
1:45:09
just sampled from aaan and all the stages what happens to our how do we
1:45:12
stages what happens to our how do we
1:45:12
stages what happens to our how do we notice that something's off well the
1:45:14
notice that something's off well the
1:45:15
notice that something's off well the activation plot will tell you whoa your
1:45:16
activation plot will tell you whoa your
1:45:16
activation plot will tell you whoa your neurons are way too saturated uh the
1:45:18
neurons are way too saturated uh the
1:45:18
neurons are way too saturated uh the gradients are going to be all messed up
1:45:21
gradients are going to be all messed up
1:45:21
gradients are going to be all messed up uh the histogram for these weights are
1:45:22
uh the histogram for these weights are
1:45:22
uh the histogram for these weights are going to be all messed up as well and
1:45:25
going to be all messed up as well and
1:45:25
going to be all messed up as well and there's a lot of asymmetry and then if
1:45:27
there's a lot of asymmetry and then if
1:45:27
there's a lot of asymmetry and then if we look here I suspect it's all going to
1:45:29
we look here I suspect it's all going to
1:45:29
we look here I suspect it's all going to be also pretty messed up so uh you see
1:45:31
be also pretty messed up so uh you see
1:45:31
be also pretty messed up so uh you see there's a lot of uh discrepancy in how
1:45:34
there's a lot of uh discrepancy in how
1:45:34
there's a lot of uh discrepancy in how fast these layers are learning and some
1:45:36
fast these layers are learning and some
1:45:36
fast these layers are learning and some of them are learning way too fast so uh1
1:45:40
of them are learning way too fast so uh1
1:45:40
of them are learning way too fast so uh1 1.5 those are very large numbers in
1:45:42
1.5 those are very large numbers in
1:45:42
1.5 those are very large numbers in terms of this ratio again you should be
1:45:44
terms of this ratio again you should be
1:45:44
terms of this ratio again you should be somewhere around3 and not much more
1:45:46
somewhere around3 and not much more
1:45:46
somewhere around3 and not much more about that um so this is how
1:45:49
about that um so this is how
1:45:49
about that um so this is how miscalibrations of your neuron nuts are
1:45:51
miscalibrations of your neuron nuts are
1:45:51
miscalibrations of your neuron nuts are going to manifest and these kinds of
1:45:53
going to manifest and these kinds of
1:45:53
going to manifest and these kinds of plots here are a good way of um sort of
1:45:56
plots here are a good way of um sort of
1:45:56
plots here are a good way of um sort of bringing um those miscalibrations sort
1:45:59
bringing um those miscalibrations sort
1:45:59
bringing um those miscalibrations sort of uh to your attention and so you can
1:46:03
of uh to your attention and so you can
1:46:03
of uh to your attention and so you can address them okay so so far we've seen
1:46:05
address them okay so so far we've seen
1:46:05
address them okay so so far we've seen that when we have this linear tanh
1:46:07
that when we have this linear tanh
1:46:07
that when we have this linear tanh sandwich we can actually precisely
1:46:09
sandwich we can actually precisely
1:46:09
sandwich we can actually precisely calibrate the gains and make the
1:46:10
calibrate the gains and make the
1:46:10
calibrate the gains and make the activations the gradients and the
1:46:12
activations the gradients and the
1:46:12
activations the gradients and the parameters and the updates all look
1:46:14
parameters and the updates all look
1:46:14
parameters and the updates all look pretty decent but it definitely feels a
1:46:16
pretty decent but it definitely feels a
1:46:16
pretty decent but it definitely feels a little bit like balancing of a pencil on
1:46:19
little bit like balancing of a pencil on
1:46:19
little bit like balancing of a pencil on your finger and that's because this gain
1:46:22
your finger and that's because this gain
1:46:22
your finger and that's because this gain has to be very precisely calibrated so
1:46:26
has to be very precisely calibrated so
1:46:26
has to be very precisely calibrated so now let's introduce bat normalization
1:46:27
now let's introduce bat normalization
1:46:27
now let's introduce bat normalization layers into the fix into the mix and
1:46:30
layers into the fix into the mix and
1:46:30
layers into the fix into the mix and let's let's see how that helps fix the
1:46:32
let's let's see how that helps fix the
1:46:32
let's let's see how that helps fix the problem so
1:46:34
problem so
1:46:35
problem so here I'm going to take the bachom 1D
1:46:37
here I'm going to take the bachom 1D
1:46:37
here I'm going to take the bachom 1D class and I'm going to start placing it
1:46:40
class and I'm going to start placing it
1:46:40
class and I'm going to start placing it inside and as I mentioned before the
1:46:43
inside and as I mentioned before the
1:46:43
inside and as I mentioned before the standard typical place you would place
1:46:44
standard typical place you would place
1:46:44
standard typical place you would place it is between the linear layer so right
1:46:47
it is between the linear layer so right
1:46:47
it is between the linear layer so right after it but before the nonlinearity but
1:46:49
after it but before the nonlinearity but
1:46:49
after it but before the nonlinearity but people have definitely played with that
1:46:51
people have definitely played with that
1:46:51
people have definitely played with that and uh in fact you can get very similar
1:46:53
and uh in fact you can get very similar
1:46:53
and uh in fact you can get very similar results even if you place it after the
1:46:55
results even if you place it after the
1:46:55
results even if you place it after the nonlinearity um and the other thing that
1:46:58
nonlinearity um and the other thing that
1:46:58
nonlinearity um and the other thing that I wanted to mention is it's totally fine
1:46:59
I wanted to mention is it's totally fine
1:46:59
I wanted to mention is it's totally fine to also place it at the end uh after the
1:47:02
to also place it at the end uh after the
1:47:02
to also place it at the end uh after the last linear layer and before the L
1:47:04
last linear layer and before the L
1:47:04
last linear layer and before the L function so this is potentially fine as
1:47:06
function so this is potentially fine as
1:47:06
function so this is potentially fine as well um and in this case this would be
1:47:10
well um and in this case this would be
1:47:10
well um and in this case this would be output would be WAP
1:47:12
output would be WAP
1:47:12
output would be WAP size um now because the last layer is
1:47:16
size um now because the last layer is
1:47:16
size um now because the last layer is Bash we would not be changing the weight
1:47:18
Bash we would not be changing the weight
1:47:18
Bash we would not be changing the weight to make the softmax less confident we'd
1:47:20
to make the softmax less confident we'd
1:47:20
to make the softmax less confident we'd be changing the gamma because gamma
1:47:23
be changing the gamma because gamma
1:47:23
be changing the gamma because gamma remember in the bathroom is the variable
1:47:26
remember in the bathroom is the variable
1:47:26
remember in the bathroom is the variable that multiplicatively interacts with the
1:47:28
that multiplicatively interacts with the
1:47:28
that multiplicatively interacts with the output of that
1:47:31
normalization so we can initialize this
1:47:34
normalization so we can initialize this
1:47:34
normalization so we can initialize this sandwich now we can train and we can see
1:47:37
sandwich now we can train and we can see
1:47:37
sandwich now we can train and we can see that the activations uh are going to of
1:47:39
that the activations uh are going to of
1:47:39
that the activations uh are going to of course look uh very good and they are
1:47:42
course look uh very good and they are
1:47:42
course look uh very good and they are going to necessarily look good because
1:47:44
going to necessarily look good because
1:47:44
going to necessarily look good because now before every single 10h layer there
1:47:46
now before every single 10h layer there
1:47:46
now before every single 10h layer there is a normalization in the bashor so this
1:47:49
is a normalization in the bashor so this
1:47:50
is a normalization in the bashor so this is unsurprisingly all uh looks pretty
1:47:52
is unsurprisingly all uh looks pretty
1:47:52
is unsurprisingly all uh looks pretty good it's going to be standard deviation
1:47:54
good it's going to be standard deviation
1:47:54
good it's going to be standard deviation of roughly 65 2% and roughly equal
1:47:57
of roughly 65 2% and roughly equal
1:47:57
of roughly 65 2% and roughly equal standard deviation throughout the entire
1:47:59
standard deviation throughout the entire
1:47:59
standard deviation throughout the entire layers so everything looks very
1:48:01
layers so everything looks very
1:48:01
layers so everything looks very homogeneous the gradients look good the
1:48:04
homogeneous the gradients look good the
1:48:04
homogeneous the gradients look good the weights look good and their
1:48:08
weights look good and their
1:48:08
weights look good and their distributions and then the
1:48:10
distributions and then the
1:48:10
distributions and then the updates also look um pretty reasonable
1:48:13
updates also look um pretty reasonable
1:48:13
updates also look um pretty reasonable uh we are going above3 a little bit but
1:48:16
uh we are going above3 a little bit but
1:48:16
uh we are going above3 a little bit but not by too much so all the parameters
1:48:19
not by too much so all the parameters
1:48:19
not by too much so all the parameters are training at roughly the same rate um
1:48:22
are training at roughly the same rate um
1:48:22
are training at roughly the same rate um here
1:48:24
here
1:48:24
here but now what we've gained is um we are
1:48:26
but now what we've gained is um we are
1:48:26
but now what we've gained is um we are going to be slightly less
1:48:30
going to be slightly less
1:48:30
going to be slightly less um brittle with respect to the gain of
1:48:33
um brittle with respect to the gain of
1:48:33
um brittle with respect to the gain of these so for example I can make the gain
1:48:35
these so for example I can make the gain
1:48:35
these so for example I can make the gain be say2 here um which is much much much
1:48:40
be say2 here um which is much much much
1:48:40
be say2 here um which is much much much slower than what we had with the tan
1:48:41
slower than what we had with the tan
1:48:41
slower than what we had with the tan H but as we'll see the activations will
1:48:44
H but as we'll see the activations will
1:48:44
H but as we'll see the activations will actually be exactly unaffected uh and
1:48:46
actually be exactly unaffected uh and
1:48:46
actually be exactly unaffected uh and that's because of again this explicit
1:48:48
that's because of again this explicit
1:48:48
that's because of again this explicit normalization the gradients are going to
1:48:50
normalization the gradients are going to
1:48:50
normalization the gradients are going to look okay the weight gradients are going
1:48:52
look okay the weight gradients are going
1:48:52
look okay the weight gradients are going to look okay okay but actually the
1:48:54
to look okay okay but actually the
1:48:54
to look okay okay but actually the updates will
1:48:55
updates will
1:48:56
updates will change and so even though the forward
1:48:59
change and so even though the forward
1:48:59
change and so even though the forward and backward pass to a very large extent
1:49:00
and backward pass to a very large extent
1:49:00
and backward pass to a very large extent look okay because of the backward pass
1:49:02
look okay because of the backward pass
1:49:02
look okay because of the backward pass of the Bator and how the scale of the
1:49:04
of the Bator and how the scale of the
1:49:04
of the Bator and how the scale of the incoming activations interacts in the
1:49:07
incoming activations interacts in the
1:49:07
incoming activations interacts in the Bator and its uh backward pass this is
1:49:10
Bator and its uh backward pass this is
1:49:10
Bator and its uh backward pass this is actually changing the um the scale of
1:49:14
actually changing the um the scale of
1:49:14
actually changing the um the scale of the updates on these parameters so the
1:49:16
the updates on these parameters so the
1:49:16
the updates on these parameters so the grades on gradients of these weights are
1:49:18
grades on gradients of these weights are
1:49:18
grades on gradients of these weights are affected so we still don't get it
1:49:21
affected so we still don't get it
1:49:21
affected so we still don't get it completely free pass to pass in arbitral
1:49:23
completely free pass to pass in arbitral
1:49:23
completely free pass to pass in arbitral um weights here but it everything else
1:49:26
um weights here but it everything else
1:49:26
um weights here but it everything else is significantly more robust in terms of
1:49:29
is significantly more robust in terms of
1:49:29
is significantly more robust in terms of the forward backward and the weight
1:49:32
the forward backward and the weight
1:49:32
the forward backward and the weight gradients it's just that you may have to
1:49:33
gradients it's just that you may have to
1:49:33
gradients it's just that you may have to retune your learning rate if you are
1:49:35
retune your learning rate if you are
1:49:35
retune your learning rate if you are changing sufficiently the the scale of
1:49:38
changing sufficiently the the scale of
1:49:38
changing sufficiently the the scale of the activations that are coming into the
1:49:40
the activations that are coming into the
1:49:40
the activations that are coming into the batch Norms so here for example this um
1:49:43
batch Norms so here for example this um
1:49:43
batch Norms so here for example this um we changed the gains of these linear
1:49:45
we changed the gains of these linear
1:49:45
we changed the gains of these linear layers to be greater and we're seeing
1:49:47
layers to be greater and we're seeing
1:49:47
layers to be greater and we're seeing that the updates are coming out lower as
1:49:49
that the updates are coming out lower as
1:49:49
that the updates are coming out lower as a
1:49:50
a
1:49:50
a result and then finally we can also so
1:49:53
result and then finally we can also so
1:49:53
result and then finally we can also so if we are using borms we don't actually
1:49:55
if we are using borms we don't actually
1:49:55
if we are using borms we don't actually need to necessarily let me reset this to
1:49:57
need to necessarily let me reset this to
1:49:57
need to necessarily let me reset this to one so there's no gain we don't
1:49:59
one so there's no gain we don't
1:49:59
one so there's no gain we don't necessarily even have to um normalize by
1:50:02
necessarily even have to um normalize by
1:50:02
necessarily even have to um normalize by fan in sometimes so if I take out the
1:50:04
fan in sometimes so if I take out the
1:50:04
fan in sometimes so if I take out the fan in so these are just now uh random
1:50:06
fan in so these are just now uh random
1:50:06
fan in so these are just now uh random gsh in we'll see that because of borm
1:50:09
gsh in we'll see that because of borm
1:50:09
gsh in we'll see that because of borm this will actually be relatively well
1:50:10
this will actually be relatively well
1:50:10
this will actually be relatively well behaved
1:50:13
behaved
1:50:13
behaved so the statistic look of course in the
1:50:16
so the statistic look of course in the
1:50:16
so the statistic look of course in the forward pass look good the gradients
1:50:18
forward pass look good the gradients
1:50:18
forward pass look good the gradients look good the uh backward uh the weight
1:50:21
look good the uh backward uh the weight
1:50:21
look good the uh backward uh the weight updates look okay A little bit of fat
1:50:24
updates look okay A little bit of fat
1:50:24
updates look okay A little bit of fat tails on some of the
1:50:25
tails on some of the
1:50:25
tails on some of the layers and uh this looks okay as well
1:50:29
layers and uh this looks okay as well
1:50:29
layers and uh this looks okay as well but as you as you can see uh we're
1:50:32
but as you as you can see uh we're
1:50:32
but as you as you can see uh we're significantly below ne3 so we'd have to
1:50:34
significantly below ne3 so we'd have to
1:50:34
significantly below ne3 so we'd have to bump up the learning rate of this bachor
1:50:36
bump up the learning rate of this bachor
1:50:36
bump up the learning rate of this bachor uh so that we are training more properly
1:50:39
uh so that we are training more properly
1:50:39
uh so that we are training more properly and in particular looking at this
1:50:40
and in particular looking at this
1:50:40
and in particular looking at this roughly looks like we have to 10x the
1:50:42
roughly looks like we have to 10x the
1:50:42
roughly looks like we have to 10x the learning rate to get to about
1:50:45
learning rate to get to about
1:50:45
learning rate to get to about one3 so we' come here and we would
1:50:47
one3 so we' come here and we would
1:50:48
one3 so we' come here and we would change this to be update of 1.0 and if I
1:50:52
change this to be update of 1.0 and if I
1:50:52
change this to be update of 1.0 and if I reinitialize
1:50:59
then we'll see that everything still of
1:51:00
then we'll see that everything still of
1:51:00
then we'll see that everything still of course looks good and now we are roughly
1:51:03
course looks good and now we are roughly
1:51:03
course looks good and now we are roughly here and we expect this to be an okay
1:51:05
here and we expect this to be an okay
1:51:05
here and we expect this to be an okay training run so long story short we are
1:51:08
training run so long story short we are
1:51:08
training run so long story short we are significantly more robust to the gain of
1:51:10
significantly more robust to the gain of
1:51:10
significantly more robust to the gain of these linear layers whether or not we
1:51:12
these linear layers whether or not we
1:51:12
these linear layers whether or not we have to apply the fan in and then we can
1:51:14
have to apply the fan in and then we can
1:51:14
have to apply the fan in and then we can change the gain uh but we actually do
1:51:17
change the gain uh but we actually do
1:51:17
change the gain uh but we actually do have to worry a little bit about the
1:51:18
have to worry a little bit about the
1:51:18
have to worry a little bit about the update um scales and making sure that uh
1:51:21
update um scales and making sure that uh
1:51:21
update um scales and making sure that uh the learning rate is properly calibrated
1:51:23
the learning rate is properly calibrated
1:51:23
the learning rate is properly calibrated here but this the activations of the
1:51:25
here but this the activations of the
1:51:25
here but this the activations of the forward backward pass and the updates
1:51:27
forward backward pass and the updates
1:51:27
forward backward pass and the updates are are looking significantly more well
1:51:29
are are looking significantly more well
1:51:29
are are looking significantly more well behaved except for the global scale that
1:51:32
behaved except for the global scale that
1:51:32
behaved except for the global scale that is potentially being adjusted here okay
1:51:34
is potentially being adjusted here okay
1:51:34
is potentially being adjusted here okay so now let me summarize there are three
1:51:36
so now let me summarize there are three
1:51:36
so now let me summarize there are three things I was hoping to achieve with this
1:51:38
things I was hoping to achieve with this
1:51:38
things I was hoping to achieve with this section number one I wanted to introduce
1:51:40
section number one I wanted to introduce
1:51:40
section number one I wanted to introduce you to bat normalization which is one of
1:51:42
you to bat normalization which is one of
1:51:42
you to bat normalization which is one of the first modern innovations that we're
1:51:44
the first modern innovations that we're
1:51:44
the first modern innovations that we're looking into that helped stabilize very
1:51:47
looking into that helped stabilize very
1:51:47
looking into that helped stabilize very deep neural networks and their training
1:51:49
deep neural networks and their training
1:51:49
deep neural networks and their training and I hope you understand how the B
1:51:51
and I hope you understand how the B
1:51:51
and I hope you understand how the B normalization works and um how it would
1:51:53
normalization works and um how it would
1:51:54
normalization works and um how it would be used in a neural network number two I
1:51:56
be used in a neural network number two I
1:51:56
be used in a neural network number two I was hoping to py torify some of our code
1:51:59
was hoping to py torify some of our code
1:51:59
was hoping to py torify some of our code and wrap it up into these uh modules so
1:52:01
and wrap it up into these uh modules so
1:52:02
and wrap it up into these uh modules so like linear bash 1D 10h Etc these are
1:52:04
like linear bash 1D 10h Etc these are
1:52:04
like linear bash 1D 10h Etc these are layers or modules and they can be
1:52:07
layers or modules and they can be
1:52:07
layers or modules and they can be stacked up into neural nuts like Lego
1:52:09
stacked up into neural nuts like Lego
1:52:09
stacked up into neural nuts like Lego building blocks and these layers
1:52:12
building blocks and these layers
1:52:12
building blocks and these layers actually exist in pytorch and if you
1:52:15
actually exist in pytorch and if you
1:52:15
actually exist in pytorch and if you import torch NN then you can actually
1:52:17
import torch NN then you can actually
1:52:17
import torch NN then you can actually the way I've constructed it you can
1:52:19
the way I've constructed it you can
1:52:19
the way I've constructed it you can simply just use pytorch by prepending n
1:52:21
simply just use pytorch by prepending n
1:52:21
simply just use pytorch by prepending n and Dot to all these different
1:52:24
and Dot to all these different
1:52:24
and Dot to all these different layers and actually everything will just
1:52:27
layers and actually everything will just
1:52:27
layers and actually everything will just work because the API that I've developed
1:52:29
work because the API that I've developed
1:52:29
work because the API that I've developed here is identical to the API that
1:52:31
here is identical to the API that
1:52:31
here is identical to the API that pytorch uses and the implementation also
1:52:33
pytorch uses and the implementation also
1:52:33
pytorch uses and the implementation also is basically as far as I'm Weare
1:52:36
is basically as far as I'm Weare
1:52:36
is basically as far as I'm Weare identical to the one in pytorch and
1:52:38
identical to the one in pytorch and
1:52:38
identical to the one in pytorch and number three I tried to introduce you to
1:52:40
number three I tried to introduce you to
1:52:40
number three I tried to introduce you to the diagnostic tools that you would use
1:52:42
the diagnostic tools that you would use
1:52:42
the diagnostic tools that you would use to understand whether your neural
1:52:43
to understand whether your neural
1:52:43
to understand whether your neural network is in a good State dynamically
1:52:46
network is in a good State dynamically
1:52:46
network is in a good State dynamically so we are looking at the statistics and
1:52:48
so we are looking at the statistics and
1:52:48
so we are looking at the statistics and histograms and activation of the forward
1:52:50
histograms and activation of the forward
1:52:50
histograms and activation of the forward pass activ activations the backward pass
1:52:53
pass activ activations the backward pass
1:52:53
pass activ activations the backward pass gradients and then also we're looking at
1:52:55
gradients and then also we're looking at
1:52:55
gradients and then also we're looking at the weights that are going to be updated
1:52:56
the weights that are going to be updated
1:52:56
the weights that are going to be updated as part of stochastic gradi in ascent
1:52:58
as part of stochastic gradi in ascent
1:52:58
as part of stochastic gradi in ascent and we're looking at their means
1:53:00
and we're looking at their means
1:53:00
and we're looking at their means standard deviations and also the ratio
1:53:02
standard deviations and also the ratio
1:53:02
standard deviations and also the ratio of gradients to data or even better the
1:53:05
of gradients to data or even better the
1:53:05
of gradients to data or even better the updates to data and we saw that
1:53:08
updates to data and we saw that
1:53:08
updates to data and we saw that typically we don't actually look at it
1:53:10
typically we don't actually look at it
1:53:10
typically we don't actually look at it as a single snapshot Frozen in time at
1:53:12
as a single snapshot Frozen in time at
1:53:12
as a single snapshot Frozen in time at some particular iteration typically
1:53:14
some particular iteration typically
1:53:14
some particular iteration typically people look at this as a over time just
1:53:16
people look at this as a over time just
1:53:16
people look at this as a over time just like I've done here and they look at
1:53:18
like I've done here and they look at
1:53:18
like I've done here and they look at these update to data ratios and they
1:53:19
these update to data ratios and they
1:53:19
these update to data ratios and they make sure everything looks okay and in
1:53:21
make sure everything looks okay and in
1:53:21
make sure everything looks okay and in particular I said said that um
1:53:24
particular I said said that um
1:53:24
particular I said said that um W3 or basically ne3 on the lock scale is
1:53:27
W3 or basically ne3 on the lock scale is
1:53:27
W3 or basically ne3 on the lock scale is a good uh rough euristic for what you
1:53:30
a good uh rough euristic for what you
1:53:30
a good uh rough euristic for what you want this ratio to be and if it's way
1:53:32
want this ratio to be and if it's way
1:53:32
want this ratio to be and if it's way too high then probably the learning rate
1:53:34
too high then probably the learning rate
1:53:34
too high then probably the learning rate or the updates are a little too too big
1:53:36
or the updates are a little too too big
1:53:36
or the updates are a little too too big and if it's way too small that the
1:53:37
and if it's way too small that the
1:53:37
and if it's way too small that the learning rate is probably too small so
1:53:39
learning rate is probably too small so
1:53:39
learning rate is probably too small so that's just some of the things that you
1:53:41
that's just some of the things that you
1:53:41
that's just some of the things that you may want to play with when you try to
1:53:42
may want to play with when you try to
1:53:42
may want to play with when you try to get your neural network to uh work with
1:53:44
get your neural network to uh work with
1:53:45
get your neural network to uh work with very
1:53:45
very
1:53:45
very well now there's a number of things I
1:53:47
well now there's a number of things I
1:53:47
well now there's a number of things I did not try to achieve I did not try to
1:53:50
did not try to achieve I did not try to
1:53:50
did not try to achieve I did not try to beat our previous performance as an
1:53:51
beat our previous performance as an
1:53:51
beat our previous performance as an example by introducing using the bash
1:53:53
example by introducing using the bash
1:53:53
example by introducing using the bash layer actually I did try um and I found
1:53:55
layer actually I did try um and I found
1:53:56
layer actually I did try um and I found the new I used the learning rate finding
1:53:58
the new I used the learning rate finding
1:53:58
the new I used the learning rate finding mechanism that I've described before I
1:53:59
mechanism that I've described before I
1:54:00
mechanism that I've described before I tried to train a borm layer a borm
1:54:02
tried to train a borm layer a borm
1:54:02
tried to train a borm layer a borm neural nut and uh I actually ended up
1:54:04
neural nut and uh I actually ended up
1:54:04
neural nut and uh I actually ended up with results that are very very similar
1:54:06
with results that are very very similar
1:54:06
with results that are very very similar to what we've obtained before and that's
1:54:08
to what we've obtained before and that's
1:54:08
to what we've obtained before and that's because our performance now is not
1:54:10
because our performance now is not
1:54:10
because our performance now is not bottlenecked by the optimization which
1:54:13
bottlenecked by the optimization which
1:54:13
bottlenecked by the optimization which is what borm is helping with the
1:54:15
is what borm is helping with the
1:54:15
is what borm is helping with the performance at this stage is bottleneck
1:54:17
performance at this stage is bottleneck
1:54:17
performance at this stage is bottleneck by what I suspect is the context length
1:54:19
by what I suspect is the context length
1:54:19
by what I suspect is the context length of our context so currently we are
1:54:22
of our context so currently we are
1:54:22
of our context so currently we are taking three characters to predict the
1:54:24
taking three characters to predict the
1:54:24
taking three characters to predict the fourth one and I think we need to go
1:54:25
fourth one and I think we need to go
1:54:25
fourth one and I think we need to go beyond that and we need to look at more
1:54:27
beyond that and we need to look at more
1:54:27
beyond that and we need to look at more powerful architectures like recurrent
1:54:29
powerful architectures like recurrent
1:54:29
powerful architectures like recurrent neural networks and Transformers in
1:54:30
neural networks and Transformers in
1:54:30
neural networks and Transformers in order to further push um the lock
1:54:33
order to further push um the lock
1:54:33
order to further push um the lock probabilities that we're achieving on
1:54:34
probabilities that we're achieving on
1:54:34
probabilities that we're achieving on this data
1:54:35
this data
1:54:35
this data set and I also did not try to have a
1:54:39
set and I also did not try to have a
1:54:39
set and I also did not try to have a full explanation of all of these
1:54:41
full explanation of all of these
1:54:41
full explanation of all of these activations the gradients and the
1:54:42
activations the gradients and the
1:54:42
activations the gradients and the backward pass and the statistics of all
1:54:44
backward pass and the statistics of all
1:54:44
backward pass and the statistics of all these gradients and so you may have
1:54:46
these gradients and so you may have
1:54:46
these gradients and so you may have found some of the parts here un
1:54:47
found some of the parts here un
1:54:47
found some of the parts here un intuitive and maybe you're slightly
1:54:48
intuitive and maybe you're slightly
1:54:48
intuitive and maybe you're slightly confused about okay if I change the uh
1:54:51
confused about okay if I change the uh
1:54:51
confused about okay if I change the uh gain here how come that we need a
1:54:52
gain here how come that we need a
1:54:53
gain here how come that we need a different learning rate and I didn't go
1:54:54
different learning rate and I didn't go
1:54:54
different learning rate and I didn't go into the full detail because you'd have
1:54:56
into the full detail because you'd have
1:54:56
into the full detail because you'd have to actually look at the backward pass of
1:54:57
to actually look at the backward pass of
1:54:57
to actually look at the backward pass of all these different layers and get an
1:54:59
all these different layers and get an
1:54:59
all these different layers and get an intuitive understanding of how that
1:55:00
intuitive understanding of how that
1:55:00
intuitive understanding of how that works and I did not go into that in this
1:55:03
works and I did not go into that in this
1:55:03
works and I did not go into that in this lecture the purpose really was just to
1:55:05
lecture the purpose really was just to
1:55:05
lecture the purpose really was just to introduce you to the diagnostic tools
1:55:07
introduce you to the diagnostic tools
1:55:07
introduce you to the diagnostic tools and what they look like but there's
1:55:08
and what they look like but there's
1:55:08
and what they look like but there's still a lot of work remaining on the
1:55:10
still a lot of work remaining on the
1:55:10
still a lot of work remaining on the intuitive level to understand the
1:55:11
intuitive level to understand the
1:55:11
intuitive level to understand the initialization the backward pass and how
1:55:13
initialization the backward pass and how
1:55:13
initialization the backward pass and how all of that interacts uh but you
1:55:15
all of that interacts uh but you
1:55:15
all of that interacts uh but you shouldn't feel too bad because honestly
1:55:18
shouldn't feel too bad because honestly
1:55:18
shouldn't feel too bad because honestly we are getting to The Cutting Edge of
1:55:21
we are getting to The Cutting Edge of
1:55:21
we are getting to The Cutting Edge of where the field is
1:55:22
where the field is
1:55:22
where the field is we certainly haven't I would say soled
1:55:24
we certainly haven't I would say soled
1:55:24
we certainly haven't I would say soled initialization and we haven't soled back
1:55:27
initialization and we haven't soled back
1:55:27
initialization and we haven't soled back propagation and these are still very
1:55:28
propagation and these are still very
1:55:29
propagation and these are still very much an active area of research people
1:55:30
much an active area of research people
1:55:30
much an active area of research people are still trying to figure out what is
1:55:32
are still trying to figure out what is
1:55:32
are still trying to figure out what is the best way to initialize these
1:55:33
the best way to initialize these
1:55:33
the best way to initialize these networks what is the best update rule to
1:55:35
networks what is the best update rule to
1:55:35
networks what is the best update rule to use um and so on so none of this is
1:55:38
use um and so on so none of this is
1:55:38
use um and so on so none of this is really solved and we don't really have
1:55:39
really solved and we don't really have
1:55:39
really solved and we don't really have all the answers to all the to you know
1:55:42
all the answers to all the to you know
1:55:42
all the answers to all the to you know all these cases but at least uh you know
1:55:45
all these cases but at least uh you know
1:55:45
all these cases but at least uh you know we're making progress and at least we
1:55:46
we're making progress and at least we
1:55:46
we're making progress and at least we have some tools to tell us uh whether or
1:55:48
have some tools to tell us uh whether or
1:55:48
have some tools to tell us uh whether or not things are on the right track for
1:55:50
not things are on the right track for
1:55:50
not things are on the right track for now so
1:55:53
now so
1:55:53
now so I think we've made positive progress in
1:55:54
I think we've made positive progress in
1:55:54
I think we've made positive progress in this lecture and I hope you enjoyed that
1:55:56
this lecture and I hope you enjoyed that
1:55:56
this lecture and I hope you enjoyed that and I will see you next time