View on GitHub
Neural Networks: Zero to Hero
The spelled-out intro to neural networks and backpropagation: building micrograd
Loading player
Notes
Transcript
8237 segments
0:00
hello my name is andre
0:01
hello my name is andre
0:01
hello my name is andre and i've been training deep neural
0:02
and i've been training deep neural
0:02
and i've been training deep neural networks for a bit more than a decade
0:04
networks for a bit more than a decade
0:04
networks for a bit more than a decade and in this lecture i'd like to show you
0:06
and in this lecture i'd like to show you
0:06
and in this lecture i'd like to show you what neural network training looks like
0:08
what neural network training looks like
0:08
what neural network training looks like under the hood so in particular we are
0:10
under the hood so in particular we are
0:10
under the hood so in particular we are going to start with a blank jupiter
0:12
going to start with a blank jupiter
0:12
going to start with a blank jupiter notebook and by the end of this lecture
0:14
notebook and by the end of this lecture
0:14
notebook and by the end of this lecture we will define and train in neural net
0:16
we will define and train in neural net
0:16
we will define and train in neural net and you'll get to see everything that
0:18
and you'll get to see everything that
0:18
and you'll get to see everything that goes on under the hood and exactly
0:20
goes on under the hood and exactly
0:20
goes on under the hood and exactly sort of how that works on an intuitive
0:21
sort of how that works on an intuitive
0:21
sort of how that works on an intuitive level
0:22
level
0:22
level now specifically what i would like to do
0:24
now specifically what i would like to do
0:24
now specifically what i would like to do is i would like to take you through
0:26
is i would like to take you through
0:26
is i would like to take you through building of micrograd now micrograd is
0:29
building of micrograd now micrograd is
0:29
building of micrograd now micrograd is this library that i released on github
0:30
this library that i released on github
0:30
this library that i released on github about two years ago but at the time i
0:32
about two years ago but at the time i
0:32
about two years ago but at the time i only uploaded the source code and you'd
0:34
only uploaded the source code and you'd
0:34
only uploaded the source code and you'd have to go in by yourself and really
0:37
have to go in by yourself and really
0:37
have to go in by yourself and really figure out how it works
0:39
figure out how it works
0:39
figure out how it works so in this lecture i will take you
0:40
so in this lecture i will take you
0:40
so in this lecture i will take you through it step by step and kind of
0:42
through it step by step and kind of
0:42
through it step by step and kind of comment on all the pieces of it so what
0:44
comment on all the pieces of it so what
0:44
comment on all the pieces of it so what is micrograd and why is it interesting
0:47
is micrograd and why is it interesting
0:47
is micrograd and why is it interesting good
0:48
good
0:48
good um
0:49
um
0:49
um micrograd is basically an autograd
0:51
micrograd is basically an autograd
0:51
micrograd is basically an autograd engine autograd is short for automatic
0:53
engine autograd is short for automatic
0:53
engine autograd is short for automatic gradient and really what it does is it
0:55
gradient and really what it does is it
0:55
gradient and really what it does is it implements backpropagation now
0:57
implements backpropagation now
0:57
implements backpropagation now backpropagation is this algorithm that
0:59
backpropagation is this algorithm that
0:59
backpropagation is this algorithm that allows you to efficiently evaluate the
1:01
allows you to efficiently evaluate the
1:01
allows you to efficiently evaluate the gradient of
1:03
gradient of
1:03
gradient of some kind of a loss function with
1:05
some kind of a loss function with
1:05
some kind of a loss function with respect to the weights of a neural
1:07
respect to the weights of a neural
1:07
respect to the weights of a neural network and what that allows us to do
1:09
network and what that allows us to do
1:09
network and what that allows us to do then is we can iteratively tune the
1:11
then is we can iteratively tune the
1:11
then is we can iteratively tune the weights of that neural network to
1:12
weights of that neural network to
1:12
weights of that neural network to minimize the loss function and therefore
1:14
minimize the loss function and therefore
1:14
minimize the loss function and therefore improve the accuracy of the network so
1:16
improve the accuracy of the network so
1:16
improve the accuracy of the network so back propagation would be at the
1:18
back propagation would be at the
1:18
back propagation would be at the mathematical core of any modern deep
1:20
mathematical core of any modern deep
1:20
mathematical core of any modern deep neural network library like say pytorch
1:22
neural network library like say pytorch
1:22
neural network library like say pytorch or jaxx
1:23
or jaxx
1:24
or jaxx so the functionality of microgrant is i
1:25
so the functionality of microgrant is i
1:25
so the functionality of microgrant is i think best illustrated by an example so
1:27
think best illustrated by an example so
1:27
think best illustrated by an example so if we just scroll down here
1:29
if we just scroll down here
1:29
if we just scroll down here you'll see that micrograph basically
1:31
you'll see that micrograph basically
1:31
you'll see that micrograph basically allows you to build out mathematical
1:32
allows you to build out mathematical
1:32
allows you to build out mathematical expressions
1:34
expressions
1:34
expressions and um here what we are doing is we have
1:36
and um here what we are doing is we have
1:36
and um here what we are doing is we have an expression that we're building out
1:37
an expression that we're building out
1:37
an expression that we're building out where you have two inputs a and b
1:40
where you have two inputs a and b
1:40
where you have two inputs a and b and you'll see that a and b are negative
1:43
and you'll see that a and b are negative
1:43
and you'll see that a and b are negative four and two but we are wrapping those
1:46
four and two but we are wrapping those
1:46
four and two but we are wrapping those values into this value object that we
1:48
values into this value object that we
1:48
values into this value object that we are going to build out as part of
1:49
are going to build out as part of
1:49
are going to build out as part of micrograd
1:51
micrograd
1:51
micrograd so this value object will wrap the
1:53
so this value object will wrap the
1:53
so this value object will wrap the numbers themselves
1:54
numbers themselves
1:54
numbers themselves and then we are going to build out a
1:56
and then we are going to build out a
1:56
and then we are going to build out a mathematical expression here where a and
1:58
mathematical expression here where a and
1:58
mathematical expression here where a and b are transformed into c d and
2:01
b are transformed into c d and
2:01
b are transformed into c d and eventually e f and g
2:03
eventually e f and g
2:03
eventually e f and g and i'm showing some of the functions
2:05
and i'm showing some of the functions
2:05
and i'm showing some of the functions some of the functionality of micrograph
2:07
some of the functionality of micrograph
2:07
some of the functionality of micrograph and the operations that it supports so
2:08
and the operations that it supports so
2:08
and the operations that it supports so you can add two value objects you can
2:11
you can add two value objects you can
2:11
you can add two value objects you can multiply them you can raise them to a
2:13
multiply them you can raise them to a
2:13
multiply them you can raise them to a constant power you can offset by one
2:15
constant power you can offset by one
2:15
constant power you can offset by one negate squash at zero
2:18
negate squash at zero
2:18
negate squash at zero square divide by constant divide by it
2:21
square divide by constant divide by it
2:21
square divide by constant divide by it etc
2:22
etc
2:22
etc and so we're building out an expression
2:24
and so we're building out an expression
2:24
and so we're building out an expression graph with with these two inputs a and b
2:27
graph with with these two inputs a and b
2:27
graph with with these two inputs a and b and we're creating an output value of g
2:30
and we're creating an output value of g
2:30
and we're creating an output value of g and micrograd will in the background
2:32
and micrograd will in the background
2:32
and micrograd will in the background build out this entire mathematical
2:34
build out this entire mathematical
2:34
build out this entire mathematical expression so it will for example know
2:36
expression so it will for example know
2:36
expression so it will for example know that c is also a value
2:38
that c is also a value
2:38
that c is also a value c was a result of an addition operation
2:41
c was a result of an addition operation
2:41
c was a result of an addition operation and the
2:42
and the
2:42
and the child nodes of c are a and b because the
2:46
child nodes of c are a and b because the
2:46
child nodes of c are a and b because the and will maintain pointers to a and b
2:48
and will maintain pointers to a and b
2:48
and will maintain pointers to a and b value objects so we'll basically know
2:50
value objects so we'll basically know
2:50
value objects so we'll basically know exactly how all of this is laid out
2:53
exactly how all of this is laid out
2:53
exactly how all of this is laid out and then not only can we do what we call
2:55
and then not only can we do what we call
2:55
and then not only can we do what we call the forward pass where we actually look
2:57
the forward pass where we actually look
2:57
the forward pass where we actually look at the value of g of course that's
2:58
at the value of g of course that's
2:58
at the value of g of course that's pretty straightforward we will access
3:00
pretty straightforward we will access
3:00
pretty straightforward we will access that using the dot data attribute and so
3:03
that using the dot data attribute and so
3:03
that using the dot data attribute and so the output of the forward pass the value
3:06
the output of the forward pass the value
3:06
the output of the forward pass the value of g is 24.7 it turns out but the big
3:09
of g is 24.7 it turns out but the big
3:09
of g is 24.7 it turns out but the big deal is that we can also take this g
3:11
deal is that we can also take this g
3:11
deal is that we can also take this g value object and we can call that
3:13
value object and we can call that
3:13
value object and we can call that backward
3:14
backward
3:14
backward and this will basically uh initialize
3:16
and this will basically uh initialize
3:16
and this will basically uh initialize back propagation at the node g
3:19
back propagation at the node g
3:19
back propagation at the node g and what backpropagation is going to do
3:21
and what backpropagation is going to do
3:21
and what backpropagation is going to do is it's going to start at g and it's
3:23
is it's going to start at g and it's
3:23
is it's going to start at g and it's going to go backwards through that
3:25
going to go backwards through that
3:25
going to go backwards through that expression graph and it's going to
3:26
expression graph and it's going to
3:26
expression graph and it's going to recursively apply the chain rule from
3:28
recursively apply the chain rule from
3:28
recursively apply the chain rule from calculus
3:30
calculus
3:30
calculus and what that allows us to do then is
3:32
and what that allows us to do then is
3:32
and what that allows us to do then is we're going to evaluate basically the
3:34
we're going to evaluate basically the
3:34
we're going to evaluate basically the derivative of g with respect to all the
3:36
derivative of g with respect to all the
3:36
derivative of g with respect to all the internal nodes
3:38
internal nodes
3:38
internal nodes like e d and c but also with respect to
3:40
like e d and c but also with respect to
3:40
like e d and c but also with respect to the inputs a and b
3:43
the inputs a and b
3:43
the inputs a and b and then we can actually query this
3:45
and then we can actually query this
3:45
and then we can actually query this derivative of g with respect to a for
3:47
derivative of g with respect to a for
3:47
derivative of g with respect to a for example that's a dot grad in this case
3:50
example that's a dot grad in this case
3:50
example that's a dot grad in this case it happens to be 138 and the derivative
3:52
it happens to be 138 and the derivative
3:52
it happens to be 138 and the derivative of g with respect to b
3:54
of g with respect to b
3:54
of g with respect to b which also happens to be here 645
3:57
which also happens to be here 645
3:57
which also happens to be here 645 and this derivative we'll see soon is
3:59
and this derivative we'll see soon is
3:59
and this derivative we'll see soon is very important information because it's
4:01
very important information because it's
4:01
very important information because it's telling us how a and b are affecting g
4:04
telling us how a and b are affecting g
4:04
telling us how a and b are affecting g through this mathematical expression so
4:06
through this mathematical expression so
4:06
through this mathematical expression so in particular
4:08
in particular
4:08
in particular a dot grad is 138 so if we slightly
4:11
a dot grad is 138 so if we slightly
4:11
a dot grad is 138 so if we slightly nudge a and make it slightly larger
4:14
nudge a and make it slightly larger
4:14
nudge a and make it slightly larger 138 is telling us that g will grow and
4:17
138 is telling us that g will grow and
4:18
138 is telling us that g will grow and the slope of that growth is going to be
4:19
the slope of that growth is going to be
4:19
the slope of that growth is going to be 138
4:20
138
4:20
138 and the slope of growth of b is going to
4:22
and the slope of growth of b is going to
4:22
and the slope of growth of b is going to be 645. so that's going to tell us about
4:25
be 645. so that's going to tell us about
4:25
be 645. so that's going to tell us about how g will respond if a and b get
4:27
how g will respond if a and b get
4:27
how g will respond if a and b get tweaked a tiny amount in a positive
4:29
tweaked a tiny amount in a positive
4:29
tweaked a tiny amount in a positive direction
4:31
direction
4:31
direction okay
4:33
now you might be confused about what
4:34
now you might be confused about what
4:34
now you might be confused about what this expression is that we built out
4:36
this expression is that we built out
4:36
this expression is that we built out here and this expression by the way is
4:38
here and this expression by the way is
4:38
here and this expression by the way is completely meaningless i just made it up
4:40
completely meaningless i just made it up
4:40
completely meaningless i just made it up i'm just flexing about the kinds of
4:42
i'm just flexing about the kinds of
4:42
i'm just flexing about the kinds of operations that are supported by
4:43
operations that are supported by
4:43
operations that are supported by micrograd
4:44
micrograd
4:44
micrograd what we actually really care about are
4:46
what we actually really care about are
4:46
what we actually really care about are neural networks but it turns out that
4:48
neural networks but it turns out that
4:48
neural networks but it turns out that neural networks are just mathematical
4:49
neural networks are just mathematical
4:49
neural networks are just mathematical expressions just like this one but
4:51
expressions just like this one but
4:51
expressions just like this one but actually slightly bit less crazy even
4:54
actually slightly bit less crazy even
4:54
actually slightly bit less crazy even neural networks are just a mathematical
4:56
neural networks are just a mathematical
4:56
neural networks are just a mathematical expression they take the input data as
4:59
expression they take the input data as
4:59
expression they take the input data as an input and they take the weights of a
5:00
an input and they take the weights of a
5:00
an input and they take the weights of a neural network as an input and it's a
5:02
neural network as an input and it's a
5:02
neural network as an input and it's a mathematical expression and the output
5:04
mathematical expression and the output
5:04
mathematical expression and the output are your predictions of your neural net
5:06
are your predictions of your neural net
5:06
are your predictions of your neural net or the loss function we'll see this in a
5:08
or the loss function we'll see this in a
5:08
or the loss function we'll see this in a bit but basically neural networks just
5:10
bit but basically neural networks just
5:10
bit but basically neural networks just happen to be a certain class of
5:12
happen to be a certain class of
5:12
happen to be a certain class of mathematical expressions
5:13
mathematical expressions
5:13
mathematical expressions but back propagation is actually
5:15
but back propagation is actually
5:15
but back propagation is actually significantly more general it doesn't
5:17
significantly more general it doesn't
5:17
significantly more general it doesn't actually care about neural networks at
5:18
actually care about neural networks at
5:18
actually care about neural networks at all it only tells us about arbitrary
5:20
all it only tells us about arbitrary
5:20
all it only tells us about arbitrary mathematical expressions and then we
5:22
mathematical expressions and then we
5:22
mathematical expressions and then we happen to use that machinery for
5:24
happen to use that machinery for
5:24
happen to use that machinery for training of neural networks now one more
5:26
training of neural networks now one more
5:26
training of neural networks now one more note i would like to make at this stage
5:28
note i would like to make at this stage
5:28
note i would like to make at this stage is that as you see here micrograd is a
5:30
is that as you see here micrograd is a
5:30
is that as you see here micrograd is a scalar valued auto grant engine so it's
5:32
scalar valued auto grant engine so it's
5:32
scalar valued auto grant engine so it's working on the you know level of
5:34
working on the you know level of
5:34
working on the you know level of individual scalars like negative four
5:35
individual scalars like negative four
5:36
individual scalars like negative four and two and we're taking neural nets and
5:37
and two and we're taking neural nets and
5:37
and two and we're taking neural nets and we're breaking them down all the way to
5:39
we're breaking them down all the way to
5:39
we're breaking them down all the way to these atoms of individual scalars and
5:41
these atoms of individual scalars and
5:41
these atoms of individual scalars and all the little pluses and times and it's
5:43
all the little pluses and times and it's
5:43
all the little pluses and times and it's just excessive and so obviously you
5:45
just excessive and so obviously you
5:45
just excessive and so obviously you would never be doing any of this in
5:47
would never be doing any of this in
5:47
would never be doing any of this in production it's really just put down for
5:48
production it's really just put down for
5:48
production it's really just put down for pedagogical reasons because it allows us
5:50
pedagogical reasons because it allows us
5:50
pedagogical reasons because it allows us to not have to deal with these
5:52
to not have to deal with these
5:52
to not have to deal with these n-dimensional tensors that you would use
5:54
n-dimensional tensors that you would use
5:54
n-dimensional tensors that you would use in modern deep neural network library so
5:56
in modern deep neural network library so
5:56
in modern deep neural network library so this is really done so that you
5:58
this is really done so that you
5:58
this is really done so that you understand and refactor out back
6:00
understand and refactor out back
6:00
understand and refactor out back propagation and chain rule and
6:02
propagation and chain rule and
6:02
propagation and chain rule and understanding of neurologic training
6:04
understanding of neurologic training
6:04
understanding of neurologic training and then if you actually want to train
6:06
and then if you actually want to train
6:06
and then if you actually want to train bigger networks you have to be using
6:08
bigger networks you have to be using
6:08
bigger networks you have to be using these tensors but none of the math
6:09
these tensors but none of the math
6:09
these tensors but none of the math changes this is done purely for
6:11
changes this is done purely for
6:11
changes this is done purely for efficiency we are basically taking scale
6:13
efficiency we are basically taking scale
6:13
efficiency we are basically taking scale value
6:14
value
6:14
value all the scale values we're packaging
6:15
all the scale values we're packaging
6:16
all the scale values we're packaging them up into tensors which are just
6:17
them up into tensors which are just
6:17
them up into tensors which are just arrays of these scalars and then because
6:20
arrays of these scalars and then because
6:20
arrays of these scalars and then because we have these large arrays we're making
6:22
we have these large arrays we're making
6:22
we have these large arrays we're making operations on those large arrays that
6:24
operations on those large arrays that
6:24
operations on those large arrays that allows us to take advantage of the
6:26
allows us to take advantage of the
6:26
allows us to take advantage of the parallelism in a computer and all those
6:28
parallelism in a computer and all those
6:28
parallelism in a computer and all those operations can be done in parallel and
6:30
operations can be done in parallel and
6:30
operations can be done in parallel and then the whole thing runs faster but
6:32
then the whole thing runs faster but
6:32
then the whole thing runs faster but really none of the math changes and
6:33
really none of the math changes and
6:33
really none of the math changes and that's done purely for efficiency so i
6:35
that's done purely for efficiency so i
6:35
that's done purely for efficiency so i don't think that it's pedagogically
6:36
don't think that it's pedagogically
6:36
don't think that it's pedagogically useful to be dealing with tensors from
6:38
useful to be dealing with tensors from
6:38
useful to be dealing with tensors from scratch uh and i think and that's why i
6:40
scratch uh and i think and that's why i
6:40
scratch uh and i think and that's why i fundamentally wrote micrograd because
6:42
fundamentally wrote micrograd because
6:42
fundamentally wrote micrograd because you can understand how things work uh at
6:44
you can understand how things work uh at
6:44
you can understand how things work uh at the fundamental level and then you can
6:46
the fundamental level and then you can
6:46
the fundamental level and then you can speed it up later okay so here's the fun
6:48
speed it up later okay so here's the fun
6:48
speed it up later okay so here's the fun part my claim is that micrograd is what
6:51
part my claim is that micrograd is what
6:51
part my claim is that micrograd is what you need to train your networks and
6:52
you need to train your networks and
6:52
you need to train your networks and everything else is just efficiency so
6:54
everything else is just efficiency so
6:54
everything else is just efficiency so you'd think that micrograd would be a
6:56
you'd think that micrograd would be a
6:56
you'd think that micrograd would be a very complex piece of code and that
6:58
very complex piece of code and that
6:58
very complex piece of code and that turns out to not be the case
7:01
turns out to not be the case
7:01
turns out to not be the case so if we just go to micrograd
7:03
so if we just go to micrograd
7:03
so if we just go to micrograd and you'll see that there's only two
7:05
and you'll see that there's only two
7:05
and you'll see that there's only two files here in micrograd this is the
7:07
files here in micrograd this is the
7:07
files here in micrograd this is the actual engine it doesn't know anything
7:09
actual engine it doesn't know anything
7:09
actual engine it doesn't know anything about neural nuts and this is the entire
7:10
about neural nuts and this is the entire
7:10
about neural nuts and this is the entire neural nets library
7:12
neural nets library
7:12
neural nets library on top of micrograd so engine and nn.pi
7:17
on top of micrograd so engine and nn.pi
7:17
on top of micrograd so engine and nn.pi so the actual backpropagation autograd
7:19
so the actual backpropagation autograd
7:19
so the actual backpropagation autograd engine
7:21
engine
7:21
engine that gives you the power of neural
7:22
that gives you the power of neural
7:22
that gives you the power of neural networks is literally
7:26
networks is literally
7:26
networks is literally 100 lines of code of like very simple
7:28
100 lines of code of like very simple
7:28
100 lines of code of like very simple python
7:29
python
7:30
python which we'll understand by the end of
7:31
which we'll understand by the end of
7:31
which we'll understand by the end of this lecture
7:32
this lecture
7:32
this lecture and then nn.pi
7:33
and then nn.pi
7:33
and then nn.pi this neural network library built on top
7:35
this neural network library built on top
7:35
this neural network library built on top of the autograd engine
7:37
of the autograd engine
7:37
of the autograd engine um is like a joke it's like
7:40
um is like a joke it's like
7:40
um is like a joke it's like we have to define what is a neuron and
7:42
we have to define what is a neuron and
7:42
we have to define what is a neuron and then we have to define what is the layer
7:43
then we have to define what is the layer
7:44
then we have to define what is the layer of neurons and then we define what is a
7:46
of neurons and then we define what is a
7:46
of neurons and then we define what is a multi-layer perceptron which is just a
7:47
multi-layer perceptron which is just a
7:47
multi-layer perceptron which is just a sequence of layers of neurons and so
7:50
sequence of layers of neurons and so
7:50
sequence of layers of neurons and so it's just a total joke
7:51
it's just a total joke
7:52
it's just a total joke so basically
7:53
so basically
7:53
so basically there's a lot of power that comes from
7:55
there's a lot of power that comes from
7:55
there's a lot of power that comes from only 150 lines of code
7:57
only 150 lines of code
7:57
only 150 lines of code and that's all you need to understand to
7:59
and that's all you need to understand to
7:59
and that's all you need to understand to understand neural network training and
8:00
understand neural network training and
8:00
understand neural network training and everything else is just efficiency and
8:02
everything else is just efficiency and
8:02
everything else is just efficiency and of course there's a lot to efficiency
8:05
of course there's a lot to efficiency
8:05
of course there's a lot to efficiency but fundamentally that's all that's
8:07
but fundamentally that's all that's
8:07
but fundamentally that's all that's happening okay so now let's dive right
8:09
happening okay so now let's dive right
8:09
happening okay so now let's dive right in and implement micrograph step by step
8:11
in and implement micrograph step by step
8:11
in and implement micrograph step by step the first thing i'd like to do is i'd
8:12
the first thing i'd like to do is i'd
8:12
the first thing i'd like to do is i'd like to make sure that you have a very
8:13
like to make sure that you have a very
8:13
like to make sure that you have a very good understanding intuitively of what a
8:16
good understanding intuitively of what a
8:16
good understanding intuitively of what a derivative is and exactly what
8:18
derivative is and exactly what
8:18
derivative is and exactly what information it gives you so let's start
8:20
information it gives you so let's start
8:20
information it gives you so let's start with some basic imports that i copy
8:22
with some basic imports that i copy
8:22
with some basic imports that i copy paste in every jupiter notebook always
8:25
paste in every jupiter notebook always
8:25
paste in every jupiter notebook always and let's define a function a scalar
8:27
and let's define a function a scalar
8:27
and let's define a function a scalar valued function
8:28
valued function
8:28
valued function f of x
8:30
f of x
8:30
f of x as follows
8:31
as follows
8:31
as follows so i just make this up randomly i just
8:33
so i just make this up randomly i just
8:33
so i just make this up randomly i just want to scale a valid function that
8:34
want to scale a valid function that
8:34
want to scale a valid function that takes a single scalar x and returns a
8:36
takes a single scalar x and returns a
8:36
takes a single scalar x and returns a single scalar y
8:38
single scalar y
8:38
single scalar y and we can call this function of course
8:40
and we can call this function of course
8:40
and we can call this function of course so we can pass in say 3.0 and get 20
8:42
so we can pass in say 3.0 and get 20
8:42
so we can pass in say 3.0 and get 20 back
8:43
back
8:43
back now we can also plot this function to
8:45
now we can also plot this function to
8:45
now we can also plot this function to get a sense of its shape you can tell
8:47
get a sense of its shape you can tell
8:47
get a sense of its shape you can tell from the mathematical expression that
8:48
from the mathematical expression that
8:48
from the mathematical expression that this is probably a parabola it's a
8:50
this is probably a parabola it's a
8:50
this is probably a parabola it's a quadratic
8:51
quadratic
8:51
quadratic and so if we just uh create a set of um
8:56
and so if we just uh create a set of um
8:56
and so if we just uh create a set of um um
8:57
um
8:57
um scale values that we can feed in using
8:59
scale values that we can feed in using
8:59
scale values that we can feed in using for example a range from negative five
9:01
for example a range from negative five
9:01
for example a range from negative five to five in steps of 0.25
9:03
to five in steps of 0.25
9:03
to five in steps of 0.25 so this is so axis is just from negative
9:06
so this is so axis is just from negative
9:06
so this is so axis is just from negative 5 to 5 not including 5 in steps of 0.25
9:11
5 to 5 not including 5 in steps of 0.25
9:11
5 to 5 not including 5 in steps of 0.25 and we can actually call this function
9:12
and we can actually call this function
9:12
and we can actually call this function on this numpy array as well so we get a
9:14
on this numpy array as well so we get a
9:14
on this numpy array as well so we get a set of y's if we call f on axis
9:17
set of y's if we call f on axis
9:17
set of y's if we call f on axis and these y's are basically
9:20
and these y's are basically
9:20
and these y's are basically also applying a function on every one of
9:23
also applying a function on every one of
9:23
also applying a function on every one of these elements independently
9:25
these elements independently
9:25
these elements independently and we can plot this using matplotlib so
9:27
and we can plot this using matplotlib so
9:28
and we can plot this using matplotlib so plt.plot x's and y's and we get a nice
9:31
plt.plot x's and y's and we get a nice
9:31
plt.plot x's and y's and we get a nice parabola so previously here we fed in
9:33
parabola so previously here we fed in
9:33
parabola so previously here we fed in 3.0 somewhere here and we received 20
9:36
3.0 somewhere here and we received 20
9:36
3.0 somewhere here and we received 20 back which is here the y coordinate so
9:39
back which is here the y coordinate so
9:39
back which is here the y coordinate so now i'd like to think through
9:40
now i'd like to think through
9:40
now i'd like to think through what is the derivative
9:42
what is the derivative
9:42
what is the derivative of this function at any single input
9:44
of this function at any single input
9:44
of this function at any single input point x
9:45
point x
9:45
point x right so what is the derivative at
9:47
right so what is the derivative at
9:47
right so what is the derivative at different points x of this function now
9:49
different points x of this function now
9:49
different points x of this function now if you remember back to your calculus
9:51
if you remember back to your calculus
9:51
if you remember back to your calculus class you've probably derived
9:52
class you've probably derived
9:52
class you've probably derived derivatives so we take this mathematical
9:54
derivatives so we take this mathematical
9:54
derivatives so we take this mathematical expression 3x squared minus 4x plus 5
9:57
expression 3x squared minus 4x plus 5
9:57
expression 3x squared minus 4x plus 5 and you would write out on a piece of
9:58
and you would write out on a piece of
9:58
and you would write out on a piece of paper and you would you know apply the
9:59
paper and you would you know apply the
9:59
paper and you would you know apply the product rule and all the other rules and
10:01
product rule and all the other rules and
10:01
product rule and all the other rules and derive the mathematical expression of
10:03
derive the mathematical expression of
10:03
derive the mathematical expression of the great derivative of the original
10:05
the great derivative of the original
10:05
the great derivative of the original function and then you could plug in
10:06
function and then you could plug in
10:06
function and then you could plug in different texts and see what the
10:07
different texts and see what the
10:08
different texts and see what the derivative is
10:09
derivative is
10:09
derivative is we're not going to actually do that
10:11
we're not going to actually do that
10:11
we're not going to actually do that because no one in neural networks
10:13
because no one in neural networks
10:13
because no one in neural networks actually writes out the expression for
10:15
actually writes out the expression for
10:15
actually writes out the expression for the neural net it would be a massive
10:16
the neural net it would be a massive
10:16
the neural net it would be a massive expression um it would be you know
10:18
expression um it would be you know
10:18
expression um it would be you know thousands tens of thousands of terms no
10:20
thousands tens of thousands of terms no
10:20
thousands tens of thousands of terms no one actually derives the derivative of
10:22
one actually derives the derivative of
10:22
one actually derives the derivative of course and so we're not going to take
10:24
course and so we're not going to take
10:24
course and so we're not going to take this kind of like a symbolic approach
10:26
this kind of like a symbolic approach
10:26
this kind of like a symbolic approach instead what i'd like to do is i'd like
10:27
instead what i'd like to do is i'd like
10:27
instead what i'd like to do is i'd like to look at the definition of derivative
10:29
to look at the definition of derivative
10:29
to look at the definition of derivative and just make sure that we really
10:30
and just make sure that we really
10:30
and just make sure that we really understand what derivative is measuring
10:32
understand what derivative is measuring
10:32
understand what derivative is measuring what it's telling you about the function
10:34
what it's telling you about the function
10:34
what it's telling you about the function and so if we just look up derivative
10:42
we see that
10:43
we see that
10:43
we see that okay so this is not a very good
10:44
okay so this is not a very good
10:44
okay so this is not a very good definition of derivative this is a
10:46
definition of derivative this is a
10:46
definition of derivative this is a definition of what it means to be
10:47
definition of what it means to be
10:47
definition of what it means to be differentiable
10:48
differentiable
10:48
differentiable but if you remember from your calculus
10:50
but if you remember from your calculus
10:50
but if you remember from your calculus it is the limit as h goes to zero of f
10:52
it is the limit as h goes to zero of f
10:52
it is the limit as h goes to zero of f of x plus h minus f of x over h so
10:55
of x plus h minus f of x over h so
10:55
of x plus h minus f of x over h so basically what it's saying is if you
10:58
basically what it's saying is if you
10:58
basically what it's saying is if you slightly bump up you're at some point x
11:00
slightly bump up you're at some point x
11:00
slightly bump up you're at some point x that you're interested in or a and if
11:02
that you're interested in or a and if
11:02
that you're interested in or a and if you slightly bump up
11:04
you slightly bump up
11:04
you slightly bump up you know you slightly increase it by
11:06
you know you slightly increase it by
11:06
you know you slightly increase it by small number h
11:08
small number h
11:08
small number h how does the function respond with what
11:09
how does the function respond with what
11:09
how does the function respond with what sensitivity does it respond what is the
11:11
sensitivity does it respond what is the
11:11
sensitivity does it respond what is the slope at that point does the function go
11:13
slope at that point does the function go
11:13
slope at that point does the function go up or does it go down and by how much
11:16
up or does it go down and by how much
11:16
up or does it go down and by how much and that's the slope of that function
11:17
and that's the slope of that function
11:18
and that's the slope of that function the
11:18
the
11:18
the the slope of that response at that point
11:21
the slope of that response at that point
11:21
the slope of that response at that point and so we can basically evaluate
11:23
and so we can basically evaluate
11:23
and so we can basically evaluate the derivative here numerically by
11:26
the derivative here numerically by
11:26
the derivative here numerically by taking a very small h of course the
11:27
taking a very small h of course the
11:28
taking a very small h of course the definition would ask us to take h to
11:30
definition would ask us to take h to
11:30
definition would ask us to take h to zero we're just going to pick a very
11:31
zero we're just going to pick a very
11:31
zero we're just going to pick a very small h 0.001
11:33
small h 0.001
11:34
small h 0.001 and let's say we're interested in point
11:35
and let's say we're interested in point
11:35
and let's say we're interested in point 3.0 so we can look at f of x of course
11:37
3.0 so we can look at f of x of course
11:37
3.0 so we can look at f of x of course as 20
11:38
as 20
11:38
as 20 and now f of x plus h
11:40
and now f of x plus h
11:40
and now f of x plus h so if we slightly nudge x in a positive
11:42
so if we slightly nudge x in a positive
11:42
so if we slightly nudge x in a positive direction how is the function going to
11:44
direction how is the function going to
11:44
direction how is the function going to respond
11:45
respond
11:45
respond and just looking at this do you expect
11:47
and just looking at this do you expect
11:47
and just looking at this do you expect do you expect f of x plus h to be
11:49
do you expect f of x plus h to be
11:49
do you expect f of x plus h to be slightly greater than 20 or do you
11:51
slightly greater than 20 or do you
11:51
slightly greater than 20 or do you expect to be slightly lower than 20
11:54
expect to be slightly lower than 20
11:54
expect to be slightly lower than 20 and since this 3 is here and this is 20
11:57
and since this 3 is here and this is 20
11:57
and since this 3 is here and this is 20 if we slightly go positively the
11:59
if we slightly go positively the
11:59
if we slightly go positively the function will respond positively so
12:01
function will respond positively so
12:01
function will respond positively so you'd expect this to be slightly greater
12:03
you'd expect this to be slightly greater
12:03
you'd expect this to be slightly greater than 20. and now by how much it's
12:05
than 20. and now by how much it's
12:05
than 20. and now by how much it's telling you the
12:06
telling you the
12:06
telling you the sort of the
12:07
sort of the
12:07
sort of the the strength of that slope right the the
12:09
the strength of that slope right the the
12:09
the strength of that slope right the the size of the slope so f of x plus h minus
12:12
size of the slope so f of x plus h minus
12:12
size of the slope so f of x plus h minus f of x this is how much the function
12:14
f of x this is how much the function
12:14
f of x this is how much the function responded
12:15
responded
12:16
responded in the positive direction and we have to
12:17
in the positive direction and we have to
12:17
in the positive direction and we have to normalize by the
12:19
normalize by the
12:19
normalize by the run so we have the rise over run to get
12:22
run so we have the rise over run to get
12:22
run so we have the rise over run to get the slope so this of course is just a
12:24
the slope so this of course is just a
12:24
the slope so this of course is just a numerical approximation of the slope
12:26
numerical approximation of the slope
12:26
numerical approximation of the slope because we have to make age very very
12:28
because we have to make age very very
12:28
because we have to make age very very small to converge to the exact amount
12:32
small to converge to the exact amount
12:32
small to converge to the exact amount now if i'm doing too many zeros
12:35
now if i'm doing too many zeros
12:35
now if i'm doing too many zeros at some point
12:36
at some point
12:36
at some point i'm gonna get an incorrect answer
12:38
i'm gonna get an incorrect answer
12:38
i'm gonna get an incorrect answer because we're using floating point
12:39
because we're using floating point
12:39
because we're using floating point arithmetic and the representations of
12:41
arithmetic and the representations of
12:41
arithmetic and the representations of all these numbers in computer memory is
12:43
all these numbers in computer memory is
12:43
all these numbers in computer memory is finite and at some point we get into
12:45
finite and at some point we get into
12:45
finite and at some point we get into trouble
12:46
trouble
12:46
trouble so we can converse towards the right
12:47
so we can converse towards the right
12:47
so we can converse towards the right answer with this approach
12:50
answer with this approach
12:50
answer with this approach but basically um at 3 the slope is 14.
12:54
but basically um at 3 the slope is 14.
12:54
but basically um at 3 the slope is 14. and you can see that by taking 3x
12:56
and you can see that by taking 3x
12:56
and you can see that by taking 3x squared minus 4x plus 5 and
12:58
squared minus 4x plus 5 and
12:58
squared minus 4x plus 5 and differentiating it in our head
13:00
differentiating it in our head
13:00
differentiating it in our head so 3x squared would be
13:02
so 3x squared would be
13:02
so 3x squared would be 6 x minus 4
13:04
6 x minus 4
13:04
6 x minus 4 and then we plug in x equals 3 so that's
13:07
and then we plug in x equals 3 so that's
13:07
and then we plug in x equals 3 so that's 18 minus 4 is 14. so this is correct
13:10
18 minus 4 is 14. so this is correct
13:10
18 minus 4 is 14. so this is correct so that's
13:12
so that's
13:12
so that's at 3. now how about the slope at say
13:15
at 3. now how about the slope at say
13:15
at 3. now how about the slope at say negative 3
13:17
negative 3
13:17
negative 3 would you expect would you expect for
13:19
would you expect would you expect for
13:19
would you expect would you expect for the slope
13:20
the slope
13:20
the slope now telling the exact value is really
13:22
now telling the exact value is really
13:22
now telling the exact value is really hard but what is the sign of that slope
13:24
hard but what is the sign of that slope
13:24
hard but what is the sign of that slope so at negative three
13:26
so at negative three
13:26
so at negative three if we slightly go in the positive
13:28
if we slightly go in the positive
13:28
if we slightly go in the positive direction at x the function would
13:30
direction at x the function would
13:30
direction at x the function would actually go down and so that tells you
13:32
actually go down and so that tells you
13:32
actually go down and so that tells you that the slope would be negative so
13:33
that the slope would be negative so
13:33
that the slope would be negative so we'll get a slight number below
13:36
we'll get a slight number below
13:36
we'll get a slight number below below 20. and so if we take the slope we
13:39
below 20. and so if we take the slope we
13:39
below 20. and so if we take the slope we expect something negative
13:40
expect something negative
13:40
expect something negative negative 22. okay
13:43
negative 22. okay
13:43
negative 22. okay and at some point here of course the
13:45
and at some point here of course the
13:45
and at some point here of course the slope would be zero now for this
13:47
slope would be zero now for this
13:47
slope would be zero now for this specific function i looked it up
13:48
specific function i looked it up
13:48
specific function i looked it up previously and it's at point two over
13:51
previously and it's at point two over
13:51
previously and it's at point two over three
13:52
three
13:52
three so at roughly two over three
13:54
so at roughly two over three
13:54
so at roughly two over three uh that's somewhere here
13:55
uh that's somewhere here
13:55
uh that's somewhere here um
13:57
um
13:57
um this derivative be zero
13:59
this derivative be zero
13:59
this derivative be zero so basically at that precise point
14:03
so basically at that precise point
14:03
so basically at that precise point yeah
14:04
yeah
14:04
yeah at that precise point if we nudge in a
14:06
at that precise point if we nudge in a
14:06
at that precise point if we nudge in a positive direction the function doesn't
14:07
positive direction the function doesn't
14:07
positive direction the function doesn't respond this stays the same almost and
14:09
respond this stays the same almost and
14:09
respond this stays the same almost and so that's why the slope is zero okay now
14:11
so that's why the slope is zero okay now
14:11
so that's why the slope is zero okay now let's look at a bit more complex case
14:14
let's look at a bit more complex case
14:14
let's look at a bit more complex case so we're going to start you know
14:15
so we're going to start you know
14:15
so we're going to start you know complexifying a bit so now we have a
14:17
complexifying a bit so now we have a
14:18
complexifying a bit so now we have a function
14:19
function
14:19
function here
14:20
here
14:20
here with output variable d
14:22
with output variable d
14:22
with output variable d that is a function of three scalar
14:24
that is a function of three scalar
14:24
that is a function of three scalar inputs a b and c
14:26
inputs a b and c
14:26
inputs a b and c so a b and c are some specific values
14:28
so a b and c are some specific values
14:28
so a b and c are some specific values three inputs into our expression graph
14:30
three inputs into our expression graph
14:30
three inputs into our expression graph and a single output d
14:32
and a single output d
14:32
and a single output d and so if we just print d we get four
14:36
and so if we just print d we get four
14:36
and so if we just print d we get four and now what i have to do is i'd like to
14:38
and now what i have to do is i'd like to
14:38
and now what i have to do is i'd like to again look at the derivatives of d with
14:40
again look at the derivatives of d with
14:40
again look at the derivatives of d with respect to a b and c
14:42
respect to a b and c
14:42
respect to a b and c and uh think through uh again just the
14:44
and uh think through uh again just the
14:44
and uh think through uh again just the intuition of what this derivative is
14:46
intuition of what this derivative is
14:46
intuition of what this derivative is telling us
14:47
telling us
14:47
telling us so in order to evaluate this derivative
14:49
so in order to evaluate this derivative
14:49
so in order to evaluate this derivative we're going to get a bit hacky here
14:52
we're going to get a bit hacky here
14:52
we're going to get a bit hacky here we're going to again have a very small
14:53
we're going to again have a very small
14:53
we're going to again have a very small value of h
14:55
value of h
14:55
value of h and then we're going to fix the inputs
14:57
and then we're going to fix the inputs
14:57
and then we're going to fix the inputs at some
14:58
at some
14:58
at some values that we're interested in
15:00
values that we're interested in
15:00
values that we're interested in so these are the this is the point abc
15:02
so these are the this is the point abc
15:02
so these are the this is the point abc at which we're going to be evaluating
15:04
at which we're going to be evaluating
15:04
at which we're going to be evaluating the the
15:05
the the
15:05
the the derivative of d with respect to all a b
15:07
derivative of d with respect to all a b
15:07
derivative of d with respect to all a b and c at that point
15:09
and c at that point
15:09
and c at that point so there are the inputs and now we have
15:11
so there are the inputs and now we have
15:11
so there are the inputs and now we have d1 is that expression
15:13
d1 is that expression
15:13
d1 is that expression and then we're going to for example look
15:15
and then we're going to for example look
15:15
and then we're going to for example look at the derivative of d with respect to a
15:17
at the derivative of d with respect to a
15:17
at the derivative of d with respect to a so we'll take a and we'll bump it by h
15:19
so we'll take a and we'll bump it by h
15:19
so we'll take a and we'll bump it by h and then we'll get d2 to be the exact
15:21
and then we'll get d2 to be the exact
15:22
and then we'll get d2 to be the exact same function
15:23
same function
15:23
same function and now we're going to print um
15:26
and now we're going to print um
15:26
and now we're going to print um you know f1
15:28
you know f1
15:28
you know f1 d1 is d1
15:31
d1 is d1
15:31
d1 is d1 d2 is d2
15:32
d2 is d2
15:32
d2 is d2 and print slope
15:35
and print slope
15:35
and print slope so the derivative or slope
15:37
so the derivative or slope
15:37
so the derivative or slope here will be um
15:39
here will be um
15:39
here will be um of course
15:41
of course
15:41
of course d2
15:42
d2
15:42
d2 minus d1 divide h
15:44
minus d1 divide h
15:44
minus d1 divide h so d2 minus d1 is how much the function
15:47
so d2 minus d1 is how much the function
15:47
so d2 minus d1 is how much the function increased
15:48
increased
15:48
increased uh when we bumped
15:50
uh when we bumped
15:50
uh when we bumped the uh
15:51
the uh
15:51
the uh the specific input that we're interested
15:53
the specific input that we're interested
15:53
the specific input that we're interested in by a tiny amount
15:55
in by a tiny amount
15:55
in by a tiny amount and
15:56
and
15:56
and this is then normalized by h
15:59
this is then normalized by h
15:59
this is then normalized by h to get the slope
16:02
so
16:03
so
16:03
so um
16:05
um
16:05
um yeah
16:06
yeah
16:06
yeah so this so if i just run this we're
16:08
so this so if i just run this we're
16:08
so this so if i just run this we're going to print
16:10
going to print
16:10
going to print d1
16:12
d1
16:12
d1 which we know is four
16:15
which we know is four
16:15
which we know is four now d2 will be bumped a will be bumped
16:18
now d2 will be bumped a will be bumped
16:18
now d2 will be bumped a will be bumped by h
16:20
by h
16:20
by h so let's just think through
16:22
so let's just think through
16:22
so let's just think through a little bit uh what d2 will be uh
16:26
a little bit uh what d2 will be uh
16:26
a little bit uh what d2 will be uh printed out here
16:27
printed out here
16:27
printed out here in particular
16:29
in particular
16:29
in particular d1 will be four
16:31
d1 will be four
16:31
d1 will be four will d2 be a number slightly greater
16:33
will d2 be a number slightly greater
16:33
will d2 be a number slightly greater than four or slightly lower than four
16:35
than four or slightly lower than four
16:35
than four or slightly lower than four and that's going to tell us the sl the
16:37
and that's going to tell us the sl the
16:37
and that's going to tell us the sl the the sign of the derivative
16:40
the sign of the derivative
16:40
the sign of the derivative so
16:42
we're bumping a by h
16:45
we're bumping a by h
16:45
we're bumping a by h b as minus three c is ten
16:48
b as minus three c is ten
16:48
b as minus three c is ten so you can just intuitively think
16:49
so you can just intuitively think
16:49
so you can just intuitively think through this derivative and what it's
16:51
through this derivative and what it's
16:51
through this derivative and what it's doing a will be slightly more positive
16:54
doing a will be slightly more positive
16:54
doing a will be slightly more positive and but b is a negative number
16:57
and but b is a negative number
16:57
and but b is a negative number so if a is slightly more positive
17:00
so if a is slightly more positive
17:00
so if a is slightly more positive because b is negative three
17:03
because b is negative three
17:03
because b is negative three we're actually going to be adding less
17:06
we're actually going to be adding less
17:06
we're actually going to be adding less to d
17:08
to d
17:08
to d so you'd actually expect that the value
17:10
so you'd actually expect that the value
17:10
so you'd actually expect that the value of the function will go down
17:13
of the function will go down
17:13
of the function will go down so let's just see this
17:16
so let's just see this
17:16
so let's just see this yeah and so we went from 4
17:18
yeah and so we went from 4
17:18
yeah and so we went from 4 to 3.9996
17:20
to 3.9996
17:20
to 3.9996 and that tells you that the slope will
17:22
and that tells you that the slope will
17:22
and that tells you that the slope will be negative
17:23
be negative
17:23
be negative and then
17:24
and then
17:24
and then uh will be a negative number
17:26
uh will be a negative number
17:26
uh will be a negative number because we went down
17:27
because we went down
17:27
because we went down and then
17:29
and then
17:29
and then the exact number of slope will be
17:31
the exact number of slope will be
17:31
the exact number of slope will be exact amount of slope is negative 3.
17:33
exact amount of slope is negative 3.
17:33
exact amount of slope is negative 3. and you can also convince yourself that
17:35
and you can also convince yourself that
17:35
and you can also convince yourself that negative 3 is the right answer
17:36
negative 3 is the right answer
17:36
negative 3 is the right answer mathematically and analytically because
17:39
mathematically and analytically because
17:39
mathematically and analytically because if you have a times b plus c and you are
17:41
if you have a times b plus c and you are
17:41
if you have a times b plus c and you are you know you have calculus then
17:43
you know you have calculus then
17:43
you know you have calculus then differentiating a times b plus c with
17:45
differentiating a times b plus c with
17:46
differentiating a times b plus c with respect to a gives you just b
17:48
respect to a gives you just b
17:48
respect to a gives you just b and indeed the value of b is negative 3
17:50
and indeed the value of b is negative 3
17:50
and indeed the value of b is negative 3 which is the derivative that we have so
17:52
which is the derivative that we have so
17:52
which is the derivative that we have so you can tell that that's correct
17:54
you can tell that that's correct
17:54
you can tell that that's correct so now if we do this with b
17:57
so now if we do this with b
17:57
so now if we do this with b so if we bump b by a little bit in a
17:59
so if we bump b by a little bit in a
17:59
so if we bump b by a little bit in a positive direction we'd get different
18:02
positive direction we'd get different
18:02
positive direction we'd get different slopes so what is the influence of b on
18:04
slopes so what is the influence of b on
18:04
slopes so what is the influence of b on the output d
18:06
the output d
18:06
the output d so if we bump b by a tiny amount in a
18:08
so if we bump b by a tiny amount in a
18:08
so if we bump b by a tiny amount in a positive direction then because a is
18:10
positive direction then because a is
18:10
positive direction then because a is positive
18:11
positive
18:11
positive we'll be adding more to d
18:13
we'll be adding more to d
18:13
we'll be adding more to d right
18:14
right
18:14
right so um and now what is the what is the
18:17
so um and now what is the what is the
18:17
so um and now what is the what is the sensitivity what is the slope of that
18:18
sensitivity what is the slope of that
18:18
sensitivity what is the slope of that addition
18:19
addition
18:19
addition and it might not surprise you that this
18:21
and it might not surprise you that this
18:21
and it might not surprise you that this should be
18:22
should be
18:22
should be 2
18:24
2
18:24
2 and y is a 2 because d of d
18:27
and y is a 2 because d of d
18:27
and y is a 2 because d of d by db differentiating with respect to b
18:30
by db differentiating with respect to b
18:30
by db differentiating with respect to b would be would give us a
18:31
would be would give us a
18:31
would be would give us a and the value of a is two so that's also
18:34
and the value of a is two so that's also
18:34
and the value of a is two so that's also working well
18:35
working well
18:35
working well and then if c gets bumped a tiny amount
18:37
and then if c gets bumped a tiny amount
18:37
and then if c gets bumped a tiny amount in h
18:38
in h
18:38
in h by h
18:39
by h
18:39
by h then of course a times b is unaffected
18:41
then of course a times b is unaffected
18:41
then of course a times b is unaffected and now c becomes slightly bit higher
18:44
and now c becomes slightly bit higher
18:44
and now c becomes slightly bit higher what does that do to the function it
18:45
what does that do to the function it
18:45
what does that do to the function it makes it slightly bit higher because
18:47
makes it slightly bit higher because
18:47
makes it slightly bit higher because we're simply adding c
18:48
we're simply adding c
18:48
we're simply adding c and it makes it slightly bit higher by
18:50
and it makes it slightly bit higher by
18:50
and it makes it slightly bit higher by the exact same amount that we added to c
18:53
the exact same amount that we added to c
18:53
the exact same amount that we added to c and so that tells you that the slope is
18:55
and so that tells you that the slope is
18:55
and so that tells you that the slope is one
18:56
one
18:56
one that will be the
18:59
that will be the
18:59
that will be the the rate at which
19:01
the rate at which
19:01
the rate at which d will increase as we scale
19:04
d will increase as we scale
19:04
d will increase as we scale c
19:05
c
19:05
c okay so we now have some intuitive sense
19:06
okay so we now have some intuitive sense
19:06
okay so we now have some intuitive sense of what this derivative is telling you
19:08
of what this derivative is telling you
19:08
of what this derivative is telling you about the function and we'd like to move
19:09
about the function and we'd like to move
19:10
about the function and we'd like to move to neural networks now as i mentioned
19:11
to neural networks now as i mentioned
19:11
to neural networks now as i mentioned neural networks will be pretty massive
19:13
neural networks will be pretty massive
19:13
neural networks will be pretty massive expressions mathematical expressions so
19:15
expressions mathematical expressions so
19:15
expressions mathematical expressions so we need some data structures that
19:16
we need some data structures that
19:16
we need some data structures that maintain these expressions and that's
19:17
maintain these expressions and that's
19:17
maintain these expressions and that's what we're going to start to build out
19:19
what we're going to start to build out
19:19
what we're going to start to build out now
19:20
now
19:20
now so we're going to
19:22
so we're going to
19:22
so we're going to build out this value object that i
19:24
build out this value object that i
19:24
build out this value object that i showed you in the readme page of
19:26
showed you in the readme page of
19:26
showed you in the readme page of micrograd
19:27
micrograd
19:27
micrograd so let me copy paste a skeleton of the
19:30
so let me copy paste a skeleton of the
19:30
so let me copy paste a skeleton of the first very simple value object
19:33
first very simple value object
19:33
first very simple value object so class value takes a single
19:36
so class value takes a single
19:36
so class value takes a single scalar value that it wraps and keeps
19:38
scalar value that it wraps and keeps
19:38
scalar value that it wraps and keeps track of
19:39
track of
19:39
track of and that's it so
19:41
and that's it so
19:41
and that's it so we can for example do value of 2.0 and
19:43
we can for example do value of 2.0 and
19:43
we can for example do value of 2.0 and then we can
19:45
then we can
19:45
then we can get we can look at its content and
19:48
get we can look at its content and
19:48
get we can look at its content and python will internally
19:50
python will internally
19:50
python will internally use the wrapper function
19:52
use the wrapper function
19:52
use the wrapper function to uh return
19:54
to uh return
19:54
to uh return uh this string oops
19:56
uh this string oops
19:56
uh this string oops like that
19:58
like that
19:58
like that so this is a value object with data
20:00
so this is a value object with data
20:00
so this is a value object with data equals two that we're creating here
20:03
equals two that we're creating here
20:03
equals two that we're creating here now we'd like to do is like we'd like to
20:04
now we'd like to do is like we'd like to
20:04
now we'd like to do is like we'd like to be able to
20:07
be able to
20:07
be able to have not just like two values
20:10
have not just like two values
20:10
have not just like two values but we'd like to do a bluffy right we'd
20:11
but we'd like to do a bluffy right we'd
20:12
but we'd like to do a bluffy right we'd like to add them
20:13
like to add them
20:13
like to add them so currently you would get an error
20:15
so currently you would get an error
20:15
so currently you would get an error because python doesn't know how to add
20:17
because python doesn't know how to add
20:17
because python doesn't know how to add two value objects so we have to tell it
20:21
two value objects so we have to tell it
20:21
two value objects so we have to tell it so here's
20:22
so here's
20:22
so here's addition
20:26
so you have to basically use these
20:27
so you have to basically use these
20:27
so you have to basically use these special double underscore methods in
20:29
special double underscore methods in
20:29
special double underscore methods in python to define these operators for
20:31
python to define these operators for
20:31
python to define these operators for these objects so if we call um
20:35
these objects so if we call um
20:35
these objects so if we call um the uh if we use this plus operator
20:39
the uh if we use this plus operator
20:39
the uh if we use this plus operator python will internally call a dot add of
20:43
python will internally call a dot add of
20:43
python will internally call a dot add of b
20:43
b
20:43
b that's what will happen internally and
20:45
that's what will happen internally and
20:45
that's what will happen internally and so b will be the other and
20:48
so b will be the other and
20:48
so b will be the other and self will be a
20:50
self will be a
20:50
self will be a and so we see that what we're going to
20:52
and so we see that what we're going to
20:52
and so we see that what we're going to return is a new value object and it's
20:54
return is a new value object and it's
20:54
return is a new value object and it's just it's going to be wrapping
20:56
just it's going to be wrapping
20:56
just it's going to be wrapping the plus of
20:58
the plus of
20:58
the plus of their data
20:59
their data
20:59
their data but remember now because data is the
21:01
but remember now because data is the
21:02
but remember now because data is the actual like numbered python number so
21:04
actual like numbered python number so
21:04
actual like numbered python number so this operator here is just the typical
21:06
this operator here is just the typical
21:06
this operator here is just the typical floating point plus addition now it's
21:09
floating point plus addition now it's
21:09
floating point plus addition now it's not an addition of value objects
21:11
not an addition of value objects
21:11
not an addition of value objects and will return a new value so now a
21:14
and will return a new value so now a
21:14
and will return a new value so now a plus b should work and it should print
21:15
plus b should work and it should print
21:16
plus b should work and it should print value of
21:17
value of
21:17
value of negative one
21:18
negative one
21:18
negative one because that's two plus minus three
21:20
because that's two plus minus three
21:20
because that's two plus minus three there we go
21:21
there we go
21:21
there we go okay let's now implement multiply
21:24
okay let's now implement multiply
21:24
okay let's now implement multiply just so we can recreate this expression
21:25
just so we can recreate this expression
21:25
just so we can recreate this expression here
21:26
here
21:26
here so multiply i think it won't surprise
21:28
so multiply i think it won't surprise
21:28
so multiply i think it won't surprise you will be fairly similar
21:31
you will be fairly similar
21:31
you will be fairly similar so instead of add we're going to be
21:33
so instead of add we're going to be
21:33
so instead of add we're going to be using mul
21:34
using mul
21:34
using mul and then here of course we want to do
21:35
and then here of course we want to do
21:36
and then here of course we want to do times
21:36
times
21:36
times and so now we can create a c value
21:38
and so now we can create a c value
21:38
and so now we can create a c value object which will be 10.0 and now we
21:41
object which will be 10.0 and now we
21:41
object which will be 10.0 and now we should be able to do a times b well
21:44
should be able to do a times b well
21:44
should be able to do a times b well let's just do a times b first
21:46
let's just do a times b first
21:46
let's just do a times b first um
21:47
um
21:47
um [Music]
21:48
[Music]
21:48
[Music] that's value of negative six now
21:50
that's value of negative six now
21:50
that's value of negative six now and by the way i skipped over this a
21:51
and by the way i skipped over this a
21:52
and by the way i skipped over this a little bit suppose that i didn't have
21:53
little bit suppose that i didn't have
21:53
little bit suppose that i didn't have the wrapper function here
21:55
the wrapper function here
21:55
the wrapper function here then it's just that you'll get some kind
21:57
then it's just that you'll get some kind
21:57
then it's just that you'll get some kind of an ugly expression so what wrapper is
21:59
of an ugly expression so what wrapper is
21:59
of an ugly expression so what wrapper is doing is it's providing us a way to
22:02
doing is it's providing us a way to
22:02
doing is it's providing us a way to print out like a nicer looking
22:03
print out like a nicer looking
22:03
print out like a nicer looking expression in python
22:05
expression in python
22:05
expression in python uh so we don't just have something
22:07
uh so we don't just have something
22:07
uh so we don't just have something cryptic we actually are you know it's
22:09
cryptic we actually are you know it's
22:09
cryptic we actually are you know it's value of
22:10
value of
22:10
value of negative six so this gives us a times
22:13
negative six so this gives us a times
22:14
negative six so this gives us a times and then this we should now be able to
22:16
and then this we should now be able to
22:16
and then this we should now be able to add c to it because we've defined and
22:18
add c to it because we've defined and
22:18
add c to it because we've defined and told the python how to do mul and add
22:20
told the python how to do mul and add
22:20
told the python how to do mul and add and so this will call this will
22:22
and so this will call this will
22:22
and so this will call this will basically be equivalent to a dot
22:24
basically be equivalent to a dot
22:24
basically be equivalent to a dot small
22:26
small
22:26
small of b
22:27
of b
22:27
of b and then this new value object will be
22:29
and then this new value object will be
22:29
and then this new value object will be dot add
22:31
dot add
22:31
dot add of c
22:32
of c
22:32
of c and so let's see if that worked
22:34
and so let's see if that worked
22:34
and so let's see if that worked yep so that worked well that gave us
22:36
yep so that worked well that gave us
22:36
yep so that worked well that gave us four which is what we expect from before
22:39
four which is what we expect from before
22:39
four which is what we expect from before and i believe we can just call them
22:40
and i believe we can just call them
22:40
and i believe we can just call them manually as well there we go so
22:44
manually as well there we go so
22:44
manually as well there we go so yeah
22:45
yeah
22:45
yeah okay so now what we are missing is the
22:46
okay so now what we are missing is the
22:46
okay so now what we are missing is the connective tissue of this expression as
22:49
connective tissue of this expression as
22:49
connective tissue of this expression as i mentioned we want to keep these
22:50
i mentioned we want to keep these
22:50
i mentioned we want to keep these expression graphs so we need to know and
22:52
expression graphs so we need to know and
22:52
expression graphs so we need to know and keep pointers about what values produce
22:54
keep pointers about what values produce
22:54
keep pointers about what values produce what other values
22:56
what other values
22:56
what other values so here for example we are going to
22:58
so here for example we are going to
22:58
so here for example we are going to introduce a new variable which we'll
23:00
introduce a new variable which we'll
23:00
introduce a new variable which we'll call children and by default it will be
23:02
call children and by default it will be
23:02
call children and by default it will be an empty tuple
23:03
an empty tuple
23:03
an empty tuple and then we're actually going to keep a
23:04
and then we're actually going to keep a
23:04
and then we're actually going to keep a slightly different variable in the class
23:06
slightly different variable in the class
23:06
slightly different variable in the class which we'll call underscore prev which
23:08
which we'll call underscore prev which
23:08
which we'll call underscore prev which will be the set of children
23:11
will be the set of children
23:11
will be the set of children this is how i done i did it in the
23:13
this is how i done i did it in the
23:13
this is how i done i did it in the original micrograd looking at my code
23:15
original micrograd looking at my code
23:15
original micrograd looking at my code here i can't remember exactly the reason
23:17
here i can't remember exactly the reason
23:17
here i can't remember exactly the reason i believe it was efficiency but this
23:19
i believe it was efficiency but this
23:19
i believe it was efficiency but this underscore children will be a tuple for
23:20
underscore children will be a tuple for
23:20
underscore children will be a tuple for convenience but then when we actually
23:22
convenience but then when we actually
23:22
convenience but then when we actually maintain it in the class it will be just
23:23
maintain it in the class it will be just
23:23
maintain it in the class it will be just this set yeah i believe for efficiency
23:27
this set yeah i believe for efficiency
23:27
this set yeah i believe for efficiency um
23:28
um
23:28
um so now
23:29
so now
23:29
so now when we are creating a value like this
23:31
when we are creating a value like this
23:31
when we are creating a value like this with a constructor children will be
23:33
with a constructor children will be
23:33
with a constructor children will be empty and prep will be the empty set but
23:36
empty and prep will be the empty set but
23:36
empty and prep will be the empty set but when we're creating a value through
23:37
when we're creating a value through
23:37
when we're creating a value through addition or multiplication we're going
23:39
addition or multiplication we're going
23:39
addition or multiplication we're going to feed in the children of this value
23:42
to feed in the children of this value
23:42
to feed in the children of this value which in this case is self and other
23:46
which in this case is self and other
23:46
which in this case is self and other so those are the children
23:48
so those are the children
23:48
so those are the children here
23:50
here
23:50
here so now we can do d dot prev
23:52
so now we can do d dot prev
23:52
so now we can do d dot prev and we'll see that the children of the
23:55
and we'll see that the children of the
23:55
and we'll see that the children of the we now know are this value of negative 6
23:58
we now know are this value of negative 6
23:58
we now know are this value of negative 6 and value of 10 and this of course is
24:00
and value of 10 and this of course is
24:00
and value of 10 and this of course is the value resulting from a times b and
24:03
the value resulting from a times b and
24:03
the value resulting from a times b and the c value which is 10.
24:06
the c value which is 10.
24:06
the c value which is 10. now the last piece of information we
24:08
now the last piece of information we
24:08
now the last piece of information we don't know so we know that the children
24:10
don't know so we know that the children
24:10
don't know so we know that the children of every single value but we don't know
24:12
of every single value but we don't know
24:12
of every single value but we don't know what operation created this value
24:14
what operation created this value
24:14
what operation created this value so we need one more element here let's
24:16
so we need one more element here let's
24:16
so we need one more element here let's call it underscore pop
24:19
call it underscore pop
24:19
call it underscore pop and by default this is the empty set for
24:21
and by default this is the empty set for
24:21
and by default this is the empty set for leaves
24:22
leaves
24:22
leaves and then we'll just maintain it here
24:25
and then we'll just maintain it here
24:25
and then we'll just maintain it here and now the operation will be just a
24:27
and now the operation will be just a
24:27
and now the operation will be just a simple string and in the case of
24:29
simple string and in the case of
24:29
simple string and in the case of addition it's plus in the case of
24:31
addition it's plus in the case of
24:31
addition it's plus in the case of multiplication is times
24:33
multiplication is times
24:33
multiplication is times so now we
24:35
so now we
24:35
so now we not just have d dot pref we also have a
24:37
not just have d dot pref we also have a
24:37
not just have d dot pref we also have a d dot up
24:38
d dot up
24:38
d dot up and we know that d was produced by an
24:40
and we know that d was produced by an
24:40
and we know that d was produced by an addition of those two values and so now
24:42
addition of those two values and so now
24:42
addition of those two values and so now we have the full
24:43
we have the full
24:44
we have the full mathematical expression uh and we're
24:46
mathematical expression uh and we're
24:46
mathematical expression uh and we're building out this data structure and we
24:47
building out this data structure and we
24:47
building out this data structure and we know exactly how each value came to be
24:49
know exactly how each value came to be
24:49
know exactly how each value came to be by word expression and from what other
24:51
by word expression and from what other
24:51
by word expression and from what other values
24:54
now because these expressions are about
24:56
now because these expressions are about
24:56
now because these expressions are about to get quite a bit larger we'd like a
24:58
to get quite a bit larger we'd like a
24:58
to get quite a bit larger we'd like a way to nicely visualize these
25:00
way to nicely visualize these
25:00
way to nicely visualize these expressions that we're building out so
25:02
expressions that we're building out so
25:02
expressions that we're building out so for that i'm going to copy paste a bunch
25:03
for that i'm going to copy paste a bunch
25:03
for that i'm going to copy paste a bunch of slightly scary code that's going to
25:06
of slightly scary code that's going to
25:06
of slightly scary code that's going to visualize this these expression graphs
25:08
visualize this these expression graphs
25:08
visualize this these expression graphs for us
25:09
for us
25:09
for us so here's the code and i'll explain it
25:11
so here's the code and i'll explain it
25:11
so here's the code and i'll explain it in a bit but first let me just show you
25:13
in a bit but first let me just show you
25:13
in a bit but first let me just show you what this code does
25:14
what this code does
25:14
what this code does basically what it does is it creates a
25:16
basically what it does is it creates a
25:16
basically what it does is it creates a new function drawdot that we can call on
25:19
new function drawdot that we can call on
25:19
new function drawdot that we can call on some root node
25:20
some root node
25:20
some root node and then it's going to visualize it so
25:22
and then it's going to visualize it so
25:22
and then it's going to visualize it so if we call drawdot on d
25:24
if we call drawdot on d
25:24
if we call drawdot on d which is this final value here that is a
25:27
which is this final value here that is a
25:27
which is this final value here that is a times b plus c
25:29
times b plus c
25:29
times b plus c it creates something like this so this
25:31
it creates something like this so this
25:31
it creates something like this so this is d
25:32
is d
25:32
is d and you see that this is a times b
25:34
and you see that this is a times b
25:34
and you see that this is a times b creating an integrated value plus c
25:36
creating an integrated value plus c
25:36
creating an integrated value plus c gives us this output node d
25:40
gives us this output node d
25:40
gives us this output node d so that's dried out of d
25:42
so that's dried out of d
25:42
so that's dried out of d and i'm not going to go through this in
25:44
and i'm not going to go through this in
25:44
and i'm not going to go through this in complete detail you can take a look at
25:46
complete detail you can take a look at
25:46
complete detail you can take a look at graphless and its api uh graphis is a
25:48
graphless and its api uh graphis is a
25:48
graphless and its api uh graphis is a open source graph visualization software
25:51
open source graph visualization software
25:51
open source graph visualization software and what we're doing here is we're
25:52
and what we're doing here is we're
25:52
and what we're doing here is we're building out this graph and graphis
25:54
building out this graph and graphis
25:54
building out this graph and graphis api and
25:56
api and
25:56
api and you can basically see that trace is this
25:58
you can basically see that trace is this
25:58
you can basically see that trace is this helper function that enumerates all of
26:00
helper function that enumerates all of
26:00
helper function that enumerates all of the nodes and edges in the graph
26:02
the nodes and edges in the graph
26:02
the nodes and edges in the graph so that just builds a set of all the
26:04
so that just builds a set of all the
26:04
so that just builds a set of all the nodes and edges and then we iterate for
26:06
nodes and edges and then we iterate for
26:06
nodes and edges and then we iterate for all the nodes and we create special node
26:07
all the nodes and we create special node
26:08
all the nodes and we create special node objects
26:08
objects
26:08
objects for them in
26:11
for them in
26:11
for them in using dot node
26:13
using dot node
26:13
using dot node and then we also create edges using dot
26:15
and then we also create edges using dot
26:15
and then we also create edges using dot dot edge
26:16
dot edge
26:16
dot edge and the only thing that's like slightly
26:18
and the only thing that's like slightly
26:18
and the only thing that's like slightly tricky here is you'll notice that i
26:20
tricky here is you'll notice that i
26:20
tricky here is you'll notice that i basically add these fake nodes which are
26:22
basically add these fake nodes which are
26:22
basically add these fake nodes which are these operation nodes so for example
26:24
these operation nodes so for example
26:24
these operation nodes so for example this node here is just like a plus node
26:27
this node here is just like a plus node
26:27
this node here is just like a plus node and
26:28
and
26:28
and i create these
26:31
i create these
26:31
i create these special op nodes here
26:34
special op nodes here
26:34
special op nodes here and i connect them accordingly so these
26:37
and i connect them accordingly so these
26:37
and i connect them accordingly so these nodes of course are not actual
26:39
nodes of course are not actual
26:39
nodes of course are not actual nodes in the original graph
26:41
nodes in the original graph
26:41
nodes in the original graph they're not actually a value object the
26:43
they're not actually a value object the
26:43
they're not actually a value object the only value objects here are the things
26:46
only value objects here are the things
26:46
only value objects here are the things in squares those are actual value
26:48
in squares those are actual value
26:48
in squares those are actual value objects or representations thereof and
26:50
objects or representations thereof and
26:50
objects or representations thereof and these op nodes are just created in this
26:52
these op nodes are just created in this
26:52
these op nodes are just created in this drawdot routine so that it looks nice
26:55
drawdot routine so that it looks nice
26:55
drawdot routine so that it looks nice let's also add labels to these graphs
26:57
let's also add labels to these graphs
26:57
let's also add labels to these graphs just so we know what variables are where
26:59
just so we know what variables are where
26:59
just so we know what variables are where so let's create a special underscore
27:01
so let's create a special underscore
27:01
so let's create a special underscore label
27:02
label
27:02
label um
27:03
um
27:03
um or let's just do label
27:05
or let's just do label
27:05
or let's just do label equals empty by default and save it in
27:08
equals empty by default and save it in
27:08
equals empty by default and save it in each node
27:11
and then here we're going to do label as
27:13
and then here we're going to do label as
27:13
and then here we're going to do label as a
27:15
a
27:15
a label is the
27:17
label is the
27:17
label is the label a c
27:22
and then
27:24
and then
27:24
and then let's create a special um
27:27
let's create a special um
27:27
let's create a special um e equals a times b
27:30
and e dot label will be e
27:34
and e dot label will be e
27:34
and e dot label will be e it's kind of naughty
27:35
it's kind of naughty
27:35
it's kind of naughty and e will be e plus c
27:38
and e will be e plus c
27:38
and e will be e plus c and a d dot label will be
27:40
and a d dot label will be
27:40
and a d dot label will be d
27:42
d
27:42
d okay so nothing really changes i just
27:44
okay so nothing really changes i just
27:44
okay so nothing really changes i just added this new e function
27:46
added this new e function
27:46
added this new e function a new e variable
27:48
a new e variable
27:48
a new e variable and then here when we are
27:50
and then here when we are
27:50
and then here when we are printing this
27:51
printing this
27:51
printing this i'm going to print the label here so
27:54
i'm going to print the label here so
27:54
i'm going to print the label here so this will be a percent s
27:56
this will be a percent s
27:56
this will be a percent s bar
27:56
bar
27:56
bar and this will be end.label
28:01
and so now
28:03
and so now
28:03
and so now we have the label on the left here so it
28:05
we have the label on the left here so it
28:05
we have the label on the left here so it says a b creating e and then e plus c
28:07
says a b creating e and then e plus c
28:07
says a b creating e and then e plus c creates d
28:08
creates d
28:08
creates d just like we have it here
28:10
just like we have it here
28:10
just like we have it here and finally let's make this expression
28:12
and finally let's make this expression
28:12
and finally let's make this expression just one layer deeper
28:14
just one layer deeper
28:14
just one layer deeper so d will not be the final output node
28:17
so d will not be the final output node
28:17
so d will not be the final output node instead after d we are going to create a
28:19
instead after d we are going to create a
28:20
instead after d we are going to create a new value object
28:21
new value object
28:21
new value object called f we're going to start running
28:23
called f we're going to start running
28:23
called f we're going to start running out of variables soon f will be negative
28:25
out of variables soon f will be negative
28:25
out of variables soon f will be negative 2.0
28:27
2.0
28:27
2.0 and its label will of course just be f
28:30
and its label will of course just be f
28:30
and its label will of course just be f and then l capital l will be the output
28:34
and then l capital l will be the output
28:34
and then l capital l will be the output of our graph
28:35
of our graph
28:35
of our graph and l will be p times f
28:37
and l will be p times f
28:38
and l will be p times f okay
28:38
okay
28:38
okay so l will be negative eight is the
28:40
so l will be negative eight is the
28:40
so l will be negative eight is the output
28:42
so
28:44
so
28:44
so now we don't just draw a d we draw l
28:50
okay
28:51
okay
28:52
okay and somehow the label of
28:53
and somehow the label of
28:54
and somehow the label of l was undefined oops all that label has
28:56
l was undefined oops all that label has
28:56
l was undefined oops all that label has to be explicitly sort of given to it
28:59
to be explicitly sort of given to it
28:59
to be explicitly sort of given to it there we go so l is the output
29:01
there we go so l is the output
29:01
there we go so l is the output so let's quickly recap what we've done
29:03
so let's quickly recap what we've done
29:03
so let's quickly recap what we've done so far
29:04
so far
29:04
so far we are able to build out mathematical
29:05
we are able to build out mathematical
29:05
we are able to build out mathematical expressions using only plus and times so
29:07
expressions using only plus and times so
29:08
expressions using only plus and times so far
29:09
far
29:09
far they are scalar valued along the way
29:11
they are scalar valued along the way
29:11
they are scalar valued along the way and we can do this forward pass
29:14
and we can do this forward pass
29:14
and we can do this forward pass and build out a mathematical expression
29:16
and build out a mathematical expression
29:16
and build out a mathematical expression so we have multiple inputs here a b c
29:18
so we have multiple inputs here a b c
29:18
so we have multiple inputs here a b c and f
29:19
and f
29:19
and f going into a mathematical expression
29:21
going into a mathematical expression
29:21
going into a mathematical expression that produces a single output l
29:23
that produces a single output l
29:24
that produces a single output l and this here is visualizing the forward
29:26
and this here is visualizing the forward
29:26
and this here is visualizing the forward pass so the output of the forward pass
29:28
pass so the output of the forward pass
29:28
pass so the output of the forward pass is negative eight that's the value
29:31
is negative eight that's the value
29:31
is negative eight that's the value now what we'd like to do next is we'd
29:33
now what we'd like to do next is we'd
29:33
now what we'd like to do next is we'd like to run back propagation
29:35
like to run back propagation
29:35
like to run back propagation and in back propagation we are going to
29:37
and in back propagation we are going to
29:37
and in back propagation we are going to start here at the end and we're going to
29:39
start here at the end and we're going to
29:39
start here at the end and we're going to reverse
29:40
reverse
29:40
reverse and calculate the gradient along along
29:43
and calculate the gradient along along
29:43
and calculate the gradient along along all these intermediate values
29:45
all these intermediate values
29:45
all these intermediate values and really what we're computing for
29:46
and really what we're computing for
29:46
and really what we're computing for every single value here
29:48
every single value here
29:48
every single value here um we're going to compute the derivative
29:50
um we're going to compute the derivative
29:50
um we're going to compute the derivative of that node with respect to l
29:55
of that node with respect to l
29:55
of that node with respect to l so
29:56
so
29:56
so the derivative of l with respect to l is
29:58
the derivative of l with respect to l is
29:58
the derivative of l with respect to l is just uh one
30:00
just uh one
30:00
just uh one and then we're going to derive what is
30:01
and then we're going to derive what is
30:01
and then we're going to derive what is the derivative of l with respect to f
30:03
the derivative of l with respect to f
30:03
the derivative of l with respect to f with respect to d with respect to c with
30:06
with respect to d with respect to c with
30:06
with respect to d with respect to c with respect to e
30:07
respect to e
30:07
respect to e with respect to b and with respect to a
30:10
with respect to b and with respect to a
30:10
with respect to b and with respect to a and in the neural network setting you'd
30:12
and in the neural network setting you'd
30:12
and in the neural network setting you'd be very interested in the derivative of
30:13
be very interested in the derivative of
30:13
be very interested in the derivative of basically this loss function l
30:16
basically this loss function l
30:16
basically this loss function l with respect to the weights of a neural
30:18
with respect to the weights of a neural
30:18
with respect to the weights of a neural network
30:19
network
30:19
network and here of course we have just these
30:20
and here of course we have just these
30:20
and here of course we have just these variables a b c and f
30:22
variables a b c and f
30:22
variables a b c and f but some of these will eventually
30:23
but some of these will eventually
30:23
but some of these will eventually represent the weights of a neural net
30:25
represent the weights of a neural net
30:25
represent the weights of a neural net and so we'll need to know how those
30:27
and so we'll need to know how those
30:27
and so we'll need to know how those weights are impacting
30:29
weights are impacting
30:29
weights are impacting the loss function so we'll be interested
30:31
the loss function so we'll be interested
30:31
the loss function so we'll be interested basically in the derivative of the
30:32
basically in the derivative of the
30:32
basically in the derivative of the output with respect to some of its leaf
30:34
output with respect to some of its leaf
30:34
output with respect to some of its leaf nodes and those leaf nodes will be the
30:36
nodes and those leaf nodes will be the
30:36
nodes and those leaf nodes will be the weights of the neural net
30:38
weights of the neural net
30:38
weights of the neural net and the other leaf nodes of course will
30:39
and the other leaf nodes of course will
30:39
and the other leaf nodes of course will be the data itself but usually we will
30:41
be the data itself but usually we will
30:41
be the data itself but usually we will not want or use the derivative of the
30:43
not want or use the derivative of the
30:44
not want or use the derivative of the loss function with respect to data
30:45
loss function with respect to data
30:45
loss function with respect to data because the data is fixed but the
30:47
because the data is fixed but the
30:47
because the data is fixed but the weights will be iterated on
30:50
weights will be iterated on
30:50
weights will be iterated on using the gradient information so next
30:52
using the gradient information so next
30:52
using the gradient information so next we are going to create a variable inside
30:54
we are going to create a variable inside
30:54
we are going to create a variable inside the value class that maintains the
30:57
the value class that maintains the
30:57
the value class that maintains the derivative of l with respect to that
30:59
derivative of l with respect to that
30:59
derivative of l with respect to that value
31:00
value
31:00
value and we will call this variable grad
31:03
and we will call this variable grad
31:03
and we will call this variable grad so there's a data and there's a
31:05
so there's a data and there's a
31:05
so there's a data and there's a self.grad
31:07
self.grad
31:07
self.grad and initially it will be zero and
31:09
and initially it will be zero and
31:09
and initially it will be zero and remember that zero is basically means no
31:12
remember that zero is basically means no
31:12
remember that zero is basically means no effect so at initialization we're
31:14
effect so at initialization we're
31:14
effect so at initialization we're assuming that every value does not
31:16
assuming that every value does not
31:16
assuming that every value does not impact does not affect the out the
31:18
impact does not affect the out the
31:18
impact does not affect the out the output
31:19
output
31:19
output right because if the gradient is zero
31:21
right because if the gradient is zero
31:21
right because if the gradient is zero that means that changing this variable
31:23
that means that changing this variable
31:23
that means that changing this variable is not changing the loss function
31:25
is not changing the loss function
31:25
is not changing the loss function so by default we assume that the
31:27
so by default we assume that the
31:27
so by default we assume that the gradient is zero
31:28
gradient is zero
31:28
gradient is zero and then
31:31
and then
31:31
and then now that we have grad and it's 0.0
31:36
we are going to be able to visualize it
31:38
we are going to be able to visualize it
31:38
we are going to be able to visualize it here after data so here grad is 0.4 f
31:42
here after data so here grad is 0.4 f
31:42
here after data so here grad is 0.4 f and this will be in that graph
31:45
and this will be in that graph
31:45
and this will be in that graph and now we are going to be showing both
31:47
and now we are going to be showing both
31:47
and now we are going to be showing both the data and the grad
31:50
the data and the grad
31:50
the data and the grad initialized at zero
31:53
initialized at zero
31:53
initialized at zero and we are just about getting ready to
31:55
and we are just about getting ready to
31:55
and we are just about getting ready to calculate the back propagation
31:57
calculate the back propagation
31:57
calculate the back propagation and of course this grad again as i
31:58
and of course this grad again as i
31:58
and of course this grad again as i mentioned is representing
32:00
mentioned is representing
32:00
mentioned is representing the derivative of the output in this
32:02
the derivative of the output in this
32:02
the derivative of the output in this case l with respect to this value so
32:05
case l with respect to this value so
32:05
case l with respect to this value so with respect to so this is the
32:06
with respect to so this is the
32:06
with respect to so this is the derivative of l with respect to f with
32:08
derivative of l with respect to f with
32:08
derivative of l with respect to f with respect to d and so on so let's now fill
32:11
respect to d and so on so let's now fill
32:11
respect to d and so on so let's now fill in those gradients and actually do back
32:12
in those gradients and actually do back
32:12
in those gradients and actually do back propagation manually so let's start
32:14
propagation manually so let's start
32:14
propagation manually so let's start filling in these gradients and start all
32:16
filling in these gradients and start all
32:16
filling in these gradients and start all the way at the end as i mentioned here
32:18
the way at the end as i mentioned here
32:18
the way at the end as i mentioned here first we are interested to fill in this
32:20
first we are interested to fill in this
32:20
first we are interested to fill in this gradient here so what is the derivative
32:22
gradient here so what is the derivative
32:22
gradient here so what is the derivative of l with respect to l
32:25
of l with respect to l
32:25
of l with respect to l in other words if i change l by a tiny
32:27
in other words if i change l by a tiny
32:27
in other words if i change l by a tiny amount of h
32:29
amount of h
32:29
amount of h how much does
32:30
how much does
32:30
how much does l change
32:32
l change
32:32
l change it changes by h so it's proportional and
32:35
it changes by h so it's proportional and
32:35
it changes by h so it's proportional and therefore derivative will be one
32:37
therefore derivative will be one
32:37
therefore derivative will be one we can of course measure these or
32:39
we can of course measure these or
32:39
we can of course measure these or estimate these numerical gradients
32:40
estimate these numerical gradients
32:40
estimate these numerical gradients numerically just like we've seen before
32:43
numerically just like we've seen before
32:43
numerically just like we've seen before so if i take this expression
32:45
so if i take this expression
32:45
so if i take this expression and i create a def lol function here
32:49
and i create a def lol function here
32:49
and i create a def lol function here and put this here now the reason i'm
32:51
and put this here now the reason i'm
32:51
and put this here now the reason i'm creating a gating function hello here is
32:53
creating a gating function hello here is
32:53
creating a gating function hello here is because i don't want to pollute or mess
32:55
because i don't want to pollute or mess
32:55
because i don't want to pollute or mess up the global scope here this is just
32:57
up the global scope here this is just
32:57
up the global scope here this is just kind of like a little staging area and
32:58
kind of like a little staging area and
32:58
kind of like a little staging area and as you know in python all of these will
33:00
as you know in python all of these will
33:00
as you know in python all of these will be local variables to this function so
33:02
be local variables to this function so
33:02
be local variables to this function so i'm not changing any of the global scope
33:04
i'm not changing any of the global scope
33:04
i'm not changing any of the global scope here
33:05
here
33:05
here so here l1 will be l
33:09
so here l1 will be l
33:10
so here l1 will be l and then copy pasting this expression
33:13
and then copy pasting this expression
33:13
and then copy pasting this expression we're going to add a small amount h
33:17
in for example a
33:20
in for example a
33:20
in for example a right and this would be measuring the
33:22
right and this would be measuring the
33:22
right and this would be measuring the derivative of l with respect to a
33:25
derivative of l with respect to a
33:25
derivative of l with respect to a so here this will be l2
33:28
so here this will be l2
33:28
so here this will be l2 and then we want to print this
33:29
and then we want to print this
33:29
and then we want to print this derivative so print
33:31
derivative so print
33:31
derivative so print l2 minus l1 which is how much l changed
33:35
l2 minus l1 which is how much l changed
33:35
l2 minus l1 which is how much l changed and then normalize it by h so this is
33:37
and then normalize it by h so this is
33:37
and then normalize it by h so this is the rise over run
33:39
the rise over run
33:39
the rise over run and we have to be careful because l is a
33:41
and we have to be careful because l is a
33:41
and we have to be careful because l is a value node so we actually want its data
33:45
value node so we actually want its data
33:45
value node so we actually want its data um
33:46
um
33:46
um so that these are floats dividing by h
33:48
so that these are floats dividing by h
33:48
so that these are floats dividing by h and this should print the derivative of
33:50
and this should print the derivative of
33:50
and this should print the derivative of l with respect to a because a is the one
33:53
l with respect to a because a is the one
33:53
l with respect to a because a is the one that we bumped a little bit by h
33:55
that we bumped a little bit by h
33:55
that we bumped a little bit by h so what is the
33:57
so what is the
33:57
so what is the derivative of l with respect to a
33:59
derivative of l with respect to a
33:59
derivative of l with respect to a it's six
34:01
it's six
34:01
it's six okay and obviously
34:03
okay and obviously
34:03
okay and obviously if we change l by h
34:06
if we change l by h
34:06
if we change l by h then that would be
34:09
then that would be
34:09
then that would be here effectively
34:12
here effectively
34:12
here effectively this looks really awkward but changing l
34:14
this looks really awkward but changing l
34:14
this looks really awkward but changing l by h
34:16
by h
34:16
by h you see the derivative here is 1. um
34:20
you see the derivative here is 1. um
34:20
you see the derivative here is 1. um that's kind of like the base case of
34:23
that's kind of like the base case of
34:23
that's kind of like the base case of what we are doing here
34:24
what we are doing here
34:24
what we are doing here so basically we cannot come up here and
34:26
so basically we cannot come up here and
34:26
so basically we cannot come up here and we can manually set l.grad to one this
34:29
we can manually set l.grad to one this
34:29
we can manually set l.grad to one this is our manual back propagation
34:31
is our manual back propagation
34:31
is our manual back propagation l dot grad is one and let's redraw
34:35
l dot grad is one and let's redraw
34:35
l dot grad is one and let's redraw and we'll see that we filled in grad as
34:37
and we'll see that we filled in grad as
34:37
and we'll see that we filled in grad as 1 for l
34:39
1 for l
34:39
1 for l we're now going to continue the back
34:40
we're now going to continue the back
34:40
we're now going to continue the back propagation so let's here look at the
34:42
propagation so let's here look at the
34:42
propagation so let's here look at the derivatives of l with respect to d and f
34:45
derivatives of l with respect to d and f
34:45
derivatives of l with respect to d and f let's do a d first
34:47
let's do a d first
34:47
let's do a d first so what we are interested in if i create
34:49
so what we are interested in if i create
34:49
so what we are interested in if i create a markdown on here is we'd like to know
34:51
a markdown on here is we'd like to know
34:51
a markdown on here is we'd like to know basically we have that l is d times f
34:54
basically we have that l is d times f
34:54
basically we have that l is d times f and we'd like to know what is uh d
34:57
and we'd like to know what is uh d
34:57
and we'd like to know what is uh d l by d d
35:00
l by d d
35:00
l by d d what is that
35:01
what is that
35:01
what is that and if you know your calculus uh l is d
35:03
and if you know your calculus uh l is d
35:03
and if you know your calculus uh l is d times f so what is d l by d d it would
35:06
times f so what is d l by d d it would
35:06
times f so what is d l by d d it would be f
35:08
be f
35:08
be f and if you don't believe me we can also
35:10
and if you don't believe me we can also
35:10
and if you don't believe me we can also just derive it because the proof would
35:11
just derive it because the proof would
35:11
just derive it because the proof would be fairly straightforward uh we go to
35:14
be fairly straightforward uh we go to
35:14
be fairly straightforward uh we go to the
35:15
the
35:15
the definition of the derivative which is f
35:18
definition of the derivative which is f
35:18
definition of the derivative which is f of x plus h minus f of x divide h
35:22
of x plus h minus f of x divide h
35:22
of x plus h minus f of x divide h as a limit limit of h goes to zero of
35:24
as a limit limit of h goes to zero of
35:24
as a limit limit of h goes to zero of this kind of expression so when we have
35:26
this kind of expression so when we have
35:26
this kind of expression so when we have l is d times f
35:28
l is d times f
35:28
l is d times f then increasing d by h
35:31
then increasing d by h
35:31
then increasing d by h would give us the output of b plus h
35:33
would give us the output of b plus h
35:33
would give us the output of b plus h times f
35:35
times f
35:35
times f that's basically f of x plus h right
35:38
that's basically f of x plus h right
35:38
that's basically f of x plus h right minus d times f
35:42
minus d times f
35:42
minus d times f and then divide h and symbolically
35:44
and then divide h and symbolically
35:44
and then divide h and symbolically expanding out here we would have
35:46
expanding out here we would have
35:46
expanding out here we would have basically d times f plus h times f minus
35:49
basically d times f plus h times f minus
35:50
basically d times f plus h times f minus t times f divide h
35:52
t times f divide h
35:52
t times f divide h and then you see how the df minus df
35:54
and then you see how the df minus df
35:54
and then you see how the df minus df cancels so you're left with h times f
35:57
cancels so you're left with h times f
35:57
cancels so you're left with h times f divide h
35:58
divide h
35:58
divide h which is f
35:59
which is f
35:59
which is f so in the limit as h goes to zero of
36:03
so in the limit as h goes to zero of
36:03
so in the limit as h goes to zero of you know
36:04
you know
36:04
you know derivative
36:06
derivative
36:06
derivative definition we just get f in the case of
36:09
definition we just get f in the case of
36:09
definition we just get f in the case of d times f
36:12
d times f
36:12
d times f so
36:13
so
36:13
so symmetrically
36:14
symmetrically
36:14
symmetrically dl by d
36:15
dl by d
36:15
dl by d f will just be d
36:18
f will just be d
36:18
f will just be d so what we have is that f dot grad
36:21
so what we have is that f dot grad
36:21
so what we have is that f dot grad we see now is just the value of d
36:24
we see now is just the value of d
36:24
we see now is just the value of d which is 4.
36:28
and we see that
36:30
and we see that
36:30
and we see that d dot grad
36:31
d dot grad
36:31
d dot grad is just uh the value of f
36:36
and so the value of f is negative two
36:41
and so the value of f is negative two
36:41
and so the value of f is negative two so we'll set those manually
36:45
let me erase this markdown node and then
36:47
let me erase this markdown node and then
36:47
let me erase this markdown node and then let's redraw what we have
36:50
okay
36:51
okay
36:51
okay and let's just make sure that these were
36:53
and let's just make sure that these were
36:53
and let's just make sure that these were correct so we seem to think that dl by
36:56
correct so we seem to think that dl by
36:56
correct so we seem to think that dl by dd is negative two so let's double check
36:59
dd is negative two so let's double check
36:59
dd is negative two so let's double check um let me erase this plus h from before
37:02
um let me erase this plus h from before
37:02
um let me erase this plus h from before and now we want the derivative with
37:03
and now we want the derivative with
37:03
and now we want the derivative with respect to f
37:05
respect to f
37:05
respect to f so let's just come here when i create f
37:06
so let's just come here when i create f
37:06
so let's just come here when i create f and let's do a plus h here and this
37:08
and let's do a plus h here and this
37:08
and let's do a plus h here and this should print the derivative of l with
37:10
should print the derivative of l with
37:10
should print the derivative of l with respect to f so we expect to see four
37:14
respect to f so we expect to see four
37:14
respect to f so we expect to see four yeah and this is four up to floating
37:16
yeah and this is four up to floating
37:16
yeah and this is four up to floating point
37:17
point
37:17
point funkiness
37:18
funkiness
37:18
funkiness and then dl by dd
37:21
and then dl by dd
37:21
and then dl by dd should be f which is negative two
37:25
should be f which is negative two
37:25
should be f which is negative two grad is negative two
37:26
grad is negative two
37:26
grad is negative two so if we again come here and we change d
37:31
d dot data plus equals h right here
37:35
d dot data plus equals h right here
37:35
d dot data plus equals h right here so we expect so we've added a little h
37:37
so we expect so we've added a little h
37:37
so we expect so we've added a little h and then we see how l changed and we
37:40
and then we see how l changed and we
37:40
and then we see how l changed and we expect to print
37:42
expect to print
37:42
expect to print uh negative two
37:44
uh negative two
37:44
uh negative two there we go
37:47
so we've numerically verified what we're
37:49
so we've numerically verified what we're
37:49
so we've numerically verified what we're doing here is what kind of like an
37:50
doing here is what kind of like an
37:50
doing here is what kind of like an inline gradient check gradient check is
37:53
inline gradient check gradient check is
37:53
inline gradient check gradient check is when we
37:54
when we
37:54
when we are deriving this like back propagation
37:56
are deriving this like back propagation
37:56
are deriving this like back propagation and getting the derivative with respect
37:57
and getting the derivative with respect
37:57
and getting the derivative with respect to all the intermediate results and then
38:00
to all the intermediate results and then
38:00
to all the intermediate results and then numerical gradient is just you know
38:03
numerical gradient is just you know
38:03
numerical gradient is just you know estimating it using small step size
38:06
estimating it using small step size
38:06
estimating it using small step size now we're getting to the crux of
38:07
now we're getting to the crux of
38:08
now we're getting to the crux of backpropagation so this will be the most
38:10
backpropagation so this will be the most
38:10
backpropagation so this will be the most important node to understand because if
38:12
important node to understand because if
38:12
important node to understand because if you understand the gradient for this
38:14
you understand the gradient for this
38:14
you understand the gradient for this node you understand all of back
38:16
node you understand all of back
38:16
node you understand all of back propagation and all of training of
38:17
propagation and all of training of
38:17
propagation and all of training of neural nets basically
38:19
neural nets basically
38:19
neural nets basically so we need to derive dl by bc
38:23
so we need to derive dl by bc
38:23
so we need to derive dl by bc in other words the derivative of l with
38:24
in other words the derivative of l with
38:24
in other words the derivative of l with respect to c
38:26
respect to c
38:26
respect to c because we've computed all these other
38:27
because we've computed all these other
38:27
because we've computed all these other gradients already
38:29
gradients already
38:29
gradients already now we're coming here and we're
38:30
now we're coming here and we're
38:30
now we're coming here and we're continuing the back propagation manually
38:33
continuing the back propagation manually
38:33
continuing the back propagation manually so we want dl by dc and then we'll also
38:36
so we want dl by dc and then we'll also
38:36
so we want dl by dc and then we'll also derive dl by de
38:38
derive dl by de
38:38
derive dl by de now here's the problem
38:39
now here's the problem
38:40
now here's the problem how do we derive dl
38:41
how do we derive dl
38:41
how do we derive dl by dc
38:43
by dc
38:44
by dc we actually know the derivative l with
38:46
we actually know the derivative l with
38:46
we actually know the derivative l with respect to d so we know how l assessed
38:48
respect to d so we know how l assessed
38:48
respect to d so we know how l assessed it to d
38:50
it to d
38:50
it to d but how is l sensitive to c so if we
38:53
but how is l sensitive to c so if we
38:53
but how is l sensitive to c so if we wiggle c how does that impact l
38:55
wiggle c how does that impact l
38:55
wiggle c how does that impact l through d
38:57
through d
38:58
through d so we know dl by dc
39:01
and we also here know how c impacts d
39:04
and we also here know how c impacts d
39:04
and we also here know how c impacts d and so just very intuitively if you know
39:06
and so just very intuitively if you know
39:06
and so just very intuitively if you know the impact that c is having on d and the
39:09
the impact that c is having on d and the
39:09
the impact that c is having on d and the impact that d is having on l
39:11
impact that d is having on l
39:11
impact that d is having on l then you should be able to somehow put
39:12
then you should be able to somehow put
39:12
then you should be able to somehow put that information together to figure out
39:14
that information together to figure out
39:14
that information together to figure out how c impacts l
39:16
how c impacts l
39:16
how c impacts l and indeed this is what we can actually
39:18
and indeed this is what we can actually
39:18
and indeed this is what we can actually do so in particular we know just
39:20
do so in particular we know just
39:20
do so in particular we know just concentrating on d first let's look at
39:22
concentrating on d first let's look at
39:22
concentrating on d first let's look at how what is the derivative basically of
39:24
how what is the derivative basically of
39:24
how what is the derivative basically of d with respect to c so in other words
39:27
d with respect to c so in other words
39:27
d with respect to c so in other words what is dd by dc
39:31
so here we know that d is c times c plus
39:34
so here we know that d is c times c plus
39:34
so here we know that d is c times c plus e
39:35
e
39:35
e that's what we know and now we're
39:37
that's what we know and now we're
39:37
that's what we know and now we're interested in dd by dc
39:39
interested in dd by dc
39:39
interested in dd by dc if you just know your calculus again and
39:41
if you just know your calculus again and
39:41
if you just know your calculus again and you remember that differentiating c plus
39:43
you remember that differentiating c plus
39:43
you remember that differentiating c plus e with respect to c you know that that
39:45
e with respect to c you know that that
39:45
e with respect to c you know that that gives you
39:46
gives you
39:46
gives you 1.0
39:47
1.0
39:47
1.0 and we can also go back to the basics
39:49
and we can also go back to the basics
39:49
and we can also go back to the basics and derive this because again we can go
39:51
and derive this because again we can go
39:51
and derive this because again we can go to our f of x plus h minus f of x
39:54
to our f of x plus h minus f of x
39:54
to our f of x plus h minus f of x divide by h
39:56
divide by h
39:56
divide by h that's the definition of a derivative as
39:58
that's the definition of a derivative as
39:58
that's the definition of a derivative as h goes to zero
39:59
h goes to zero
40:00
h goes to zero and so here
40:01
and so here
40:01
and so here focusing on c and its effect on d
40:04
focusing on c and its effect on d
40:04
focusing on c and its effect on d we can basically do the f of x plus h
40:06
we can basically do the f of x plus h
40:06
we can basically do the f of x plus h will be
40:07
will be
40:07
will be c is incremented by h plus e
40:10
c is incremented by h plus e
40:10
c is incremented by h plus e that's the first evaluation of our
40:12
that's the first evaluation of our
40:12
that's the first evaluation of our function minus
40:14
function minus
40:14
function minus c plus e
40:16
c plus e
40:16
c plus e and then divide h
40:18
and then divide h
40:18
and then divide h and so what is this
40:19
and so what is this
40:19
and so what is this uh just expanding this out this will be
40:21
uh just expanding this out this will be
40:21
uh just expanding this out this will be c plus h plus e minus c minus e
40:25
c plus h plus e minus c minus e
40:25
c plus h plus e minus c minus e divide h and then you see here how c
40:27
divide h and then you see here how c
40:27
divide h and then you see here how c minus c cancels e minus e cancels we're
40:30
minus c cancels e minus e cancels we're
40:30
minus c cancels e minus e cancels we're left with h over h which is 1.0
40:33
left with h over h which is 1.0
40:33
left with h over h which is 1.0 and so
40:35
and so
40:35
and so by symmetry also d d by d
40:38
by symmetry also d d by d
40:38
by symmetry also d d by d e
40:39
e
40:39
e will be 1.0 as well
40:42
will be 1.0 as well
40:42
will be 1.0 as well so basically the derivative of a sum
40:44
so basically the derivative of a sum
40:44
so basically the derivative of a sum expression is very simple and and this
40:46
expression is very simple and and this
40:46
expression is very simple and and this is the local derivative so i call this
40:49
is the local derivative so i call this
40:49
is the local derivative so i call this the local derivative because we have the
40:51
the local derivative because we have the
40:51
the local derivative because we have the final output value all the way at the
40:52
final output value all the way at the
40:52
final output value all the way at the end of this graph and we're now like a
40:54
end of this graph and we're now like a
40:54
end of this graph and we're now like a small node here
40:55
small node here
40:55
small node here and this is a little plus node
40:57
and this is a little plus node
40:58
and this is a little plus node and it the little plus node doesn't know
41:00
and it the little plus node doesn't know
41:00
and it the little plus node doesn't know anything about the rest of the graph
41:02
anything about the rest of the graph
41:02
anything about the rest of the graph that it's embedded in all it knows is
41:04
that it's embedded in all it knows is
41:04
that it's embedded in all it knows is that it did a plus it took a c and an e
41:07
that it did a plus it took a c and an e
41:07
that it did a plus it took a c and an e added them and created d
41:09
added them and created d
41:09
added them and created d and this plus note also knows the local
41:11
and this plus note also knows the local
41:11
and this plus note also knows the local influence of c on d or rather rather the
41:14
influence of c on d or rather rather the
41:14
influence of c on d or rather rather the derivative of d with respect to c and it
41:16
derivative of d with respect to c and it
41:16
derivative of d with respect to c and it also
41:17
also
41:17
also knows the derivative of d with respect
41:18
knows the derivative of d with respect
41:18
knows the derivative of d with respect to e but that's not what we want that's
41:21
to e but that's not what we want that's
41:21
to e but that's not what we want that's just a local derivative what we actually
41:23
just a local derivative what we actually
41:23
just a local derivative what we actually want is d l by d c and l could l is here
41:27
want is d l by d c and l could l is here
41:27
want is d l by d c and l could l is here just one step away but in a general case
41:30
just one step away but in a general case
41:30
just one step away but in a general case this little plus note is could be
41:32
this little plus note is could be
41:32
this little plus note is could be embedded in like a massive graph
41:34
embedded in like a massive graph
41:34
embedded in like a massive graph so
41:35
so
41:35
so again we know how l impacts d and now we
41:38
again we know how l impacts d and now we
41:38
again we know how l impacts d and now we know how c and e impact d how do we put
41:41
know how c and e impact d how do we put
41:41
know how c and e impact d how do we put that information together to write dl by
41:43
that information together to write dl by
41:43
that information together to write dl by dc and the answer of course is the chain
41:45
dc and the answer of course is the chain
41:46
dc and the answer of course is the chain rule in calculus
41:47
rule in calculus
41:47
rule in calculus and so um
41:50
and so um
41:50
and so um i pulled up a chain rule here from
41:51
i pulled up a chain rule here from
41:51
i pulled up a chain rule here from kapedia
41:52
kapedia
41:52
kapedia and
41:53
and
41:53
and i'm going to go through this very
41:54
i'm going to go through this very
41:54
i'm going to go through this very briefly so chain rule
41:57
briefly so chain rule
41:57
briefly so chain rule wikipedia sometimes can be very
41:58
wikipedia sometimes can be very
41:58
wikipedia sometimes can be very confusing and calculus can
42:00
confusing and calculus can
42:00
confusing and calculus can can be very confusing like this is the
42:02
can be very confusing like this is the
42:02
can be very confusing like this is the way i
42:03
way i
42:03
way i learned
42:05
learned
42:05
learned chain rule and it was very confusing
42:06
chain rule and it was very confusing
42:06
chain rule and it was very confusing like what is happening it's just
42:08
like what is happening it's just
42:08
like what is happening it's just complicated so i like this expression
42:10
complicated so i like this expression
42:10
complicated so i like this expression much better
42:12
much better
42:12
much better if a variable z depends on a variable y
42:15
if a variable z depends on a variable y
42:15
if a variable z depends on a variable y which itself depends on the variable x
42:18
which itself depends on the variable x
42:18
which itself depends on the variable x then z depends on x as well obviously
42:20
then z depends on x as well obviously
42:20
then z depends on x as well obviously through the intermediate variable y
42:22
through the intermediate variable y
42:22
through the intermediate variable y in this case the chain rule is expressed
42:24
in this case the chain rule is expressed
42:24
in this case the chain rule is expressed as
42:25
as
42:25
as if you want dz by dx
42:28
if you want dz by dx
42:28
if you want dz by dx then you take the dz by dy and you
42:30
then you take the dz by dy and you
42:30
then you take the dz by dy and you multiply it by d y by dx
42:33
multiply it by d y by dx
42:33
multiply it by d y by dx so the chain rule fundamentally is
42:34
so the chain rule fundamentally is
42:34
so the chain rule fundamentally is telling you
42:36
telling you
42:36
telling you how
42:37
how
42:37
how we chain these
42:39
we chain these
42:39
we chain these uh derivatives together
42:41
uh derivatives together
42:41
uh derivatives together correctly so to differentiate through a
42:44
correctly so to differentiate through a
42:44
correctly so to differentiate through a function composition
42:46
function composition
42:46
function composition we have to apply a multiplication
42:48
we have to apply a multiplication
42:48
we have to apply a multiplication of
42:49
of
42:49
of those derivatives
42:51
those derivatives
42:51
those derivatives so that's really what chain rule is
42:53
so that's really what chain rule is
42:53
so that's really what chain rule is telling us
42:54
telling us
42:54
telling us and there's a nice little intuitive
42:56
and there's a nice little intuitive
42:56
and there's a nice little intuitive explanation here which i also think is
42:58
explanation here which i also think is
42:58
explanation here which i also think is kind of cute the chain rule says that
42:59
kind of cute the chain rule says that
42:59
kind of cute the chain rule says that knowing the instantaneous rate of change
43:01
knowing the instantaneous rate of change
43:01
knowing the instantaneous rate of change of z with respect to y and y relative to
43:03
of z with respect to y and y relative to
43:03
of z with respect to y and y relative to x allows one to calculate the
43:04
x allows one to calculate the
43:04
x allows one to calculate the instantaneous rate of change of z
43:06
instantaneous rate of change of z
43:06
instantaneous rate of change of z relative to x
43:07
relative to x
43:07
relative to x as a product of those two rates of
43:09
as a product of those two rates of
43:09
as a product of those two rates of change
43:10
change
43:10
change simply the product of those two
43:12
simply the product of those two
43:12
simply the product of those two so here's a good one
43:14
so here's a good one
43:14
so here's a good one if a car travels twice as fast as
43:15
if a car travels twice as fast as
43:16
if a car travels twice as fast as bicycle and the bicycle is four times as
43:18
bicycle and the bicycle is four times as
43:18
bicycle and the bicycle is four times as fast as walking man
43:19
fast as walking man
43:19
fast as walking man then the car travels two times four
43:22
then the car travels two times four
43:22
then the car travels two times four eight times as fast as demand
43:25
eight times as fast as demand
43:25
eight times as fast as demand and so this makes it very clear that the
43:27
and so this makes it very clear that the
43:27
and so this makes it very clear that the correct thing to do sort of
43:29
correct thing to do sort of
43:29
correct thing to do sort of is to multiply
43:30
is to multiply
43:30
is to multiply so
43:31
so
43:31
so cars twice as fast as bicycle and
43:33
cars twice as fast as bicycle and
43:33
cars twice as fast as bicycle and bicycle is four times as fast as man
43:36
bicycle is four times as fast as man
43:36
bicycle is four times as fast as man so the car will be eight times as fast
43:38
so the car will be eight times as fast
43:38
so the car will be eight times as fast as the man and so we can take these
43:42
as the man and so we can take these
43:42
as the man and so we can take these intermediate rates of change if you will
43:44
intermediate rates of change if you will
43:44
intermediate rates of change if you will and multiply them together
43:46
and multiply them together
43:46
and multiply them together and that justifies the
43:48
and that justifies the
43:48
and that justifies the chain rule intuitively so have a look at
43:50
chain rule intuitively so have a look at
43:50
chain rule intuitively so have a look at chain rule about here really what it
43:52
chain rule about here really what it
43:52
chain rule about here really what it means for us is there's a very simple
43:54
means for us is there's a very simple
43:54
means for us is there's a very simple recipe for deriving what we want which
43:56
recipe for deriving what we want which
43:56
recipe for deriving what we want which is dl by dc
43:59
is dl by dc
43:59
is dl by dc and what we have so far
44:01
and what we have so far
44:01
and what we have so far is we know
44:03
is we know
44:03
is we know want
44:05
want
44:05
want and we know
44:07
and we know
44:07
and we know what is the
44:08
what is the
44:08
what is the impact of d on l so we know d l by
44:12
impact of d on l so we know d l by
44:12
impact of d on l so we know d l by d d the derivative of l with respect to
44:14
d d the derivative of l with respect to
44:14
d d the derivative of l with respect to d d we know that that's negative two
44:17
d d we know that that's negative two
44:17
d d we know that that's negative two and now because of this local
44:19
and now because of this local
44:19
and now because of this local reasoning that we've done here we know
44:21
reasoning that we've done here we know
44:21
reasoning that we've done here we know dd by d
44:23
dd by d
44:23
dd by d c
44:24
c
44:24
c so how does c impact d and in
44:27
so how does c impact d and in
44:27
so how does c impact d and in particular this is a plus node so the
44:29
particular this is a plus node so the
44:29
particular this is a plus node so the local derivative is simply 1.0 it's very
44:31
local derivative is simply 1.0 it's very
44:32
local derivative is simply 1.0 it's very simple
44:33
simple
44:33
simple and so
44:34
and so
44:34
and so the chain rule tells us that dl by dc
44:37
the chain rule tells us that dl by dc
44:37
the chain rule tells us that dl by dc going through this intermediate variable
44:40
going through this intermediate variable
44:40
going through this intermediate variable will just be simply d l by
44:43
will just be simply d l by
44:44
will just be simply d l by d
44:44
d
44:44
d times
44:49
dd by dc
44:51
dd by dc
44:51
dd by dc that's chain rule
44:53
that's chain rule
44:53
that's chain rule so this is identical to what's happening
44:55
so this is identical to what's happening
44:55
so this is identical to what's happening here
44:56
here
44:56
here except
44:58
except
44:58
except z is rl
44:59
z is rl
44:59
z is rl y is our d and x is rc
45:03
y is our d and x is rc
45:03
y is our d and x is rc so we literally just have to multiply
45:05
so we literally just have to multiply
45:05
so we literally just have to multiply these
45:06
these
45:06
these and because
45:10
these local derivatives like dd by dc
45:12
these local derivatives like dd by dc
45:12
these local derivatives like dd by dc are just one
45:14
are just one
45:14
are just one we basically just copy over dl by dd
45:17
we basically just copy over dl by dd
45:17
we basically just copy over dl by dd because this is just times one
45:19
because this is just times one
45:19
because this is just times one so what does it do so because dl by dd
45:22
so what does it do so because dl by dd
45:22
so what does it do so because dl by dd is negative two what is dl by dc
45:25
is negative two what is dl by dc
45:25
is negative two what is dl by dc well it's the local gradient 1.0 times
45:29
well it's the local gradient 1.0 times
45:29
well it's the local gradient 1.0 times dl by dd which is negative two
45:31
dl by dd which is negative two
45:31
dl by dd which is negative two so literally what a plus node does you
45:33
so literally what a plus node does you
45:33
so literally what a plus node does you can look at it that way is it literally
45:35
can look at it that way is it literally
45:35
can look at it that way is it literally just routes the gradient
45:37
just routes the gradient
45:37
just routes the gradient because the plus nodes local derivatives
45:39
because the plus nodes local derivatives
45:39
because the plus nodes local derivatives are just one and so in the chain rule
45:41
are just one and so in the chain rule
45:41
are just one and so in the chain rule one times
45:43
one times
45:43
one times dl by dd
45:45
dl by dd
45:45
dl by dd is um
45:47
is um
45:47
is um is uh is just dl by dd and so that
45:50
is uh is just dl by dd and so that
45:50
is uh is just dl by dd and so that derivative just gets routed to both c
45:53
derivative just gets routed to both c
45:53
derivative just gets routed to both c and to e in this case
45:55
and to e in this case
45:55
and to e in this case so basically um we have that that grad
45:59
so basically um we have that that grad
45:59
so basically um we have that that grad or let's start with c since that's the
46:01
or let's start with c since that's the
46:01
or let's start with c since that's the one we looked at
46:02
one we looked at
46:02
one we looked at is
46:03
is
46:03
is negative two times one
46:06
negative two times one
46:06
negative two times one negative two
46:08
negative two
46:08
negative two and in the same way by symmetry e that
46:11
and in the same way by symmetry e that
46:11
and in the same way by symmetry e that grad will be negative two that's the
46:13
grad will be negative two that's the
46:13
grad will be negative two that's the claim so we can set those
46:16
claim so we can set those
46:16
claim so we can set those we can redraw
46:19
we can redraw
46:19
we can redraw and you see how we just assign negative
46:20
and you see how we just assign negative
46:20
and you see how we just assign negative to negative two so this backpropagating
46:23
to negative two so this backpropagating
46:23
to negative two so this backpropagating signal which is carrying the information
46:25
signal which is carrying the information
46:25
signal which is carrying the information of like what is the derivative of l with
46:26
of like what is the derivative of l with
46:26
of like what is the derivative of l with respect to all the intermediate nodes
46:28
respect to all the intermediate nodes
46:28
respect to all the intermediate nodes we can imagine it almost like flowing
46:30
we can imagine it almost like flowing
46:30
we can imagine it almost like flowing backwards through the graph and a plus
46:32
backwards through the graph and a plus
46:32
backwards through the graph and a plus node will simply distribute the
46:34
node will simply distribute the
46:34
node will simply distribute the derivative to all the leaf nodes sorry
46:36
derivative to all the leaf nodes sorry
46:36
derivative to all the leaf nodes sorry to all the children nodes of it
46:39
to all the children nodes of it
46:39
to all the children nodes of it so this is the claim and now let's
46:40
so this is the claim and now let's
46:40
so this is the claim and now let's verify it so let me remove the plus h
46:43
verify it so let me remove the plus h
46:43
verify it so let me remove the plus h here from before
46:45
here from before
46:45
here from before and now instead what we're going to do
46:46
and now instead what we're going to do
46:46
and now instead what we're going to do is we're going to increment c so c dot
46:48
is we're going to increment c so c dot
46:48
is we're going to increment c so c dot data will be credited by h
46:50
data will be credited by h
46:50
data will be credited by h and when i run this we expect to see
46:52
and when i run this we expect to see
46:52
and when i run this we expect to see negative 2
46:54
negative 2
46:54
negative 2 negative 2. and then of course for e
46:58
negative 2. and then of course for e
46:58
negative 2. and then of course for e so e dot data plus equals h and we
47:01
so e dot data plus equals h and we
47:01
so e dot data plus equals h and we expect to see negative 2.
47:03
expect to see negative 2.
47:03
expect to see negative 2. simple
47:07
so those are the derivatives of these
47:09
so those are the derivatives of these
47:09
so those are the derivatives of these internal nodes
47:11
internal nodes
47:11
internal nodes and now we're going to recurse our way
47:13
and now we're going to recurse our way
47:13
and now we're going to recurse our way backwards again
47:15
backwards again
47:15
backwards again and we're again going to apply the chain
47:17
and we're again going to apply the chain
47:17
and we're again going to apply the chain rule so here we go our second
47:19
rule so here we go our second
47:19
rule so here we go our second application of chain rule and we will
47:20
application of chain rule and we will
47:20
application of chain rule and we will apply it all the way through the graph
47:22
apply it all the way through the graph
47:22
apply it all the way through the graph we just happen to only have one more
47:23
we just happen to only have one more
47:24
we just happen to only have one more node remaining
47:25
node remaining
47:25
node remaining we have that d l
47:27
we have that d l
47:27
we have that d l by d e
47:28
by d e
47:28
by d e as we have just calculated is negative
47:30
as we have just calculated is negative
47:30
as we have just calculated is negative two so we know that
47:32
two so we know that
47:32
two so we know that so we know the derivative of l with
47:33
so we know the derivative of l with
47:33
so we know the derivative of l with respect to e
47:36
and now we want dl
47:39
and now we want dl
47:39
and now we want dl by
47:40
by
47:40
by da
47:41
da
47:41
da right
47:42
right
47:42
right and the chain rule is telling us that
47:44
and the chain rule is telling us that
47:44
and the chain rule is telling us that that's just dl by de
47:48
negative 2
47:50
negative 2
47:50
negative 2 times the local gradient so what is the
47:52
times the local gradient so what is the
47:52
times the local gradient so what is the local gradient basically d e
47:55
local gradient basically d e
47:55
local gradient basically d e by d a
47:56
by d a
47:56
by d a we have to look at that
47:59
we have to look at that
48:00
we have to look at that so i'm a little times node
48:02
so i'm a little times node
48:02
so i'm a little times node inside a massive graph
48:04
inside a massive graph
48:04
inside a massive graph and i only know that i did a times b and
48:06
and i only know that i did a times b and
48:06
and i only know that i did a times b and i produced an e
48:09
i produced an e
48:09
i produced an e so now what is d e by d a and d e by d b
48:12
so now what is d e by d a and d e by d b
48:12
so now what is d e by d a and d e by d b that's the only thing that i sort of
48:14
that's the only thing that i sort of
48:14
that's the only thing that i sort of know about that's my local gradient
48:17
know about that's my local gradient
48:17
know about that's my local gradient so
48:17
so
48:17
so because we have that e's a times b we're
48:20
because we have that e's a times b we're
48:20
because we have that e's a times b we're asking what is d e by d a
48:24
asking what is d e by d a
48:24
asking what is d e by d a and of course we just did that here we
48:26
and of course we just did that here we
48:26
and of course we just did that here we had a
48:27
had a
48:27
had a times so i'm not going to rederive it
48:30
times so i'm not going to rederive it
48:30
times so i'm not going to rederive it but if you want to differentiate this
48:31
but if you want to differentiate this
48:32
but if you want to differentiate this with respect to a you'll just get b
48:34
with respect to a you'll just get b
48:34
with respect to a you'll just get b right the value of b
48:36
right the value of b
48:36
right the value of b which in this case is negative 3.0
48:41
so
48:41
so
48:41
so basically we have that dl by da
48:45
basically we have that dl by da
48:45
basically we have that dl by da well let me just do it right here we
48:47
well let me just do it right here we
48:47
well let me just do it right here we have that a dot grad and we are applying
48:49
have that a dot grad and we are applying
48:49
have that a dot grad and we are applying chain rule here
48:50
chain rule here
48:50
chain rule here is d l by d e which we see here is
48:54
is d l by d e which we see here is
48:54
is d l by d e which we see here is negative two
48:56
negative two
48:56
negative two times
48:57
times
48:57
times what is d e by d a
48:59
what is d e by d a
48:59
what is d e by d a it's the value of b which is negative 3.
49:04
that's it
49:07
and then we have b grad is again dl by
49:10
and then we have b grad is again dl by
49:10
and then we have b grad is again dl by de
49:11
de
49:11
de which is negative 2
49:13
which is negative 2
49:13
which is negative 2 just the same way
49:14
just the same way
49:14
just the same way times
49:15
times
49:15
times what is d e by d
49:17
what is d e by d
49:18
what is d e by d um db
49:19
um db
49:19
um db is the value of a which is 2.2.0
49:23
is the value of a which is 2.2.0
49:23
is the value of a which is 2.2.0 as the value of a
49:25
as the value of a
49:25
as the value of a so these are our claimed derivatives
49:28
so these are our claimed derivatives
49:28
so these are our claimed derivatives let's
49:30
let's
49:30
let's redraw
49:32
redraw
49:32
redraw and we see here that
49:33
and we see here that
49:33
and we see here that a dot grad turns out to be 6 because
49:35
a dot grad turns out to be 6 because
49:36
a dot grad turns out to be 6 because that is negative 2 times negative 3
49:38
that is negative 2 times negative 3
49:38
that is negative 2 times negative 3 and b dot grad is negative 4
49:41
and b dot grad is negative 4
49:41
and b dot grad is negative 4 times sorry is negative 2 times 2 which
49:43
times sorry is negative 2 times 2 which
49:43
times sorry is negative 2 times 2 which is negative 4.
49:45
is negative 4.
49:45
is negative 4. so those are our claims let's delete
49:47
so those are our claims let's delete
49:47
so those are our claims let's delete this and let's verify them
49:50
this and let's verify them
49:50
this and let's verify them we have
49:52
we have
49:52
we have a here a dot data plus equals h
49:57
so the claim is that
49:59
so the claim is that
49:59
so the claim is that a dot grad is six
50:01
a dot grad is six
50:01
a dot grad is six let's verify
50:03
let's verify
50:03
let's verify six
50:04
six
50:04
six and we have beta data
50:07
and we have beta data
50:07
and we have beta data plus equals h
50:08
plus equals h
50:08
plus equals h so nudging b by h
50:11
so nudging b by h
50:11
so nudging b by h and looking at what happens
50:13
and looking at what happens
50:13
and looking at what happens we claim it's negative four
50:15
we claim it's negative four
50:15
we claim it's negative four and indeed it's negative four plus minus
50:17
and indeed it's negative four plus minus
50:17
and indeed it's negative four plus minus again float oddness
50:20
again float oddness
50:20
again float oddness um
50:21
um
50:21
um and uh
50:23
and uh
50:23
and uh that's it this
50:24
that's it this
50:24
that's it this that was the manual
50:26
that was the manual
50:26
that was the manual back propagation
50:28
back propagation
50:28
back propagation uh all the way from here to all the leaf
50:30
uh all the way from here to all the leaf
50:30
uh all the way from here to all the leaf nodes and we've done it piece by piece
50:33
nodes and we've done it piece by piece
50:33
nodes and we've done it piece by piece and really all we've done is as you saw
50:35
and really all we've done is as you saw
50:35
and really all we've done is as you saw we iterated through all the nodes one by
50:37
we iterated through all the nodes one by
50:37
we iterated through all the nodes one by one and locally applied the chain rule
50:39
one and locally applied the chain rule
50:39
one and locally applied the chain rule we always know what is the derivative of
50:41
we always know what is the derivative of
50:41
we always know what is the derivative of l with respect to this little output and
50:44
l with respect to this little output and
50:44
l with respect to this little output and then we look at how this output was
50:45
then we look at how this output was
50:45
then we look at how this output was produced this output was produced
50:47
produced this output was produced
50:47
produced this output was produced through some operation and we have the
50:49
through some operation and we have the
50:49
through some operation and we have the pointers to the children nodes of this
50:51
pointers to the children nodes of this
50:51
pointers to the children nodes of this operation
50:52
operation
50:52
operation and so in this little operation we know
50:54
and so in this little operation we know
50:54
and so in this little operation we know what the local derivatives are and we
50:56
what the local derivatives are and we
50:56
what the local derivatives are and we just multiply them onto the derivative
50:58
just multiply them onto the derivative
50:58
just multiply them onto the derivative always
50:59
always
50:59
always so we just go through and recursively
51:01
so we just go through and recursively
51:01
so we just go through and recursively multiply on the local derivatives and
51:04
multiply on the local derivatives and
51:04
multiply on the local derivatives and that's what back propagation is is just
51:05
that's what back propagation is is just
51:05
that's what back propagation is is just a recursive application of chain rule
51:08
a recursive application of chain rule
51:08
a recursive application of chain rule backwards through the computation graph
51:10
backwards through the computation graph
51:10
backwards through the computation graph let's see this power in action just very
51:12
let's see this power in action just very
51:12
let's see this power in action just very briefly what we're going to do is we're
51:14
briefly what we're going to do is we're
51:14
briefly what we're going to do is we're going to
51:15
going to
51:15
going to nudge our inputs to try to make l go up
51:19
nudge our inputs to try to make l go up
51:19
nudge our inputs to try to make l go up so in particular what we're doing is we
51:21
so in particular what we're doing is we
51:21
so in particular what we're doing is we want a.data we're going to change it
51:24
want a.data we're going to change it
51:24
want a.data we're going to change it and if we want l to go up that means we
51:26
and if we want l to go up that means we
51:26
and if we want l to go up that means we just have to go in the direction of the
51:27
just have to go in the direction of the
51:27
just have to go in the direction of the gradient so
51:29
gradient so
51:29
gradient so a
51:30
a
51:30
a should increase in the direction of
51:32
should increase in the direction of
51:32
should increase in the direction of gradient by like some small step amount
51:34
gradient by like some small step amount
51:34
gradient by like some small step amount this is the step size
51:36
this is the step size
51:36
this is the step size and we don't just want this for ba but
51:38
and we don't just want this for ba but
51:38
and we don't just want this for ba but also for b
51:41
also for c
51:44
also for c
51:44
also for c also for f
51:46
also for f
51:46
also for f those are
51:47
those are
51:47
those are leaf nodes which we usually have control
51:49
leaf nodes which we usually have control
51:49
leaf nodes which we usually have control over
51:50
over
51:50
over and if we nudge in direction of the
51:52
and if we nudge in direction of the
51:52
and if we nudge in direction of the gradient we expect a positive influence
51:54
gradient we expect a positive influence
51:54
gradient we expect a positive influence on l
51:55
on l
51:55
on l so we expect l to go up
51:58
so we expect l to go up
51:58
so we expect l to go up positively
51:59
positively
51:59
positively so it should become less negative it
52:01
so it should become less negative it
52:01
so it should become less negative it should go up to say negative you know
52:03
should go up to say negative you know
52:03
should go up to say negative you know six or something like that
52:05
six or something like that
52:05
six or something like that uh it's hard to tell exactly and we'd
52:08
uh it's hard to tell exactly and we'd
52:08
uh it's hard to tell exactly and we'd have to rewrite the forward pass so let
52:09
have to rewrite the forward pass so let
52:09
have to rewrite the forward pass so let me just um
52:12
me just um
52:12
me just um do that here
52:13
do that here
52:13
do that here um
52:16
this would be the forward pass f would
52:18
this would be the forward pass f would
52:18
this would be the forward pass f would be unchanged this is effectively the
52:20
be unchanged this is effectively the
52:20
be unchanged this is effectively the forward pass and now if we print l.data
52:24
forward pass and now if we print l.data
52:24
forward pass and now if we print l.data we expect because we nudged all the
52:27
we expect because we nudged all the
52:27
we expect because we nudged all the values all the inputs in the rational
52:28
values all the inputs in the rational
52:28
values all the inputs in the rational gradient we expected a less negative l
52:30
gradient we expected a less negative l
52:30
gradient we expected a less negative l we expect it to go up
52:32
we expect it to go up
52:32
we expect it to go up so maybe it's negative six or so let's
52:34
so maybe it's negative six or so let's
52:34
so maybe it's negative six or so let's see what happens
52:36
see what happens
52:36
see what happens okay negative seven
52:38
okay negative seven
52:38
okay negative seven and uh this is basically one step of an
52:41
and uh this is basically one step of an
52:41
and uh this is basically one step of an optimization that we'll end up running
52:43
optimization that we'll end up running
52:43
optimization that we'll end up running and really does gradient just give us
52:46
and really does gradient just give us
52:46
and really does gradient just give us some power because we know how to
52:47
some power because we know how to
52:47
some power because we know how to influence the final outcome and this
52:49
influence the final outcome and this
52:49
influence the final outcome and this will be extremely useful for training
52:50
will be extremely useful for training
52:50
will be extremely useful for training knowledge as well as you'll see
52:52
knowledge as well as you'll see
52:52
knowledge as well as you'll see so now i would like to do one more uh
52:55
so now i would like to do one more uh
52:55
so now i would like to do one more uh example of manual backpropagation using
52:58
example of manual backpropagation using
52:58
example of manual backpropagation using a bit more complex and uh useful example
53:02
a bit more complex and uh useful example
53:02
a bit more complex and uh useful example we are going to back propagate through a
53:04
we are going to back propagate through a
53:04
we are going to back propagate through a neuron
53:05
neuron
53:05
neuron so
53:07
so
53:07
so we want to eventually build up neural
53:08
we want to eventually build up neural
53:08
we want to eventually build up neural networks and in the simplest case these
53:10
networks and in the simplest case these
53:10
networks and in the simplest case these are multilateral perceptrons as they're
53:12
are multilateral perceptrons as they're
53:12
are multilateral perceptrons as they're called so this is a two layer neural net
53:15
called so this is a two layer neural net
53:15
called so this is a two layer neural net and it's got these hidden layers made up
53:17
and it's got these hidden layers made up
53:17
and it's got these hidden layers made up of neurons and these neurons are fully
53:18
of neurons and these neurons are fully
53:18
of neurons and these neurons are fully connected to each other
53:19
connected to each other
53:20
connected to each other now biologically neurons are very
53:21
now biologically neurons are very
53:21
now biologically neurons are very complicated devices but we have very
53:23
complicated devices but we have very
53:23
complicated devices but we have very simple mathematical models of them
53:26
simple mathematical models of them
53:26
simple mathematical models of them and so this is a very simple
53:27
and so this is a very simple
53:27
and so this is a very simple mathematical model of a neuron you have
53:29
mathematical model of a neuron you have
53:29
mathematical model of a neuron you have some inputs axis
53:31
some inputs axis
53:31
some inputs axis and then you have these synapses that
53:33
and then you have these synapses that
53:33
and then you have these synapses that have weights on them so
53:36
have weights on them so
53:36
have weights on them so the w's are weights
53:39
the w's are weights
53:39
the w's are weights and then
53:40
and then
53:40
and then the synapse interacts with the input to
53:42
the synapse interacts with the input to
53:42
the synapse interacts with the input to this neuron multiplicatively so what
53:44
this neuron multiplicatively so what
53:44
this neuron multiplicatively so what flows to the cell body
53:47
flows to the cell body
53:47
flows to the cell body of this neuron is w times x
53:49
of this neuron is w times x
53:49
of this neuron is w times x but there's multiple inputs so there's
53:51
but there's multiple inputs so there's
53:51
but there's multiple inputs so there's many w times x's flowing into the cell
53:53
many w times x's flowing into the cell
53:53
many w times x's flowing into the cell body
53:54
body
53:54
body the cell body then has also like some
53:56
the cell body then has also like some
53:56
the cell body then has also like some bias
53:57
bias
53:57
bias so this is kind of like the
53:59
so this is kind of like the
53:59
so this is kind of like the inert innate sort of trigger happiness
54:02
inert innate sort of trigger happiness
54:02
inert innate sort of trigger happiness of this neuron so this bias can make it
54:04
of this neuron so this bias can make it
54:04
of this neuron so this bias can make it a bit more trigger happy or a bit less
54:06
a bit more trigger happy or a bit less
54:06
a bit more trigger happy or a bit less trigger happy regardless of the input
54:08
trigger happy regardless of the input
54:08
trigger happy regardless of the input but basically we're taking all the w
54:10
but basically we're taking all the w
54:10
but basically we're taking all the w times x
54:11
times x
54:11
times x of all the inputs adding the bias and
54:13
of all the inputs adding the bias and
54:13
of all the inputs adding the bias and then we take it through an activation
54:15
then we take it through an activation
54:15
then we take it through an activation function
54:16
function
54:16
function and this activation function is usually
54:18
and this activation function is usually
54:18
and this activation function is usually some kind of a squashing function
54:20
some kind of a squashing function
54:20
some kind of a squashing function like a sigmoid or 10h or something like
54:22
like a sigmoid or 10h or something like
54:22
like a sigmoid or 10h or something like that so as an example
54:24
that so as an example
54:24
that so as an example we're going to use the 10h in this
54:26
we're going to use the 10h in this
54:26
we're going to use the 10h in this example
54:28
example
54:28
example numpy has a
54:29
numpy has a
54:29
numpy has a np.10h
54:31
np.10h
54:31
np.10h so
54:32
so
54:32
so we can call it on a range
54:34
we can call it on a range
54:34
we can call it on a range and we can plot it
54:36
and we can plot it
54:36
and we can plot it this is the 10h function and you see
54:38
this is the 10h function and you see
54:38
this is the 10h function and you see that the inputs as they come in
54:41
that the inputs as they come in
54:41
that the inputs as they come in get squashed on the y coordinate here so
54:44
get squashed on the y coordinate here so
54:44
get squashed on the y coordinate here so um
54:45
um
54:45
um right at zero we're going to get exactly
54:47
right at zero we're going to get exactly
54:47
right at zero we're going to get exactly zero and then as you go more positive in
54:49
zero and then as you go more positive in
54:49
zero and then as you go more positive in the input
54:50
the input
54:50
the input then you'll see that the function will
54:52
then you'll see that the function will
54:52
then you'll see that the function will only go up to one and then plateau out
54:55
only go up to one and then plateau out
54:55
only go up to one and then plateau out and so if you pass in very positive
54:57
and so if you pass in very positive
54:57
and so if you pass in very positive inputs we're gonna cap it smoothly at
55:00
inputs we're gonna cap it smoothly at
55:00
inputs we're gonna cap it smoothly at one and on the negative side we're gonna
55:02
one and on the negative side we're gonna
55:02
one and on the negative side we're gonna cap it smoothly to negative one
55:04
cap it smoothly to negative one
55:04
cap it smoothly to negative one so that's 10h
55:06
so that's 10h
55:06
so that's 10h and that's the squashing function or an
55:08
and that's the squashing function or an
55:08
and that's the squashing function or an activation function and what comes out
55:10
activation function and what comes out
55:10
activation function and what comes out of this neuron is just the activation
55:12
of this neuron is just the activation
55:12
of this neuron is just the activation function applied to the dot product of
55:14
function applied to the dot product of
55:14
function applied to the dot product of the weights and the
55:16
the weights and the
55:16
the weights and the inputs
55:18
inputs
55:18
inputs so let's
55:19
so let's
55:19
so let's write one out
55:21
write one out
55:21
write one out um
55:22
um
55:22
um i'm going to copy paste because
55:27
i don't want to type too much
55:28
i don't want to type too much
55:28
i don't want to type too much but okay so here we have the inputs
55:31
but okay so here we have the inputs
55:31
but okay so here we have the inputs x1 x2 so this is a two-dimensional
55:33
x1 x2 so this is a two-dimensional
55:33
x1 x2 so this is a two-dimensional neuron so two inputs are going to come
55:34
neuron so two inputs are going to come
55:34
neuron so two inputs are going to come in
55:35
in
55:35
in these are thought out as the weights of
55:37
these are thought out as the weights of
55:37
these are thought out as the weights of this neuron
55:38
this neuron
55:38
this neuron weights w1 w2 and these weights again
55:41
weights w1 w2 and these weights again
55:41
weights w1 w2 and these weights again are the synaptic strengths for each
55:43
are the synaptic strengths for each
55:43
are the synaptic strengths for each input
55:45
input
55:45
input and this is the bias of the neuron
55:47
and this is the bias of the neuron
55:47
and this is the bias of the neuron b
55:49
b
55:49
b and now we want to do is according to
55:51
and now we want to do is according to
55:51
and now we want to do is according to this model we need to multiply x1 times
55:54
this model we need to multiply x1 times
55:54
this model we need to multiply x1 times w1
55:55
w1
55:55
w1 and x2 times w2
55:57
and x2 times w2
55:57
and x2 times w2 and then we need to add bias on top of
56:00
and then we need to add bias on top of
56:00
and then we need to add bias on top of it
56:01
it
56:01
it and it gets a little messy here but all
56:03
and it gets a little messy here but all
56:03
and it gets a little messy here but all we are trying to do is x1 w1 plus x2 w2
56:06
we are trying to do is x1 w1 plus x2 w2
56:06
we are trying to do is x1 w1 plus x2 w2 plus b
56:07
plus b
56:07
plus b and these are multiply here
56:09
and these are multiply here
56:09
and these are multiply here except i'm doing it in small steps so
56:12
except i'm doing it in small steps so
56:12
except i'm doing it in small steps so that we actually have pointers to all
56:13
that we actually have pointers to all
56:13
that we actually have pointers to all these intermediate nodes so we have x1
56:15
these intermediate nodes so we have x1
56:15
these intermediate nodes so we have x1 w1 variable x times x2 w2 variable and
56:19
w1 variable x times x2 w2 variable and
56:19
w1 variable x times x2 w2 variable and i'm also labeling them
56:21
i'm also labeling them
56:21
i'm also labeling them so n is now
56:23
so n is now
56:23
so n is now the cell body raw
56:25
the cell body raw
56:25
the cell body raw raw
56:26
raw
56:26
raw activation without
56:28
activation without
56:28
activation without the activation function for now
56:30
the activation function for now
56:30
the activation function for now and this should be enough to basically
56:32
and this should be enough to basically
56:32
and this should be enough to basically plot it so draw dot of n
56:37
gives us x1 times w1 x2 times w2
56:41
gives us x1 times w1 x2 times w2
56:41
gives us x1 times w1 x2 times w2 being added
56:43
being added
56:43
being added then the bias gets added on top of this
56:45
then the bias gets added on top of this
56:45
then the bias gets added on top of this and this n
56:47
and this n
56:47
and this n is this sum
56:49
is this sum
56:49
is this sum so we're now going to take it through an
56:50
so we're now going to take it through an
56:50
so we're now going to take it through an activation function
56:52
activation function
56:52
activation function and let's say we use the 10h
56:54
and let's say we use the 10h
56:54
and let's say we use the 10h so that we produce the output
56:56
so that we produce the output
56:56
so that we produce the output so what we'd like to do here is we'd
56:57
so what we'd like to do here is we'd
56:58
so what we'd like to do here is we'd like to do the output and i'll call it o
57:01
like to do the output and i'll call it o
57:01
like to do the output and i'll call it o is um
57:03
is um
57:03
is um n dot 10h
57:05
n dot 10h
57:05
n dot 10h okay but we haven't yet written the 10h
57:08
okay but we haven't yet written the 10h
57:08
okay but we haven't yet written the 10h now the reason that we need to implement
57:09
now the reason that we need to implement
57:09
now the reason that we need to implement another 10h function here is that
57:12
another 10h function here is that
57:12
another 10h function here is that tanh is a
57:14
tanh is a
57:14
tanh is a hyperbolic function and we've only so
57:16
hyperbolic function and we've only so
57:16
hyperbolic function and we've only so far implemented a plus and the times and
57:18
far implemented a plus and the times and
57:18
far implemented a plus and the times and you can't make a 10h out of just pluses
57:20
you can't make a 10h out of just pluses
57:20
you can't make a 10h out of just pluses and times
57:21
and times
57:22
and times you also need exponentiation so 10h is
57:25
you also need exponentiation so 10h is
57:25
you also need exponentiation so 10h is this kind of a formula here
57:27
this kind of a formula here
57:27
this kind of a formula here you can use either one of these and you
57:28
you can use either one of these and you
57:28
you can use either one of these and you see that there's exponentiation involved
57:30
see that there's exponentiation involved
57:30
see that there's exponentiation involved which we have not implemented yet for
57:32
which we have not implemented yet for
57:32
which we have not implemented yet for our low value node here so we're not
57:34
our low value node here so we're not
57:34
our low value node here so we're not going to be able to produce 10h yet and
57:36
going to be able to produce 10h yet and
57:36
going to be able to produce 10h yet and we have to go back up and implement
57:37
we have to go back up and implement
57:37
we have to go back up and implement something like it
57:39
something like it
57:39
something like it now one option here
57:42
now one option here
57:42
now one option here is we could actually implement um
57:44
is we could actually implement um
57:44
is we could actually implement um exponentiation
57:46
exponentiation
57:46
exponentiation right and we could return the x of a
57:49
right and we could return the x of a
57:49
right and we could return the x of a value instead of a 10h of a value
57:52
value instead of a 10h of a value
57:52
value instead of a 10h of a value because if we had x then we have
57:54
because if we had x then we have
57:54
because if we had x then we have everything else that we need so um
57:56
everything else that we need so um
57:56
everything else that we need so um because we know how to add and we know
57:58
because we know how to add and we know
57:58
because we know how to add and we know how to
57:59
how to
58:00
how to um
58:01
um
58:01
um we know how to add and we know how to
58:02
we know how to add and we know how to
58:02
we know how to add and we know how to multiply so we'd be able to create 10h
58:04
multiply so we'd be able to create 10h
58:04
multiply so we'd be able to create 10h if we knew how to x
58:06
if we knew how to x
58:06
if we knew how to x but for the purposes of this example i
58:08
but for the purposes of this example i
58:08
but for the purposes of this example i specifically wanted to
58:09
specifically wanted to
58:10
specifically wanted to show you
58:11
show you
58:11
show you that we don't necessarily need to have
58:13
that we don't necessarily need to have
58:13
that we don't necessarily need to have the most atomic pieces
58:15
the most atomic pieces
58:15
the most atomic pieces in
58:15
in
58:16
in um
58:16
um
58:16
um in this value object we can actually
58:19
in this value object we can actually
58:19
in this value object we can actually like create functions at arbitrary
58:23
like create functions at arbitrary
58:23
like create functions at arbitrary points of abstraction they can be
58:24
points of abstraction they can be
58:24
points of abstraction they can be complicated functions but they can be
58:26
complicated functions but they can be
58:26
complicated functions but they can be also very very simple functions like a
58:27
also very very simple functions like a
58:27
also very very simple functions like a plus and it's totally up to us the only
58:30
plus and it's totally up to us the only
58:30
plus and it's totally up to us the only thing that matters is that we know how
58:31
thing that matters is that we know how
58:31
thing that matters is that we know how to differentiate through any one
58:33
to differentiate through any one
58:33
to differentiate through any one function so we take some inputs and we
58:35
function so we take some inputs and we
58:35
function so we take some inputs and we make an output the only thing that
58:37
make an output the only thing that
58:37
make an output the only thing that matters it can be arbitrarily complex
58:38
matters it can be arbitrarily complex
58:38
matters it can be arbitrarily complex function as long as you know how to
58:41
function as long as you know how to
58:41
function as long as you know how to create the local derivative if you know
58:43
create the local derivative if you know
58:43
create the local derivative if you know the local derivative of how the inputs
58:44
the local derivative of how the inputs
58:44
the local derivative of how the inputs impact the output then that's all you
58:46
impact the output then that's all you
58:46
impact the output then that's all you need so we're going to cluster up
58:49
need so we're going to cluster up
58:49
need so we're going to cluster up all of this expression and we're not
58:51
all of this expression and we're not
58:51
all of this expression and we're not going to break it down to its atomic
58:52
going to break it down to its atomic
58:52
going to break it down to its atomic pieces we're just going to directly
58:54
pieces we're just going to directly
58:54
pieces we're just going to directly implement tanh
58:55
implement tanh
58:55
implement tanh so let's do that
58:57
so let's do that
58:57
so let's do that depth nh
58:59
depth nh
58:59
depth nh and then out will be a value
59:02
and then out will be a value
59:02
and then out will be a value of
59:03
of
59:03
of and we need this expression here so
59:05
and we need this expression here so
59:05
and we need this expression here so um
59:08
let me actually
59:10
let me actually
59:10
let me actually copy paste
59:14
let's grab n which is a cell.theta
59:17
let's grab n which is a cell.theta
59:17
let's grab n which is a cell.theta and then this
59:18
and then this
59:18
and then this i believe is the tan h
59:21
i believe is the tan h
59:21
i believe is the tan h math.x of
59:24
math.x of
59:24
math.x of two
59:25
two
59:25
two no n
59:27
no n
59:27
no n n minus one over
59:28
n minus one over
59:28
n minus one over two n plus one
59:30
two n plus one
59:30
two n plus one maybe i can call this x
59:33
maybe i can call this x
59:33
maybe i can call this x just so that it matches exactly
59:35
just so that it matches exactly
59:35
just so that it matches exactly okay and now
59:37
okay and now
59:37
okay and now this will be t
59:40
this will be t
59:40
this will be t and uh children of this node there's
59:42
and uh children of this node there's
59:42
and uh children of this node there's just one child
59:43
just one child
59:44
just one child and i'm wrapping it in a tuple so this
59:45
and i'm wrapping it in a tuple so this
59:46
and i'm wrapping it in a tuple so this is a tuple of one object just self
59:48
is a tuple of one object just self
59:48
is a tuple of one object just self and here the name of this operation will
59:50
and here the name of this operation will
59:50
and here the name of this operation will be 10h
59:52
be 10h
59:52
be 10h and we're going to return that
59:56
okay
59:58
okay
59:58
okay so now valley should be implementing 10h
1:00:01
so now valley should be implementing 10h
1:00:02
so now valley should be implementing 10h and now we can scroll all the way down
1:00:03
and now we can scroll all the way down
1:00:03
and now we can scroll all the way down here
1:00:04
here
1:00:04
here and we can actually do n.10 h and that's
1:00:06
and we can actually do n.10 h and that's
1:00:06
and we can actually do n.10 h and that's going to return the tanhd
1:00:09
going to return the tanhd
1:00:09
going to return the tanhd output of n
1:00:11
output of n
1:00:11
output of n and now we should be able to draw it out
1:00:12
and now we should be able to draw it out
1:00:12
and now we should be able to draw it out of o not of n
1:00:14
of o not of n
1:00:14
of o not of n so let's see how that worked
1:00:18
there we go
1:00:19
there we go
1:00:19
there we go n went through 10 h
1:00:21
n went through 10 h
1:00:21
n went through 10 h to produce this output
1:00:24
to produce this output
1:00:24
to produce this output so now tan h is a
1:00:26
so now tan h is a
1:00:26
so now tan h is a sort of
1:00:27
sort of
1:00:27
sort of our little micro grad supported node
1:00:29
our little micro grad supported node
1:00:30
our little micro grad supported node here as an operation
1:00:33
here as an operation
1:00:33
here as an operation and as long as we know the derivative of
1:00:35
and as long as we know the derivative of
1:00:35
and as long as we know the derivative of 10h
1:00:36
10h
1:00:36
10h then we'll be able to back propagate
1:00:37
then we'll be able to back propagate
1:00:37
then we'll be able to back propagate through it now let's see this 10h in
1:00:39
through it now let's see this 10h in
1:00:39
through it now let's see this 10h in action currently it's not squashing too
1:00:41
action currently it's not squashing too
1:00:41
action currently it's not squashing too much because the input to it is pretty
1:00:43
much because the input to it is pretty
1:00:43
much because the input to it is pretty low so if the bias was increased to say
1:00:46
low so if the bias was increased to say
1:00:46
low so if the bias was increased to say eight
1:00:49
eight
1:00:49
eight then we'll see that what's flowing into
1:00:51
then we'll see that what's flowing into
1:00:51
then we'll see that what's flowing into the 10h now is
1:00:53
the 10h now is
1:00:53
the 10h now is two
1:00:54
two
1:00:54
two and 10h is squashing it to 0.96 so we're
1:00:57
and 10h is squashing it to 0.96 so we're
1:00:57
and 10h is squashing it to 0.96 so we're already hitting the tail of this 10h and
1:00:59
already hitting the tail of this 10h and
1:00:59
already hitting the tail of this 10h and it will sort of smoothly go up to 1 and
1:01:01
it will sort of smoothly go up to 1 and
1:01:01
it will sort of smoothly go up to 1 and then plateau out over there
1:01:03
then plateau out over there
1:01:03
then plateau out over there okay so now i'm going to do something
1:01:04
okay so now i'm going to do something
1:01:04
okay so now i'm going to do something slightly strange i'm going to change
1:01:06
slightly strange i'm going to change
1:01:06
slightly strange i'm going to change this bias from 8 to this number
1:01:09
this bias from 8 to this number
1:01:09
this bias from 8 to this number 6.88 etc
1:01:11
6.88 etc
1:01:11
6.88 etc and i'm going to do this for specific
1:01:13
and i'm going to do this for specific
1:01:13
and i'm going to do this for specific reasons because we're about to start
1:01:15
reasons because we're about to start
1:01:15
reasons because we're about to start back propagation
1:01:16
back propagation
1:01:16
back propagation and i want to make sure that our numbers
1:01:19
and i want to make sure that our numbers
1:01:19
and i want to make sure that our numbers come out nice they're not like very
1:01:21
come out nice they're not like very
1:01:21
come out nice they're not like very crazy numbers they're nice numbers that
1:01:22
crazy numbers they're nice numbers that
1:01:22
crazy numbers they're nice numbers that we can sort of understand in our head
1:01:24
we can sort of understand in our head
1:01:24
we can sort of understand in our head let me also add a pose label
1:01:26
let me also add a pose label
1:01:26
let me also add a pose label o is short for output here
1:01:29
o is short for output here
1:01:30
o is short for output here so that's zero
1:01:31
so that's zero
1:01:31
so that's zero okay so
1:01:32
okay so
1:01:32
okay so 0.88 flows into 10 h comes out 0.7 so on
1:01:36
0.88 flows into 10 h comes out 0.7 so on
1:01:36
0.88 flows into 10 h comes out 0.7 so on so now we're going to do back
1:01:37
so now we're going to do back
1:01:37
so now we're going to do back propagation and we're going to fill in
1:01:38
propagation and we're going to fill in
1:01:38
propagation and we're going to fill in all the gradients
1:01:40
all the gradients
1:01:40
all the gradients so what is the derivative o with respect
1:01:43
so what is the derivative o with respect
1:01:43
so what is the derivative o with respect to
1:01:43
to
1:01:44
to all the
1:01:45
all the
1:01:45
all the inputs here and of course in the typical
1:01:47
inputs here and of course in the typical
1:01:47
inputs here and of course in the typical neural network setting what we really
1:01:48
neural network setting what we really
1:01:48
neural network setting what we really care about the most is the derivative of
1:01:51
care about the most is the derivative of
1:01:51
care about the most is the derivative of these neurons on the weights
1:01:53
these neurons on the weights
1:01:53
these neurons on the weights specifically the w2 and w1 because those
1:01:56
specifically the w2 and w1 because those
1:01:56
specifically the w2 and w1 because those are the weights that we're going to be
1:01:57
are the weights that we're going to be
1:01:57
are the weights that we're going to be changing part of the optimization
1:01:59
changing part of the optimization
1:01:59
changing part of the optimization and the other thing that we have to
1:02:00
and the other thing that we have to
1:02:00
and the other thing that we have to remember is here we have only a single
1:02:02
remember is here we have only a single
1:02:02
remember is here we have only a single neuron but in the neural natives
1:02:03
neuron but in the neural natives
1:02:03
neuron but in the neural natives typically have many neurons and they're
1:02:04
typically have many neurons and they're
1:02:04
typically have many neurons and they're connected
1:02:07
connected
1:02:07
connected so this is only like a one small neuron
1:02:09
so this is only like a one small neuron
1:02:09
so this is only like a one small neuron a piece of a much bigger puzzle and
1:02:10
a piece of a much bigger puzzle and
1:02:10
a piece of a much bigger puzzle and eventually there's a loss function that
1:02:12
eventually there's a loss function that
1:02:12
eventually there's a loss function that sort of measures the accuracy of the
1:02:13
sort of measures the accuracy of the
1:02:13
sort of measures the accuracy of the neural net and we're back propagating
1:02:15
neural net and we're back propagating
1:02:15
neural net and we're back propagating with respect to that accuracy and trying
1:02:16
with respect to that accuracy and trying
1:02:16
with respect to that accuracy and trying to increase it
1:02:19
to increase it
1:02:19
to increase it so let's start off by propagation here
1:02:21
so let's start off by propagation here
1:02:21
so let's start off by propagation here in the end
1:02:22
in the end
1:02:22
in the end what is the derivative of o with respect
1:02:24
what is the derivative of o with respect
1:02:24
what is the derivative of o with respect to o the base case sort of we know
1:02:26
to o the base case sort of we know
1:02:26
to o the base case sort of we know always is that the gradient is just 1.0
1:02:30
always is that the gradient is just 1.0
1:02:30
always is that the gradient is just 1.0 so let me fill it in
1:02:32
so let me fill it in
1:02:32
so let me fill it in and then let me
1:02:35
and then let me
1:02:35
and then let me split out
1:02:37
split out
1:02:37
split out the drawing function
1:02:40
the drawing function
1:02:40
the drawing function here
1:02:43
and then here cell
1:02:47
clear this output here okay
1:02:50
clear this output here okay
1:02:50
clear this output here okay so now when we draw o we'll see that oh
1:02:52
so now when we draw o we'll see that oh
1:02:52
so now when we draw o we'll see that oh that grad is one
1:02:53
that grad is one
1:02:53
that grad is one so now we're going to back propagate
1:02:55
so now we're going to back propagate
1:02:55
so now we're going to back propagate through the tan h
1:02:56
through the tan h
1:02:56
through the tan h so to back propagate through 10h we need
1:02:58
so to back propagate through 10h we need
1:02:58
so to back propagate through 10h we need to know the local derivative of 10h
1:03:01
to know the local derivative of 10h
1:03:01
to know the local derivative of 10h so if we have that
1:03:03
so if we have that
1:03:03
so if we have that o is 10 h of
1:03:07
o is 10 h of
1:03:07
o is 10 h of n
1:03:08
n
1:03:08
n then what is d o by d n
1:03:11
then what is d o by d n
1:03:12
then what is d o by d n now what you could do is you could come
1:03:13
now what you could do is you could come
1:03:13
now what you could do is you could come here and you could take this expression
1:03:15
here and you could take this expression
1:03:15
here and you could take this expression and you could
1:03:16
and you could
1:03:16
and you could do your calculus derivative taking
1:03:19
do your calculus derivative taking
1:03:19
do your calculus derivative taking um and that would work but we can also
1:03:21
um and that would work but we can also
1:03:21
um and that would work but we can also just scroll down wikipedia here
1:03:23
just scroll down wikipedia here
1:03:23
just scroll down wikipedia here into a section that hopefully tells us
1:03:26
into a section that hopefully tells us
1:03:26
into a section that hopefully tells us that derivative uh
1:03:28
that derivative uh
1:03:28
that derivative uh d by dx of 10 h of x is
1:03:31
d by dx of 10 h of x is
1:03:31
d by dx of 10 h of x is any of these i like this one 1 minus 10
1:03:33
any of these i like this one 1 minus 10
1:03:33
any of these i like this one 1 minus 10 h square of x
1:03:35
h square of x
1:03:35
h square of x so this is 1 minus 10 h
1:03:37
so this is 1 minus 10 h
1:03:37
so this is 1 minus 10 h of x squared
1:03:39
of x squared
1:03:39
of x squared so basically what this is saying is that
1:03:41
so basically what this is saying is that
1:03:41
so basically what this is saying is that d o by d n
1:03:43
d o by d n
1:03:43
d o by d n is
1:03:44
is
1:03:44
is 1 minus 10 h
1:03:47
1 minus 10 h
1:03:47
1 minus 10 h of n
1:03:48
of n
1:03:48
of n squared
1:03:51
squared
1:03:51
squared and we already have 10 h of n that's
1:03:52
and we already have 10 h of n that's
1:03:52
and we already have 10 h of n that's just o
1:03:54
just o
1:03:54
just o so it's one minus o squared
1:03:56
so it's one minus o squared
1:03:56
so it's one minus o squared so o is the output here so the output is
1:03:59
so o is the output here so the output is
1:03:59
so o is the output here so the output is this number
1:04:02
this number
1:04:02
this number data
1:04:04
data
1:04:04
data is this number
1:04:06
is this number
1:04:06
is this number and then
1:04:08
and then
1:04:08
and then what this is saying is that do by dn is
1:04:10
what this is saying is that do by dn is
1:04:10
what this is saying is that do by dn is 1 minus
1:04:11
1 minus
1:04:11
1 minus this squared so
1:04:13
this squared so
1:04:13
this squared so one minus of that data squared
1:04:16
one minus of that data squared
1:04:16
one minus of that data squared is 0.5 conveniently
1:04:18
is 0.5 conveniently
1:04:18
is 0.5 conveniently so the local derivative of this 10 h
1:04:21
so the local derivative of this 10 h
1:04:21
so the local derivative of this 10 h operation here is 0.5
1:04:24
operation here is 0.5
1:04:24
operation here is 0.5 and
1:04:25
and
1:04:25
and so that would be d o by d n
1:04:27
so that would be d o by d n
1:04:27
so that would be d o by d n so
1:04:28
so
1:04:28
so we can fill in that in that grad
1:04:33
is 0.5 we'll just fill in
1:04:42
so this is exactly 0.5 one half
1:04:45
so this is exactly 0.5 one half
1:04:45
so this is exactly 0.5 one half so now we're going to continue the back
1:04:47
so now we're going to continue the back
1:04:47
so now we're going to continue the back propagation
1:04:49
propagation
1:04:49
propagation this is 0.5 and this is a plus node
1:04:52
this is 0.5 and this is a plus node
1:04:52
this is 0.5 and this is a plus node so how is backprop going to what is that
1:04:55
so how is backprop going to what is that
1:04:55
so how is backprop going to what is that going to do here
1:04:56
going to do here
1:04:56
going to do here and if you remember our previous example
1:04:58
and if you remember our previous example
1:04:58
and if you remember our previous example a plus is just a distributor of gradient
1:05:01
a plus is just a distributor of gradient
1:05:01
a plus is just a distributor of gradient so this gradient will simply flow to
1:05:03
so this gradient will simply flow to
1:05:03
so this gradient will simply flow to both of these equally and that's because
1:05:05
both of these equally and that's because
1:05:05
both of these equally and that's because the local derivative of this operation
1:05:07
the local derivative of this operation
1:05:07
the local derivative of this operation is one for every one of its nodes so 1
1:05:10
is one for every one of its nodes so 1
1:05:10
is one for every one of its nodes so 1 times 0.5 is 0.5
1:05:12
times 0.5 is 0.5
1:05:12
times 0.5 is 0.5 so therefore we know that
1:05:14
so therefore we know that
1:05:14
so therefore we know that this node here which we called this
1:05:18
this node here which we called this
1:05:18
this node here which we called this its grad is just 0.5
1:05:21
its grad is just 0.5
1:05:21
its grad is just 0.5 and we know that b dot grad is also 0.5
1:05:24
and we know that b dot grad is also 0.5
1:05:24
and we know that b dot grad is also 0.5 so let's set those and let's draw
1:05:28
so 0.5
1:05:30
so 0.5
1:05:30
so 0.5 continuing we have another plus
1:05:32
continuing we have another plus
1:05:32
continuing we have another plus 0.5 again we'll just distribute it so
1:05:34
0.5 again we'll just distribute it so
1:05:34
0.5 again we'll just distribute it so 0.5 will flow to both of these
1:05:37
0.5 will flow to both of these
1:05:37
0.5 will flow to both of these so we can set
1:05:39
so we can set
1:05:39
so we can set theirs
1:05:43
x2w2 as well that grad is 0.5
1:05:47
x2w2 as well that grad is 0.5
1:05:47
x2w2 as well that grad is 0.5 and let's redraw pluses are my favorite
1:05:50
and let's redraw pluses are my favorite
1:05:50
and let's redraw pluses are my favorite uh operations to back propagate through
1:05:51
uh operations to back propagate through
1:05:51
uh operations to back propagate through because
1:05:53
because
1:05:53
because it's very simple
1:05:55
it's very simple
1:05:55
it's very simple so now it's flowing into these
1:05:56
so now it's flowing into these
1:05:56
so now it's flowing into these expressions is 0.5 and so really again
1:05:58
expressions is 0.5 and so really again
1:05:58
expressions is 0.5 and so really again keep in mind what the derivative is
1:05:59
keep in mind what the derivative is
1:05:59
keep in mind what the derivative is telling us at every point in time along
1:06:01
telling us at every point in time along
1:06:01
telling us at every point in time along here this is saying that
1:06:04
here this is saying that
1:06:04
here this is saying that if we want the output of this neuron to
1:06:06
if we want the output of this neuron to
1:06:06
if we want the output of this neuron to increase
1:06:08
increase
1:06:08
increase then
1:06:08
then
1:06:08
then the influence on these expressions is
1:06:10
the influence on these expressions is
1:06:10
the influence on these expressions is positive on the output both of them are
1:06:13
positive on the output both of them are
1:06:13
positive on the output both of them are positive
1:06:16
contribution to the output
1:06:20
so now back propagating to x2 and w2
1:06:23
so now back propagating to x2 and w2
1:06:23
so now back propagating to x2 and w2 first
1:06:24
first
1:06:24
first this is a times node so we know that the
1:06:26
this is a times node so we know that the
1:06:26
this is a times node so we know that the local derivative is you know the other
1:06:28
local derivative is you know the other
1:06:28
local derivative is you know the other term
1:06:28
term
1:06:28
term so if we want to calculate x2.grad
1:06:32
so if we want to calculate x2.grad
1:06:32
so if we want to calculate x2.grad then
1:06:33
then
1:06:33
then can you think through what it's going to
1:06:34
can you think through what it's going to
1:06:34
can you think through what it's going to be
1:06:40
so x2.grad will be
1:06:42
so x2.grad will be
1:06:42
so x2.grad will be w2.data
1:06:44
w2.data
1:06:44
w2.data times this x2w2
1:06:48
times this x2w2
1:06:48
times this x2w2 by grad right
1:06:51
by grad right
1:06:51
by grad right and
1:06:52
and
1:06:52
and w2.grad will be
1:06:55
w2.grad will be
1:06:55
w2.grad will be x2 that data times x2w2.grad
1:07:01
right so that's the local piece of chain
1:07:03
right so that's the local piece of chain
1:07:03
right so that's the local piece of chain rule
1:07:07
let's set them and let's redraw
1:07:09
let's set them and let's redraw
1:07:09
let's set them and let's redraw so here we see that the gradient on our
1:07:11
so here we see that the gradient on our
1:07:11
so here we see that the gradient on our weight 2 is 0 because x2 data was 0
1:07:15
weight 2 is 0 because x2 data was 0
1:07:15
weight 2 is 0 because x2 data was 0 right but x2 will have the gradient 0.5
1:07:18
right but x2 will have the gradient 0.5
1:07:18
right but x2 will have the gradient 0.5 because data here was 1.
1:07:20
because data here was 1.
1:07:20
because data here was 1. and so what's interesting here right is
1:07:22
and so what's interesting here right is
1:07:22
and so what's interesting here right is because the input x2 was 0 then because
1:07:25
because the input x2 was 0 then because
1:07:25
because the input x2 was 0 then because of the way the times works
1:07:28
of the way the times works
1:07:28
of the way the times works of course this gradient will be zero and
1:07:30
of course this gradient will be zero and
1:07:30
of course this gradient will be zero and think about intuitively why that is
1:07:33
think about intuitively why that is
1:07:33
think about intuitively why that is derivative always tells us the influence
1:07:35
derivative always tells us the influence
1:07:35
derivative always tells us the influence of
1:07:36
of
1:07:36
of this on the final output if i wiggle w2
1:07:39
this on the final output if i wiggle w2
1:07:39
this on the final output if i wiggle w2 how is the output changing
1:07:41
how is the output changing
1:07:41
how is the output changing it's not changing because we're
1:07:42
it's not changing because we're
1:07:42
it's not changing because we're multiplying by zero
1:07:44
multiplying by zero
1:07:44
multiplying by zero so because it's not changing there's no
1:07:45
so because it's not changing there's no
1:07:46
so because it's not changing there's no derivative and zero is the correct
1:07:47
derivative and zero is the correct
1:07:47
derivative and zero is the correct answer
1:07:48
answer
1:07:48
answer because we're
1:07:49
because we're
1:07:49
because we're squashing it at zero
1:07:52
squashing it at zero
1:07:52
squashing it at zero and let's do it here point five should
1:07:54
and let's do it here point five should
1:07:54
and let's do it here point five should come here and flow through this times
1:07:57
come here and flow through this times
1:07:57
come here and flow through this times and so we'll have that x1.grad is
1:08:01
and so we'll have that x1.grad is
1:08:01
and so we'll have that x1.grad is can you think through a little bit what
1:08:03
can you think through a little bit what
1:08:03
can you think through a little bit what what
1:08:04
what
1:08:04
what this should be
1:08:07
the local derivative of times
1:08:09
the local derivative of times
1:08:09
the local derivative of times with respect to x1 is going to be w1
1:08:12
with respect to x1 is going to be w1
1:08:12
with respect to x1 is going to be w1 so w1 is data times
1:08:15
so w1 is data times
1:08:15
so w1 is data times x1 w1 dot grad
1:08:18
x1 w1 dot grad
1:08:18
x1 w1 dot grad and w1.grad will be x1.data times
1:08:23
and w1.grad will be x1.data times
1:08:23
and w1.grad will be x1.data times x1 w2 w1 with graph
1:08:27
x1 w2 w1 with graph
1:08:27
x1 w2 w1 with graph let's see what those came out to be
1:08:29
let's see what those came out to be
1:08:29
let's see what those came out to be so this is 0.5 so this would be negative
1:08:31
so this is 0.5 so this would be negative
1:08:31
so this is 0.5 so this would be negative 1.5 and this would be 1.
1:08:34
1.5 and this would be 1.
1:08:34
1.5 and this would be 1. and we've back propagated through this
1:08:36
and we've back propagated through this
1:08:36
and we've back propagated through this expression these are the actual final
1:08:38
expression these are the actual final
1:08:38
expression these are the actual final derivatives so if we want this neuron's
1:08:40
derivatives so if we want this neuron's
1:08:40
derivatives so if we want this neuron's output to increase
1:08:43
output to increase
1:08:43
output to increase we know that what's necessary is that
1:08:47
we know that what's necessary is that
1:08:47
we know that what's necessary is that w2 we have no gradient w2 doesn't
1:08:49
w2 we have no gradient w2 doesn't
1:08:49
w2 we have no gradient w2 doesn't actually matter to this neuron right now
1:08:51
actually matter to this neuron right now
1:08:51
actually matter to this neuron right now but this neuron this weight should uh go
1:08:54
but this neuron this weight should uh go
1:08:54
but this neuron this weight should uh go up
1:08:55
up
1:08:55
up so if this weight goes up then this
1:08:57
so if this weight goes up then this
1:08:57
so if this weight goes up then this neuron's output would have gone up and
1:08:59
neuron's output would have gone up and
1:08:59
neuron's output would have gone up and proportionally because the gradient is
1:09:01
proportionally because the gradient is
1:09:01
proportionally because the gradient is one okay so doing the back propagation
1:09:03
one okay so doing the back propagation
1:09:03
one okay so doing the back propagation manually is obviously ridiculous so we
1:09:05
manually is obviously ridiculous so we
1:09:05
manually is obviously ridiculous so we are now going to put an end to this
1:09:06
are now going to put an end to this
1:09:06
are now going to put an end to this suffering and we're going to see how we
1:09:08
suffering and we're going to see how we
1:09:08
suffering and we're going to see how we can implement uh the backward pass a bit
1:09:11
can implement uh the backward pass a bit
1:09:11
can implement uh the backward pass a bit more automatically we're not going to be
1:09:12
more automatically we're not going to be
1:09:12
more automatically we're not going to be doing all of it manually out here
1:09:14
doing all of it manually out here
1:09:14
doing all of it manually out here it's now pretty obvious to us by example
1:09:17
it's now pretty obvious to us by example
1:09:17
it's now pretty obvious to us by example how these pluses and times are back
1:09:18
how these pluses and times are back
1:09:18
how these pluses and times are back property ingredients so let's go up to
1:09:20
property ingredients so let's go up to
1:09:20
property ingredients so let's go up to the value
1:09:22
the value
1:09:22
the value object and we're going to start
1:09:24
object and we're going to start
1:09:24
object and we're going to start codifying what we've seen
1:09:27
codifying what we've seen
1:09:27
codifying what we've seen in the examples below
1:09:29
in the examples below
1:09:29
in the examples below so we're going to do this by storing a
1:09:31
so we're going to do this by storing a
1:09:31
so we're going to do this by storing a special cell dot backward
1:09:34
special cell dot backward
1:09:34
special cell dot backward and underscore backward and this will be
1:09:37
and underscore backward and this will be
1:09:37
and underscore backward and this will be a function which is going to do that
1:09:39
a function which is going to do that
1:09:39
a function which is going to do that little piece of chain rule at each
1:09:41
little piece of chain rule at each
1:09:41
little piece of chain rule at each little node that compute that took
1:09:43
little node that compute that took
1:09:43
little node that compute that took inputs and produced output uh we're
1:09:45
inputs and produced output uh we're
1:09:45
inputs and produced output uh we're going to store
1:09:46
going to store
1:09:46
going to store how we are going to chain the the
1:09:49
how we are going to chain the the
1:09:49
how we are going to chain the the outputs gradient into the inputs
1:09:51
outputs gradient into the inputs
1:09:51
outputs gradient into the inputs gradients
1:09:52
gradients
1:09:52
gradients so by default
1:09:53
so by default
1:09:54
so by default this will be a function
1:09:55
this will be a function
1:09:55
this will be a function that uh doesn't do anything
1:09:58
that uh doesn't do anything
1:09:58
that uh doesn't do anything so um
1:09:59
so um
1:09:59
so um and you can also see that here in the
1:10:01
and you can also see that here in the
1:10:01
and you can also see that here in the value in micrograb
1:10:03
value in micrograb
1:10:03
value in micrograb so
1:10:03
so
1:10:04
so with this backward function by default
1:10:06
with this backward function by default
1:10:06
with this backward function by default doesn't do anything
1:10:08
doesn't do anything
1:10:08
doesn't do anything this is an empty function
1:10:10
this is an empty function
1:10:10
this is an empty function and that would be sort of the case for
1:10:11
and that would be sort of the case for
1:10:11
and that would be sort of the case for example for a leaf node for leaf node
1:10:13
example for a leaf node for leaf node
1:10:13
example for a leaf node for leaf node there's nothing to do
1:10:15
there's nothing to do
1:10:15
there's nothing to do but now if when we're creating these out
1:10:18
but now if when we're creating these out
1:10:18
but now if when we're creating these out values these out values are an addition
1:10:21
values these out values are an addition
1:10:21
values these out values are an addition of self and other
1:10:24
of self and other
1:10:24
of self and other and so we will want to sell set
1:10:27
and so we will want to sell set
1:10:27
and so we will want to sell set outs backward to be
1:10:29
outs backward to be
1:10:29
outs backward to be the function that propagates the
1:10:31
the function that propagates the
1:10:31
the function that propagates the gradient
1:10:34
so
1:10:35
so
1:10:35
so let's define what should happen
1:10:40
and we're going to store it in a closure
1:10:42
and we're going to store it in a closure
1:10:42
and we're going to store it in a closure let's define what should happen when we
1:10:44
let's define what should happen when we
1:10:44
let's define what should happen when we call
1:10:45
call
1:10:45
call outs grad
1:10:47
for in addition
1:10:49
for in addition
1:10:50
for in addition our job is to take
1:10:51
our job is to take
1:10:52
our job is to take outs grad and propagate it into self's
1:10:55
outs grad and propagate it into self's
1:10:55
outs grad and propagate it into self's grad and other grad so basically we want
1:10:57
grad and other grad so basically we want
1:10:57
grad and other grad so basically we want to sell self.grad to something
1:11:00
to sell self.grad to something
1:11:00
to sell self.grad to something and we want to set others.grad to
1:11:02
and we want to set others.grad to
1:11:02
and we want to set others.grad to something
1:11:04
something
1:11:04
something okay
1:11:05
okay
1:11:05
okay and the way we saw below how chain rule
1:11:08
and the way we saw below how chain rule
1:11:08
and the way we saw below how chain rule works we want to take the local
1:11:09
works we want to take the local
1:11:10
works we want to take the local derivative times
1:11:11
derivative times
1:11:11
derivative times the
1:11:12
the
1:11:12
the sort of global derivative i should call
1:11:14
sort of global derivative i should call
1:11:14
sort of global derivative i should call it which is the derivative of the final
1:11:16
it which is the derivative of the final
1:11:16
it which is the derivative of the final output of the expression with respect to
1:11:18
output of the expression with respect to
1:11:18
output of the expression with respect to out's data
1:11:21
out's data
1:11:21
out's data with respect to out
1:11:22
with respect to out
1:11:22
with respect to out so
1:11:24
so
1:11:24
so the local derivative of self in an
1:11:27
the local derivative of self in an
1:11:27
the local derivative of self in an addition is 1.0
1:11:29
addition is 1.0
1:11:29
addition is 1.0 so it's just 1.0 times
1:11:31
so it's just 1.0 times
1:11:31
so it's just 1.0 times outs grad
1:11:34
outs grad
1:11:34
outs grad that's the chain rule
1:11:35
that's the chain rule
1:11:35
that's the chain rule and others.grad will be 1.0 times
1:11:38
and others.grad will be 1.0 times
1:11:38
and others.grad will be 1.0 times outgrad
1:11:39
outgrad
1:11:39
outgrad and what you basically what you're
1:11:40
and what you basically what you're
1:11:40
and what you basically what you're seeing here is that outscrad
1:11:42
seeing here is that outscrad
1:11:42
seeing here is that outscrad will simply be copied onto selfs grad
1:11:45
will simply be copied onto selfs grad
1:11:45
will simply be copied onto selfs grad and others grad as we saw happens for an
1:11:48
and others grad as we saw happens for an
1:11:48
and others grad as we saw happens for an addition operation
1:11:49
addition operation
1:11:49
addition operation so we're going to later call this
1:11:51
so we're going to later call this
1:11:51
so we're going to later call this function to propagate the gradient
1:11:53
function to propagate the gradient
1:11:53
function to propagate the gradient having done an addition
1:11:55
having done an addition
1:11:55
having done an addition let's now do multiplication we're going
1:11:57
let's now do multiplication we're going
1:11:57
let's now do multiplication we're going to also define that backward
1:12:02
and we're going to set its backward to
1:12:04
and we're going to set its backward to
1:12:04
and we're going to set its backward to be backward
1:12:07
and we want to chain outgrad into
1:12:11
and we want to chain outgrad into
1:12:11
and we want to chain outgrad into self.grad
1:12:14
and others.grad
1:12:17
and others.grad
1:12:17
and others.grad and this will be a little piece of chain
1:12:18
and this will be a little piece of chain
1:12:18
and this will be a little piece of chain rule for multiplication
1:12:20
rule for multiplication
1:12:20
rule for multiplication so we'll have
1:12:21
so we'll have
1:12:21
so we'll have so what should this be
1:12:23
so what should this be
1:12:23
so what should this be can you think through
1:12:28
so what is the local derivative
1:12:30
so what is the local derivative
1:12:30
so what is the local derivative here the local derivative was
1:12:32
here the local derivative was
1:12:32
here the local derivative was others.data
1:12:35
and then
1:12:36
and then
1:12:36
and then oops others.data and the times of that
1:12:39
oops others.data and the times of that
1:12:39
oops others.data and the times of that grad that's channel
1:12:42
grad that's channel
1:12:42
grad that's channel and here we have self.data times of that
1:12:44
and here we have self.data times of that
1:12:44
and here we have self.data times of that grad
1:12:45
grad
1:12:45
grad that's what we've been doing
1:12:49
and finally here for 10 h
1:12:51
and finally here for 10 h
1:12:51
and finally here for 10 h left backward
1:12:54
and then we want to set out backwards to
1:12:57
and then we want to set out backwards to
1:12:57
and then we want to set out backwards to be just backward
1:13:00
and here we need to
1:13:02
and here we need to
1:13:02
and here we need to back propagate we have out that grad and
1:13:04
back propagate we have out that grad and
1:13:04
back propagate we have out that grad and we want to chain it into self.grad
1:13:09
and salt.grad will be
1:13:11
and salt.grad will be
1:13:11
and salt.grad will be the local derivative of this operation
1:13:13
the local derivative of this operation
1:13:13
the local derivative of this operation that we've done here which is 10h
1:13:16
that we've done here which is 10h
1:13:16
that we've done here which is 10h and so we saw that the local the
1:13:17
and so we saw that the local the
1:13:17
and so we saw that the local the gradient is 1 minus the tan h of x
1:13:20
gradient is 1 minus the tan h of x
1:13:20
gradient is 1 minus the tan h of x squared which here is t
1:13:23
squared which here is t
1:13:23
squared which here is t that's the local derivative because
1:13:25
that's the local derivative because
1:13:25
that's the local derivative because that's t is the output of this 10 h so 1
1:13:27
that's t is the output of this 10 h so 1
1:13:27
that's t is the output of this 10 h so 1 minus t squared is the local derivative
1:13:29
minus t squared is the local derivative
1:13:30
minus t squared is the local derivative and then gradient um
1:13:32
and then gradient um
1:13:32
and then gradient um has to be multiplied because of the
1:13:33
has to be multiplied because of the
1:13:33
has to be multiplied because of the chain rule
1:13:34
chain rule
1:13:34
chain rule so outgrad is chained through the local
1:13:36
so outgrad is chained through the local
1:13:36
so outgrad is chained through the local gradient into salt.grad
1:13:39
gradient into salt.grad
1:13:39
gradient into salt.grad and that should be basically it so we're
1:13:41
and that should be basically it so we're
1:13:41
and that should be basically it so we're going to redefine our value node
1:13:44
going to redefine our value node
1:13:44
going to redefine our value node we're going to swing all the way down
1:13:46
we're going to swing all the way down
1:13:46
we're going to swing all the way down here
1:13:48
here
1:13:48
here and we're going to
1:13:49
and we're going to
1:13:49
and we're going to redefine
1:13:51
redefine
1:13:51
redefine our expression
1:13:52
our expression
1:13:52
our expression make sure that all the grads are zero
1:13:55
make sure that all the grads are zero
1:13:55
make sure that all the grads are zero okay
1:13:56
okay
1:13:56
okay but now we don't have to do this
1:13:57
but now we don't have to do this
1:13:57
but now we don't have to do this manually anymore
1:13:59
manually anymore
1:13:59
manually anymore we are going to basically be calling the
1:14:01
we are going to basically be calling the
1:14:01
we are going to basically be calling the dot backward in the right order
1:14:03
dot backward in the right order
1:14:04
dot backward in the right order so
1:14:05
so
1:14:05
so first we want to call os
1:14:07
first we want to call os
1:14:07
first we want to call os dot backwards
1:14:14
so o was the outcome of 10h
1:14:17
so o was the outcome of 10h
1:14:17
so o was the outcome of 10h right so calling all that those who's
1:14:20
right so calling all that those who's
1:14:20
right so calling all that those who's backward
1:14:22
backward
1:14:22
backward will be
1:14:23
will be
1:14:23
will be this function this is what it will do
1:14:25
this function this is what it will do
1:14:26
this function this is what it will do now we have to be careful because
1:14:29
now we have to be careful because
1:14:29
now we have to be careful because there's a times out.grad
1:14:31
there's a times out.grad
1:14:31
there's a times out.grad and out.grad remember is initialized to
1:14:34
and out.grad remember is initialized to
1:14:34
and out.grad remember is initialized to zero
1:14:38
so here we see grad zero so as a base
1:14:41
so here we see grad zero so as a base
1:14:41
so here we see grad zero so as a base case we need to set both.grad to 1.0
1:14:46
case we need to set both.grad to 1.0
1:14:46
case we need to set both.grad to 1.0 to initialize this with 1
1:14:53
and then once this is 1 we can call oda
1:14:56
and then once this is 1 we can call oda
1:14:56
and then once this is 1 we can call oda backward
1:14:57
backward
1:14:57
backward and what that should do is it should
1:14:58
and what that should do is it should
1:14:58
and what that should do is it should propagate this grad through 10h
1:15:02
propagate this grad through 10h
1:15:02
propagate this grad through 10h so the local derivative times
1:15:04
so the local derivative times
1:15:04
so the local derivative times the global derivative which is
1:15:05
the global derivative which is
1:15:05
the global derivative which is initialized at one so
1:15:08
initialized at one so
1:15:08
initialized at one so this should
1:15:11
this should
1:15:11
this should um
1:15:15
a dope
1:15:17
a dope
1:15:17
a dope so i thought about redoing it but i
1:15:19
so i thought about redoing it but i
1:15:19
so i thought about redoing it but i figured i should just leave the error in
1:15:20
figured i should just leave the error in
1:15:20
figured i should just leave the error in here because it's pretty funny why is
1:15:22
here because it's pretty funny why is
1:15:22
here because it's pretty funny why is anti-object not callable
1:15:24
anti-object not callable
1:15:24
anti-object not callable uh it's because
1:15:27
uh it's because
1:15:27
uh it's because i screwed up we're trying to save these
1:15:29
i screwed up we're trying to save these
1:15:29
i screwed up we're trying to save these functions so this is correct
1:15:31
functions so this is correct
1:15:31
functions so this is correct this here
1:15:33
this here
1:15:33
this here we don't want to call the function
1:15:34
we don't want to call the function
1:15:34
we don't want to call the function because that returns none these
1:15:36
because that returns none these
1:15:36
because that returns none these functions return none we just want to
1:15:38
functions return none we just want to
1:15:38
functions return none we just want to store the function
1:15:39
store the function
1:15:39
store the function so let me redefine the value object
1:15:42
so let me redefine the value object
1:15:42
so let me redefine the value object and then we're going to come back in
1:15:43
and then we're going to come back in
1:15:43
and then we're going to come back in redefine the expression draw a dot
1:15:46
redefine the expression draw a dot
1:15:46
redefine the expression draw a dot everything is great o dot grad is one
1:15:50
everything is great o dot grad is one
1:15:50
everything is great o dot grad is one o dot grad is one and now
1:15:53
o dot grad is one and now
1:15:53
o dot grad is one and now now this should work of course
1:15:55
now this should work of course
1:15:55
now this should work of course okay so all that backward should
1:15:58
okay so all that backward should
1:15:58
okay so all that backward should this grant should now be 0.5 if we
1:16:00
this grant should now be 0.5 if we
1:16:00
this grant should now be 0.5 if we redraw and if everything went correctly
1:16:03
redraw and if everything went correctly
1:16:03
redraw and if everything went correctly 0.5 yay
1:16:05
0.5 yay
1:16:05
0.5 yay okay so now we need to call ns.grad
1:16:10
and it's not awkward sorry
1:16:13
and it's not awkward sorry
1:16:13
and it's not awkward sorry ends backward
1:16:14
ends backward
1:16:14
ends backward so that seems to have worked
1:16:17
so instead backward routed the gradient
1:16:21
so instead backward routed the gradient
1:16:21
so instead backward routed the gradient to both of these so this is looking
1:16:22
to both of these so this is looking
1:16:22
to both of these so this is looking great
1:16:24
great
1:16:24
great now we could of course called uh called
1:16:26
now we could of course called uh called
1:16:26
now we could of course called uh called b grad
1:16:27
b grad
1:16:27
b grad beat up backwards sorry
1:16:30
beat up backwards sorry
1:16:30
beat up backwards sorry what's gonna happen
1:16:31
what's gonna happen
1:16:32
what's gonna happen well b doesn't have it backward b is
1:16:34
well b doesn't have it backward b is
1:16:34
well b doesn't have it backward b is backward
1:16:35
backward
1:16:35
backward because b is a leaf node
1:16:37
because b is a leaf node
1:16:37
because b is a leaf node b's backward is by initialization the
1:16:40
b's backward is by initialization the
1:16:40
b's backward is by initialization the empty function
1:16:41
empty function
1:16:41
empty function so nothing would happen but we can call
1:16:44
so nothing would happen but we can call
1:16:44
so nothing would happen but we can call call it on it
1:16:45
call it on it
1:16:45
call it on it but when we call
1:16:48
but when we call
1:16:48
but when we call this one
1:16:50
this one
1:16:50
this one it's backward
1:16:53
then we expect this 0.5 to get further
1:16:56
then we expect this 0.5 to get further
1:16:56
then we expect this 0.5 to get further routed
1:16:57
routed
1:16:57
routed right so there we go 0.5.5
1:17:00
right so there we go 0.5.5
1:17:00
right so there we go 0.5.5 and then finally
1:17:02
and then finally
1:17:02
and then finally we want to call
1:17:05
we want to call
1:17:05
we want to call it here on x2 w2
1:17:10
and on x1 w1
1:17:16
do both of those
1:17:17
do both of those
1:17:17
do both of those and there we go
1:17:19
and there we go
1:17:19
and there we go so we get 0 0.5 negative 1.5 and 1
1:17:23
so we get 0 0.5 negative 1.5 and 1
1:17:23
so we get 0 0.5 negative 1.5 and 1 exactly as we did before but now
1:17:26
exactly as we did before but now
1:17:26
exactly as we did before but now we've done it through
1:17:28
we've done it through
1:17:28
we've done it through calling that backward um
1:17:30
calling that backward um
1:17:30
calling that backward um sort of manually
1:17:32
sort of manually
1:17:32
sort of manually so we have the lamp one last piece to
1:17:34
so we have the lamp one last piece to
1:17:34
so we have the lamp one last piece to get rid of which is us calling
1:17:36
get rid of which is us calling
1:17:36
get rid of which is us calling underscore backward manually so let's
1:17:38
underscore backward manually so let's
1:17:38
underscore backward manually so let's think through what we are actually doing
1:17:40
think through what we are actually doing
1:17:40
think through what we are actually doing um
1:17:41
um
1:17:41
um we've laid out a mathematical expression
1:17:43
we've laid out a mathematical expression
1:17:43
we've laid out a mathematical expression and now we're trying to go backwards
1:17:44
and now we're trying to go backwards
1:17:44
and now we're trying to go backwards through that expression
1:17:46
through that expression
1:17:46
through that expression um so going backwards through the
1:17:48
um so going backwards through the
1:17:48
um so going backwards through the expression just means that we never want
1:17:50
expression just means that we never want
1:17:50
expression just means that we never want to call a dot backward for any node
1:17:54
to call a dot backward for any node
1:17:54
to call a dot backward for any node before
1:17:55
before
1:17:55
before we've done a sort of um everything after
1:17:58
we've done a sort of um everything after
1:17:58
we've done a sort of um everything after it
1:17:59
it
1:17:59
it so we have to do everything after it
1:18:01
so we have to do everything after it
1:18:01
so we have to do everything after it before we're ever going to call that
1:18:02
before we're ever going to call that
1:18:02
before we're ever going to call that backward on any one node we have to get
1:18:04
backward on any one node we have to get
1:18:04
backward on any one node we have to get all of its full dependencies everything
1:18:05
all of its full dependencies everything
1:18:06
all of its full dependencies everything that it depends on has to
1:18:08
that it depends on has to
1:18:08
that it depends on has to propagate to it before we can continue
1:18:10
propagate to it before we can continue
1:18:10
propagate to it before we can continue back propagation so this ordering of
1:18:14
back propagation so this ordering of
1:18:14
back propagation so this ordering of graphs can be achieved using something
1:18:15
graphs can be achieved using something
1:18:16
graphs can be achieved using something called topological sort
1:18:17
called topological sort
1:18:17
called topological sort so topological sort
1:18:20
so topological sort
1:18:20
so topological sort is basically a laying out of a graph
1:18:23
is basically a laying out of a graph
1:18:23
is basically a laying out of a graph such that all the edges go only from
1:18:24
such that all the edges go only from
1:18:24
such that all the edges go only from left to right basically
1:18:26
left to right basically
1:18:26
left to right basically so here we have a graph it's a directory
1:18:29
so here we have a graph it's a directory
1:18:29
so here we have a graph it's a directory a cyclic graph a dag
1:18:31
a cyclic graph a dag
1:18:31
a cyclic graph a dag and this is two different topological
1:18:33
and this is two different topological
1:18:34
and this is two different topological orders of it i believe where basically
1:18:36
orders of it i believe where basically
1:18:36
orders of it i believe where basically you'll see that it's laying out of the
1:18:37
you'll see that it's laying out of the
1:18:37
you'll see that it's laying out of the notes such that all the edges go only
1:18:39
notes such that all the edges go only
1:18:39
notes such that all the edges go only one way from left to right
1:18:41
one way from left to right
1:18:41
one way from left to right and implementing topological sort you
1:18:44
and implementing topological sort you
1:18:44
and implementing topological sort you can look in wikipedia and so on i'm not
1:18:46
can look in wikipedia and so on i'm not
1:18:46
can look in wikipedia and so on i'm not going to go through it in detail
1:18:48
going to go through it in detail
1:18:48
going to go through it in detail but basically this is what builds a
1:18:51
but basically this is what builds a
1:18:51
but basically this is what builds a topological graph
1:18:54
topological graph
1:18:54
topological graph we maintain a set of visited nodes and
1:18:56
we maintain a set of visited nodes and
1:18:56
we maintain a set of visited nodes and then we are
1:18:59
then we are
1:18:59
then we are going through starting at some root node
1:19:01
going through starting at some root node
1:19:02
going through starting at some root node which for us is o that's where we want
1:19:03
which for us is o that's where we want
1:19:03
which for us is o that's where we want to start the topological sort
1:19:05
to start the topological sort
1:19:05
to start the topological sort and starting at o we go through all of
1:19:07
and starting at o we go through all of
1:19:08
and starting at o we go through all of its children and we need to lay them out
1:19:10
its children and we need to lay them out
1:19:10
its children and we need to lay them out from left to right
1:19:12
from left to right
1:19:12
from left to right and basically this starts at o
1:19:14
and basically this starts at o
1:19:14
and basically this starts at o if it's not visited then it marks it as
1:19:17
if it's not visited then it marks it as
1:19:17
if it's not visited then it marks it as visited and then it iterates through all
1:19:19
visited and then it iterates through all
1:19:19
visited and then it iterates through all of its children
1:19:20
of its children
1:19:20
of its children and calls build topological on them
1:19:24
and calls build topological on them
1:19:24
and calls build topological on them and then uh after it's gone through all
1:19:26
and then uh after it's gone through all
1:19:26
and then uh after it's gone through all the children it adds itself
1:19:28
the children it adds itself
1:19:28
the children it adds itself so basically
1:19:29
so basically
1:19:29
so basically this node that we're going to call it on
1:19:31
this node that we're going to call it on
1:19:31
this node that we're going to call it on like say o is only going to add itself
1:19:34
like say o is only going to add itself
1:19:34
like say o is only going to add itself to the topo list after all of the
1:19:37
to the topo list after all of the
1:19:37
to the topo list after all of the children have been processed and that's
1:19:39
children have been processed and that's
1:19:39
children have been processed and that's how this function is guaranteeing
1:19:41
how this function is guaranteeing
1:19:41
how this function is guaranteeing that you're only going to be in the list
1:19:43
that you're only going to be in the list
1:19:43
that you're only going to be in the list once all your children are in the list
1:19:45
once all your children are in the list
1:19:45
once all your children are in the list and that's the invariant that is being
1:19:46
and that's the invariant that is being
1:19:46
and that's the invariant that is being maintained so if we built upon o and
1:19:49
maintained so if we built upon o and
1:19:49
maintained so if we built upon o and then inspect this list
1:19:52
then inspect this list
1:19:52
then inspect this list we're going to see that it ordered our
1:19:54
we're going to see that it ordered our
1:19:54
we're going to see that it ordered our value objects
1:19:56
value objects
1:19:56
value objects and the last one
1:19:58
and the last one
1:19:58
and the last one is the value of 0.707 which is the
1:20:00
is the value of 0.707 which is the
1:20:00
is the value of 0.707 which is the output
1:20:01
output
1:20:01
output so this is o and then this is n
1:20:04
so this is o and then this is n
1:20:04
so this is o and then this is n and then all the other nodes get laid
1:20:07
and then all the other nodes get laid
1:20:07
and then all the other nodes get laid out before it
1:20:09
out before it
1:20:09
out before it so that builds the topological graph and
1:20:11
so that builds the topological graph and
1:20:12
so that builds the topological graph and really what we're doing now is we're
1:20:13
really what we're doing now is we're
1:20:13
really what we're doing now is we're just calling dot underscore backward on
1:20:16
just calling dot underscore backward on
1:20:16
just calling dot underscore backward on all of the nodes in a topological order
1:20:19
all of the nodes in a topological order
1:20:19
all of the nodes in a topological order so if we just reset the gradients
1:20:21
so if we just reset the gradients
1:20:22
so if we just reset the gradients they're all zero
1:20:23
they're all zero
1:20:23
they're all zero what did we do
1:20:24
what did we do
1:20:24
what did we do we started by
1:20:27
we started by
1:20:27
we started by setting o dot grad
1:20:29
setting o dot grad
1:20:29
setting o dot grad to b1
1:20:31
to b1
1:20:31
to b1 that's the base case
1:20:33
that's the base case
1:20:33
that's the base case then we built the topological order
1:20:38
and then we went for node
1:20:41
and then we went for node
1:20:41
and then we went for node in
1:20:42
in
1:20:42
in reversed
1:20:43
reversed
1:20:44
reversed of topo
1:20:46
of topo
1:20:46
of topo now
1:20:47
now
1:20:47
now in in the reverse order because this
1:20:49
in in the reverse order because this
1:20:49
in in the reverse order because this list goes from
1:20:50
list goes from
1:20:50
list goes from you know we need to go through it in
1:20:52
you know we need to go through it in
1:20:52
you know we need to go through it in reversed order
1:20:53
reversed order
1:20:53
reversed order so starting at o
1:20:56
so starting at o
1:20:56
so starting at o note that backward
1:20:58
note that backward
1:20:58
note that backward and this should be
1:21:01
and this should be
1:21:01
and this should be it
1:21:03
it
1:21:03
it there we go
1:21:05
there we go
1:21:05
there we go those are the correct derivatives
1:21:07
those are the correct derivatives
1:21:07
those are the correct derivatives finally we are going to hide this
1:21:08
finally we are going to hide this
1:21:08
finally we are going to hide this functionality
1:21:10
functionality
1:21:10
functionality so i'm going to
1:21:11
so i'm going to
1:21:11
so i'm going to copy this and we're going to hide it
1:21:13
copy this and we're going to hide it
1:21:13
copy this and we're going to hide it inside the valley class because we don't
1:21:15
inside the valley class because we don't
1:21:15
inside the valley class because we don't want to have all that code lying around
1:21:18
want to have all that code lying around
1:21:18
want to have all that code lying around so instead of an underscore backward
1:21:19
so instead of an underscore backward
1:21:19
so instead of an underscore backward we're now going to define an actual
1:21:21
we're now going to define an actual
1:21:21
we're now going to define an actual backward so that's backward without the
1:21:23
backward so that's backward without the
1:21:23
backward so that's backward without the underscore
1:21:26
underscore
1:21:26
underscore and that's going to do all the stuff
1:21:27
and that's going to do all the stuff
1:21:27
and that's going to do all the stuff that we just arrived
1:21:29
that we just arrived
1:21:29
that we just arrived so let me just clean this up a little
1:21:30
so let me just clean this up a little
1:21:30
so let me just clean this up a little bit so
1:21:32
bit so
1:21:32
bit so we're first going to
1:21:37
build a topological graph
1:21:38
build a topological graph
1:21:38
build a topological graph starting at self
1:21:41
starting at self
1:21:41
starting at self so build topo of self
1:21:44
so build topo of self
1:21:44
so build topo of self will populate the topological order into
1:21:46
will populate the topological order into
1:21:46
will populate the topological order into the topo list which is a local variable
1:21:49
the topo list which is a local variable
1:21:49
the topo list which is a local variable then we set self.grad to be one
1:21:52
then we set self.grad to be one
1:21:52
then we set self.grad to be one and then for each node in the reversed
1:21:55
and then for each node in the reversed
1:21:55
and then for each node in the reversed list so starting at us and going to all
1:21:57
list so starting at us and going to all
1:21:57
list so starting at us and going to all the children
1:21:59
the children
1:22:00
the children underscore backward
1:22:02
underscore backward
1:22:02
underscore backward and
1:22:03
and
1:22:03
and that should be it so
1:22:06
that should be it so
1:22:06
that should be it so save
1:22:07
save
1:22:08
save come down here
1:22:09
come down here
1:22:09
come down here redefine
1:22:09
redefine
1:22:09
redefine [Music]
1:22:11
[Music]
1:22:11
[Music] okay all the grands are zero
1:22:13
okay all the grands are zero
1:22:13
okay all the grands are zero and now what we can do is oh that
1:22:15
and now what we can do is oh that
1:22:15
and now what we can do is oh that backward without the underscore
1:22:17
backward without the underscore
1:22:17
backward without the underscore and
1:22:21
there we go
1:22:22
there we go
1:22:22
there we go and that's uh that's back propagation
1:22:26
and that's uh that's back propagation
1:22:26
and that's uh that's back propagation place for one neuron
1:22:28
place for one neuron
1:22:28
place for one neuron now we shouldn't be too happy with
1:22:29
now we shouldn't be too happy with
1:22:29
now we shouldn't be too happy with ourselves actually because we have a bad
1:22:32
ourselves actually because we have a bad
1:22:32
ourselves actually because we have a bad bug um and we have not surfaced the bug
1:22:35
bug um and we have not surfaced the bug
1:22:35
bug um and we have not surfaced the bug because of some specific conditions that
1:22:36
because of some specific conditions that
1:22:36
because of some specific conditions that we are we have to think about right now
1:22:39
we are we have to think about right now
1:22:39
we are we have to think about right now so here's the simplest case that shows
1:22:41
so here's the simplest case that shows
1:22:42
so here's the simplest case that shows the bug
1:22:43
the bug
1:22:43
the bug say i create a single node a
1:22:48
and then i create a b that is a plus a
1:22:51
and then i create a b that is a plus a
1:22:51
and then i create a b that is a plus a and then i called backward
1:22:54
so what's going to happen is a is 3
1:22:57
so what's going to happen is a is 3
1:22:57
so what's going to happen is a is 3 and then a b is a plus a so there's two
1:22:59
and then a b is a plus a so there's two
1:23:00
and then a b is a plus a so there's two arrows on top of each other here
1:23:03
then we can see that b is of course the
1:23:05
then we can see that b is of course the
1:23:05
then we can see that b is of course the forward pass works
1:23:06
forward pass works
1:23:06
forward pass works b is just
1:23:08
b is just
1:23:08
b is just a plus a which is six
1:23:09
a plus a which is six
1:23:10
a plus a which is six but the gradient here is not actually
1:23:11
but the gradient here is not actually
1:23:11
but the gradient here is not actually correct
1:23:12
correct
1:23:12
correct that we calculate it automatically
1:23:15
that we calculate it automatically
1:23:15
that we calculate it automatically and that's because
1:23:17
and that's because
1:23:17
and that's because um
1:23:19
um
1:23:19
um of course uh
1:23:20
of course uh
1:23:20
of course uh just doing calculus in your head the
1:23:22
just doing calculus in your head the
1:23:22
just doing calculus in your head the derivative of b with respect to a
1:23:24
derivative of b with respect to a
1:23:24
derivative of b with respect to a should be uh two
1:23:27
should be uh two
1:23:27
should be uh two one plus one
1:23:28
one plus one
1:23:28
one plus one it's not one
1:23:30
it's not one
1:23:30
it's not one intuitively what's happening here right
1:23:32
intuitively what's happening here right
1:23:32
intuitively what's happening here right so b is the result of a plus a and then
1:23:34
so b is the result of a plus a and then
1:23:34
so b is the result of a plus a and then we call backward on it
1:23:36
we call backward on it
1:23:36
we call backward on it so let's go up and see what that does
1:23:42
um
1:23:43
um
1:23:43
um b is a result of addition
1:23:45
b is a result of addition
1:23:45
b is a result of addition so out as
1:23:46
so out as
1:23:46
so out as b and then when we called backward what
1:23:49
b and then when we called backward what
1:23:49
b and then when we called backward what happened is
1:23:50
happened is
1:23:50
happened is self.grad was set
1:23:53
self.grad was set
1:23:53
self.grad was set to one
1:23:54
to one
1:23:54
to one and then other that grad was set to one
1:23:57
and then other that grad was set to one
1:23:57
and then other that grad was set to one but because we're doing a plus a
1:23:59
but because we're doing a plus a
1:23:59
but because we're doing a plus a self and other are actually the exact
1:24:01
self and other are actually the exact
1:24:02
self and other are actually the exact same object
1:24:03
same object
1:24:03
same object so we are overriding the gradient we are
1:24:05
so we are overriding the gradient we are
1:24:06
so we are overriding the gradient we are setting it to one and then we are
1:24:07
setting it to one and then we are
1:24:07
setting it to one and then we are setting it again to one and that's why
1:24:09
setting it again to one and that's why
1:24:10
setting it again to one and that's why it stays
1:24:11
it stays
1:24:11
it stays at one
1:24:13
at one
1:24:13
at one so that's a problem
1:24:14
so that's a problem
1:24:14
so that's a problem there's another way to see this in a
1:24:16
there's another way to see this in a
1:24:16
there's another way to see this in a little bit more complicated expression
1:24:21
so here we have
1:24:23
so here we have
1:24:23
so here we have a and b
1:24:25
a and b
1:24:25
a and b and then uh d will be the multiplication
1:24:28
and then uh d will be the multiplication
1:24:28
and then uh d will be the multiplication of the two and e will be the addition of
1:24:30
of the two and e will be the addition of
1:24:30
of the two and e will be the addition of the two
1:24:31
the two
1:24:32
the two and
1:24:33
and
1:24:33
and then we multiply e times d to get f and
1:24:35
then we multiply e times d to get f and
1:24:35
then we multiply e times d to get f and then we called fda backward
1:24:37
then we called fda backward
1:24:37
then we called fda backward and these gradients if you check will be
1:24:39
and these gradients if you check will be
1:24:39
and these gradients if you check will be incorrect
1:24:40
incorrect
1:24:40
incorrect so fundamentally what's happening here
1:24:42
so fundamentally what's happening here
1:24:42
so fundamentally what's happening here again is
1:24:45
again is
1:24:45
again is basically we're going to see an issue
1:24:46
basically we're going to see an issue
1:24:46
basically we're going to see an issue anytime we use a variable more than once
1:24:49
anytime we use a variable more than once
1:24:49
anytime we use a variable more than once until now in these expressions above
1:24:51
until now in these expressions above
1:24:51
until now in these expressions above every variable is used exactly once so
1:24:53
every variable is used exactly once so
1:24:53
every variable is used exactly once so we didn't see the issue
1:24:54
we didn't see the issue
1:24:54
we didn't see the issue but here if a variable is used more than
1:24:56
but here if a variable is used more than
1:24:56
but here if a variable is used more than once what's going to happen during
1:24:57
once what's going to happen during
1:24:57
once what's going to happen during backward pass we're backpropagating from
1:25:00
backward pass we're backpropagating from
1:25:00
backward pass we're backpropagating from f to e to d so far so good but now
1:25:03
f to e to d so far so good but now
1:25:03
f to e to d so far so good but now equals it backward and it deposits its
1:25:05
equals it backward and it deposits its
1:25:05
equals it backward and it deposits its gradients to a and b but then we come
1:25:07
gradients to a and b but then we come
1:25:08
gradients to a and b but then we come back to d
1:25:09
back to d
1:25:09
back to d and call backward and it overwrites
1:25:11
and call backward and it overwrites
1:25:11
and call backward and it overwrites those gradients at a and b
1:25:14
those gradients at a and b
1:25:14
those gradients at a and b so that's obviously a problem
1:25:17
so that's obviously a problem
1:25:17
so that's obviously a problem and the solution here if you look at
1:25:19
and the solution here if you look at
1:25:19
and the solution here if you look at the multivariate case of the chain rule
1:25:22
the multivariate case of the chain rule
1:25:22
the multivariate case of the chain rule and its generalization there
1:25:23
and its generalization there
1:25:23
and its generalization there the solution there is basically that we
1:25:26
the solution there is basically that we
1:25:26
the solution there is basically that we have to accumulate these gradients these
1:25:28
have to accumulate these gradients these
1:25:28
have to accumulate these gradients these gradients add
1:25:30
gradients add
1:25:30
gradients add and so instead of setting those
1:25:32
and so instead of setting those
1:25:32
and so instead of setting those gradients
1:25:34
gradients
1:25:34
gradients we can simply do plus equals we need to
1:25:37
we can simply do plus equals we need to
1:25:37
we can simply do plus equals we need to accumulate those gradients
1:25:39
accumulate those gradients
1:25:39
accumulate those gradients plus equals plus equals
1:25:41
plus equals plus equals
1:25:41
plus equals plus equals plus equals
1:25:44
plus equals
1:25:46
plus equals
1:25:46
plus equals and this will be okay remember because
1:25:48
and this will be okay remember because
1:25:48
and this will be okay remember because we are initializing them at zero so they
1:25:50
we are initializing them at zero so they
1:25:50
we are initializing them at zero so they start at zero
1:25:51
start at zero
1:25:51
start at zero and then any
1:25:53
and then any
1:25:53
and then any contribution
1:25:54
contribution
1:25:54
contribution that flows backwards
1:25:57
that flows backwards
1:25:57
that flows backwards will simply add
1:25:58
will simply add
1:25:58
will simply add so now if we redefine
1:26:01
so now if we redefine
1:26:01
so now if we redefine this one
1:26:03
this one
1:26:03
this one because the plus equals this now works
1:26:05
because the plus equals this now works
1:26:06
because the plus equals this now works because a.grad started at zero and we
1:26:08
because a.grad started at zero and we
1:26:08
because a.grad started at zero and we called beta backward we deposit one and
1:26:11
called beta backward we deposit one and
1:26:11
called beta backward we deposit one and then we deposit one again and now this
1:26:13
then we deposit one again and now this
1:26:13
then we deposit one again and now this is two which is correct
1:26:14
is two which is correct
1:26:14
is two which is correct and here this will also work and we'll
1:26:16
and here this will also work and we'll
1:26:16
and here this will also work and we'll get correct gradients
1:26:18
get correct gradients
1:26:18
get correct gradients because when we call eta backward we
1:26:20
because when we call eta backward we
1:26:20
because when we call eta backward we will deposit the gradients from this
1:26:21
will deposit the gradients from this
1:26:21
will deposit the gradients from this branch and then we get to back into
1:26:23
branch and then we get to back into
1:26:23
branch and then we get to back into detail backward it will deposit its own
1:26:25
detail backward it will deposit its own
1:26:26
detail backward it will deposit its own gradients and then those gradients
1:26:28
gradients and then those gradients
1:26:28
gradients and then those gradients simply add on top of each other and so
1:26:30
simply add on top of each other and so
1:26:30
simply add on top of each other and so we just accumulate those gradients and
1:26:31
we just accumulate those gradients and
1:26:31
we just accumulate those gradients and that fixes the issue okay now before we
1:26:34
that fixes the issue okay now before we
1:26:34
that fixes the issue okay now before we move on let me actually do a bit of
1:26:35
move on let me actually do a bit of
1:26:35
move on let me actually do a bit of cleanup here and delete some of these
1:26:38
cleanup here and delete some of these
1:26:38
cleanup here and delete some of these some of this intermediate work so
1:26:41
some of this intermediate work so
1:26:41
some of this intermediate work so we're not gonna need any of this now
1:26:42
we're not gonna need any of this now
1:26:42
we're not gonna need any of this now that we've derived all of it
1:26:44
that we've derived all of it
1:26:44
that we've derived all of it um
1:26:45
um
1:26:45
um we are going to keep this because i want
1:26:47
we are going to keep this because i want
1:26:48
we are going to keep this because i want to come back to it
1:26:49
to come back to it
1:26:49
to come back to it delete the 10h
1:26:51
delete the 10h
1:26:51
delete the 10h delete our morning example
1:26:53
delete our morning example
1:26:53
delete our morning example delete the step
1:26:55
delete the step
1:26:55
delete the step delete this keep the code that draws
1:26:59
delete this keep the code that draws
1:26:59
delete this keep the code that draws and then delete this example
1:27:02
and then delete this example
1:27:02
and then delete this example and leave behind only the definition of
1:27:03
and leave behind only the definition of
1:27:03
and leave behind only the definition of value
1:27:05
value
1:27:05
value and now let's come back to this
1:27:06
and now let's come back to this
1:27:06
and now let's come back to this non-linearity here that we implemented
1:27:08
non-linearity here that we implemented
1:27:08
non-linearity here that we implemented the tanh now i told you that we could
1:27:10
the tanh now i told you that we could
1:27:10
the tanh now i told you that we could have broken down 10h into its explicit
1:27:13
have broken down 10h into its explicit
1:27:13
have broken down 10h into its explicit atoms in terms of other expressions if
1:27:15
atoms in terms of other expressions if
1:27:16
atoms in terms of other expressions if we had the x function so if you remember
1:27:18
we had the x function so if you remember
1:27:18
we had the x function so if you remember tan h is defined like this and we chose
1:27:20
tan h is defined like this and we chose
1:27:20
tan h is defined like this and we chose to develop tan h as a single function
1:27:22
to develop tan h as a single function
1:27:22
to develop tan h as a single function and we can do that because we know its
1:27:24
and we can do that because we know its
1:27:24
and we can do that because we know its derivative and we can back propagate
1:27:25
derivative and we can back propagate
1:27:26
derivative and we can back propagate through it
1:27:26
through it
1:27:26
through it but we can also break down tan h into
1:27:29
but we can also break down tan h into
1:27:29
but we can also break down tan h into and express it as a function of x and i
1:27:31
and express it as a function of x and i
1:27:31
and express it as a function of x and i would like to do that now because i want
1:27:33
would like to do that now because i want
1:27:33
would like to do that now because i want to prove to you that you get all the
1:27:34
to prove to you that you get all the
1:27:34
to prove to you that you get all the same results and all those ingredients
1:27:36
same results and all those ingredients
1:27:36
same results and all those ingredients but also because it forces us to
1:27:38
but also because it forces us to
1:27:38
but also because it forces us to implement a few more expressions it
1:27:39
implement a few more expressions it
1:27:40
implement a few more expressions it forces us to do exponentiation addition
1:27:42
forces us to do exponentiation addition
1:27:42
forces us to do exponentiation addition subtraction division and things like
1:27:44
subtraction division and things like
1:27:44
subtraction division and things like that and i think it's a good exercise to
1:27:46
that and i think it's a good exercise to
1:27:46
that and i think it's a good exercise to go through a few more of these
1:27:48
go through a few more of these
1:27:48
go through a few more of these okay so let's scroll up
1:27:50
okay so let's scroll up
1:27:50
okay so let's scroll up to the definition of value
1:27:52
to the definition of value
1:27:52
to the definition of value and here one thing that we currently
1:27:53
and here one thing that we currently
1:27:53
and here one thing that we currently can't do is we can do like a value of
1:27:56
can't do is we can do like a value of
1:27:56
can't do is we can do like a value of say 2.0
1:27:58
say 2.0
1:27:58
say 2.0 but we can't do you know here for
1:28:00
but we can't do you know here for
1:28:00
but we can't do you know here for example we want to add constant one and
1:28:02
example we want to add constant one and
1:28:02
example we want to add constant one and we can't do something like this
1:28:05
we can't do something like this
1:28:05
we can't do something like this and we can't do it because it says
1:28:06
and we can't do it because it says
1:28:06
and we can't do it because it says object has no attribute data that's
1:28:08
object has no attribute data that's
1:28:08
object has no attribute data that's because a plus one comes right here to
1:28:11
because a plus one comes right here to
1:28:11
because a plus one comes right here to add
1:28:12
add
1:28:12
add and then other is the integer one and
1:28:14
and then other is the integer one and
1:28:14
and then other is the integer one and then here python is trying to access
1:28:16
then here python is trying to access
1:28:16
then here python is trying to access one.data and that's not a thing and
1:28:18
one.data and that's not a thing and
1:28:18
one.data and that's not a thing and that's because basically one is not a
1:28:20
that's because basically one is not a
1:28:20
that's because basically one is not a value object and we only have addition
1:28:22
value object and we only have addition
1:28:22
value object and we only have addition for value objects so as a matter of
1:28:24
for value objects so as a matter of
1:28:24
for value objects so as a matter of convenience so that we can create
1:28:26
convenience so that we can create
1:28:26
convenience so that we can create expressions like this and make them make
1:28:28
expressions like this and make them make
1:28:28
expressions like this and make them make sense
1:28:29
sense
1:28:29
sense we can simply do something like this
1:28:32
we can simply do something like this
1:28:32
we can simply do something like this basically
1:28:33
basically
1:28:33
basically we let other alone if other is an
1:28:35
we let other alone if other is an
1:28:35
we let other alone if other is an instance of value but if it's not an
1:28:37
instance of value but if it's not an
1:28:37
instance of value but if it's not an instance of value we're going to assume
1:28:39
instance of value we're going to assume
1:28:39
instance of value we're going to assume that it's a number like an integer float
1:28:40
that it's a number like an integer float
1:28:40
that it's a number like an integer float and we're going to simply wrap it in in
1:28:43
and we're going to simply wrap it in in
1:28:43
and we're going to simply wrap it in in value and then other will just become
1:28:45
value and then other will just become
1:28:45
value and then other will just become value of other and then other will have
1:28:46
value of other and then other will have
1:28:46
value of other and then other will have a data attribute and this should work so
1:28:49
a data attribute and this should work so
1:28:49
a data attribute and this should work so if i just say this predefined value then
1:28:51
if i just say this predefined value then
1:28:51
if i just say this predefined value then this should work
1:28:53
this should work
1:28:53
this should work there we go okay now let's do the exact
1:28:55
there we go okay now let's do the exact
1:28:55
there we go okay now let's do the exact same thing for multiply because we can't
1:28:57
same thing for multiply because we can't
1:28:57
same thing for multiply because we can't do something like this
1:28:58
do something like this
1:28:58
do something like this again
1:28:59
again
1:28:59
again for the exact same reason so we just
1:29:01
for the exact same reason so we just
1:29:01
for the exact same reason so we just have to go to mole and if other is
1:29:04
have to go to mole and if other is
1:29:04
have to go to mole and if other is not a value then let's wrap it in value
1:29:07
not a value then let's wrap it in value
1:29:07
not a value then let's wrap it in value let's redefine value and now this works
1:29:10
let's redefine value and now this works
1:29:10
let's redefine value and now this works now here's a kind of unfortunate and not
1:29:12
now here's a kind of unfortunate and not
1:29:12
now here's a kind of unfortunate and not obvious part a times two works we saw
1:29:15
obvious part a times two works we saw
1:29:15
obvious part a times two works we saw that but two times a is that gonna work
1:29:19
that but two times a is that gonna work
1:29:19
that but two times a is that gonna work you'd expect it to right but actually it
1:29:21
you'd expect it to right but actually it
1:29:21
you'd expect it to right but actually it will not
1:29:22
will not
1:29:22
will not and the reason it won't is because
1:29:24
and the reason it won't is because
1:29:24
and the reason it won't is because python doesn't know
1:29:26
python doesn't know
1:29:26
python doesn't know like when when you do a times two
1:29:28
like when when you do a times two
1:29:28
like when when you do a times two basically um so a times two python will
1:29:31
basically um so a times two python will
1:29:31
basically um so a times two python will go and it will basically do something
1:29:32
go and it will basically do something
1:29:32
go and it will basically do something like a dot mul
1:29:34
like a dot mul
1:29:34
like a dot mul of two that's basically what it will
1:29:36
of two that's basically what it will
1:29:36
of two that's basically what it will call but to it 2 times a is the same as
1:29:39
call but to it 2 times a is the same as
1:29:39
call but to it 2 times a is the same as 2 dot mol of a
1:29:41
2 dot mol of a
1:29:41
2 dot mol of a and it doesn't 2 can't multiply
1:29:44
and it doesn't 2 can't multiply
1:29:44
and it doesn't 2 can't multiply value and so it's really confused about
1:29:46
value and so it's really confused about
1:29:46
value and so it's really confused about that
1:29:47
that
1:29:47
that so instead what happens is in python the
1:29:49
so instead what happens is in python the
1:29:49
so instead what happens is in python the way this works is you are free to define
1:29:51
way this works is you are free to define
1:29:51
way this works is you are free to define something called the r mold
1:29:54
something called the r mold
1:29:54
something called the r mold and our mole
1:29:55
and our mole
1:29:55
and our mole is kind of like a fallback so if python
1:29:58
is kind of like a fallback so if python
1:29:58
is kind of like a fallback so if python can't do 2 times a it will check if um
1:30:02
can't do 2 times a it will check if um
1:30:02
can't do 2 times a it will check if um if by any chance a knows how to multiply
1:30:05
if by any chance a knows how to multiply
1:30:05
if by any chance a knows how to multiply two and that will be called into our
1:30:07
two and that will be called into our
1:30:07
two and that will be called into our mole
1:30:08
mole
1:30:08
mole so because python can't do two times a
1:30:11
so because python can't do two times a
1:30:11
so because python can't do two times a it will check is there an our mole in
1:30:12
it will check is there an our mole in
1:30:12
it will check is there an our mole in value and because there is it will now
1:30:15
value and because there is it will now
1:30:15
value and because there is it will now call that
1:30:16
call that
1:30:16
call that and what we'll do here is we will swap
1:30:18
and what we'll do here is we will swap
1:30:18
and what we'll do here is we will swap the order of the operands so basically
1:30:21
the order of the operands so basically
1:30:21
the order of the operands so basically two times a will redirect to armel and
1:30:23
two times a will redirect to armel and
1:30:23
two times a will redirect to armel and our mole will basically call a times two
1:30:26
our mole will basically call a times two
1:30:26
our mole will basically call a times two and that's how that will work
1:30:28
and that's how that will work
1:30:28
and that's how that will work so
1:30:29
so
1:30:29
so redefining now with armor two times a
1:30:31
redefining now with armor two times a
1:30:31
redefining now with armor two times a becomes four okay now looking at the
1:30:33
becomes four okay now looking at the
1:30:33
becomes four okay now looking at the other elements that we still need we
1:30:34
other elements that we still need we
1:30:34
other elements that we still need we need to know how to exponentiate and how
1:30:36
need to know how to exponentiate and how
1:30:36
need to know how to exponentiate and how to divide so let's first the explanation
1:30:38
to divide so let's first the explanation
1:30:38
to divide so let's first the explanation to the exponentiation part we're going
1:30:40
to the exponentiation part we're going
1:30:40
to the exponentiation part we're going to introduce
1:30:41
to introduce
1:30:41
to introduce a single
1:30:42
a single
1:30:42
a single function x here
1:30:45
function x here
1:30:45
function x here and x is going to mirror 10h in the
1:30:47
and x is going to mirror 10h in the
1:30:47
and x is going to mirror 10h in the sense that it's a simple single function
1:30:49
sense that it's a simple single function
1:30:49
sense that it's a simple single function that transforms a single scalar value
1:30:51
that transforms a single scalar value
1:30:51
that transforms a single scalar value and outputs a single scalar value
1:30:53
and outputs a single scalar value
1:30:53
and outputs a single scalar value so we pop out the python number we use
1:30:55
so we pop out the python number we use
1:30:56
so we pop out the python number we use math.x to exponentiate it create a new
1:30:57
math.x to exponentiate it create a new
1:30:58
math.x to exponentiate it create a new value object
1:30:59
value object
1:30:59
value object everything that we've seen before the
1:31:00
everything that we've seen before the
1:31:00
everything that we've seen before the tricky part of course is how do you
1:31:02
tricky part of course is how do you
1:31:02
tricky part of course is how do you propagate through e to the x
1:31:04
propagate through e to the x
1:31:04
propagate through e to the x and
1:31:05
and
1:31:05
and so here you can potentially pause the
1:31:07
so here you can potentially pause the
1:31:07
so here you can potentially pause the video and think about what should go
1:31:09
video and think about what should go
1:31:09
video and think about what should go here
1:31:13
okay so basically we need to know what
1:31:15
okay so basically we need to know what
1:31:15
okay so basically we need to know what is the local derivative of e to the x so
1:31:18
is the local derivative of e to the x so
1:31:18
is the local derivative of e to the x so d by d x of e to the x is famously just
1:31:21
d by d x of e to the x is famously just
1:31:21
d by d x of e to the x is famously just e to the x and we've already just
1:31:23
e to the x and we've already just
1:31:23
e to the x and we've already just calculated e to the x and it's inside
1:31:25
calculated e to the x and it's inside
1:31:25
calculated e to the x and it's inside out that data so we can do up that data
1:31:27
out that data so we can do up that data
1:31:27
out that data so we can do up that data times
1:31:28
times
1:31:28
times and
1:31:29
and
1:31:29
and out that grad that's the chain rule
1:31:31
out that grad that's the chain rule
1:31:32
out that grad that's the chain rule so we're just chaining on to the current
1:31:33
so we're just chaining on to the current
1:31:33
so we're just chaining on to the current running grad
1:31:35
running grad
1:31:35
running grad and this is what the expression looks
1:31:36
and this is what the expression looks
1:31:36
and this is what the expression looks like it looks a little confusing but
1:31:38
like it looks a little confusing but
1:31:38
like it looks a little confusing but this is what it is and that's the
1:31:39
this is what it is and that's the
1:31:40
this is what it is and that's the exponentiation
1:31:41
exponentiation
1:31:41
exponentiation so redefining we should now be able to
1:31:43
so redefining we should now be able to
1:31:43
so redefining we should now be able to call a.x
1:31:45
call a.x
1:31:45
call a.x and
1:31:46
and
1:31:46
and hopefully the backward pass works as
1:31:47
hopefully the backward pass works as
1:31:47
hopefully the backward pass works as well okay and the last thing we'd like
1:31:49
well okay and the last thing we'd like
1:31:49
well okay and the last thing we'd like to do of course is we'd like to be able
1:31:50
to do of course is we'd like to be able
1:31:50
to do of course is we'd like to be able to divide
1:31:52
to divide
1:31:52
to divide now
1:31:53
now
1:31:53
now i actually will implement something
1:31:54
i actually will implement something
1:31:54
i actually will implement something slightly more powerful than division
1:31:55
slightly more powerful than division
1:31:56
slightly more powerful than division because division is just a special case
1:31:57
because division is just a special case
1:31:57
because division is just a special case of something a bit more powerful
1:31:59
of something a bit more powerful
1:31:59
of something a bit more powerful so in particular just by rearranging
1:32:02
so in particular just by rearranging
1:32:02
so in particular just by rearranging if we have some kind of a b equals
1:32:04
if we have some kind of a b equals
1:32:04
if we have some kind of a b equals value of 4.0 here we'd like to basically
1:32:07
value of 4.0 here we'd like to basically
1:32:07
value of 4.0 here we'd like to basically be able to do a divide b and we'd like
1:32:09
be able to do a divide b and we'd like
1:32:09
be able to do a divide b and we'd like this to be able to give us 0.5
1:32:11
this to be able to give us 0.5
1:32:11
this to be able to give us 0.5 now division actually can be reshuffled
1:32:14
now division actually can be reshuffled
1:32:14
now division actually can be reshuffled as follows if we have a divide b that's
1:32:17
as follows if we have a divide b that's
1:32:17
as follows if we have a divide b that's actually the same as a multiplying one
1:32:18
actually the same as a multiplying one
1:32:18
actually the same as a multiplying one over b
1:32:19
over b
1:32:19
over b and that's the same as a multiplying b
1:32:21
and that's the same as a multiplying b
1:32:21
and that's the same as a multiplying b to the power of negative one
1:32:24
to the power of negative one
1:32:24
to the power of negative one and so what i'd like to do instead is i
1:32:25
and so what i'd like to do instead is i
1:32:25
and so what i'd like to do instead is i basically like to implement the
1:32:27
basically like to implement the
1:32:27
basically like to implement the operation of x to the k for some
1:32:29
operation of x to the k for some
1:32:29
operation of x to the k for some constant uh k so it's an integer or a
1:32:32
constant uh k so it's an integer or a
1:32:32
constant uh k so it's an integer or a float um and we would like to be able to
1:32:35
float um and we would like to be able to
1:32:35
float um and we would like to be able to differentiate this and then as a special
1:32:36
differentiate this and then as a special
1:32:36
differentiate this and then as a special case uh negative one will be division
1:32:40
case uh negative one will be division
1:32:40
case uh negative one will be division and so i'm doing that just because uh
1:32:42
and so i'm doing that just because uh
1:32:42
and so i'm doing that just because uh it's more general and um yeah you might
1:32:45
it's more general and um yeah you might
1:32:45
it's more general and um yeah you might as well do it that way so basically what
1:32:46
as well do it that way so basically what
1:32:46
as well do it that way so basically what i'm saying is we can redefine
1:32:49
i'm saying is we can redefine
1:32:49
i'm saying is we can redefine uh division
1:32:51
uh division
1:32:51
uh division which we will put here somewhere
1:32:54
which we will put here somewhere
1:32:54
which we will put here somewhere yeah we can put it here somewhere what
1:32:56
yeah we can put it here somewhere what
1:32:56
yeah we can put it here somewhere what i'm saying is that we can redefine
1:32:58
i'm saying is that we can redefine
1:32:58
i'm saying is that we can redefine division so self-divide other
1:33:00
division so self-divide other
1:33:00
division so self-divide other can actually be rewritten as self times
1:33:03
can actually be rewritten as self times
1:33:03
can actually be rewritten as self times other to the power of negative one
1:33:05
other to the power of negative one
1:33:05
other to the power of negative one and now
1:33:07
and now
1:33:07
and now a value raised to the power of negative
1:33:09
a value raised to the power of negative
1:33:09
a value raised to the power of negative one we have now defined that
1:33:11
one we have now defined that
1:33:11
one we have now defined that so
1:33:12
so
1:33:12
so here's
1:33:13
here's
1:33:13
here's so we need to implement the pow function
1:33:15
so we need to implement the pow function
1:33:15
so we need to implement the pow function where am i going to put the power
1:33:17
where am i going to put the power
1:33:17
where am i going to put the power function maybe here somewhere
1:33:19
function maybe here somewhere
1:33:20
function maybe here somewhere this is the skeleton for it
1:33:22
this is the skeleton for it
1:33:22
this is the skeleton for it so this function will be called when we
1:33:24
so this function will be called when we
1:33:24
so this function will be called when we try to raise a value to some power and
1:33:26
try to raise a value to some power and
1:33:26
try to raise a value to some power and other will be that power
1:33:28
other will be that power
1:33:28
other will be that power now i'd like to make sure that other is
1:33:30
now i'd like to make sure that other is
1:33:30
now i'd like to make sure that other is only an int or a float usually other is
1:33:33
only an int or a float usually other is
1:33:33
only an int or a float usually other is some kind of a different value object
1:33:35
some kind of a different value object
1:33:35
some kind of a different value object but here other will be forced to be an
1:33:37
but here other will be forced to be an
1:33:37
but here other will be forced to be an end or a float otherwise the math
1:33:40
end or a float otherwise the math
1:33:40
end or a float otherwise the math won't work for
1:33:42
won't work for
1:33:42
won't work for for or try to achieve in the specific
1:33:43
for or try to achieve in the specific
1:33:43
for or try to achieve in the specific case that would be a different
1:33:45
case that would be a different
1:33:45
case that would be a different derivative expression if we wanted other
1:33:47
derivative expression if we wanted other
1:33:47
derivative expression if we wanted other to be a value
1:33:49
to be a value
1:33:49
to be a value so here we create the output value which
1:33:51
so here we create the output value which
1:33:51
so here we create the output value which is just uh you know this data raised to
1:33:53
is just uh you know this data raised to
1:33:53
is just uh you know this data raised to the power of other and other here could
1:33:55
the power of other and other here could
1:33:55
the power of other and other here could be for example negative one that's what
1:33:56
be for example negative one that's what
1:33:56
be for example negative one that's what we are hoping to achieve
1:33:59
we are hoping to achieve
1:33:59
we are hoping to achieve and then uh this is the backwards stub
1:34:01
and then uh this is the backwards stub
1:34:01
and then uh this is the backwards stub and this is the fun part which is what
1:34:03
and this is the fun part which is what
1:34:03
and this is the fun part which is what is the uh chain rule expression here for
1:34:07
is the uh chain rule expression here for
1:34:07
is the uh chain rule expression here for back for um
1:34:09
back for um
1:34:09
back for um back propagating through the power
1:34:11
back propagating through the power
1:34:11
back propagating through the power function where the power is to the power
1:34:13
function where the power is to the power
1:34:13
function where the power is to the power of some kind of a constant
1:34:15
of some kind of a constant
1:34:15
of some kind of a constant so this is the exercise and maybe pause
1:34:17
so this is the exercise and maybe pause
1:34:17
so this is the exercise and maybe pause the video here and see if you can figure
1:34:18
the video here and see if you can figure
1:34:18
the video here and see if you can figure it out yourself as to what we should put
1:34:20
it out yourself as to what we should put
1:34:20
it out yourself as to what we should put here
1:34:26
okay so
1:34:29
okay so
1:34:29
okay so you can actually go here and look at
1:34:30
you can actually go here and look at
1:34:30
you can actually go here and look at derivative rules as an example and we
1:34:32
derivative rules as an example and we
1:34:32
derivative rules as an example and we see lots of derivatives that you can
1:34:34
see lots of derivatives that you can
1:34:34
see lots of derivatives that you can hopefully know from calculus in
1:34:35
hopefully know from calculus in
1:34:36
hopefully know from calculus in particular what we're looking for is the
1:34:37
particular what we're looking for is the
1:34:37
particular what we're looking for is the power rule
1:34:39
power rule
1:34:39
power rule because that's telling us that if we're
1:34:40
because that's telling us that if we're
1:34:40
because that's telling us that if we're trying to take d by dx of x to the n
1:34:42
trying to take d by dx of x to the n
1:34:42
trying to take d by dx of x to the n which is what we're doing here
1:34:44
which is what we're doing here
1:34:44
which is what we're doing here then that is just n times x to the n
1:34:46
then that is just n times x to the n
1:34:46
then that is just n times x to the n minus 1
1:34:48
minus 1
1:34:48
minus 1 right
1:34:49
right
1:34:49
right okay
1:34:50
okay
1:34:50
okay so
1:34:51
so
1:34:51
so that's telling us about the local
1:34:53
that's telling us about the local
1:34:53
that's telling us about the local derivative of this power operation
1:34:55
derivative of this power operation
1:34:55
derivative of this power operation so all we want here
1:34:58
so all we want here
1:34:58
so all we want here basically n is now other
1:35:00
basically n is now other
1:35:00
basically n is now other and self.data is x
1:35:03
and self.data is x
1:35:03
and self.data is x and so this now becomes
1:35:06
and so this now becomes
1:35:06
and so this now becomes other which is n times
1:35:08
other which is n times
1:35:08
other which is n times self.data
1:35:10
self.data
1:35:10
self.data which is now a python in torah float
1:35:13
which is now a python in torah float
1:35:13
which is now a python in torah float it's not a valley object we're accessing
1:35:14
it's not a valley object we're accessing
1:35:14
it's not a valley object we're accessing the data attribute
1:35:16
the data attribute
1:35:16
the data attribute raised
1:35:17
raised
1:35:17
raised to the power of other minus one or n
1:35:19
to the power of other minus one or n
1:35:19
to the power of other minus one or n minus one
1:35:21
minus one
1:35:21
minus one i can put brackets around this but this
1:35:22
i can put brackets around this but this
1:35:22
i can put brackets around this but this doesn't matter because
1:35:25
doesn't matter because
1:35:25
doesn't matter because power takes precedence over multiply and
1:35:27
power takes precedence over multiply and
1:35:27
power takes precedence over multiply and python so that would have been okay
1:35:29
python so that would have been okay
1:35:29
python so that would have been okay and that's the local derivative only but
1:35:31
and that's the local derivative only but
1:35:31
and that's the local derivative only but now we have to chain it and we change
1:35:33
now we have to chain it and we change
1:35:33
now we have to chain it and we change just simply by multiplying by output
1:35:34
just simply by multiplying by output
1:35:34
just simply by multiplying by output grad that's chain rule
1:35:36
grad that's chain rule
1:35:36
grad that's chain rule and this should technically work
1:35:40
and this should technically work
1:35:40
and this should technically work and we're going to find out soon but now
1:35:42
and we're going to find out soon but now
1:35:42
and we're going to find out soon but now if we
1:35:43
if we
1:35:43
if we do this this should now work
1:35:46
do this this should now work
1:35:46
do this this should now work and we get 0.5 so the forward pass works
1:35:49
and we get 0.5 so the forward pass works
1:35:49
and we get 0.5 so the forward pass works but does the backward pass work and i
1:35:51
but does the backward pass work and i
1:35:51
but does the backward pass work and i realize that we actually also have to
1:35:52
realize that we actually also have to
1:35:52
realize that we actually also have to know how to subtract so
1:35:54
know how to subtract so
1:35:54
know how to subtract so right now a minus b will not work
1:35:57
right now a minus b will not work
1:35:57
right now a minus b will not work to make it work we need one more
1:36:00
to make it work we need one more
1:36:00
to make it work we need one more piece of code here
1:36:01
piece of code here
1:36:01
piece of code here and
1:36:02
and
1:36:02
and basically this is the
1:36:05
basically this is the
1:36:05
basically this is the subtraction and the way we're going to
1:36:06
subtraction and the way we're going to
1:36:06
subtraction and the way we're going to implement subtraction is we're going to
1:36:08
implement subtraction is we're going to
1:36:08
implement subtraction is we're going to implement it by addition of a negation
1:36:10
implement it by addition of a negation
1:36:10
implement it by addition of a negation and then to implement negation we're
1:36:12
and then to implement negation we're
1:36:12
and then to implement negation we're gonna multiply by negative one so just
1:36:14
gonna multiply by negative one so just
1:36:14
gonna multiply by negative one so just again using the stuff we've already
1:36:15
again using the stuff we've already
1:36:15
again using the stuff we've already built and just um expressing it in terms
1:36:17
built and just um expressing it in terms
1:36:17
built and just um expressing it in terms of what we have and a minus b is now
1:36:20
of what we have and a minus b is now
1:36:20
of what we have and a minus b is now working okay so now let's scroll again
1:36:22
working okay so now let's scroll again
1:36:22
working okay so now let's scroll again to this expression here for this neuron
1:36:25
to this expression here for this neuron
1:36:25
to this expression here for this neuron and let's just
1:36:26
and let's just
1:36:26
and let's just compute the backward pass here once
1:36:28
compute the backward pass here once
1:36:28
compute the backward pass here once we've defined o
1:36:29
we've defined o
1:36:30
we've defined o and let's draw it
1:36:31
and let's draw it
1:36:32
and let's draw it so here's the gradients for all these
1:36:33
so here's the gradients for all these
1:36:33
so here's the gradients for all these leaf nodes for this two-dimensional
1:36:35
leaf nodes for this two-dimensional
1:36:35
leaf nodes for this two-dimensional neuron that has a 10h that we've seen
1:36:37
neuron that has a 10h that we've seen
1:36:37
neuron that has a 10h that we've seen before so now what i'd like to do is i'd
1:36:39
before so now what i'd like to do is i'd
1:36:39
before so now what i'd like to do is i'd like to break up this 10h
1:36:41
like to break up this 10h
1:36:41
like to break up this 10h into this expression here
1:36:44
into this expression here
1:36:44
into this expression here so let me copy paste this
1:36:46
so let me copy paste this
1:36:46
so let me copy paste this here
1:36:47
here
1:36:47
here and now instead of we'll preserve the
1:36:49
and now instead of we'll preserve the
1:36:49
and now instead of we'll preserve the label
1:36:50
label
1:36:50
label and we will change how we define o
1:36:53
and we will change how we define o
1:36:53
and we will change how we define o so in particular we're going to
1:36:55
so in particular we're going to
1:36:55
so in particular we're going to implement this formula here
1:36:56
implement this formula here
1:36:56
implement this formula here so we need e to the 2x
1:36:58
so we need e to the 2x
1:36:58
so we need e to the 2x minus 1 over e to the x plus 1. so e to
1:37:01
minus 1 over e to the x plus 1. so e to
1:37:01
minus 1 over e to the x plus 1. so e to the 2x we need to take 2 times n and we
1:37:04
the 2x we need to take 2 times n and we
1:37:04
the 2x we need to take 2 times n and we need to exponentiate it that's e to the
1:37:07
need to exponentiate it that's e to the
1:37:07
need to exponentiate it that's e to the two x and then because we're using it
1:37:08
two x and then because we're using it
1:37:08
two x and then because we're using it twice let's create an intermediate
1:37:10
twice let's create an intermediate
1:37:10
twice let's create an intermediate variable e
1:37:12
variable e
1:37:12
variable e and then define o as
1:37:14
and then define o as
1:37:14
and then define o as e plus one over
1:37:16
e plus one over
1:37:16
e plus one over e minus one over e plus one
1:37:19
e minus one over e plus one
1:37:19
e minus one over e plus one e minus one over e plus one
1:37:22
e minus one over e plus one
1:37:22
e minus one over e plus one and that should be it and then we should
1:37:24
and that should be it and then we should
1:37:24
and that should be it and then we should be able to draw that of o
1:37:26
be able to draw that of o
1:37:26
be able to draw that of o so now before i run this what do we
1:37:29
so now before i run this what do we
1:37:29
so now before i run this what do we expect to see
1:37:30
expect to see
1:37:30
expect to see number one we're expecting to see a much
1:37:32
number one we're expecting to see a much
1:37:32
number one we're expecting to see a much longer
1:37:33
longer
1:37:33
longer graph here because we've broken up 10h
1:37:35
graph here because we've broken up 10h
1:37:35
graph here because we've broken up 10h into a bunch of other operations
1:37:37
into a bunch of other operations
1:37:37
into a bunch of other operations but those operations are mathematically
1:37:39
but those operations are mathematically
1:37:39
but those operations are mathematically equivalent and so what we're expecting
1:37:41
equivalent and so what we're expecting
1:37:41
equivalent and so what we're expecting to see is number one the same
1:37:43
to see is number one the same
1:37:43
to see is number one the same result here so the forward pass works
1:37:45
result here so the forward pass works
1:37:45
result here so the forward pass works and number two because of that
1:37:47
and number two because of that
1:37:47
and number two because of that mathematical equivalence we expect to
1:37:49
mathematical equivalence we expect to
1:37:49
mathematical equivalence we expect to see the same backward pass and the same
1:37:51
see the same backward pass and the same
1:37:51
see the same backward pass and the same gradients on these leaf nodes so these
1:37:53
gradients on these leaf nodes so these
1:37:53
gradients on these leaf nodes so these gradients should be identical
1:37:55
gradients should be identical
1:37:55
gradients should be identical so let's run this
1:37:57
so let's run this
1:37:58
so let's run this so number one let's verify that instead
1:38:00
so number one let's verify that instead
1:38:00
so number one let's verify that instead of a single 10h node we have now x and
1:38:03
of a single 10h node we have now x and
1:38:03
of a single 10h node we have now x and we have plus we have times negative one
1:38:06
we have plus we have times negative one
1:38:06
we have plus we have times negative one uh this is the division
1:38:08
uh this is the division
1:38:08
uh this is the division and we end up with the same forward pass
1:38:10
and we end up with the same forward pass
1:38:10
and we end up with the same forward pass here
1:38:11
here
1:38:11
here and then the gradients we have to be
1:38:13
and then the gradients we have to be
1:38:13
and then the gradients we have to be careful because they're in slightly
1:38:14
careful because they're in slightly
1:38:14
careful because they're in slightly different order potentially the
1:38:16
different order potentially the
1:38:16
different order potentially the gradients for w2x2 should be 0 and 0.5
1:38:19
gradients for w2x2 should be 0 and 0.5
1:38:19
gradients for w2x2 should be 0 and 0.5 w2 and x2 are 0 and 0.5
1:38:21
w2 and x2 are 0 and 0.5
1:38:22
w2 and x2 are 0 and 0.5 and w1 x1 are 1 and negative 1.5
1:38:25
and w1 x1 are 1 and negative 1.5
1:38:25
and w1 x1 are 1 and negative 1.5 1 and negative 1.5
1:38:27
1 and negative 1.5
1:38:27
1 and negative 1.5 so that means that both our forward
1:38:28
so that means that both our forward
1:38:28
so that means that both our forward passes and backward passes were correct
1:38:31
passes and backward passes were correct
1:38:31
passes and backward passes were correct because this turned out to be equivalent
1:38:33
because this turned out to be equivalent
1:38:33
because this turned out to be equivalent to
1:38:33
to
1:38:34
to 10h before
1:38:35
10h before
1:38:35
10h before and so the reason i wanted to go through
1:38:37
and so the reason i wanted to go through
1:38:37
and so the reason i wanted to go through this exercise is number one we got to
1:38:39
this exercise is number one we got to
1:38:39
this exercise is number one we got to practice a few more operations and uh
1:38:41
practice a few more operations and uh
1:38:41
practice a few more operations and uh writing more backwards passes and number
1:38:43
writing more backwards passes and number
1:38:43
writing more backwards passes and number two i wanted to illustrate the point
1:38:45
two i wanted to illustrate the point
1:38:45
two i wanted to illustrate the point that
1:38:46
that
1:38:46
that the um
1:38:47
the um
1:38:47
the um the level at which you implement your
1:38:49
the level at which you implement your
1:38:49
the level at which you implement your operations is totally up to you you can
1:38:51
operations is totally up to you you can
1:38:51
operations is totally up to you you can implement backward passes for tiny
1:38:53
implement backward passes for tiny
1:38:53
implement backward passes for tiny expressions like a single individual
1:38:54
expressions like a single individual
1:38:54
expressions like a single individual plus or a single times
1:38:56
plus or a single times
1:38:56
plus or a single times or you can implement them for say
1:38:58
or you can implement them for say
1:38:58
or you can implement them for say 10h
1:38:59
10h
1:39:00
10h which is a kind of a potentially you can
1:39:01
which is a kind of a potentially you can
1:39:01
which is a kind of a potentially you can see it as a composite operation because
1:39:03
see it as a composite operation because
1:39:03
see it as a composite operation because it's made up of all these more atomic
1:39:05
it's made up of all these more atomic
1:39:05
it's made up of all these more atomic operations but really all of this is
1:39:07
operations but really all of this is
1:39:07
operations but really all of this is kind of like a fake concept all that
1:39:08
kind of like a fake concept all that
1:39:08
kind of like a fake concept all that matters is we have some kind of inputs
1:39:10
matters is we have some kind of inputs
1:39:10
matters is we have some kind of inputs and some kind of an output and this
1:39:11
and some kind of an output and this
1:39:11
and some kind of an output and this output is a function of the inputs in
1:39:13
output is a function of the inputs in
1:39:13
output is a function of the inputs in some way and as long as you can do
1:39:14
some way and as long as you can do
1:39:14
some way and as long as you can do forward pass and the backward pass of
1:39:16
forward pass and the backward pass of
1:39:16
forward pass and the backward pass of that little operation it doesn't matter
1:39:19
that little operation it doesn't matter
1:39:19
that little operation it doesn't matter what that operation is
1:39:21
what that operation is
1:39:21
what that operation is and how composite it is
1:39:23
and how composite it is
1:39:23
and how composite it is if you can write the local gradients you
1:39:24
if you can write the local gradients you
1:39:24
if you can write the local gradients you can chain the gradient and you can
1:39:25
can chain the gradient and you can
1:39:26
can chain the gradient and you can continue back propagation so the design
1:39:28
continue back propagation so the design
1:39:28
continue back propagation so the design of what those functions are is
1:39:30
of what those functions are is
1:39:30
of what those functions are is completely up to you
1:39:31
completely up to you
1:39:31
completely up to you so now i would like to show you how you
1:39:33
so now i would like to show you how you
1:39:33
so now i would like to show you how you can do the exact same thing by using a
1:39:35
can do the exact same thing by using a
1:39:35
can do the exact same thing by using a modern deep neural network library like
1:39:37
modern deep neural network library like
1:39:37
modern deep neural network library like for example pytorch which i've roughly
1:39:39
for example pytorch which i've roughly
1:39:40
for example pytorch which i've roughly modeled micrograd
1:39:41
modeled micrograd
1:39:41
modeled micrograd by
1:39:42
by
1:39:42
by and so
1:39:43
and so
1:39:43
and so pytorch is something you would use in
1:39:44
pytorch is something you would use in
1:39:44
pytorch is something you would use in production and i'll show you how you can
1:39:46
production and i'll show you how you can
1:39:46
production and i'll show you how you can do the exact same thing but in pytorch
1:39:48
do the exact same thing but in pytorch
1:39:48
do the exact same thing but in pytorch api so i'm just going to copy paste it
1:39:50
api so i'm just going to copy paste it
1:39:50
api so i'm just going to copy paste it in and walk you through it a little bit
1:39:52
in and walk you through it a little bit
1:39:52
in and walk you through it a little bit this is what it looks like
1:39:54
this is what it looks like
1:39:54
this is what it looks like so we're going to import pi torch and
1:39:56
so we're going to import pi torch and
1:39:56
so we're going to import pi torch and then we need to define these
1:39:59
then we need to define these
1:39:59
then we need to define these value objects like we have here
1:40:01
value objects like we have here
1:40:01
value objects like we have here now micrograd is a scalar valued
1:40:04
now micrograd is a scalar valued
1:40:04
now micrograd is a scalar valued engine so we only have scalar values
1:40:07
engine so we only have scalar values
1:40:07
engine so we only have scalar values like 2.0 but in pi torch everything is
1:40:09
like 2.0 but in pi torch everything is
1:40:10
like 2.0 but in pi torch everything is based around tensors and like i
1:40:11
based around tensors and like i
1:40:11
based around tensors and like i mentioned tensors are just n-dimensional
1:40:13
mentioned tensors are just n-dimensional
1:40:13
mentioned tensors are just n-dimensional arrays of scalars
1:40:15
arrays of scalars
1:40:15
arrays of scalars so that's why things get a little bit
1:40:17
so that's why things get a little bit
1:40:17
so that's why things get a little bit more complicated here i just need a
1:40:19
more complicated here i just need a
1:40:19
more complicated here i just need a scalar value to tensor a tensor with
1:40:21
scalar value to tensor a tensor with
1:40:21
scalar value to tensor a tensor with just a single element
1:40:23
just a single element
1:40:23
just a single element but by default when you work with
1:40:25
but by default when you work with
1:40:25
but by default when you work with pytorch you would use um
1:40:28
pytorch you would use um
1:40:28
pytorch you would use um more complicated tensors like this so if
1:40:30
more complicated tensors like this so if
1:40:30
more complicated tensors like this so if i import pytorch
1:40:33
then i can create tensors like this and
1:40:36
then i can create tensors like this and
1:40:36
then i can create tensors like this and this tensor for example is a two by
1:40:38
this tensor for example is a two by
1:40:38
this tensor for example is a two by three array
1:40:39
three array
1:40:39
three array of scalar
1:40:41
of scalar
1:40:41
of scalar scalars
1:40:42
scalars
1:40:42
scalars in a single compact representation so we
1:40:45
in a single compact representation so we
1:40:45
in a single compact representation so we can check its shape we see that it's a
1:40:46
can check its shape we see that it's a
1:40:46
can check its shape we see that it's a two by three array
1:40:47
two by three array
1:40:48
two by three array and so on
1:40:49
and so on
1:40:49
and so on so this is usually what you would work
1:40:50
so this is usually what you would work
1:40:50
so this is usually what you would work with um in the actual libraries so here
1:40:53
with um in the actual libraries so here
1:40:54
with um in the actual libraries so here i'm creating
1:40:55
i'm creating
1:40:55
i'm creating a tensor that has only a single element
1:40:58
a tensor that has only a single element
1:40:58
a tensor that has only a single element 2.0
1:41:00
2.0
1:41:00
2.0 and then i'm casting it to be double
1:41:03
and then i'm casting it to be double
1:41:03
and then i'm casting it to be double because python is by default using
1:41:05
because python is by default using
1:41:05
because python is by default using double precision for its floating point
1:41:07
double precision for its floating point
1:41:07
double precision for its floating point numbers so i'd like everything to be
1:41:08
numbers so i'd like everything to be
1:41:08
numbers so i'd like everything to be identical by default the data type of
1:41:12
identical by default the data type of
1:41:12
identical by default the data type of these tensors will be float32 so it's
1:41:14
these tensors will be float32 so it's
1:41:14
these tensors will be float32 so it's only using a single precision float so
1:41:16
only using a single precision float so
1:41:16
only using a single precision float so i'm casting it to double
1:41:18
i'm casting it to double
1:41:18
i'm casting it to double so that we have float64 just like in
1:41:21
so that we have float64 just like in
1:41:21
so that we have float64 just like in python
1:41:22
python
1:41:22
python so i'm casting to double and then we get
1:41:24
so i'm casting to double and then we get
1:41:24
so i'm casting to double and then we get something similar to value of two the
1:41:28
something similar to value of two the
1:41:28
something similar to value of two the next thing i have to do is because these
1:41:29
next thing i have to do is because these
1:41:29
next thing i have to do is because these are leaf nodes by default pytorch
1:41:31
are leaf nodes by default pytorch
1:41:31
are leaf nodes by default pytorch assumes that they do not require
1:41:32
assumes that they do not require
1:41:32
assumes that they do not require gradients so i need to explicitly say
1:41:35
gradients so i need to explicitly say
1:41:35
gradients so i need to explicitly say that all of these nodes require
1:41:36
that all of these nodes require
1:41:36
that all of these nodes require gradients
1:41:37
gradients
1:41:37
gradients okay so this is going to construct
1:41:39
okay so this is going to construct
1:41:39
okay so this is going to construct scalar valued one element tensors
1:41:43
scalar valued one element tensors
1:41:43
scalar valued one element tensors make sure that fighters knows that they
1:41:44
make sure that fighters knows that they
1:41:44
make sure that fighters knows that they require gradients now by default these
1:41:47
require gradients now by default these
1:41:47
require gradients now by default these are set to false by the way because of
1:41:48
are set to false by the way because of
1:41:48
are set to false by the way because of efficiency reasons because usually you
1:41:50
efficiency reasons because usually you
1:41:50
efficiency reasons because usually you would not want gradients for leaf nodes
1:41:53
would not want gradients for leaf nodes
1:41:53
would not want gradients for leaf nodes like the inputs to the network and this
1:41:55
like the inputs to the network and this
1:41:55
like the inputs to the network and this is just trying to be efficient in the
1:41:57
is just trying to be efficient in the
1:41:57
is just trying to be efficient in the most common cases
1:41:59
most common cases
1:41:59
most common cases so once we've defined all of our values
1:42:01
so once we've defined all of our values
1:42:01
so once we've defined all of our values in python we can perform arithmetic just
1:42:03
in python we can perform arithmetic just
1:42:03
in python we can perform arithmetic just like we can here in microgradlend so
1:42:05
like we can here in microgradlend so
1:42:06
like we can here in microgradlend so this will just work and then there's a
1:42:07
this will just work and then there's a
1:42:07
this will just work and then there's a torch.10h also
1:42:09
torch.10h also
1:42:09
torch.10h also and when we get back is a tensor again
1:42:12
and when we get back is a tensor again
1:42:12
and when we get back is a tensor again and we can
1:42:13
and we can
1:42:13
and we can just like in micrograd it's got a data
1:42:15
just like in micrograd it's got a data
1:42:15
just like in micrograd it's got a data attribute and it's got grant attributes
1:42:18
attribute and it's got grant attributes
1:42:18
attribute and it's got grant attributes so these tensor objects just like in
1:42:19
so these tensor objects just like in
1:42:19
so these tensor objects just like in micrograd have a dot data and a dot grad
1:42:22
micrograd have a dot data and a dot grad
1:42:22
micrograd have a dot data and a dot grad and
1:42:23
and
1:42:23
and the only difference here is that we need
1:42:25
the only difference here is that we need
1:42:25
the only difference here is that we need to call it that item because otherwise
1:42:28
to call it that item because otherwise
1:42:28
to call it that item because otherwise um pi torch
1:42:30
um pi torch
1:42:30
um pi torch that item basically takes
1:42:32
that item basically takes
1:42:32
that item basically takes a single tensor of one element and it
1:42:34
a single tensor of one element and it
1:42:34
a single tensor of one element and it just returns that element stripping out
1:42:36
just returns that element stripping out
1:42:36
just returns that element stripping out the tensor
1:42:37
the tensor
1:42:37
the tensor so let me just run this and hopefully we
1:42:39
so let me just run this and hopefully we
1:42:39
so let me just run this and hopefully we are going to get this is going to print
1:42:41
are going to get this is going to print
1:42:41
are going to get this is going to print the forward pass
1:42:42
the forward pass
1:42:42
the forward pass which is 0.707
1:42:44
which is 0.707
1:42:44
which is 0.707 and this will be the gradients which
1:42:46
and this will be the gradients which
1:42:46
and this will be the gradients which hopefully are
1:42:48
hopefully are
1:42:48
hopefully are 0.5 0 negative 1.5 and 1.
1:42:51
0.5 0 negative 1.5 and 1.
1:42:51
0.5 0 negative 1.5 and 1. so if we just run this
1:42:53
so if we just run this
1:42:53
so if we just run this there we go
1:42:54
there we go
1:42:54
there we go 0.7 so the forward pass agrees and then
1:42:57
0.7 so the forward pass agrees and then
1:42:57
0.7 so the forward pass agrees and then point five zero negative one point five
1:42:59
point five zero negative one point five
1:42:59
point five zero negative one point five and one
1:43:00
and one
1:43:00
and one so pi torch agrees with us
1:43:02
so pi torch agrees with us
1:43:02
so pi torch agrees with us and just to show you here basically o
1:43:05
and just to show you here basically o
1:43:05
and just to show you here basically o here's a tensor with a single element
1:43:08
here's a tensor with a single element
1:43:08
here's a tensor with a single element and it's a double
1:43:09
and it's a double
1:43:09
and it's a double and we can call that item on it to just
1:43:11
and we can call that item on it to just
1:43:12
and we can call that item on it to just get the single number out
1:43:14
get the single number out
1:43:14
get the single number out so that's what item does and o is a
1:43:16
so that's what item does and o is a
1:43:16
so that's what item does and o is a tensor object like i mentioned and it's
1:43:18
tensor object like i mentioned and it's
1:43:18
tensor object like i mentioned and it's got a backward function just like we've
1:43:20
got a backward function just like we've
1:43:20
got a backward function just like we've implemented
1:43:22
implemented
1:43:22
implemented and then all of these also have a dot
1:43:23
and then all of these also have a dot
1:43:23
and then all of these also have a dot graph so like x2 for example in the grad
1:43:26
graph so like x2 for example in the grad
1:43:26
graph so like x2 for example in the grad and it's a tensor and we can pop out the
1:43:28
and it's a tensor and we can pop out the
1:43:28
and it's a tensor and we can pop out the individual number with that actin
1:43:31
individual number with that actin
1:43:31
individual number with that actin so basically
1:43:32
so basically
1:43:32
so basically torches torch can do what we did in
1:43:35
torches torch can do what we did in
1:43:35
torches torch can do what we did in micrograph is a special case when your
1:43:37
micrograph is a special case when your
1:43:37
micrograph is a special case when your tensors are all single element tensors
1:43:40
tensors are all single element tensors
1:43:40
tensors are all single element tensors but the big deal with pytorch is that
1:43:42
but the big deal with pytorch is that
1:43:42
but the big deal with pytorch is that everything is significantly more
1:43:43
everything is significantly more
1:43:43
everything is significantly more efficient because we are working with
1:43:45
efficient because we are working with
1:43:45
efficient because we are working with these tensor objects and we can do lots
1:43:47
these tensor objects and we can do lots
1:43:47
these tensor objects and we can do lots of operations in parallel on all of
1:43:49
of operations in parallel on all of
1:43:49
of operations in parallel on all of these tensors
1:43:51
these tensors
1:43:51
these tensors but otherwise what we've built very much
1:43:53
but otherwise what we've built very much
1:43:53
but otherwise what we've built very much agrees with the api of pytorch
1:43:55
agrees with the api of pytorch
1:43:55
agrees with the api of pytorch okay so now that we have some machinery
1:43:57
okay so now that we have some machinery
1:43:57
okay so now that we have some machinery to build out pretty complicated
1:43:58
to build out pretty complicated
1:43:58
to build out pretty complicated mathematical expressions we can also
1:44:00
mathematical expressions we can also
1:44:00
mathematical expressions we can also start building out neural nets and as i
1:44:02
start building out neural nets and as i
1:44:02
start building out neural nets and as i mentioned neural nets are just a
1:44:03
mentioned neural nets are just a
1:44:03
mentioned neural nets are just a specific class of mathematical
1:44:05
specific class of mathematical
1:44:05
specific class of mathematical expressions
1:44:07
expressions
1:44:07
expressions so we're going to start building out a
1:44:08
so we're going to start building out a
1:44:08
so we're going to start building out a neural net piece by piece and eventually
1:44:09
neural net piece by piece and eventually
1:44:09
neural net piece by piece and eventually we'll build out a two-layer multi-layer
1:44:12
we'll build out a two-layer multi-layer
1:44:12
we'll build out a two-layer multi-layer layer perceptron as it's called and i'll
1:44:14
layer perceptron as it's called and i'll
1:44:14
layer perceptron as it's called and i'll show you exactly what that means
1:44:15
show you exactly what that means
1:44:15
show you exactly what that means let's start with a single individual
1:44:17
let's start with a single individual
1:44:17
let's start with a single individual neuron we've implemented one here but
1:44:19
neuron we've implemented one here but
1:44:19
neuron we've implemented one here but here i'm going to implement one that
1:44:21
here i'm going to implement one that
1:44:21
here i'm going to implement one that also subscribes to the pytorch api in
1:44:24
also subscribes to the pytorch api in
1:44:24
also subscribes to the pytorch api in how it designs its neural network
1:44:26
how it designs its neural network
1:44:26
how it designs its neural network modules
1:44:27
modules
1:44:27
modules so just like we saw that we can like
1:44:28
so just like we saw that we can like
1:44:28
so just like we saw that we can like match the api of pytorch
1:44:31
match the api of pytorch
1:44:31
match the api of pytorch on the auto grad side we're going to try
1:44:33
on the auto grad side we're going to try
1:44:33
on the auto grad side we're going to try to do that on the neural network modules
1:44:35
to do that on the neural network modules
1:44:35
to do that on the neural network modules so here's class neuron
1:44:38
so here's class neuron
1:44:38
so here's class neuron and just for the sake of efficiency i'm
1:44:40
and just for the sake of efficiency i'm
1:44:40
and just for the sake of efficiency i'm going to copy paste some sections that
1:44:42
going to copy paste some sections that
1:44:42
going to copy paste some sections that are relatively straightforward
1:44:45
are relatively straightforward
1:44:45
are relatively straightforward so the constructor will take
1:44:47
so the constructor will take
1:44:47
so the constructor will take number of inputs to this neuron which is
1:44:49
number of inputs to this neuron which is
1:44:49
number of inputs to this neuron which is how many inputs come to a neuron so this
1:44:52
how many inputs come to a neuron so this
1:44:52
how many inputs come to a neuron so this one for example has three inputs
1:44:55
one for example has three inputs
1:44:55
one for example has three inputs and then it's going to create a weight
1:44:57
and then it's going to create a weight
1:44:57
and then it's going to create a weight there is some random number between
1:44:58
there is some random number between
1:44:58
there is some random number between negative one and one for every one of
1:45:00
negative one and one for every one of
1:45:00
negative one and one for every one of those inputs
1:45:01
those inputs
1:45:01
those inputs and a bias that controls the overall
1:45:03
and a bias that controls the overall
1:45:03
and a bias that controls the overall trigger happiness of this neuron
1:45:06
trigger happiness of this neuron
1:45:06
trigger happiness of this neuron and then we're going to implement a def
1:45:08
and then we're going to implement a def
1:45:08
and then we're going to implement a def underscore underscore call
1:45:11
underscore underscore call
1:45:11
underscore underscore call of self and x some input x
1:45:13
of self and x some input x
1:45:14
of self and x some input x and really what we don't do here is w
1:45:15
and really what we don't do here is w
1:45:15
and really what we don't do here is w times x plus b
1:45:17
times x plus b
1:45:17
times x plus b where w times x here is a dot product
1:45:19
where w times x here is a dot product
1:45:19
where w times x here is a dot product specifically
1:45:21
specifically
1:45:21
specifically now if you haven't seen
1:45:22
now if you haven't seen
1:45:22
now if you haven't seen call
1:45:23
call
1:45:24
call let me just return 0.0 here for now the
1:45:26
let me just return 0.0 here for now the
1:45:26
let me just return 0.0 here for now the way this works now is we can have an x
1:45:28
way this works now is we can have an x
1:45:28
way this works now is we can have an x which is say like 2.0 3.0 then we can
1:45:31
which is say like 2.0 3.0 then we can
1:45:31
which is say like 2.0 3.0 then we can initialize a neuron that is
1:45:32
initialize a neuron that is
1:45:32
initialize a neuron that is two-dimensional
1:45:33
two-dimensional
1:45:33
two-dimensional because these are two numbers and then
1:45:35
because these are two numbers and then
1:45:35
because these are two numbers and then we can feed those two numbers into that
1:45:37
we can feed those two numbers into that
1:45:37
we can feed those two numbers into that neuron to get an output
1:45:39
neuron to get an output
1:45:39
neuron to get an output and so when you use this notation n of x
1:45:42
and so when you use this notation n of x
1:45:42
and so when you use this notation n of x python will use call
1:45:45
python will use call
1:45:45
python will use call so currently call just return 0.0
1:45:50
now we'd like to actually do the forward
1:45:52
now we'd like to actually do the forward
1:45:52
now we'd like to actually do the forward pass of this neuron instead
1:45:54
pass of this neuron instead
1:45:54
pass of this neuron instead so we're going to do here first is we
1:45:57
so we're going to do here first is we
1:45:57
so we're going to do here first is we need to basically multiply all of the
1:45:58
need to basically multiply all of the
1:45:58
need to basically multiply all of the elements of w with all of the elements
1:46:01
elements of w with all of the elements
1:46:01
elements of w with all of the elements of x pairwise we need to multiply them
1:46:04
of x pairwise we need to multiply them
1:46:04
of x pairwise we need to multiply them so the first thing we're going to do is
1:46:05
so the first thing we're going to do is
1:46:05
so the first thing we're going to do is we're going to zip up
1:46:07
we're going to zip up
1:46:07
we're going to zip up celta w and x
1:46:09
celta w and x
1:46:09
celta w and x and in python zip takes two iterators
1:46:12
and in python zip takes two iterators
1:46:12
and in python zip takes two iterators and it creates a new iterator that
1:46:14
and it creates a new iterator that
1:46:14
and it creates a new iterator that iterates over the tuples of the
1:46:16
iterates over the tuples of the
1:46:16
iterates over the tuples of the corresponding entries
1:46:17
corresponding entries
1:46:17
corresponding entries so for example just to show you we can
1:46:19
so for example just to show you we can
1:46:20
so for example just to show you we can print this list
1:46:21
print this list
1:46:22
print this list and still return 0.0 here
1:46:30
sorry
1:46:34
so we see that these w's are paired up
1:46:36
so we see that these w's are paired up
1:46:36
so we see that these w's are paired up with the x's w with x
1:46:41
and now what we want to do is
1:46:47
for w i x i in
1:46:50
for w i x i in
1:46:50
for w i x i in we want to multiply w times
1:46:52
we want to multiply w times
1:46:52
we want to multiply w times w wi times x i
1:46:54
w wi times x i
1:46:54
w wi times x i and then we want to sum all of that
1:46:56
and then we want to sum all of that
1:46:56
and then we want to sum all of that together
1:46:57
together
1:46:57
together to come up with an activation
1:46:59
to come up with an activation
1:46:59
to come up with an activation and add also subnet b on top
1:47:02
and add also subnet b on top
1:47:02
and add also subnet b on top so that's the raw activation and then of
1:47:04
so that's the raw activation and then of
1:47:04
so that's the raw activation and then of course we need to pass that through a
1:47:05
course we need to pass that through a
1:47:05
course we need to pass that through a non-linearity so what we're going to be
1:47:07
non-linearity so what we're going to be
1:47:07
non-linearity so what we're going to be returning is act.10h
1:47:09
returning is act.10h
1:47:09
returning is act.10h and here's out
1:47:12
and here's out
1:47:12
and here's out so
1:47:13
so
1:47:13
so now we see that we are getting some
1:47:14
now we see that we are getting some
1:47:14
now we see that we are getting some outputs and we get a different output
1:47:16
outputs and we get a different output
1:47:16
outputs and we get a different output from a neuron each time because we are
1:47:17
from a neuron each time because we are
1:47:17
from a neuron each time because we are initializing different weights and by
1:47:19
initializing different weights and by
1:47:19
initializing different weights and by and biases
1:47:21
and biases
1:47:21
and biases and then to be a bit more efficient here
1:47:22
and then to be a bit more efficient here
1:47:22
and then to be a bit more efficient here actually sum by the way takes a second
1:47:25
actually sum by the way takes a second
1:47:25
actually sum by the way takes a second optional parameter which is the start
1:47:28
optional parameter which is the start
1:47:28
optional parameter which is the start and by default the start is zero so
1:47:31
and by default the start is zero so
1:47:31
and by default the start is zero so these elements of this sum will be added
1:47:34
these elements of this sum will be added
1:47:34
these elements of this sum will be added on top of zero to begin with but
1:47:35
on top of zero to begin with but
1:47:35
on top of zero to begin with but actually we can just start with cell dot
1:47:37
actually we can just start with cell dot
1:47:37
actually we can just start with cell dot b
1:47:38
b
1:47:38
b and then we just have an expression like
1:47:39
and then we just have an expression like
1:47:39
and then we just have an expression like this
1:47:45
and then the generator expression here
1:47:47
and then the generator expression here
1:47:47
and then the generator expression here must be parenthesized in python
1:47:49
must be parenthesized in python
1:47:49
must be parenthesized in python there we go
1:47:53
yep so now we can forward a single
1:47:55
yep so now we can forward a single
1:47:55
yep so now we can forward a single neuron next up we're going to define a
1:47:57
neuron next up we're going to define a
1:47:57
neuron next up we're going to define a layer of neurons so here we have a
1:47:59
layer of neurons so here we have a
1:47:59
layer of neurons so here we have a schematic for a mlb
1:48:02
schematic for a mlb
1:48:02
schematic for a mlb so we see that these mlps each layer
1:48:05
so we see that these mlps each layer
1:48:05
so we see that these mlps each layer this is one layer has actually a number
1:48:07
this is one layer has actually a number
1:48:07
this is one layer has actually a number of neurons and they're not connected to
1:48:08
of neurons and they're not connected to
1:48:08
of neurons and they're not connected to each other but all of them are fully
1:48:09
each other but all of them are fully
1:48:09
each other but all of them are fully connected to the input
1:48:11
connected to the input
1:48:11
connected to the input so what is a layer of neurons it's just
1:48:13
so what is a layer of neurons it's just
1:48:13
so what is a layer of neurons it's just it's just a set of neurons evaluated
1:48:15
it's just a set of neurons evaluated
1:48:15
it's just a set of neurons evaluated independently
1:48:16
independently
1:48:16
independently so
1:48:17
so
1:48:17
so in the interest of time i'm going to do
1:48:19
in the interest of time i'm going to do
1:48:20
in the interest of time i'm going to do something fairly straightforward here
1:48:23
something fairly straightforward here
1:48:23
something fairly straightforward here it's um
1:48:25
it's um
1:48:25
it's um literally a layer is just a list of
1:48:27
literally a layer is just a list of
1:48:27
literally a layer is just a list of neurons
1:48:28
neurons
1:48:28
neurons and then how many neurons do we have we
1:48:30
and then how many neurons do we have we
1:48:30
and then how many neurons do we have we take that as an input argument here how
1:48:32
take that as an input argument here how
1:48:32
take that as an input argument here how many neurons do you want in your layer
1:48:34
many neurons do you want in your layer
1:48:34
many neurons do you want in your layer number of outputs in this layer
1:48:36
number of outputs in this layer
1:48:36
number of outputs in this layer and so we just initialize completely
1:48:38
and so we just initialize completely
1:48:38
and so we just initialize completely independent neurons with this given
1:48:40
independent neurons with this given
1:48:40
independent neurons with this given dimensionality and when we call on it we
1:48:43
dimensionality and when we call on it we
1:48:43
dimensionality and when we call on it we just independently
1:48:44
just independently
1:48:44
just independently evaluate them so now instead of a neuron
1:48:47
evaluate them so now instead of a neuron
1:48:47
evaluate them so now instead of a neuron we can make a layer of neurons they are
1:48:49
we can make a layer of neurons they are
1:48:49
we can make a layer of neurons they are two-dimensional neurons and let's have
1:48:51
two-dimensional neurons and let's have
1:48:51
two-dimensional neurons and let's have three of them
1:48:52
three of them
1:48:52
three of them and now we see that we have three
1:48:53
and now we see that we have three
1:48:53
and now we see that we have three independent evaluations of three
1:48:55
independent evaluations of three
1:48:55
independent evaluations of three different neurons
1:48:57
different neurons
1:48:57
different neurons right
1:48:58
right
1:48:58
right okay finally let's complete this picture
1:49:00
okay finally let's complete this picture
1:49:00
okay finally let's complete this picture and define an entire multi-layer
1:49:02
and define an entire multi-layer
1:49:02
and define an entire multi-layer perceptron or mlp
1:49:04
perceptron or mlp
1:49:04
perceptron or mlp and as we can see here in an mlp these
1:49:06
and as we can see here in an mlp these
1:49:06
and as we can see here in an mlp these layers just feed into each other
1:49:07
layers just feed into each other
1:49:07
layers just feed into each other sequentially
1:49:09
sequentially
1:49:09
sequentially so let's come here and i'm just going to
1:49:11
so let's come here and i'm just going to
1:49:11
so let's come here and i'm just going to copy the code here in interest of time
1:49:14
copy the code here in interest of time
1:49:14
copy the code here in interest of time so an mlp is very similar
1:49:16
so an mlp is very similar
1:49:16
so an mlp is very similar we're taking the number of inputs
1:49:18
we're taking the number of inputs
1:49:18
we're taking the number of inputs as before but now instead of taking a
1:49:20
as before but now instead of taking a
1:49:20
as before but now instead of taking a single n out which is number of neurons
1:49:22
single n out which is number of neurons
1:49:22
single n out which is number of neurons in a single layer we're going to take a
1:49:24
in a single layer we're going to take a
1:49:24
in a single layer we're going to take a list of an outs and this list defines
1:49:26
list of an outs and this list defines
1:49:26
list of an outs and this list defines the sizes of all the layers that we want
1:49:28
the sizes of all the layers that we want
1:49:28
the sizes of all the layers that we want in our mlp
1:49:30
in our mlp
1:49:30
in our mlp so here we just put them all together
1:49:31
so here we just put them all together
1:49:31
so here we just put them all together and then iterate over consecutive pairs
1:49:34
and then iterate over consecutive pairs
1:49:34
and then iterate over consecutive pairs of these sizes and create layer objects
1:49:36
of these sizes and create layer objects
1:49:36
of these sizes and create layer objects for them
1:49:37
for them
1:49:37
for them and then in the call function we are
1:49:39
and then in the call function we are
1:49:39
and then in the call function we are just calling them sequentially so that's
1:49:41
just calling them sequentially so that's
1:49:41
just calling them sequentially so that's an mlp really
1:49:42
an mlp really
1:49:42
an mlp really and let's actually re-implement this
1:49:44
and let's actually re-implement this
1:49:44
and let's actually re-implement this picture so we want three input neurons
1:49:46
picture so we want three input neurons
1:49:46
picture so we want three input neurons and then two layers of four and an
1:49:47
and then two layers of four and an
1:49:48
and then two layers of four and an output unit
1:49:49
output unit
1:49:49
output unit so
1:49:50
so
1:49:50
so we want
1:49:52
we want
1:49:52
we want a three-dimensional input say this is an
1:49:54
a three-dimensional input say this is an
1:49:54
a three-dimensional input say this is an example input we want three inputs into
1:49:57
example input we want three inputs into
1:49:57
example input we want three inputs into two layers of four and one output
1:50:00
two layers of four and one output
1:50:00
two layers of four and one output and this of course is an mlp
1:50:03
and this of course is an mlp
1:50:03
and this of course is an mlp and there we go that's a forward pass of
1:50:05
and there we go that's a forward pass of
1:50:05
and there we go that's a forward pass of an mlp
1:50:06
an mlp
1:50:06
an mlp to make this a little bit nicer you see
1:50:08
to make this a little bit nicer you see
1:50:08
to make this a little bit nicer you see how we have just a single element but
1:50:09
how we have just a single element but
1:50:09
how we have just a single element but it's wrapped in a list because layer
1:50:11
it's wrapped in a list because layer
1:50:11
it's wrapped in a list because layer always returns lists
1:50:13
always returns lists
1:50:13
always returns lists circle for convenience
1:50:15
circle for convenience
1:50:15
circle for convenience return outs at zero if len out is
1:50:18
return outs at zero if len out is
1:50:18
return outs at zero if len out is exactly a single element
1:50:20
exactly a single element
1:50:20
exactly a single element else return fullest
1:50:22
else return fullest
1:50:22
else return fullest and this will allow us to just get a
1:50:23
and this will allow us to just get a
1:50:23
and this will allow us to just get a single value out at the last layer that
1:50:25
single value out at the last layer that
1:50:25
single value out at the last layer that only has a single neuron
1:50:28
only has a single neuron
1:50:28
only has a single neuron and finally we should be able to draw
1:50:29
and finally we should be able to draw
1:50:29
and finally we should be able to draw dot of n of x
1:50:31
dot of n of x
1:50:31
dot of n of x and
1:50:32
and
1:50:32
and as you might imagine
1:50:34
as you might imagine
1:50:34
as you might imagine these expressions are now getting
1:50:36
these expressions are now getting
1:50:36
these expressions are now getting relatively involved
1:50:38
relatively involved
1:50:38
relatively involved so this is an entire mlp that we're
1:50:40
so this is an entire mlp that we're
1:50:40
so this is an entire mlp that we're defining now
1:50:45
all the way until a single output
1:50:48
all the way until a single output
1:50:48
all the way until a single output okay
1:50:49
okay
1:50:49
okay and so obviously you would never
1:50:50
and so obviously you would never
1:50:50
and so obviously you would never differentiate on pen and paper these
1:50:52
differentiate on pen and paper these
1:50:52
differentiate on pen and paper these expressions but with micrograd we will
1:50:55
expressions but with micrograd we will
1:50:55
expressions but with micrograd we will be able to back propagate all the way
1:50:56
be able to back propagate all the way
1:50:56
be able to back propagate all the way through this
1:50:58
through this
1:50:58
through this and back propagate
1:50:59
and back propagate
1:50:59
and back propagate into
1:51:00
into
1:51:00
into these weights of all these neurons so
1:51:02
these weights of all these neurons so
1:51:02
these weights of all these neurons so let's see how that works okay so let's
1:51:04
let's see how that works okay so let's
1:51:04
let's see how that works okay so let's create ourselves a very simple
1:51:06
create ourselves a very simple
1:51:06
create ourselves a very simple example data set here
1:51:08
example data set here
1:51:08
example data set here so this data set has four examples
1:51:11
so this data set has four examples
1:51:11
so this data set has four examples and so we have four possible
1:51:13
and so we have four possible
1:51:13
and so we have four possible inputs into the neural net
1:51:15
inputs into the neural net
1:51:15
inputs into the neural net and we have four desired targets so we'd
1:51:17
and we have four desired targets so we'd
1:51:17
and we have four desired targets so we'd like the neural net to assign
1:51:21
like the neural net to assign
1:51:21
like the neural net to assign or output 1.0 when it's fed this example
1:51:24
or output 1.0 when it's fed this example
1:51:24
or output 1.0 when it's fed this example negative one when it's fed these
1:51:25
negative one when it's fed these
1:51:25
negative one when it's fed these examples and one when it's fed this
1:51:26
examples and one when it's fed this
1:51:26
examples and one when it's fed this example so it's a very simple binary
1:51:28
example so it's a very simple binary
1:51:28
example so it's a very simple binary classifier neural net basically that we
1:51:30
classifier neural net basically that we
1:51:30
classifier neural net basically that we would like here
1:51:32
would like here
1:51:32
would like here now let's think what the neural net
1:51:33
now let's think what the neural net
1:51:33
now let's think what the neural net currently thinks about these four
1:51:34
currently thinks about these four
1:51:34
currently thinks about these four examples we can just get their
1:51:36
examples we can just get their
1:51:36
examples we can just get their predictions
1:51:37
predictions
1:51:37
predictions um basically we can just call n of x for
1:51:40
um basically we can just call n of x for
1:51:40
um basically we can just call n of x for x in axis
1:51:41
x in axis
1:51:42
x in axis and then we can
1:51:43
and then we can
1:51:43
and then we can print
1:51:45
print
1:51:45
print so these are the outputs of the neural
1:51:46
so these are the outputs of the neural
1:51:46
so these are the outputs of the neural net on those four examples
1:51:48
net on those four examples
1:51:48
net on those four examples so
1:51:50
so
1:51:50
so the first one is 0.91 but we'd like it
1:51:52
the first one is 0.91 but we'd like it
1:51:52
the first one is 0.91 but we'd like it to be one so we should push this one
1:51:55
to be one so we should push this one
1:51:55
to be one so we should push this one higher this one we want to be higher
1:51:57
higher this one we want to be higher
1:51:58
higher this one we want to be higher this one says 0.88 and we want this to
1:52:00
this one says 0.88 and we want this to
1:52:00
this one says 0.88 and we want this to be negative one
1:52:02
be negative one
1:52:02
be negative one this is 0.8 we want it to be negative
1:52:04
this is 0.8 we want it to be negative
1:52:04
this is 0.8 we want it to be negative one
1:52:05
one
1:52:05
one and this one is 0.8 we want it to be one
1:52:08
and this one is 0.8 we want it to be one
1:52:08
and this one is 0.8 we want it to be one so how do we make the neural net and how
1:52:10
so how do we make the neural net and how
1:52:10
so how do we make the neural net and how do we tune the weights
1:52:12
do we tune the weights
1:52:12
do we tune the weights to
1:52:12
to
1:52:12
to better predict the desired targets
1:52:16
better predict the desired targets
1:52:16
better predict the desired targets and the trick used in deep learning to
1:52:18
and the trick used in deep learning to
1:52:18
and the trick used in deep learning to achieve this is to
1:52:20
achieve this is to
1:52:20
achieve this is to calculate a single number that somehow
1:52:22
calculate a single number that somehow
1:52:22
calculate a single number that somehow measures the total performance of your
1:52:24
measures the total performance of your
1:52:24
measures the total performance of your neural net and we call this single
1:52:25
neural net and we call this single
1:52:25
neural net and we call this single number the loss
1:52:27
number the loss
1:52:28
number the loss so the loss
1:52:29
so the loss
1:52:29
so the loss first
1:52:31
first
1:52:31
first is is a single number that we're going
1:52:32
is is a single number that we're going
1:52:32
is is a single number that we're going to define that basically measures how
1:52:34
to define that basically measures how
1:52:34
to define that basically measures how well the neural net is performing right
1:52:36
well the neural net is performing right
1:52:36
well the neural net is performing right now we have the intuitive sense that
1:52:37
now we have the intuitive sense that
1:52:37
now we have the intuitive sense that it's not performing very well because
1:52:38
it's not performing very well because
1:52:38
it's not performing very well because we're not very much close to this
1:52:40
we're not very much close to this
1:52:40
we're not very much close to this so the loss will be high and we'll want
1:52:43
so the loss will be high and we'll want
1:52:43
so the loss will be high and we'll want to minimize the loss
1:52:44
to minimize the loss
1:52:44
to minimize the loss so in particular in this case what we're
1:52:46
so in particular in this case what we're
1:52:46
so in particular in this case what we're going to do is we're going to implement
1:52:47
going to do is we're going to implement
1:52:47
going to do is we're going to implement the mean squared error loss
1:52:49
the mean squared error loss
1:52:49
the mean squared error loss so this is doing is we're going to
1:52:51
so this is doing is we're going to
1:52:51
so this is doing is we're going to basically iterate um
1:52:54
basically iterate um
1:52:54
basically iterate um for y ground truth
1:52:56
for y ground truth
1:52:56
for y ground truth and y output in zip of um
1:52:59
and y output in zip of um
1:52:59
and y output in zip of um wise and white red so we're going to
1:53:01
wise and white red so we're going to
1:53:01
wise and white red so we're going to pair up the
1:53:03
pair up the
1:53:03
pair up the ground truths with the predictions
1:53:06
ground truths with the predictions
1:53:06
ground truths with the predictions and this zip iterates over tuples of
1:53:07
and this zip iterates over tuples of
1:53:07
and this zip iterates over tuples of them
1:53:08
them
1:53:08
them and for each
1:53:11
and for each
1:53:11
and for each y ground truth and y output we're going
1:53:13
y ground truth and y output we're going
1:53:13
y ground truth and y output we're going to subtract them
1:53:16
and square them
1:53:18
and square them
1:53:18
and square them so let's first see what these losses are
1:53:20
so let's first see what these losses are
1:53:20
so let's first see what these losses are these are individual loss components
1:53:22
these are individual loss components
1:53:22
these are individual loss components and so basically for each
1:53:25
and so basically for each
1:53:25
and so basically for each one of the four
1:53:26
one of the four
1:53:26
one of the four we are taking the prediction and the
1:53:28
we are taking the prediction and the
1:53:28
we are taking the prediction and the ground truth we are subtracting them and
1:53:30
ground truth we are subtracting them and
1:53:30
ground truth we are subtracting them and squaring them
1:53:32
squaring them
1:53:32
squaring them so because
1:53:33
so because
1:53:33
so because this one is so close to its target 0.91
1:53:36
this one is so close to its target 0.91
1:53:36
this one is so close to its target 0.91 is almost one
1:53:38
is almost one
1:53:38
is almost one subtracting them gives a very small
1:53:40
subtracting them gives a very small
1:53:40
subtracting them gives a very small number
1:53:41
number
1:53:41
number so here we would get like a negative
1:53:43
so here we would get like a negative
1:53:43
so here we would get like a negative point one and then squaring it
1:53:45
point one and then squaring it
1:53:45
point one and then squaring it just makes sure
1:53:47
just makes sure
1:53:47
just makes sure that regardless of whether we are more
1:53:49
that regardless of whether we are more
1:53:49
that regardless of whether we are more negative or more positive we always get
1:53:51
negative or more positive we always get
1:53:51
negative or more positive we always get a positive
1:53:52
a positive
1:53:52
a positive number instead of squaring we should we
1:53:55
number instead of squaring we should we
1:53:55
number instead of squaring we should we could also take for example the absolute
1:53:56
could also take for example the absolute
1:53:56
could also take for example the absolute value we need to discard the sign
1:53:59
value we need to discard the sign
1:53:59
value we need to discard the sign and so you see that the expression is
1:54:00
and so you see that the expression is
1:54:00
and so you see that the expression is ranged so that you only get zero exactly
1:54:03
ranged so that you only get zero exactly
1:54:03
ranged so that you only get zero exactly when y out is equal to y ground truth
1:54:06
when y out is equal to y ground truth
1:54:06
when y out is equal to y ground truth when those two are equal so your
1:54:07
when those two are equal so your
1:54:07
when those two are equal so your prediction is exactly the target you are
1:54:09
prediction is exactly the target you are
1:54:09
prediction is exactly the target you are going to get zero
1:54:10
going to get zero
1:54:10
going to get zero and if your prediction is not the target
1:54:12
and if your prediction is not the target
1:54:12
and if your prediction is not the target you are going to get some other number
1:54:15
you are going to get some other number
1:54:15
you are going to get some other number so here for example we are way off and
1:54:17
so here for example we are way off and
1:54:17
so here for example we are way off and so that's why the loss is quite high
1:54:19
so that's why the loss is quite high
1:54:19
so that's why the loss is quite high and the more off we are the greater the
1:54:22
and the more off we are the greater the
1:54:22
and the more off we are the greater the loss will be
1:54:24
loss will be
1:54:24
loss will be so we don't want high loss we want low
1:54:26
so we don't want high loss we want low
1:54:26
so we don't want high loss we want low loss
1:54:27
loss
1:54:27
loss and so the final loss here will be just
1:54:30
and so the final loss here will be just
1:54:30
and so the final loss here will be just the sum
1:54:31
the sum
1:54:32
the sum of all of these
1:54:33
of all of these
1:54:33
of all of these numbers
1:54:34
numbers
1:54:34
numbers so you see that this should be zero
1:54:36
so you see that this should be zero
1:54:36
so you see that this should be zero roughly plus zero roughly
1:54:38
roughly plus zero roughly
1:54:38
roughly plus zero roughly but plus
1:54:39
but plus
1:54:39
but plus seven
1:54:40
seven
1:54:40
seven so loss should be about seven
1:54:43
so loss should be about seven
1:54:43
so loss should be about seven here
1:54:44
here
1:54:44
here and now we want to minimize the loss we
1:54:47
and now we want to minimize the loss we
1:54:47
and now we want to minimize the loss we want the loss to be low
1:54:49
want the loss to be low
1:54:49
want the loss to be low because if loss is low
1:54:51
because if loss is low
1:54:51
because if loss is low then every one of the predictions is
1:54:53
then every one of the predictions is
1:54:54
then every one of the predictions is equal to its target
1:54:56
equal to its target
1:54:56
equal to its target so the loss the lowest it can be is zero
1:54:58
so the loss the lowest it can be is zero
1:54:58
so the loss the lowest it can be is zero and the greater it is the worse off the
1:55:01
and the greater it is the worse off the
1:55:01
and the greater it is the worse off the neural net is predicting
1:55:04
neural net is predicting
1:55:04
neural net is predicting so now of course if we do lost that
1:55:05
so now of course if we do lost that
1:55:05
so now of course if we do lost that backward
1:55:07
backward
1:55:07
backward something magical happened when i hit
1:55:09
something magical happened when i hit
1:55:09
something magical happened when i hit enter
1:55:10
enter
1:55:10
enter and the magical thing of course that
1:55:12
and the magical thing of course that
1:55:12
and the magical thing of course that happened is that we can look at
1:55:14
happened is that we can look at
1:55:14
happened is that we can look at end.layers.neuron and that layers at say
1:55:16
end.layers.neuron and that layers at say
1:55:16
end.layers.neuron and that layers at say like the the first layer
1:55:18
like the the first layer
1:55:18
like the the first layer that neurons at zero
1:55:22
because remember that mlp has the layers
1:55:24
because remember that mlp has the layers
1:55:24
because remember that mlp has the layers which is a list
1:55:26
which is a list
1:55:26
which is a list and each layer has a neurons which is a
1:55:28
and each layer has a neurons which is a
1:55:28
and each layer has a neurons which is a list and that gives us an individual
1:55:29
list and that gives us an individual
1:55:29
list and that gives us an individual neuron
1:55:30
neuron
1:55:30
neuron and then it's got some weights
1:55:32
and then it's got some weights
1:55:32
and then it's got some weights and so we can for example look at the
1:55:34
and so we can for example look at the
1:55:34
and so we can for example look at the weights at zero
1:55:38
um
1:55:40
um
1:55:40
um oops it's not called weights it's called
1:55:42
oops it's not called weights it's called
1:55:42
oops it's not called weights it's called w
1:55:44
w
1:55:44
w and that's a value but now this value
1:55:46
and that's a value but now this value
1:55:46
and that's a value but now this value also has a groud because of the backward
1:55:48
also has a groud because of the backward
1:55:48
also has a groud because of the backward pass
1:55:50
pass
1:55:50
pass and so we see that because this gradient
1:55:52
and so we see that because this gradient
1:55:52
and so we see that because this gradient here on this particular weight of this
1:55:54
here on this particular weight of this
1:55:54
here on this particular weight of this particular neuron of this particular
1:55:56
particular neuron of this particular
1:55:56
particular neuron of this particular layer is negative
1:55:57
layer is negative
1:55:57
layer is negative we see that its influence on the loss is
1:56:00
we see that its influence on the loss is
1:56:00
we see that its influence on the loss is also negative so slightly increasing
1:56:02
also negative so slightly increasing
1:56:02
also negative so slightly increasing this particular weight of this neuron of
1:56:04
this particular weight of this neuron of
1:56:04
this particular weight of this neuron of this layer would make the loss go down
1:56:08
this layer would make the loss go down
1:56:08
this layer would make the loss go down and we actually have this information
1:56:10
and we actually have this information
1:56:10
and we actually have this information for every single one of our neurons and
1:56:12
for every single one of our neurons and
1:56:12
for every single one of our neurons and all their parameters actually it's worth
1:56:13
all their parameters actually it's worth
1:56:13
all their parameters actually it's worth looking at also the draw dot loss by the
1:56:16
looking at also the draw dot loss by the
1:56:16
looking at also the draw dot loss by the way
1:56:17
way
1:56:17
way so previously we looked at the draw dot
1:56:19
so previously we looked at the draw dot
1:56:19
so previously we looked at the draw dot of a single neural neuron forward pass
1:56:21
of a single neural neuron forward pass
1:56:21
of a single neural neuron forward pass and that was already a large expression
1:56:23
and that was already a large expression
1:56:23
and that was already a large expression but what is this expression we actually
1:56:25
but what is this expression we actually
1:56:25
but what is this expression we actually forwarded
1:56:27
forwarded
1:56:27
forwarded every one of those four examples and
1:56:29
every one of those four examples and
1:56:29
every one of those four examples and then we have the loss on top of them
1:56:30
then we have the loss on top of them
1:56:30
then we have the loss on top of them with the mean squared error
1:56:32
with the mean squared error
1:56:32
with the mean squared error and so this is a really massive graph
1:56:36
and so this is a really massive graph
1:56:36
and so this is a really massive graph because this graph that we've built up
1:56:38
because this graph that we've built up
1:56:38
because this graph that we've built up now
1:56:39
now
1:56:39
now oh my gosh this graph that we've built
1:56:41
oh my gosh this graph that we've built
1:56:41
oh my gosh this graph that we've built up now
1:56:42
up now
1:56:42
up now which is kind of excessive it's
1:56:44
which is kind of excessive it's
1:56:44
which is kind of excessive it's excessive because it has four forward
1:56:46
excessive because it has four forward
1:56:46
excessive because it has four forward passes of a neural net for every one of
1:56:48
passes of a neural net for every one of
1:56:48
passes of a neural net for every one of the examples and then it has the loss on
1:56:50
the examples and then it has the loss on
1:56:50
the examples and then it has the loss on top
1:56:51
top
1:56:51
top and it ends with the value of the loss
1:56:53
and it ends with the value of the loss
1:56:53
and it ends with the value of the loss which was 7.12
1:56:55
which was 7.12
1:56:55
which was 7.12 and this loss will now back propagate
1:56:56
and this loss will now back propagate
1:56:56
and this loss will now back propagate through all the four forward passes all
1:56:58
through all the four forward passes all
1:56:58
through all the four forward passes all the way through just every single
1:57:00
the way through just every single
1:57:00
the way through just every single intermediate value of the neural net
1:57:03
intermediate value of the neural net
1:57:03
intermediate value of the neural net all the way back to of course the
1:57:05
all the way back to of course the
1:57:05
all the way back to of course the parameters of the weights which are the
1:57:06
parameters of the weights which are the
1:57:06
parameters of the weights which are the input
1:57:07
input
1:57:07
input so these weight parameters here are
1:57:10
so these weight parameters here are
1:57:10
so these weight parameters here are inputs to this neural net
1:57:12
inputs to this neural net
1:57:12
inputs to this neural net and
1:57:13
and
1:57:13
and these numbers here these scalars are
1:57:15
these numbers here these scalars are
1:57:15
these numbers here these scalars are inputs to the neural net
1:57:16
inputs to the neural net
1:57:16
inputs to the neural net so if we went around here
1:57:18
so if we went around here
1:57:18
so if we went around here we'll probably find
1:57:20
we'll probably find
1:57:20
we'll probably find some of these examples this 1.0
1:57:22
some of these examples this 1.0
1:57:22
some of these examples this 1.0 potentially maybe this 1.0 or you know
1:57:25
potentially maybe this 1.0 or you know
1:57:25
potentially maybe this 1.0 or you know some of the others and you'll see that
1:57:26
some of the others and you'll see that
1:57:26
some of the others and you'll see that they all have gradients as well
1:57:28
they all have gradients as well
1:57:28
they all have gradients as well the thing is these gradients on the
1:57:30
the thing is these gradients on the
1:57:30
the thing is these gradients on the input data are not that useful to us
1:57:33
input data are not that useful to us
1:57:33
input data are not that useful to us and that's because the input data seems
1:57:36
and that's because the input data seems
1:57:36
and that's because the input data seems to be not changeable it's it's a given
1:57:38
to be not changeable it's it's a given
1:57:38
to be not changeable it's it's a given to the problem and so it's a fixed input
1:57:40
to the problem and so it's a fixed input
1:57:40
to the problem and so it's a fixed input we're not going to be changing it or
1:57:42
we're not going to be changing it or
1:57:42
we're not going to be changing it or messing with it even though we do have
1:57:43
messing with it even though we do have
1:57:43
messing with it even though we do have gradients for it
1:57:45
gradients for it
1:57:46
gradients for it but some of these gradients here
1:57:49
but some of these gradients here
1:57:49
but some of these gradients here will be for the neural network
1:57:50
will be for the neural network
1:57:50
will be for the neural network parameters the ws and the bs and those
1:57:53
parameters the ws and the bs and those
1:57:53
parameters the ws and the bs and those we of course we want to change
1:57:55
we of course we want to change
1:57:55
we of course we want to change okay so now we're going to want some
1:57:58
okay so now we're going to want some
1:57:58
okay so now we're going to want some convenience code to gather up all of the
1:57:59
convenience code to gather up all of the
1:57:59
convenience code to gather up all of the parameters of the neural net so that we
1:58:01
parameters of the neural net so that we
1:58:01
parameters of the neural net so that we can operate on all of them
1:58:03
can operate on all of them
1:58:03
can operate on all of them simultaneously and every one of them we
1:58:05
simultaneously and every one of them we
1:58:05
simultaneously and every one of them we will nudge a tiny amount
1:58:08
will nudge a tiny amount
1:58:08
will nudge a tiny amount based on the gradient information
1:58:10
based on the gradient information
1:58:10
based on the gradient information so let's collect the parameters of the
1:58:11
so let's collect the parameters of the
1:58:11
so let's collect the parameters of the neural net all in one array
1:58:14
neural net all in one array
1:58:14
neural net all in one array so let's create a parameters of self
1:58:17
so let's create a parameters of self
1:58:17
so let's create a parameters of self that just
1:58:18
that just
1:58:18
that just returns celta w which is a list
1:58:22
returns celta w which is a list
1:58:22
returns celta w which is a list concatenated with
1:58:23
concatenated with
1:58:24
concatenated with a list of self.b
1:58:27
a list of self.b
1:58:27
a list of self.b so this will just return a list
1:58:29
so this will just return a list
1:58:29
so this will just return a list list plus list just you know gives you a
1:58:31
list plus list just you know gives you a
1:58:31
list plus list just you know gives you a list
1:58:32
list
1:58:32
list so that's parameters of neuron and i'm
1:58:35
so that's parameters of neuron and i'm
1:58:35
so that's parameters of neuron and i'm calling it this way because also pi
1:58:36
calling it this way because also pi
1:58:36
calling it this way because also pi torch has a parameters on every single
1:58:38
torch has a parameters on every single
1:58:38
torch has a parameters on every single and in module
1:58:40
and in module
1:58:40
and in module and uh it does exactly what we're doing
1:58:41
and uh it does exactly what we're doing
1:58:42
and uh it does exactly what we're doing here it just returns the
1:58:43
here it just returns the
1:58:44
here it just returns the parameter tensors for us as the
1:58:45
parameter tensors for us as the
1:58:46
parameter tensors for us as the parameter scalars
1:58:48
parameter scalars
1:58:48
parameter scalars now layer is also a module so it will
1:58:50
now layer is also a module so it will
1:58:50
now layer is also a module so it will have parameters
1:58:52
have parameters
1:58:52
have parameters itself
1:58:54
itself
1:58:54
itself and basically what we want to do here is
1:58:56
and basically what we want to do here is
1:58:56
and basically what we want to do here is something like this like
1:59:00
params is here and then for
1:59:03
params is here and then for
1:59:03
params is here and then for neuron in salt out neurons
1:59:07
neuron in salt out neurons
1:59:07
neuron in salt out neurons we want to get neuron.parameters
1:59:10
we want to get neuron.parameters
1:59:10
we want to get neuron.parameters and we want to params.extend
1:59:13
and we want to params.extend
1:59:14
and we want to params.extend right so these are the parameters of
1:59:15
right so these are the parameters of
1:59:16
right so these are the parameters of this neuron and then we want to put them
1:59:17
this neuron and then we want to put them
1:59:17
this neuron and then we want to put them on top of params so params dot extend
1:59:21
on top of params so params dot extend
1:59:21
on top of params so params dot extend of peace
1:59:22
of peace
1:59:22
of peace and then we want to return brands
1:59:25
and then we want to return brands
1:59:25
and then we want to return brands so this is way too much code so actually
1:59:28
so this is way too much code so actually
1:59:28
so this is way too much code so actually there's a way to simplify this which is
1:59:31
there's a way to simplify this which is
1:59:31
there's a way to simplify this which is return
1:59:33
return
1:59:33
return p
1:59:35
p
1:59:35
p for neuron in self
1:59:37
for neuron in self
1:59:38
for neuron in self neurons
1:59:39
neurons
1:59:39
neurons for
1:59:41
for
1:59:41
for p in neuron dot parameters
1:59:45
p in neuron dot parameters
1:59:45
p in neuron dot parameters so it's a single list comprehension in
1:59:47
so it's a single list comprehension in
1:59:47
so it's a single list comprehension in python you can sort of nest them like
1:59:49
python you can sort of nest them like
1:59:49
python you can sort of nest them like this and you can um
1:59:51
this and you can um
1:59:51
this and you can um then create
1:59:52
then create
1:59:52
then create uh the desired
1:59:54
uh the desired
1:59:54
uh the desired array so this is these are identical
1:59:57
array so this is these are identical
1:59:57
array so this is these are identical we can take this out
1:59:59
we can take this out
2:00:00
we can take this out and then let's do the same here
2:00:04
def parameters
2:00:06
def parameters
2:00:06
def parameters self
2:00:07
self
2:00:07
self and return
2:00:09
and return
2:00:09
and return a parameter for layer in self dot layers
2:00:13
a parameter for layer in self dot layers
2:00:13
a parameter for layer in self dot layers for
2:00:15
for
2:00:15
for p in layer dot parameters
2:00:20
and that should be good
2:00:23
and that should be good
2:00:23
and that should be good now let me pop out this so
2:00:26
now let me pop out this so
2:00:26
now let me pop out this so we don't re-initialize our network
2:00:28
we don't re-initialize our network
2:00:28
we don't re-initialize our network because we need to re-initialize
2:00:31
because we need to re-initialize
2:00:31
because we need to re-initialize our
2:00:35
okay so unfortunately we will have to
2:00:37
okay so unfortunately we will have to
2:00:37
okay so unfortunately we will have to probably re-initialize the network
2:00:38
probably re-initialize the network
2:00:38
probably re-initialize the network because we just add functionality
2:00:41
because we just add functionality
2:00:41
because we just add functionality because this class of course we i want
2:00:43
because this class of course we i want
2:00:43
because this class of course we i want to get all the and that parameters but
2:00:45
to get all the and that parameters but
2:00:45
to get all the and that parameters but that's not going to work because this is
2:00:47
that's not going to work because this is
2:00:47
that's not going to work because this is the old class
2:00:49
the old class
2:00:49
the old class okay
2:00:50
okay
2:00:50
okay so unfortunately we do have to
2:00:51
so unfortunately we do have to
2:00:52
so unfortunately we do have to reinitialize the network which will
2:00:53
reinitialize the network which will
2:00:53
reinitialize the network which will change some of the numbers
2:00:55
change some of the numbers
2:00:55
change some of the numbers but let me do that so that we pick up
2:00:57
but let me do that so that we pick up
2:00:57
but let me do that so that we pick up the new api we can now do in the
2:00:58
the new api we can now do in the
2:00:58
the new api we can now do in the parameters
2:01:00
parameters
2:01:00
parameters and these are all the weights and biases
2:01:02
and these are all the weights and biases
2:01:02
and these are all the weights and biases inside the entire neural net
2:01:05
inside the entire neural net
2:01:05
inside the entire neural net so in total this mlp has 41 parameters
2:01:11
and
2:01:12
and
2:01:12
and now we'll be able to change them
2:01:15
now we'll be able to change them
2:01:15
now we'll be able to change them if we recalculate the loss here we see
2:01:18
if we recalculate the loss here we see
2:01:18
if we recalculate the loss here we see that unfortunately we have slightly
2:01:19
that unfortunately we have slightly
2:01:19
that unfortunately we have slightly different
2:01:22
different
2:01:22
different predictions and slightly different laws
2:01:26
but that's okay
2:01:28
but that's okay
2:01:28
but that's okay okay so we see that this neurons
2:01:31
okay so we see that this neurons
2:01:31
okay so we see that this neurons gradient is slightly negative we can
2:01:33
gradient is slightly negative we can
2:01:33
gradient is slightly negative we can also look at its data right now
2:01:36
also look at its data right now
2:01:36
also look at its data right now which is 0.85 so this is the current
2:01:38
which is 0.85 so this is the current
2:01:38
which is 0.85 so this is the current value of this neuron and this is its
2:01:40
value of this neuron and this is its
2:01:40
value of this neuron and this is its gradient on the loss
2:01:43
gradient on the loss
2:01:43
gradient on the loss so what we want to do now is we want to
2:01:45
so what we want to do now is we want to
2:01:45
so what we want to do now is we want to iterate for every p in
2:01:47
iterate for every p in
2:01:47
iterate for every p in n dot parameters so for all the 41
2:01:49
n dot parameters so for all the 41
2:01:49
n dot parameters so for all the 41 parameters in this neural net
2:01:51
parameters in this neural net
2:01:51
parameters in this neural net we actually want to change p data
2:01:55
we actually want to change p data
2:01:55
we actually want to change p data slightly
2:01:56
slightly
2:01:56
slightly according to the gradient information
2:01:59
according to the gradient information
2:01:59
according to the gradient information okay so
2:02:00
okay so
2:02:00
okay so dot dot to do here
2:02:02
dot dot to do here
2:02:02
dot dot to do here but this will be basically a tiny update
2:02:05
but this will be basically a tiny update
2:02:05
but this will be basically a tiny update in this gradient descent scheme in
2:02:08
in this gradient descent scheme in
2:02:08
in this gradient descent scheme in gradient descent we are thinking of the
2:02:10
gradient descent we are thinking of the
2:02:10
gradient descent we are thinking of the gradient as a vector pointing in the
2:02:13
gradient as a vector pointing in the
2:02:13
gradient as a vector pointing in the direction
2:02:14
direction
2:02:14
direction of
2:02:15
of
2:02:15
of increased
2:02:16
increased
2:02:16
increased loss
2:02:19
loss
2:02:19
loss and so
2:02:20
and so
2:02:20
and so in gradient descent we are modifying
2:02:22
in gradient descent we are modifying
2:02:22
in gradient descent we are modifying p data
2:02:24
p data
2:02:24
p data by a small step size in the direction of
2:02:26
by a small step size in the direction of
2:02:26
by a small step size in the direction of the gradient so the step size as an
2:02:28
the gradient so the step size as an
2:02:28
the gradient so the step size as an example could be like a very small
2:02:29
example could be like a very small
2:02:29
example could be like a very small number like 0.01 is the step size times
2:02:32
number like 0.01 is the step size times
2:02:32
number like 0.01 is the step size times p dot grad
2:02:35
p dot grad
2:02:35
p dot grad right
2:02:36
right
2:02:36
right but we have to think through some of the
2:02:37
but we have to think through some of the
2:02:37
but we have to think through some of the signs here
2:02:38
signs here
2:02:38
signs here so uh
2:02:40
so uh
2:02:40
so uh in particular working with this specific
2:02:43
in particular working with this specific
2:02:43
in particular working with this specific example here
2:02:44
example here
2:02:44
example here we see that if we just left it like this
2:02:47
we see that if we just left it like this
2:02:47
we see that if we just left it like this then this neuron's value
2:02:49
then this neuron's value
2:02:49
then this neuron's value would be currently increased by a tiny
2:02:51
would be currently increased by a tiny
2:02:51
would be currently increased by a tiny amount of the gradient
2:02:53
amount of the gradient
2:02:53
amount of the gradient the grain is negative so this value of
2:02:56
the grain is negative so this value of
2:02:56
the grain is negative so this value of this neuron would go slightly down it
2:02:58
this neuron would go slightly down it
2:02:58
this neuron would go slightly down it would become like 0.8 you know four or
2:03:00
would become like 0.8 you know four or
2:03:00
would become like 0.8 you know four or something like that
2:03:02
something like that
2:03:02
something like that but if this neuron's value goes lower
2:03:06
but if this neuron's value goes lower
2:03:06
but if this neuron's value goes lower that would actually
2:03:08
that would actually
2:03:08
that would actually increase the loss
2:03:10
increase the loss
2:03:10
increase the loss that's because
2:03:12
that's because
2:03:12
that's because the derivative of this neuron is
2:03:14
the derivative of this neuron is
2:03:14
the derivative of this neuron is negative so increasing
2:03:16
negative so increasing
2:03:16
negative so increasing this makes the loss go down so
2:03:19
this makes the loss go down so
2:03:19
this makes the loss go down so increasing it is what we want to do
2:03:21
increasing it is what we want to do
2:03:21
increasing it is what we want to do instead of decreasing it so basically
2:03:23
instead of decreasing it so basically
2:03:23
instead of decreasing it so basically what we're missing here is we're
2:03:24
what we're missing here is we're
2:03:24
what we're missing here is we're actually missing a negative sign
2:03:26
actually missing a negative sign
2:03:26
actually missing a negative sign and again this other interpretation
2:03:29
and again this other interpretation
2:03:29
and again this other interpretation and that's because we want to minimize
2:03:30
and that's because we want to minimize
2:03:30
and that's because we want to minimize the loss we don't want to maximize the
2:03:31
the loss we don't want to maximize the
2:03:31
the loss we don't want to maximize the loss we want to decrease it
2:03:33
loss we want to decrease it
2:03:33
loss we want to decrease it and the other interpretation as i
2:03:34
and the other interpretation as i
2:03:34
and the other interpretation as i mentioned is you can think of the
2:03:36
mentioned is you can think of the
2:03:36
mentioned is you can think of the gradient vector
2:03:37
gradient vector
2:03:37
gradient vector so basically just the vector of all the
2:03:39
so basically just the vector of all the
2:03:39
so basically just the vector of all the gradients
2:03:40
gradients
2:03:40
gradients as pointing in the direction of
2:03:42
as pointing in the direction of
2:03:42
as pointing in the direction of increasing
2:03:44
increasing
2:03:44
increasing the loss but then we want to decrease it
2:03:46
the loss but then we want to decrease it
2:03:46
the loss but then we want to decrease it so we actually want to go in the
2:03:47
so we actually want to go in the
2:03:47
so we actually want to go in the opposite direction
2:03:49
opposite direction
2:03:49
opposite direction and so you can convince yourself that
2:03:50
and so you can convince yourself that
2:03:50
and so you can convince yourself that this sort of plug does the right thing
2:03:51
this sort of plug does the right thing
2:03:51
this sort of plug does the right thing here with the negative because we want
2:03:53
here with the negative because we want
2:03:53
here with the negative because we want to minimize the loss
2:03:55
to minimize the loss
2:03:55
to minimize the loss so if we nudge all the parameters by
2:03:57
so if we nudge all the parameters by
2:03:57
so if we nudge all the parameters by tiny amount
2:04:00
then we'll see that
2:04:02
then we'll see that
2:04:02
then we'll see that this data will have changed a little bit
2:04:04
this data will have changed a little bit
2:04:04
this data will have changed a little bit so now this neuron
2:04:06
so now this neuron
2:04:06
so now this neuron is a tiny amount greater
2:04:08
is a tiny amount greater
2:04:08
is a tiny amount greater value so 0.854 went to 0.857
2:04:13
value so 0.854 went to 0.857
2:04:13
value so 0.854 went to 0.857 and that's a good thing because slightly
2:04:16
and that's a good thing because slightly
2:04:16
and that's a good thing because slightly increasing this neuron
2:04:18
increasing this neuron
2:04:18
increasing this neuron uh
2:04:18
uh
2:04:18
uh data makes the loss go down according to
2:04:21
data makes the loss go down according to
2:04:21
data makes the loss go down according to the gradient and so the correct thing
2:04:23
the gradient and so the correct thing
2:04:23
the gradient and so the correct thing has happened sign wise
2:04:26
has happened sign wise
2:04:26
has happened sign wise and so now what we would expect of
2:04:27
and so now what we would expect of
2:04:27
and so now what we would expect of course is that
2:04:29
course is that
2:04:29
course is that because we've changed all these
2:04:30
because we've changed all these
2:04:30
because we've changed all these parameters we expect that the loss
2:04:32
parameters we expect that the loss
2:04:32
parameters we expect that the loss should have gone down a bit
2:04:35
should have gone down a bit
2:04:35
should have gone down a bit so we want to re-evaluate the loss let
2:04:37
so we want to re-evaluate the loss let
2:04:37
so we want to re-evaluate the loss let me basically
2:04:39
this is just a data definition that
2:04:41
this is just a data definition that
2:04:41
this is just a data definition that hasn't changed but the forward pass here
2:04:44
hasn't changed but the forward pass here
2:04:44
hasn't changed but the forward pass here of the network we can recalculate
2:04:49
and actually let me do it outside here
2:04:51
and actually let me do it outside here
2:04:51
and actually let me do it outside here so that we can compare the two loss
2:04:52
so that we can compare the two loss
2:04:52
so that we can compare the two loss values
2:04:54
values
2:04:54
values so here if i recalculate the loss
2:04:57
so here if i recalculate the loss
2:04:57
so here if i recalculate the loss we'd expect the new loss now to be
2:04:59
we'd expect the new loss now to be
2:04:59
we'd expect the new loss now to be slightly lower than this number so
2:05:01
slightly lower than this number so
2:05:01
slightly lower than this number so hopefully what we're getting now is a
2:05:03
hopefully what we're getting now is a
2:05:03
hopefully what we're getting now is a tiny bit lower than 4.84
2:05:06
tiny bit lower than 4.84
2:05:06
tiny bit lower than 4.84 4.36
2:05:08
4.36
2:05:08
4.36 okay and remember the way we've arranged
2:05:10
okay and remember the way we've arranged
2:05:10
okay and remember the way we've arranged this is that low loss means that our
2:05:12
this is that low loss means that our
2:05:12
this is that low loss means that our predictions are matching the targets so
2:05:15
predictions are matching the targets so
2:05:15
predictions are matching the targets so our predictions now are probably
2:05:16
our predictions now are probably
2:05:16
our predictions now are probably slightly closer to the
2:05:18
slightly closer to the
2:05:18
slightly closer to the targets and now all we have to do is we
2:05:22
targets and now all we have to do is we
2:05:22
targets and now all we have to do is we have to iterate this process
2:05:24
have to iterate this process
2:05:24
have to iterate this process so again um we've done the forward pass
2:05:26
so again um we've done the forward pass
2:05:26
so again um we've done the forward pass and this is the loss
2:05:27
and this is the loss
2:05:28
and this is the loss now we can lost that backward
2:05:30
now we can lost that backward
2:05:30
now we can lost that backward let me take these out and we can do a
2:05:32
let me take these out and we can do a
2:05:32
let me take these out and we can do a step size
2:05:34
step size
2:05:34
step size and now we should have a slightly lower
2:05:35
and now we should have a slightly lower
2:05:35
and now we should have a slightly lower loss 4.36 goes to 3.9
2:05:39
loss 4.36 goes to 3.9
2:05:39
loss 4.36 goes to 3.9 and okay so
2:05:41
and okay so
2:05:41
and okay so we've done the forward pass here's the
2:05:43
we've done the forward pass here's the
2:05:43
we've done the forward pass here's the backward pass
2:05:44
backward pass
2:05:44
backward pass nudge
2:05:45
nudge
2:05:45
nudge and now the loss is 3.66
2:05:50
3.47
2:05:52
3.47
2:05:52
3.47 and you get the idea we just continue
2:05:54
and you get the idea we just continue
2:05:54
and you get the idea we just continue doing this and this is uh gradient
2:05:56
doing this and this is uh gradient
2:05:56
doing this and this is uh gradient descent we're just iteratively doing
2:05:58
descent we're just iteratively doing
2:05:58
descent we're just iteratively doing forward pass backward pass update
2:06:01
forward pass backward pass update
2:06:01
forward pass backward pass update forward pass backward pass update and
2:06:02
forward pass backward pass update and
2:06:02
forward pass backward pass update and the neural net is improving its
2:06:04
the neural net is improving its
2:06:04
the neural net is improving its predictions
2:06:05
predictions
2:06:05
predictions so here if we look at why pred now
2:06:09
so here if we look at why pred now
2:06:09
so here if we look at why pred now like red
2:06:12
we see that um
2:06:14
we see that um
2:06:14
we see that um this value should be getting closer to
2:06:15
this value should be getting closer to
2:06:16
this value should be getting closer to one
2:06:16
one
2:06:16
one so this value should be getting more
2:06:17
so this value should be getting more
2:06:17
so this value should be getting more positive these should be getting more
2:06:19
positive these should be getting more
2:06:19
positive these should be getting more negative and this one should be also
2:06:20
negative and this one should be also
2:06:20
negative and this one should be also getting more positive so if we just
2:06:22
getting more positive so if we just
2:06:22
getting more positive so if we just iterate this
2:06:23
iterate this
2:06:23
iterate this a few more times
2:06:26
actually we may be able to afford go to
2:06:28
actually we may be able to afford go to
2:06:28
actually we may be able to afford go to go a bit faster let's try a slightly
2:06:30
go a bit faster let's try a slightly
2:06:30
go a bit faster let's try a slightly higher learning rate
2:06:34
oops okay there we go so now we're at
2:06:35
oops okay there we go so now we're at
2:06:35
oops okay there we go so now we're at 0.31
2:06:39
if you go too fast by the way if you try
2:06:41
if you go too fast by the way if you try
2:06:41
if you go too fast by the way if you try to make it too big of a step you may
2:06:43
to make it too big of a step you may
2:06:43
to make it too big of a step you may actually overstep
2:06:47
it's overconfidence because again
2:06:48
it's overconfidence because again
2:06:48
it's overconfidence because again remember we don't actually know exactly
2:06:50
remember we don't actually know exactly
2:06:50
remember we don't actually know exactly about the loss function the loss
2:06:51
about the loss function the loss
2:06:51
about the loss function the loss function has all kinds of structure and
2:06:53
function has all kinds of structure and
2:06:53
function has all kinds of structure and we only know about the very local
2:06:55
we only know about the very local
2:06:55
we only know about the very local dependence of all these parameters on
2:06:57
dependence of all these parameters on
2:06:57
dependence of all these parameters on the loss but if we step too far
2:06:59
the loss but if we step too far
2:06:59
the loss but if we step too far we may step into you know a part of the
2:07:01
we may step into you know a part of the
2:07:01
we may step into you know a part of the loss that is completely different
2:07:03
loss that is completely different
2:07:03
loss that is completely different and that can destabilize training and
2:07:04
and that can destabilize training and
2:07:04
and that can destabilize training and make your loss actually blow up even
2:07:08
make your loss actually blow up even
2:07:08
make your loss actually blow up even so the loss is now 0.04 so actually the
2:07:11
so the loss is now 0.04 so actually the
2:07:11
so the loss is now 0.04 so actually the predictions should be really quite close
2:07:13
predictions should be really quite close
2:07:13
predictions should be really quite close let's take a look
2:07:15
let's take a look
2:07:15
let's take a look so you see how this is almost one
2:07:17
so you see how this is almost one
2:07:17
so you see how this is almost one almost negative one almost one we can
2:07:19
almost negative one almost one we can
2:07:19
almost negative one almost one we can continue going
2:07:21
continue going
2:07:21
continue going uh so
2:07:22
uh so
2:07:22
uh so yep backward
2:07:24
yep backward
2:07:24
yep backward update
2:07:25
update
2:07:25
update oops there we go so we went way too fast
2:07:28
oops there we go so we went way too fast
2:07:28
oops there we go so we went way too fast and um
2:07:29
and um
2:07:29
and um we actually overstepped
2:07:31
we actually overstepped
2:07:31
we actually overstepped so we got two uh too eager where are we
2:07:34
so we got two uh too eager where are we
2:07:34
so we got two uh too eager where are we now oops
2:07:36
now oops
2:07:36
now oops okay
2:07:37
okay
2:07:37
okay seven e negative nine so this is very
2:07:39
seven e negative nine so this is very
2:07:39
seven e negative nine so this is very very low loss
2:07:41
very low loss
2:07:41
very low loss and the predictions
2:07:43
and the predictions
2:07:43
and the predictions are basically perfect
2:07:45
are basically perfect
2:07:45
are basically perfect so somehow we
2:07:47
so somehow we
2:07:47
so somehow we basically we were doing way too big
2:07:48
basically we were doing way too big
2:07:48
basically we were doing way too big updates and we briefly exploded but then
2:07:50
updates and we briefly exploded but then
2:07:50
updates and we briefly exploded but then somehow we ended up getting into a
2:07:51
somehow we ended up getting into a
2:07:51
somehow we ended up getting into a really good spot so usually this
2:07:54
really good spot so usually this
2:07:54
really good spot so usually this learning rate and the tuning of it is a
2:07:56
learning rate and the tuning of it is a
2:07:56
learning rate and the tuning of it is a subtle art you want to set your learning
2:07:58
subtle art you want to set your learning
2:07:58
subtle art you want to set your learning rate if it's too low you're going to
2:08:00
rate if it's too low you're going to
2:08:00
rate if it's too low you're going to take way too long to converge but if
2:08:02
take way too long to converge but if
2:08:02
take way too long to converge but if it's too high the whole thing gets
2:08:03
it's too high the whole thing gets
2:08:03
it's too high the whole thing gets unstable and you might actually even
2:08:05
unstable and you might actually even
2:08:05
unstable and you might actually even explode the loss
2:08:07
explode the loss
2:08:07
explode the loss depending on your loss function
2:08:08
depending on your loss function
2:08:08
depending on your loss function so finding the step size to be just
2:08:10
so finding the step size to be just
2:08:10
so finding the step size to be just right it's it's a pretty subtle art
2:08:12
right it's it's a pretty subtle art
2:08:12
right it's it's a pretty subtle art sometimes when you're using sort of
2:08:14
sometimes when you're using sort of
2:08:14
sometimes when you're using sort of vanilla gradient descent
2:08:15
vanilla gradient descent
2:08:15
vanilla gradient descent but we happen to get into a good spot we
2:08:17
but we happen to get into a good spot we
2:08:17
but we happen to get into a good spot we can look at
2:08:19
can look at
2:08:19
can look at n-dot parameters
2:08:22
n-dot parameters
2:08:22
n-dot parameters so this is the setting of weights and
2:08:25
so this is the setting of weights and
2:08:25
so this is the setting of weights and biases
2:08:26
biases
2:08:26
biases that makes our network
2:08:29
that makes our network
2:08:29
that makes our network predict
2:08:30
predict
2:08:30
predict the desired targets
2:08:31
the desired targets
2:08:31
the desired targets very very close
2:08:33
very very close
2:08:33
very very close and
2:08:35
and
2:08:35
and basically we've successfully trained
2:08:37
basically we've successfully trained
2:08:37
basically we've successfully trained neural net
2:08:38
neural net
2:08:38
neural net okay let's make this a tiny bit more
2:08:40
okay let's make this a tiny bit more
2:08:40
okay let's make this a tiny bit more respectable and implement an actual
2:08:41
respectable and implement an actual
2:08:41
respectable and implement an actual training loop and what that looks like
2:08:43
training loop and what that looks like
2:08:43
training loop and what that looks like so this is the data definition that
2:08:45
so this is the data definition that
2:08:45
so this is the data definition that stays this is the forward pass
2:08:47
stays this is the forward pass
2:08:47
stays this is the forward pass um so
2:08:49
um so
2:08:49
um so for uh k in range you know we're going
2:08:52
for uh k in range you know we're going
2:08:52
for uh k in range you know we're going to
2:08:53
to
2:08:53
to take a bunch of steps
2:08:57
first you do the forward pass
2:09:00
first you do the forward pass
2:09:00
first you do the forward pass we validate the loss
2:09:03
let's re-initialize the neural net from
2:09:05
let's re-initialize the neural net from
2:09:05
let's re-initialize the neural net from scratch
2:09:06
scratch
2:09:06
scratch and here's the data
2:09:08
and here's the data
2:09:08
and here's the data and we first do before pass then we do
2:09:11
and we first do before pass then we do
2:09:11
and we first do before pass then we do the backward pass
2:09:19
and then we do an update that's gradient
2:09:21
and then we do an update that's gradient
2:09:21
and then we do an update that's gradient descent
2:09:26
and then we should be able to iterate
2:09:27
and then we should be able to iterate
2:09:27
and then we should be able to iterate this and we should be able to print the
2:09:29
this and we should be able to print the
2:09:29
this and we should be able to print the current step
2:09:30
current step
2:09:30
current step the current loss um let's just print the
2:09:33
the current loss um let's just print the
2:09:33
the current loss um let's just print the sort of
2:09:34
sort of
2:09:34
sort of number of the loss
2:09:36
number of the loss
2:09:36
number of the loss and
2:09:38
and
2:09:38
and that should be it
2:09:40
that should be it
2:09:40
that should be it and then the learning rate 0.01 is a
2:09:42
and then the learning rate 0.01 is a
2:09:42
and then the learning rate 0.01 is a little too small 0.1 we saw is like a
2:09:44
little too small 0.1 we saw is like a
2:09:44
little too small 0.1 we saw is like a little bit dangerously too high let's go
2:09:46
little bit dangerously too high let's go
2:09:46
little bit dangerously too high let's go somewhere in between
2:09:47
somewhere in between
2:09:47
somewhere in between and we'll optimize this for
2:09:50
and we'll optimize this for
2:09:50
and we'll optimize this for not 10 steps but let's go for say 20
2:09:52
not 10 steps but let's go for say 20
2:09:52
not 10 steps but let's go for say 20 steps
2:09:54
steps
2:09:54
steps let me erase all of this junk
2:09:59
and uh let's run the optimization
2:10:03
and you see how we've actually converged
2:10:05
and you see how we've actually converged
2:10:05
and you see how we've actually converged slower in a more controlled manner and
2:10:08
slower in a more controlled manner and
2:10:08
slower in a more controlled manner and got to a loss that is very low
2:10:11
got to a loss that is very low
2:10:11
got to a loss that is very low so
2:10:12
so
2:10:12
so i expect white bread to be quite good
2:10:15
i expect white bread to be quite good
2:10:15
i expect white bread to be quite good there we go
2:10:19
um
2:10:21
um
2:10:22
um and
2:10:23
and
2:10:23
and that's it
2:10:24
that's it
2:10:24
that's it okay so this is kind of embarrassing but
2:10:25
okay so this is kind of embarrassing but
2:10:25
okay so this is kind of embarrassing but we actually have a really terrible bug
2:10:28
we actually have a really terrible bug
2:10:28
we actually have a really terrible bug in here and it's a subtle bug and it's a
2:10:31
in here and it's a subtle bug and it's a
2:10:31
in here and it's a subtle bug and it's a very common bug and i can't believe i've
2:10:33
very common bug and i can't believe i've
2:10:33
very common bug and i can't believe i've done it for the 20th time in my life
2:10:36
done it for the 20th time in my life
2:10:36
done it for the 20th time in my life especially on camera and i could have
2:10:38
especially on camera and i could have
2:10:38
especially on camera and i could have reshot the whole thing but i think it's
2:10:39
reshot the whole thing but i think it's
2:10:39
reshot the whole thing but i think it's pretty funny and you know you get to
2:10:41
pretty funny and you know you get to
2:10:41
pretty funny and you know you get to appreciate a bit what um working with
2:10:44
appreciate a bit what um working with
2:10:44
appreciate a bit what um working with neural nets maybe
2:10:45
neural nets maybe
2:10:45
neural nets maybe is like sometimes
2:10:47
is like sometimes
2:10:47
is like sometimes we are guilty of
2:10:50
we are guilty of
2:10:50
we are guilty of come bug i've actually tweeted
2:10:52
come bug i've actually tweeted
2:10:52
come bug i've actually tweeted the most common neural net mistakes a
2:10:54
the most common neural net mistakes a
2:10:54
the most common neural net mistakes a long time ago now
2:10:56
long time ago now
2:10:56
long time ago now uh and
2:10:57
uh and
2:10:57
uh and i'm not really
2:10:59
i'm not really
2:10:59
i'm not really gonna explain any of these except for we
2:11:01
gonna explain any of these except for we
2:11:01
gonna explain any of these except for we are guilty of number three you forgot to
2:11:03
are guilty of number three you forgot to
2:11:03
are guilty of number three you forgot to zero grad
2:11:04
zero grad
2:11:04
zero grad before that backward what is that
2:11:09
basically what's happening and it's a
2:11:10
basically what's happening and it's a
2:11:10
basically what's happening and it's a subtle bug and i'm not sure if you saw
2:11:11
subtle bug and i'm not sure if you saw
2:11:12
subtle bug and i'm not sure if you saw it
2:11:12
it
2:11:12
it is that
2:11:14
is that
2:11:14
is that all of these
2:11:15
all of these
2:11:15
all of these weights here have a dot data and a dot
2:11:17
weights here have a dot data and a dot
2:11:17
weights here have a dot data and a dot grad
2:11:19
grad
2:11:19
grad and that grad starts at zero
2:11:22
and that grad starts at zero
2:11:22
and that grad starts at zero and then we do backward and we fill in
2:11:24
and then we do backward and we fill in
2:11:24
and then we do backward and we fill in the gradients
2:11:25
the gradients
2:11:25
the gradients and then we do an update on the data but
2:11:27
and then we do an update on the data but
2:11:27
and then we do an update on the data but we don't flush the grad
2:11:29
we don't flush the grad
2:11:29
we don't flush the grad it stays there
2:11:31
it stays there
2:11:31
it stays there so when we do the second
2:11:33
so when we do the second
2:11:33
so when we do the second forward pass and we do backward again
2:11:35
forward pass and we do backward again
2:11:35
forward pass and we do backward again remember that all the backward
2:11:36
remember that all the backward
2:11:36
remember that all the backward operations do a plus equals on the grad
2:11:39
operations do a plus equals on the grad
2:11:39
operations do a plus equals on the grad and so these gradients just
2:11:41
and so these gradients just
2:11:41
and so these gradients just add up and they never get reset to zero
2:11:44
add up and they never get reset to zero
2:11:44
add up and they never get reset to zero so basically we didn't zero grad so
2:11:47
so basically we didn't zero grad so
2:11:47
so basically we didn't zero grad so here's how we zero grad before
2:11:50
here's how we zero grad before
2:11:50
here's how we zero grad before backward
2:11:51
backward
2:11:51
backward we need to iterate over all the
2:11:52
we need to iterate over all the
2:11:52
we need to iterate over all the parameters
2:11:54
parameters
2:11:54
parameters and we need to make sure that p dot grad
2:11:56
and we need to make sure that p dot grad
2:11:56
and we need to make sure that p dot grad is set to zero
2:11:58
is set to zero
2:11:58
is set to zero we need to reset it to zero just like it
2:12:00
we need to reset it to zero just like it
2:12:00
we need to reset it to zero just like it is in the constructor
2:12:02
is in the constructor
2:12:02
is in the constructor so remember all the way here for all
2:12:04
so remember all the way here for all
2:12:04
so remember all the way here for all these value nodes grad is reset to zero
2:12:07
these value nodes grad is reset to zero
2:12:07
these value nodes grad is reset to zero and then all these backward passes do a
2:12:09
and then all these backward passes do a
2:12:09
and then all these backward passes do a plus equals from that grad
2:12:11
plus equals from that grad
2:12:11
plus equals from that grad but we need to make sure that
2:12:13
but we need to make sure that
2:12:13
but we need to make sure that we reset these graphs to zero so that
2:12:15
we reset these graphs to zero so that
2:12:15
we reset these graphs to zero so that when we do backward
2:12:17
when we do backward
2:12:17
when we do backward all of them start at zero and the actual
2:12:18
all of them start at zero and the actual
2:12:18
all of them start at zero and the actual backward pass accumulates um
2:12:21
backward pass accumulates um
2:12:21
backward pass accumulates um the loss derivatives into the grads
2:12:25
the loss derivatives into the grads
2:12:25
the loss derivatives into the grads so this is zero grad in pytorch
2:12:28
so this is zero grad in pytorch
2:12:28
so this is zero grad in pytorch and uh
2:12:30
and uh
2:12:30
and uh we will slightly get we'll get a
2:12:31
we will slightly get we'll get a
2:12:31
we will slightly get we'll get a slightly different optimization let's
2:12:33
slightly different optimization let's
2:12:33
slightly different optimization let's reset the neural net
2:12:34
reset the neural net
2:12:34
reset the neural net the data is the same this is now i think
2:12:37
the data is the same this is now i think
2:12:37
the data is the same this is now i think correct
2:12:38
correct
2:12:38
correct and we get a much more
2:12:39
and we get a much more
2:12:40
and we get a much more you know we get a much more
2:12:42
you know we get a much more
2:12:42
you know we get a much more slower descent
2:12:44
slower descent
2:12:44
slower descent we still end up with pretty good results
2:12:46
we still end up with pretty good results
2:12:46
we still end up with pretty good results and we can continue this a bit more
2:12:48
and we can continue this a bit more
2:12:48
and we can continue this a bit more to get down lower
2:12:50
to get down lower
2:12:50
to get down lower and lower
2:12:51
and lower
2:12:51
and lower and lower
2:12:54
yeah
2:12:56
yeah
2:12:56
yeah so the only reason that the previous
2:12:57
so the only reason that the previous
2:12:57
so the only reason that the previous thing worked it's extremely buggy um the
2:12:59
thing worked it's extremely buggy um the
2:12:59
thing worked it's extremely buggy um the only reason that worked is that
2:13:03
only reason that worked is that
2:13:03
only reason that worked is that this is a very very simple problem
2:13:05
this is a very very simple problem
2:13:05
this is a very very simple problem and it's very easy for this neural net
2:13:07
and it's very easy for this neural net
2:13:07
and it's very easy for this neural net to fit this data
2:13:09
to fit this data
2:13:09
to fit this data and so the grads ended up accumulating
2:13:12
and so the grads ended up accumulating
2:13:12
and so the grads ended up accumulating and it effectively gave us a massive
2:13:13
and it effectively gave us a massive
2:13:13
and it effectively gave us a massive step size and it made us converge
2:13:16
step size and it made us converge
2:13:16
step size and it made us converge extremely fast
2:13:19
but basically now we have to do more
2:13:20
but basically now we have to do more
2:13:20
but basically now we have to do more steps to get to very low values of loss
2:13:24
steps to get to very low values of loss
2:13:24
steps to get to very low values of loss and get wipe red to be really good we
2:13:26
and get wipe red to be really good we
2:13:26
and get wipe red to be really good we can try to
2:13:27
can try to
2:13:27
can try to step a bit greater
2:13:34
yeah we're gonna get closer and closer
2:13:36
yeah we're gonna get closer and closer
2:13:36
yeah we're gonna get closer and closer to one minus one and one
2:13:38
to one minus one and one
2:13:38
to one minus one and one so
2:13:39
so
2:13:39
so working with neural nets is sometimes
2:13:41
working with neural nets is sometimes
2:13:41
working with neural nets is sometimes tricky because
2:13:43
tricky because
2:13:43
tricky because uh
2:13:44
uh
2:13:44
uh you may have lots of bugs in the code
2:13:47
you may have lots of bugs in the code
2:13:47
you may have lots of bugs in the code and uh your network might actually work
2:13:49
and uh your network might actually work
2:13:49
and uh your network might actually work just like ours worked
2:13:51
just like ours worked
2:13:51
just like ours worked but chances are is that if we had a more
2:13:53
but chances are is that if we had a more
2:13:53
but chances are is that if we had a more complex problem then actually this bug
2:13:55
complex problem then actually this bug
2:13:55
complex problem then actually this bug would have made us not optimize the loss
2:13:57
would have made us not optimize the loss
2:13:57
would have made us not optimize the loss very well and we were only able to get
2:13:59
very well and we were only able to get
2:13:59
very well and we were only able to get away with it because
2:14:01
away with it because
2:14:01
away with it because the problem is very simple
2:14:03
the problem is very simple
2:14:03
the problem is very simple so let's now bring everything together
2:14:04
so let's now bring everything together
2:14:04
so let's now bring everything together and summarize what we learned
2:14:06
and summarize what we learned
2:14:06
and summarize what we learned what are neural nets neural nets are
2:14:09
what are neural nets neural nets are
2:14:09
what are neural nets neural nets are these mathematical expressions
2:14:11
these mathematical expressions
2:14:11
these mathematical expressions fairly simple mathematical expressions
2:14:13
fairly simple mathematical expressions
2:14:13
fairly simple mathematical expressions in the case of multi-layer perceptron
2:14:15
in the case of multi-layer perceptron
2:14:15
in the case of multi-layer perceptron that take
2:14:16
that take
2:14:16
that take input as the data and they take input
2:14:19
input as the data and they take input
2:14:19
input as the data and they take input the weights and the parameters of the
2:14:20
the weights and the parameters of the
2:14:20
the weights and the parameters of the neural net mathematical expression for
2:14:22
neural net mathematical expression for
2:14:22
neural net mathematical expression for the forward pass followed by a loss
2:14:24
the forward pass followed by a loss
2:14:24
the forward pass followed by a loss function and the loss function tries to
2:14:26
function and the loss function tries to
2:14:26
function and the loss function tries to measure the accuracy of the predictions
2:14:29
measure the accuracy of the predictions
2:14:29
measure the accuracy of the predictions and usually the loss will be low when
2:14:31
and usually the loss will be low when
2:14:31
and usually the loss will be low when your predictions are matching your
2:14:32
your predictions are matching your
2:14:32
your predictions are matching your targets or where the network is
2:14:34
targets or where the network is
2:14:34
targets or where the network is basically behaving well so we we
2:14:37
basically behaving well so we we
2:14:37
basically behaving well so we we manipulate the loss function so that
2:14:38
manipulate the loss function so that
2:14:38
manipulate the loss function so that when the loss is low the network is
2:14:40
when the loss is low the network is
2:14:40
when the loss is low the network is doing what you want it to do on your
2:14:42
doing what you want it to do on your
2:14:42
doing what you want it to do on your problem
2:14:43
problem
2:14:44
problem and then we backward the loss
2:14:46
and then we backward the loss
2:14:46
and then we backward the loss use backpropagation to get the gradient
2:14:48
use backpropagation to get the gradient
2:14:48
use backpropagation to get the gradient and then we know how to tune all the
2:14:49
and then we know how to tune all the
2:14:50
and then we know how to tune all the parameters to decrease the loss locally
2:14:52
parameters to decrease the loss locally
2:14:52
parameters to decrease the loss locally but then we have to iterate that process
2:14:54
but then we have to iterate that process
2:14:54
but then we have to iterate that process many times in what's called the gradient
2:14:55
many times in what's called the gradient
2:14:55
many times in what's called the gradient descent
2:14:56
descent
2:14:56
descent so we simply follow the gradient
2:14:58
so we simply follow the gradient
2:14:58
so we simply follow the gradient information and that minimizes the loss
2:15:01
information and that minimizes the loss
2:15:01
information and that minimizes the loss and the loss is arranged so that when
2:15:02
and the loss is arranged so that when
2:15:02
and the loss is arranged so that when the loss is minimized the network is
2:15:04
the loss is minimized the network is
2:15:04
the loss is minimized the network is doing what you want it to do
2:15:06
doing what you want it to do
2:15:06
doing what you want it to do and yeah so we just have a blob of
2:15:09
and yeah so we just have a blob of
2:15:09
and yeah so we just have a blob of neural stuff and we can make it do
2:15:11
neural stuff and we can make it do
2:15:11
neural stuff and we can make it do arbitrary things and that's what gives
2:15:13
arbitrary things and that's what gives
2:15:13
arbitrary things and that's what gives neural nets their power um
2:15:15
neural nets their power um
2:15:15
neural nets their power um it's you know this is a very tiny
2:15:16
it's you know this is a very tiny
2:15:16
it's you know this is a very tiny network with 41 parameters
2:15:19
network with 41 parameters
2:15:19
network with 41 parameters but you can build significantly more
2:15:20
but you can build significantly more
2:15:20
but you can build significantly more complicated neural nets with billions
2:15:24
complicated neural nets with billions
2:15:24
complicated neural nets with billions at this point almost trillions of
2:15:25
at this point almost trillions of
2:15:25
at this point almost trillions of parameters and it's a massive blob of
2:15:28
parameters and it's a massive blob of
2:15:28
parameters and it's a massive blob of neural tissue simulated neural tissue
2:15:31
neural tissue simulated neural tissue
2:15:31
neural tissue simulated neural tissue roughly speaking
2:15:32
roughly speaking
2:15:32
roughly speaking and you can make it do extremely complex
2:15:34
and you can make it do extremely complex
2:15:34
and you can make it do extremely complex problems and these neurons then have all
2:15:37
problems and these neurons then have all
2:15:37
problems and these neurons then have all kinds of very fascinating emergent
2:15:39
kinds of very fascinating emergent
2:15:39
kinds of very fascinating emergent properties
2:15:40
properties
2:15:40
properties in
2:15:41
in
2:15:41
in when you try to make them do
2:15:43
when you try to make them do
2:15:43
when you try to make them do significantly hard problems as in the
2:15:45
significantly hard problems as in the
2:15:45
significantly hard problems as in the case of gpt for example
2:15:47
case of gpt for example
2:15:47
case of gpt for example we have massive amounts of text from the
2:15:49
we have massive amounts of text from the
2:15:49
we have massive amounts of text from the internet and we're trying to get a
2:15:51
internet and we're trying to get a
2:15:51
internet and we're trying to get a neural net to predict to take like a few
2:15:53
neural net to predict to take like a few
2:15:53
neural net to predict to take like a few words and try to predict the next word
2:15:55
words and try to predict the next word
2:15:55
words and try to predict the next word in a sequence that's the learning
2:15:56
in a sequence that's the learning
2:15:56
in a sequence that's the learning problem
2:15:57
problem
2:15:57
problem and it turns out that when you train
2:15:58
and it turns out that when you train
2:15:58
and it turns out that when you train this on all of internet the neural net
2:16:00
this on all of internet the neural net
2:16:00
this on all of internet the neural net actually has like really remarkable
2:16:02
actually has like really remarkable
2:16:02
actually has like really remarkable emergent properties but that neural net
2:16:04
emergent properties but that neural net
2:16:04
emergent properties but that neural net would have hundreds of billions of
2:16:05
would have hundreds of billions of
2:16:05
would have hundreds of billions of parameters
2:16:07
parameters
2:16:07
parameters but it works on fundamentally the exact
2:16:09
but it works on fundamentally the exact
2:16:09
but it works on fundamentally the exact same principles
2:16:10
same principles
2:16:10
same principles the neural net of course will be a bit
2:16:12
the neural net of course will be a bit
2:16:12
the neural net of course will be a bit more complex but otherwise the
2:16:15
more complex but otherwise the
2:16:15
more complex but otherwise the value in the gradient is there
2:16:17
value in the gradient is there
2:16:17
value in the gradient is there and would be identical and the gradient
2:16:19
and would be identical and the gradient
2:16:19
and would be identical and the gradient descent would be there and would be
2:16:21
descent would be there and would be
2:16:21
descent would be there and would be basically identical but people usually
2:16:23
basically identical but people usually
2:16:23
basically identical but people usually use slightly different updates this is a
2:16:25
use slightly different updates this is a
2:16:25
use slightly different updates this is a very simple stochastic gradient descent
2:16:27
very simple stochastic gradient descent
2:16:27
very simple stochastic gradient descent update
2:16:28
update
2:16:28
update um
2:16:29
um
2:16:29
um and the loss function would not be mean
2:16:30
and the loss function would not be mean
2:16:30
and the loss function would not be mean squared error they would be using
2:16:31
squared error they would be using
2:16:32
squared error they would be using something called the cross-entropy loss
2:16:33
something called the cross-entropy loss
2:16:34
something called the cross-entropy loss for predicting the next token so there's
2:16:35
for predicting the next token so there's
2:16:36
for predicting the next token so there's a few more details but fundamentally the
2:16:37
a few more details but fundamentally the
2:16:37
a few more details but fundamentally the neural network setup and neural network
2:16:39
neural network setup and neural network
2:16:39
neural network setup and neural network training is identical and pervasive and
2:16:42
training is identical and pervasive and
2:16:42
training is identical and pervasive and now you understand intuitively
2:16:44
now you understand intuitively
2:16:44
now you understand intuitively how that works under the hood in the
2:16:45
how that works under the hood in the
2:16:46
how that works under the hood in the beginning of this video i told you that
2:16:47
beginning of this video i told you that
2:16:47
beginning of this video i told you that by the end of it you would understand
2:16:48
by the end of it you would understand
2:16:48
by the end of it you would understand everything in micrograd and then we'd
2:16:50
everything in micrograd and then we'd
2:16:50
everything in micrograd and then we'd slowly build it up let me briefly prove
2:16:52
slowly build it up let me briefly prove
2:16:52
slowly build it up let me briefly prove that to you
2:16:53
that to you
2:16:54
that to you so i'm going to step through all the
2:16:55
so i'm going to step through all the
2:16:55
so i'm going to step through all the code that is in micrograd as of today
2:16:57
code that is in micrograd as of today
2:16:57
code that is in micrograd as of today actually potentially some of the code
2:16:59
actually potentially some of the code
2:16:59
actually potentially some of the code will change by the time you watch this
2:17:00
will change by the time you watch this
2:17:00
will change by the time you watch this video because i intend to continue
2:17:01
video because i intend to continue
2:17:01
video because i intend to continue developing micrograd
2:17:03
developing micrograd
2:17:03
developing micrograd but let's look at what we have so far at
2:17:05
but let's look at what we have so far at
2:17:05
but let's look at what we have so far at least init.pi is empty when you go to
2:17:07
least init.pi is empty when you go to
2:17:07
least init.pi is empty when you go to engine.pi that has the value
2:17:10
engine.pi that has the value
2:17:10
engine.pi that has the value everything here you should mostly
2:17:11
everything here you should mostly
2:17:11
everything here you should mostly recognize so we have the data.grad
2:17:13
recognize so we have the data.grad
2:17:13
recognize so we have the data.grad attributes we have the backward function
2:17:15
attributes we have the backward function
2:17:15
attributes we have the backward function uh we have the previous set of children
2:17:17
uh we have the previous set of children
2:17:17
uh we have the previous set of children and the operation that produced this
2:17:19
and the operation that produced this
2:17:19
and the operation that produced this value
2:17:20
value
2:17:20
value we have addition multiplication and
2:17:22
we have addition multiplication and
2:17:22
we have addition multiplication and raising to a scalar power
2:17:25
raising to a scalar power
2:17:25
raising to a scalar power we have the relu non-linearity which is
2:17:27
we have the relu non-linearity which is
2:17:27
we have the relu non-linearity which is slightly different type of nonlinearity
2:17:28
slightly different type of nonlinearity
2:17:28
slightly different type of nonlinearity than 10h that we used in this video
2:17:30
than 10h that we used in this video
2:17:30
than 10h that we used in this video both of them are non-linearities and
2:17:32
both of them are non-linearities and
2:17:32
both of them are non-linearities and notably 10h is not actually present in
2:17:34
notably 10h is not actually present in
2:17:34
notably 10h is not actually present in micrograd as of right now but i intend
2:17:37
micrograd as of right now but i intend
2:17:37
micrograd as of right now but i intend to add it later
2:17:38
to add it later
2:17:38
to add it later with the backward which is identical and
2:17:40
with the backward which is identical and
2:17:40
with the backward which is identical and then all of these other operations which
2:17:42
then all of these other operations which
2:17:42
then all of these other operations which are built up on top of operations here
2:17:45
are built up on top of operations here
2:17:45
are built up on top of operations here so values should be very recognizable
2:17:47
so values should be very recognizable
2:17:47
so values should be very recognizable except for the non-linearity used in
2:17:48
except for the non-linearity used in
2:17:48
except for the non-linearity used in this video
2:17:50
this video
2:17:50
this video um there's no massive difference between
2:17:52
um there's no massive difference between
2:17:52
um there's no massive difference between relu and 10h and sigmoid and these other
2:17:54
relu and 10h and sigmoid and these other
2:17:54
relu and 10h and sigmoid and these other non-linearities they're all roughly
2:17:55
non-linearities they're all roughly
2:17:55
non-linearities they're all roughly equivalent and can be used in mlps so i
2:17:58
equivalent and can be used in mlps so i
2:17:58
equivalent and can be used in mlps so i use 10h because it's a bit smoother and
2:18:00
use 10h because it's a bit smoother and
2:18:00
use 10h because it's a bit smoother and because it's a little bit more
2:18:01
because it's a little bit more
2:18:01
because it's a little bit more complicated than relu and therefore it's
2:18:03
complicated than relu and therefore it's
2:18:03
complicated than relu and therefore it's stressed a little bit more the
2:18:05
stressed a little bit more the
2:18:05
stressed a little bit more the local gradients and working with those
2:18:07
local gradients and working with those
2:18:07
local gradients and working with those derivatives which i thought would be
2:18:09
derivatives which i thought would be
2:18:09
derivatives which i thought would be useful
2:18:10
useful
2:18:10
useful and then that pi is the neural networks
2:18:12
and then that pi is the neural networks
2:18:12
and then that pi is the neural networks library as i mentioned so you should
2:18:14
library as i mentioned so you should
2:18:14
library as i mentioned so you should recognize identical implementation of
2:18:16
recognize identical implementation of
2:18:16
recognize identical implementation of neuron layer and mlp
2:18:18
neuron layer and mlp
2:18:18
neuron layer and mlp notably or not so much
2:18:20
notably or not so much
2:18:20
notably or not so much we have a class module here there is a
2:18:22
we have a class module here there is a
2:18:22
we have a class module here there is a parent class of all these modules i did
2:18:24
parent class of all these modules i did
2:18:24
parent class of all these modules i did that because there's an nn.module class
2:18:27
that because there's an nn.module class
2:18:27
that because there's an nn.module class in pytorch and so this exactly matches
2:18:29
in pytorch and so this exactly matches
2:18:29
in pytorch and so this exactly matches that api and end.module and pytorch has
2:18:31
that api and end.module and pytorch has
2:18:31
that api and end.module and pytorch has also a zero grad which i've refactored
2:18:33
also a zero grad which i've refactored
2:18:33
also a zero grad which i've refactored out here
2:18:36
so that's the end of micrograd really
2:18:38
so that's the end of micrograd really
2:18:38
so that's the end of micrograd really then there's a test
2:18:39
then there's a test
2:18:40
then there's a test which you'll see
2:18:41
which you'll see
2:18:41
which you'll see basically creates
2:18:42
basically creates
2:18:42
basically creates two chunks of code one in micrograd and
2:18:45
two chunks of code one in micrograd and
2:18:45
two chunks of code one in micrograd and one in pi torch and we'll make sure that
2:18:47
one in pi torch and we'll make sure that
2:18:47
one in pi torch and we'll make sure that the forward and the backward pass agree
2:18:49
the forward and the backward pass agree
2:18:49
the forward and the backward pass agree identically
2:18:50
identically
2:18:50
identically for a slightly less complicated
2:18:51
for a slightly less complicated
2:18:51
for a slightly less complicated expression a slightly more complicated
2:18:53
expression a slightly more complicated
2:18:53
expression a slightly more complicated expression everything
2:18:55
expression everything
2:18:55
expression everything agrees so we agree with pytorch on all
2:18:57
agrees so we agree with pytorch on all
2:18:57
agrees so we agree with pytorch on all of these operations
2:18:58
of these operations
2:18:58
of these operations and finally there's a demo.ipymb here
2:19:01
and finally there's a demo.ipymb here
2:19:01
and finally there's a demo.ipymb here and it's a bit more complicated binary
2:19:03
and it's a bit more complicated binary
2:19:03
and it's a bit more complicated binary classification demo than the one i
2:19:04
classification demo than the one i
2:19:04
classification demo than the one i covered in this lecture so we only had a
2:19:07
covered in this lecture so we only had a
2:19:07
covered in this lecture so we only had a tiny data set of four examples um here
2:19:09
tiny data set of four examples um here
2:19:09
tiny data set of four examples um here we have a bit more complicated example
2:19:11
we have a bit more complicated example
2:19:11
we have a bit more complicated example with lots of blue points and lots of red
2:19:13
with lots of blue points and lots of red
2:19:13
with lots of blue points and lots of red points and we're trying to again build a
2:19:15
points and we're trying to again build a
2:19:15
points and we're trying to again build a binary classifier to distinguish uh two
2:19:17
binary classifier to distinguish uh two
2:19:18
binary classifier to distinguish uh two dimensional points as red or blue
2:19:20
dimensional points as red or blue
2:19:20
dimensional points as red or blue it's a bit more complicated mlp here
2:19:22
it's a bit more complicated mlp here
2:19:22
it's a bit more complicated mlp here with it's a bigger mlp
2:19:24
with it's a bigger mlp
2:19:24
with it's a bigger mlp the loss is a bit more complicated
2:19:26
the loss is a bit more complicated
2:19:26
the loss is a bit more complicated because
2:19:27
because
2:19:27
because it supports batches
2:19:29
it supports batches
2:19:29
it supports batches so because our dataset was so tiny we
2:19:31
so because our dataset was so tiny we
2:19:31
so because our dataset was so tiny we always did a forward pass on the entire
2:19:32
always did a forward pass on the entire
2:19:32
always did a forward pass on the entire data set of four examples but when your
2:19:35
data set of four examples but when your
2:19:35
data set of four examples but when your data set is like a million examples what
2:19:37
data set is like a million examples what
2:19:37
data set is like a million examples what we usually do in practice is we chair we
2:19:39
we usually do in practice is we chair we
2:19:39
we usually do in practice is we chair we basically pick out some random subset we
2:19:41
basically pick out some random subset we
2:19:41
basically pick out some random subset we call that a batch and then we only
2:19:43
call that a batch and then we only
2:19:43
call that a batch and then we only process the batch forward backward and
2:19:45
process the batch forward backward and
2:19:45
process the batch forward backward and update so we don't have to forward the
2:19:47
update so we don't have to forward the
2:19:47
update so we don't have to forward the entire training set
2:19:49
entire training set
2:19:49
entire training set so this supports batching because
2:19:51
so this supports batching because
2:19:51
so this supports batching because there's a lot more examples here
2:19:53
there's a lot more examples here
2:19:53
there's a lot more examples here we do a forward pass the loss is
2:19:55
we do a forward pass the loss is
2:19:55
we do a forward pass the loss is slightly more different this is a max
2:19:57
slightly more different this is a max
2:19:57
slightly more different this is a max margin loss that i implement here
2:19:59
margin loss that i implement here
2:20:00
margin loss that i implement here the one that we used was the mean
2:20:01
the one that we used was the mean
2:20:01
the one that we used was the mean squared error loss because it's the
2:20:03
squared error loss because it's the
2:20:03
squared error loss because it's the simplest one
2:20:04
simplest one
2:20:04
simplest one there's also the binary cross entropy
2:20:06
there's also the binary cross entropy
2:20:06
there's also the binary cross entropy loss all of them can be used for binary
2:20:08
loss all of them can be used for binary
2:20:08
loss all of them can be used for binary classification and don't make too much
2:20:10
classification and don't make too much
2:20:10
classification and don't make too much of a difference in the simple examples
2:20:11
of a difference in the simple examples
2:20:11
of a difference in the simple examples that we looked at so far
2:20:13
that we looked at so far
2:20:13
that we looked at so far there's something called l2
2:20:14
there's something called l2
2:20:14
there's something called l2 regularization used here this has to do
2:20:17
regularization used here this has to do
2:20:17
regularization used here this has to do with generalization of the neural net
2:20:19
with generalization of the neural net
2:20:19
with generalization of the neural net and controls the overfitting in machine
2:20:21
and controls the overfitting in machine
2:20:21
and controls the overfitting in machine learning setting but i did not cover
2:20:23
learning setting but i did not cover
2:20:23
learning setting but i did not cover these concepts and concepts in this
2:20:24
these concepts and concepts in this
2:20:24
these concepts and concepts in this video potentially later
2:20:26
video potentially later
2:20:26
video potentially later and the training loop you should
2:20:27
and the training loop you should
2:20:27
and the training loop you should recognize so forward backward with zero
2:20:31
recognize so forward backward with zero
2:20:31
recognize so forward backward with zero grad
2:20:32
grad
2:20:32
grad and update and so on you'll notice that
2:20:35
and update and so on you'll notice that
2:20:35
and update and so on you'll notice that in the update here the learning rate is
2:20:36
in the update here the learning rate is
2:20:36
in the update here the learning rate is scaled as a function of number of
2:20:38
scaled as a function of number of
2:20:38
scaled as a function of number of iterations and it
2:20:40
iterations and it
2:20:40
iterations and it shrinks
2:20:41
shrinks
2:20:41
shrinks and this is something called learning
2:20:43
and this is something called learning
2:20:43
and this is something called learning rate decay so in the beginning you have
2:20:44
rate decay so in the beginning you have
2:20:44
rate decay so in the beginning you have a high learning rate and as the network
2:20:47
a high learning rate and as the network
2:20:47
a high learning rate and as the network sort of stabilizes near the end you
2:20:49
sort of stabilizes near the end you
2:20:49
sort of stabilizes near the end you bring down the learning rate to get some
2:20:50
bring down the learning rate to get some
2:20:50
bring down the learning rate to get some of the fine details in the end
2:20:53
of the fine details in the end
2:20:53
of the fine details in the end and in the end we see the decision
2:20:54
and in the end we see the decision
2:20:54
and in the end we see the decision surface of the neural net and we see
2:20:56
surface of the neural net and we see
2:20:56
surface of the neural net and we see that it learns to separate out the red
2:20:58
that it learns to separate out the red
2:20:58
that it learns to separate out the red and the blue area based on the data
2:21:00
and the blue area based on the data
2:21:00
and the blue area based on the data points
2:21:01
points
2:21:01
points so that's the slightly more complicated
2:21:03
so that's the slightly more complicated
2:21:03
so that's the slightly more complicated example and then we'll demo that hyper
2:21:05
example and then we'll demo that hyper
2:21:05
example and then we'll demo that hyper ymb that you're free to go over
2:21:07
ymb that you're free to go over
2:21:07
ymb that you're free to go over but yeah as of today that is micrograd i
2:21:10
but yeah as of today that is micrograd i
2:21:10
but yeah as of today that is micrograd i also wanted to show you a little bit of
2:21:11
also wanted to show you a little bit of
2:21:11
also wanted to show you a little bit of real stuff so that you get to see how
2:21:13
real stuff so that you get to see how
2:21:13
real stuff so that you get to see how this is actually implemented in
2:21:14
this is actually implemented in
2:21:14
this is actually implemented in production grade library like by torch
2:21:16
production grade library like by torch
2:21:16
production grade library like by torch uh so in particular i wanted to show i
2:21:18
uh so in particular i wanted to show i
2:21:18
uh so in particular i wanted to show i wanted to find and show you the backward
2:21:20
wanted to find and show you the backward
2:21:20
wanted to find and show you the backward pass for 10h in pytorch so here in
2:21:23
pass for 10h in pytorch so here in
2:21:23
pass for 10h in pytorch so here in micrograd we see that the backward
2:21:25
micrograd we see that the backward
2:21:25
micrograd we see that the backward password 10h is one minus t square
2:21:28
password 10h is one minus t square
2:21:28
password 10h is one minus t square where t is the output of the tanh of x
2:21:33
times of that grad which is the chain
2:21:34
times of that grad which is the chain
2:21:34
times of that grad which is the chain rule so we're looking for something that
2:21:36
rule so we're looking for something that
2:21:36
rule so we're looking for something that looks like this
2:21:38
looks like this
2:21:38
looks like this now
2:21:39
now
2:21:39
now i went to pytorch um which has an open
2:21:42
i went to pytorch um which has an open
2:21:42
i went to pytorch um which has an open source github codebase and uh i looked
2:21:45
source github codebase and uh i looked
2:21:45
source github codebase and uh i looked through a lot of its code
2:21:47
through a lot of its code
2:21:47
through a lot of its code and honestly i i i spent about 15
2:21:49
and honestly i i i spent about 15
2:21:49
and honestly i i i spent about 15 minutes and i couldn't find 10h
2:21:51
minutes and i couldn't find 10h
2:21:51
minutes and i couldn't find 10h and that's because these libraries
2:21:53
and that's because these libraries
2:21:53
and that's because these libraries unfortunately they grow in size and
2:21:55
unfortunately they grow in size and
2:21:55
unfortunately they grow in size and entropy and if you just search for 10h
2:21:57
entropy and if you just search for 10h
2:21:57
entropy and if you just search for 10h you get apparently 2 800 results and 400
2:22:01
you get apparently 2 800 results and 400
2:22:01
you get apparently 2 800 results and 400 and 406 files so i don't know what these
2:22:03
and 406 files so i don't know what these
2:22:04
and 406 files so i don't know what these files are doing honestly
2:22:07
files are doing honestly
2:22:07
files are doing honestly and why there are so many mentions of
2:22:09
and why there are so many mentions of
2:22:09
and why there are so many mentions of 10h but unfortunately these libraries
2:22:11
10h but unfortunately these libraries
2:22:11
10h but unfortunately these libraries are quite complex they're meant to be
2:22:12
are quite complex they're meant to be
2:22:12
are quite complex they're meant to be used not really inspected um
2:22:15
used not really inspected um
2:22:15
used not really inspected um eventually i did stumble on someone
2:22:18
eventually i did stumble on someone
2:22:18
eventually i did stumble on someone who tries to change the 10 h backward
2:22:21
who tries to change the 10 h backward
2:22:21
who tries to change the 10 h backward code for some reason
2:22:22
code for some reason
2:22:22
code for some reason and someone here pointed to the cpu
2:22:24
and someone here pointed to the cpu
2:22:24
and someone here pointed to the cpu kernel and the kuda kernel for 10 inch
2:22:26
kernel and the kuda kernel for 10 inch
2:22:26
kernel and the kuda kernel for 10 inch backward
2:22:27
backward
2:22:27
backward so this so basically depends on if
2:22:29
so this so basically depends on if
2:22:29
so this so basically depends on if you're using pi torch on a cpu device or
2:22:31
you're using pi torch on a cpu device or
2:22:31
you're using pi torch on a cpu device or on a gpu which these are different
2:22:33
on a gpu which these are different
2:22:33
on a gpu which these are different devices and i haven't covered this but
2:22:35
devices and i haven't covered this but
2:22:35
devices and i haven't covered this but this is the 10 h backwards kernel
2:22:37
this is the 10 h backwards kernel
2:22:37
this is the 10 h backwards kernel for uh cpu
2:22:39
for uh cpu
2:22:40
for uh cpu and the reason it's so large is that
2:22:43
and the reason it's so large is that
2:22:43
and the reason it's so large is that number one this is like if you're using
2:22:45
number one this is like if you're using
2:22:45
number one this is like if you're using a complex type which we haven't even
2:22:46
a complex type which we haven't even
2:22:46
a complex type which we haven't even talked about if you're using a specific
2:22:48
talked about if you're using a specific
2:22:48
talked about if you're using a specific data type of b-float 16 which we haven't
2:22:50
data type of b-float 16 which we haven't
2:22:50
data type of b-float 16 which we haven't talked about
2:22:52
talked about
2:22:52
talked about and then if you're not then this is the
2:22:54
and then if you're not then this is the
2:22:54
and then if you're not then this is the kernel and deep here we see something
2:22:57
kernel and deep here we see something
2:22:57
kernel and deep here we see something that resembles our backward pass so they
2:23:00
that resembles our backward pass so they
2:23:00
that resembles our backward pass so they have a times one minus
2:23:02
have a times one minus
2:23:02
have a times one minus b square uh so this b
2:23:05
b square uh so this b
2:23:05
b square uh so this b b here must be the output of the 10h and
2:23:07
b here must be the output of the 10h and
2:23:07
b here must be the output of the 10h and this is the health.grad so here we found
2:23:10
this is the health.grad so here we found
2:23:10
this is the health.grad so here we found it
2:23:11
it
2:23:11
it uh deep inside
2:23:14
uh deep inside
2:23:14
uh deep inside pi torch from this location for some
2:23:15
pi torch from this location for some
2:23:15
pi torch from this location for some reason inside binaryops kernel when 10h
2:23:18
reason inside binaryops kernel when 10h
2:23:18
reason inside binaryops kernel when 10h is not actually a binary op
2:23:21
is not actually a binary op
2:23:21
is not actually a binary op and then this is the gpu kernel
2:23:25
we're not complex
2:23:26
we're not complex
2:23:26
we're not complex we're
2:23:27
we're
2:23:27
we're here and here we go with one line of
2:23:29
here and here we go with one line of
2:23:29
here and here we go with one line of code
2:23:30
code
2:23:30
code so we did find it but basically
2:23:33
so we did find it but basically
2:23:33
so we did find it but basically unfortunately these codepieces are very
2:23:34
unfortunately these codepieces are very
2:23:34
unfortunately these codepieces are very large and
2:23:36
large and
2:23:36
large and micrograd is very very simple but if you
2:23:38
micrograd is very very simple but if you
2:23:38
micrograd is very very simple but if you actually want to use real stuff uh
2:23:40
actually want to use real stuff uh
2:23:40
actually want to use real stuff uh finding the code for it you'll actually
2:23:41
finding the code for it you'll actually
2:23:41
finding the code for it you'll actually find that difficult
2:23:43
find that difficult
2:23:43
find that difficult i also wanted to show you a little
2:23:45
i also wanted to show you a little
2:23:45
i also wanted to show you a little example here where pytorch is showing
2:23:47
example here where pytorch is showing
2:23:47
example here where pytorch is showing you how can you can register a new type
2:23:49
you how can you can register a new type
2:23:49
you how can you can register a new type of function that you want to add to
2:23:51
of function that you want to add to
2:23:51
of function that you want to add to pytorch as a lego building block
2:23:53
pytorch as a lego building block
2:23:53
pytorch as a lego building block so here if you want to for example add a
2:23:55
so here if you want to for example add a
2:23:55
so here if you want to for example add a gender polynomial 3
2:23:59
gender polynomial 3
2:23:59
gender polynomial 3 here's how you could do it you will
2:24:00
here's how you could do it you will
2:24:00
here's how you could do it you will register it as a class that
2:24:03
register it as a class that
2:24:03
register it as a class that subclasses storage.org that function
2:24:06
subclasses storage.org that function
2:24:06
subclasses storage.org that function and then you have to tell pytorch how to
2:24:07
and then you have to tell pytorch how to
2:24:07
and then you have to tell pytorch how to forward your new function
2:24:10
forward your new function
2:24:10
forward your new function and how to backward through it
2:24:12
and how to backward through it
2:24:12
and how to backward through it so as long as you can do the forward
2:24:14
so as long as you can do the forward
2:24:14
so as long as you can do the forward pass of this little function piece that
2:24:15
pass of this little function piece that
2:24:15
pass of this little function piece that you want to add and as long as you know
2:24:17
you want to add and as long as you know
2:24:17
you want to add and as long as you know the the local derivative the local
2:24:19
the the local derivative the local
2:24:19
the the local derivative the local gradients which are implemented in the
2:24:20
gradients which are implemented in the
2:24:20
gradients which are implemented in the backward pi torch will be able to back
2:24:22
backward pi torch will be able to back
2:24:22
backward pi torch will be able to back propagate through your function and then
2:24:24
propagate through your function and then
2:24:24
propagate through your function and then you can use this as a lego block in a
2:24:26
you can use this as a lego block in a
2:24:26
you can use this as a lego block in a larger lego castle of all the different
2:24:28
larger lego castle of all the different
2:24:28
larger lego castle of all the different lego blocks that pytorch already has
2:24:31
lego blocks that pytorch already has
2:24:31
lego blocks that pytorch already has and so that's the only thing you have to
2:24:32
and so that's the only thing you have to
2:24:32
and so that's the only thing you have to tell pytorch and everything would just
2:24:33
tell pytorch and everything would just
2:24:33
tell pytorch and everything would just work and you can register new types of
2:24:35
work and you can register new types of
2:24:35
work and you can register new types of functions
2:24:36
functions
2:24:36
functions in this way following this example
2:24:38
in this way following this example
2:24:38
in this way following this example and that is everything that i wanted to
2:24:40
and that is everything that i wanted to
2:24:40
and that is everything that i wanted to cover in this lecture
2:24:41
cover in this lecture
2:24:41
cover in this lecture so i hope you enjoyed building out
2:24:42
so i hope you enjoyed building out
2:24:42
so i hope you enjoyed building out micrograd with me i hope you find it
2:24:44
micrograd with me i hope you find it
2:24:44
micrograd with me i hope you find it interesting insightful
2:24:46
interesting insightful
2:24:46
interesting insightful and
2:24:47
and
2:24:47
and yeah i will post a lot of the links
2:24:50
yeah i will post a lot of the links
2:24:50
yeah i will post a lot of the links that are related to this video in the
2:24:51
that are related to this video in the
2:24:51
that are related to this video in the video description below i will also
2:24:53
video description below i will also
2:24:53
video description below i will also probably post a link to a discussion
2:24:55
probably post a link to a discussion
2:24:55
probably post a link to a discussion forum
2:24:56
forum
2:24:56
forum or discussion group where you can ask
2:24:58
or discussion group where you can ask
2:24:58
or discussion group where you can ask questions related to this video and then
2:25:00
questions related to this video and then
2:25:00
questions related to this video and then i can answer or someone else can answer
2:25:02
i can answer or someone else can answer
2:25:02
i can answer or someone else can answer your questions and i may also do a
2:25:04
your questions and i may also do a
2:25:04
your questions and i may also do a follow-up video that answers some of the
2:25:06
follow-up video that answers some of the
2:25:06
follow-up video that answers some of the most common questions
2:25:08
most common questions
2:25:08
most common questions but for now that's it i hope you enjoyed
2:25:10
but for now that's it i hope you enjoyed
2:25:10
but for now that's it i hope you enjoyed it if you did then please like and
2:25:11
it if you did then please like and
2:25:11
it if you did then please like and subscribe so that youtube knows to
2:25:13
subscribe so that youtube knows to
2:25:13
subscribe so that youtube knows to feature this video to more people
2:25:15
feature this video to more people
2:25:15
feature this video to more people and that's it for now i'll see you later
2:25:22
now here's the problem
2:25:24
now here's the problem
2:25:24
now here's the problem we know
2:25:25
we know
2:25:25
we know dl by
2:25:28
dl by
2:25:28
dl by wait what is the problem
2:25:31
and that's everything i wanted to cover
2:25:33
and that's everything i wanted to cover
2:25:33
and that's everything i wanted to cover in this lecture
2:25:34
in this lecture
2:25:34
in this lecture so i hope
2:25:35
so i hope
2:25:35
so i hope you enjoyed us building up microcraft
2:25:38
you enjoyed us building up microcraft
2:25:38
you enjoyed us building up microcraft micro crab
2:25:42
okay now let's do the exact same thing
2:25:43
okay now let's do the exact same thing
2:25:43
okay now let's do the exact same thing for multiply because we can't do
2:25:44
for multiply because we can't do
2:25:44
for multiply because we can't do something like a times two
2:25:47
something like a times two
2:25:47
something like a times two oops
2:25:50
i know what happened there