View on GitHub
GitHub
Neural Networks: Zero to Hero
The spelled-out intro to language modeling: building makemore
Loading player
Notes
Transcript
6647 segments
0:00
hi everyone hope you're well
0:02
hi everyone hope you're well
0:02
hi everyone hope you're well and next up what i'd like to do is i'd
0:04
and next up what i'd like to do is i'd
0:04
and next up what i'd like to do is i'd like to build out make more
0:06
like to build out make more
0:06
like to build out make more like micrograd before it make more is a
0:08
like micrograd before it make more is a
0:08
like micrograd before it make more is a repository that i have on my github
0:10
repository that i have on my github
0:10
repository that i have on my github webpage
0:11
webpage
0:11
webpage you can look at it
0:12
you can look at it
0:12
you can look at it but just like with micrograd i'm going
0:14
but just like with micrograd i'm going
0:14
but just like with micrograd i'm going to build it out step by step and i'm
0:16
to build it out step by step and i'm
0:16
to build it out step by step and i'm going to spell everything out so we're
0:17
going to spell everything out so we're
0:17
going to spell everything out so we're going to build it out slowly and
0:19
going to build it out slowly and
0:19
going to build it out slowly and together
0:20
together
0:20
together now what is make more
0:22
now what is make more
0:22
now what is make more make more as the name suggests
0:24
make more as the name suggests
0:24
make more as the name suggests makes more of things that you give it
0:27
makes more of things that you give it
0:27
makes more of things that you give it so here's an example
0:29
so here's an example
0:29
so here's an example names.txt is an example dataset to make
0:31
names.txt is an example dataset to make
0:31
names.txt is an example dataset to make more
0:32
more
0:32
more and when you look at names.txt you'll
0:34
and when you look at names.txt you'll
0:34
and when you look at names.txt you'll find that it's a very large data set of
0:36
find that it's a very large data set of
0:36
find that it's a very large data set of names
0:38
names
0:38
names so
0:40
so
0:40
so here's lots of different types of names
0:41
here's lots of different types of names
0:41
here's lots of different types of names in fact i believe there are 32 000 names
0:44
in fact i believe there are 32 000 names
0:44
in fact i believe there are 32 000 names that i've sort of found randomly on the
0:46
that i've sort of found randomly on the
0:46
that i've sort of found randomly on the government website
0:47
government website
0:47
government website and if you train make more on this data
0:50
and if you train make more on this data
0:50
and if you train make more on this data set it will learn to make more of things
0:53
set it will learn to make more of things
0:53
set it will learn to make more of things like this
0:55
like this
0:55
like this and in particular in this case that will
0:57
and in particular in this case that will
0:57
and in particular in this case that will mean more things that sound name-like
1:00
mean more things that sound name-like
1:00
mean more things that sound name-like but are actually unique names
1:02
but are actually unique names
1:02
but are actually unique names and maybe if you have a baby and you're
1:03
and maybe if you have a baby and you're
1:03
and maybe if you have a baby and you're trying to assign name maybe you're
1:05
trying to assign name maybe you're
1:05
trying to assign name maybe you're looking for a cool new sounding unique
1:07
looking for a cool new sounding unique
1:07
looking for a cool new sounding unique name make more might help you
1:09
name make more might help you
1:09
name make more might help you so here are some example generations
1:11
so here are some example generations
1:11
so here are some example generations from the neural network
1:13
from the neural network
1:13
from the neural network once we train it on our data set
1:16
once we train it on our data set
1:16
once we train it on our data set so here's some example
1:17
so here's some example
1:17
so here's some example unique names that it will generate
1:19
unique names that it will generate
1:19
unique names that it will generate dontel
1:21
dontel
1:21
dontel irot
1:23
irot
1:23
irot zhendi
1:24
zhendi
1:24
zhendi and so on and so all these are sound
1:26
and so on and so all these are sound
1:26
and so on and so all these are sound name like but they're not of course
1:28
name like but they're not of course
1:28
name like but they're not of course names
1:30
names
1:30
names so under the hood make more is a
1:32
so under the hood make more is a
1:32
so under the hood make more is a character level language model so what
1:35
character level language model so what
1:35
character level language model so what that means is that it is treating every
1:37
that means is that it is treating every
1:37
that means is that it is treating every single line here as an example and
1:39
single line here as an example and
1:39
single line here as an example and within each example it's treating them
1:42
within each example it's treating them
1:42
within each example it's treating them all as sequences of individual
1:43
all as sequences of individual
1:44
all as sequences of individual characters so r e e s e is this example
1:48
characters so r e e s e is this example
1:48
characters so r e e s e is this example and that's the sequence of characters
1:50
and that's the sequence of characters
1:50
and that's the sequence of characters and that's the level on which we are
1:51
and that's the level on which we are
1:51
and that's the level on which we are building out make more and what it means
1:54
building out make more and what it means
1:54
building out make more and what it means to be a character level language model
1:56
to be a character level language model
1:56
to be a character level language model then is that it's just uh sort of
1:58
then is that it's just uh sort of
1:58
then is that it's just uh sort of modeling those sequences of characters
1:59
modeling those sequences of characters
1:59
modeling those sequences of characters and it knows how to predict the next
2:01
and it knows how to predict the next
2:01
and it knows how to predict the next character in the sequence
2:03
character in the sequence
2:03
character in the sequence now we're actually going to implement a
2:05
now we're actually going to implement a
2:05
now we're actually going to implement a large number of character level language
2:07
large number of character level language
2:07
large number of character level language models in terms of the neural networks
2:09
models in terms of the neural networks
2:09
models in terms of the neural networks that are involved in predicting the next
2:10
that are involved in predicting the next
2:10
that are involved in predicting the next character in a sequence so very simple
2:13
character in a sequence so very simple
2:13
character in a sequence so very simple bi-gram and back of work models
2:15
bi-gram and back of work models
2:15
bi-gram and back of work models multilingual perceptrons recurrent
2:17
multilingual perceptrons recurrent
2:17
multilingual perceptrons recurrent neural networks all the way to modern
2:19
neural networks all the way to modern
2:19
neural networks all the way to modern transformers in fact the transformer
2:21
transformers in fact the transformer
2:21
transformers in fact the transformer that we will build will be basically the
2:23
that we will build will be basically the
2:23
that we will build will be basically the equivalent transformer to gpt2 if you
2:26
equivalent transformer to gpt2 if you
2:26
equivalent transformer to gpt2 if you have heard of gpt uh so that's kind of a
2:28
have heard of gpt uh so that's kind of a
2:28
have heard of gpt uh so that's kind of a big deal it's a modern network and by
2:30
big deal it's a modern network and by
2:30
big deal it's a modern network and by the end of the series you will actually
2:32
the end of the series you will actually
2:32
the end of the series you will actually understand how that works um on the
2:34
understand how that works um on the
2:34
understand how that works um on the level of characters now to give you a
2:36
level of characters now to give you a
2:36
level of characters now to give you a sense of the extensions here uh after
2:39
sense of the extensions here uh after
2:39
sense of the extensions here uh after characters we will probably spend some
2:41
characters we will probably spend some
2:41
characters we will probably spend some time on the word level so that we can
2:43
time on the word level so that we can
2:43
time on the word level so that we can generate documents of words not just
2:45
generate documents of words not just
2:45
generate documents of words not just little you know segments of characters
2:47
little you know segments of characters
2:47
little you know segments of characters but we can generate entire large much
2:49
but we can generate entire large much
2:49
but we can generate entire large much larger documents
2:50
larger documents
2:50
larger documents and then we're probably going to go into
2:52
and then we're probably going to go into
2:52
and then we're probably going to go into images and image text
2:54
images and image text
2:54
images and image text networks such as dolly stable diffusion
2:57
networks such as dolly stable diffusion
2:57
networks such as dolly stable diffusion and so on but for now we have to start
2:59
and so on but for now we have to start
3:00
and so on but for now we have to start here character level language modeling
3:02
here character level language modeling
3:02
here character level language modeling let's go
3:03
let's go
3:03
let's go so like before we are starting with a
3:04
so like before we are starting with a
3:04
so like before we are starting with a completely blank jupiter notebook page
3:06
completely blank jupiter notebook page
3:06
completely blank jupiter notebook page the first thing is i would like to
3:08
the first thing is i would like to
3:08
the first thing is i would like to basically load up the dataset names.txt
3:11
basically load up the dataset names.txt
3:11
basically load up the dataset names.txt so we're going to open up names.txt for
3:13
so we're going to open up names.txt for
3:13
so we're going to open up names.txt for reading
3:15
reading
3:15
reading and we're going to read in everything
3:17
and we're going to read in everything
3:17
and we're going to read in everything into a massive string
3:19
into a massive string
3:19
into a massive string and then because it's a massive string
3:21
and then because it's a massive string
3:21
and then because it's a massive string we'd only like the individual words and
3:23
we'd only like the individual words and
3:23
we'd only like the individual words and put them in the list
3:24
put them in the list
3:24
put them in the list so let's call split lines
3:26
so let's call split lines
3:26
so let's call split lines on that string
3:27
on that string
3:27
on that string to get all of our words as a python list
3:30
to get all of our words as a python list
3:30
to get all of our words as a python list of strings
3:31
of strings
3:32
of strings so basically we can look at for example
3:33
so basically we can look at for example
3:33
so basically we can look at for example the first 10 words
3:35
the first 10 words
3:35
the first 10 words and we have that it's a list of emma
3:39
and we have that it's a list of emma
3:39
and we have that it's a list of emma olivia eva and so on
3:41
olivia eva and so on
3:41
olivia eva and so on and if we look at
3:43
and if we look at
3:43
and if we look at the top of the page here that is indeed
3:45
the top of the page here that is indeed
3:45
the top of the page here that is indeed what we see
3:47
what we see
3:47
what we see um
3:48
um
3:48
um so that's good
3:49
so that's good
3:49
so that's good this list actually makes me feel that
3:52
this list actually makes me feel that
3:52
this list actually makes me feel that this is probably sorted by frequency
3:55
this is probably sorted by frequency
3:55
this is probably sorted by frequency but okay so
3:57
but okay so
3:57
but okay so these are the words now we'd like to
3:58
these are the words now we'd like to
3:58
these are the words now we'd like to actually like learn a little bit more
4:00
actually like learn a little bit more
4:00
actually like learn a little bit more about this data set let's look at the
4:02
about this data set let's look at the
4:02
about this data set let's look at the total number of words we expect this to
4:03
total number of words we expect this to
4:03
total number of words we expect this to be roughly 32 000
4:06
be roughly 32 000
4:06
be roughly 32 000 and then what is the for example
4:07
and then what is the for example
4:07
and then what is the for example shortest word
4:09
shortest word
4:09
shortest word so min of
4:10
so min of
4:10
so min of length of each word for w inwards
4:13
length of each word for w inwards
4:13
length of each word for w inwards so the shortest word will be length
4:17
so the shortest word will be length
4:17
so the shortest word will be length two
4:18
two
4:18
two and max of one w for w in words so the
4:21
and max of one w for w in words so the
4:21
and max of one w for w in words so the longest word will be
4:23
longest word will be
4:23
longest word will be 15 characters
4:24
15 characters
4:24
15 characters so let's now think through our very
4:25
so let's now think through our very
4:25
so let's now think through our very first language model
4:27
first language model
4:27
first language model as i mentioned a character level
4:28
as i mentioned a character level
4:28
as i mentioned a character level language model is predicting the next
4:30
language model is predicting the next
4:30
language model is predicting the next character in a sequence given already
4:33
character in a sequence given already
4:33
character in a sequence given already some concrete sequence of characters
4:35
some concrete sequence of characters
4:35
some concrete sequence of characters before it
4:36
before it
4:36
before it now we have to realize here is that
4:37
now we have to realize here is that
4:38
now we have to realize here is that every single word here like isabella is
4:40
every single word here like isabella is
4:40
every single word here like isabella is actually quite a few examples packed in
4:43
actually quite a few examples packed in
4:43
actually quite a few examples packed in to that single word
4:45
to that single word
4:45
to that single word because what is an existence of a word
4:47
because what is an existence of a word
4:47
because what is an existence of a word like isabella in the data set telling us
4:48
like isabella in the data set telling us
4:48
like isabella in the data set telling us really it's saying that
4:51
really it's saying that
4:51
really it's saying that the character i is a very likely
4:53
the character i is a very likely
4:53
the character i is a very likely character to come first in the sequence
4:56
character to come first in the sequence
4:56
character to come first in the sequence of a name
4:58
of a name
4:58
of a name the character s is likely to come
5:01
the character s is likely to come
5:01
the character s is likely to come after i
5:04
after i
5:04
after i the character a is likely to come after
5:06
the character a is likely to come after
5:06
the character a is likely to come after is
5:07
is
5:07
is the character b is very likely to come
5:09
the character b is very likely to come
5:09
the character b is very likely to come after isa and so on all the way to a
5:12
after isa and so on all the way to a
5:12
after isa and so on all the way to a following isabel
5:14
following isabel
5:14
following isabel and then there's one more example
5:15
and then there's one more example
5:15
and then there's one more example actually packed in here
5:17
actually packed in here
5:17
actually packed in here and that is that
5:19
and that is that
5:19
and that is that after there's isabella
5:21
after there's isabella
5:21
after there's isabella the word is very likely to end
5:23
the word is very likely to end
5:23
the word is very likely to end so that's one more sort of explicit
5:25
so that's one more sort of explicit
5:25
so that's one more sort of explicit piece of information that we have here
5:27
piece of information that we have here
5:27
piece of information that we have here that we have to be careful with
5:29
that we have to be careful with
5:29
that we have to be careful with and so there's a lot backed into a
5:31
and so there's a lot backed into a
5:31
and so there's a lot backed into a single individual word in terms of the
5:33
single individual word in terms of the
5:33
single individual word in terms of the statistical structure of what's likely
5:35
statistical structure of what's likely
5:35
statistical structure of what's likely to follow in these character sequences
5:37
to follow in these character sequences
5:38
to follow in these character sequences and then of course we don't have just an
5:39
and then of course we don't have just an
5:39
and then of course we don't have just an individual word we actually have 32 000
5:41
individual word we actually have 32 000
5:41
individual word we actually have 32 000 of these and so there's a lot of
5:42
of these and so there's a lot of
5:42
of these and so there's a lot of structure here to model
5:44
structure here to model
5:44
structure here to model now in the beginning what i'd like to
5:46
now in the beginning what i'd like to
5:46
now in the beginning what i'd like to start with is i'd like to start with
5:48
start with is i'd like to start with
5:48
start with is i'd like to start with building a bi-gram language model
5:51
building a bi-gram language model
5:51
building a bi-gram language model now in the bigram language model we're
5:53
now in the bigram language model we're
5:53
now in the bigram language model we're always working with just
5:54
always working with just
5:54
always working with just two characters at a time
5:56
two characters at a time
5:56
two characters at a time so we're only looking at one character
5:59
so we're only looking at one character
5:59
so we're only looking at one character that we are given and we're trying to
6:00
that we are given and we're trying to
6:00
that we are given and we're trying to predict the next character in the
6:02
predict the next character in the
6:02
predict the next character in the sequence
6:03
sequence
6:03
sequence so um what characters are likely to
6:06
so um what characters are likely to
6:06
so um what characters are likely to follow are what characters are likely to
6:08
follow are what characters are likely to
6:08
follow are what characters are likely to follow a and so on and we're just
6:10
follow a and so on and we're just
6:10
follow a and so on and we're just modeling that kind of a little local
6:11
modeling that kind of a little local
6:11
modeling that kind of a little local structure
6:12
structure
6:12
structure and we're forgetting the fact that we
6:14
and we're forgetting the fact that we
6:14
and we're forgetting the fact that we may have a lot more information we're
6:16
may have a lot more information we're
6:16
may have a lot more information we're always just looking at the previous
6:18
always just looking at the previous
6:18
always just looking at the previous character to predict the next one so
6:20
character to predict the next one so
6:20
character to predict the next one so it's a very simple and weak language
6:21
it's a very simple and weak language
6:21
it's a very simple and weak language model but i think it's a great place to
6:23
model but i think it's a great place to
6:23
model but i think it's a great place to start
6:24
start
6:24
start so now let's begin by looking at these
6:25
so now let's begin by looking at these
6:25
so now let's begin by looking at these bi-grams in our data set and what they
6:27
bi-grams in our data set and what they
6:27
bi-grams in our data set and what they look like and these bi-grams again are
6:29
look like and these bi-grams again are
6:29
look like and these bi-grams again are just two characters in a row
6:30
just two characters in a row
6:30
just two characters in a row so for w in words
6:33
so for w in words
6:33
so for w in words each w here is an individual word a
6:35
each w here is an individual word a
6:35
each w here is an individual word a string
6:36
string
6:36
string we want to iterate uh for
6:39
we want to iterate uh for
6:39
we want to iterate uh for we're going to iterate this word
6:41
we're going to iterate this word
6:41
we're going to iterate this word with consecutive characters so two
6:43
with consecutive characters so two
6:44
with consecutive characters so two characters at a time sliding it through
6:45
characters at a time sliding it through
6:45
characters at a time sliding it through the word now a interesting nice way cute
6:49
the word now a interesting nice way cute
6:49
the word now a interesting nice way cute way to do this in python by the way is
6:51
way to do this in python by the way is
6:51
way to do this in python by the way is doing something like this for character
6:53
doing something like this for character
6:53
doing something like this for character one character two in zip off
6:56
one character two in zip off
6:56
one character two in zip off w and w at one
6:59
w and w at one
7:00
w and w at one one column
7:01
one column
7:01
one column print
7:02
print
7:02
print character one character two
7:04
character one character two
7:04
character one character two and let's not do all the words let's
7:05
and let's not do all the words let's
7:05
and let's not do all the words let's just do the first three words and i'm
7:07
just do the first three words and i'm
7:07
just do the first three words and i'm going to show you in a second how this
7:09
going to show you in a second how this
7:09
going to show you in a second how this works
7:09
works
7:09
works but for now basically as an example
7:11
but for now basically as an example
7:11
but for now basically as an example let's just do the very first word alone
7:13
let's just do the very first word alone
7:13
let's just do the very first word alone emma
7:15
emma
7:15
emma you see how we have a emma and this will
7:18
you see how we have a emma and this will
7:18
you see how we have a emma and this will just print e m m m a
7:20
just print e m m m a
7:20
just print e m m m a and the reason this works is because w
7:23
and the reason this works is because w
7:23
and the reason this works is because w is the string emma w at one column is
7:26
is the string emma w at one column is
7:26
is the string emma w at one column is the string mma
7:28
the string mma
7:28
the string mma and zip
7:29
and zip
7:29
and zip takes two iterators and it pairs them up
7:33
takes two iterators and it pairs them up
7:33
takes two iterators and it pairs them up and then creates an iterator over the
7:35
and then creates an iterator over the
7:35
and then creates an iterator over the tuples of their consecutive entries
7:37
tuples of their consecutive entries
7:37
tuples of their consecutive entries and if any one of these lists is shorter
7:39
and if any one of these lists is shorter
7:39
and if any one of these lists is shorter than the other then it will just
7:41
than the other then it will just
7:41
than the other then it will just halt and return
7:43
halt and return
7:43
halt and return so basically that's why we return em mmm
7:48
so basically that's why we return em mmm
7:48
so basically that's why we return em mmm ma
7:49
ma
7:49
ma but then because this iterator second
7:52
but then because this iterator second
7:52
but then because this iterator second one here runs out of elements zip just
7:55
one here runs out of elements zip just
7:55
one here runs out of elements zip just ends and that's why we only get these
7:56
ends and that's why we only get these
7:56
ends and that's why we only get these tuples so pretty cute
7:59
tuples so pretty cute
7:59
tuples so pretty cute so these are the consecutive elements in
8:01
so these are the consecutive elements in
8:01
so these are the consecutive elements in the first word now we have to be careful
8:03
the first word now we have to be careful
8:03
the first word now we have to be careful because we actually have more
8:04
because we actually have more
8:04
because we actually have more information here than just these three
8:07
information here than just these three
8:07
information here than just these three examples as i mentioned we know that e
8:09
examples as i mentioned we know that e
8:09
examples as i mentioned we know that e is the is very likely to come first and
8:12
is the is very likely to come first and
8:12
is the is very likely to come first and we know that a in this case is coming
8:14
we know that a in this case is coming
8:14
we know that a in this case is coming last
8:15
last
8:15
last so one way to do this is basically we're
8:17
so one way to do this is basically we're
8:17
so one way to do this is basically we're going to create
8:19
going to create
8:19
going to create a special array here all
8:21
a special array here all
8:21
a special array here all characters
8:23
characters
8:23
characters and um we're going to hallucinate a
8:25
and um we're going to hallucinate a
8:25
and um we're going to hallucinate a special start token here
8:28
special start token here
8:28
special start token here i'm going to
8:29
i'm going to
8:29
i'm going to call it like special start
8:32
call it like special start
8:32
call it like special start so this is a list of one element
8:34
so this is a list of one element
8:34
so this is a list of one element plus
8:36
plus
8:36
plus w
8:37
w
8:37
w and then plus a special end character
8:41
and then plus a special end character
8:41
and then plus a special end character and the reason i'm wrapping the list of
8:42
and the reason i'm wrapping the list of
8:42
and the reason i'm wrapping the list of w here is because w is a string emma
8:45
w here is because w is a string emma
8:45
w here is because w is a string emma list of w will just have the individual
8:48
list of w will just have the individual
8:48
list of w will just have the individual characters in the list
8:50
characters in the list
8:50
characters in the list and then
8:51
and then
8:51
and then doing this again now but not iterating
8:54
doing this again now but not iterating
8:54
doing this again now but not iterating over w's but over the characters
8:58
over w's but over the characters
8:58
over w's but over the characters will give us something like this
9:00
will give us something like this
9:00
will give us something like this so e is likely so this is a bigram of
9:02
so e is likely so this is a bigram of
9:02
so e is likely so this is a bigram of the start character and e and this is a
9:05
the start character and e and this is a
9:05
the start character and e and this is a bigram of the
9:06
bigram of the
9:06
bigram of the a and the special end character
9:09
a and the special end character
9:09
a and the special end character and now we can look at for example what
9:10
and now we can look at for example what
9:10
and now we can look at for example what this looks like for
9:11
this looks like for
9:12
this looks like for olivia or eva
9:14
olivia or eva
9:14
olivia or eva and indeed we can actually
9:16
and indeed we can actually
9:16
and indeed we can actually potentially do this for the entire data
9:17
potentially do this for the entire data
9:17
potentially do this for the entire data set but we won't print that that's going
9:19
set but we won't print that that's going
9:19
set but we won't print that that's going to be too much
9:20
to be too much
9:20
to be too much but these are the individual character
9:22
but these are the individual character
9:22
but these are the individual character diagrams and we can print them
9:24
diagrams and we can print them
9:24
diagrams and we can print them now in order to learn the statistics
9:26
now in order to learn the statistics
9:26
now in order to learn the statistics about which characters are likely to
9:28
about which characters are likely to
9:28
about which characters are likely to follow other characters the simplest way
9:30
follow other characters the simplest way
9:30
follow other characters the simplest way in the bigram language models is to
9:32
in the bigram language models is to
9:32
in the bigram language models is to simply do it by counting
9:34
simply do it by counting
9:34
simply do it by counting so we're basically just going to count
9:36
so we're basically just going to count
9:36
so we're basically just going to count how often any one of these combinations
9:38
how often any one of these combinations
9:38
how often any one of these combinations occurs in the training set
9:40
occurs in the training set
9:40
occurs in the training set in these words
9:41
in these words
9:41
in these words so we're going to need some kind of a
9:43
so we're going to need some kind of a
9:43
so we're going to need some kind of a dictionary that's going to maintain some
9:44
dictionary that's going to maintain some
9:44
dictionary that's going to maintain some counts for every one of these diagrams
9:47
counts for every one of these diagrams
9:47
counts for every one of these diagrams so let's use a dictionary b
9:49
so let's use a dictionary b
9:49
so let's use a dictionary b and this will map these bi-grams so
9:52
and this will map these bi-grams so
9:52
and this will map these bi-grams so bi-gram is a tuple of character one
9:54
bi-gram is a tuple of character one
9:54
bi-gram is a tuple of character one character two
9:56
character two
9:56
character two and then b at bi-gram
9:58
and then b at bi-gram
9:58
and then b at bi-gram will be b dot get of bi-gram
10:01
will be b dot get of bi-gram
10:01
will be b dot get of bi-gram which is basically the same as b at
10:03
which is basically the same as b at
10:03
which is basically the same as b at bigram
10:04
bigram
10:04
bigram but in the case that bigram is not in
10:07
but in the case that bigram is not in
10:07
but in the case that bigram is not in the dictionary b we would like to by
10:09
the dictionary b we would like to by
10:09
the dictionary b we would like to by default return to zero
10:11
default return to zero
10:11
default return to zero plus one
10:13
plus one
10:13
plus one so this will basically add up all the
10:15
so this will basically add up all the
10:15
so this will basically add up all the bigrams and count how often they occur
10:18
bigrams and count how often they occur
10:18
bigrams and count how often they occur let's get rid of printing
10:20
let's get rid of printing
10:20
let's get rid of printing or rather
10:22
or rather
10:22
or rather let's keep the printing and let's just
10:23
let's keep the printing and let's just
10:24
let's keep the printing and let's just inspect what b is in this case
10:27
inspect what b is in this case
10:27
inspect what b is in this case and we see that many bi-grams occur just
10:29
and we see that many bi-grams occur just
10:29
and we see that many bi-grams occur just a single time this one allegedly
10:31
a single time this one allegedly
10:31
a single time this one allegedly occurred three times
10:32
occurred three times
10:32
occurred three times so a was an ending character three times
10:35
so a was an ending character three times
10:35
so a was an ending character three times and that's true for all of these words
10:37
and that's true for all of these words
10:37
and that's true for all of these words all of emma olivia and eva and with a
10:41
all of emma olivia and eva and with a
10:41
all of emma olivia and eva and with a so that's why this occurred three times
10:46
now let's do it for all the words
10:51
oops i should not have printed
10:55
oops i should not have printed
10:55
oops i should not have printed i'm going to erase that
10:56
i'm going to erase that
10:56
i'm going to erase that let's kill this
10:58
let's kill this
10:58
let's kill this let's just run
11:00
let's just run
11:00
let's just run and now b will have the statistics of
11:02
and now b will have the statistics of
11:02
and now b will have the statistics of the entire data set
11:04
the entire data set
11:04
the entire data set so these are the counts across all the
11:05
so these are the counts across all the
11:05
so these are the counts across all the words of the individual pie grams
11:08
words of the individual pie grams
11:08
words of the individual pie grams and we could for example look at some of
11:09
and we could for example look at some of
11:09
and we could for example look at some of the most common ones and least common
11:11
the most common ones and least common
11:11
the most common ones and least common ones
11:13
ones
11:13
ones this kind of grows in python but the way
11:15
this kind of grows in python but the way
11:15
this kind of grows in python but the way to do this the simplest way i like is we
11:17
to do this the simplest way i like is we
11:17
to do this the simplest way i like is we just use b dot items
11:19
just use b dot items
11:19
just use b dot items b dot items returns
11:21
b dot items returns
11:21
b dot items returns the tuples of
11:24
the tuples of
11:24
the tuples of key value in this case the keys are
11:27
key value in this case the keys are
11:27
key value in this case the keys are the character diagrams and the values
11:29
the character diagrams and the values
11:29
the character diagrams and the values are the counts
11:30
are the counts
11:30
are the counts and so then what we want to do is we
11:32
and so then what we want to do is we
11:32
and so then what we want to do is we want to do
11:35
sorted of this
11:38
sorted of this
11:38
sorted of this but by default sort is on the first
11:43
on the first item of a tuple but we want
11:45
on the first item of a tuple but we want
11:45
on the first item of a tuple but we want to sort by the values which are the
11:47
to sort by the values which are the
11:47
to sort by the values which are the second element of a tuple that is the
11:49
second element of a tuple that is the
11:49
second element of a tuple that is the key value
11:50
key value
11:50
key value so we want to use the key
11:52
so we want to use the key
11:52
so we want to use the key equals lambda
11:55
equals lambda
11:55
equals lambda that takes the key value
11:57
that takes the key value
11:57
that takes the key value and returns
11:58
and returns
11:58
and returns the key value at the one not at zero but
12:02
the key value at the one not at zero but
12:02
the key value at the one not at zero but at one which is the count so we want to
12:04
at one which is the count so we want to
12:04
at one which is the count so we want to sort by the count
12:07
sort by the count
12:07
sort by the count of these elements
12:10
of these elements
12:10
of these elements and actually we wanted to go backwards
12:12
and actually we wanted to go backwards
12:12
and actually we wanted to go backwards so here we have is
12:14
so here we have is
12:14
so here we have is the bi-gram q and r occurs only a single
12:17
the bi-gram q and r occurs only a single
12:17
the bi-gram q and r occurs only a single time
12:18
time
12:18
time dz occurred only a single time
12:20
dz occurred only a single time
12:20
dz occurred only a single time and when we sort this the other way
12:21
and when we sort this the other way
12:21
and when we sort this the other way around
12:23
around
12:23
around we're going to see the most likely
12:25
we're going to see the most likely
12:25
we're going to see the most likely bigrams so we see that n was
12:28
bigrams so we see that n was
12:28
bigrams so we see that n was very often an ending character
12:30
very often an ending character
12:30
very often an ending character many many times and apparently n almost
12:32
many many times and apparently n almost
12:32
many many times and apparently n almost always follows an a
12:34
always follows an a
12:34
always follows an a and that's a very likely combination as
12:36
and that's a very likely combination as
12:36
and that's a very likely combination as well
12:38
well
12:38
well so
12:39
so
12:39
so this is kind of the individual counts
12:41
this is kind of the individual counts
12:41
this is kind of the individual counts that we achieve over the entire data set
12:44
that we achieve over the entire data set
12:44
that we achieve over the entire data set now it's actually going to be
12:46
now it's actually going to be
12:46
now it's actually going to be significantly more convenient for us to
12:48
significantly more convenient for us to
12:48
significantly more convenient for us to keep this information in a
12:49
keep this information in a
12:49
keep this information in a two-dimensional array instead of a
12:51
two-dimensional array instead of a
12:51
two-dimensional array instead of a python dictionary
12:53
python dictionary
12:53
python dictionary so
12:54
so
12:54
so we're going to store this information
12:56
we're going to store this information
12:56
we're going to store this information in a 2d array
12:58
in a 2d array
12:58
in a 2d array and
12:59
and
13:00
and the rows are going to be the first
13:01
the rows are going to be the first
13:01
the rows are going to be the first character of the bigram and the columns
13:03
character of the bigram and the columns
13:03
character of the bigram and the columns are going to be the second character and
13:05
are going to be the second character and
13:05
are going to be the second character and each entry in this two-dimensional array
13:06
each entry in this two-dimensional array
13:06
each entry in this two-dimensional array will tell us how often that first
13:08
will tell us how often that first
13:08
will tell us how often that first character files the second character in
13:11
character files the second character in
13:11
character files the second character in the data set
13:12
the data set
13:12
the data set so in particular the array
13:14
so in particular the array
13:14
so in particular the array representation that we're going to use
13:16
representation that we're going to use
13:16
representation that we're going to use or the library is that of pytorch
13:18
or the library is that of pytorch
13:18
or the library is that of pytorch and pytorch is a deep
13:20
and pytorch is a deep
13:20
and pytorch is a deep learning neural network framework but
13:22
learning neural network framework but
13:22
learning neural network framework but part of it is also this torch.tensor
13:25
part of it is also this torch.tensor
13:25
part of it is also this torch.tensor which allows us to create
13:26
which allows us to create
13:26
which allows us to create multi-dimensional arrays and manipulate
13:27
multi-dimensional arrays and manipulate
13:28
multi-dimensional arrays and manipulate them very efficiently
13:29
them very efficiently
13:29
them very efficiently so
13:30
so
13:30
so let's import pytorch which you can do by
13:32
let's import pytorch which you can do by
13:32
let's import pytorch which you can do by import torch
13:34
import torch
13:34
import torch and then we can create
13:36
and then we can create
13:36
and then we can create arrays
13:37
arrays
13:37
arrays so let's create a array of zeros
13:40
so let's create a array of zeros
13:40
so let's create a array of zeros and we give it a
13:42
and we give it a
13:42
and we give it a size of this array let's create a three
13:44
size of this array let's create a three
13:44
size of this array let's create a three by five array as an example
13:47
by five array as an example
13:47
by five array as an example and
13:48
and
13:48
and this is a three by five array of zeros
13:51
this is a three by five array of zeros
13:51
this is a three by five array of zeros and by default you'll notice a.d type
13:53
and by default you'll notice a.d type
13:53
and by default you'll notice a.d type which is short for data type is float32
13:56
which is short for data type is float32
13:56
which is short for data type is float32 so these are single precision floating
13:58
so these are single precision floating
13:58
so these are single precision floating point numbers
13:59
point numbers
13:59
point numbers because we are going to represent counts
14:01
because we are going to represent counts
14:01
because we are going to represent counts let's actually use d type as torch dot
14:04
let's actually use d type as torch dot
14:04
let's actually use d type as torch dot and 32
14:05
and 32
14:06
and 32 so these are
14:07
so these are
14:07
so these are 32-bit integers
14:10
32-bit integers
14:10
32-bit integers so now you see that we have integer data
14:12
so now you see that we have integer data
14:12
so now you see that we have integer data inside this tensor
14:14
inside this tensor
14:14
inside this tensor now tensors allow us to really
14:17
now tensors allow us to really
14:17
now tensors allow us to really manipulate all the individual entries
14:18
manipulate all the individual entries
14:18
manipulate all the individual entries and do it very efficiently
14:20
and do it very efficiently
14:20
and do it very efficiently so for example if we want to change this
14:22
so for example if we want to change this
14:22
so for example if we want to change this bit
14:23
bit
14:23
bit we have to index into the tensor and in
14:25
we have to index into the tensor and in
14:25
we have to index into the tensor and in particular here this is the first row
14:29
particular here this is the first row
14:29
particular here this is the first row and the
14:31
and the
14:31
and the because it's zero indexed so this is row
14:34
because it's zero indexed so this is row
14:34
because it's zero indexed so this is row index one and column index zero one two
14:37
index one and column index zero one two
14:37
index one and column index zero one two three
14:38
three
14:38
three so a at one comma three we can set that
14:41
so a at one comma three we can set that
14:41
so a at one comma three we can set that to one
14:43
to one
14:43
to one and then a we'll have a 1 over there
14:47
and then a we'll have a 1 over there
14:47
and then a we'll have a 1 over there we can of course also do things like
14:48
we can of course also do things like
14:48
we can of course also do things like this so now a will be 2 over there
14:52
this so now a will be 2 over there
14:52
this so now a will be 2 over there or 3.
14:53
or 3.
14:53
or 3. and also we can for example say a 0 0 is
14:56
and also we can for example say a 0 0 is
14:56
and also we can for example say a 0 0 is 5
14:57
5
14:57
5 and then a will have a 5 over here
15:00
and then a will have a 5 over here
15:00
and then a will have a 5 over here so that's how we can index into the
15:02
so that's how we can index into the
15:02
so that's how we can index into the arrays now of course the array that we
15:04
arrays now of course the array that we
15:04
arrays now of course the array that we are interested in is much much bigger so
15:06
are interested in is much much bigger so
15:06
are interested in is much much bigger so for our purposes we have 26 letters of
15:08
for our purposes we have 26 letters of
15:08
for our purposes we have 26 letters of the alphabet
15:09
the alphabet
15:09
the alphabet and then we have two special characters
15:12
and then we have two special characters
15:12
and then we have two special characters s and e
15:13
s and e
15:14
s and e so uh we want 26 plus 2 or 28 by 28
15:18
so uh we want 26 plus 2 or 28 by 28
15:18
so uh we want 26 plus 2 or 28 by 28 array
15:19
array
15:19
array and let's call it the capital n because
15:21
and let's call it the capital n because
15:21
and let's call it the capital n because it's going to represent sort of the
15:22
it's going to represent sort of the
15:22
it's going to represent sort of the counts
15:24
counts
15:24
counts let me erase this stuff
15:26
let me erase this stuff
15:26
let me erase this stuff so that's the array that starts at zeros
15:28
so that's the array that starts at zeros
15:28
so that's the array that starts at zeros 28 by 28
15:30
28 by 28
15:30
28 by 28 and now let's copy paste this
15:33
and now let's copy paste this
15:33
and now let's copy paste this here
15:34
here
15:34
here but instead of having a dictionary b
15:36
but instead of having a dictionary b
15:36
but instead of having a dictionary b which we're going to erase we now have
15:39
which we're going to erase we now have
15:39
which we're going to erase we now have an n
15:40
an n
15:40
an n now the problem here is that we have
15:42
now the problem here is that we have
15:42
now the problem here is that we have these characters which are strings but
15:44
these characters which are strings but
15:44
these characters which are strings but we have to now
15:45
we have to now
15:45
we have to now um basically index into a
15:48
um basically index into a
15:48
um basically index into a um array and we have to index using
15:50
um array and we have to index using
15:50
um array and we have to index using integers so we need some kind of a
15:52
integers so we need some kind of a
15:52
integers so we need some kind of a lookup table from characters to integers
15:55
lookup table from characters to integers
15:55
lookup table from characters to integers so let's construct such a character
15:56
so let's construct such a character
15:56
so let's construct such a character array
15:57
array
15:58
array and the way we're going to do this is
15:59
and the way we're going to do this is
15:59
and the way we're going to do this is we're going to take all the words which
16:01
we're going to take all the words which
16:01
we're going to take all the words which is a list of strings
16:02
is a list of strings
16:02
is a list of strings we're going to concatenate all of it
16:04
we're going to concatenate all of it
16:04
we're going to concatenate all of it into a massive string so this is just
16:05
into a massive string so this is just
16:06
into a massive string so this is just simply the entire data set as a single
16:07
simply the entire data set as a single
16:07
simply the entire data set as a single string
16:09
string
16:09
string we're going to pass this to the set
16:10
we're going to pass this to the set
16:10
we're going to pass this to the set constructor which takes this massive
16:13
constructor which takes this massive
16:13
constructor which takes this massive string
16:14
string
16:14
string and throws out duplicates because sets
16:16
and throws out duplicates because sets
16:16
and throws out duplicates because sets do not allow duplicates
16:18
do not allow duplicates
16:18
do not allow duplicates so set of this will just be the set of
16:21
so set of this will just be the set of
16:21
so set of this will just be the set of all the lowercase characters
16:24
all the lowercase characters
16:24
all the lowercase characters and there should be a total of 26 of
16:25
and there should be a total of 26 of
16:26
and there should be a total of 26 of them
16:28
and now we actually don't want a set we
16:29
and now we actually don't want a set we
16:29
and now we actually don't want a set we want a list
16:32
want a list
16:32
want a list but we don't want a list sorted in some
16:34
but we don't want a list sorted in some
16:34
but we don't want a list sorted in some weird arbitrary way we want it to be
16:36
weird arbitrary way we want it to be
16:36
weird arbitrary way we want it to be sorted
16:37
sorted
16:37
sorted from a to z
16:39
from a to z
16:39
from a to z so sorted list
16:41
so sorted list
16:41
so sorted list so those are our characters
16:45
now what we want is this lookup table as
16:47
now what we want is this lookup table as
16:47
now what we want is this lookup table as i mentioned so let's create a special
16:49
i mentioned so let's create a special
16:49
i mentioned so let's create a special s2i i will call it
16:52
s2i i will call it
16:52
s2i i will call it um s is string or character and this
16:55
um s is string or character and this
16:55
um s is string or character and this will be an s2i mapping
16:58
will be an s2i mapping
16:58
will be an s2i mapping for
16:59
for
16:59
for is in enumerate of these characters
17:04
is in enumerate of these characters
17:04
is in enumerate of these characters so enumerate basically gives us this
17:06
so enumerate basically gives us this
17:06
so enumerate basically gives us this iterator over the integer index and the
17:09
iterator over the integer index and the
17:10
iterator over the integer index and the actual element of the list and then we
17:12
actual element of the list and then we
17:12
actual element of the list and then we are mapping the character to the integer
17:15
are mapping the character to the integer
17:15
are mapping the character to the integer so s2i
17:16
so s2i
17:16
so s2i is a mapping from a to 0 b to 1 etc all
17:19
is a mapping from a to 0 b to 1 etc all
17:19
is a mapping from a to 0 b to 1 etc all the way from z to 25
17:24
and that's going to be useful here but
17:25
and that's going to be useful here but
17:25
and that's going to be useful here but we actually also have to specifically
17:27
we actually also have to specifically
17:27
we actually also have to specifically set that s will be 26
17:29
set that s will be 26
17:29
set that s will be 26 and s to i at e will be 27 right because
17:33
and s to i at e will be 27 right because
17:33
and s to i at e will be 27 right because z was 25.
17:35
z was 25.
17:35
z was 25. so those are the lookups and now we can
17:38
so those are the lookups and now we can
17:38
so those are the lookups and now we can come here and we can map
17:39
come here and we can map
17:39
come here and we can map both character 1 and character 2 to
17:41
both character 1 and character 2 to
17:41
both character 1 and character 2 to their integers
17:42
their integers
17:42
their integers so this will be s2i at character 1
17:45
so this will be s2i at character 1
17:45
so this will be s2i at character 1 and ix2 will be s2i of character 2.
17:49
and ix2 will be s2i of character 2.
17:49
and ix2 will be s2i of character 2. and now we should be able to
17:52
and now we should be able to
17:52
and now we should be able to do this line but using our array so n at
17:55
do this line but using our array so n at
17:55
do this line but using our array so n at x1 ix2 this is the two-dimensional array
17:58
x1 ix2 this is the two-dimensional array
17:58
x1 ix2 this is the two-dimensional array indexing i've shown you before
18:00
indexing i've shown you before
18:00
indexing i've shown you before and honestly just plus equals one
18:02
and honestly just plus equals one
18:02
and honestly just plus equals one because everything starts at
18:04
because everything starts at
18:04
because everything starts at zero
18:06
zero
18:06
zero so this should
18:07
so this should
18:07
so this should work
18:08
work
18:08
work and give us a large 28 by 28 array
18:12
and give us a large 28 by 28 array
18:12
and give us a large 28 by 28 array of all these counts so
18:15
of all these counts so
18:15
of all these counts so if we print n
18:16
if we print n
18:16
if we print n this is the array but of course it looks
18:19
this is the array but of course it looks
18:19
this is the array but of course it looks ugly so let's erase this ugly mess and
18:21
ugly so let's erase this ugly mess and
18:21
ugly so let's erase this ugly mess and let's try to visualize it a bit more
18:23
let's try to visualize it a bit more
18:23
let's try to visualize it a bit more nicer
18:24
nicer
18:24
nicer so for that we're going to use a library
18:26
so for that we're going to use a library
18:26
so for that we're going to use a library called matplotlib
18:28
called matplotlib
18:28
called matplotlib so matplotlib allows us to create
18:30
so matplotlib allows us to create
18:30
so matplotlib allows us to create figures so we can do things like plt
18:32
figures so we can do things like plt
18:32
figures so we can do things like plt item show of the counter array
18:36
item show of the counter array
18:36
item show of the counter array so this is the 28x28 array
18:39
so this is the 28x28 array
18:39
so this is the 28x28 array and this is structure but even this i
18:41
and this is structure but even this i
18:41
and this is structure but even this i would say is still pretty ugly
18:43
would say is still pretty ugly
18:43
would say is still pretty ugly so we're going to try to create a much
18:45
so we're going to try to create a much
18:45
so we're going to try to create a much nicer visualization of it and i wrote a
18:47
nicer visualization of it and i wrote a
18:47
nicer visualization of it and i wrote a bunch of code for that
18:49
bunch of code for that
18:49
bunch of code for that the first thing we're going to need is
18:51
the first thing we're going to need is
18:51
the first thing we're going to need is we're going to need to invert
18:53
we're going to need to invert
18:53
we're going to need to invert this array here this dictionary so s2i
18:57
this array here this dictionary so s2i
18:57
this array here this dictionary so s2i is mapping from s to i
18:59
is mapping from s to i
18:59
is mapping from s to i and in i2s we're going to reverse this
19:02
and in i2s we're going to reverse this
19:02
and in i2s we're going to reverse this dictionary so iterator of all the items
19:04
dictionary so iterator of all the items
19:04
dictionary so iterator of all the items and just reverse that array
19:06
and just reverse that array
19:06
and just reverse that array so i2s
19:08
so i2s
19:08
so i2s maps inversely from 0 to a 1 to b etc
19:12
maps inversely from 0 to a 1 to b etc
19:12
maps inversely from 0 to a 1 to b etc so we'll need that
19:14
so we'll need that
19:14
so we'll need that and then here's the code that i came up
19:16
and then here's the code that i came up
19:16
and then here's the code that i came up with to try to make this a little bit
19:17
with to try to make this a little bit
19:17
with to try to make this a little bit nicer
19:20
we create a figure
19:21
we create a figure
19:22
we create a figure we plot
19:23
we plot
19:23
we plot n
19:24
n
19:24
n and then we do and then we visualize a
19:26
and then we do and then we visualize a
19:26
and then we do and then we visualize a bunch of things later let me just run it
19:27
bunch of things later let me just run it
19:28
bunch of things later let me just run it so you get a sense of what this is
19:31
okay
19:32
okay
19:32
okay so you see here that we have
19:35
so you see here that we have
19:35
so you see here that we have the array spaced out
19:37
the array spaced out
19:37
the array spaced out and every one of these is basically like
19:39
and every one of these is basically like
19:39
and every one of these is basically like b follows g zero times
19:42
b follows g zero times
19:42
b follows g zero times b follows h 41 times
19:44
b follows h 41 times
19:44
b follows h 41 times um so a follows j 175 times
19:47
um so a follows j 175 times
19:47
um so a follows j 175 times and so what you can see that i'm doing
19:49
and so what you can see that i'm doing
19:49
and so what you can see that i'm doing here is first i show that entire array
19:52
here is first i show that entire array
19:52
here is first i show that entire array and then i iterate over all the
19:54
and then i iterate over all the
19:54
and then i iterate over all the individual little cells here
19:56
individual little cells here
19:56
individual little cells here and i create a character string here
19:59
and i create a character string here
19:59
and i create a character string here which is the inverse mapping i2s of the
20:02
which is the inverse mapping i2s of the
20:02
which is the inverse mapping i2s of the integer i and the integer j so those are
20:05
integer i and the integer j so those are
20:05
integer i and the integer j so those are the bi-grams in a character
20:06
the bi-grams in a character
20:06
the bi-grams in a character representation
20:08
representation
20:08
representation and then i plot just the diagram text
20:11
and then i plot just the diagram text
20:12
and then i plot just the diagram text and then i plot the number of times that
20:14
and then i plot the number of times that
20:14
and then i plot the number of times that this bigram occurs
20:15
this bigram occurs
20:16
this bigram occurs now the reason that there's a dot item
20:17
now the reason that there's a dot item
20:17
now the reason that there's a dot item here is because when you index into
20:19
here is because when you index into
20:20
here is because when you index into these arrays these are torch tensors
20:22
these arrays these are torch tensors
20:22
these arrays these are torch tensors you see that we still get a tensor back
20:25
you see that we still get a tensor back
20:25
you see that we still get a tensor back so the type of this thing you'd think it
20:28
so the type of this thing you'd think it
20:28
so the type of this thing you'd think it would be just an integer 149 but it's
20:29
would be just an integer 149 but it's
20:29
would be just an integer 149 but it's actually a torch.tensor
20:31
actually a torch.tensor
20:31
actually a torch.tensor and so
20:32
and so
20:32
and so if you do dot item then it will pop out
20:35
if you do dot item then it will pop out
20:35
if you do dot item then it will pop out that in individual integer
20:38
that in individual integer
20:38
that in individual integer so it will just be 149.
20:40
so it will just be 149.
20:40
so it will just be 149. so that's what's happening there and
20:42
so that's what's happening there and
20:42
so that's what's happening there and these are just some options to make it
20:43
these are just some options to make it
20:43
these are just some options to make it look nice
20:45
look nice
20:45
look nice so what is the structure of this array
20:49
we have all these counts and we see that
20:50
we have all these counts and we see that
20:50
we have all these counts and we see that some of them occur often and some of
20:52
some of them occur often and some of
20:52
some of them occur often and some of them do not occur often
20:53
them do not occur often
20:53
them do not occur often now if you scrutinize this carefully you
20:55
now if you scrutinize this carefully you
20:56
now if you scrutinize this carefully you will notice that we're not actually
20:57
will notice that we're not actually
20:57
will notice that we're not actually being very clever
20:58
being very clever
20:58
being very clever that's because when you come over here
21:00
that's because when you come over here
21:00
that's because when you come over here you'll notice that for example we have
21:02
you'll notice that for example we have
21:02
you'll notice that for example we have an entire row of completely zeros and
21:04
an entire row of completely zeros and
21:04
an entire row of completely zeros and that's because the end character
21:06
that's because the end character
21:06
that's because the end character is never possibly going to be the first
21:08
is never possibly going to be the first
21:08
is never possibly going to be the first character of a bi-gram because we're
21:10
character of a bi-gram because we're
21:10
character of a bi-gram because we're always placing these end tokens all at
21:12
always placing these end tokens all at
21:12
always placing these end tokens all at the end of the diagram
21:14
the end of the diagram
21:14
the end of the diagram similarly we have entire columns zeros
21:16
similarly we have entire columns zeros
21:16
similarly we have entire columns zeros here because the s
21:19
here because the s
21:19
here because the s character will never possibly be the
21:21
character will never possibly be the
21:21
character will never possibly be the second element of a bigram because we
21:23
second element of a bigram because we
21:23
second element of a bigram because we always start with s and we end with e
21:25
always start with s and we end with e
21:25
always start with s and we end with e and we only have the words in between
21:27
and we only have the words in between
21:27
and we only have the words in between so we have an entire column of zeros an
21:30
so we have an entire column of zeros an
21:30
so we have an entire column of zeros an entire row of zeros and in this little
21:32
entire row of zeros and in this little
21:32
entire row of zeros and in this little two by two matrix here as well the only
21:34
two by two matrix here as well the only
21:34
two by two matrix here as well the only one that can possibly happen is if s
21:36
one that can possibly happen is if s
21:36
one that can possibly happen is if s directly follows e
21:38
directly follows e
21:38
directly follows e that can be non-zero if we have a word
21:41
that can be non-zero if we have a word
21:41
that can be non-zero if we have a word that has no letters so in that case
21:43
that has no letters so in that case
21:43
that has no letters so in that case there's no letters in the word it's an
21:44
there's no letters in the word it's an
21:44
there's no letters in the word it's an empty word and we just have s follows e
21:47
empty word and we just have s follows e
21:47
empty word and we just have s follows e but the other ones are just not possible
21:50
but the other ones are just not possible
21:50
but the other ones are just not possible and so we're basically wasting space and
21:51
and so we're basically wasting space and
21:51
and so we're basically wasting space and not only that but the s and the e are
21:53
not only that but the s and the e are
21:53
not only that but the s and the e are getting very crowded here
21:55
getting very crowded here
21:55
getting very crowded here i was using these brackets because
21:57
i was using these brackets because
21:57
i was using these brackets because there's convention and natural language
21:58
there's convention and natural language
21:58
there's convention and natural language processing to use these kinds of
22:00
processing to use these kinds of
22:00
processing to use these kinds of brackets to denote special tokens
22:03
brackets to denote special tokens
22:03
brackets to denote special tokens but we're going to use something else
22:05
but we're going to use something else
22:05
but we're going to use something else so let's fix all this and make it
22:06
so let's fix all this and make it
22:06
so let's fix all this and make it prettier
22:08
prettier
22:08
prettier we're not actually going to have two
22:09
we're not actually going to have two
22:09
we're not actually going to have two special tokens we're only going to have
22:11
special tokens we're only going to have
22:11
special tokens we're only going to have one special token
22:12
one special token
22:12
one special token so
22:13
so
22:13
so we're going to have n by n
22:15
we're going to have n by n
22:15
we're going to have n by n array of 27 by 27 instead
22:18
array of 27 by 27 instead
22:18
array of 27 by 27 instead instead of having two
22:20
instead of having two
22:20
instead of having two we will just have one and i will call it
22:22
we will just have one and i will call it
22:22
we will just have one and i will call it a dot
22:24
a dot
22:24
a dot okay
22:27
okay
22:27
okay let me swing this over here
22:30
let me swing this over here
22:30
let me swing this over here now one more thing that i would like to
22:31
now one more thing that i would like to
22:31
now one more thing that i would like to do is i would actually like to make this
22:33
do is i would actually like to make this
22:33
do is i would actually like to make this special character half position zero
22:36
special character half position zero
22:36
special character half position zero and i would like to offset all the other
22:37
and i would like to offset all the other
22:37
and i would like to offset all the other letters off i find that a little bit
22:39
letters off i find that a little bit
22:39
letters off i find that a little bit more
22:40
more
22:40
more pleasing
22:42
pleasing
22:42
pleasing so
22:44
so
22:44
so we need a plus one here so that the
22:46
we need a plus one here so that the
22:46
we need a plus one here so that the first character which is a will start at
22:48
first character which is a will start at
22:48
first character which is a will start at one
22:49
one
22:49
one so s2i
22:51
so s2i
22:51
so s2i will now be a starts at one and dot is 0
22:55
will now be a starts at one and dot is 0
22:55
will now be a starts at one and dot is 0 and
22:56
and
22:56
and i2s of course we're not changing this
22:58
i2s of course we're not changing this
22:58
i2s of course we're not changing this because i2s just creates a reverse
23:00
because i2s just creates a reverse
23:00
because i2s just creates a reverse mapping and this will work fine so 1 is
23:02
mapping and this will work fine so 1 is
23:02
mapping and this will work fine so 1 is a 2 is b
23:04
a 2 is b
23:04
a 2 is b 0 is dot
23:06
0 is dot
23:06
0 is dot so we've reversed that here
23:09
so we've reversed that here
23:09
so we've reversed that here we have
23:10
we have
23:10
we have a dot and a dot
23:12
a dot and a dot
23:12
a dot and a dot this should work fine
23:14
this should work fine
23:14
this should work fine make sure i start at zeros
23:17
make sure i start at zeros
23:17
make sure i start at zeros count
23:18
count
23:18
count and then here we don't go up to 28 we go
23:20
and then here we don't go up to 28 we go
23:20
and then here we don't go up to 28 we go up to 27
23:22
up to 27
23:22
up to 27 and this should just work
23:30
okay
23:31
okay
23:31
okay so we see that dot never happened it's
23:33
so we see that dot never happened it's
23:33
so we see that dot never happened it's at zero because we don't have empty
23:35
at zero because we don't have empty
23:35
at zero because we don't have empty words
23:36
words
23:36
words then this row here now is just uh very
23:38
then this row here now is just uh very
23:38
then this row here now is just uh very simply the um
23:40
simply the um
23:40
simply the um counts for all the first letters so
23:44
counts for all the first letters so
23:44
counts for all the first letters so uh j starts a word h starts a word i
23:47
uh j starts a word h starts a word i
23:47
uh j starts a word h starts a word i starts a word etc and then these are all
23:50
starts a word etc and then these are all
23:50
starts a word etc and then these are all the ending
23:51
the ending
23:51
the ending characters
23:52
characters
23:52
characters and in between we have the structure of
23:54
and in between we have the structure of
23:54
and in between we have the structure of what characters follow each other
23:56
what characters follow each other
23:56
what characters follow each other so this is the counts array of our
23:59
so this is the counts array of our
23:59
so this is the counts array of our entire
24:00
entire
24:00
entire data set so this array actually has all
24:03
data set so this array actually has all
24:03
data set so this array actually has all the information necessary for us to
24:04
the information necessary for us to
24:04
the information necessary for us to actually sample from this bigram
24:07
actually sample from this bigram
24:07
actually sample from this bigram uh character level language model
24:09
uh character level language model
24:09
uh character level language model and um roughly speaking what we're going
24:11
and um roughly speaking what we're going
24:11
and um roughly speaking what we're going to do is we're just going to start
24:13
to do is we're just going to start
24:13
to do is we're just going to start following these probabilities and these
24:14
following these probabilities and these
24:14
following these probabilities and these counts and we're going to start sampling
24:16
counts and we're going to start sampling
24:16
counts and we're going to start sampling from the from the model
24:18
from the from the model
24:18
from the from the model so in the beginning of course
24:20
so in the beginning of course
24:20
so in the beginning of course we start with the dot the start token
24:23
we start with the dot the start token
24:23
we start with the dot the start token dot
24:24
dot
24:24
dot so to sample the first character of a
24:27
so to sample the first character of a
24:27
so to sample the first character of a name we're looking at this row here
24:30
name we're looking at this row here
24:30
name we're looking at this row here so we see that we have the counts and
24:32
so we see that we have the counts and
24:32
so we see that we have the counts and those concepts terminally are telling us
24:34
those concepts terminally are telling us
24:34
those concepts terminally are telling us how often any one of these characters is
24:37
how often any one of these characters is
24:37
how often any one of these characters is to start a word
24:39
to start a word
24:39
to start a word so if we take this n
24:41
so if we take this n
24:41
so if we take this n and we grab the first row
24:44
and we grab the first row
24:44
and we grab the first row we can do that by using just indexing as
24:47
we can do that by using just indexing as
24:47
we can do that by using just indexing as zero
24:48
zero
24:48
zero and then using this notation column for
24:51
and then using this notation column for
24:51
and then using this notation column for the rest of that row
24:53
the rest of that row
24:53
the rest of that row so n zero colon
24:56
so n zero colon
24:56
so n zero colon is indexing into the zeroth
24:58
is indexing into the zeroth
24:58
is indexing into the zeroth row and then it's grabbing all the
25:00
row and then it's grabbing all the
25:00
row and then it's grabbing all the columns
25:01
columns
25:01
columns and so this will give us a
25:03
and so this will give us a
25:03
and so this will give us a one-dimensional array
25:05
one-dimensional array
25:05
one-dimensional array of the first row so zero four four ten
25:08
of the first row so zero four four ten
25:08
of the first row so zero four four ten you know zero four four ten one three oh
25:10
you know zero four four ten one three oh
25:10
you know zero four four ten one three oh six one five four two etc it's just the
25:13
six one five four two etc it's just the
25:13
six one five four two etc it's just the first row the shape of this
25:15
first row the shape of this
25:15
first row the shape of this is 27 it's just the row of 27
25:19
is 27 it's just the row of 27
25:19
is 27 it's just the row of 27 and the other way that you can do this
25:21
and the other way that you can do this
25:21
and the other way that you can do this also is you just you don't need to
25:22
also is you just you don't need to
25:22
also is you just you don't need to actually give this
25:23
actually give this
25:23
actually give this you just grab the zeroth row like this
25:26
you just grab the zeroth row like this
25:26
you just grab the zeroth row like this this is equivalent
25:28
this is equivalent
25:28
this is equivalent now these are the counts
25:29
now these are the counts
25:29
now these are the counts and now what we'd like to do is we'd
25:31
and now what we'd like to do is we'd
25:31
and now what we'd like to do is we'd like to basically um sample from this
25:34
like to basically um sample from this
25:34
like to basically um sample from this since these are the raw counts we
25:36
since these are the raw counts we
25:36
since these are the raw counts we actually have to convert this to
25:37
actually have to convert this to
25:37
actually have to convert this to probabilities
25:39
probabilities
25:39
probabilities so we create a probability vector
25:42
so we create a probability vector
25:42
so we create a probability vector so we'll take n of zero
25:44
so we'll take n of zero
25:44
so we'll take n of zero and we'll actually convert this to float
25:48
and we'll actually convert this to float
25:48
and we'll actually convert this to float first
25:49
first
25:49
first okay so these integers are converted to
25:51
okay so these integers are converted to
25:51
okay so these integers are converted to float
25:52
float
25:52
float floating point numbers and the reason
25:54
floating point numbers and the reason
25:54
floating point numbers and the reason we're creating floats is because we're
25:56
we're creating floats is because we're
25:56
we're creating floats is because we're about to normalize these counts
25:58
about to normalize these counts
25:58
about to normalize these counts so to create a probability distribution
26:00
so to create a probability distribution
26:00
so to create a probability distribution here we want to divide
26:03
here we want to divide
26:03
here we want to divide we basically want to do p p p divide p
26:05
we basically want to do p p p divide p
26:05
we basically want to do p p p divide p that sum
26:09
and now we get a vector of smaller
26:11
and now we get a vector of smaller
26:11
and now we get a vector of smaller numbers and these are now probabilities
26:13
numbers and these are now probabilities
26:13
numbers and these are now probabilities so of course because we divided by the
26:15
so of course because we divided by the
26:15
so of course because we divided by the sum the sum of p now is 1.
26:18
sum the sum of p now is 1.
26:18
sum the sum of p now is 1. so this is a nice proper probability
26:20
so this is a nice proper probability
26:20
so this is a nice proper probability distribution it sums to 1 and this is
26:22
distribution it sums to 1 and this is
26:22
distribution it sums to 1 and this is giving us the probability for any single
26:24
giving us the probability for any single
26:24
giving us the probability for any single character to be the first
26:25
character to be the first
26:26
character to be the first character of a word
26:27
character of a word
26:27
character of a word so now we can try to sample from this
26:29
so now we can try to sample from this
26:29
so now we can try to sample from this distribution to sample from these
26:31
distribution to sample from these
26:31
distribution to sample from these distributions we're going to use
26:32
distributions we're going to use
26:33
distributions we're going to use storch.multinomial which i've pulled up
26:34
storch.multinomial which i've pulled up
26:34
storch.multinomial which i've pulled up here
26:36
here
26:36
here so torch.multinomial returns uh
26:40
samples from the multinomial probability
26:42
samples from the multinomial probability
26:42
samples from the multinomial probability distribution which is a complicated way
26:44
distribution which is a complicated way
26:44
distribution which is a complicated way of saying you give me probabilities and
26:46
of saying you give me probabilities and
26:46
of saying you give me probabilities and i will give you integers which are
26:48
i will give you integers which are
26:48
i will give you integers which are sampled
26:49
sampled
26:49
sampled according to the property distribution
26:51
according to the property distribution
26:51
according to the property distribution so this is the signature of the method
26:53
so this is the signature of the method
26:53
so this is the signature of the method and to make everything deterministic
26:54
and to make everything deterministic
26:54
and to make everything deterministic we're going to use a generator object in
26:57
we're going to use a generator object in
26:57
we're going to use a generator object in pytorch
26:59
pytorch
26:59
pytorch so this makes everything deterministic
27:00
so this makes everything deterministic
27:00
so this makes everything deterministic so when you run this on your computer
27:02
so when you run this on your computer
27:02
so when you run this on your computer you're going to the exact get the exact
27:04
you're going to the exact get the exact
27:04
you're going to the exact get the exact same results that i'm getting here on my
27:05
same results that i'm getting here on my
27:05
same results that i'm getting here on my computer
27:07
computer
27:07
computer so let me show you how this works
27:12
here's the deterministic way of creating
27:15
here's the deterministic way of creating
27:15
here's the deterministic way of creating a torch generator object
27:18
a torch generator object
27:18
a torch generator object seeding it with some number that we can
27:19
seeding it with some number that we can
27:20
seeding it with some number that we can agree on
27:21
agree on
27:21
agree on so that seeds a generator gets gives us
27:23
so that seeds a generator gets gives us
27:23
so that seeds a generator gets gives us an object g
27:24
an object g
27:24
an object g and then we can pass that g
27:26
and then we can pass that g
27:26
and then we can pass that g to a function
27:28
to a function
27:28
to a function that creates um
27:30
that creates um
27:30
that creates um here random numbers twerk.rand creates
27:32
here random numbers twerk.rand creates
27:32
here random numbers twerk.rand creates random numbers three of them
27:35
random numbers three of them
27:35
random numbers three of them and it's using this generator object to
27:37
and it's using this generator object to
27:37
and it's using this generator object to as a source of randomness
27:40
as a source of randomness
27:40
as a source of randomness so
27:41
so
27:41
so without normalizing it
27:44
without normalizing it
27:44
without normalizing it i can just print
27:46
i can just print
27:46
i can just print this is sort of like numbers between 0
27:48
this is sort of like numbers between 0
27:48
this is sort of like numbers between 0 and 1 that are random according to this
27:50
and 1 that are random according to this
27:50
and 1 that are random according to this thing and whenever i run it again
27:53
thing and whenever i run it again
27:53
thing and whenever i run it again i'm always going to get the same result
27:54
i'm always going to get the same result
27:54
i'm always going to get the same result because i keep using the same generator
27:56
because i keep using the same generator
27:56
because i keep using the same generator object which i'm seeing here
27:58
object which i'm seeing here
27:58
object which i'm seeing here and then if i divide
28:01
and then if i divide
28:01
and then if i divide to normalize i'm going to get a nice
28:03
to normalize i'm going to get a nice
28:04
to normalize i'm going to get a nice probability distribution of just three
28:05
probability distribution of just three
28:05
probability distribution of just three elements
28:07
elements
28:07
elements and then we can use torsion multinomial
28:09
and then we can use torsion multinomial
28:09
and then we can use torsion multinomial to draw samples from it so this is what
28:11
to draw samples from it so this is what
28:11
to draw samples from it so this is what that looks like
28:13
that looks like
28:13
that looks like tertiary multinomial we'll take the
28:16
tertiary multinomial we'll take the
28:16
tertiary multinomial we'll take the torch tensor
28:18
torch tensor
28:18
torch tensor of probability distributions
28:20
of probability distributions
28:20
of probability distributions then we can ask for a number of samples
28:22
then we can ask for a number of samples
28:22
then we can ask for a number of samples let's say 20.
28:24
let's say 20.
28:24
let's say 20. replacement equals true means that when
28:26
replacement equals true means that when
28:26
replacement equals true means that when we draw an element
28:28
we draw an element
28:28
we draw an element we will uh we can draw it and then we
28:30
we will uh we can draw it and then we
28:30
we will uh we can draw it and then we can put it back into the list of
28:32
can put it back into the list of
28:32
can put it back into the list of eligible indices to draw again
28:35
eligible indices to draw again
28:35
eligible indices to draw again and we have to specify replacement as
28:37
and we have to specify replacement as
28:37
and we have to specify replacement as true because by default uh for some
28:39
true because by default uh for some
28:39
true because by default uh for some reason it's false
28:41
reason it's false
28:41
reason it's false and i think
28:42
and i think
28:42
and i think you know it's just something to be
28:44
you know it's just something to be
28:44
you know it's just something to be careful with
28:45
careful with
28:45
careful with and the generator is passed in here so
28:47
and the generator is passed in here so
28:47
and the generator is passed in here so we're going to always get deterministic
28:49
we're going to always get deterministic
28:49
we're going to always get deterministic results the same results so if i run
28:51
results the same results so if i run
28:51
results the same results so if i run these two
28:53
these two
28:53
these two we're going to get a bunch of samples
28:55
we're going to get a bunch of samples
28:55
we're going to get a bunch of samples from this distribution
28:57
from this distribution
28:57
from this distribution now you'll notice here that the
28:58
now you'll notice here that the
28:58
now you'll notice here that the probability for the
29:00
probability for the
29:00
probability for the first element in this tensor is 60
29:04
first element in this tensor is 60
29:04
first element in this tensor is 60 so in these 20 samples we'd expect 60 of
29:08
so in these 20 samples we'd expect 60 of
29:08
so in these 20 samples we'd expect 60 of them to be zero
29:10
them to be zero
29:10
them to be zero we'd expect thirty percent of them to be
29:12
we'd expect thirty percent of them to be
29:12
we'd expect thirty percent of them to be one
29:14
one
29:14
one and because the the element index two
29:17
and because the the element index two
29:17
and because the the element index two has only ten percent probability very
29:19
has only ten percent probability very
29:19
has only ten percent probability very few of these samples should be two and
29:22
few of these samples should be two and
29:22
few of these samples should be two and indeed we only have a small number of
29:24
indeed we only have a small number of
29:24
indeed we only have a small number of twos
29:25
twos
29:25
twos and we can sample as many as we'd like
29:28
and we can sample as many as we'd like
29:28
and we can sample as many as we'd like and the more we sample the more
29:31
and the more we sample the more
29:31
and the more we sample the more these numbers should um roughly have the
29:33
these numbers should um roughly have the
29:33
these numbers should um roughly have the distribution here
29:35
distribution here
29:35
distribution here so we should have lots of zeros
29:38
so we should have lots of zeros
29:38
so we should have lots of zeros half as many um
29:41
half as many um
29:41
half as many um ones and we should have um three times
29:44
ones and we should have um three times
29:44
ones and we should have um three times as few
29:46
as few
29:46
as few oh sorry s few ones and three times as
29:49
oh sorry s few ones and three times as
29:49
oh sorry s few ones and three times as few uh
29:50
few uh
29:50
few uh twos
29:51
twos
29:51
twos so you see that we have very few twos we
29:53
so you see that we have very few twos we
29:53
so you see that we have very few twos we have some ones and most of them are zero
29:55
have some ones and most of them are zero
29:55
have some ones and most of them are zero so that's what torsion multinomial is
29:57
so that's what torsion multinomial is
29:57
so that's what torsion multinomial is doing
29:58
doing
29:58
doing for us here
30:01
for us here
30:01
for us here we are interested in this row we've
30:02
we are interested in this row we've
30:02
we are interested in this row we've created this
30:05
p here
30:06
p here
30:06
p here and now we can sample from it
30:09
and now we can sample from it
30:09
and now we can sample from it so if we use the same
30:11
so if we use the same
30:11
so if we use the same seed
30:12
seed
30:12
seed and then we sample from this
30:14
and then we sample from this
30:14
and then we sample from this distribution let's just get one sample
30:18
distribution let's just get one sample
30:18
distribution let's just get one sample then we see that the sample is say 13.
30:22
then we see that the sample is say 13.
30:22
then we see that the sample is say 13. so this will be the index
30:25
so this will be the index
30:25
so this will be the index and let's you see how it's a tensor that
30:27
and let's you see how it's a tensor that
30:27
and let's you see how it's a tensor that wraps 13 we again have to use that item
30:30
wraps 13 we again have to use that item
30:30
wraps 13 we again have to use that item to pop out that integer
30:32
to pop out that integer
30:32
to pop out that integer and now index would be just the number
30:35
and now index would be just the number
30:35
and now index would be just the number 13.
30:37
13.
30:37
13. and of course the um we can do
30:40
and of course the um we can do
30:40
and of course the um we can do we can map the i2s of ix to figure out
30:43
we can map the i2s of ix to figure out
30:43
we can map the i2s of ix to figure out exactly which character
30:45
exactly which character
30:45
exactly which character we're sampling here we're sampling m
30:47
we're sampling here we're sampling m
30:48
we're sampling here we're sampling m so we're saying that the first character
30:50
so we're saying that the first character
30:50
so we're saying that the first character is
30:51
is
30:51
is in our generation
30:53
in our generation
30:53
in our generation and just looking at the road here
30:55
and just looking at the road here
30:55
and just looking at the road here m was drawn and you we can see that m
30:57
m was drawn and you we can see that m
30:57
m was drawn and you we can see that m actually starts a large number of words
30:59
actually starts a large number of words
30:59
actually starts a large number of words uh m
31:01
uh m
31:01
uh m started 2 500 words out of 32 000 words
31:04
started 2 500 words out of 32 000 words
31:04
started 2 500 words out of 32 000 words so almost
31:06
so almost
31:06
so almost a bit less than 10 percent of the words
31:07
a bit less than 10 percent of the words
31:07
a bit less than 10 percent of the words start with them so this was actually a
31:09
start with them so this was actually a
31:09
start with them so this was actually a fairly likely character to draw
31:13
fairly likely character to draw
31:13
fairly likely character to draw um
31:15
um
31:15
um so that would be the first character of
31:16
so that would be the first character of
31:16
so that would be the first character of our work and now we can continue to
31:18
our work and now we can continue to
31:18
our work and now we can continue to sample more characters because now we
31:20
sample more characters because now we
31:20
sample more characters because now we know that m started
31:22
know that m started
31:22
know that m started m is already sampled
31:24
m is already sampled
31:24
m is already sampled so now to draw the next character we
31:26
so now to draw the next character we
31:26
so now to draw the next character we will come back here and we will look for
31:29
will come back here and we will look for
31:29
will come back here and we will look for the row
31:30
the row
31:30
the row that starts with m
31:32
that starts with m
31:32
that starts with m so you see m
31:34
so you see m
31:34
so you see m and we have a row here
31:36
and we have a row here
31:36
and we have a row here so we see that m dot is
31:39
so we see that m dot is
31:39
so we see that m dot is 516 m a is this many and b is this many
31:43
516 m a is this many and b is this many
31:43
516 m a is this many and b is this many etc so these are the counts for the next
31:44
etc so these are the counts for the next
31:44
etc so these are the counts for the next row and that's the next character that
31:46
row and that's the next character that
31:46
row and that's the next character that we are going to now generate
31:48
we are going to now generate
31:48
we are going to now generate so i think we are ready to actually just
31:49
so i think we are ready to actually just
31:50
so i think we are ready to actually just write out the loop because i think
31:51
write out the loop because i think
31:51
write out the loop because i think you're starting to get a sense of how
31:52
you're starting to get a sense of how
31:52
you're starting to get a sense of how this is going to go
31:54
this is going to go
31:54
this is going to go the um
31:56
the um
31:56
the um we always begin at
31:57
we always begin at
31:57
we always begin at index 0 because that's the start token
32:02
index 0 because that's the start token
32:02
index 0 because that's the start token and then while true
32:04
and then while true
32:04
and then while true we're going to grab the row
32:06
we're going to grab the row
32:06
we're going to grab the row corresponding to index
32:08
corresponding to index
32:08
corresponding to index that we're currently on so that's p
32:11
that we're currently on so that's p
32:11
that we're currently on so that's p so that's n array at ix
32:14
so that's n array at ix
32:14
so that's n array at ix converted to float is rp
32:19
then we normalize
32:21
then we normalize
32:21
then we normalize this p to sum to one
32:25
i accidentally ran the infinite loop we
32:28
i accidentally ran the infinite loop we
32:28
i accidentally ran the infinite loop we normalize p to something one
32:30
normalize p to something one
32:30
normalize p to something one then we need this generator object
32:33
then we need this generator object
32:33
then we need this generator object now we're going to initialize up here
32:35
now we're going to initialize up here
32:35
now we're going to initialize up here and we're going to draw a single sample
32:36
and we're going to draw a single sample
32:36
and we're going to draw a single sample from this distribution
32:40
and then this is going to tell us what
32:42
and then this is going to tell us what
32:42
and then this is going to tell us what index is going to be next
32:46
index is going to be next
32:46
index is going to be next if the index sampled is
32:48
if the index sampled is
32:48
if the index sampled is 0
32:49
0
32:49
0 then that's now the end token
32:52
then that's now the end token
32:52
then that's now the end token so we will break
32:55
so we will break
32:55
so we will break otherwise we are going to print
32:57
otherwise we are going to print
32:57
otherwise we are going to print s2i of ix
33:02
i2s
33:05
and uh that's pretty much it we're just
33:07
and uh that's pretty much it we're just
33:07
and uh that's pretty much it we're just uh this should work okay more
33:12
uh this should work okay more
33:12
uh this should work okay more so that's that's the name that we've
33:13
so that's that's the name that we've
33:13
so that's that's the name that we've sampled we started with m the next step
33:16
sampled we started with m the next step
33:16
sampled we started with m the next step was o then r and then dot
33:21
and this dot we it here as well
33:24
and this dot we it here as well
33:24
and this dot we it here as well so
33:26
so
33:26
so let's now do this a few times
33:29
so let's actually create an
33:33
so let's actually create an
33:33
so let's actually create an out list here
33:37
and instead of printing we're going to
33:38
and instead of printing we're going to
33:38
and instead of printing we're going to append
33:39
append
33:39
append so out that append this character
33:44
and then here let's just print it at the
33:46
and then here let's just print it at the
33:46
and then here let's just print it at the end so let's just join up all the outs
33:49
end so let's just join up all the outs
33:49
end so let's just join up all the outs and we're just going to print more okay
33:51
and we're just going to print more okay
33:51
and we're just going to print more okay now we're always getting the same result
33:53
now we're always getting the same result
33:53
now we're always getting the same result because of the generator
33:55
because of the generator
33:55
because of the generator so if we want to do this a few times we
33:57
so if we want to do this a few times we
33:57
so if we want to do this a few times we can go for i in range
34:00
can go for i in range
34:00
can go for i in range 10 we can sample 10 names
34:02
10 we can sample 10 names
34:02
10 we can sample 10 names and we can just do that 10 times
34:05
and we can just do that 10 times
34:05
and we can just do that 10 times and these are the names that we're
34:06
and these are the names that we're
34:06
and these are the names that we're getting out
34:08
getting out
34:08
getting out let's do 20.
34:14
i'll be honest with you this doesn't
34:15
i'll be honest with you this doesn't
34:15
i'll be honest with you this doesn't look right
34:16
look right
34:16
look right so i started a few minutes to convince
34:18
so i started a few minutes to convince
34:18
so i started a few minutes to convince myself that it actually is right
34:20
myself that it actually is right
34:20
myself that it actually is right the reason these samples are so terrible
34:22
the reason these samples are so terrible
34:22
the reason these samples are so terrible is that bigram language model
34:24
is that bigram language model
34:24
is that bigram language model is actually look just like really
34:26
is actually look just like really
34:26
is actually look just like really terrible
34:27
terrible
34:27
terrible we can generate a few more here
34:29
we can generate a few more here
34:30
we can generate a few more here and you can see that they're kind of
34:30
and you can see that they're kind of
34:30
and you can see that they're kind of like their name like a little bit like
34:33
like their name like a little bit like
34:33
like their name like a little bit like yanu o'reilly etc but they're just like
34:36
yanu o'reilly etc but they're just like
34:36
yanu o'reilly etc but they're just like totally messed up um
34:38
totally messed up um
34:38
totally messed up um and i mean the reason that this is so
34:40
and i mean the reason that this is so
34:40
and i mean the reason that this is so bad like we're generating h as a name
34:42
bad like we're generating h as a name
34:42
bad like we're generating h as a name but you have to think through
34:44
but you have to think through
34:44
but you have to think through it from the model's eyes it doesn't know
34:47
it from the model's eyes it doesn't know
34:47
it from the model's eyes it doesn't know that this h is the very first h all it
34:49
that this h is the very first h all it
34:49
that this h is the very first h all it knows is that h was previously and now
34:52
knows is that h was previously and now
34:52
knows is that h was previously and now how likely is h the last character well
34:55
how likely is h the last character well
34:55
how likely is h the last character well it's somewhat
34:57
it's somewhat
34:57
it's somewhat likely and so it just makes it last
34:58
likely and so it just makes it last
34:58
likely and so it just makes it last character it doesn't know that there
34:59
character it doesn't know that there
35:00
character it doesn't know that there were other things before it or there
35:01
were other things before it or there
35:02
were other things before it or there were not other things before it and so
35:04
were not other things before it and so
35:04
were not other things before it and so that's why it's generating all these
35:05
that's why it's generating all these
35:05
that's why it's generating all these like
35:06
like
35:06
like nonsense names
35:08
nonsense names
35:08
nonsense names another way to do this is
35:11
to convince yourself that this is
35:12
to convince yourself that this is
35:12
to convince yourself that this is actually doing something reasonable even
35:14
actually doing something reasonable even
35:14
actually doing something reasonable even though it's so terrible is
35:17
though it's so terrible is
35:17
though it's so terrible is these little piece here are 27 right
35:20
these little piece here are 27 right
35:20
these little piece here are 27 right like 27.
35:23
like 27.
35:23
like 27. so how about if we did something like
35:24
so how about if we did something like
35:24
so how about if we did something like this
35:26
this
35:26
this instead of p having any structure
35:27
instead of p having any structure
35:27
instead of p having any structure whatsoever
35:28
whatsoever
35:28
whatsoever how about if p was just
35:30
how about if p was just
35:30
how about if p was just torch dot once
35:34
of 27
35:37
of 27
35:37
of 27 by default this is a float 32 so this is
35:39
by default this is a float 32 so this is
35:39
by default this is a float 32 so this is fine divide 27
35:42
fine divide 27
35:42
fine divide 27 so what i'm doing here is this is the
35:44
so what i'm doing here is this is the
35:44
so what i'm doing here is this is the uniform distribution which will make
35:47
uniform distribution which will make
35:47
uniform distribution which will make everything equally likely
35:49
everything equally likely
35:49
everything equally likely and we can sample from that so let's see
35:52
and we can sample from that so let's see
35:52
and we can sample from that so let's see if that does any better
35:54
if that does any better
35:54
if that does any better okay so it's
35:55
okay so it's
35:55
okay so it's this is what you have from a model that
35:57
this is what you have from a model that
35:57
this is what you have from a model that is completely untrained where everything
35:59
is completely untrained where everything
35:59
is completely untrained where everything is equally likely so it's obviously
36:01
is equally likely so it's obviously
36:01
is equally likely so it's obviously garbage and then if we have a trained
36:03
garbage and then if we have a trained
36:03
garbage and then if we have a trained model which is trained on just bi-grams
36:07
model which is trained on just bi-grams
36:07
model which is trained on just bi-grams this is what we get so you can see that
36:09
this is what we get so you can see that
36:09
this is what we get so you can see that it is more name-like it is actually
36:11
it is more name-like it is actually
36:11
it is more name-like it is actually working it's just um
36:14
working it's just um
36:14
working it's just um my gram is so terrible and we have to do
36:15
my gram is so terrible and we have to do
36:15
my gram is so terrible and we have to do better now next i would like to fix an
36:17
better now next i would like to fix an
36:17
better now next i would like to fix an inefficiency that we have going on here
36:20
inefficiency that we have going on here
36:20
inefficiency that we have going on here because what we're doing here is we're
36:21
because what we're doing here is we're
36:21
because what we're doing here is we're always fetching a row of n from the
36:24
always fetching a row of n from the
36:24
always fetching a row of n from the counts matrix up ahead
36:26
counts matrix up ahead
36:26
counts matrix up ahead and then we're always doing the same
36:27
and then we're always doing the same
36:27
and then we're always doing the same things we're converting to float and
36:28
things we're converting to float and
36:28
things we're converting to float and we're dividing and we're doing this
36:30
we're dividing and we're doing this
36:30
we're dividing and we're doing this every single iteration of this loop and
36:32
every single iteration of this loop and
36:32
every single iteration of this loop and we just keep renormalizing these rows
36:34
we just keep renormalizing these rows
36:34
we just keep renormalizing these rows over and over again and it's extremely
36:35
over and over again and it's extremely
36:35
over and over again and it's extremely inefficient and wasteful so what i'd
36:37
inefficient and wasteful so what i'd
36:37
inefficient and wasteful so what i'd like to do is i'd like to actually
36:38
like to do is i'd like to actually
36:38
like to do is i'd like to actually prepare a matrix capital p that will
36:41
prepare a matrix capital p that will
36:41
prepare a matrix capital p that will just have the probabilities in it so in
36:44
just have the probabilities in it so in
36:44
just have the probabilities in it so in other words it's going to be the same as
36:45
other words it's going to be the same as
36:45
other words it's going to be the same as the capital n matrix here of counts but
36:48
the capital n matrix here of counts but
36:48
the capital n matrix here of counts but every single row will have the row of
36:50
every single row will have the row of
36:50
every single row will have the row of probabilities uh that is normalized to 1
36:52
probabilities uh that is normalized to 1
36:52
probabilities uh that is normalized to 1 indicating the probability distribution
36:54
indicating the probability distribution
36:54
indicating the probability distribution for the next character given the
36:56
for the next character given the
36:56
for the next character given the character before it
36:57
character before it
36:57
character before it um as defined by which row we're in
37:01
um as defined by which row we're in
37:01
um as defined by which row we're in so basically what we'd like to do is
37:03
so basically what we'd like to do is
37:03
so basically what we'd like to do is we'd like to just do it up front here
37:04
we'd like to just do it up front here
37:04
we'd like to just do it up front here and then we would like to just use that
37:06
and then we would like to just use that
37:06
and then we would like to just use that row here so here we would like to just
37:09
row here so here we would like to just
37:09
row here so here we would like to just do p equals p of ix instead
37:12
do p equals p of ix instead
37:12
do p equals p of ix instead okay
37:14
okay
37:14
okay the other reason i want to do this is
37:15
the other reason i want to do this is
37:16
the other reason i want to do this is not just for efficiency but also i would
37:17
not just for efficiency but also i would
37:17
not just for efficiency but also i would like us to practice
37:19
like us to practice
37:19
like us to practice these n-dimensional tensors and i'd like
37:21
these n-dimensional tensors and i'd like
37:21
these n-dimensional tensors and i'd like us to practice their manipulation and
37:23
us to practice their manipulation and
37:23
us to practice their manipulation and especially something that's called
37:24
especially something that's called
37:24
especially something that's called broadcasting that we'll go into in a
37:25
broadcasting that we'll go into in a
37:26
broadcasting that we'll go into in a second
37:26
second
37:26
second we're actually going to have to become
37:28
we're actually going to have to become
37:28
we're actually going to have to become very good at these tensor manipulations
37:30
very good at these tensor manipulations
37:30
very good at these tensor manipulations because if we're going to build out all
37:31
because if we're going to build out all
37:32
because if we're going to build out all the way to transformers we're going to
37:33
the way to transformers we're going to
37:33
the way to transformers we're going to be doing some pretty complicated um
37:35
be doing some pretty complicated um
37:35
be doing some pretty complicated um array operations for efficiency and we
37:37
array operations for efficiency and we
37:37
array operations for efficiency and we need to really understand that and be
37:39
need to really understand that and be
37:39
need to really understand that and be very good at it
37:42
very good at it
37:42
very good at it so intuitively what we want to do is we
37:43
so intuitively what we want to do is we
37:43
so intuitively what we want to do is we first want to grab the floating point
37:45
first want to grab the floating point
37:45
first want to grab the floating point copy of n
37:48
copy of n
37:48
copy of n and i'm mimicking the line here
37:49
and i'm mimicking the line here
37:49
and i'm mimicking the line here basically
37:50
basically
37:50
basically and then we want to divide all the rows
37:53
and then we want to divide all the rows
37:53
and then we want to divide all the rows so that they sum to 1.
37:55
so that they sum to 1.
37:55
so that they sum to 1. so we'd like to do something like this p
37:57
so we'd like to do something like this p
37:57
so we'd like to do something like this p divide p dot sum
38:00
divide p dot sum
38:00
divide p dot sum but
38:01
but
38:01
but now we have to be careful
38:02
now we have to be careful
38:02
now we have to be careful because p dot sum actually
38:05
because p dot sum actually
38:05
because p dot sum actually produces a sum
38:07
produces a sum
38:08
produces a sum sorry equals and that float copy
38:10
sorry equals and that float copy
38:10
sorry equals and that float copy p dot sum produces a um
38:14
p dot sum produces a um
38:14
p dot sum produces a um sums up all of the counts of this entire
38:16
sums up all of the counts of this entire
38:16
sums up all of the counts of this entire matrix n and gives us a single number of
38:19
matrix n and gives us a single number of
38:19
matrix n and gives us a single number of just the summation of everything so
38:21
just the summation of everything so
38:21
just the summation of everything so that's not the way we want to define
38:22
that's not the way we want to define
38:22
that's not the way we want to define divide we want to simultaneously and in
38:25
divide we want to simultaneously and in
38:25
divide we want to simultaneously and in parallel divide all the rows
38:27
parallel divide all the rows
38:27
parallel divide all the rows by their respective sums
38:30
by their respective sums
38:30
by their respective sums so what we have to do now is we have to
38:32
so what we have to do now is we have to
38:32
so what we have to do now is we have to go into documentation for torch.sum
38:35
go into documentation for torch.sum
38:35
go into documentation for torch.sum and we can scroll down here to a
38:37
and we can scroll down here to a
38:37
and we can scroll down here to a definition that is relevant to us which
38:38
definition that is relevant to us which
38:38
definition that is relevant to us which is where we don't only provide an input
38:41
is where we don't only provide an input
38:41
is where we don't only provide an input array that we want to sum but we also
38:43
array that we want to sum but we also
38:43
array that we want to sum but we also provide the dimension along which we
38:45
provide the dimension along which we
38:45
provide the dimension along which we want to sum
38:47
want to sum
38:47
want to sum and in particular we want to sum up
38:49
and in particular we want to sum up
38:49
and in particular we want to sum up over rows
38:51
over rows
38:51
over rows right
38:52
right
38:52
right now one more argument that i want you to
38:53
now one more argument that i want you to
38:53
now one more argument that i want you to pay attention to here is the keep them
38:56
pay attention to here is the keep them
38:56
pay attention to here is the keep them is false
38:57
is false
38:57
is false if keep them is true then the output
39:00
if keep them is true then the output
39:00
if keep them is true then the output tensor is of the same size as input
39:02
tensor is of the same size as input
39:02
tensor is of the same size as input except of course the dimension along
39:03
except of course the dimension along
39:03
except of course the dimension along which is summed which will become just
39:05
which is summed which will become just
39:05
which is summed which will become just one
39:07
one
39:07
one but if you pass in keep them as false
39:11
but if you pass in keep them as false
39:12
but if you pass in keep them as false then this dimension is squeezed out and
39:14
then this dimension is squeezed out and
39:14
then this dimension is squeezed out and so torch.sum not only does the sum and
39:16
so torch.sum not only does the sum and
39:16
so torch.sum not only does the sum and collapses dimension to be of size one
39:18
collapses dimension to be of size one
39:18
collapses dimension to be of size one but in addition it does what's called a
39:20
but in addition it does what's called a
39:20
but in addition it does what's called a squeeze where it squeezes out it
39:22
squeeze where it squeezes out it
39:22
squeeze where it squeezes out it squeezes out that dimension
39:24
squeezes out that dimension
39:24
squeezes out that dimension so
39:25
so
39:25
so basically what we want here is we
39:27
basically what we want here is we
39:27
basically what we want here is we instead want to do p dot sum of some
39:29
instead want to do p dot sum of some
39:29
instead want to do p dot sum of some axis
39:30
axis
39:30
axis and in particular notice that p dot
39:32
and in particular notice that p dot
39:32
and in particular notice that p dot shape is 27 by 27
39:35
shape is 27 by 27
39:35
shape is 27 by 27 so when we sum up across axis zero then
39:37
so when we sum up across axis zero then
39:37
so when we sum up across axis zero then we would be taking the zeroth dimension
39:39
we would be taking the zeroth dimension
39:40
we would be taking the zeroth dimension and we would be summing across it
39:42
and we would be summing across it
39:42
and we would be summing across it so when keep them as true
39:44
so when keep them as true
39:44
so when keep them as true then this thing will not only give us
39:46
then this thing will not only give us
39:46
then this thing will not only give us the counts across um
39:49
the counts across um
39:50
the counts across um along the columns
39:51
along the columns
39:51
along the columns but notice that basically the shape of
39:53
but notice that basically the shape of
39:53
but notice that basically the shape of this is 1 by 27 we just get a row vector
39:57
this is 1 by 27 we just get a row vector
39:57
this is 1 by 27 we just get a row vector and the reason we get a row vector here
39:58
and the reason we get a row vector here
39:58
and the reason we get a row vector here again is because we passed in zero
40:00
again is because we passed in zero
40:00
again is because we passed in zero dimension so this zero dimension becomes
40:02
dimension so this zero dimension becomes
40:02
dimension so this zero dimension becomes one and we've done a sum
40:04
one and we've done a sum
40:04
one and we've done a sum and we get a row and so basically we've
40:06
and we get a row and so basically we've
40:06
and we get a row and so basically we've done the sum
40:08
done the sum
40:08
done the sum this way
40:09
this way
40:09
this way vertically and arrived at just a single
40:11
vertically and arrived at just a single
40:11
vertically and arrived at just a single 1 by 27
40:12
1 by 27
40:12
1 by 27 vector of counts
40:15
vector of counts
40:15
vector of counts what happens when you take out keep them
40:17
what happens when you take out keep them
40:17
what happens when you take out keep them is that we just get 27. so it squeezes
40:20
is that we just get 27. so it squeezes
40:20
is that we just get 27. so it squeezes out that dimension and we just get
40:23
out that dimension and we just get
40:23
out that dimension and we just get a one-dimensional vector of size 27.
40:28
now we don't actually want
40:31
now we don't actually want
40:31
now we don't actually want one by 27 row vector because that gives
40:33
one by 27 row vector because that gives
40:33
one by 27 row vector because that gives us the counts or the sums across
40:37
us the counts or the sums across
40:37
us the counts or the sums across the columns
40:39
the columns
40:39
the columns we actually want to sum the other way
40:41
we actually want to sum the other way
40:41
we actually want to sum the other way along dimension one and you'll see that
40:43
along dimension one and you'll see that
40:43
along dimension one and you'll see that the shape of this is 27 by one so it's a
40:46
the shape of this is 27 by one so it's a
40:46
the shape of this is 27 by one so it's a column vector it's a 27 by one
40:49
column vector it's a 27 by one
40:49
column vector it's a 27 by one vector of counts
40:52
vector of counts
40:52
vector of counts okay
40:53
okay
40:53
okay and that's because what's happened here
40:55
and that's because what's happened here
40:55
and that's because what's happened here is that we're going horizontally and
40:56
is that we're going horizontally and
40:56
is that we're going horizontally and this 27 by 27 matrix becomes a 27 by 1
41:01
this 27 by 27 matrix becomes a 27 by 1
41:01
this 27 by 27 matrix becomes a 27 by 1 array
41:03
array
41:03
array now you'll notice by the way that um the
41:06
now you'll notice by the way that um the
41:06
now you'll notice by the way that um the actual numbers
41:08
actual numbers
41:08
actual numbers of these counts are identical
41:10
of these counts are identical
41:10
of these counts are identical and that's because this special array of
41:12
and that's because this special array of
41:12
and that's because this special array of counts here comes from bi-gram
41:13
counts here comes from bi-gram
41:13
counts here comes from bi-gram statistics and actually it just so
41:15
statistics and actually it just so
41:15
statistics and actually it just so happens by chance
41:17
happens by chance
41:17
happens by chance or because of the way this array is
41:19
or because of the way this array is
41:19
or because of the way this array is constructed that the sums along the
41:21
constructed that the sums along the
41:21
constructed that the sums along the columns or along the rows horizontally
41:23
columns or along the rows horizontally
41:23
columns or along the rows horizontally or vertically is identical
41:26
or vertically is identical
41:26
or vertically is identical but actually what we want to do in this
41:27
but actually what we want to do in this
41:27
but actually what we want to do in this case is we want to sum across the
41:30
case is we want to sum across the
41:30
case is we want to sum across the rows
41:31
rows
41:31
rows horizontally so what we want here is p
41:33
horizontally so what we want here is p
41:33
horizontally so what we want here is p that sum of one with keep in true
41:37
that sum of one with keep in true
41:37
that sum of one with keep in true 27 by one column vector
41:39
27 by one column vector
41:39
27 by one column vector and now what we want to do is we want to
41:40
and now what we want to do is we want to
41:40
and now what we want to do is we want to divide by that
41:44
now we have to be careful here again is
41:46
now we have to be careful here again is
41:46
now we have to be careful here again is it possible to take
41:48
it possible to take
41:48
it possible to take what's a um p dot shape you see here 27
41:51
what's a um p dot shape you see here 27
41:51
what's a um p dot shape you see here 27 by 27 is it possible to take a 27 by 27
41:55
by 27 is it possible to take a 27 by 27
41:55
by 27 is it possible to take a 27 by 27 array and divide it by what is a 27 by 1
41:59
array and divide it by what is a 27 by 1
41:59
array and divide it by what is a 27 by 1 array
42:01
array
42:01
array is that an operation that you can do
42:03
is that an operation that you can do
42:03
is that an operation that you can do and whether or not you can perform this
42:05
and whether or not you can perform this
42:05
and whether or not you can perform this operation is determined by what's called
42:06
operation is determined by what's called
42:06
operation is determined by what's called broadcasting rules so if you just search
42:09
broadcasting rules so if you just search
42:09
broadcasting rules so if you just search broadcasting semantics in torch
42:11
broadcasting semantics in torch
42:11
broadcasting semantics in torch you'll notice that there's a special
42:12
you'll notice that there's a special
42:12
you'll notice that there's a special definition for
42:14
definition for
42:14
definition for what's called broadcasting that uh for
42:16
what's called broadcasting that uh for
42:16
what's called broadcasting that uh for whether or not um these two uh arrays
42:20
whether or not um these two uh arrays
42:20
whether or not um these two uh arrays can be combined in a binary operation
42:22
can be combined in a binary operation
42:22
can be combined in a binary operation like division
42:23
like division
42:23
like division so the first condition is each tensor
42:25
so the first condition is each tensor
42:25
so the first condition is each tensor has at least one dimension which is the
42:27
has at least one dimension which is the
42:27
has at least one dimension which is the case for us
42:28
case for us
42:28
case for us and then when iterating over the
42:29
and then when iterating over the
42:29
and then when iterating over the dimension sizes starting at the trailing
42:31
dimension sizes starting at the trailing
42:31
dimension sizes starting at the trailing dimension
42:32
dimension
42:32
dimension the dimension sizes must either be equal
42:34
the dimension sizes must either be equal
42:34
the dimension sizes must either be equal one of them is one or one of them does
42:36
one of them is one or one of them does
42:36
one of them is one or one of them does not exist
42:37
not exist
42:38
not exist okay
42:38
okay
42:38
okay so let's do that we need to align the
42:41
so let's do that we need to align the
42:41
so let's do that we need to align the two arrays and their shapes which is
42:44
two arrays and their shapes which is
42:44
two arrays and their shapes which is very easy because both of these shapes
42:45
very easy because both of these shapes
42:45
very easy because both of these shapes have two elements so they're aligned
42:47
have two elements so they're aligned
42:48
have two elements so they're aligned then we iterate over from the from the
42:50
then we iterate over from the from the
42:50
then we iterate over from the from the right and going to the left
42:52
right and going to the left
42:52
right and going to the left each dimension must be either equal one
42:55
each dimension must be either equal one
42:55
each dimension must be either equal one of them is a one or one of them does not
42:57
of them is a one or one of them does not
42:57
of them is a one or one of them does not exist so in this case they're not equal
42:59
exist so in this case they're not equal
42:59
exist so in this case they're not equal but one of them is a one so this is fine
43:01
but one of them is a one so this is fine
43:01
but one of them is a one so this is fine and then this dimension they're both
43:03
and then this dimension they're both
43:03
and then this dimension they're both equal
43:03
equal
43:04
equal so uh this is fine
43:05
so uh this is fine
43:05
so uh this is fine so all the dimensions are fine and
43:08
so all the dimensions are fine and
43:08
so all the dimensions are fine and therefore the this operation is
43:10
therefore the this operation is
43:10
therefore the this operation is broadcastable so that means that this
43:12
broadcastable so that means that this
43:12
broadcastable so that means that this operation is allowed
43:14
operation is allowed
43:14
operation is allowed and what is it that these arrays do when
43:16
and what is it that these arrays do when
43:16
and what is it that these arrays do when you divide 27 by 27 by 27 by one
43:19
you divide 27 by 27 by 27 by one
43:19
you divide 27 by 27 by 27 by one what it does is that it takes this
43:21
what it does is that it takes this
43:21
what it does is that it takes this dimension one and it stretches it out it
43:24
dimension one and it stretches it out it
43:24
dimension one and it stretches it out it copies it to match
43:27
copies it to match
43:27
copies it to match 27 here in this case
43:28
27 here in this case
43:28
27 here in this case so in our case it takes this column
43:30
so in our case it takes this column
43:30
so in our case it takes this column vector which is 27 by 1
43:32
vector which is 27 by 1
43:32
vector which is 27 by 1 and it copies it 27 times
43:36
and it copies it 27 times
43:36
and it copies it 27 times to make
43:37
to make
43:37
to make these both be 27 by 27 internally you
43:40
these both be 27 by 27 internally you
43:40
these both be 27 by 27 internally you can think of it that way and so it
43:42
can think of it that way and so it
43:42
can think of it that way and so it copies those counts
43:44
copies those counts
43:44
copies those counts and then it does an element-wise
43:45
and then it does an element-wise
43:45
and then it does an element-wise division
43:47
division
43:47
division which is what we want because these
43:48
which is what we want because these
43:48
which is what we want because these counts we want to divide by them on
43:50
counts we want to divide by them on
43:50
counts we want to divide by them on every single one of these columns in
43:52
every single one of these columns in
43:52
every single one of these columns in this matrix
43:54
this matrix
43:54
this matrix so this actually we expect will
43:56
so this actually we expect will
43:56
so this actually we expect will normalize
43:57
normalize
43:57
normalize every single row
43:59
every single row
43:59
every single row and we can check that this is true by
44:01
and we can check that this is true by
44:01
and we can check that this is true by taking the first row for example and
44:04
taking the first row for example and
44:04
taking the first row for example and taking its sum we expect this to be
44:06
taking its sum we expect this to be
44:06
taking its sum we expect this to be 1. because it's not normalized
44:10
1. because it's not normalized
44:10
1. because it's not normalized and then we expect this now because if
44:13
and then we expect this now because if
44:13
and then we expect this now because if we actually correctly normalize all the
44:14
we actually correctly normalize all the
44:14
we actually correctly normalize all the rows we expect to get the exact same
44:16
rows we expect to get the exact same
44:16
rows we expect to get the exact same result here so let's run this
44:19
result here so let's run this
44:19
result here so let's run this it's the exact same result
44:21
it's the exact same result
44:21
it's the exact same result this is correct so now i would like to
44:23
this is correct so now i would like to
44:23
this is correct so now i would like to scare you a little bit
44:25
scare you a little bit
44:25
scare you a little bit uh you actually have to like i basically
44:27
uh you actually have to like i basically
44:27
uh you actually have to like i basically encourage you very strongly to read
44:28
encourage you very strongly to read
44:28
encourage you very strongly to read through broadcasting semantics
44:30
through broadcasting semantics
44:30
through broadcasting semantics and i encourage you to treat this with
44:31
and i encourage you to treat this with
44:31
and i encourage you to treat this with respect and it's not something to play
44:34
respect and it's not something to play
44:34
respect and it's not something to play fast and loose with it's something to
44:35
fast and loose with it's something to
44:35
fast and loose with it's something to really respect really understand and
44:37
really respect really understand and
44:37
really respect really understand and look up maybe some tutorials for
44:38
look up maybe some tutorials for
44:38
look up maybe some tutorials for broadcasting and practice it and be
44:40
broadcasting and practice it and be
44:40
broadcasting and practice it and be careful with it because you can very
44:42
careful with it because you can very
44:42
careful with it because you can very quickly run into books let me show you
44:44
quickly run into books let me show you
44:44
quickly run into books let me show you what i mean
44:47
you see how here we have p dot sum of
44:48
you see how here we have p dot sum of
44:48
you see how here we have p dot sum of one keep them as true
44:50
one keep them as true
44:50
one keep them as true the shape of this is 27 by one let me
44:53
the shape of this is 27 by one let me
44:53
the shape of this is 27 by one let me take out this line just so we have the n
44:55
take out this line just so we have the n
44:55
take out this line just so we have the n and then we can see the counts
44:58
and then we can see the counts
44:58
and then we can see the counts we can see that this is a all the counts
45:00
we can see that this is a all the counts
45:00
we can see that this is a all the counts across all the
45:02
across all the
45:02
across all the rows
45:03
rows
45:03
rows and it's a 27 by one column vector right
45:07
and it's a 27 by one column vector right
45:07
and it's a 27 by one column vector right now suppose that i tried to do the
45:09
now suppose that i tried to do the
45:09
now suppose that i tried to do the following
45:10
following
45:10
following but i erase keep them just true here
45:13
but i erase keep them just true here
45:13
but i erase keep them just true here what does that do if keep them is not
45:15
what does that do if keep them is not
45:15
what does that do if keep them is not true it's false then remember according
45:18
true it's false then remember according
45:18
true it's false then remember according to documentation it gets rid of this
45:20
to documentation it gets rid of this
45:20
to documentation it gets rid of this dimension one it squeezes it out so
45:23
dimension one it squeezes it out so
45:23
dimension one it squeezes it out so basically we just get all the same
45:24
basically we just get all the same
45:24
basically we just get all the same counts the same result except the shape
45:27
counts the same result except the shape
45:27
counts the same result except the shape of it is not 27 by 1 it is just 27 the
45:29
of it is not 27 by 1 it is just 27 the
45:30
of it is not 27 by 1 it is just 27 the one disappears
45:31
one disappears
45:31
one disappears but all the counts are the same
45:34
but all the counts are the same
45:34
but all the counts are the same so you'd think that this divide that
45:37
so you'd think that this divide that
45:37
so you'd think that this divide that would uh would work
45:39
would uh would work
45:40
would uh would work first of all can we even uh write this
45:42
first of all can we even uh write this
45:42
first of all can we even uh write this and will it is it even is it even
45:44
and will it is it even is it even
45:44
and will it is it even is it even expected to run is it broadcastable
45:46
expected to run is it broadcastable
45:46
expected to run is it broadcastable let's determine if this result is
45:47
let's determine if this result is
45:47
let's determine if this result is broadcastable
45:49
broadcastable
45:49
broadcastable p.summit one is shape
45:51
p.summit one is shape
45:51
p.summit one is shape is 27.
45:52
is 27.
45:52
is 27. this is 27 by 27. so 27 by 27
45:57
broadcasting into 27. so now
46:01
broadcasting into 27. so now
46:01
broadcasting into 27. so now rules of broadcasting number one align
46:03
rules of broadcasting number one align
46:03
rules of broadcasting number one align all the dimensions on the right done now
46:06
all the dimensions on the right done now
46:06
all the dimensions on the right done now iteration over all the dimensions
46:07
iteration over all the dimensions
46:07
iteration over all the dimensions starting from the right going to the
46:09
starting from the right going to the
46:09
starting from the right going to the left
46:10
left
46:10
left all the dimensions must either be equal
46:12
all the dimensions must either be equal
46:12
all the dimensions must either be equal one of them must be one or one that does
46:14
one of them must be one or one that does
46:14
one of them must be one or one that does not exist so here they are all equal
46:17
not exist so here they are all equal
46:17
not exist so here they are all equal here the dimension does not exist
46:19
here the dimension does not exist
46:19
here the dimension does not exist so internally what broadcasting will do
46:21
so internally what broadcasting will do
46:21
so internally what broadcasting will do is it will create a one here
46:24
is it will create a one here
46:24
is it will create a one here and then
46:25
and then
46:25
and then we see that one of them is a one and
46:27
we see that one of them is a one and
46:27
we see that one of them is a one and this will get copied and this will run
46:30
this will get copied and this will run
46:30
this will get copied and this will run this will broadcast
46:32
this will broadcast
46:32
this will broadcast okay so you'd expect this
46:34
okay so you'd expect this
46:34
okay so you'd expect this to work
46:37
because we we are
46:41
this broadcast and this we can divide
46:42
this broadcast and this we can divide
46:42
this broadcast and this we can divide this
46:43
this
46:43
this now if i run this you'd expect it to
46:45
now if i run this you'd expect it to
46:45
now if i run this you'd expect it to work but
46:46
work but
46:46
work but it doesn't
46:47
it doesn't
46:47
it doesn't uh you actually get garbage you get a
46:49
uh you actually get garbage you get a
46:49
uh you actually get garbage you get a wrong dissolve because this is actually
46:51
wrong dissolve because this is actually
46:51
wrong dissolve because this is actually a bug
46:52
a bug
46:52
a bug this keep them equals true
46:57
makes it work
47:00
this is a bug
47:02
this is a bug
47:02
this is a bug in both cases we are doing
47:04
in both cases we are doing
47:04
in both cases we are doing the correct counts we are summing up
47:07
the correct counts we are summing up
47:07
the correct counts we are summing up across the rows
47:09
across the rows
47:09
across the rows but keep them is saving us and making it
47:10
but keep them is saving us and making it
47:10
but keep them is saving us and making it work so in this case
47:12
work so in this case
47:12
work so in this case i'd like to encourage you to potentially
47:14
i'd like to encourage you to potentially
47:14
i'd like to encourage you to potentially like pause this video at this point and
47:15
like pause this video at this point and
47:15
like pause this video at this point and try to think about why this is buggy and
47:18
try to think about why this is buggy and
47:18
try to think about why this is buggy and why the keep dim was necessary here
47:22
okay
47:22
okay
47:22
okay so the reason to do
47:24
so the reason to do
47:24
so the reason to do for this is i'm trying to hint it here
47:26
for this is i'm trying to hint it here
47:26
for this is i'm trying to hint it here when i was sort of giving you a bit of a
47:27
when i was sort of giving you a bit of a
47:27
when i was sort of giving you a bit of a hint on how this works
47:29
hint on how this works
47:29
hint on how this works this
47:30
this
47:30
this 27 vector
47:32
27 vector
47:32
27 vector internally inside the broadcasting this
47:34
internally inside the broadcasting this
47:34
internally inside the broadcasting this becomes a 1 by 27
47:36
becomes a 1 by 27
47:36
becomes a 1 by 27 and 1 by 27 is a row vector right
47:39
and 1 by 27 is a row vector right
47:39
and 1 by 27 is a row vector right and now we are dividing 27 by 27 by 1 by
47:41
and now we are dividing 27 by 27 by 1 by
47:42
and now we are dividing 27 by 27 by 1 by 27
47:43
27
47:43
27 and torch will replicate this dimension
47:45
and torch will replicate this dimension
47:45
and torch will replicate this dimension so basically
47:47
so basically
47:47
so basically uh it will take
47:49
uh it will take
47:49
uh it will take it will take this
47:51
it will take this
47:51
it will take this row vector and it will copy it
47:53
row vector and it will copy it
47:53
row vector and it will copy it vertically now
47:55
vertically now
47:55
vertically now 27 times so the 27 by 27 lies exactly
47:57
27 times so the 27 by 27 lies exactly
47:57
27 times so the 27 by 27 lies exactly and element wise divides
48:00
and element wise divides
48:00
and element wise divides and so basically what's happening here
48:02
and so basically what's happening here
48:02
and so basically what's happening here is
48:04
is
48:04
is we're actually normalizing the columns
48:06
we're actually normalizing the columns
48:06
we're actually normalizing the columns instead of normalizing the rows
48:09
instead of normalizing the rows
48:09
instead of normalizing the rows so you can check that what's happening
48:11
so you can check that what's happening
48:11
so you can check that what's happening here is that p at zero which is the
48:13
here is that p at zero which is the
48:13
here is that p at zero which is the first row of p dot sum
48:16
first row of p dot sum
48:16
first row of p dot sum is not one it's seven
48:18
is not one it's seven
48:18
is not one it's seven it is the first column as an example
48:20
it is the first column as an example
48:20
it is the first column as an example that sums to one
48:23
that sums to one
48:23
that sums to one so
48:24
so
48:24
so to summarize where does the issue come
48:26
to summarize where does the issue come
48:26
to summarize where does the issue come from the issue comes from the silent
48:28
from the issue comes from the silent
48:28
from the issue comes from the silent adding of a dimension here because in
48:30
adding of a dimension here because in
48:30
adding of a dimension here because in broadcasting rules you align on the
48:31
broadcasting rules you align on the
48:31
broadcasting rules you align on the right and go from right to left and if
48:33
right and go from right to left and if
48:34
right and go from right to left and if dimension doesn't exist you create it
48:36
dimension doesn't exist you create it
48:36
dimension doesn't exist you create it so that's where the problem happens we
48:37
so that's where the problem happens we
48:38
so that's where the problem happens we still did the counts correctly we did
48:39
still did the counts correctly we did
48:39
still did the counts correctly we did the counts across the rows and we got
48:41
the counts across the rows and we got
48:41
the counts across the rows and we got the the counts on the right here as a
48:44
the the counts on the right here as a
48:44
the the counts on the right here as a column vector but because the keep
48:46
column vector but because the keep
48:46
column vector but because the keep things was true this this uh this
48:48
things was true this this uh this
48:48
things was true this this uh this dimension was discarded and now we just
48:49
dimension was discarded and now we just
48:49
dimension was discarded and now we just have a vector of 27. and because of
48:52
have a vector of 27. and because of
48:52
have a vector of 27. and because of broadcasting the way it works this
48:54
broadcasting the way it works this
48:54
broadcasting the way it works this vector of 27 suddenly becomes a row
48:55
vector of 27 suddenly becomes a row
48:56
vector of 27 suddenly becomes a row vector
48:56
vector
48:56
vector and then this row vector gets replicated
48:58
and then this row vector gets replicated
48:58
and then this row vector gets replicated vertically and that every single point
49:00
vertically and that every single point
49:00
vertically and that every single point we are dividing by the by the count
49:05
in the opposite direction
49:07
in the opposite direction
49:07
in the opposite direction so uh
49:08
so uh
49:08
so uh so this thing just uh doesn't work this
49:11
so this thing just uh doesn't work this
49:11
so this thing just uh doesn't work this needs to be keep things equal true in
49:12
needs to be keep things equal true in
49:12
needs to be keep things equal true in this case
49:14
this case
49:14
this case so then
49:16
so then
49:16
so then then we have that p at zero is
49:17
then we have that p at zero is
49:17
then we have that p at zero is normalized
49:19
normalized
49:19
normalized and conversely the first column you'd
49:21
and conversely the first column you'd
49:21
and conversely the first column you'd expect to potentially not be normalized
49:24
expect to potentially not be normalized
49:24
expect to potentially not be normalized and this is what makes it work
49:27
and this is what makes it work
49:27
and this is what makes it work so pretty subtle and uh hopefully this
49:31
so pretty subtle and uh hopefully this
49:31
so pretty subtle and uh hopefully this helps to scare you that you should have
49:33
helps to scare you that you should have
49:33
helps to scare you that you should have a respect for broadcasting be careful
49:34
a respect for broadcasting be careful
49:34
a respect for broadcasting be careful check your work uh and uh understand how
49:37
check your work uh and uh understand how
49:37
check your work uh and uh understand how it works under the hood and make sure
49:39
it works under the hood and make sure
49:39
it works under the hood and make sure that it's broadcasting in the direction
49:40
that it's broadcasting in the direction
49:40
that it's broadcasting in the direction that you like otherwise you're going to
49:41
that you like otherwise you're going to
49:41
that you like otherwise you're going to introduce very subtle bugs very hard to
49:44
introduce very subtle bugs very hard to
49:44
introduce very subtle bugs very hard to find bugs and uh just be careful one
49:46
find bugs and uh just be careful one
49:46
find bugs and uh just be careful one more note on efficiency we don't want to
49:48
more note on efficiency we don't want to
49:48
more note on efficiency we don't want to be doing this here because this creates
49:51
be doing this here because this creates
49:51
be doing this here because this creates a completely new tensor that we store
49:53
a completely new tensor that we store
49:53
a completely new tensor that we store into p
49:54
into p
49:54
into p we prefer to use in place operations if
49:56
we prefer to use in place operations if
49:56
we prefer to use in place operations if possible
49:57
possible
49:57
possible so this would be an in-place operation
49:59
so this would be an in-place operation
50:00
so this would be an in-place operation it has the potential to be faster it
50:01
it has the potential to be faster it
50:01
it has the potential to be faster it doesn't create new memory
50:03
doesn't create new memory
50:03
doesn't create new memory under the hood and then let's erase this
50:06
under the hood and then let's erase this
50:06
under the hood and then let's erase this we don't need it
50:07
we don't need it
50:07
we don't need it and let's
50:09
and let's
50:09
and let's also
50:10
also
50:10
also um just do fewer just so i'm not wasting
50:13
um just do fewer just so i'm not wasting
50:13
um just do fewer just so i'm not wasting space
50:14
space
50:14
space okay so we're actually in a pretty good
50:15
okay so we're actually in a pretty good
50:15
okay so we're actually in a pretty good spot now
50:16
spot now
50:16
spot now we trained a bigram language model and
50:19
we trained a bigram language model and
50:19
we trained a bigram language model and we trained it really just by counting uh
50:22
we trained it really just by counting uh
50:22
we trained it really just by counting uh how frequently any pairing occurs and
50:24
how frequently any pairing occurs and
50:24
how frequently any pairing occurs and then normalizing so that we get a nice
50:26
then normalizing so that we get a nice
50:26
then normalizing so that we get a nice property distribution
50:27
property distribution
50:27
property distribution so really these elements of this array p
50:31
so really these elements of this array p
50:31
so really these elements of this array p are really the parameters of our biogram
50:32
are really the parameters of our biogram
50:32
are really the parameters of our biogram language model giving us and summarizing
50:34
language model giving us and summarizing
50:34
language model giving us and summarizing the statistics of these bigrams
50:36
the statistics of these bigrams
50:36
the statistics of these bigrams so we train the model and then we know
50:38
so we train the model and then we know
50:38
so we train the model and then we know how to sample from a model we just
50:40
how to sample from a model we just
50:40
how to sample from a model we just iteratively uh sample the next character
50:43
iteratively uh sample the next character
50:43
iteratively uh sample the next character and feed it in each time and get a next
50:45
and feed it in each time and get a next
50:45
and feed it in each time and get a next character
50:46
character
50:46
character now what i'd like to do is i'd like to
50:48
now what i'd like to do is i'd like to
50:48
now what i'd like to do is i'd like to somehow evaluate the quality of this
50:50
somehow evaluate the quality of this
50:50
somehow evaluate the quality of this model we'd like to somehow summarize the
50:52
model we'd like to somehow summarize the
50:52
model we'd like to somehow summarize the quality of this model into a single
50:54
quality of this model into a single
50:54
quality of this model into a single number how good is it at predicting
50:57
number how good is it at predicting
50:57
number how good is it at predicting the training set
50:58
the training set
50:58
the training set and as an example so in the training set
51:00
and as an example so in the training set
51:00
and as an example so in the training set we can evaluate now the training loss
51:04
we can evaluate now the training loss
51:04
we can evaluate now the training loss and this training loss is telling us
51:05
and this training loss is telling us
51:05
and this training loss is telling us about
51:06
about
51:06
about sort of the quality of this model in a
51:08
sort of the quality of this model in a
51:08
sort of the quality of this model in a single number just like we saw in
51:09
single number just like we saw in
51:10
single number just like we saw in micrograd
51:11
micrograd
51:11
micrograd so let's try to think through the
51:13
so let's try to think through the
51:13
so let's try to think through the quality of the model and how we would
51:14
quality of the model and how we would
51:14
quality of the model and how we would evaluate it
51:16
evaluate it
51:16
evaluate it basically what we're going to do is
51:18
basically what we're going to do is
51:18
basically what we're going to do is we're going to copy paste this code
51:20
we're going to copy paste this code
51:20
we're going to copy paste this code that we previously used for counting
51:22
that we previously used for counting
51:22
that we previously used for counting okay
51:24
okay
51:24
okay and let me just print these diagrams
51:25
and let me just print these diagrams
51:25
and let me just print these diagrams first we're gonna use f strings
51:27
first we're gonna use f strings
51:27
first we're gonna use f strings and i'm gonna print character one
51:29
and i'm gonna print character one
51:29
and i'm gonna print character one followed by character two these are the
51:30
followed by character two these are the
51:30
followed by character two these are the diagrams and then i don't wanna do it
51:32
diagrams and then i don't wanna do it
51:32
diagrams and then i don't wanna do it for all the words just do the first
51:34
for all the words just do the first
51:34
for all the words just do the first three words so here we have emma olivia
51:37
three words so here we have emma olivia
51:37
three words so here we have emma olivia and ava bigrams
51:40
and ava bigrams
51:40
and ava bigrams now what we'd like to do is we'd like to
51:41
now what we'd like to do is we'd like to
51:41
now what we'd like to do is we'd like to basically look at the probability that
51:44
basically look at the probability that
51:44
basically look at the probability that the model assigns to every one of these
51:46
the model assigns to every one of these
51:46
the model assigns to every one of these diagrams
51:48
diagrams
51:48
diagrams so in other words we can look at the
51:49
so in other words we can look at the
51:49
so in other words we can look at the probability which is
51:51
probability which is
51:51
probability which is summarized in the matrix b
51:52
summarized in the matrix b
51:52
summarized in the matrix b of i x 1 x 2
51:55
of i x 1 x 2
51:56
of i x 1 x 2 and then we can print it here
51:57
and then we can print it here
51:57
and then we can print it here as probability
52:00
as probability
52:00
as probability and because these properties are way too
52:01
and because these properties are way too
52:02
and because these properties are way too large let me present
52:04
large let me present
52:04
large let me present or call in 0.4 f
52:06
or call in 0.4 f
52:06
or call in 0.4 f to like truncate it a bit
52:09
to like truncate it a bit
52:09
to like truncate it a bit so what do we have here right we're
52:10
so what do we have here right we're
52:10
so what do we have here right we're looking at the probabilities that the
52:11
looking at the probabilities that the
52:11
looking at the probabilities that the model assigns to every one of these
52:13
model assigns to every one of these
52:13
model assigns to every one of these bigrams in the dataset
52:15
bigrams in the dataset
52:15
bigrams in the dataset and so we can see some of them are four
52:16
and so we can see some of them are four
52:16
and so we can see some of them are four percent three percent etc
52:18
percent three percent etc
52:18
percent three percent etc just to have a measuring stick in our
52:19
just to have a measuring stick in our
52:19
just to have a measuring stick in our mind by the way um we have 27 possible
52:23
mind by the way um we have 27 possible
52:23
mind by the way um we have 27 possible characters or tokens and if everything
52:25
characters or tokens and if everything
52:25
characters or tokens and if everything was equally likely then you'd expect all
52:27
was equally likely then you'd expect all
52:27
was equally likely then you'd expect all these probabilities
52:28
these probabilities
52:28
these probabilities to be
52:30
to be
52:30
to be four percent roughly
52:32
four percent roughly
52:32
four percent roughly so anything above four percent means
52:34
so anything above four percent means
52:34
so anything above four percent means that we've learned something useful from
52:35
that we've learned something useful from
52:35
that we've learned something useful from these bigram statistics and you see that
52:38
these bigram statistics and you see that
52:38
these bigram statistics and you see that roughly some of these are four percent
52:39
roughly some of these are four percent
52:39
roughly some of these are four percent but some of them are as high as 40
52:40
but some of them are as high as 40
52:40
but some of them are as high as 40 percent
52:41
percent
52:41
percent 35 percent and so on so you see that the
52:44
35 percent and so on so you see that the
52:44
35 percent and so on so you see that the model actually assigned a pretty high
52:45
model actually assigned a pretty high
52:45
model actually assigned a pretty high probability to whatever's in the
52:47
probability to whatever's in the
52:47
probability to whatever's in the training set and so that's a good thing
52:49
training set and so that's a good thing
52:50
training set and so that's a good thing um basically if you have a very good
52:51
um basically if you have a very good
52:51
um basically if you have a very good model you'd expect that these
52:53
model you'd expect that these
52:53
model you'd expect that these probabilities should be near one because
52:54
probabilities should be near one because
52:54
probabilities should be near one because that means that your model is correctly
52:57
that means that your model is correctly
52:57
that means that your model is correctly predicting what's going to come next
52:58
predicting what's going to come next
52:58
predicting what's going to come next especially on the training set where you
53:00
especially on the training set where you
53:00
especially on the training set where you where you trained your model
53:02
where you trained your model
53:02
where you trained your model so
53:03
so
53:03
so now we'd like to think about how can we
53:05
now we'd like to think about how can we
53:05
now we'd like to think about how can we summarize these probabilities into a
53:07
summarize these probabilities into a
53:07
summarize these probabilities into a single number that measures the quality
53:09
single number that measures the quality
53:09
single number that measures the quality of this model
53:11
of this model
53:11
of this model now when you look at the literature into
53:13
now when you look at the literature into
53:13
now when you look at the literature into maximum likelihood estimation and
53:15
maximum likelihood estimation and
53:15
maximum likelihood estimation and statistical modeling and so on
53:17
statistical modeling and so on
53:17
statistical modeling and so on you'll see that what's typically used
53:18
you'll see that what's typically used
53:18
you'll see that what's typically used here is something called the likelihood
53:21
here is something called the likelihood
53:21
here is something called the likelihood and the likelihood is the product of all
53:23
and the likelihood is the product of all
53:23
and the likelihood is the product of all of these probabilities
53:25
of these probabilities
53:25
of these probabilities and so the product of all these
53:27
and so the product of all these
53:27
and so the product of all these probabilities is the likelihood and it's
53:29
probabilities is the likelihood and it's
53:29
probabilities is the likelihood and it's really telling us about the probability
53:31
really telling us about the probability
53:31
really telling us about the probability of the entire data set assigned uh
53:34
of the entire data set assigned uh
53:34
of the entire data set assigned uh assigned by the model that we've trained
53:37
assigned by the model that we've trained
53:37
assigned by the model that we've trained and that is a measure of quality
53:39
and that is a measure of quality
53:39
and that is a measure of quality so the product of these
53:41
so the product of these
53:41
so the product of these should be as high as possible
53:43
should be as high as possible
53:43
should be as high as possible when you are training the model and when
53:44
when you are training the model and when
53:44
when you are training the model and when you have a good model your pro your
53:46
you have a good model your pro your
53:46
you have a good model your pro your product of these probabilities should be
53:47
product of these probabilities should be
53:47
product of these probabilities should be very high
53:49
very high
53:49
very high um
53:50
um
53:50
um now because the product of these
53:51
now because the product of these
53:51
now because the product of these probabilities is an unwieldy thing to
53:53
probabilities is an unwieldy thing to
53:53
probabilities is an unwieldy thing to work with you can see that all of them
53:55
work with you can see that all of them
53:55
work with you can see that all of them are between zero and one so your product
53:56
are between zero and one so your product
53:56
are between zero and one so your product of these probabilities will be a very
53:58
of these probabilities will be a very
53:58
of these probabilities will be a very tiny number
54:00
tiny number
54:00
tiny number um
54:00
um
54:00
um so
54:01
so
54:01
so for convenience what people work with
54:03
for convenience what people work with
54:03
for convenience what people work with usually is not the likelihood but they
54:04
usually is not the likelihood but they
54:04
usually is not the likelihood but they work with what's called the log
54:06
work with what's called the log
54:06
work with what's called the log likelihood
54:07
likelihood
54:07
likelihood so
54:08
so
54:08
so the product of these is the likelihood
54:10
the product of these is the likelihood
54:10
the product of these is the likelihood to get the log likelihood we just have
54:12
to get the log likelihood we just have
54:12
to get the log likelihood we just have to take the log of the probability
54:14
to take the log of the probability
54:14
to take the log of the probability and so the log of the probability here i
54:17
and so the log of the probability here i
54:17
and so the log of the probability here i have the log of x from zero to one
54:19
have the log of x from zero to one
54:19
have the log of x from zero to one the log is a you see here monotonic
54:21
the log is a you see here monotonic
54:21
the log is a you see here monotonic transformation of the probability
54:24
transformation of the probability
54:24
transformation of the probability where if you pass in one
54:27
where if you pass in one
54:27
where if you pass in one you get zero
54:28
you get zero
54:28
you get zero so probability one gets your log
54:30
so probability one gets your log
54:30
so probability one gets your log probability of zero
54:32
probability of zero
54:32
probability of zero and then as you go lower and lower
54:33
and then as you go lower and lower
54:33
and then as you go lower and lower probability the log will grow more and
54:35
probability the log will grow more and
54:35
probability the log will grow more and more negative until all the way to
54:37
more negative until all the way to
54:37
more negative until all the way to negative infinity at zero
54:41
so here we have a log prob which is
54:43
so here we have a log prob which is
54:44
so here we have a log prob which is really just a torch.log of probability
54:46
really just a torch.log of probability
54:46
really just a torch.log of probability let's print it out to get a sense of
54:48
let's print it out to get a sense of
54:48
let's print it out to get a sense of what that looks like
54:49
what that looks like
54:49
what that looks like log prob
54:51
log prob
54:51
log prob also 0.4 f
54:54
okay
54:56
okay
54:56
okay so as you can see when we plug in
54:58
so as you can see when we plug in
54:58
so as you can see when we plug in numbers that are very close some of our
55:00
numbers that are very close some of our
55:00
numbers that are very close some of our higher numbers we get closer and closer
55:02
higher numbers we get closer and closer
55:02
higher numbers we get closer and closer to zero
55:03
to zero
55:03
to zero and then if we plug in very bad
55:04
and then if we plug in very bad
55:04
and then if we plug in very bad probabilities we get more and more
55:06
probabilities we get more and more
55:06
probabilities we get more and more negative number that's bad
55:09
negative number that's bad
55:09
negative number that's bad so
55:10
so
55:10
so and the reason we work with this is for
55:12
and the reason we work with this is for
55:12
and the reason we work with this is for a large extent convenience right
55:15
a large extent convenience right
55:15
a large extent convenience right because we have mathematically that if
55:16
because we have mathematically that if
55:16
because we have mathematically that if you have some product a times b times c
55:18
you have some product a times b times c
55:18
you have some product a times b times c of all these probabilities right
55:21
of all these probabilities right
55:21
of all these probabilities right the likelihood is the product of all
55:23
the likelihood is the product of all
55:23
the likelihood is the product of all these probabilities
55:25
these probabilities
55:25
these probabilities then the log
55:27
then the log
55:27
then the log of these
55:28
of these
55:28
of these is just log of a plus
55:30
is just log of a plus
55:30
is just log of a plus log of b
55:33
plus log of c if you remember your logs
55:36
plus log of c if you remember your logs
55:36
plus log of c if you remember your logs from your
55:37
from your
55:37
from your high school or undergrad and so on
55:39
high school or undergrad and so on
55:39
high school or undergrad and so on so we have that basically
55:41
so we have that basically
55:41
so we have that basically the likelihood of the product
55:42
the likelihood of the product
55:42
the likelihood of the product probabilities the log likelihood is just
55:44
probabilities the log likelihood is just
55:44
probabilities the log likelihood is just the sum of the logs of the individual
55:46
the sum of the logs of the individual
55:46
the sum of the logs of the individual probabilities
55:48
probabilities
55:48
probabilities so
55:49
so
55:50
so log likelihood
55:52
log likelihood
55:52
log likelihood starts at zero
55:54
starts at zero
55:54
starts at zero and then log likelihood here we can just
55:57
and then log likelihood here we can just
55:57
and then log likelihood here we can just accumulate simply
56:00
and in the end we can print this
56:05
print the log likelihood
56:09
f strings
56:11
f strings
56:11
f strings maybe you're familiar with this
56:13
maybe you're familiar with this
56:13
maybe you're familiar with this so log likelihood is negative 38.
56:19
okay
56:21
okay
56:21
okay now
56:22
now
56:22
now we actually want um
56:25
we actually want um
56:25
we actually want um so how high can log likelihood get it
56:27
so how high can log likelihood get it
56:27
so how high can log likelihood get it can go to zero so when all the
56:30
can go to zero so when all the
56:30
can go to zero so when all the probabilities are one log likelihood
56:31
probabilities are one log likelihood
56:31
probabilities are one log likelihood will be zero and then when all the
56:33
will be zero and then when all the
56:33
will be zero and then when all the probabilities are lower this will grow
56:35
probabilities are lower this will grow
56:35
probabilities are lower this will grow more and more negative
56:37
more and more negative
56:37
more and more negative now we don't actually like this because
56:39
now we don't actually like this because
56:39
now we don't actually like this because what we'd like is a loss function and a
56:41
what we'd like is a loss function and a
56:41
what we'd like is a loss function and a loss function has the semantics that low
56:44
loss function has the semantics that low
56:44
loss function has the semantics that low is good
56:46
is good
56:46
is good because we're trying to minimize the
56:47
because we're trying to minimize the
56:47
because we're trying to minimize the loss so we actually need to invert this
56:50
loss so we actually need to invert this
56:50
loss so we actually need to invert this and that's what gives us something
56:51
and that's what gives us something
56:51
and that's what gives us something called the negative log likelihood
56:55
negative log likelihood is just negative
56:58
negative log likelihood is just negative
56:58
negative log likelihood is just negative of the log likelihood
57:03
these are f strings by the way if you'd
57:05
these are f strings by the way if you'd
57:05
these are f strings by the way if you'd like to look this up
57:06
like to look this up
57:06
like to look this up negative log likelihood equals
57:09
negative log likelihood equals
57:09
negative log likelihood equals so negative log likelihood now is just
57:10
so negative log likelihood now is just
57:10
so negative log likelihood now is just negative of it and so the negative log
57:12
negative of it and so the negative log
57:12
negative of it and so the negative log block load is a very nice loss function
57:15
block load is a very nice loss function
57:15
block load is a very nice loss function because um
57:17
because um
57:17
because um the lowest it can get is zero
57:19
the lowest it can get is zero
57:19
the lowest it can get is zero and the higher it is the worse off the
57:22
and the higher it is the worse off the
57:22
and the higher it is the worse off the predictions are that you're making
57:24
predictions are that you're making
57:24
predictions are that you're making and then one more modification to this
57:25
and then one more modification to this
57:26
and then one more modification to this that sometimes people do is that for
57:27
that sometimes people do is that for
57:27
that sometimes people do is that for convenience uh they actually like to
57:29
convenience uh they actually like to
57:29
convenience uh they actually like to normalize by they like to make it an
57:32
normalize by they like to make it an
57:32
normalize by they like to make it an average instead of a sum
57:34
average instead of a sum
57:34
average instead of a sum and so uh here
57:37
and so uh here
57:37
and so uh here let's just keep some counts as well
57:39
let's just keep some counts as well
57:39
let's just keep some counts as well so n plus equals one
57:41
so n plus equals one
57:41
so n plus equals one starts at zero
57:42
starts at zero
57:42
starts at zero and then here
57:43
and then here
57:44
and then here um we can have sort of like a normalized
57:46
um we can have sort of like a normalized
57:46
um we can have sort of like a normalized log likelihood
57:47
log likelihood
57:47
log likelihood um
57:50
if we just normalize it by the count
57:52
if we just normalize it by the count
57:52
if we just normalize it by the count then we will sort of get the average
57:54
then we will sort of get the average
57:54
then we will sort of get the average log likelihood so this would be
57:56
log likelihood so this would be
57:56
log likelihood so this would be usually our loss function here is what
57:59
usually our loss function here is what
57:59
usually our loss function here is what this we would this is what we would use
58:02
this we would this is what we would use
58:02
this we would this is what we would use uh so our loss function for the training
58:03
uh so our loss function for the training
58:03
uh so our loss function for the training set assigned by the model is 2.4 that's
58:06
set assigned by the model is 2.4 that's
58:06
set assigned by the model is 2.4 that's the quality of this model
58:08
the quality of this model
58:08
the quality of this model and the lower it is the better off we
58:10
and the lower it is the better off we
58:10
and the lower it is the better off we are and the higher it is the worse off
58:11
are and the higher it is the worse off
58:12
are and the higher it is the worse off we are
58:13
we are
58:13
we are and
58:14
and
58:14
and the job of our you know training is to
58:17
the job of our you know training is to
58:17
the job of our you know training is to find the parameters that minimize the
58:19
find the parameters that minimize the
58:19
find the parameters that minimize the negative log likelihood loss
58:22
negative log likelihood loss
58:22
negative log likelihood loss and that would be like a high quality
58:24
and that would be like a high quality
58:24
and that would be like a high quality model okay so to summarize i actually
58:26
model okay so to summarize i actually
58:26
model okay so to summarize i actually wrote it out here
58:27
wrote it out here
58:28
wrote it out here so our goal is to maximize likelihood
58:30
so our goal is to maximize likelihood
58:30
so our goal is to maximize likelihood which is the
58:31
which is the
58:31
which is the product of all the probabilities
58:33
product of all the probabilities
58:34
product of all the probabilities assigned by the model
58:35
assigned by the model
58:35
assigned by the model and we want to maximize this likelihood
58:37
and we want to maximize this likelihood
58:37
and we want to maximize this likelihood with respect to the model parameters and
58:39
with respect to the model parameters and
58:39
with respect to the model parameters and in our case the model parameters here
58:41
in our case the model parameters here
58:41
in our case the model parameters here are defined in the table these numbers
58:43
are defined in the table these numbers
58:43
are defined in the table these numbers the probabilities
58:45
the probabilities
58:45
the probabilities are
58:46
are
58:46
are the model parameters sort of in our
58:47
the model parameters sort of in our
58:47
the model parameters sort of in our program language models so far but you
58:50
program language models so far but you
58:50
program language models so far but you have to keep in mind that here we are
58:51
have to keep in mind that here we are
58:52
have to keep in mind that here we are storing everything in a table format the
58:53
storing everything in a table format the
58:53
storing everything in a table format the probabilities but what's coming up as a
58:55
probabilities but what's coming up as a
58:55
probabilities but what's coming up as a brief preview is that these numbers will
58:58
brief preview is that these numbers will
58:58
brief preview is that these numbers will not be kept explicitly but these numbers
59:00
not be kept explicitly but these numbers
59:00
not be kept explicitly but these numbers will be calculated by a neural network
59:03
will be calculated by a neural network
59:03
will be calculated by a neural network so that's coming up
59:04
so that's coming up
59:04
so that's coming up and we want to change and tune the
59:06
and we want to change and tune the
59:06
and we want to change and tune the parameters of these neural networks we
59:08
parameters of these neural networks we
59:08
parameters of these neural networks we want to change these parameters to
59:09
want to change these parameters to
59:09
want to change these parameters to maximize the likelihood the product of
59:11
maximize the likelihood the product of
59:11
maximize the likelihood the product of the probabilities
59:13
the probabilities
59:13
the probabilities now maximizing the likelihood is
59:15
now maximizing the likelihood is
59:15
now maximizing the likelihood is equivalent to maximizing the log
59:16
equivalent to maximizing the log
59:16
equivalent to maximizing the log likelihood because log is a monotonic
59:18
likelihood because log is a monotonic
59:18
likelihood because log is a monotonic function
59:19
function
59:19
function here's the graph of log
59:21
here's the graph of log
59:22
here's the graph of log and basically all it is doing is it's
59:24
and basically all it is doing is it's
59:24
and basically all it is doing is it's just scaling your um you can look at it
59:27
just scaling your um you can look at it
59:27
just scaling your um you can look at it as just a scaling of the loss function
59:29
as just a scaling of the loss function
59:29
as just a scaling of the loss function and so the optimization problem here and
59:32
and so the optimization problem here and
59:32
and so the optimization problem here and here are actually equivalent because
59:34
here are actually equivalent because
59:34
here are actually equivalent because this is just scaling you can look at it
59:35
this is just scaling you can look at it
59:36
this is just scaling you can look at it that way
59:36
that way
59:36
that way and so these are two identical
59:38
and so these are two identical
59:38
and so these are two identical optimization problems
59:41
optimization problems
59:41
optimization problems um
59:41
um
59:41
um maximizing the log-likelihood is
59:43
maximizing the log-likelihood is
59:43
maximizing the log-likelihood is equivalent to minimizing the negative
59:44
equivalent to minimizing the negative
59:44
equivalent to minimizing the negative log likelihood and then in practice
59:47
log likelihood and then in practice
59:47
log likelihood and then in practice people actually minimize the average
59:48
people actually minimize the average
59:48
people actually minimize the average negative log likelihood to get numbers
59:50
negative log likelihood to get numbers
59:50
negative log likelihood to get numbers like 2.4
59:52
like 2.4
59:52
like 2.4 and then this summarizes the quality of
59:55
and then this summarizes the quality of
59:55
and then this summarizes the quality of your model and we'd like to minimize it
59:57
your model and we'd like to minimize it
59:57
your model and we'd like to minimize it and make it as small as possible
59:59
and make it as small as possible
59:59
and make it as small as possible and the lowest it can get is zero
1:00:02
and the lowest it can get is zero
1:00:02
and the lowest it can get is zero and the lower it is
1:00:04
and the lower it is
1:00:04
and the lower it is the better off your model is because
1:00:05
the better off your model is because
1:00:05
the better off your model is because it's signing it's assigning high
1:00:07
it's signing it's assigning high
1:00:07
it's signing it's assigning high probabilities to your data now let's
1:00:09
probabilities to your data now let's
1:00:09
probabilities to your data now let's estimate the probability over the entire
1:00:11
estimate the probability over the entire
1:00:11
estimate the probability over the entire training set just to make sure that we
1:00:12
training set just to make sure that we
1:00:12
training set just to make sure that we get something around 2.4 let's run this
1:00:15
get something around 2.4 let's run this
1:00:15
get something around 2.4 let's run this over the entire oops
1:00:17
over the entire oops
1:00:17
over the entire oops let's take out the print segment as well
1:00:20
let's take out the print segment as well
1:00:20
let's take out the print segment as well okay 2.45 or the entire training set
1:00:24
okay 2.45 or the entire training set
1:00:24
okay 2.45 or the entire training set now what i'd like to show you is that
1:00:25
now what i'd like to show you is that
1:00:25
now what i'd like to show you is that you can actually evaluate the
1:00:26
you can actually evaluate the
1:00:26
you can actually evaluate the probability for any word that you want
1:00:28
probability for any word that you want
1:00:28
probability for any word that you want like for example
1:00:30
like for example
1:00:30
like for example if we just test a single word andre and
1:00:32
if we just test a single word andre and
1:00:32
if we just test a single word andre and bring back the print statement
1:00:35
bring back the print statement
1:00:35
bring back the print statement then you see that andre is actually kind
1:00:37
then you see that andre is actually kind
1:00:37
then you see that andre is actually kind of like an unlikely word like on average
1:00:40
of like an unlikely word like on average
1:00:40
of like an unlikely word like on average we take
1:00:41
we take
1:00:41
we take three
1:00:42
three
1:00:42
three log probability to represent it and
1:00:44
log probability to represent it and
1:00:44
log probability to represent it and roughly that's because ej apparently is
1:00:46
roughly that's because ej apparently is
1:00:46
roughly that's because ej apparently is very uncommon as an example
1:00:49
very uncommon as an example
1:00:50
very uncommon as an example now
1:00:51
now
1:00:51
now think through this um
1:00:53
think through this um
1:00:53
think through this um when i take andre and i append q and i
1:00:55
when i take andre and i append q and i
1:00:55
when i take andre and i append q and i test the probability of it under q
1:01:00
we actually get
1:01:01
we actually get
1:01:01
we actually get infinity
1:01:02
infinity
1:01:02
infinity and that's because jq has a zero percent
1:01:05
and that's because jq has a zero percent
1:01:05
and that's because jq has a zero percent probability according to our model so
1:01:07
probability according to our model so
1:01:07
probability according to our model so the log likelihood
1:01:09
the log likelihood
1:01:09
the log likelihood so the log of zero will be negative
1:01:11
so the log of zero will be negative
1:01:11
so the log of zero will be negative infinity we get infinite loss
1:01:14
infinity we get infinite loss
1:01:14
infinity we get infinite loss so this is kind of undesirable right
1:01:15
so this is kind of undesirable right
1:01:15
so this is kind of undesirable right because we plugged in a string that
1:01:16
because we plugged in a string that
1:01:16
because we plugged in a string that could be like a somewhat reasonable name
1:01:19
could be like a somewhat reasonable name
1:01:19
could be like a somewhat reasonable name but basically what this is saying is
1:01:20
but basically what this is saying is
1:01:20
but basically what this is saying is that this model is exactly zero percent
1:01:22
that this model is exactly zero percent
1:01:22
that this model is exactly zero percent likely to uh to predict this
1:01:25
likely to uh to predict this
1:01:25
likely to uh to predict this name
1:01:26
name
1:01:26
name and our loss is infinity on this example
1:01:29
and our loss is infinity on this example
1:01:29
and our loss is infinity on this example and really what the reason for that is
1:01:31
and really what the reason for that is
1:01:31
and really what the reason for that is that j
1:01:32
that j
1:01:32
that j is followed by q
1:01:34
is followed by q
1:01:34
is followed by q uh zero times
1:01:36
uh zero times
1:01:36
uh zero times uh where's q jq is zero and so jq is uh
1:01:40
uh where's q jq is zero and so jq is uh
1:01:40
uh where's q jq is zero and so jq is uh zero percent likely
1:01:42
zero percent likely
1:01:42
zero percent likely so it's actually kind of gross and
1:01:43
so it's actually kind of gross and
1:01:43
so it's actually kind of gross and people don't like this too much to fix
1:01:45
people don't like this too much to fix
1:01:45
people don't like this too much to fix this there's a very simple fix that
1:01:47
this there's a very simple fix that
1:01:47
this there's a very simple fix that people like to do to sort of like smooth
1:01:49
people like to do to sort of like smooth
1:01:49
people like to do to sort of like smooth out your model a little bit and it's
1:01:50
out your model a little bit and it's
1:01:50
out your model a little bit and it's called model smoothing and roughly
1:01:52
called model smoothing and roughly
1:01:52
called model smoothing and roughly what's happening is that we will eight
1:01:53
what's happening is that we will eight
1:01:53
what's happening is that we will eight we will add some fake counts
1:01:56
we will add some fake counts
1:01:56
we will add some fake counts so
1:01:57
so
1:01:57
so imagine adding a count of one to
1:01:59
imagine adding a count of one to
1:01:59
imagine adding a count of one to everything
1:02:00
everything
1:02:00
everything so we add a count of one
1:02:03
so we add a count of one
1:02:03
so we add a count of one like this
1:02:04
like this
1:02:04
like this and then we recalculate the
1:02:05
and then we recalculate the
1:02:05
and then we recalculate the probabilities
1:02:07
probabilities
1:02:07
probabilities and that's model smoothing and you can
1:02:09
and that's model smoothing and you can
1:02:09
and that's model smoothing and you can add as much as you like you can add five
1:02:10
add as much as you like you can add five
1:02:10
add as much as you like you can add five and it will give you a smoother model
1:02:12
and it will give you a smoother model
1:02:12
and it will give you a smoother model and the more you add here
1:02:14
and the more you add here
1:02:14
and the more you add here the more
1:02:15
the more
1:02:15
the more uniform model you're going to have and
1:02:17
uniform model you're going to have and
1:02:17
uniform model you're going to have and the less you add
1:02:19
the less you add
1:02:19
the less you add the more peaked model you are going to
1:02:21
the more peaked model you are going to
1:02:21
the more peaked model you are going to have of course
1:02:22
have of course
1:02:22
have of course so one is like a pretty decent count to
1:02:24
so one is like a pretty decent count to
1:02:24
so one is like a pretty decent count to add
1:02:25
add
1:02:25
add and that will ensure that there will be
1:02:27
and that will ensure that there will be
1:02:27
and that will ensure that there will be no zeros in our probability matrix p
1:02:30
no zeros in our probability matrix p
1:02:30
no zeros in our probability matrix p and so this will of course change the
1:02:31
and so this will of course change the
1:02:32
and so this will of course change the generations a little bit in this case it
1:02:34
generations a little bit in this case it
1:02:34
generations a little bit in this case it didn't but in principle it could
1:02:36
didn't but in principle it could
1:02:36
didn't but in principle it could but what that's going to do now is that
1:02:38
but what that's going to do now is that
1:02:38
but what that's going to do now is that nothing will be infinity unlikely
1:02:41
nothing will be infinity unlikely
1:02:41
nothing will be infinity unlikely so now
1:02:42
so now
1:02:42
so now our model will predict some other
1:02:43
our model will predict some other
1:02:43
our model will predict some other probability and we see that jq now has a
1:02:46
probability and we see that jq now has a
1:02:46
probability and we see that jq now has a very small probability so the model
1:02:48
very small probability so the model
1:02:48
very small probability so the model still finds it very surprising that this
1:02:49
still finds it very surprising that this
1:02:49
still finds it very surprising that this was a word or a bigram but we don't get
1:02:51
was a word or a bigram but we don't get
1:02:52
was a word or a bigram but we don't get negative infinity so it's kind of like a
1:02:53
negative infinity so it's kind of like a
1:02:54
negative infinity so it's kind of like a nice fix that people like to apply
1:02:55
nice fix that people like to apply
1:02:55
nice fix that people like to apply sometimes and it's called model
1:02:56
sometimes and it's called model
1:02:56
sometimes and it's called model smoothing okay so we've now trained a
1:02:58
smoothing okay so we've now trained a
1:02:58
smoothing okay so we've now trained a respectable bi-gram character level
1:03:00
respectable bi-gram character level
1:03:00
respectable bi-gram character level language model and we saw that we both
1:03:04
language model and we saw that we both
1:03:04
language model and we saw that we both sort of trained the model by looking at
1:03:05
sort of trained the model by looking at
1:03:05
sort of trained the model by looking at the counts of all the bigrams and
1:03:08
the counts of all the bigrams and
1:03:08
the counts of all the bigrams and normalizing the rows to get probability
1:03:09
normalizing the rows to get probability
1:03:10
normalizing the rows to get probability distributions
1:03:11
distributions
1:03:11
distributions we saw that we can also then use those
1:03:14
we saw that we can also then use those
1:03:14
we saw that we can also then use those parameters of this model to perform
1:03:16
parameters of this model to perform
1:03:16
parameters of this model to perform sampling of new words
1:03:19
sampling of new words
1:03:19
sampling of new words so we sample new names according to
1:03:20
so we sample new names according to
1:03:20
so we sample new names according to those distributions and we also saw that
1:03:22
those distributions and we also saw that
1:03:22
those distributions and we also saw that we can evaluate the quality of this
1:03:24
we can evaluate the quality of this
1:03:24
we can evaluate the quality of this model and the quality of this model is
1:03:26
model and the quality of this model is
1:03:26
model and the quality of this model is summarized in a single number which is
1:03:28
summarized in a single number which is
1:03:28
summarized in a single number which is the negative log likelihood and the
1:03:30
the negative log likelihood and the
1:03:30
the negative log likelihood and the lower this number is the better the
1:03:32
lower this number is the better the
1:03:32
lower this number is the better the model is
1:03:33
model is
1:03:33
model is because it is giving high probabilities
1:03:35
because it is giving high probabilities
1:03:35
because it is giving high probabilities to the actual next characters in all the
1:03:37
to the actual next characters in all the
1:03:37
to the actual next characters in all the bi-grams in our training set
1:03:40
bi-grams in our training set
1:03:40
bi-grams in our training set so that's all well and good but we've
1:03:42
so that's all well and good but we've
1:03:42
so that's all well and good but we've arrived at this model explicitly by
1:03:44
arrived at this model explicitly by
1:03:44
arrived at this model explicitly by doing something that felt sensible we
1:03:46
doing something that felt sensible we
1:03:46
doing something that felt sensible we were just performing counts and then we
1:03:48
were just performing counts and then we
1:03:48
were just performing counts and then we were normalizing those counts
1:03:50
were normalizing those counts
1:03:50
were normalizing those counts now what i would like to do is i would
1:03:52
now what i would like to do is i would
1:03:52
now what i would like to do is i would like to take an alternative approach we
1:03:54
like to take an alternative approach we
1:03:54
like to take an alternative approach we will end up in a very very similar
1:03:55
will end up in a very very similar
1:03:55
will end up in a very very similar position but the approach will look very
1:03:57
position but the approach will look very
1:03:57
position but the approach will look very different because i would like to cast
1:03:59
different because i would like to cast
1:03:59
different because i would like to cast the problem of bi-gram character level
1:04:00
the problem of bi-gram character level
1:04:00
the problem of bi-gram character level language modeling into the neural
1:04:02
language modeling into the neural
1:04:02
language modeling into the neural network framework
1:04:04
network framework
1:04:04
network framework in the neural network framework we're
1:04:05
in the neural network framework we're
1:04:05
in the neural network framework we're going to approach things slightly
1:04:07
going to approach things slightly
1:04:07
going to approach things slightly differently but again end up in a very
1:04:09
differently but again end up in a very
1:04:09
differently but again end up in a very similar spot i'll go into that later now
1:04:12
similar spot i'll go into that later now
1:04:12
similar spot i'll go into that later now our neural network is going to be a
1:04:14
our neural network is going to be a
1:04:14
our neural network is going to be a still a background character level
1:04:16
still a background character level
1:04:16
still a background character level language model so it receives a single
1:04:18
language model so it receives a single
1:04:18
language model so it receives a single character as an input
1:04:20
character as an input
1:04:20
character as an input then there's neural network with some
1:04:21
then there's neural network with some
1:04:21
then there's neural network with some weights or some parameters w
1:04:24
weights or some parameters w
1:04:24
weights or some parameters w and it's going to output the probability
1:04:26
and it's going to output the probability
1:04:26
and it's going to output the probability distribution over the next character in
1:04:28
distribution over the next character in
1:04:28
distribution over the next character in a sequence it's going to make guesses as
1:04:30
a sequence it's going to make guesses as
1:04:30
a sequence it's going to make guesses as to what is likely to follow this
1:04:32
to what is likely to follow this
1:04:32
to what is likely to follow this character that was input to the model
1:04:35
character that was input to the model
1:04:35
character that was input to the model and then in addition to that we're going
1:04:37
and then in addition to that we're going
1:04:37
and then in addition to that we're going to be able to evaluate any setting of
1:04:39
to be able to evaluate any setting of
1:04:39
to be able to evaluate any setting of the parameters of the neural net because
1:04:41
the parameters of the neural net because
1:04:41
the parameters of the neural net because we have the loss function
1:04:43
we have the loss function
1:04:43
we have the loss function the negative log likelihood so we're
1:04:45
the negative log likelihood so we're
1:04:45
the negative log likelihood so we're going to take a look at its probability
1:04:46
going to take a look at its probability
1:04:46
going to take a look at its probability distributions and we're going to use the
1:04:48
distributions and we're going to use the
1:04:48
distributions and we're going to use the labels
1:04:49
labels
1:04:50
labels which are basically just the identity of
1:04:51
which are basically just the identity of
1:04:51
which are basically just the identity of the next character in that diagram the
1:04:53
the next character in that diagram the
1:04:53
the next character in that diagram the second character
1:04:54
second character
1:04:54
second character so knowing what second character
1:04:56
so knowing what second character
1:04:56
so knowing what second character actually comes next in the bigram allows
1:04:58
actually comes next in the bigram allows
1:04:58
actually comes next in the bigram allows us to then look at what how high of
1:05:00
us to then look at what how high of
1:05:00
us to then look at what how high of probability the model assigns to that
1:05:02
probability the model assigns to that
1:05:02
probability the model assigns to that character
1:05:03
character
1:05:03
character and then we of course want the
1:05:05
and then we of course want the
1:05:05
and then we of course want the probability to be very high
1:05:06
probability to be very high
1:05:06
probability to be very high and that is another way of saying that
1:05:08
and that is another way of saying that
1:05:08
and that is another way of saying that the loss is low
1:05:10
the loss is low
1:05:10
the loss is low so we're going to use gradient-based
1:05:12
so we're going to use gradient-based
1:05:12
so we're going to use gradient-based optimization then to tune the parameters
1:05:14
optimization then to tune the parameters
1:05:14
optimization then to tune the parameters of this network because we have the loss
1:05:16
of this network because we have the loss
1:05:16
of this network because we have the loss function and we're going to minimize it
1:05:18
function and we're going to minimize it
1:05:18
function and we're going to minimize it so we're going to tune the weights so
1:05:20
so we're going to tune the weights so
1:05:20
so we're going to tune the weights so that the neural net is correctly
1:05:21
that the neural net is correctly
1:05:21
that the neural net is correctly predicting the probabilities for the
1:05:22
predicting the probabilities for the
1:05:22
predicting the probabilities for the next character
1:05:24
next character
1:05:24
next character so let's get started the first thing i
1:05:26
so let's get started the first thing i
1:05:26
so let's get started the first thing i want to do is i want to compile the
1:05:27
want to do is i want to compile the
1:05:27
want to do is i want to compile the training set of this neural network
1:05:29
training set of this neural network
1:05:29
training set of this neural network right so
1:05:30
right so
1:05:30
right so create
1:05:31
create
1:05:31
create the training set
1:05:33
the training set
1:05:33
the training set of all the bigrams
1:05:36
of all the bigrams
1:05:36
of all the bigrams okay
1:05:37
okay
1:05:37
okay and
1:05:39
and
1:05:39
and here
1:05:40
here
1:05:40
here i'm going to copy paste this code
1:05:43
i'm going to copy paste this code
1:05:43
i'm going to copy paste this code because this code iterates over all the
1:05:45
because this code iterates over all the
1:05:45
because this code iterates over all the programs
1:05:47
programs
1:05:47
programs so here we start with the words we
1:05:48
so here we start with the words we
1:05:48
so here we start with the words we iterate over all the bygrams and
1:05:50
iterate over all the bygrams and
1:05:50
iterate over all the bygrams and previously as you recall we did the
1:05:52
previously as you recall we did the
1:05:52
previously as you recall we did the counts but now we're not going to do
1:05:53
counts but now we're not going to do
1:05:54
counts but now we're not going to do counts we're just creating a training
1:05:55
counts we're just creating a training
1:05:55
counts we're just creating a training set
1:05:56
set
1:05:56
set now this training set will be made up of
1:05:58
now this training set will be made up of
1:05:58
now this training set will be made up of two lists
1:06:02
we have the
1:06:04
we have the
1:06:04
we have the inputs
1:06:06
inputs
1:06:06
inputs and the targets
1:06:07
and the targets
1:06:07
and the targets the the labels
1:06:09
the the labels
1:06:09
the the labels and these bi-grams will denote x y those
1:06:11
and these bi-grams will denote x y those
1:06:11
and these bi-grams will denote x y those are the characters right
1:06:13
are the characters right
1:06:13
are the characters right and so we're given the first character
1:06:14
and so we're given the first character
1:06:14
and so we're given the first character of the bi-gram and then we're trying to
1:06:16
of the bi-gram and then we're trying to
1:06:16
of the bi-gram and then we're trying to predict the next one
1:06:17
predict the next one
1:06:17
predict the next one both of these are going to be integers
1:06:19
both of these are going to be integers
1:06:19
both of these are going to be integers so here we'll take x's that append is
1:06:22
so here we'll take x's that append is
1:06:22
so here we'll take x's that append is just
1:06:23
just
1:06:23
just x1 ystat append ix2
1:06:27
x1 ystat append ix2
1:06:27
x1 ystat append ix2 and then here
1:06:29
and then here
1:06:29
and then here we actually don't want lists of integers
1:06:31
we actually don't want lists of integers
1:06:31
we actually don't want lists of integers we will create tensors out of these so
1:06:34
we will create tensors out of these so
1:06:34
we will create tensors out of these so axis is torch.tensor of axis and wise a
1:06:38
axis is torch.tensor of axis and wise a
1:06:38
axis is torch.tensor of axis and wise a storage.tensor of ys
1:06:41
storage.tensor of ys
1:06:41
storage.tensor of ys and then
1:06:42
and then
1:06:42
and then we don't actually want to take all the
1:06:43
we don't actually want to take all the
1:06:43
we don't actually want to take all the words just yet because i want everything
1:06:45
words just yet because i want everything
1:06:45
words just yet because i want everything to be manageable
1:06:46
to be manageable
1:06:46
to be manageable so let's just do the first word which is
1:06:48
so let's just do the first word which is
1:06:48
so let's just do the first word which is emma
1:06:51
emma
1:06:51
emma and then it's clear what these x's and
1:06:52
and then it's clear what these x's and
1:06:52
and then it's clear what these x's and y's would be
1:06:55
y's would be
1:06:55
y's would be here let me print
1:06:57
here let me print
1:06:57
here let me print character 1 character 2 just so you see
1:06:59
character 1 character 2 just so you see
1:06:59
character 1 character 2 just so you see what's going on here
1:07:01
what's going on here
1:07:01
what's going on here so the bigrams of these characters is
1:07:04
so the bigrams of these characters is
1:07:04
so the bigrams of these characters is dot e e m m m a a dot so this single
1:07:09
dot e e m m m a a dot so this single
1:07:09
dot e e m m m a a dot so this single word as i mentioned has one two three
1:07:11
word as i mentioned has one two three
1:07:11
word as i mentioned has one two three four five examples for our neural
1:07:13
four five examples for our neural
1:07:13
four five examples for our neural network
1:07:14
network
1:07:14
network there are five separate examples in emma
1:07:17
there are five separate examples in emma
1:07:17
there are five separate examples in emma and those examples are summarized here
1:07:19
and those examples are summarized here
1:07:19
and those examples are summarized here when the input to the neural network is
1:07:21
when the input to the neural network is
1:07:21
when the input to the neural network is integer 0
1:07:23
integer 0
1:07:23
integer 0 the desired label is integer 5 which
1:07:25
the desired label is integer 5 which
1:07:25
the desired label is integer 5 which corresponds to e when the input to the
1:07:28
corresponds to e when the input to the
1:07:28
corresponds to e when the input to the neural network is 5 we want its weights
1:07:31
neural network is 5 we want its weights
1:07:31
neural network is 5 we want its weights to be arranged so that 13 gets a very
1:07:33
to be arranged so that 13 gets a very
1:07:33
to be arranged so that 13 gets a very high probability
1:07:35
high probability
1:07:35
high probability when 13 is put in we want 13 to have a
1:07:37
when 13 is put in we want 13 to have a
1:07:37
when 13 is put in we want 13 to have a high probability
1:07:39
high probability
1:07:39
high probability when 13 is put in we also want 1 to have
1:07:41
when 13 is put in we also want 1 to have
1:07:41
when 13 is put in we also want 1 to have a high probability
1:07:43
a high probability
1:07:43
a high probability when one is input we want zero to have a
1:07:45
when one is input we want zero to have a
1:07:45
when one is input we want zero to have a very high probability so there are five
1:07:48
very high probability so there are five
1:07:48
very high probability so there are five separate input examples to a neural nut
1:07:51
separate input examples to a neural nut
1:07:51
separate input examples to a neural nut in this data set
1:07:54
i wanted to add a tangent of a node of
1:07:57
i wanted to add a tangent of a node of
1:07:57
i wanted to add a tangent of a node of caution to be careful with a lot of the
1:07:58
caution to be careful with a lot of the
1:07:58
caution to be careful with a lot of the apis of some of these frameworks
1:08:01
apis of some of these frameworks
1:08:01
apis of some of these frameworks you saw me silently use torch.tensor
1:08:03
you saw me silently use torch.tensor
1:08:03
you saw me silently use torch.tensor with a lowercase t
1:08:05
with a lowercase t
1:08:05
with a lowercase t and the output looked right
1:08:07
and the output looked right
1:08:07
and the output looked right but you should be aware that there's
1:08:09
but you should be aware that there's
1:08:09
but you should be aware that there's actually two ways of constructing a
1:08:10
actually two ways of constructing a
1:08:10
actually two ways of constructing a tensor there's a torch.lowercase tensor
1:08:13
tensor there's a torch.lowercase tensor
1:08:13
tensor there's a torch.lowercase tensor and there's also a torch.capital tensor
1:08:16
and there's also a torch.capital tensor
1:08:16
and there's also a torch.capital tensor class which you can also construct
1:08:18
class which you can also construct
1:08:18
class which you can also construct so you can actually call both you can
1:08:20
so you can actually call both you can
1:08:20
so you can actually call both you can also do torch.capital tensor
1:08:22
also do torch.capital tensor
1:08:22
also do torch.capital tensor and you get a nexus and wise as well
1:08:25
and you get a nexus and wise as well
1:08:25
and you get a nexus and wise as well so that's not confusing at all
1:08:27
so that's not confusing at all
1:08:27
so that's not confusing at all um
1:08:28
um
1:08:28
um there are threads on what is the
1:08:29
there are threads on what is the
1:08:29
there are threads on what is the difference between these two
1:08:31
difference between these two
1:08:31
difference between these two and um
1:08:33
and um
1:08:33
and um unfortunately the docs are just like not
1:08:35
unfortunately the docs are just like not
1:08:35
unfortunately the docs are just like not clear on the difference and when you
1:08:36
clear on the difference and when you
1:08:36
clear on the difference and when you look at the the docs of lower case
1:08:38
look at the the docs of lower case
1:08:38
look at the the docs of lower case tensor construct tensor with no autograd
1:08:40
tensor construct tensor with no autograd
1:08:40
tensor construct tensor with no autograd history by copying data
1:08:43
history by copying data
1:08:43
history by copying data it's just like it doesn't
1:08:45
it's just like it doesn't
1:08:45
it's just like it doesn't it doesn't make sense so the actual
1:08:47
it doesn't make sense so the actual
1:08:47
it doesn't make sense so the actual difference as far as i can tell is
1:08:48
difference as far as i can tell is
1:08:48
difference as far as i can tell is explained eventually in this random
1:08:49
explained eventually in this random
1:08:50
explained eventually in this random thread that you can google
1:08:51
thread that you can google
1:08:51
thread that you can google and really it comes down to
1:08:53
and really it comes down to
1:08:53
and really it comes down to i believe
1:08:55
i believe
1:08:55
i believe that um
1:08:56
that um
1:08:56
that um what is this
1:08:58
what is this
1:08:58
what is this torch.tensor in first d-type the data
1:09:00
torch.tensor in first d-type the data
1:09:00
torch.tensor in first d-type the data type automatically while torch.tensor
1:09:02
type automatically while torch.tensor
1:09:02
type automatically while torch.tensor just returns a float tensor
1:09:04
just returns a float tensor
1:09:04
just returns a float tensor i would recommend stick to
1:09:05
i would recommend stick to
1:09:05
i would recommend stick to torch.lowercase tensor
1:09:07
torch.lowercase tensor
1:09:07
torch.lowercase tensor so um
1:09:09
so um
1:09:09
so um indeed we see that when i
1:09:11
indeed we see that when i
1:09:11
indeed we see that when i construct this with a capital t the data
1:09:13
construct this with a capital t the data
1:09:13
construct this with a capital t the data type here of xs is float32
1:09:18
type here of xs is float32
1:09:18
type here of xs is float32 but towards that lowercase tensor
1:09:21
but towards that lowercase tensor
1:09:21
but towards that lowercase tensor you see how it's now x dot d type is now
1:09:24
you see how it's now x dot d type is now
1:09:24
you see how it's now x dot d type is now integer
1:09:26
integer
1:09:26
integer so um
1:09:28
so um
1:09:28
so um it's advised that you use lowercase t
1:09:30
it's advised that you use lowercase t
1:09:30
it's advised that you use lowercase t and you can read more about it if you
1:09:32
and you can read more about it if you
1:09:32
and you can read more about it if you like in some of these threads but
1:09:34
like in some of these threads but
1:09:34
like in some of these threads but basically
1:09:35
basically
1:09:35
basically um
1:09:36
um
1:09:36
um i'm pointing out some of these things
1:09:38
i'm pointing out some of these things
1:09:38
i'm pointing out some of these things because i want to caution you and i want
1:09:39
because i want to caution you and i want
1:09:39
because i want to caution you and i want you to re get used to reading a lot of
1:09:41
you to re get used to reading a lot of
1:09:41
you to re get used to reading a lot of documentation and reading through a lot
1:09:43
documentation and reading through a lot
1:09:43
documentation and reading through a lot of
1:09:44
of
1:09:44
of q and a's and threads like this
1:09:46
q and a's and threads like this
1:09:46
q and a's and threads like this and
1:09:48
and
1:09:48
and you know some of the stuff is
1:09:49
you know some of the stuff is
1:09:49
you know some of the stuff is unfortunately not easy and not very well
1:09:50
unfortunately not easy and not very well
1:09:50
unfortunately not easy and not very well documented and you have to be careful
1:09:51
documented and you have to be careful
1:09:51
documented and you have to be careful out there what we want here is integers
1:09:54
out there what we want here is integers
1:09:54
out there what we want here is integers because that's what makes uh sense
1:09:56
because that's what makes uh sense
1:09:56
because that's what makes uh sense um
1:09:57
um
1:09:58
um and so
1:09:59
and so
1:09:59
and so lowercase tensor is what we are using
1:10:01
lowercase tensor is what we are using
1:10:01
lowercase tensor is what we are using okay now we want to think through how
1:10:02
okay now we want to think through how
1:10:02
okay now we want to think through how we're going to feed in these examples
1:10:04
we're going to feed in these examples
1:10:04
we're going to feed in these examples into a neural network
1:10:06
into a neural network
1:10:06
into a neural network now it's not quite as straightforward as
1:10:09
now it's not quite as straightforward as
1:10:09
now it's not quite as straightforward as plugging it in because these examples
1:10:10
plugging it in because these examples
1:10:10
plugging it in because these examples right now are integers so there's like a
1:10:12
right now are integers so there's like a
1:10:12
right now are integers so there's like a 0 5 or 13 it gives us the index of the
1:10:15
0 5 or 13 it gives us the index of the
1:10:15
0 5 or 13 it gives us the index of the character and you can't just plug an
1:10:17
character and you can't just plug an
1:10:17
character and you can't just plug an integer index into a neural net
1:10:19
integer index into a neural net
1:10:19
integer index into a neural net these neural nets right are sort of made
1:10:22
these neural nets right are sort of made
1:10:22
these neural nets right are sort of made up of these neurons
1:10:24
up of these neurons
1:10:24
up of these neurons and
1:10:25
and
1:10:25
and these neurons have weights and as you
1:10:27
these neurons have weights and as you
1:10:27
these neurons have weights and as you saw in micrograd these weights act
1:10:29
saw in micrograd these weights act
1:10:29
saw in micrograd these weights act multiplicatively on the inputs w x plus
1:10:31
multiplicatively on the inputs w x plus
1:10:32
multiplicatively on the inputs w x plus b there's 10 h's and so on and so it
1:10:34
b there's 10 h's and so on and so it
1:10:34
b there's 10 h's and so on and so it doesn't really make sense to make an
1:10:35
doesn't really make sense to make an
1:10:35
doesn't really make sense to make an input neuron take on integer values that
1:10:37
input neuron take on integer values that
1:10:37
input neuron take on integer values that you feed in and then multiply on with
1:10:40
you feed in and then multiply on with
1:10:40
you feed in and then multiply on with weights
1:10:41
weights
1:10:41
weights so instead
1:10:42
so instead
1:10:42
so instead a common way of encoding integers is
1:10:44
a common way of encoding integers is
1:10:44
a common way of encoding integers is what's called one hot encoding
1:10:46
what's called one hot encoding
1:10:46
what's called one hot encoding in one hot encoding
1:10:48
in one hot encoding
1:10:48
in one hot encoding we take an integer like 13 and we create
1:10:51
we take an integer like 13 and we create
1:10:51
we take an integer like 13 and we create a vector that is all zeros except for
1:10:53
a vector that is all zeros except for
1:10:54
a vector that is all zeros except for the 13th dimension which we turn to a
1:10:56
the 13th dimension which we turn to a
1:10:56
the 13th dimension which we turn to a one and then that vector can feed into a
1:10:59
one and then that vector can feed into a
1:10:59
one and then that vector can feed into a neural net
1:11:01
neural net
1:11:01
neural net now conveniently
1:11:03
now conveniently
1:11:03
now conveniently uh pi torch actually has something
1:11:04
uh pi torch actually has something
1:11:04
uh pi torch actually has something called the one hot
1:11:07
function inside torching and functional
1:11:10
function inside torching and functional
1:11:10
function inside torching and functional it takes a tensor made up of integers
1:11:13
it takes a tensor made up of integers
1:11:13
it takes a tensor made up of integers um
1:11:14
um
1:11:14
um long is a is a as an integer
1:11:18
long is a is a as an integer
1:11:18
long is a is a as an integer um
1:11:19
um
1:11:19
um and it also takes a number of classes um
1:11:22
and it also takes a number of classes um
1:11:22
and it also takes a number of classes um which is how large you want your uh
1:11:24
which is how large you want your uh
1:11:24
which is how large you want your uh tensor uh your vector to be
1:11:27
tensor uh your vector to be
1:11:27
tensor uh your vector to be so here let's import
1:11:30
so here let's import
1:11:30
so here let's import torch.n.functional sf this is a common
1:11:32
torch.n.functional sf this is a common
1:11:32
torch.n.functional sf this is a common way of importing it
1:11:34
way of importing it
1:11:34
way of importing it and then let's do f.1 hot
1:11:36
and then let's do f.1 hot
1:11:36
and then let's do f.1 hot and we feed in the integers that we want
1:11:38
and we feed in the integers that we want
1:11:38
and we feed in the integers that we want to encode so we can actually feed in the
1:11:41
to encode so we can actually feed in the
1:11:41
to encode so we can actually feed in the entire array of x's
1:11:43
entire array of x's
1:11:44
entire array of x's and we can tell it that num classes is
1:11:46
and we can tell it that num classes is
1:11:46
and we can tell it that num classes is 27.
1:11:47
27.
1:11:47
27. so it doesn't have to try to guess it it
1:11:49
so it doesn't have to try to guess it it
1:11:49
so it doesn't have to try to guess it it may have guessed that it's only 13 and
1:11:51
may have guessed that it's only 13 and
1:11:51
may have guessed that it's only 13 and would give us an incorrect result
1:11:54
would give us an incorrect result
1:11:54
would give us an incorrect result so this is the one hot let's call this x
1:11:57
so this is the one hot let's call this x
1:11:57
so this is the one hot let's call this x inc for x encoded
1:12:02
and then we see that x encoded that
1:12:03
and then we see that x encoded that
1:12:03
and then we see that x encoded that shape is 5 by 27
1:12:07
shape is 5 by 27
1:12:07
shape is 5 by 27 and uh we can also visualize it plt.i am
1:12:10
and uh we can also visualize it plt.i am
1:12:10
and uh we can also visualize it plt.i am show of x inc
1:12:12
show of x inc
1:12:12
show of x inc to make it a little bit more clear
1:12:13
to make it a little bit more clear
1:12:13
to make it a little bit more clear because this is a little messy
1:12:15
because this is a little messy
1:12:15
because this is a little messy so we see that we've encoded all the
1:12:17
so we see that we've encoded all the
1:12:17
so we see that we've encoded all the five examples uh into vectors we have
1:12:20
five examples uh into vectors we have
1:12:20
five examples uh into vectors we have five examples so we have five rows and
1:12:22
five examples so we have five rows and
1:12:22
five examples so we have five rows and each row here is now an example into a
1:12:24
each row here is now an example into a
1:12:24
each row here is now an example into a neural nut
1:12:26
neural nut
1:12:26
neural nut and we see that the appropriate bit is
1:12:28
and we see that the appropriate bit is
1:12:28
and we see that the appropriate bit is turned on as a one and everything else
1:12:30
turned on as a one and everything else
1:12:30
turned on as a one and everything else is zero
1:12:31
is zero
1:12:31
is zero so um
1:12:33
so um
1:12:33
so um here for example the zeroth bit is
1:12:35
here for example the zeroth bit is
1:12:35
here for example the zeroth bit is turned on the fifth bit is turned on
1:12:38
turned on the fifth bit is turned on
1:12:38
turned on the fifth bit is turned on 13th bits are turned on for both of
1:12:40
13th bits are turned on for both of
1:12:40
13th bits are turned on for both of these examples and then the first bit
1:12:42
these examples and then the first bit
1:12:42
these examples and then the first bit here is turned on
1:12:44
here is turned on
1:12:44
here is turned on so that's how we can encode
1:12:47
so that's how we can encode
1:12:47
so that's how we can encode integers into vectors and then these
1:12:49
integers into vectors and then these
1:12:49
integers into vectors and then these vectors can feed in to neural nets one
1:12:52
vectors can feed in to neural nets one
1:12:52
vectors can feed in to neural nets one more issue to be careful with here by
1:12:53
more issue to be careful with here by
1:12:53
more issue to be careful with here by the way is
1:12:55
the way is
1:12:55
the way is let's look at the data type of encoding
1:12:56
let's look at the data type of encoding
1:12:56
let's look at the data type of encoding we always want to be careful with data
1:12:58
we always want to be careful with data
1:12:58
we always want to be careful with data types
1:12:59
types
1:12:59
types what would you expect x encoding's data
1:13:01
what would you expect x encoding's data
1:13:01
what would you expect x encoding's data type to be
1:13:02
type to be
1:13:02
type to be when we're plugging numbers into neural
1:13:04
when we're plugging numbers into neural
1:13:04
when we're plugging numbers into neural nuts we don't want them to be integers
1:13:06
nuts we don't want them to be integers
1:13:06
nuts we don't want them to be integers we want them to be floating point
1:13:07
we want them to be floating point
1:13:07
we want them to be floating point numbers that can take on various values
1:13:10
numbers that can take on various values
1:13:10
numbers that can take on various values but the d type here is actually 64-bit
1:13:13
but the d type here is actually 64-bit
1:13:13
but the d type here is actually 64-bit integer
1:13:14
integer
1:13:14
integer and the reason for that i suspect is
1:13:15
and the reason for that i suspect is
1:13:15
and the reason for that i suspect is that one hot received a 64-bit integer
1:13:18
that one hot received a 64-bit integer
1:13:18
that one hot received a 64-bit integer here and it returned the same data type
1:13:21
here and it returned the same data type
1:13:21
here and it returned the same data type and when you look at the signature of
1:13:23
and when you look at the signature of
1:13:23
and when you look at the signature of one hot it doesn't even take a d type a
1:13:25
one hot it doesn't even take a d type a
1:13:25
one hot it doesn't even take a d type a desired data type of the output tensor
1:13:28
desired data type of the output tensor
1:13:28
desired data type of the output tensor and so we can't in a lot of functions in
1:13:30
and so we can't in a lot of functions in
1:13:30
and so we can't in a lot of functions in torch we'd be able to do something like
1:13:31
torch we'd be able to do something like
1:13:32
torch we'd be able to do something like d type equal storage.float32
1:13:34
d type equal storage.float32
1:13:34
d type equal storage.float32 which is what we want but one heart does
1:13:36
which is what we want but one heart does
1:13:36
which is what we want but one heart does not support that
1:13:37
not support that
1:13:37
not support that so instead we're going to want to cast
1:13:39
so instead we're going to want to cast
1:13:39
so instead we're going to want to cast this to float like this
1:13:43
this to float like this
1:13:43
this to float like this so that these
1:13:44
so that these
1:13:44
so that these everything is the same
1:13:46
everything is the same
1:13:46
everything is the same everything looks the same but the d-type
1:13:48
everything looks the same but the d-type
1:13:48
everything looks the same but the d-type is float32 and floats can feed into
1:13:52
is float32 and floats can feed into
1:13:52
is float32 and floats can feed into neural nets so now let's construct our
1:13:54
neural nets so now let's construct our
1:13:54
neural nets so now let's construct our first neuron
1:13:55
first neuron
1:13:56
first neuron this neuron will look at these input
1:13:58
this neuron will look at these input
1:13:58
this neuron will look at these input vectors
1:14:00
vectors
1:14:00
vectors and as you remember from micrograd these
1:14:02
and as you remember from micrograd these
1:14:02
and as you remember from micrograd these neurons basically perform a very simple
1:14:03
neurons basically perform a very simple
1:14:03
neurons basically perform a very simple function w x plus b where w x is a dot
1:14:07
function w x plus b where w x is a dot
1:14:07
function w x plus b where w x is a dot product
1:14:08
product
1:14:08
product right
1:14:09
right
1:14:09
right so we can achieve the same thing here
1:14:11
so we can achieve the same thing here
1:14:12
so we can achieve the same thing here let's first define the weights of this
1:14:14
let's first define the weights of this
1:14:14
let's first define the weights of this neuron basically what are the initial
1:14:15
neuron basically what are the initial
1:14:15
neuron basically what are the initial weights at initialization for this
1:14:17
weights at initialization for this
1:14:17
weights at initialization for this neuron
1:14:18
neuron
1:14:18
neuron let's initialize them with torch.rendin
1:14:21
let's initialize them with torch.rendin
1:14:21
let's initialize them with torch.rendin torch.rendin
1:14:23
torch.rendin
1:14:23
torch.rendin is um
1:14:24
is um
1:14:24
is um fills a tensor with random numbers
1:14:27
fills a tensor with random numbers
1:14:27
fills a tensor with random numbers drawn from a normal distribution
1:14:29
drawn from a normal distribution
1:14:29
drawn from a normal distribution and a normal distribution
1:14:31
and a normal distribution
1:14:31
and a normal distribution has
1:14:31
has
1:14:31
has a probability density function like this
1:14:34
a probability density function like this
1:14:34
a probability density function like this and so most of the numbers drawn from
1:14:35
and so most of the numbers drawn from
1:14:35
and so most of the numbers drawn from this distribution will be around 0
1:14:39
this distribution will be around 0
1:14:39
this distribution will be around 0 but some of them will be as high as
1:14:40
but some of them will be as high as
1:14:40
but some of them will be as high as almost three and so on and very few
1:14:42
almost three and so on and very few
1:14:42
almost three and so on and very few numbers will be above three in magnitude
1:14:46
numbers will be above three in magnitude
1:14:46
numbers will be above three in magnitude so we need to take a size as an input
1:14:49
so we need to take a size as an input
1:14:49
so we need to take a size as an input here
1:14:50
here
1:14:50
here and i'm going to use size as to be 27 by
1:14:52
and i'm going to use size as to be 27 by
1:14:52
and i'm going to use size as to be 27 by one
1:14:54
one
1:14:54
one so
1:14:55
so
1:14:55
so 27 by one and then let's visualize w so
1:14:58
27 by one and then let's visualize w so
1:14:58
27 by one and then let's visualize w so w is a column vector of 27 numbers
1:15:03
w is a column vector of 27 numbers
1:15:03
w is a column vector of 27 numbers and
1:15:03
and
1:15:03
and these weights are then multiplied by the
1:15:06
these weights are then multiplied by the
1:15:06
these weights are then multiplied by the inputs
1:15:08
inputs
1:15:08
inputs so now to perform this multiplication we
1:15:10
so now to perform this multiplication we
1:15:10
so now to perform this multiplication we can take x encoding and we can multiply
1:15:13
can take x encoding and we can multiply
1:15:13
can take x encoding and we can multiply it with w
1:15:14
it with w
1:15:14
it with w this is a matrix multiplication operator
1:15:17
this is a matrix multiplication operator
1:15:17
this is a matrix multiplication operator in pi torch
1:15:19
in pi torch
1:15:19
in pi torch and the output of this operation is five
1:15:22
and the output of this operation is five
1:15:22
and the output of this operation is five by one
1:15:23
by one
1:15:23
by one the reason is five by five is the
1:15:24
the reason is five by five is the
1:15:24
the reason is five by five is the following
1:15:25
following
1:15:25
following we took x encoding which is five by
1:15:27
we took x encoding which is five by
1:15:28
we took x encoding which is five by twenty seven and we multiplied it by
1:15:30
twenty seven and we multiplied it by
1:15:30
twenty seven and we multiplied it by twenty seven by one
1:15:33
twenty seven by one
1:15:33
twenty seven by one and
1:15:34
and
1:15:34
and in matrix multiplication
1:15:36
in matrix multiplication
1:15:36
in matrix multiplication you see that the output will become five
1:15:38
you see that the output will become five
1:15:38
you see that the output will become five by one because these 27
1:15:41
by one because these 27
1:15:41
by one because these 27 will multiply and add
1:15:44
will multiply and add
1:15:44
will multiply and add so basically what we're seeing here outs
1:15:46
so basically what we're seeing here outs
1:15:46
so basically what we're seeing here outs out of this operation
1:15:48
out of this operation
1:15:48
out of this operation is we are seeing the five
1:15:51
is we are seeing the five
1:15:51
is we are seeing the five activations
1:15:53
activations
1:15:53
activations of this neuron
1:15:56
of this neuron
1:15:56
of this neuron on these five inputs
1:15:58
on these five inputs
1:15:58
on these five inputs and we've evaluated all of them in
1:15:59
and we've evaluated all of them in
1:15:59
and we've evaluated all of them in parallel we didn't feed in just a single
1:16:01
parallel we didn't feed in just a single
1:16:01
parallel we didn't feed in just a single input to the single neuron we fed in
1:16:03
input to the single neuron we fed in
1:16:03
input to the single neuron we fed in simultaneously all the five inputs into
1:16:06
simultaneously all the five inputs into
1:16:06
simultaneously all the five inputs into the same neuron
1:16:08
the same neuron
1:16:08
the same neuron and in parallel patrol has evaluated
1:16:11
and in parallel patrol has evaluated
1:16:11
and in parallel patrol has evaluated the wx plus b but here is just the wx
1:16:14
the wx plus b but here is just the wx
1:16:14
the wx plus b but here is just the wx there's no bias
1:16:15
there's no bias
1:16:15
there's no bias it has value w times x for all of them
1:16:19
it has value w times x for all of them
1:16:19
it has value w times x for all of them independently now instead of a single
1:16:21
independently now instead of a single
1:16:21
independently now instead of a single neuron though i would like to have 27
1:16:23
neuron though i would like to have 27
1:16:23
neuron though i would like to have 27 neurons and i'll show you in a second
1:16:25
neurons and i'll show you in a second
1:16:25
neurons and i'll show you in a second why i want 27 neurons
1:16:27
why i want 27 neurons
1:16:27
why i want 27 neurons so instead of having just a 1 here which
1:16:29
so instead of having just a 1 here which
1:16:29
so instead of having just a 1 here which is indicating this presence of one
1:16:31
is indicating this presence of one
1:16:31
is indicating this presence of one single neuron
1:16:32
single neuron
1:16:32
single neuron we can use 27
1:16:34
we can use 27
1:16:34
we can use 27 and then when w is 27 by 27
1:16:38
and then when w is 27 by 27
1:16:38
and then when w is 27 by 27 this will in parallel evaluate all the
1:16:41
this will in parallel evaluate all the
1:16:41
this will in parallel evaluate all the 27 neurons on all the 5 inputs
1:16:46
27 neurons on all the 5 inputs
1:16:46
27 neurons on all the 5 inputs giving us a much better much much bigger
1:16:48
giving us a much better much much bigger
1:16:48
giving us a much better much much bigger result so now what we've done is 5 by 27
1:16:51
result so now what we've done is 5 by 27
1:16:51
result so now what we've done is 5 by 27 multiplied 27 by 27
1:16:53
multiplied 27 by 27
1:16:54
multiplied 27 by 27 and the output of this is now 5 by 27
1:16:57
and the output of this is now 5 by 27
1:16:57
and the output of this is now 5 by 27 so we can see that the shape of this
1:17:01
is 5 by 27.
1:17:03
is 5 by 27.
1:17:03
is 5 by 27. so what is every element here telling us
1:17:06
so what is every element here telling us
1:17:06
so what is every element here telling us right
1:17:07
right
1:17:07
right it's telling us for every one of 27
1:17:09
it's telling us for every one of 27
1:17:09
it's telling us for every one of 27 neurons that we created
1:17:13
what is the firing rate of those neurons
1:17:16
what is the firing rate of those neurons
1:17:16
what is the firing rate of those neurons on every one of those five examples
1:17:19
on every one of those five examples
1:17:19
on every one of those five examples so
1:17:20
so
1:17:20
so the element for example 3 comma 13
1:17:25
the element for example 3 comma 13
1:17:25
the element for example 3 comma 13 is giving us the firing rate of the 13th
1:17:28
is giving us the firing rate of the 13th
1:17:28
is giving us the firing rate of the 13th neuron
1:17:29
neuron
1:17:29
neuron looking at the third input
1:17:31
looking at the third input
1:17:31
looking at the third input and the way this was achieved is by a
1:17:34
and the way this was achieved is by a
1:17:34
and the way this was achieved is by a dot product
1:17:36
dot product
1:17:36
dot product between the third
1:17:37
between the third
1:17:37
between the third input
1:17:38
input
1:17:38
input and the 13th column
1:17:41
and the 13th column
1:17:41
and the 13th column of this w matrix here
1:17:44
of this w matrix here
1:17:44
of this w matrix here okay
1:17:45
okay
1:17:45
okay so
1:17:46
so
1:17:46
so using matrix multiplication we can very
1:17:48
using matrix multiplication we can very
1:17:48
using matrix multiplication we can very efficiently evaluate
1:17:50
efficiently evaluate
1:17:50
efficiently evaluate the dot product between lots of input
1:17:52
the dot product between lots of input
1:17:52
the dot product between lots of input examples in a batch
1:17:54
examples in a batch
1:17:54
examples in a batch and lots of neurons where all those
1:17:57
and lots of neurons where all those
1:17:57
and lots of neurons where all those neurons have weights in the columns of
1:17:59
neurons have weights in the columns of
1:17:59
neurons have weights in the columns of those w's
1:18:01
those w's
1:18:01
those w's and in matrix multiplication we're just
1:18:02
and in matrix multiplication we're just
1:18:02
and in matrix multiplication we're just doing those dot products and
1:18:04
doing those dot products and
1:18:04
doing those dot products and in parallel just to show you that this
1:18:06
in parallel just to show you that this
1:18:06
in parallel just to show you that this is the case we can take x and we can
1:18:08
is the case we can take x and we can
1:18:08
is the case we can take x and we can take the third
1:18:10
take the third
1:18:10
take the third row
1:18:12
row
1:18:12
row and we can take the w and take its 13th
1:18:14
and we can take the w and take its 13th
1:18:14
and we can take the w and take its 13th column
1:18:17
and then we can do
1:18:18
and then we can do
1:18:18
and then we can do x and get three
1:18:21
x and get three
1:18:21
x and get three elementwise multiply with w at 13.
1:18:26
elementwise multiply with w at 13.
1:18:26
elementwise multiply with w at 13. and sum that up that's wx plus b
1:18:29
and sum that up that's wx plus b
1:18:29
and sum that up that's wx plus b well there's no plus b it's just wx dot
1:18:31
well there's no plus b it's just wx dot
1:18:31
well there's no plus b it's just wx dot product
1:18:32
product
1:18:32
product and that's
1:18:34
and that's
1:18:34
and that's this number
1:18:35
this number
1:18:35
this number so you see that this is just being done
1:18:36
so you see that this is just being done
1:18:36
so you see that this is just being done efficiently by the matrix multiplication
1:18:39
efficiently by the matrix multiplication
1:18:39
efficiently by the matrix multiplication operation
1:18:40
operation
1:18:40
operation for all the input examples and for all
1:18:42
for all the input examples and for all
1:18:42
for all the input examples and for all the output neurons of this first layer
1:18:45
the output neurons of this first layer
1:18:46
the output neurons of this first layer okay so we fed our 27-dimensional inputs
1:18:49
okay so we fed our 27-dimensional inputs
1:18:49
okay so we fed our 27-dimensional inputs into a first layer of a neural net that
1:18:51
into a first layer of a neural net that
1:18:51
into a first layer of a neural net that has 27 neurons right so we have 27
1:18:54
has 27 neurons right so we have 27
1:18:54
has 27 neurons right so we have 27 inputs and now we have 27 neurons these
1:18:57
inputs and now we have 27 neurons these
1:18:57
inputs and now we have 27 neurons these neurons perform w times x they don't
1:18:59
neurons perform w times x they don't
1:19:00
neurons perform w times x they don't have a bias and they don't have a
1:19:01
have a bias and they don't have a
1:19:01
have a bias and they don't have a non-linearity like 10 h we're going to
1:19:03
non-linearity like 10 h we're going to
1:19:03
non-linearity like 10 h we're going to leave them to be a linear layer
1:19:06
leave them to be a linear layer
1:19:06
leave them to be a linear layer in addition to that we're not going to
1:19:07
in addition to that we're not going to
1:19:07
in addition to that we're not going to have any other layers this is going to
1:19:09
have any other layers this is going to
1:19:09
have any other layers this is going to be it it's just going to be
1:19:11
be it it's just going to be
1:19:11
be it it's just going to be the dumbest smallest simplest neural net
1:19:13
the dumbest smallest simplest neural net
1:19:13
the dumbest smallest simplest neural net which is just a single linear layer
1:19:16
which is just a single linear layer
1:19:16
which is just a single linear layer and now i'd like to explain what i want
1:19:18
and now i'd like to explain what i want
1:19:18
and now i'd like to explain what i want those 27 outputs to be
1:19:21
those 27 outputs to be
1:19:21
those 27 outputs to be intuitively what we're trying to produce
1:19:22
intuitively what we're trying to produce
1:19:22
intuitively what we're trying to produce here for every single input example is
1:19:24
here for every single input example is
1:19:24
here for every single input example is we're trying to produce some kind of a
1:19:25
we're trying to produce some kind of a
1:19:26
we're trying to produce some kind of a probability distribution for the next
1:19:27
probability distribution for the next
1:19:27
probability distribution for the next character in a sequence
1:19:29
character in a sequence
1:19:29
character in a sequence and there's 27 of them
1:19:31
and there's 27 of them
1:19:31
and there's 27 of them but we have to come up with like precise
1:19:33
but we have to come up with like precise
1:19:33
but we have to come up with like precise semantics for exactly how we're going to
1:19:34
semantics for exactly how we're going to
1:19:34
semantics for exactly how we're going to interpret these 27 numbers that these
1:19:37
interpret these 27 numbers that these
1:19:37
interpret these 27 numbers that these neurons take on
1:19:39
neurons take on
1:19:39
neurons take on now intuitively
1:19:41
now intuitively
1:19:41
now intuitively you see here that these numbers are
1:19:42
you see here that these numbers are
1:19:42
you see here that these numbers are negative and some of them are positive
1:19:43
negative and some of them are positive
1:19:44
negative and some of them are positive etc
1:19:45
etc
1:19:45
etc and that's because these are coming out
1:19:46
and that's because these are coming out
1:19:46
and that's because these are coming out of a neural net layer initialized with
1:19:48
of a neural net layer initialized with
1:19:48
of a neural net layer initialized with these
1:19:51
normal distribution
1:19:52
normal distribution
1:19:52
normal distribution parameters
1:19:54
parameters
1:19:54
parameters but what we want is we want something
1:19:55
but what we want is we want something
1:19:55
but what we want is we want something like we had here
1:19:57
like we had here
1:19:57
like we had here like each row here
1:19:58
like each row here
1:19:58
like each row here told us the counts and then we
1:20:01
told us the counts and then we
1:20:01
told us the counts and then we normalized the counts to get
1:20:02
normalized the counts to get
1:20:02
normalized the counts to get probabilities and we want something
1:20:04
probabilities and we want something
1:20:04
probabilities and we want something similar to come out of the neural net
1:20:06
similar to come out of the neural net
1:20:06
similar to come out of the neural net but what we just have right now is just
1:20:07
but what we just have right now is just
1:20:07
but what we just have right now is just some negative and positive numbers
1:20:10
some negative and positive numbers
1:20:10
some negative and positive numbers now we want those numbers to somehow
1:20:12
now we want those numbers to somehow
1:20:12
now we want those numbers to somehow represent the probabilities for the next
1:20:13
represent the probabilities for the next
1:20:14
represent the probabilities for the next character
1:20:15
character
1:20:15
character but you see that probabilities they they
1:20:17
but you see that probabilities they they
1:20:17
but you see that probabilities they they have a special structure they um
1:20:19
have a special structure they um
1:20:19
have a special structure they um they're positive numbers and they sum to
1:20:21
they're positive numbers and they sum to
1:20:21
they're positive numbers and they sum to one
1:20:22
one
1:20:22
one and so that doesn't just come out of a
1:20:24
and so that doesn't just come out of a
1:20:24
and so that doesn't just come out of a neural net
1:20:25
neural net
1:20:25
neural net and then they can't be counts
1:20:27
and then they can't be counts
1:20:27
and then they can't be counts because these counts are positive and
1:20:31
because these counts are positive and
1:20:31
because these counts are positive and counts are integers
1:20:32
counts are integers
1:20:32
counts are integers so counts are also not really a good
1:20:34
so counts are also not really a good
1:20:34
so counts are also not really a good thing to output from a neural net
1:20:36
thing to output from a neural net
1:20:36
thing to output from a neural net so instead what the neural net is going
1:20:38
so instead what the neural net is going
1:20:38
so instead what the neural net is going to output and how we are going to
1:20:39
to output and how we are going to
1:20:39
to output and how we are going to interpret the um
1:20:42
interpret the um
1:20:42
interpret the um the 27 numbers is that these 27 numbers
1:20:45
the 27 numbers is that these 27 numbers
1:20:45
the 27 numbers is that these 27 numbers are giving us log counts
1:20:48
are giving us log counts
1:20:48
are giving us log counts basically
1:20:49
basically
1:20:49
basically um
1:20:50
um
1:20:50
um so instead of giving us counts directly
1:20:52
so instead of giving us counts directly
1:20:52
so instead of giving us counts directly like in this table they're giving us log
1:20:54
like in this table they're giving us log
1:20:54
like in this table they're giving us log counts
1:20:55
counts
1:20:56
counts and to get the counts we're going to
1:20:57
and to get the counts we're going to
1:20:57
and to get the counts we're going to take the log counts and we're going to
1:20:59
take the log counts and we're going to
1:20:59
take the log counts and we're going to exponentiate them
1:21:01
exponentiate them
1:21:01
exponentiate them now
1:21:02
now
1:21:02
now exponentiation
1:21:04
exponentiation
1:21:04
exponentiation takes the following form
1:21:07
takes the following form
1:21:07
takes the following form it takes numbers
1:21:08
it takes numbers
1:21:08
it takes numbers that are negative or they are positive
1:21:10
that are negative or they are positive
1:21:10
that are negative or they are positive it takes the entire real line
1:21:12
it takes the entire real line
1:21:12
it takes the entire real line and then if you plug in negative numbers
1:21:14
and then if you plug in negative numbers
1:21:14
and then if you plug in negative numbers you're going to get e to the x
1:21:17
you're going to get e to the x
1:21:17
you're going to get e to the x which is uh always below one
1:21:20
which is uh always below one
1:21:20
which is uh always below one so you're getting numbers lower than one
1:21:23
so you're getting numbers lower than one
1:21:23
so you're getting numbers lower than one and if you plug in numbers greater than
1:21:25
and if you plug in numbers greater than
1:21:25
and if you plug in numbers greater than zero you're getting numbers greater than
1:21:27
zero you're getting numbers greater than
1:21:27
zero you're getting numbers greater than one all the way growing to the infinity
1:21:30
one all the way growing to the infinity
1:21:30
one all the way growing to the infinity and this here grows to zero
1:21:33
and this here grows to zero
1:21:33
and this here grows to zero so basically we're going to
1:21:35
so basically we're going to
1:21:35
so basically we're going to take these numbers
1:21:37
take these numbers
1:21:37
take these numbers here
1:21:40
and
1:21:43
and
1:21:43
and instead of them being positive and
1:21:44
instead of them being positive and
1:21:44
instead of them being positive and negative and all over the place we're
1:21:46
negative and all over the place we're
1:21:46
negative and all over the place we're going to interpret them as log counts
1:21:48
going to interpret them as log counts
1:21:48
going to interpret them as log counts and then we're going to element wise
1:21:50
and then we're going to element wise
1:21:50
and then we're going to element wise exponentiate these numbers
1:21:52
exponentiate these numbers
1:21:52
exponentiate these numbers exponentiating them now gives us
1:21:54
exponentiating them now gives us
1:21:54
exponentiating them now gives us something like this
1:21:56
something like this
1:21:56
something like this and you see that these numbers now
1:21:57
and you see that these numbers now
1:21:57
and you see that these numbers now because they went through an exponent
1:21:59
because they went through an exponent
1:21:59
because they went through an exponent all the negative numbers turned into
1:22:00
all the negative numbers turned into
1:22:00
all the negative numbers turned into numbers below 1 like 0.338 and all the
1:22:04
numbers below 1 like 0.338 and all the
1:22:04
numbers below 1 like 0.338 and all the positive numbers originally turned into
1:22:07
positive numbers originally turned into
1:22:07
positive numbers originally turned into even more positive numbers sort of
1:22:08
even more positive numbers sort of
1:22:08
even more positive numbers sort of greater than one
1:22:10
greater than one
1:22:10
greater than one so like for example
1:22:12
so like for example
1:22:12
so like for example seven
1:22:14
seven
1:22:14
seven is some positive number over here
1:22:18
is some positive number over here
1:22:18
is some positive number over here that is greater than zero
1:22:21
that is greater than zero
1:22:21
that is greater than zero but exponentiated outputs here
1:22:24
but exponentiated outputs here
1:22:24
but exponentiated outputs here basically give us something that we can
1:22:26
basically give us something that we can
1:22:26
basically give us something that we can use and interpret as the equivalent of
1:22:29
use and interpret as the equivalent of
1:22:29
use and interpret as the equivalent of counts originally so you see these
1:22:31
counts originally so you see these
1:22:31
counts originally so you see these counts here 112 7 51 1 etc
1:22:36
counts here 112 7 51 1 etc
1:22:36
counts here 112 7 51 1 etc the neural net is kind of now predicting
1:22:38
the neural net is kind of now predicting
1:22:38
the neural net is kind of now predicting uh
1:22:40
uh
1:22:40
uh counts
1:22:41
counts
1:22:41
counts and these counts are positive numbers
1:22:43
and these counts are positive numbers
1:22:43
and these counts are positive numbers they can never be below zero so that
1:22:45
they can never be below zero so that
1:22:45
they can never be below zero so that makes sense
1:22:46
makes sense
1:22:46
makes sense and uh they can now take on various
1:22:48
and uh they can now take on various
1:22:48
and uh they can now take on various values
1:22:49
values
1:22:49
values depending on the settings of w
1:22:54
depending on the settings of w
1:22:54
depending on the settings of w so let me break this down
1:22:56
so let me break this down
1:22:56
so let me break this down we're going to interpret these to be the
1:22:58
we're going to interpret these to be the
1:22:58
we're going to interpret these to be the log counts
1:23:01
in other words for this that is often
1:23:02
in other words for this that is often
1:23:02
in other words for this that is often used is so-called logits
1:23:05
used is so-called logits
1:23:05
used is so-called logits these are logits log counts
1:23:08
these are logits log counts
1:23:08
these are logits log counts then these will be sort of the counts
1:23:11
then these will be sort of the counts
1:23:11
then these will be sort of the counts largest exponentiated
1:23:13
largest exponentiated
1:23:13
largest exponentiated and this is equivalent to the n matrix
1:23:16
and this is equivalent to the n matrix
1:23:16
and this is equivalent to the n matrix sort of the n
1:23:17
sort of the n
1:23:18
sort of the n array that we used previously remember
1:23:20
array that we used previously remember
1:23:20
array that we used previously remember this was the n
1:23:21
this was the n
1:23:21
this was the n this is the the array of counts
1:23:24
this is the the array of counts
1:23:24
this is the the array of counts and each row here are the counts for the
1:23:27
and each row here are the counts for the
1:23:27
and each row here are the counts for the for the um
1:23:28
for the um
1:23:28
for the um next character sort of
1:23:32
so those are the counts and now the
1:23:34
so those are the counts and now the
1:23:34
so those are the counts and now the probabilities are just the counts um
1:23:37
probabilities are just the counts um
1:23:37
probabilities are just the counts um normalized
1:23:39
normalized
1:23:39
normalized and so um
1:23:41
and so um
1:23:41
and so um i'm not going to find the same but
1:23:43
i'm not going to find the same but
1:23:43
i'm not going to find the same but basically i'm not going to scroll all
1:23:44
basically i'm not going to scroll all
1:23:44
basically i'm not going to scroll all over the place
1:23:45
over the place
1:23:46
over the place we've already done this we want to
1:23:48
we've already done this we want to
1:23:48
we've already done this we want to counts that sum
1:23:50
counts that sum
1:23:50
counts that sum along the first dimension and we want to
1:23:52
along the first dimension and we want to
1:23:52
along the first dimension and we want to keep them as true
1:23:54
keep them as true
1:23:54
keep them as true we've went over this and this is how we
1:23:56
we've went over this and this is how we
1:23:56
we've went over this and this is how we normalize the rows of our counts matrix
1:23:59
normalize the rows of our counts matrix
1:23:59
normalize the rows of our counts matrix to get our probabilities
1:24:03
to get our probabilities
1:24:03
to get our probabilities props
1:24:04
props
1:24:04
props so now these are the probabilities
1:24:07
so now these are the probabilities
1:24:07
so now these are the probabilities and
1:24:08
and
1:24:08
and these are the counts that we ask
1:24:10
these are the counts that we ask
1:24:10
these are the counts that we ask currently and now when i show the
1:24:11
currently and now when i show the
1:24:11
currently and now when i show the probabilities
1:24:13
probabilities
1:24:13
probabilities you see that um
1:24:15
you see that um
1:24:15
you see that um every row here
1:24:17
every row here
1:24:17
every row here of course
1:24:19
of course
1:24:19
of course will sum to 1
1:24:21
will sum to 1
1:24:21
will sum to 1 because they're normalized
1:24:23
because they're normalized
1:24:23
because they're normalized and the shape of this
1:24:25
and the shape of this
1:24:25
and the shape of this is 5 by 27
1:24:27
is 5 by 27
1:24:27
is 5 by 27 and so really what we've achieved is for
1:24:29
and so really what we've achieved is for
1:24:29
and so really what we've achieved is for every one of our five examples
1:24:31
every one of our five examples
1:24:31
every one of our five examples we now have a row that came out of a
1:24:33
we now have a row that came out of a
1:24:33
we now have a row that came out of a neural net
1:24:35
neural net
1:24:35
neural net and because of the transformations here
1:24:37
and because of the transformations here
1:24:37
and because of the transformations here we made sure that this output of this
1:24:39
we made sure that this output of this
1:24:39
we made sure that this output of this neural net now are probabilities or we
1:24:41
neural net now are probabilities or we
1:24:41
neural net now are probabilities or we can interpret to be probabilities
1:24:43
can interpret to be probabilities
1:24:44
can interpret to be probabilities so
1:24:45
so
1:24:45
so our wx here gave us logits
1:24:47
our wx here gave us logits
1:24:48
our wx here gave us logits and then we interpret those to be log
1:24:49
and then we interpret those to be log
1:24:49
and then we interpret those to be log counts
1:24:50
counts
1:24:50
counts we exponentiate to get something that
1:24:52
we exponentiate to get something that
1:24:52
we exponentiate to get something that looks like counts
1:24:53
looks like counts
1:24:53
looks like counts and then we normalize those counts to
1:24:55
and then we normalize those counts to
1:24:55
and then we normalize those counts to get a probability distribution
1:24:57
get a probability distribution
1:24:57
get a probability distribution and all of these are differentiable
1:24:59
and all of these are differentiable
1:24:59
and all of these are differentiable operations
1:25:00
operations
1:25:00
operations so what we've done now is we're taking
1:25:02
so what we've done now is we're taking
1:25:02
so what we've done now is we're taking inputs we have differentiable operations
1:25:04
inputs we have differentiable operations
1:25:04
inputs we have differentiable operations that we can back propagate through
1:25:06
that we can back propagate through
1:25:06
that we can back propagate through and we're getting out probability
1:25:08
and we're getting out probability
1:25:08
and we're getting out probability distributions
1:25:09
distributions
1:25:09
distributions so
1:25:11
so
1:25:11
so for example for the zeroth example that
1:25:13
for example for the zeroth example that
1:25:13
for example for the zeroth example that fed in
1:25:15
fed in
1:25:15
fed in right which was um
1:25:17
right which was um
1:25:17
right which was um the zeroth example here was a one-half
1:25:18
the zeroth example here was a one-half
1:25:18
the zeroth example here was a one-half vector of zero
1:25:20
vector of zero
1:25:20
vector of zero and um
1:25:22
and um
1:25:22
and um it basically corresponded to feeding in
1:25:26
it basically corresponded to feeding in
1:25:26
it basically corresponded to feeding in this example here so we're feeding in a
1:25:28
this example here so we're feeding in a
1:25:28
this example here so we're feeding in a dot into a neural net and the way we fed
1:25:31
dot into a neural net and the way we fed
1:25:31
dot into a neural net and the way we fed the dot into a neural net is that we
1:25:32
the dot into a neural net is that we
1:25:32
the dot into a neural net is that we first got its index
1:25:34
first got its index
1:25:34
first got its index then we one hot encoded it
1:25:36
then we one hot encoded it
1:25:36
then we one hot encoded it then it went into the neural net and out
1:25:39
then it went into the neural net and out
1:25:39
then it went into the neural net and out came
1:25:40
came
1:25:40
came this distribution of probabilities
1:25:43
this distribution of probabilities
1:25:43
this distribution of probabilities and its shape
1:25:46
and its shape
1:25:46
and its shape is 27 there's 27 numbers and we're going
1:25:49
is 27 there's 27 numbers and we're going
1:25:49
is 27 there's 27 numbers and we're going to interpret this as the neural nets
1:25:51
to interpret this as the neural nets
1:25:51
to interpret this as the neural nets assignment for how likely every one of
1:25:54
assignment for how likely every one of
1:25:54
assignment for how likely every one of these characters um
1:25:56
these characters um
1:25:56
these characters um the 27 characters are to come next
1:25:59
the 27 characters are to come next
1:25:59
the 27 characters are to come next and as we tune the weights w
1:26:02
and as we tune the weights w
1:26:02
and as we tune the weights w we're going to be of course getting
1:26:03
we're going to be of course getting
1:26:03
we're going to be of course getting different probabilities out for any
1:26:05
different probabilities out for any
1:26:05
different probabilities out for any character that you input
1:26:07
character that you input
1:26:07
character that you input and so now the question is just can we
1:26:08
and so now the question is just can we
1:26:08
and so now the question is just can we optimize and find a good w
1:26:10
optimize and find a good w
1:26:10
optimize and find a good w such that the probabilities coming out
1:26:12
such that the probabilities coming out
1:26:12
such that the probabilities coming out are pretty good and the way we measure
1:26:15
are pretty good and the way we measure
1:26:15
are pretty good and the way we measure pretty good is by the loss function okay
1:26:17
pretty good is by the loss function okay
1:26:17
pretty good is by the loss function okay so i organized everything into a single
1:26:18
so i organized everything into a single
1:26:18
so i organized everything into a single summary so that hopefully it's a bit
1:26:20
summary so that hopefully it's a bit
1:26:20
summary so that hopefully it's a bit more clear so it starts here
1:26:22
more clear so it starts here
1:26:22
more clear so it starts here with an input data set
1:26:24
with an input data set
1:26:24
with an input data set we have some inputs to the neural net
1:26:26
we have some inputs to the neural net
1:26:26
we have some inputs to the neural net and we have some labels for the correct
1:26:28
and we have some labels for the correct
1:26:28
and we have some labels for the correct next character in a sequence these are
1:26:30
next character in a sequence these are
1:26:30
next character in a sequence these are integers
1:26:32
integers
1:26:32
integers here i'm using uh torch generators now
1:26:35
here i'm using uh torch generators now
1:26:35
here i'm using uh torch generators now so that you see the same numbers that i
1:26:37
so that you see the same numbers that i
1:26:37
so that you see the same numbers that i see
1:26:38
see
1:26:38
see and i'm generating um
1:26:40
and i'm generating um
1:26:40
and i'm generating um 27 neurons weights
1:26:42
27 neurons weights
1:26:42
27 neurons weights and each neuron here receives 27 inputs
1:26:48
then here we're going to plug in all the
1:26:50
then here we're going to plug in all the
1:26:50
then here we're going to plug in all the input examples x's into a neural net so
1:26:52
input examples x's into a neural net so
1:26:52
input examples x's into a neural net so here this is a forward pass
1:26:55
here this is a forward pass
1:26:55
here this is a forward pass first we have to encode all of the
1:26:57
first we have to encode all of the
1:26:57
first we have to encode all of the inputs into one hot representations
1:27:00
inputs into one hot representations
1:27:00
inputs into one hot representations so we have 27 classes we pass in these
1:27:02
so we have 27 classes we pass in these
1:27:02
so we have 27 classes we pass in these integers and
1:27:04
integers and
1:27:04
integers and x inc becomes a array that is 5 by 27
1:27:09
x inc becomes a array that is 5 by 27
1:27:09
x inc becomes a array that is 5 by 27 zeros except for a few ones
1:27:12
zeros except for a few ones
1:27:12
zeros except for a few ones we then multiply this in the first layer
1:27:14
we then multiply this in the first layer
1:27:14
we then multiply this in the first layer of a neural net to get logits
1:27:16
of a neural net to get logits
1:27:16
of a neural net to get logits exponentiate the logits to get fake
1:27:18
exponentiate the logits to get fake
1:27:18
exponentiate the logits to get fake counts sort of
1:27:20
counts sort of
1:27:20
counts sort of and normalize these counts to get
1:27:22
and normalize these counts to get
1:27:22
and normalize these counts to get probabilities
1:27:24
probabilities
1:27:24
probabilities so we lock these last two lines by the
1:27:26
so we lock these last two lines by the
1:27:26
so we lock these last two lines by the way here are called the softmax
1:27:29
way here are called the softmax
1:27:30
way here are called the softmax which i pulled up here soft max is a
1:27:33
which i pulled up here soft max is a
1:27:33
which i pulled up here soft max is a very often used layer in a neural net
1:27:35
very often used layer in a neural net
1:27:35
very often used layer in a neural net that takes these z's which are logics
1:27:38
that takes these z's which are logics
1:27:38
that takes these z's which are logics exponentiates them
1:27:40
exponentiates them
1:27:40
exponentiates them and divides and normalizes it's a way of
1:27:43
and divides and normalizes it's a way of
1:27:43
and divides and normalizes it's a way of taking
1:27:44
taking
1:27:44
taking outputs of a neural net layer and these
1:27:47
outputs of a neural net layer and these
1:27:47
outputs of a neural net layer and these these outputs can be positive or
1:27:48
these outputs can be positive or
1:27:48
these outputs can be positive or negative
1:27:49
negative
1:27:49
negative and it outputs probability distributions
1:27:52
and it outputs probability distributions
1:27:52
and it outputs probability distributions it outputs something that is always
1:27:55
it outputs something that is always
1:27:55
it outputs something that is always sums to one and are positive numbers
1:27:56
sums to one and are positive numbers
1:27:56
sums to one and are positive numbers just like probabilities
1:27:58
just like probabilities
1:27:58
just like probabilities um so it's kind of like a normalization
1:28:00
um so it's kind of like a normalization
1:28:00
um so it's kind of like a normalization function if you want to think of it that
1:28:01
function if you want to think of it that
1:28:01
function if you want to think of it that way and you can put it on top of any
1:28:03
way and you can put it on top of any
1:28:03
way and you can put it on top of any other linear layer inside a neural net
1:28:05
other linear layer inside a neural net
1:28:05
other linear layer inside a neural net and it basically makes a neural net
1:28:07
and it basically makes a neural net
1:28:07
and it basically makes a neural net output probabilities that's very often
1:28:09
output probabilities that's very often
1:28:09
output probabilities that's very often used and we used it as well here
1:28:13
used and we used it as well here
1:28:13
used and we used it as well here so this is the forward pass and that's
1:28:14
so this is the forward pass and that's
1:28:14
so this is the forward pass and that's how we made a neural net output
1:28:16
how we made a neural net output
1:28:16
how we made a neural net output probability
1:28:17
probability
1:28:17
probability now
1:28:19
now
1:28:19
now you'll notice that
1:28:20
you'll notice that
1:28:20
you'll notice that um
1:28:22
all of these
1:28:24
all of these
1:28:24
all of these this entire forward pass is made up of
1:28:25
this entire forward pass is made up of
1:28:25
this entire forward pass is made up of differentiable
1:28:27
differentiable
1:28:27
differentiable layers everything here we can back
1:28:29
layers everything here we can back
1:28:29
layers everything here we can back propagate through and we saw some of the
1:28:30
propagate through and we saw some of the
1:28:30
propagate through and we saw some of the back propagation in micrograd
1:28:33
back propagation in micrograd
1:28:33
back propagation in micrograd this is just
1:28:34
this is just
1:28:34
this is just multiplication and addition all that's
1:28:36
multiplication and addition all that's
1:28:36
multiplication and addition all that's happening here is just multiply and then
1:28:38
happening here is just multiply and then
1:28:38
happening here is just multiply and then add and we know how to backpropagate
1:28:39
add and we know how to backpropagate
1:28:39
add and we know how to backpropagate through them
1:28:40
through them
1:28:40
through them exponentiation we know how to
1:28:41
exponentiation we know how to
1:28:42
exponentiation we know how to backpropagate through
1:28:43
backpropagate through
1:28:43
backpropagate through and then here we are summing
1:28:46
and then here we are summing
1:28:46
and then here we are summing and sum is is easily backpropagable as
1:28:49
and sum is is easily backpropagable as
1:28:49
and sum is is easily backpropagable as well
1:28:50
well
1:28:50
well and division as well so everything here
1:28:52
and division as well so everything here
1:28:52
and division as well so everything here is differentiable operation
1:28:54
is differentiable operation
1:28:54
is differentiable operation and we can back propagate through
1:28:57
and we can back propagate through
1:28:57
and we can back propagate through now we achieve these probabilities which
1:28:59
now we achieve these probabilities which
1:28:59
now we achieve these probabilities which are 5 by 27
1:29:01
are 5 by 27
1:29:01
are 5 by 27 for every single example we have a
1:29:03
for every single example we have a
1:29:03
for every single example we have a vector of probabilities that's into one
1:29:06
vector of probabilities that's into one
1:29:06
vector of probabilities that's into one and then here i wrote a bunch of stuff
1:29:08
and then here i wrote a bunch of stuff
1:29:08
and then here i wrote a bunch of stuff to sort of like break down uh the
1:29:10
to sort of like break down uh the
1:29:10
to sort of like break down uh the examples
1:29:11
examples
1:29:11
examples so we have five examples making up emma
1:29:14
so we have five examples making up emma
1:29:14
so we have five examples making up emma right
1:29:16
right
1:29:16
right and there are five bigrams inside emma
1:29:19
and there are five bigrams inside emma
1:29:20
and there are five bigrams inside emma so bigram example a bigram example1 is
1:29:23
so bigram example a bigram example1 is
1:29:23
so bigram example a bigram example1 is that e is the beginning character right
1:29:26
that e is the beginning character right
1:29:26
that e is the beginning character right after dot
1:29:28
after dot
1:29:28
after dot and the indexes for these are zero and
1:29:30
and the indexes for these are zero and
1:29:30
and the indexes for these are zero and five
1:29:31
five
1:29:31
five so then we feed in a zero
1:29:34
so then we feed in a zero
1:29:34
so then we feed in a zero that's the input of the neural net
1:29:35
that's the input of the neural net
1:29:35
that's the input of the neural net we get probabilities from the neural net
1:29:38
we get probabilities from the neural net
1:29:38
we get probabilities from the neural net that are 27 numbers
1:29:41
that are 27 numbers
1:29:41
that are 27 numbers and then the label is 5 because e
1:29:44
and then the label is 5 because e
1:29:44
and then the label is 5 because e actually comes after dot
1:29:45
actually comes after dot
1:29:45
actually comes after dot so that's the label
1:29:47
so that's the label
1:29:47
so that's the label and then
1:29:49
and then
1:29:49
and then we use this label 5 to index into the
1:29:52
we use this label 5 to index into the
1:29:52
we use this label 5 to index into the probability distribution here
1:29:54
probability distribution here
1:29:54
probability distribution here so
1:29:55
so
1:29:55
so this
1:29:55
this
1:29:56
this index 5 here is 0 1 2 3 4 5. it's this
1:29:59
index 5 here is 0 1 2 3 4 5. it's this
1:30:00
index 5 here is 0 1 2 3 4 5. it's this number here
1:30:01
number here
1:30:01
number here which is here
1:30:03
which is here
1:30:04
which is here so that's basically the probability
1:30:05
so that's basically the probability
1:30:05
so that's basically the probability assigned by the neural net to the actual
1:30:07
assigned by the neural net to the actual
1:30:07
assigned by the neural net to the actual correct character
1:30:08
correct character
1:30:08
correct character you see that the network currently
1:30:10
you see that the network currently
1:30:10
you see that the network currently thinks that this next character that e
1:30:12
thinks that this next character that e
1:30:12
thinks that this next character that e following dot is only one percent likely
1:30:15
following dot is only one percent likely
1:30:15
following dot is only one percent likely which is of course not very good right
1:30:16
which is of course not very good right
1:30:16
which is of course not very good right because this actually is a training
1:30:18
because this actually is a training
1:30:18
because this actually is a training example and the network thinks this is
1:30:20
example and the network thinks this is
1:30:20
example and the network thinks this is currently very very unlikely but that's
1:30:22
currently very very unlikely but that's
1:30:22
currently very very unlikely but that's just because we didn't get very lucky in
1:30:24
just because we didn't get very lucky in
1:30:24
just because we didn't get very lucky in generating a good setting of w so right
1:30:27
generating a good setting of w so right
1:30:27
generating a good setting of w so right now this network things it says unlikely
1:30:29
now this network things it says unlikely
1:30:29
now this network things it says unlikely and 0.01 is not a good outcome
1:30:31
and 0.01 is not a good outcome
1:30:31
and 0.01 is not a good outcome so the log likelihood then is very
1:30:34
so the log likelihood then is very
1:30:34
so the log likelihood then is very negative
1:30:35
negative
1:30:35
negative and the negative log likelihood is very
1:30:38
and the negative log likelihood is very
1:30:38
and the negative log likelihood is very positive
1:30:39
positive
1:30:39
positive and so four is a very high negative log
1:30:42
and so four is a very high negative log
1:30:42
and so four is a very high negative log likelihood and that means we're going to
1:30:43
likelihood and that means we're going to
1:30:44
likelihood and that means we're going to have a high loss
1:30:45
have a high loss
1:30:45
have a high loss because what is the loss the loss is
1:30:47
because what is the loss the loss is
1:30:47
because what is the loss the loss is just the average negative log likelihood
1:30:51
just the average negative log likelihood
1:30:51
just the average negative log likelihood so the second character is em
1:30:53
so the second character is em
1:30:53
so the second character is em and you see here that also the network
1:30:55
and you see here that also the network
1:30:55
and you see here that also the network thought that m following e is very
1:30:57
thought that m following e is very
1:30:57
thought that m following e is very unlikely one percent
1:31:00
the for m following m i thought it was
1:31:03
the for m following m i thought it was
1:31:03
the for m following m i thought it was two percent
1:31:04
two percent
1:31:04
two percent and for a following m it actually
1:31:06
and for a following m it actually
1:31:06
and for a following m it actually thought it was seven percent likely so
1:31:08
thought it was seven percent likely so
1:31:08
thought it was seven percent likely so just by chance this one actually has a
1:31:10
just by chance this one actually has a
1:31:10
just by chance this one actually has a pretty good probability and therefore
1:31:12
pretty good probability and therefore
1:31:12
pretty good probability and therefore pretty low negative log likelihood
1:31:15
pretty low negative log likelihood
1:31:15
pretty low negative log likelihood and finally here it thought this was one
1:31:17
and finally here it thought this was one
1:31:17
and finally here it thought this was one percent likely
1:31:18
percent likely
1:31:18
percent likely so overall our average negative log
1:31:20
so overall our average negative log
1:31:20
so overall our average negative log likelihood which is the loss the total
1:31:22
likelihood which is the loss the total
1:31:22
likelihood which is the loss the total loss that summarizes
1:31:24
loss that summarizes
1:31:24
loss that summarizes basically the how well this network
1:31:26
basically the how well this network
1:31:26
basically the how well this network currently works at least on this one
1:31:28
currently works at least on this one
1:31:28
currently works at least on this one word not on the full data suggested one
1:31:30
word not on the full data suggested one
1:31:30
word not on the full data suggested one word is 3.76 which is actually very
1:31:33
word is 3.76 which is actually very
1:31:33
word is 3.76 which is actually very fairly high loss this is not a very good
1:31:35
fairly high loss this is not a very good
1:31:35
fairly high loss this is not a very good setting of w's
1:31:36
setting of w's
1:31:36
setting of w's now here's what we can do
1:31:38
now here's what we can do
1:31:38
now here's what we can do we're currently getting 3.76
1:31:41
we're currently getting 3.76
1:31:41
we're currently getting 3.76 we can actually come here and we can
1:31:42
we can actually come here and we can
1:31:42
we can actually come here and we can change our w we can resample it so let
1:31:45
change our w we can resample it so let
1:31:45
change our w we can resample it so let me just add one to have a different seed
1:31:48
me just add one to have a different seed
1:31:48
me just add one to have a different seed and then we get a different w
1:31:50
and then we get a different w
1:31:50
and then we get a different w and then we can rerun this
1:31:52
and then we can rerun this
1:31:52
and then we can rerun this and with this different c with this
1:31:54
and with this different c with this
1:31:54
and with this different c with this different setting of w's we now get 3.37
1:31:58
different setting of w's we now get 3.37
1:31:58
different setting of w's we now get 3.37 so this is a much better w right and
1:32:00
so this is a much better w right and
1:32:00
so this is a much better w right and that and it's better because the
1:32:02
that and it's better because the
1:32:02
that and it's better because the probabilities just happen to come out
1:32:04
probabilities just happen to come out
1:32:04
probabilities just happen to come out higher for the for the characters that
1:32:07
higher for the for the characters that
1:32:07
higher for the for the characters that actually are next
1:32:08
actually are next
1:32:08
actually are next and so you can imagine actually just
1:32:09
and so you can imagine actually just
1:32:10
and so you can imagine actually just resampling this you know we can try two
1:32:14
resampling this you know we can try two
1:32:14
resampling this you know we can try two so
1:32:15
so
1:32:15
so okay this was not very good
1:32:17
okay this was not very good
1:32:17
okay this was not very good let's try one more
1:32:18
let's try one more
1:32:18
let's try one more we can try three
1:32:20
we can try three
1:32:20
we can try three okay this was terrible setting because
1:32:22
okay this was terrible setting because
1:32:22
okay this was terrible setting because we have a very high loss
1:32:24
we have a very high loss
1:32:24
we have a very high loss so anyway i'm going to erase this
1:32:29
what i'm doing here which is just guess
1:32:31
what i'm doing here which is just guess
1:32:31
what i'm doing here which is just guess and check of randomly assigning
1:32:32
and check of randomly assigning
1:32:32
and check of randomly assigning parameters and seeing if the network is
1:32:34
parameters and seeing if the network is
1:32:34
parameters and seeing if the network is good that is uh amateur hour that's not
1:32:37
good that is uh amateur hour that's not
1:32:37
good that is uh amateur hour that's not how you optimize a neural net the way
1:32:39
how you optimize a neural net the way
1:32:39
how you optimize a neural net the way you optimize your neural net is you
1:32:40
you optimize your neural net is you
1:32:40
you optimize your neural net is you start with some random guess and we're
1:32:42
start with some random guess and we're
1:32:42
start with some random guess and we're going to commit to this one even though
1:32:43
going to commit to this one even though
1:32:43
going to commit to this one even though it's not very good
1:32:45
it's not very good
1:32:45
it's not very good but now the big deal is we have a loss
1:32:46
but now the big deal is we have a loss
1:32:46
but now the big deal is we have a loss function
1:32:48
function
1:32:48
function so this loss
1:32:49
so this loss
1:32:49
so this loss is made up only of differentiable
1:32:52
is made up only of differentiable
1:32:52
is made up only of differentiable operations and we can minimize the loss
1:32:56
operations and we can minimize the loss
1:32:56
operations and we can minimize the loss by tuning
1:32:57
by tuning
1:32:57
by tuning ws
1:32:58
ws
1:32:58
ws by computing the gradients of the loss
1:33:01
by computing the gradients of the loss
1:33:01
by computing the gradients of the loss with respect to
1:33:02
with respect to
1:33:02
with respect to these w matrices
1:33:04
these w matrices
1:33:04
these w matrices and so then we can tune w to minimize
1:33:07
and so then we can tune w to minimize
1:33:07
and so then we can tune w to minimize the loss and find a good setting of w
1:33:09
the loss and find a good setting of w
1:33:09
the loss and find a good setting of w using gradient based optimization so
1:33:11
using gradient based optimization so
1:33:11
using gradient based optimization so let's see how that will work now things
1:33:13
let's see how that will work now things
1:33:13
let's see how that will work now things are actually going to look almost
1:33:14
are actually going to look almost
1:33:14
are actually going to look almost identical to what we had with micrograd
1:33:17
identical to what we had with micrograd
1:33:17
identical to what we had with micrograd so here
1:33:18
so here
1:33:18
so here i pulled up the lecture from micrograd
1:33:20
i pulled up the lecture from micrograd
1:33:20
i pulled up the lecture from micrograd the notebook it's from this repository
1:33:23
the notebook it's from this repository
1:33:23
the notebook it's from this repository and when i scroll all the way to the end
1:33:24
and when i scroll all the way to the end
1:33:24
and when i scroll all the way to the end where we left off with micrograd we had
1:33:26
where we left off with micrograd we had
1:33:26
where we left off with micrograd we had something very very similar
1:33:28
something very very similar
1:33:28
something very very similar we had
1:33:29
we had
1:33:29
we had a number of input examples in this case
1:33:31
a number of input examples in this case
1:33:31
a number of input examples in this case we had four input examples inside axis
1:33:34
we had four input examples inside axis
1:33:34
we had four input examples inside axis and we had their targets these are
1:33:36
and we had their targets these are
1:33:36
and we had their targets these are targets
1:33:37
targets
1:33:37
targets just like here we have our axes now but
1:33:39
just like here we have our axes now but
1:33:39
just like here we have our axes now but we have five of them and they're now
1:33:41
we have five of them and they're now
1:33:41
we have five of them and they're now integers instead of vectors
1:33:44
integers instead of vectors
1:33:44
integers instead of vectors but we're going to convert our integers
1:33:45
but we're going to convert our integers
1:33:45
but we're going to convert our integers to vectors except our vectors will be 27
1:33:48
to vectors except our vectors will be 27
1:33:48
to vectors except our vectors will be 27 large instead of three large
1:33:51
large instead of three large
1:33:51
large instead of three large and then here what we did is first we
1:33:53
and then here what we did is first we
1:33:53
and then here what we did is first we did a forward pass where we ran a neural
1:33:55
did a forward pass where we ran a neural
1:33:55
did a forward pass where we ran a neural net on all of the inputs
1:33:58
net on all of the inputs
1:33:58
net on all of the inputs to get predictions
1:34:00
to get predictions
1:34:00
to get predictions our neural net at the time this nfx was
1:34:02
our neural net at the time this nfx was
1:34:02
our neural net at the time this nfx was a multi-layer perceptron
1:34:05
a multi-layer perceptron
1:34:05
a multi-layer perceptron our neural net is going to look
1:34:06
our neural net is going to look
1:34:06
our neural net is going to look different because our neural net is just
1:34:08
different because our neural net is just
1:34:08
different because our neural net is just a single layer
1:34:10
a single layer
1:34:10
a single layer single linear layer followed by a soft
1:34:12
single linear layer followed by a soft
1:34:12
single linear layer followed by a soft max
1:34:13
max
1:34:13
max so that's our neural net
1:34:15
so that's our neural net
1:34:15
so that's our neural net and the loss here was the mean squared
1:34:17
and the loss here was the mean squared
1:34:17
and the loss here was the mean squared error so we simply subtracted the
1:34:19
error so we simply subtracted the
1:34:19
error so we simply subtracted the prediction from the ground truth and
1:34:21
prediction from the ground truth and
1:34:21
prediction from the ground truth and squared it and summed it all up and that
1:34:23
squared it and summed it all up and that
1:34:23
squared it and summed it all up and that was the loss and loss was the single
1:34:25
was the loss and loss was the single
1:34:25
was the loss and loss was the single number that summarized the quality of
1:34:27
number that summarized the quality of
1:34:27
number that summarized the quality of the neural net and when loss is low like
1:34:30
the neural net and when loss is low like
1:34:30
the neural net and when loss is low like almost zero that means the neural net is
1:34:33
almost zero that means the neural net is
1:34:33
almost zero that means the neural net is predicting correctly
1:34:36
predicting correctly
1:34:36
predicting correctly so we had a single number that uh that
1:34:38
so we had a single number that uh that
1:34:38
so we had a single number that uh that summarized the uh the performance of the
1:34:40
summarized the uh the performance of the
1:34:40
summarized the uh the performance of the neural net and everything here was
1:34:42
neural net and everything here was
1:34:42
neural net and everything here was differentiable and was stored in massive
1:34:44
differentiable and was stored in massive
1:34:44
differentiable and was stored in massive compute graph
1:34:46
compute graph
1:34:46
compute graph and then we iterated over all the
1:34:48
and then we iterated over all the
1:34:48
and then we iterated over all the parameters we made sure that the
1:34:50
parameters we made sure that the
1:34:50
parameters we made sure that the gradients are set to zero and we called
1:34:52
gradients are set to zero and we called
1:34:52
gradients are set to zero and we called lost up backward
1:34:54
lost up backward
1:34:54
lost up backward and lasted backward initiated back
1:34:56
and lasted backward initiated back
1:34:56
and lasted backward initiated back propagation at the final output node of
1:34:58
propagation at the final output node of
1:34:58
propagation at the final output node of loss
1:34:59
loss
1:34:59
loss right so
1:35:00
right so
1:35:00
right so yeah remember these expressions we had
1:35:02
yeah remember these expressions we had
1:35:02
yeah remember these expressions we had loss all the way at the end we start
1:35:03
loss all the way at the end we start
1:35:03
loss all the way at the end we start back propagation and we went all the way
1:35:05
back propagation and we went all the way
1:35:05
back propagation and we went all the way back
1:35:06
back
1:35:06
back and we made sure that we populated all
1:35:08
and we made sure that we populated all
1:35:08
and we made sure that we populated all the parameters dot grad
1:35:10
the parameters dot grad
1:35:10
the parameters dot grad so that graph started at zero but back
1:35:12
so that graph started at zero but back
1:35:12
so that graph started at zero but back propagation filled it in
1:35:14
propagation filled it in
1:35:14
propagation filled it in and then in the update we iterated over
1:35:16
and then in the update we iterated over
1:35:16
and then in the update we iterated over all the parameters and we simply did a
1:35:18
all the parameters and we simply did a
1:35:18
all the parameters and we simply did a parameter update where every single
1:35:21
parameter update where every single
1:35:21
parameter update where every single element of our parameters was nudged in
1:35:24
element of our parameters was nudged in
1:35:24
element of our parameters was nudged in the opposite direction of the gradient
1:35:27
the opposite direction of the gradient
1:35:27
the opposite direction of the gradient and so we're going to do the exact same
1:35:30
and so we're going to do the exact same
1:35:30
and so we're going to do the exact same thing here
1:35:31
thing here
1:35:31
thing here so i'm going to pull this up
1:35:34
so i'm going to pull this up
1:35:34
so i'm going to pull this up on the side here
1:35:38
so that we have it available and we're
1:35:40
so that we have it available and we're
1:35:40
so that we have it available and we're actually going to do the exact same
1:35:41
actually going to do the exact same
1:35:41
actually going to do the exact same thing so this was the forward pass so
1:35:44
thing so this was the forward pass so
1:35:44
thing so this was the forward pass so where we did this
1:35:46
where we did this
1:35:46
where we did this and probs is our wipe red so now we have
1:35:49
and probs is our wipe red so now we have
1:35:49
and probs is our wipe red so now we have to evaluate the loss but we're not using
1:35:51
to evaluate the loss but we're not using
1:35:51
to evaluate the loss but we're not using the mean squared error we're using the
1:35:52
the mean squared error we're using the
1:35:52
the mean squared error we're using the negative log likelihood because we are
1:35:54
negative log likelihood because we are
1:35:54
negative log likelihood because we are doing classification we're not doing
1:35:56
doing classification we're not doing
1:35:56
doing classification we're not doing regression as it's called
1:35:58
regression as it's called
1:35:58
regression as it's called so here we want to calculate loss
1:36:02
so here we want to calculate loss
1:36:02
so here we want to calculate loss now the way we calculate it is it's just
1:36:04
now the way we calculate it is it's just
1:36:04
now the way we calculate it is it's just this average negative log likelihood
1:36:07
this average negative log likelihood
1:36:07
this average negative log likelihood now this probs here
1:36:10
now this probs here
1:36:10
now this probs here has a shape of 5 by 27
1:36:13
has a shape of 5 by 27
1:36:13
has a shape of 5 by 27 and so to get all the we basically want
1:36:15
and so to get all the we basically want
1:36:15
and so to get all the we basically want to pluck out the probabilities at the
1:36:18
to pluck out the probabilities at the
1:36:18
to pluck out the probabilities at the correct indices here
1:36:19
correct indices here
1:36:20
correct indices here so in particular because the labels are
1:36:22
so in particular because the labels are
1:36:22
so in particular because the labels are stored here in array wise
1:36:24
stored here in array wise
1:36:24
stored here in array wise basically what we're after is for the
1:36:25
basically what we're after is for the
1:36:26
basically what we're after is for the first example we're looking at
1:36:27
first example we're looking at
1:36:27
first example we're looking at probability of five right at index five
1:36:30
probability of five right at index five
1:36:30
probability of five right at index five for the second example
1:36:32
for the second example
1:36:32
for the second example at the the second row or row index one
1:36:36
at the the second row or row index one
1:36:36
at the the second row or row index one we are interested in the probability
1:36:37
we are interested in the probability
1:36:37
we are interested in the probability assigned to index 13.
1:36:40
assigned to index 13.
1:36:40
assigned to index 13. at the second example we also have 13.
1:36:43
at the second example we also have 13.
1:36:43
at the second example we also have 13. at the third row we want one
1:36:47
at the third row we want one
1:36:47
at the third row we want one and then the last row which is four we
1:36:49
and then the last row which is four we
1:36:49
and then the last row which is four we want zero so these are the probabilities
1:36:52
want zero so these are the probabilities
1:36:52
want zero so these are the probabilities we're interested in right
1:36:53
we're interested in right
1:36:54
we're interested in right and you can see that they're not amazing
1:36:55
and you can see that they're not amazing
1:36:55
and you can see that they're not amazing as we saw above
1:36:58
as we saw above
1:36:58
as we saw above so these are the probabilities we want
1:37:00
so these are the probabilities we want
1:37:00
so these are the probabilities we want but we want like a more efficient way to
1:37:02
but we want like a more efficient way to
1:37:02
but we want like a more efficient way to access these probabilities
1:37:04
access these probabilities
1:37:04
access these probabilities not just listing them out in a tuple
1:37:06
not just listing them out in a tuple
1:37:06
not just listing them out in a tuple like this so it turns out that the way
1:37:07
like this so it turns out that the way
1:37:07
like this so it turns out that the way to do this in pytorch uh one of the ways
1:37:09
to do this in pytorch uh one of the ways
1:37:09
to do this in pytorch uh one of the ways at least is we can basically pass in
1:37:12
at least is we can basically pass in
1:37:12
at least is we can basically pass in all of these
1:37:16
sorry about that all of these um
1:37:19
sorry about that all of these um
1:37:19
sorry about that all of these um integers in the vectors
1:37:22
integers in the vectors
1:37:22
integers in the vectors so
1:37:22
so
1:37:22
so the
1:37:23
the
1:37:23
the these ones you see how they're just 0 1
1:37:25
these ones you see how they're just 0 1
1:37:25
these ones you see how they're just 0 1 2 3 4
1:37:27
2 3 4
1:37:27
2 3 4 we can actually create that using mp
1:37:29
we can actually create that using mp
1:37:29
we can actually create that using mp not mp sorry torch dot range of 5
1:37:32
not mp sorry torch dot range of 5
1:37:32
not mp sorry torch dot range of 5 0 1 2 3 4.
1:37:34
0 1 2 3 4.
1:37:34
0 1 2 3 4. so we can index here with torch.range of
1:37:37
so we can index here with torch.range of
1:37:37
so we can index here with torch.range of 5
1:37:38
5
1:37:38
5 and here we index with ys
1:37:41
and here we index with ys
1:37:41
and here we index with ys and you see that that gives us
1:37:43
and you see that that gives us
1:37:43
and you see that that gives us exactly these numbers
1:37:48
so that plucks out the probabilities of
1:37:51
so that plucks out the probabilities of
1:37:51
so that plucks out the probabilities of that the neural network assigns to the
1:37:53
that the neural network assigns to the
1:37:53
that the neural network assigns to the correct next character
1:37:56
correct next character
1:37:56
correct next character now we take those probabilities and we
1:37:58
now we take those probabilities and we
1:37:58
now we take those probabilities and we don't we actually look at the log
1:37:59
don't we actually look at the log
1:37:59
don't we actually look at the log probability so we want to dot log
1:38:03
probability so we want to dot log
1:38:03
probability so we want to dot log and then we want to just
1:38:05
and then we want to just
1:38:05
and then we want to just average that up so take the mean of all
1:38:07
average that up so take the mean of all
1:38:07
average that up so take the mean of all of that
1:38:08
of that
1:38:08
of that and then it's the negative
1:38:10
and then it's the negative
1:38:10
and then it's the negative average log likelihood that is the loss
1:38:14
average log likelihood that is the loss
1:38:14
average log likelihood that is the loss so the loss here is 3.7 something and
1:38:18
so the loss here is 3.7 something and
1:38:18
so the loss here is 3.7 something and you see that this loss 3.76 3.76 is
1:38:21
you see that this loss 3.76 3.76 is
1:38:21
you see that this loss 3.76 3.76 is exactly as we've obtained before but
1:38:23
exactly as we've obtained before but
1:38:23
exactly as we've obtained before but this is a vectorized form of that
1:38:25
this is a vectorized form of that
1:38:25
this is a vectorized form of that expression
1:38:26
expression
1:38:26
expression so
1:38:27
so
1:38:27
so we get the same loss
1:38:29
we get the same loss
1:38:29
we get the same loss and the same loss we can consider
1:38:31
and the same loss we can consider
1:38:31
and the same loss we can consider service part of this forward pass
1:38:33
service part of this forward pass
1:38:34
service part of this forward pass and we've achieved here now loss
1:38:36
and we've achieved here now loss
1:38:36
and we've achieved here now loss okay so we made our way all the way to
1:38:37
okay so we made our way all the way to
1:38:37
okay so we made our way all the way to loss we've defined the forward pass
1:38:40
loss we've defined the forward pass
1:38:40
loss we've defined the forward pass we forwarded the network and the loss
1:38:41
we forwarded the network and the loss
1:38:42
we forwarded the network and the loss now we're ready to do the backward pass
1:38:44
now we're ready to do the backward pass
1:38:44
now we're ready to do the backward pass so backward pass
1:38:48
we want to first make sure that all the
1:38:49
we want to first make sure that all the
1:38:49
we want to first make sure that all the gradients are reset so they're at zero
1:38:52
gradients are reset so they're at zero
1:38:52
gradients are reset so they're at zero now in pytorch you can set the gradients
1:38:55
now in pytorch you can set the gradients
1:38:55
now in pytorch you can set the gradients to be zero but you can also just set it
1:38:57
to be zero but you can also just set it
1:38:57
to be zero but you can also just set it to none and setting it to none is more
1:38:59
to none and setting it to none is more
1:38:59
to none and setting it to none is more efficient and pi torch will interpret
1:39:01
efficient and pi torch will interpret
1:39:01
efficient and pi torch will interpret none as like a lack of a gradient and is
1:39:04
none as like a lack of a gradient and is
1:39:04
none as like a lack of a gradient and is the same as zeros
1:39:05
the same as zeros
1:39:05
the same as zeros so this is a way to set to zero the
1:39:07
so this is a way to set to zero the
1:39:07
so this is a way to set to zero the gradient
1:39:10
and now we do lost it backward
1:39:14
before we do lost that backward we need
1:39:16
before we do lost that backward we need
1:39:16
before we do lost that backward we need one more thing if you remember from
1:39:17
one more thing if you remember from
1:39:17
one more thing if you remember from micrograd
1:39:18
micrograd
1:39:18
micrograd pytorch actually requires
1:39:21
pytorch actually requires
1:39:21
pytorch actually requires that we pass in requires grad is true
1:39:25
that we pass in requires grad is true
1:39:25
that we pass in requires grad is true so that when we tell
1:39:26
so that when we tell
1:39:26
so that when we tell pythorge that we are interested in
1:39:28
pythorge that we are interested in
1:39:28
pythorge that we are interested in calculating gradients for this leaf
1:39:30
calculating gradients for this leaf
1:39:30
calculating gradients for this leaf tensor by default this is false
1:39:33
tensor by default this is false
1:39:33
tensor by default this is false so let me recalculate with that
1:39:35
so let me recalculate with that
1:39:35
so let me recalculate with that and then set to none and lost that
1:39:37
and then set to none and lost that
1:39:37
and then set to none and lost that backward
1:39:40
backward
1:39:40
backward now something magical happened when
1:39:42
now something magical happened when
1:39:42
now something magical happened when lasted backward was run
1:39:44
lasted backward was run
1:39:44
lasted backward was run because pytorch just like micrograd when
1:39:47
because pytorch just like micrograd when
1:39:47
because pytorch just like micrograd when we did the forward pass here
1:39:49
we did the forward pass here
1:39:49
we did the forward pass here it keeps track of all the operations
1:39:51
it keeps track of all the operations
1:39:51
it keeps track of all the operations under the hood it builds a full
1:39:53
under the hood it builds a full
1:39:53
under the hood it builds a full computational graph just like the graphs
1:39:55
computational graph just like the graphs
1:39:55
computational graph just like the graphs we've
1:39:56
we've
1:39:56
we've produced in micrograd those graphs exist
1:39:58
produced in micrograd those graphs exist
1:39:58
produced in micrograd those graphs exist inside pi torch
1:40:00
inside pi torch
1:40:00
inside pi torch and so it knows all the dependencies and
1:40:02
and so it knows all the dependencies and
1:40:02
and so it knows all the dependencies and all the mathematical operations of
1:40:03
all the mathematical operations of
1:40:04
all the mathematical operations of everything
1:40:04
everything
1:40:04
everything and when you then calculate the loss
1:40:07
and when you then calculate the loss
1:40:07
and when you then calculate the loss we can call a dot backward on it
1:40:09
we can call a dot backward on it
1:40:09
we can call a dot backward on it and that backward then fills in the
1:40:11
and that backward then fills in the
1:40:11
and that backward then fills in the gradients of
1:40:13
gradients of
1:40:13
gradients of all the intermediates
1:40:15
all the intermediates
1:40:15
all the intermediates all the way back to w's which are the
1:40:18
all the way back to w's which are the
1:40:18
all the way back to w's which are the parameters of our neural net so now we
1:40:20
parameters of our neural net so now we
1:40:20
parameters of our neural net so now we can do w grad and we see that it has
1:40:23
can do w grad and we see that it has
1:40:23
can do w grad and we see that it has structure there's stuff inside it
1:40:29
and these gradients
1:40:30
and these gradients
1:40:30
and these gradients every single element here
1:40:33
every single element here
1:40:33
every single element here so w dot shape is 27 by 27
1:40:36
so w dot shape is 27 by 27
1:40:36
so w dot shape is 27 by 27 w grad shape is the same 27 by 27
1:40:40
w grad shape is the same 27 by 27
1:40:40
w grad shape is the same 27 by 27 and every element of w that grad
1:40:43
and every element of w that grad
1:40:43
and every element of w that grad is telling us
1:40:44
is telling us
1:40:44
is telling us the influence of that weight on the loss
1:40:47
the influence of that weight on the loss
1:40:47
the influence of that weight on the loss function
1:40:48
function
1:40:48
function so for example this number all the way
1:40:50
so for example this number all the way
1:40:50
so for example this number all the way here
1:40:51
here
1:40:51
here if this element the zero zero element of
1:40:54
if this element the zero zero element of
1:40:54
if this element the zero zero element of w
1:40:55
w
1:40:55
w because the gradient is positive is
1:40:57
because the gradient is positive is
1:40:57
because the gradient is positive is telling us that this has a positive
1:40:59
telling us that this has a positive
1:41:00
telling us that this has a positive influence in the loss slightly nudging
1:41:03
influence in the loss slightly nudging
1:41:03
influence in the loss slightly nudging w
1:41:04
w
1:41:04
w slightly taking w 0 0
1:41:06
slightly taking w 0 0
1:41:06
slightly taking w 0 0 and
1:41:07
and
1:41:07
and adding a small h to it
1:41:10
adding a small h to it
1:41:10
adding a small h to it would increase the loss
1:41:12
would increase the loss
1:41:12
would increase the loss mildly because this gradient is positive
1:41:15
mildly because this gradient is positive
1:41:15
mildly because this gradient is positive some of these gradients are also
1:41:16
some of these gradients are also
1:41:16
some of these gradients are also negative
1:41:18
negative
1:41:18
negative so that's telling us about the gradient
1:41:20
so that's telling us about the gradient
1:41:20
so that's telling us about the gradient information and we can use this gradient
1:41:22
information and we can use this gradient
1:41:22
information and we can use this gradient information to update the weights of
1:41:25
information to update the weights of
1:41:25
information to update the weights of this neural network so let's now do the
1:41:27
this neural network so let's now do the
1:41:27
this neural network so let's now do the update it's going to be very similar to
1:41:29
update it's going to be very similar to
1:41:29
update it's going to be very similar to what we had in micrograd we need no loop
1:41:32
what we had in micrograd we need no loop
1:41:32
what we had in micrograd we need no loop over all the parameters because we only
1:41:33
over all the parameters because we only
1:41:33
over all the parameters because we only have one parameter uh tensor and that is
1:41:36
have one parameter uh tensor and that is
1:41:36
have one parameter uh tensor and that is w
1:41:37
w
1:41:37
w so we simply do w dot data plus equals
1:41:40
so we simply do w dot data plus equals
1:41:40
so we simply do w dot data plus equals uh the
1:41:42
uh the
1:41:42
uh the we can actually copy this almost exactly
1:41:43
we can actually copy this almost exactly
1:41:43
we can actually copy this almost exactly negative 0.1 times w dot grad
1:41:49
and that would be the update to the
1:41:52
and that would be the update to the
1:41:52
and that would be the update to the tensor
1:41:54
tensor
1:41:54
tensor so that updates
1:41:55
so that updates
1:41:55
so that updates the tensor
1:41:58
and
1:41:59
and
1:41:59
and because the tensor is updated we would
1:42:01
because the tensor is updated we would
1:42:01
because the tensor is updated we would expect that now the loss should decrease
1:42:04
expect that now the loss should decrease
1:42:04
expect that now the loss should decrease so
1:42:05
so
1:42:05
so here if i print loss
1:42:09
that item
1:42:11
that item
1:42:11
that item it was 3.76 right
1:42:13
it was 3.76 right
1:42:13
it was 3.76 right so we've updated the w here so if i
1:42:16
so we've updated the w here so if i
1:42:16
so we've updated the w here so if i recalculate forward pass
1:42:18
recalculate forward pass
1:42:18
recalculate forward pass loss now should be slightly lower so
1:42:21
loss now should be slightly lower so
1:42:21
loss now should be slightly lower so 3.76 goes to
1:42:23
3.76 goes to
1:42:23
3.76 goes to 3.74
1:42:25
3.74
1:42:25
3.74 and then
1:42:26
and then
1:42:26
and then we can again set to set grad to none and
1:42:29
we can again set to set grad to none and
1:42:29
we can again set to set grad to none and backward
1:42:30
backward
1:42:30
backward update
1:42:32
update
1:42:32
update and now the parameters changed again
1:42:34
and now the parameters changed again
1:42:34
and now the parameters changed again so if we recalculate the forward pass we
1:42:37
so if we recalculate the forward pass we
1:42:37
so if we recalculate the forward pass we expect a lower loss again 3.72
1:42:42
okay and this is again doing the we're
1:42:44
okay and this is again doing the we're
1:42:44
okay and this is again doing the we're now doing gradient descent
1:42:48
and when we achieve a low loss that will
1:42:50
and when we achieve a low loss that will
1:42:50
and when we achieve a low loss that will mean that the network is assigning high
1:42:52
mean that the network is assigning high
1:42:52
mean that the network is assigning high probabilities to the correctness
1:42:54
probabilities to the correctness
1:42:54
probabilities to the correctness characters okay so i rearranged
1:42:56
characters okay so i rearranged
1:42:56
characters okay so i rearranged everything and i put it all together
1:42:57
everything and i put it all together
1:42:57
everything and i put it all together from scratch
1:42:59
from scratch
1:42:59
from scratch so here is where we construct our data
1:43:01
so here is where we construct our data
1:43:01
so here is where we construct our data set of bigrams
1:43:03
set of bigrams
1:43:03
set of bigrams you see that we are still iterating only
1:43:04
you see that we are still iterating only
1:43:04
you see that we are still iterating only on the first word emma
1:43:06
on the first word emma
1:43:06
on the first word emma i'm going to change that in a second i
1:43:09
i'm going to change that in a second i
1:43:09
i'm going to change that in a second i added a number that counts the number of
1:43:11
added a number that counts the number of
1:43:11
added a number that counts the number of elements in x's so that we explicitly
1:43:14
elements in x's so that we explicitly
1:43:14
elements in x's so that we explicitly see that number of examples is five
1:43:16
see that number of examples is five
1:43:16
see that number of examples is five because currently we're just working
1:43:17
because currently we're just working
1:43:18
because currently we're just working with emma and there's five backgrounds
1:43:19
with emma and there's five backgrounds
1:43:19
with emma and there's five backgrounds there
1:43:20
there
1:43:20
there and here i added a loop of exactly what
1:43:22
and here i added a loop of exactly what
1:43:22
and here i added a loop of exactly what we had before so we had 10 iterations of
1:43:25
we had before so we had 10 iterations of
1:43:25
we had before so we had 10 iterations of grainy descent of forward pass backward
1:43:27
grainy descent of forward pass backward
1:43:27
grainy descent of forward pass backward pass and an update
1:43:28
pass and an update
1:43:28
pass and an update and so running these two cells
1:43:30
and so running these two cells
1:43:30
and so running these two cells initialization and gradient descent
1:43:32
initialization and gradient descent
1:43:32
initialization and gradient descent gives us some improvement
1:43:35
gives us some improvement
1:43:35
gives us some improvement on
1:43:36
on
1:43:36
on the loss function
1:43:38
the loss function
1:43:38
the loss function but now i want to use all the words
1:43:41
but now i want to use all the words
1:43:41
but now i want to use all the words and there's not 5 but 228 000 bigrams
1:43:45
and there's not 5 but 228 000 bigrams
1:43:45
and there's not 5 but 228 000 bigrams now
1:43:46
now
1:43:46
now however this should require no
1:43:48
however this should require no
1:43:48
however this should require no modification whatsoever everything
1:43:49
modification whatsoever everything
1:43:49
modification whatsoever everything should just run because all the code we
1:43:51
should just run because all the code we
1:43:51
should just run because all the code we wrote doesn't care if there's five
1:43:53
wrote doesn't care if there's five
1:43:53
wrote doesn't care if there's five migrants or 228 000 bigrams and with
1:43:56
migrants or 228 000 bigrams and with
1:43:56
migrants or 228 000 bigrams and with everything we should just work so
1:43:58
everything we should just work so
1:43:58
everything we should just work so you see that this will just run
1:44:00
you see that this will just run
1:44:00
you see that this will just run but now we are optimizing over the
1:44:01
but now we are optimizing over the
1:44:01
but now we are optimizing over the entire training set of all the bigrams
1:44:04
entire training set of all the bigrams
1:44:04
entire training set of all the bigrams and you see now that we are decreasing
1:44:06
and you see now that we are decreasing
1:44:06
and you see now that we are decreasing very slightly so actually we can
1:44:08
very slightly so actually we can
1:44:08
very slightly so actually we can probably afford a larger learning rate
1:44:12
and probably for even larger learning
1:44:13
and probably for even larger learning
1:44:14
and probably for even larger learning rate
1:44:20
even 50 seems to work on this very very
1:44:22
even 50 seems to work on this very very
1:44:22
even 50 seems to work on this very very simple example right so let me
1:44:24
simple example right so let me
1:44:24
simple example right so let me re-initialize and let's run 100
1:44:26
re-initialize and let's run 100
1:44:26
re-initialize and let's run 100 iterations
1:44:29
iterations
1:44:29
iterations see what happens
1:44:32
okay
1:44:36
we seem to be
1:44:39
we seem to be
1:44:39
we seem to be coming up to some pretty good losses
1:44:40
coming up to some pretty good losses
1:44:40
coming up to some pretty good losses here 2.47
1:44:42
here 2.47
1:44:42
here 2.47 let me run 100 more
1:44:44
let me run 100 more
1:44:44
let me run 100 more what is the number that we expect by the
1:44:46
what is the number that we expect by the
1:44:46
what is the number that we expect by the way in the loss we expect to get
1:44:47
way in the loss we expect to get
1:44:48
way in the loss we expect to get something around what we had originally
1:44:50
something around what we had originally
1:44:50
something around what we had originally actually
1:44:51
actually
1:44:52
actually so all the way back if you remember in
1:44:53
so all the way back if you remember in
1:44:53
so all the way back if you remember in the beginning of this video when we
1:44:55
the beginning of this video when we
1:44:55
the beginning of this video when we optimized uh just by counting
1:44:58
optimized uh just by counting
1:44:58
optimized uh just by counting our loss was roughly 2.47
1:45:01
our loss was roughly 2.47
1:45:01
our loss was roughly 2.47 after we had it smoothing
1:45:03
after we had it smoothing
1:45:03
after we had it smoothing but before smoothing we had roughly 2.45
1:45:06
but before smoothing we had roughly 2.45
1:45:06
but before smoothing we had roughly 2.45 likelihood
1:45:08
likelihood
1:45:08
likelihood sorry loss
1:45:09
sorry loss
1:45:09
sorry loss and so that's actually roughly the
1:45:10
and so that's actually roughly the
1:45:10
and so that's actually roughly the vicinity of what we expect to achieve
1:45:13
vicinity of what we expect to achieve
1:45:13
vicinity of what we expect to achieve but before we achieved it by counting
1:45:15
but before we achieved it by counting
1:45:15
but before we achieved it by counting and here we are achieving the roughly
1:45:17
and here we are achieving the roughly
1:45:17
and here we are achieving the roughly the same result but with gradient based
1:45:19
the same result but with gradient based
1:45:19
the same result but with gradient based optimization
1:45:20
optimization
1:45:20
optimization so we come to about 2.4
1:45:23
so we come to about 2.4
1:45:23
so we come to about 2.4 6 2.45 etc
1:45:26
6 2.45 etc
1:45:26
6 2.45 etc and that makes sense because
1:45:27
and that makes sense because
1:45:27
and that makes sense because fundamentally we're not taking any
1:45:28
fundamentally we're not taking any
1:45:28
fundamentally we're not taking any additional information we're still just
1:45:30
additional information we're still just
1:45:30
additional information we're still just taking in the previous character and
1:45:31
taking in the previous character and
1:45:31
taking in the previous character and trying to predict the next one but
1:45:33
trying to predict the next one but
1:45:33
trying to predict the next one but instead of doing it explicitly by
1:45:35
instead of doing it explicitly by
1:45:35
instead of doing it explicitly by counting and normalizing
1:45:38
counting and normalizing
1:45:38
counting and normalizing we are doing it with gradient-based
1:45:39
we are doing it with gradient-based
1:45:39
we are doing it with gradient-based learning and it just so happens that the
1:45:41
learning and it just so happens that the
1:45:41
learning and it just so happens that the explicit approach happens to very well
1:45:43
explicit approach happens to very well
1:45:44
explicit approach happens to very well optimize the loss function without any
1:45:46
optimize the loss function without any
1:45:46
optimize the loss function without any need for a gradient based optimization
1:45:48
need for a gradient based optimization
1:45:48
need for a gradient based optimization because the setup for bigram language
1:45:50
because the setup for bigram language
1:45:50
because the setup for bigram language models are is so straightforward that's
1:45:51
models are is so straightforward that's
1:45:52
models are is so straightforward that's so simple we can just afford to estimate
1:45:54
so simple we can just afford to estimate
1:45:54
so simple we can just afford to estimate those probabilities directly and
1:45:55
those probabilities directly and
1:45:56
those probabilities directly and maintain them
1:45:57
maintain them
1:45:57
maintain them in a table
1:45:58
in a table
1:45:58
in a table but the gradient-based approach is
1:46:00
but the gradient-based approach is
1:46:00
but the gradient-based approach is significantly more flexible
1:46:02
significantly more flexible
1:46:02
significantly more flexible so we've actually gained a lot
1:46:04
so we've actually gained a lot
1:46:04
so we've actually gained a lot because
1:46:06
because
1:46:06
because what we can do now is
1:46:09
what we can do now is
1:46:09
what we can do now is we can expand this approach and
1:46:11
we can expand this approach and
1:46:11
we can expand this approach and complexify the neural net so currently
1:46:13
complexify the neural net so currently
1:46:13
complexify the neural net so currently we're just taking a single character and
1:46:14
we're just taking a single character and
1:46:14
we're just taking a single character and feeding into a neural net and the neural
1:46:16
feeding into a neural net and the neural
1:46:16
feeding into a neural net and the neural that's extremely simple but we're about
1:46:18
that's extremely simple but we're about
1:46:18
that's extremely simple but we're about to iterate on this substantially we're
1:46:20
to iterate on this substantially we're
1:46:20
to iterate on this substantially we're going to be taking multiple previous
1:46:22
going to be taking multiple previous
1:46:22
going to be taking multiple previous characters and we're going to be feeding
1:46:24
characters and we're going to be feeding
1:46:24
characters and we're going to be feeding feeding them into increasingly more
1:46:26
feeding them into increasingly more
1:46:26
feeding them into increasingly more complex neural nets but fundamentally
1:46:28
complex neural nets but fundamentally
1:46:28
complex neural nets but fundamentally out the output of the neural net will
1:46:30
out the output of the neural net will
1:46:30
out the output of the neural net will always just be logics
1:46:32
always just be logics
1:46:32
always just be logics and those logits will go through the
1:46:33
and those logits will go through the
1:46:34
and those logits will go through the exact same transformation we are going
1:46:35
exact same transformation we are going
1:46:36
exact same transformation we are going to take them through a soft max
1:46:37
to take them through a soft max
1:46:37
to take them through a soft max calculate the loss function and the
1:46:39
calculate the loss function and the
1:46:39
calculate the loss function and the negative log likelihood and do gradient
1:46:42
negative log likelihood and do gradient
1:46:42
negative log likelihood and do gradient based optimization and so actually
1:46:44
based optimization and so actually
1:46:44
based optimization and so actually as we complexify the neural nets and
1:46:47
as we complexify the neural nets and
1:46:47
as we complexify the neural nets and work all the way up to transformers
1:46:49
work all the way up to transformers
1:46:49
work all the way up to transformers none of this will really fundamentally
1:46:51
none of this will really fundamentally
1:46:51
none of this will really fundamentally change none of this will fundamentally
1:46:52
change none of this will fundamentally
1:46:52
change none of this will fundamentally change the only thing that will change
1:46:54
change the only thing that will change
1:46:54
change the only thing that will change is
1:46:55
is
1:46:55
is the way we do the forward pass where we
1:46:57
the way we do the forward pass where we
1:46:57
the way we do the forward pass where we take in some previous characters and
1:46:59
take in some previous characters and
1:46:59
take in some previous characters and calculate logits for the next character
1:47:01
calculate logits for the next character
1:47:01
calculate logits for the next character in the sequence that will become more
1:47:03
in the sequence that will become more
1:47:03
in the sequence that will become more complex
1:47:05
complex
1:47:05
complex and uh but we'll use the same machinery
1:47:07
and uh but we'll use the same machinery
1:47:07
and uh but we'll use the same machinery to optimize it
1:47:08
to optimize it
1:47:08
to optimize it and um
1:47:10
and um
1:47:10
and um it's not obvious how we would have
1:47:12
it's not obvious how we would have
1:47:12
it's not obvious how we would have extended
1:47:13
extended
1:47:13
extended this bigram approach
1:47:14
this bigram approach
1:47:14
this bigram approach into the case where there are many more
1:47:17
into the case where there are many more
1:47:17
into the case where there are many more characters at the input because
1:47:19
characters at the input because
1:47:19
characters at the input because eventually these tables would get way
1:47:21
eventually these tables would get way
1:47:21
eventually these tables would get way too large because there's way too many
1:47:23
too large because there's way too many
1:47:23
too large because there's way too many combinations of what previous characters
1:47:26
combinations of what previous characters
1:47:26
combinations of what previous characters could be
1:47:27
could be
1:47:27
could be if you only have one previous character
1:47:29
if you only have one previous character
1:47:29
if you only have one previous character we can just keep everything in a table
1:47:31
we can just keep everything in a table
1:47:31
we can just keep everything in a table that counts but if you have the last 10
1:47:33
that counts but if you have the last 10
1:47:33
that counts but if you have the last 10 characters that are input we can't
1:47:35
characters that are input we can't
1:47:35
characters that are input we can't actually keep everything in the table
1:47:36
actually keep everything in the table
1:47:36
actually keep everything in the table anymore so this is fundamentally an
1:47:38
anymore so this is fundamentally an
1:47:38
anymore so this is fundamentally an unscalable approach and the neural
1:47:40
unscalable approach and the neural
1:47:40
unscalable approach and the neural network approach is significantly more
1:47:42
network approach is significantly more
1:47:42
network approach is significantly more scalable and it's something that
1:47:43
scalable and it's something that
1:47:44
scalable and it's something that actually we can improve on over time so
1:47:46
actually we can improve on over time so
1:47:46
actually we can improve on over time so that's where we will be digging next i
1:47:48
that's where we will be digging next i
1:47:48
that's where we will be digging next i wanted to point out two more things
1:47:51
wanted to point out two more things
1:47:51
wanted to point out two more things number one
1:47:52
number one
1:47:52
number one i want you to notice that
1:47:53
i want you to notice that
1:47:54
i want you to notice that this
1:47:55
this
1:47:55
this x ink here
1:47:56
x ink here
1:47:56
x ink here this is made up of one hot vectors and
1:47:59
this is made up of one hot vectors and
1:47:59
this is made up of one hot vectors and then those one hot vectors are
1:48:00
then those one hot vectors are
1:48:00
then those one hot vectors are multiplied by this w matrix
1:48:03
multiplied by this w matrix
1:48:03
multiplied by this w matrix and we think of this as multiple neurons
1:48:05
and we think of this as multiple neurons
1:48:05
and we think of this as multiple neurons being forwarded in a fully connected
1:48:07
being forwarded in a fully connected
1:48:07
being forwarded in a fully connected manner
1:48:08
manner
1:48:08
manner but actually what's happening here is
1:48:09
but actually what's happening here is
1:48:10
but actually what's happening here is that for example
1:48:11
that for example
1:48:11
that for example if you have a one hot vector here that
1:48:14
if you have a one hot vector here that
1:48:14
if you have a one hot vector here that has a one at say the fifth dimension
1:48:17
has a one at say the fifth dimension
1:48:17
has a one at say the fifth dimension then because of the way the matrix
1:48:19
then because of the way the matrix
1:48:19
then because of the way the matrix multiplication works
1:48:21
multiplication works
1:48:21
multiplication works multiplying that one-half vector with w
1:48:23
multiplying that one-half vector with w
1:48:23
multiplying that one-half vector with w actually ends up plucking out the fifth
1:48:25
actually ends up plucking out the fifth
1:48:25
actually ends up plucking out the fifth row of w
1:48:27
row of w
1:48:27
row of w log logits would become just the fifth
1:48:29
log logits would become just the fifth
1:48:29
log logits would become just the fifth row of w
1:48:31
row of w
1:48:31
row of w and that's because of the way the matrix
1:48:32
and that's because of the way the matrix
1:48:32
and that's because of the way the matrix multiplication works
1:48:35
multiplication works
1:48:35
multiplication works um
1:48:36
um
1:48:36
um so
1:48:37
so
1:48:37
so that's actually what ends up happening
1:48:39
that's actually what ends up happening
1:48:40
that's actually what ends up happening so but that's actually exactly what
1:48:41
so but that's actually exactly what
1:48:42
so but that's actually exactly what happened before
1:48:43
happened before
1:48:43
happened before because remember all the way up here
1:48:46
because remember all the way up here
1:48:46
because remember all the way up here we have a bigram we took the first
1:48:48
we have a bigram we took the first
1:48:48
we have a bigram we took the first character and then that first character
1:48:50
character and then that first character
1:48:50
character and then that first character indexed into a row of this array here
1:48:54
indexed into a row of this array here
1:48:54
indexed into a row of this array here and that row gave us the probability
1:48:56
and that row gave us the probability
1:48:56
and that row gave us the probability distribution for the next character so
1:48:58
distribution for the next character so
1:48:58
distribution for the next character so the first character was used as a lookup
1:49:01
the first character was used as a lookup
1:49:01
the first character was used as a lookup into a
1:49:03
into a
1:49:03
into a matrix here to get the probability
1:49:05
matrix here to get the probability
1:49:05
matrix here to get the probability distribution
1:49:06
distribution
1:49:06
distribution well that's actually exactly what's
1:49:07
well that's actually exactly what's
1:49:07
well that's actually exactly what's happening here because we're taking the
1:49:09
happening here because we're taking the
1:49:09
happening here because we're taking the index we're encoding it as one hot and
1:49:11
index we're encoding it as one hot and
1:49:11
index we're encoding it as one hot and multiplying it by w
1:49:13
multiplying it by w
1:49:13
multiplying it by w so logics literally becomes
1:49:15
so logics literally becomes
1:49:15
so logics literally becomes the
1:49:18
the appropriate row of w
1:49:20
the appropriate row of w
1:49:20
the appropriate row of w and that gets just as before
1:49:22
and that gets just as before
1:49:22
and that gets just as before exponentiated to create the counts
1:49:24
exponentiated to create the counts
1:49:24
exponentiated to create the counts and then normalized and becomes
1:49:26
and then normalized and becomes
1:49:26
and then normalized and becomes probability
1:49:27
probability
1:49:27
probability so this w here
1:49:29
so this w here
1:49:29
so this w here is literally
1:49:31
is literally
1:49:31
is literally the same as this array here
1:49:35
the same as this array here
1:49:35
the same as this array here but w remember is the log counts not the
1:49:38
but w remember is the log counts not the
1:49:38
but w remember is the log counts not the counts so it's more precise to say that
1:49:40
counts so it's more precise to say that
1:49:40
counts so it's more precise to say that w exponentiated
1:49:42
w exponentiated
1:49:42
w exponentiated w dot x is this array
1:49:46
w dot x is this array
1:49:46
w dot x is this array but this array was filled in by counting
1:49:49
but this array was filled in by counting
1:49:49
but this array was filled in by counting and by
1:49:50
and by
1:49:50
and by basically
1:49:51
basically
1:49:51
basically populating the counts of bi-grams
1:49:53
populating the counts of bi-grams
1:49:53
populating the counts of bi-grams whereas in the gradient-based framework
1:49:55
whereas in the gradient-based framework
1:49:55
whereas in the gradient-based framework we initialize it randomly and then we
1:49:57
we initialize it randomly and then we
1:49:57
we initialize it randomly and then we let the loss
1:49:59
let the loss
1:49:59
let the loss guide us
1:50:00
guide us
1:50:00
guide us to arrive at the exact same array
1:50:03
to arrive at the exact same array
1:50:03
to arrive at the exact same array so this array exactly here
1:50:05
so this array exactly here
1:50:05
so this array exactly here is
1:50:06
is
1:50:06
is basically the array w at the end of
1:50:09
basically the array w at the end of
1:50:09
basically the array w at the end of optimization except we arrived at it
1:50:12
optimization except we arrived at it
1:50:12
optimization except we arrived at it piece by piece by following the loss
1:50:14
piece by piece by following the loss
1:50:14
piece by piece by following the loss and that's why we also obtain the same
1:50:16
and that's why we also obtain the same
1:50:16
and that's why we also obtain the same loss function at the end and the second
1:50:18
loss function at the end and the second
1:50:18
loss function at the end and the second note is if i come here
1:50:20
note is if i come here
1:50:20
note is if i come here remember the smoothing where we added
1:50:22
remember the smoothing where we added
1:50:22
remember the smoothing where we added fake counts to our counts
1:50:24
fake counts to our counts
1:50:24
fake counts to our counts in order to
1:50:25
in order to
1:50:26
in order to smooth out and make more uniform the
1:50:28
smooth out and make more uniform the
1:50:28
smooth out and make more uniform the distributions of these probabilities
1:50:30
distributions of these probabilities
1:50:30
distributions of these probabilities and that prevented us from assigning
1:50:32
and that prevented us from assigning
1:50:32
and that prevented us from assigning zero probability to
1:50:34
zero probability to
1:50:34
zero probability to to any one bigram
1:50:37
to any one bigram
1:50:37
to any one bigram now if i increase the count here
1:50:40
now if i increase the count here
1:50:40
now if i increase the count here what's happening to the probability
1:50:42
what's happening to the probability
1:50:42
what's happening to the probability as i increase the count probability
1:50:45
as i increase the count probability
1:50:45
as i increase the count probability becomes more and more uniform
1:50:47
becomes more and more uniform
1:50:47
becomes more and more uniform right because these counts go only up to
1:50:50
right because these counts go only up to
1:50:50
right because these counts go only up to like 900 or whatever so if i'm adding
1:50:52
like 900 or whatever so if i'm adding
1:50:52
like 900 or whatever so if i'm adding plus a million to every single number
1:50:54
plus a million to every single number
1:50:54
plus a million to every single number here you can see how
1:50:56
here you can see how
1:50:56
here you can see how the row and its probability then when we
1:50:58
the row and its probability then when we
1:50:58
the row and its probability then when we divide is just going to become more and
1:51:00
divide is just going to become more and
1:51:00
divide is just going to become more and more close to exactly even probability
1:51:02
more close to exactly even probability
1:51:02
more close to exactly even probability uniform distribution
1:51:05
uniform distribution
1:51:05
uniform distribution it turns out that the gradient based
1:51:06
it turns out that the gradient based
1:51:06
it turns out that the gradient based framework has an equivalent to smoothing
1:51:10
framework has an equivalent to smoothing
1:51:10
framework has an equivalent to smoothing in particular
1:51:13
in particular
1:51:13
in particular think through these w's here
1:51:15
think through these w's here
1:51:15
think through these w's here which we initialized randomly
1:51:18
which we initialized randomly
1:51:18
which we initialized randomly we could also think about initializing
1:51:20
we could also think about initializing
1:51:20
we could also think about initializing w's to be zero
1:51:22
w's to be zero
1:51:22
w's to be zero if all the entries of w are zero
1:51:25
if all the entries of w are zero
1:51:25
if all the entries of w are zero then you'll see that logits will become
1:51:27
then you'll see that logits will become
1:51:27
then you'll see that logits will become all zero
1:51:28
all zero
1:51:28
all zero and then exponentiating those logics
1:51:30
and then exponentiating those logics
1:51:30
and then exponentiating those logics becomes all one
1:51:32
becomes all one
1:51:32
becomes all one and then the probabilities turned out to
1:51:33
and then the probabilities turned out to
1:51:33
and then the probabilities turned out to be exactly uniform
1:51:35
be exactly uniform
1:51:35
be exactly uniform so basically when w's are all equal to
1:51:37
so basically when w's are all equal to
1:51:38
so basically when w's are all equal to each other or say especially zero
1:51:41
each other or say especially zero
1:51:41
each other or say especially zero then the probabilities come out
1:51:42
then the probabilities come out
1:51:42
then the probabilities come out completely uniform
1:51:44
completely uniform
1:51:44
completely uniform so
1:51:45
so
1:51:45
so trying to incentivize w to be near zero
1:51:49
trying to incentivize w to be near zero
1:51:49
trying to incentivize w to be near zero is basically equivalent to
1:51:51
is basically equivalent to
1:51:51
is basically equivalent to label smoothing and the more you
1:51:53
label smoothing and the more you
1:51:53
label smoothing and the more you incentivize that in the loss function
1:51:55
incentivize that in the loss function
1:51:55
incentivize that in the loss function the more smooth distribution you're
1:51:57
the more smooth distribution you're
1:51:57
the more smooth distribution you're going to achieve
1:51:58
going to achieve
1:51:58
going to achieve so this brings us to something that's
1:52:00
so this brings us to something that's
1:52:00
so this brings us to something that's called
1:52:00
called
1:52:00
called regularization where we can actually
1:52:02
regularization where we can actually
1:52:02
regularization where we can actually augment the loss function to have a
1:52:04
augment the loss function to have a
1:52:04
augment the loss function to have a small component that we call a
1:52:06
small component that we call a
1:52:06
small component that we call a regularization loss
1:52:08
regularization loss
1:52:08
regularization loss in particular what we're going to do is
1:52:10
in particular what we're going to do is
1:52:10
in particular what we're going to do is we can take w and we can for example
1:52:12
we can take w and we can for example
1:52:12
we can take w and we can for example square all of its entries
1:52:14
square all of its entries
1:52:14
square all of its entries and then we can um whoops
1:52:17
and then we can um whoops
1:52:17
and then we can um whoops sorry about that
1:52:18
sorry about that
1:52:18
sorry about that we can take all the entries of w and we
1:52:20
we can take all the entries of w and we
1:52:20
we can take all the entries of w and we can sum them
1:52:23
and because we're squaring uh there will
1:52:25
and because we're squaring uh there will
1:52:25
and because we're squaring uh there will be no signs anymore um
1:52:28
be no signs anymore um
1:52:28
be no signs anymore um negatives and positives all get squashed
1:52:29
negatives and positives all get squashed
1:52:29
negatives and positives all get squashed to be positive numbers
1:52:31
to be positive numbers
1:52:31
to be positive numbers and then the way this works is you
1:52:33
and then the way this works is you
1:52:33
and then the way this works is you achieve zero loss if w is exactly or
1:52:36
achieve zero loss if w is exactly or
1:52:36
achieve zero loss if w is exactly or zero but if w has non-zero numbers you
1:52:39
zero but if w has non-zero numbers you
1:52:39
zero but if w has non-zero numbers you accumulate loss
1:52:41
accumulate loss
1:52:41
accumulate loss and so we can actually take this and we
1:52:42
and so we can actually take this and we
1:52:42
and so we can actually take this and we can add it on here
1:52:44
can add it on here
1:52:44
can add it on here so we can do something like loss plus
1:52:48
so we can do something like loss plus
1:52:48
so we can do something like loss plus w square
1:52:50
w square
1:52:50
w square dot sum
1:52:51
dot sum
1:52:51
dot sum or let's actually instead of sum let's
1:52:53
or let's actually instead of sum let's
1:52:53
or let's actually instead of sum let's take a mean because otherwise the sum
1:52:55
take a mean because otherwise the sum
1:52:55
take a mean because otherwise the sum gets too large
1:52:57
gets too large
1:52:57
gets too large so mean is like a little bit more
1:52:58
so mean is like a little bit more
1:52:58
so mean is like a little bit more manageable
1:53:01
manageable
1:53:01
manageable and then we have a regularization loss
1:53:02
and then we have a regularization loss
1:53:02
and then we have a regularization loss here say 0.01 times
1:53:05
here say 0.01 times
1:53:05
here say 0.01 times or something like that you can choose
1:53:06
or something like that you can choose
1:53:06
or something like that you can choose the regularization strength
1:53:09
the regularization strength
1:53:09
the regularization strength and then we can just optimize this and
1:53:12
and then we can just optimize this and
1:53:12
and then we can just optimize this and now this optimization actually has two
1:53:14
now this optimization actually has two
1:53:14
now this optimization actually has two components not only is it trying to make
1:53:16
components not only is it trying to make
1:53:16
components not only is it trying to make all the probabilities work out but in
1:53:18
all the probabilities work out but in
1:53:18
all the probabilities work out but in addition to that there's an additional
1:53:19
addition to that there's an additional
1:53:19
addition to that there's an additional component that simultaneously tries to
1:53:22
component that simultaneously tries to
1:53:22
component that simultaneously tries to make all w's be zero because if w's are
1:53:24
make all w's be zero because if w's are
1:53:24
make all w's be zero because if w's are non-zero you feel a loss and so
1:53:26
non-zero you feel a loss and so
1:53:26
non-zero you feel a loss and so minimizing this the only way to achieve
1:53:28
minimizing this the only way to achieve
1:53:28
minimizing this the only way to achieve that is for w to be zero
1:53:30
that is for w to be zero
1:53:30
that is for w to be zero and so you can think of this as adding
1:53:32
and so you can think of this as adding
1:53:32
and so you can think of this as adding like a spring force or like a gravity
1:53:34
like a spring force or like a gravity
1:53:34
like a spring force or like a gravity force that that pushes w to be zero so w
1:53:38
force that that pushes w to be zero so w
1:53:38
force that that pushes w to be zero so w wants to be zero and the probabilities
1:53:39
wants to be zero and the probabilities
1:53:40
wants to be zero and the probabilities want to be uniform but they also
1:53:41
want to be uniform but they also
1:53:42
want to be uniform but they also simultaneously want to match up your
1:53:44
simultaneously want to match up your
1:53:44
simultaneously want to match up your your probabilities as indicated by the
1:53:46
your probabilities as indicated by the
1:53:46
your probabilities as indicated by the data
1:53:47
data
1:53:47
data and so the strength of this
1:53:49
and so the strength of this
1:53:49
and so the strength of this regularization is exactly controlling
1:53:52
regularization is exactly controlling
1:53:52
regularization is exactly controlling the amount of counts
1:53:54
the amount of counts
1:53:54
the amount of counts that you add here
1:53:57
that you add here
1:53:57
that you add here adding a lot more counts
1:53:59
adding a lot more counts
1:53:59
adding a lot more counts here
1:54:00
here
1:54:00
here corresponds to
1:54:02
corresponds to
1:54:02
corresponds to increasing this number
1:54:04
increasing this number
1:54:04
increasing this number because the more you increase it the
1:54:06
because the more you increase it the
1:54:06
because the more you increase it the more this part of the loss function
1:54:07
more this part of the loss function
1:54:08
more this part of the loss function dominates this part and the more these
1:54:10
dominates this part and the more these
1:54:10
dominates this part and the more these these weights will un be unable to grow
1:54:13
these weights will un be unable to grow
1:54:13
these weights will un be unable to grow because as they grow
1:54:15
because as they grow
1:54:15
because as they grow they uh accumulate way too much loss
1:54:18
they uh accumulate way too much loss
1:54:18
they uh accumulate way too much loss and so if this is strong enough
1:54:21
and so if this is strong enough
1:54:21
and so if this is strong enough then we are not able to overcome the
1:54:23
then we are not able to overcome the
1:54:23
then we are not able to overcome the force of this loss and we will never
1:54:26
force of this loss and we will never
1:54:26
force of this loss and we will never and basically everything will be uniform
1:54:28
and basically everything will be uniform
1:54:28
and basically everything will be uniform predictions
1:54:29
predictions
1:54:29
predictions so i thought that's kind of cool okay
1:54:30
so i thought that's kind of cool okay
1:54:30
so i thought that's kind of cool okay and lastly before we wrap up
1:54:33
and lastly before we wrap up
1:54:33
and lastly before we wrap up i wanted to show you how you would
1:54:34
i wanted to show you how you would
1:54:34
i wanted to show you how you would sample from this neural net model
1:54:36
sample from this neural net model
1:54:36
sample from this neural net model and i copy-pasted the sampling code from
1:54:39
and i copy-pasted the sampling code from
1:54:39
and i copy-pasted the sampling code from before
1:54:40
before
1:54:40
before where remember that we sampled five
1:54:43
where remember that we sampled five
1:54:43
where remember that we sampled five times
1:54:44
times
1:54:44
times and all we did we start at zero we
1:54:46
and all we did we start at zero we
1:54:46
and all we did we start at zero we grabbed the current ix row of p and that
1:54:50
grabbed the current ix row of p and that
1:54:50
grabbed the current ix row of p and that was our probability row
1:54:52
was our probability row
1:54:52
was our probability row from which we sampled the next index and
1:54:54
from which we sampled the next index and
1:54:54
from which we sampled the next index and just accumulated that and
1:54:56
just accumulated that and
1:54:56
just accumulated that and break when zero
1:54:58
break when zero
1:54:58
break when zero and running this
1:55:00
and running this
1:55:00
and running this gave us these
1:55:02
gave us these
1:55:02
gave us these results still have the
1:55:05
results still have the
1:55:05
results still have the p in memory so this is fine
1:55:07
p in memory so this is fine
1:55:07
p in memory so this is fine now
1:55:09
now
1:55:09
now the speed doesn't come from the row of b
1:55:11
the speed doesn't come from the row of b
1:55:11
the speed doesn't come from the row of b instead it comes from this neural net
1:55:14
instead it comes from this neural net
1:55:14
instead it comes from this neural net first we take ix
1:55:17
first we take ix
1:55:17
first we take ix and we encode it into a one hot row of x
1:55:21
and we encode it into a one hot row of x
1:55:21
and we encode it into a one hot row of x inc
1:55:22
inc
1:55:22
inc this x inc multiplies rw
1:55:25
this x inc multiplies rw
1:55:25
this x inc multiplies rw which really just plucks out the row of
1:55:26
which really just plucks out the row of
1:55:26
which really just plucks out the row of w corresponding to ix really that's
1:55:29
w corresponding to ix really that's
1:55:29
w corresponding to ix really that's what's happening
1:55:30
what's happening
1:55:30
what's happening and that gets our logits and then we
1:55:33
and that gets our logits and then we
1:55:33
and that gets our logits and then we normalize those low jets
1:55:34
normalize those low jets
1:55:34
normalize those low jets exponentiate to get counts and then
1:55:36
exponentiate to get counts and then
1:55:36
exponentiate to get counts and then normalize to get uh the distribution and
1:55:39
normalize to get uh the distribution and
1:55:39
normalize to get uh the distribution and then we can sample from the distribution
1:55:41
then we can sample from the distribution
1:55:41
then we can sample from the distribution so if i run this
1:55:45
kind of anticlimactic or climatic
1:55:47
kind of anticlimactic or climatic
1:55:47
kind of anticlimactic or climatic depending how you look at it but we get
1:55:48
depending how you look at it but we get
1:55:48
depending how you look at it but we get the exact same result
1:55:50
the exact same result
1:55:50
the exact same result um
1:55:52
um
1:55:52
um and that's because this is in the
1:55:53
and that's because this is in the
1:55:53
and that's because this is in the identical model not only does it achieve
1:55:55
identical model not only does it achieve
1:55:55
identical model not only does it achieve the same loss
1:55:56
the same loss
1:55:56
the same loss but
1:55:58
but
1:55:58
but as i mentioned these are identical
1:55:59
as i mentioned these are identical
1:55:59
as i mentioned these are identical models and this w is the log counts of
1:56:02
models and this w is the log counts of
1:56:02
models and this w is the log counts of what we've estimated before but we came
1:56:05
what we've estimated before but we came
1:56:05
what we've estimated before but we came to this answer in a very different way
1:56:07
to this answer in a very different way
1:56:07
to this answer in a very different way and it's got a very different
1:56:08
and it's got a very different
1:56:08
and it's got a very different interpretation but fundamentally this is
1:56:10
interpretation but fundamentally this is
1:56:10
interpretation but fundamentally this is basically the same model and gives the
1:56:11
basically the same model and gives the
1:56:11
basically the same model and gives the same samples here and so
1:56:14
same samples here and so
1:56:14
same samples here and so that's kind of cool okay so we've
1:56:16
that's kind of cool okay so we've
1:56:16
that's kind of cool okay so we've actually covered a lot of ground we
1:56:18
actually covered a lot of ground we
1:56:18
actually covered a lot of ground we introduced the bigram character level
1:56:20
introduced the bigram character level
1:56:20
introduced the bigram character level language model
1:56:21
language model
1:56:21
language model we saw how we can train the model how we
1:56:24
we saw how we can train the model how we
1:56:24
we saw how we can train the model how we can sample from the model and how we can
1:56:25
can sample from the model and how we can
1:56:25
can sample from the model and how we can evaluate the quality of the model using
1:56:28
evaluate the quality of the model using
1:56:28
evaluate the quality of the model using the negative log likelihood loss
1:56:30
the negative log likelihood loss
1:56:30
the negative log likelihood loss and then we actually trained the model
1:56:31
and then we actually trained the model
1:56:31
and then we actually trained the model in two completely different ways that
1:56:33
in two completely different ways that
1:56:33
in two completely different ways that actually get the same result and the
1:56:34
actually get the same result and the
1:56:34
actually get the same result and the same model
1:56:36
same model
1:56:36
same model in the first way we just counted up the
1:56:38
in the first way we just counted up the
1:56:38
in the first way we just counted up the frequency of all the bigrams and
1:56:40
frequency of all the bigrams and
1:56:40
frequency of all the bigrams and normalized
1:56:41
normalized
1:56:41
normalized in a second way we used the
1:56:44
in a second way we used the
1:56:44
in a second way we used the negative log likelihood loss as a guide
1:56:47
negative log likelihood loss as a guide
1:56:47
negative log likelihood loss as a guide to optimizing the counts matrix
1:56:50
to optimizing the counts matrix
1:56:50
to optimizing the counts matrix or the counts array so that the loss is
1:56:52
or the counts array so that the loss is
1:56:52
or the counts array so that the loss is minimized in the in a gradient-based
1:56:54
minimized in the in a gradient-based
1:56:54
minimized in the in a gradient-based framework and we saw that both of them
1:56:56
framework and we saw that both of them
1:56:56
framework and we saw that both of them give the same result
1:56:58
give the same result
1:56:58
give the same result and
1:57:00
and
1:57:00
and that's it
1:57:01
that's it
1:57:01
that's it now the second one of these the
1:57:02
now the second one of these the
1:57:02
now the second one of these the gradient-based framework is much more
1:57:03
gradient-based framework is much more
1:57:03
gradient-based framework is much more flexible and right now our neural
1:57:06
flexible and right now our neural
1:57:06
flexible and right now our neural network is super simple we're taking a
1:57:08
network is super simple we're taking a
1:57:08
network is super simple we're taking a single previous character and we're
1:57:10
single previous character and we're
1:57:10
single previous character and we're taking it through a single linear layer
1:57:12
taking it through a single linear layer
1:57:12
taking it through a single linear layer to calculate the logits
1:57:13
to calculate the logits
1:57:14
to calculate the logits this is about to complexify so in the
1:57:16
this is about to complexify so in the
1:57:16
this is about to complexify so in the follow-up videos we're going to be
1:57:17
follow-up videos we're going to be
1:57:17
follow-up videos we're going to be taking more and more of these characters
1:57:20
taking more and more of these characters
1:57:20
taking more and more of these characters and we're going to be feeding them into
1:57:21
and we're going to be feeding them into
1:57:21
and we're going to be feeding them into a neural net but this neural net will
1:57:23
a neural net but this neural net will
1:57:23
a neural net but this neural net will still output the exact same thing the
1:57:25
still output the exact same thing the
1:57:25
still output the exact same thing the neural net will output logits
1:57:27
neural net will output logits
1:57:27
neural net will output logits and these logits will still be
1:57:29
and these logits will still be
1:57:29
and these logits will still be normalized in the exact same way and all
1:57:30
normalized in the exact same way and all
1:57:30
normalized in the exact same way and all the loss and everything else and the
1:57:32
the loss and everything else and the
1:57:32
the loss and everything else and the gradient gradient-based framework
1:57:33
gradient gradient-based framework
1:57:33
gradient gradient-based framework everything stays identical it's just
1:57:35
everything stays identical it's just
1:57:35
everything stays identical it's just that this neural net will now complexify
1:57:38
that this neural net will now complexify
1:57:38
that this neural net will now complexify all the way to transformers
1:57:40
all the way to transformers
1:57:40
all the way to transformers so that's gonna be pretty awesome and
1:57:42
so that's gonna be pretty awesome and
1:57:42
so that's gonna be pretty awesome and i'm looking forward to it for now bye