View on GitHub
GitHub
Neural Networks: Zero to Hero
Let's build the GPT Tokenizer
Loading player
Notes
Transcript
6831 segments
0:00
hi everyone so in this video I'd like us
0:02
hi everyone so in this video I'd like us
0:02
hi everyone so in this video I'd like us to cover the process of tokenization in
0:04
to cover the process of tokenization in
0:04
to cover the process of tokenization in large language models now you see here
0:06
large language models now you see here
0:06
large language models now you see here that I have a set face and that's
0:08
that I have a set face and that's
0:08
that I have a set face and that's because uh tokenization is my least
0:10
because uh tokenization is my least
0:10
because uh tokenization is my least favorite part of working with large
0:11
favorite part of working with large
0:11
favorite part of working with large language models but unfortunately it is
0:13
language models but unfortunately it is
0:13
language models but unfortunately it is necessary to understand in some detail
0:15
necessary to understand in some detail
0:15
necessary to understand in some detail because it it is fairly hairy gnarly and
0:17
because it it is fairly hairy gnarly and
0:17
because it it is fairly hairy gnarly and there's a lot of hidden foot guns to be
0:19
there's a lot of hidden foot guns to be
0:19
there's a lot of hidden foot guns to be aware of and a lot of oddness with large
0:21
aware of and a lot of oddness with large
0:21
aware of and a lot of oddness with large language models typically traces back to
0:24
language models typically traces back to
0:24
language models typically traces back to tokenization so what is
0:26
tokenization so what is
0:26
tokenization so what is tokenization now in my previous video
0:28
tokenization now in my previous video
0:28
tokenization now in my previous video Let's Build GPT from scratch uh we
0:31
Let's Build GPT from scratch uh we
0:31
Let's Build GPT from scratch uh we actually already did tokenization but we
0:33
actually already did tokenization but we
0:33
actually already did tokenization but we did a very naive simple version of
0:35
did a very naive simple version of
0:35
did a very naive simple version of tokenization so when you go to the
0:37
tokenization so when you go to the
0:37
tokenization so when you go to the Google colab for that video uh you see
0:40
Google colab for that video uh you see
0:40
Google colab for that video uh you see here that we loaded our training set and
0:43
here that we loaded our training set and
0:43
here that we loaded our training set and our training set was this uh Shakespeare
0:45
our training set was this uh Shakespeare
0:45
our training set was this uh Shakespeare uh data set now in the beginning the
0:48
uh data set now in the beginning the
0:48
uh data set now in the beginning the Shakespeare data set is just a large
0:49
Shakespeare data set is just a large
0:49
Shakespeare data set is just a large string in Python it's just text and so
0:52
string in Python it's just text and so
0:52
string in Python it's just text and so the question is how do we plug text into
0:54
the question is how do we plug text into
0:54
the question is how do we plug text into large language models and in this case
0:58
large language models and in this case
0:58
large language models and in this case here we created a vocabulary of 65
1:01
here we created a vocabulary of 65
1:01
here we created a vocabulary of 65 possible characters that we saw occur in
1:03
possible characters that we saw occur in
1:03
possible characters that we saw occur in this string these were the possible
1:05
this string these were the possible
1:05
this string these were the possible characters and we saw that there are 65
1:07
characters and we saw that there are 65
1:07
characters and we saw that there are 65 of them and then we created a a lookup
1:10
of them and then we created a a lookup
1:10
of them and then we created a a lookup table for converting from every possible
1:13
table for converting from every possible
1:13
table for converting from every possible character a little string piece into a
1:16
character a little string piece into a
1:16
character a little string piece into a token an
1:17
token an
1:17
token an integer so here for example we tokenized
1:20
integer so here for example we tokenized
1:20
integer so here for example we tokenized the string High there and we received
1:23
the string High there and we received
1:23
the string High there and we received this sequence of
1:24
this sequence of
1:24
this sequence of tokens and here we took the first 1,000
1:27
tokens and here we took the first 1,000
1:27
tokens and here we took the first 1,000 characters of our data set and we
1:29
characters of our data set and we
1:29
characters of our data set and we encoded it into tokens and because it is
1:32
encoded it into tokens and because it is
1:32
encoded it into tokens and because it is this is character level we received
1:34
this is character level we received
1:34
this is character level we received 1,000 tokens in a sequence so token 18
1:38
1,000 tokens in a sequence so token 18
1:38
1,000 tokens in a sequence so token 18 47
1:40
47
1:40
47 Etc now later we saw that the way we
1:43
Etc now later we saw that the way we
1:43
Etc now later we saw that the way we plug these tokens into the language
1:45
plug these tokens into the language
1:45
plug these tokens into the language model is by using an embedding
1:48
model is by using an embedding
1:48
model is by using an embedding table and so basically if we have 65
1:51
table and so basically if we have 65
1:51
table and so basically if we have 65 possible tokens then this embedding
1:53
possible tokens then this embedding
1:53
possible tokens then this embedding table is going to have 65 rows and
1:56
table is going to have 65 rows and
1:56
table is going to have 65 rows and roughly speaking we're taking the
1:58
roughly speaking we're taking the
1:58
roughly speaking we're taking the integer associated with every single
1:59
integer associated with every single
1:59
integer associated with every single sing Le token we're using that as a
2:01
sing Le token we're using that as a
2:01
sing Le token we're using that as a lookup into this table and we're
2:04
lookup into this table and we're
2:04
lookup into this table and we're plucking out the corresponding row and
2:06
plucking out the corresponding row and
2:06
plucking out the corresponding row and this row is a uh is trainable parameters
2:09
this row is a uh is trainable parameters
2:09
this row is a uh is trainable parameters that we're going to train using back
2:10
that we're going to train using back
2:10
that we're going to train using back propagation and this is the vector that
2:12
propagation and this is the vector that
2:12
propagation and this is the vector that then feeds into the Transformer um and
2:15
then feeds into the Transformer um and
2:15
then feeds into the Transformer um and that's how the Transformer Ser of
2:16
that's how the Transformer Ser of
2:16
that's how the Transformer Ser of perceives every single
2:18
perceives every single
2:18
perceives every single token so here we had a very naive
2:21
token so here we had a very naive
2:21
token so here we had a very naive tokenization process that was a
2:23
tokenization process that was a
2:23
tokenization process that was a character level tokenizer but in
2:25
character level tokenizer but in
2:25
character level tokenizer but in practice in state-ofthe-art uh language
2:27
practice in state-ofthe-art uh language
2:27
practice in state-ofthe-art uh language models people use a lot more complicated
2:28
models people use a lot more complicated
2:28
models people use a lot more complicated schemes unfortunately
2:30
schemes unfortunately
2:30
schemes unfortunately uh for constructing these uh token
2:34
uh for constructing these uh token
2:34
uh for constructing these uh token vocabularies so we're not dealing on the
2:36
vocabularies so we're not dealing on the
2:36
vocabularies so we're not dealing on the Character level we're dealing on chunk
2:38
Character level we're dealing on chunk
2:38
Character level we're dealing on chunk level and the way these um character
2:41
level and the way these um character
2:41
level and the way these um character chunks are constructed is using
2:43
chunks are constructed is using
2:43
chunks are constructed is using algorithms such as for example the bik
2:45
algorithms such as for example the bik
2:45
algorithms such as for example the bik pair in coding algorithm which we're
2:46
pair in coding algorithm which we're
2:46
pair in coding algorithm which we're going to go into in detail um and cover
2:50
going to go into in detail um and cover
2:51
going to go into in detail um and cover in this video I'd like to briefly show
2:52
in this video I'd like to briefly show
2:52
in this video I'd like to briefly show you the paper that introduced a bite
2:54
you the paper that introduced a bite
2:54
you the paper that introduced a bite level encoding as a mechanism for
2:56
level encoding as a mechanism for
2:56
level encoding as a mechanism for tokenization in the context of large
2:58
tokenization in the context of large
2:58
tokenization in the context of large language models and I would say that
3:00
language models and I would say that
3:00
language models and I would say that that's probably the gpt2 paper and if
3:02
that's probably the gpt2 paper and if
3:02
that's probably the gpt2 paper and if you scroll down here to the section
3:05
you scroll down here to the section
3:05
you scroll down here to the section input representation this is where they
3:07
input representation this is where they
3:07
input representation this is where they cover tokenization the kinds of
3:09
cover tokenization the kinds of
3:09
cover tokenization the kinds of properties that you'd like the
3:10
properties that you'd like the
3:10
properties that you'd like the tokenization to have and they conclude
3:12
tokenization to have and they conclude
3:13
tokenization to have and they conclude here that they're going to have a
3:14
here that they're going to have a
3:14
here that they're going to have a tokenizer where you have a vocabulary of
3:17
tokenizer where you have a vocabulary of
3:17
tokenizer where you have a vocabulary of 50,2 57 possible
3:20
50,2 57 possible
3:20
50,2 57 possible tokens and the context size is going to
3:24
tokens and the context size is going to
3:24
tokens and the context size is going to be 1,24 tokens so in the in in the
3:27
be 1,24 tokens so in the in in the
3:27
be 1,24 tokens so in the in in the attention layer of the Transformer
3:29
attention layer of the Transformer
3:29
attention layer of the Transformer neural network
3:30
neural network
3:30
neural network every single token is attending to the
3:32
every single token is attending to the
3:32
every single token is attending to the previous tokens in the sequence and it's
3:34
previous tokens in the sequence and it's
3:34
previous tokens in the sequence and it's going to see up to 1,24 tokens so tokens
3:37
going to see up to 1,24 tokens so tokens
3:37
going to see up to 1,24 tokens so tokens are this like fundamental unit um the
3:40
are this like fundamental unit um the
3:40
are this like fundamental unit um the atom of uh large language models if you
3:43
atom of uh large language models if you
3:43
atom of uh large language models if you will and everything is in units of
3:44
will and everything is in units of
3:44
will and everything is in units of tokens everything is about tokens and
3:47
tokens everything is about tokens and
3:47
tokens everything is about tokens and tokenization is the process for
3:48
tokenization is the process for
3:48
tokenization is the process for translating strings or text into
3:51
translating strings or text into
3:51
translating strings or text into sequences of tokens and uh vice versa
3:54
sequences of tokens and uh vice versa
3:54
sequences of tokens and uh vice versa when you go into the Llama 2 paper as
3:56
when you go into the Llama 2 paper as
3:56
when you go into the Llama 2 paper as well I can show you that when you search
3:58
well I can show you that when you search
3:58
well I can show you that when you search token you're going to get get 63 hits um
4:01
token you're going to get get 63 hits um
4:01
token you're going to get get 63 hits um and that's because tokens are again
4:03
and that's because tokens are again
4:03
and that's because tokens are again pervasive so here they mentioned that
4:05
pervasive so here they mentioned that
4:05
pervasive so here they mentioned that they trained on two trillion tokens of
4:06
they trained on two trillion tokens of
4:06
they trained on two trillion tokens of data and so
4:08
data and so
4:08
data and so on so we're going to build our own
4:11
on so we're going to build our own
4:11
on so we're going to build our own tokenizer luckily the bite be encoding
4:13
tokenizer luckily the bite be encoding
4:13
tokenizer luckily the bite be encoding algorithm is not uh that super
4:15
algorithm is not uh that super
4:15
algorithm is not uh that super complicated and we can build it from
4:16
complicated and we can build it from
4:16
complicated and we can build it from scratch ourselves and we'll see exactly
4:18
scratch ourselves and we'll see exactly
4:18
scratch ourselves and we'll see exactly how this works before we dive into code
4:20
how this works before we dive into code
4:20
how this works before we dive into code I'd like to give you a brief Taste of
4:22
I'd like to give you a brief Taste of
4:22
I'd like to give you a brief Taste of some of the complexities that come from
4:24
some of the complexities that come from
4:24
some of the complexities that come from the tokenization because I just want to
4:26
the tokenization because I just want to
4:26
the tokenization because I just want to make sure that we motivate it
4:27
make sure that we motivate it
4:27
make sure that we motivate it sufficiently for why we are doing all
4:29
sufficiently for why we are doing all
4:29
sufficiently for why we are doing all this and why this is so gross so
4:32
this and why this is so gross so
4:32
this and why this is so gross so tokenization is at the heart of a lot of
4:34
tokenization is at the heart of a lot of
4:34
tokenization is at the heart of a lot of weirdness in large language models and I
4:36
weirdness in large language models and I
4:36
weirdness in large language models and I would advise that you do not brush it
4:37
would advise that you do not brush it
4:37
would advise that you do not brush it off a lot of the issues that may look
4:40
off a lot of the issues that may look
4:40
off a lot of the issues that may look like just issues with the new network
4:42
like just issues with the new network
4:42
like just issues with the new network architecture or the large language model
4:44
architecture or the large language model
4:44
architecture or the large language model itself are actually issues with the
4:46
itself are actually issues with the
4:46
itself are actually issues with the tokenization and fundamentally Trace uh
4:49
tokenization and fundamentally Trace uh
4:49
tokenization and fundamentally Trace uh back to it so if you've noticed any
4:51
back to it so if you've noticed any
4:51
back to it so if you've noticed any issues with large language models can't
4:54
issues with large language models can't
4:54
issues with large language models can't you know not able to do spelling tasks
4:56
you know not able to do spelling tasks
4:56
you know not able to do spelling tasks very easily that's usually due to
4:57
very easily that's usually due to
4:57
very easily that's usually due to tokenization simple string processing
5:00
tokenization simple string processing
5:00
tokenization simple string processing can be difficult for the large language
5:02
can be difficult for the large language
5:02
can be difficult for the large language model to perform
5:03
model to perform
5:03
model to perform natively uh non-english languages can
5:06
natively uh non-english languages can
5:06
natively uh non-english languages can work much worse and to a large extent
5:08
work much worse and to a large extent
5:08
work much worse and to a large extent this is due to
5:09
this is due to
5:09
this is due to tokenization sometimes llms are bad at
5:11
tokenization sometimes llms are bad at
5:11
tokenization sometimes llms are bad at simple arithmetic also can trace be
5:14
simple arithmetic also can trace be
5:14
simple arithmetic also can trace be traced to
5:15
traced to
5:15
traced to tokenization uh gbt2 specifically would
5:17
tokenization uh gbt2 specifically would
5:17
tokenization uh gbt2 specifically would have had quite a bit more issues with
5:19
have had quite a bit more issues with
5:19
have had quite a bit more issues with python than uh future versions of it due
5:22
python than uh future versions of it due
5:22
python than uh future versions of it due to tokenization there's a lot of other
5:24
to tokenization there's a lot of other
5:24
to tokenization there's a lot of other issues maybe you've seen weird warnings
5:25
issues maybe you've seen weird warnings
5:25
issues maybe you've seen weird warnings about a trailing whites space this is a
5:27
about a trailing whites space this is a
5:27
about a trailing whites space this is a tokenization issue um
5:30
tokenization issue um
5:30
tokenization issue um if you had asked GPT earlier about solid
5:33
if you had asked GPT earlier about solid
5:33
if you had asked GPT earlier about solid gold Magikarp and what it is you would
5:35
gold Magikarp and what it is you would
5:35
gold Magikarp and what it is you would see the llm go totally crazy and it
5:37
see the llm go totally crazy and it
5:37
see the llm go totally crazy and it would start going off about a completely
5:39
would start going off about a completely
5:39
would start going off about a completely unrelated tangent topic maybe you've
5:41
unrelated tangent topic maybe you've
5:41
unrelated tangent topic maybe you've been told to use yl over Json in
5:43
been told to use yl over Json in
5:43
been told to use yl over Json in structure data all of that has to do
5:45
structure data all of that has to do
5:45
structure data all of that has to do with tokenization so basically
5:47
with tokenization so basically
5:47
with tokenization so basically tokenization is at the heart of many
5:49
tokenization is at the heart of many
5:49
tokenization is at the heart of many issues I will look back around to these
5:51
issues I will look back around to these
5:51
issues I will look back around to these at the end of the video but for now let
5:54
at the end of the video but for now let
5:54
at the end of the video but for now let me just um skip over it a little bit and
5:56
me just um skip over it a little bit and
5:56
me just um skip over it a little bit and let's go to this web app um the Tik
5:59
let's go to this web app um the Tik
5:59
let's go to this web app um the Tik tokenizer bell.app so I have it loaded
6:02
tokenizer bell.app so I have it loaded
6:02
tokenizer bell.app so I have it loaded here and what I like about this web app
6:04
here and what I like about this web app
6:04
here and what I like about this web app is that tokenization is running a sort
6:06
is that tokenization is running a sort
6:06
is that tokenization is running a sort of live in your browser in JavaScript so
6:09
of live in your browser in JavaScript so
6:09
of live in your browser in JavaScript so you can just type here stuff hello world
6:11
you can just type here stuff hello world
6:11
you can just type here stuff hello world and the whole string
6:14
and the whole string
6:14
and the whole string rokenes so here what we see on uh the
6:18
rokenes so here what we see on uh the
6:18
rokenes so here what we see on uh the left is a string that you put in on the
6:20
left is a string that you put in on the
6:20
left is a string that you put in on the right we're currently using the gpt2
6:22
right we're currently using the gpt2
6:22
right we're currently using the gpt2 tokenizer we see that this string that I
6:24
tokenizer we see that this string that I
6:24
tokenizer we see that this string that I pasted here is currently tokenizing into
6:27
pasted here is currently tokenizing into
6:27
pasted here is currently tokenizing into 300 tokens and here they are sort of uh
6:30
300 tokens and here they are sort of uh
6:30
300 tokens and here they are sort of uh shown explicitly in different colors for
6:32
shown explicitly in different colors for
6:32
shown explicitly in different colors for every single token so for example uh
6:35
every single token so for example uh
6:35
every single token so for example uh this word tokenization became two tokens
6:38
this word tokenization became two tokens
6:38
this word tokenization became two tokens the token
6:40
the token
6:40
the token 3,642 and
6:44
1,634 the token um space is is token 318
6:50
1,634 the token um space is is token 318
6:50
1,634 the token um space is is token 318 so be careful on the bottom you can show
6:51
so be careful on the bottom you can show
6:51
so be careful on the bottom you can show white space and keep in mind that there
6:54
white space and keep in mind that there
6:54
white space and keep in mind that there are spaces and uh sln new line
6:57
are spaces and uh sln new line
6:57
are spaces and uh sln new line characters in here but you can hide them
6:59
characters in here but you can hide them
6:59
characters in here but you can hide them for
7:01
for
7:01
for clarity the token space at is token 379
7:05
clarity the token space at is token 379
7:06
clarity the token space at is token 379 the to the Token space the is 262 Etc so
7:11
the to the Token space the is 262 Etc so
7:11
the to the Token space the is 262 Etc so you notice here that the space is part
7:12
you notice here that the space is part
7:12
you notice here that the space is part of that uh token
7:15
of that uh token
7:15
of that uh token chunk now so this is kind of like how
7:18
chunk now so this is kind of like how
7:18
chunk now so this is kind of like how our English sentence broke up and that
7:21
our English sentence broke up and that
7:21
our English sentence broke up and that seems all well and good now now here I
7:24
seems all well and good now now here I
7:24
seems all well and good now now here I put in some arithmetic so we see that uh
7:26
put in some arithmetic so we see that uh
7:26
put in some arithmetic so we see that uh the token 127 Plus and then token six
7:31
the token 127 Plus and then token six
7:31
the token 127 Plus and then token six space 6 followed by 77 so what's
7:34
space 6 followed by 77 so what's
7:34
space 6 followed by 77 so what's happening here is that 127 is feeding in
7:36
happening here is that 127 is feeding in
7:36
happening here is that 127 is feeding in as a single token into the large
7:38
as a single token into the large
7:38
as a single token into the large language model but the um number 677
7:42
language model but the um number 677
7:42
language model but the um number 677 will actually feed in as two separate
7:44
will actually feed in as two separate
7:44
will actually feed in as two separate tokens and so the large language model
7:46
tokens and so the large language model
7:47
tokens and so the large language model has to sort of um take account of that
7:50
has to sort of um take account of that
7:50
has to sort of um take account of that and process it correctly in its Network
7:53
and process it correctly in its Network
7:53
and process it correctly in its Network and see here 804 will be broken up into
7:56
and see here 804 will be broken up into
7:56
and see here 804 will be broken up into two tokens and it's is all completely
7:57
two tokens and it's is all completely
7:57
two tokens and it's is all completely arbitrary and here I have another
7:59
arbitrary and here I have another
7:59
arbitrary and here I have another example of four-digit numbers and they
8:02
example of four-digit numbers and they
8:02
example of four-digit numbers and they break up in a way that they break up and
8:03
break up in a way that they break up and
8:03
break up in a way that they break up and it's totally arbitrary sometimes you
8:05
it's totally arbitrary sometimes you
8:05
it's totally arbitrary sometimes you have um multiple digits single token
8:08
have um multiple digits single token
8:08
have um multiple digits single token sometimes you have individual digits as
8:10
sometimes you have individual digits as
8:10
sometimes you have individual digits as many tokens and it's all kind of pretty
8:12
many tokens and it's all kind of pretty
8:12
many tokens and it's all kind of pretty arbitrary and coming out of the
8:14
arbitrary and coming out of the
8:14
arbitrary and coming out of the tokenizer here's another example we have
8:17
tokenizer here's another example we have
8:17
tokenizer here's another example we have the string egg and you see here that
8:21
the string egg and you see here that
8:21
the string egg and you see here that this became two
8:22
this became two
8:22
this became two tokens but for some reason when I say I
8:24
tokens but for some reason when I say I
8:24
tokens but for some reason when I say I have an egg you see when it's a space
8:27
have an egg you see when it's a space
8:27
have an egg you see when it's a space egg it's two token it's sorry it's a
8:30
egg it's two token it's sorry it's a
8:30
egg it's two token it's sorry it's a single token so just egg by itself in
8:33
single token so just egg by itself in
8:33
single token so just egg by itself in the beginning of a sentence is two
8:34
the beginning of a sentence is two
8:34
the beginning of a sentence is two tokens but here as a space egg is
8:37
tokens but here as a space egg is
8:37
tokens but here as a space egg is suddenly a single token uh for the exact
8:40
suddenly a single token uh for the exact
8:40
suddenly a single token uh for the exact same string okay here lowercase egg
8:44
same string okay here lowercase egg
8:44
same string okay here lowercase egg turns out to be a single token and in
8:46
turns out to be a single token and in
8:46
turns out to be a single token and in particular notice that the color is
8:47
particular notice that the color is
8:47
particular notice that the color is different so this is a different token
8:49
different so this is a different token
8:49
different so this is a different token so this is case sensitive and of course
8:51
so this is case sensitive and of course
8:51
so this is case sensitive and of course a capital egg would also be different
8:54
a capital egg would also be different
8:54
a capital egg would also be different tokens and again um this would be two
8:57
tokens and again um this would be two
8:57
tokens and again um this would be two tokens arbitrarily so so for the same
9:00
tokens arbitrarily so so for the same
9:00
tokens arbitrarily so so for the same concept egg depending on if it's in the
9:02
concept egg depending on if it's in the
9:02
concept egg depending on if it's in the beginning of a sentence at the end of a
9:03
beginning of a sentence at the end of a
9:03
beginning of a sentence at the end of a sentence lowercase uppercase or mixed
9:06
sentence lowercase uppercase or mixed
9:06
sentence lowercase uppercase or mixed all this will be uh basically very
9:08
all this will be uh basically very
9:08
all this will be uh basically very different tokens and different IDs and
9:10
different tokens and different IDs and
9:10
different tokens and different IDs and the language model has to learn from raw
9:12
the language model has to learn from raw
9:12
the language model has to learn from raw data from all the internet text that
9:13
data from all the internet text that
9:13
data from all the internet text that it's going to be training on that these
9:15
it's going to be training on that these
9:15
it's going to be training on that these are actually all the exact same concept
9:17
are actually all the exact same concept
9:17
are actually all the exact same concept and it has to sort of group them in the
9:19
and it has to sort of group them in the
9:19
and it has to sort of group them in the parameters of the neural network and
9:21
parameters of the neural network and
9:21
parameters of the neural network and understand just based on the data
9:22
understand just based on the data
9:22
understand just based on the data patterns that these are all very similar
9:24
patterns that these are all very similar
9:24
patterns that these are all very similar but maybe not almost exactly similar but
9:27
but maybe not almost exactly similar but
9:27
but maybe not almost exactly similar but but very very similar
9:30
but very very similar
9:30
but very very similar um after the EG demonstration here I
9:32
um after the EG demonstration here I
9:32
um after the EG demonstration here I have um an introduction from open a eyes
9:35
have um an introduction from open a eyes
9:35
have um an introduction from open a eyes chbt in Korean so manaso Pang uh Etc uh
9:41
chbt in Korean so manaso Pang uh Etc uh
9:41
chbt in Korean so manaso Pang uh Etc uh so this is in Korean and the reason I
9:44
so this is in Korean and the reason I
9:44
so this is in Korean and the reason I put this here is because you'll notice
9:47
put this here is because you'll notice
9:47
put this here is because you'll notice that um non-english languages work
9:50
that um non-english languages work
9:51
that um non-english languages work slightly worse in Chachi part of this is
9:54
slightly worse in Chachi part of this is
9:54
slightly worse in Chachi part of this is because of course the training data set
9:55
because of course the training data set
9:55
because of course the training data set for Chachi is much larger for English
9:58
for Chachi is much larger for English
9:58
for Chachi is much larger for English and for everything else but the same is
9:59
and for everything else but the same is
9:59
and for everything else but the same is true not just for the large language
10:01
true not just for the large language
10:01
true not just for the large language model itself but also for the tokenizer
10:04
model itself but also for the tokenizer
10:04
model itself but also for the tokenizer so when we train the tokenizer we're
10:05
so when we train the tokenizer we're
10:05
so when we train the tokenizer we're going to see that there's a training set
10:07
going to see that there's a training set
10:07
going to see that there's a training set as well and there's a lot more English
10:09
as well and there's a lot more English
10:09
as well and there's a lot more English than non-english and what ends up
10:11
than non-english and what ends up
10:11
than non-english and what ends up happening is that we're going to have a
10:13
happening is that we're going to have a
10:13
happening is that we're going to have a lot more longer tokens for
10:16
lot more longer tokens for
10:16
lot more longer tokens for English so how do I put this if you have
10:19
English so how do I put this if you have
10:19
English so how do I put this if you have a single sentence in English and you
10:21
a single sentence in English and you
10:21
a single sentence in English and you tokenize it you might see that it's 10
10:23
tokenize it you might see that it's 10
10:23
tokenize it you might see that it's 10 tokens or something like that but if you
10:25
tokens or something like that but if you
10:25
tokens or something like that but if you translate that sentence into say Korean
10:27
translate that sentence into say Korean
10:27
translate that sentence into say Korean or Japanese or something else you'll
10:29
or Japanese or something else you'll
10:29
or Japanese or something else you'll typically see that the number of tokens
10:30
typically see that the number of tokens
10:30
typically see that the number of tokens used is much larger and that's because
10:33
used is much larger and that's because
10:33
used is much larger and that's because the chunks here are a lot more broken up
10:36
the chunks here are a lot more broken up
10:36
the chunks here are a lot more broken up so we're using a lot more tokens for the
10:38
so we're using a lot more tokens for the
10:38
so we're using a lot more tokens for the exact same thing and what this does is
10:41
exact same thing and what this does is
10:41
exact same thing and what this does is it bloats up the sequence length of all
10:43
it bloats up the sequence length of all
10:43
it bloats up the sequence length of all the documents so you're using up more
10:46
the documents so you're using up more
10:46
the documents so you're using up more tokens and then in the attention of the
10:48
tokens and then in the attention of the
10:48
tokens and then in the attention of the Transformer when these tokens try to
10:49
Transformer when these tokens try to
10:49
Transformer when these tokens try to attend each other you are running out of
10:51
attend each other you are running out of
10:51
attend each other you are running out of context um in the maximum context length
10:55
context um in the maximum context length
10:55
context um in the maximum context length of that Transformer and so basically all
10:57
of that Transformer and so basically all
10:57
of that Transformer and so basically all the non-english text is stretched out
11:01
the non-english text is stretched out
11:01
the non-english text is stretched out from the perspective of the Transformer
11:03
from the perspective of the Transformer
11:03
from the perspective of the Transformer and this just has to do with the um
11:05
and this just has to do with the um
11:05
and this just has to do with the um trainings that used for the tokenizer
11:07
trainings that used for the tokenizer
11:07
trainings that used for the tokenizer and the tokenization itself so it will
11:10
and the tokenization itself so it will
11:10
and the tokenization itself so it will create a lot bigger tokens and a lot
11:12
create a lot bigger tokens and a lot
11:12
create a lot bigger tokens and a lot larger groups in English and it will
11:14
larger groups in English and it will
11:14
larger groups in English and it will have a lot of little boundaries for all
11:16
have a lot of little boundaries for all
11:16
have a lot of little boundaries for all the other non-english text um so if we
11:19
the other non-english text um so if we
11:19
the other non-english text um so if we translated this into English it would be
11:21
translated this into English it would be
11:21
translated this into English it would be significantly fewer
11:23
significantly fewer
11:23
significantly fewer tokens the final example I have here is
11:25
tokens the final example I have here is
11:25
tokens the final example I have here is a little snippet of python for doing FS
11:28
a little snippet of python for doing FS
11:28
a little snippet of python for doing FS buuz and what I'd like you to notice is
11:30
buuz and what I'd like you to notice is
11:31
buuz and what I'd like you to notice is look all these individual spaces are all
11:34
look all these individual spaces are all
11:34
look all these individual spaces are all separate tokens they are token
11:36
separate tokens they are token
11:37
separate tokens they are token 220 so uh 220 220 220 220 and then space
11:42
220 so uh 220 220 220 220 and then space
11:42
220 so uh 220 220 220 220 and then space if is a single token and so what's going
11:45
if is a single token and so what's going
11:45
if is a single token and so what's going on here is that when the Transformer is
11:46
on here is that when the Transformer is
11:46
on here is that when the Transformer is going to consume or try to uh create
11:49
going to consume or try to uh create
11:49
going to consume or try to uh create this text it needs to um handle all
11:52
this text it needs to um handle all
11:52
this text it needs to um handle all these spaces individually they all feed
11:54
these spaces individually they all feed
11:54
these spaces individually they all feed in one by one into the entire
11:56
in one by one into the entire
11:56
in one by one into the entire Transformer in the sequence and so this
11:59
Transformer in the sequence and so this
11:59
Transformer in the sequence and so this is being extremely wasteful tokenizing
12:01
is being extremely wasteful tokenizing
12:01
is being extremely wasteful tokenizing it in this way and so as a result of
12:04
it in this way and so as a result of
12:04
it in this way and so as a result of that gpt2 is not very good with python
12:07
that gpt2 is not very good with python
12:07
that gpt2 is not very good with python and it's not anything to do with coding
12:08
and it's not anything to do with coding
12:08
and it's not anything to do with coding or the language model itself it's just
12:10
or the language model itself it's just
12:10
or the language model itself it's just that if he use a lot of indentation
12:12
that if he use a lot of indentation
12:12
that if he use a lot of indentation using space in Python like we usually do
12:15
using space in Python like we usually do
12:15
using space in Python like we usually do uh you just end up bloating out all the
12:17
uh you just end up bloating out all the
12:17
uh you just end up bloating out all the text and it's separated across way too
12:19
text and it's separated across way too
12:19
text and it's separated across way too much of the sequence and we are running
12:21
much of the sequence and we are running
12:21
much of the sequence and we are running out of the context length in the
12:22
out of the context length in the
12:22
out of the context length in the sequence uh that's roughly speaking
12:24
sequence uh that's roughly speaking
12:24
sequence uh that's roughly speaking what's what's happening we're being way
12:25
what's what's happening we're being way
12:25
what's what's happening we're being way too wasteful we're taking up way too
12:27
too wasteful we're taking up way too
12:27
too wasteful we're taking up way too much token space now we can also scroll
12:29
much token space now we can also scroll
12:29
much token space now we can also scroll up here and we can change the tokenizer
12:31
up here and we can change the tokenizer
12:31
up here and we can change the tokenizer so note here that gpt2 tokenizer creates
12:34
so note here that gpt2 tokenizer creates
12:34
so note here that gpt2 tokenizer creates a token count of 300 for this string
12:36
a token count of 300 for this string
12:36
a token count of 300 for this string here we can change it to CL 100K base
12:39
here we can change it to CL 100K base
12:39
here we can change it to CL 100K base which is the GPT for tokenizer and we
12:41
which is the GPT for tokenizer and we
12:41
which is the GPT for tokenizer and we see that the token count drops to 185 so
12:44
see that the token count drops to 185 so
12:44
see that the token count drops to 185 so for the exact same string we are now
12:46
for the exact same string we are now
12:46
for the exact same string we are now roughly having the number of tokens and
12:49
roughly having the number of tokens and
12:49
roughly having the number of tokens and roughly speaking this is because uh the
12:51
roughly speaking this is because uh the
12:51
roughly speaking this is because uh the number of tokens in the GPT 4 tokenizer
12:54
number of tokens in the GPT 4 tokenizer
12:54
number of tokens in the GPT 4 tokenizer is roughly double that of the number of
12:56
is roughly double that of the number of
12:56
is roughly double that of the number of tokens in the gpt2 tokenizer so we went
12:58
tokens in the gpt2 tokenizer so we went
12:58
tokens in the gpt2 tokenizer so we went went from roughly 50k to roughly 100K
13:01
went from roughly 50k to roughly 100K
13:01
went from roughly 50k to roughly 100K now you can imagine that this is a good
13:02
now you can imagine that this is a good
13:03
now you can imagine that this is a good thing because the same text is now
13:05
thing because the same text is now
13:06
thing because the same text is now squished into half as many tokens so uh
13:10
squished into half as many tokens so uh
13:10
squished into half as many tokens so uh this is a lot denser input to the
13:12
this is a lot denser input to the
13:12
this is a lot denser input to the Transformer and in the Transformer every
13:15
Transformer and in the Transformer every
13:15
Transformer and in the Transformer every single token has a finite number of
13:17
single token has a finite number of
13:17
single token has a finite number of tokens before it that it's going to pay
13:18
tokens before it that it's going to pay
13:18
tokens before it that it's going to pay attention to and so what this is doing
13:20
attention to and so what this is doing
13:20
attention to and so what this is doing is we're roughly able to see twice as
13:23
is we're roughly able to see twice as
13:23
is we're roughly able to see twice as much text as a context for what token to
13:26
much text as a context for what token to
13:26
much text as a context for what token to predict next uh because of this change
13:29
predict next uh because of this change
13:29
predict next uh because of this change but of course just increasing the number
13:30
but of course just increasing the number
13:30
but of course just increasing the number of tokens is uh not strictly better
13:33
of tokens is uh not strictly better
13:33
of tokens is uh not strictly better infinitely uh because as you increase
13:35
infinitely uh because as you increase
13:35
infinitely uh because as you increase the number of tokens now your embedding
13:36
the number of tokens now your embedding
13:36
the number of tokens now your embedding table is um sort of getting a lot larger
13:39
table is um sort of getting a lot larger
13:39
table is um sort of getting a lot larger and also at the output we are trying to
13:41
and also at the output we are trying to
13:41
and also at the output we are trying to predict the next token and there's the
13:42
predict the next token and there's the
13:42
predict the next token and there's the soft Max there and that grows as well
13:45
soft Max there and that grows as well
13:45
soft Max there and that grows as well we're going to go into more detail later
13:46
we're going to go into more detail later
13:46
we're going to go into more detail later on this but there's some kind of a Sweet
13:48
on this but there's some kind of a Sweet
13:48
on this but there's some kind of a Sweet Spot somewhere where you have a just
13:50
Spot somewhere where you have a just
13:51
Spot somewhere where you have a just right number of tokens in your
13:52
right number of tokens in your
13:52
right number of tokens in your vocabulary where everything is
13:53
vocabulary where everything is
13:53
vocabulary where everything is appropriately dense and still fairly
13:56
appropriately dense and still fairly
13:56
appropriately dense and still fairly efficient now one thing I would like you
13:58
efficient now one thing I would like you
13:58
efficient now one thing I would like you to note specifically for the gp4
14:00
to note specifically for the gp4
14:00
to note specifically for the gp4 tokenizer is that the handling of the
14:03
tokenizer is that the handling of the
14:03
tokenizer is that the handling of the white space for python has improved a
14:05
white space for python has improved a
14:05
white space for python has improved a lot you see that here these four spaces
14:08
lot you see that here these four spaces
14:08
lot you see that here these four spaces are represented as one single token for
14:10
are represented as one single token for
14:10
are represented as one single token for the three spaces here and then the token
14:13
the three spaces here and then the token
14:13
the three spaces here and then the token SPF and here seven spaces were all
14:16
SPF and here seven spaces were all
14:16
SPF and here seven spaces were all grouped into a single token so we're
14:18
grouped into a single token so we're
14:18
grouped into a single token so we're being a lot more efficient in how we
14:20
being a lot more efficient in how we
14:20
being a lot more efficient in how we represent Python and this was a
14:21
represent Python and this was a
14:21
represent Python and this was a deliberate Choice made by open aai when
14:23
deliberate Choice made by open aai when
14:23
deliberate Choice made by open aai when they designed the gp4 tokenizer and they
14:27
they designed the gp4 tokenizer and they
14:27
they designed the gp4 tokenizer and they group a lot more space into a single
14:29
group a lot more space into a single
14:29
group a lot more space into a single character what this does is this
14:32
character what this does is this
14:32
character what this does is this densifies Python and therefore we can
14:35
densifies Python and therefore we can
14:35
densifies Python and therefore we can attend to more code before it when we're
14:38
attend to more code before it when we're
14:38
attend to more code before it when we're trying to predict the next token in the
14:39
trying to predict the next token in the
14:39
trying to predict the next token in the sequence and so the Improvement in the
14:42
sequence and so the Improvement in the
14:42
sequence and so the Improvement in the python coding ability from gbt2 to gp4
14:45
python coding ability from gbt2 to gp4
14:45
python coding ability from gbt2 to gp4 is not just a matter of the language
14:47
is not just a matter of the language
14:47
is not just a matter of the language model and the architecture and the
14:48
model and the architecture and the
14:48
model and the architecture and the details of the optimization but a lot of
14:50
details of the optimization but a lot of
14:50
details of the optimization but a lot of the Improvement here is also coming from
14:52
the Improvement here is also coming from
14:52
the Improvement here is also coming from the design of the tokenizer and how it
14:54
the design of the tokenizer and how it
14:54
the design of the tokenizer and how it groups characters into tokens okay so
14:56
groups characters into tokens okay so
14:56
groups characters into tokens okay so let's now start writing some code
14:59
let's now start writing some code
14:59
let's now start writing some code so remember what we want to do we want
15:01
so remember what we want to do we want
15:01
so remember what we want to do we want to take strings and feed them into
15:03
to take strings and feed them into
15:03
to take strings and feed them into language models for that we need to
15:05
language models for that we need to
15:05
language models for that we need to somehow tokenize strings into some
15:08
somehow tokenize strings into some
15:08
somehow tokenize strings into some integers in some fixed vocabulary and
15:12
integers in some fixed vocabulary and
15:12
integers in some fixed vocabulary and then we will use those integers to make
15:14
then we will use those integers to make
15:14
then we will use those integers to make a look up into a lookup table of vectors
15:16
a look up into a lookup table of vectors
15:16
a look up into a lookup table of vectors and feed those vectors into the
15:17
and feed those vectors into the
15:18
and feed those vectors into the Transformer as an input now the reason
15:21
Transformer as an input now the reason
15:21
Transformer as an input now the reason this gets a little bit tricky of course
15:22
this gets a little bit tricky of course
15:22
this gets a little bit tricky of course is that we don't just want to support
15:23
is that we don't just want to support
15:24
is that we don't just want to support the simple English alphabet we want to
15:26
the simple English alphabet we want to
15:26
the simple English alphabet we want to support different kinds of languages so
15:28
support different kinds of languages so
15:28
support different kinds of languages so this is anango in Korean which is hello
15:31
this is anango in Korean which is hello
15:31
this is anango in Korean which is hello and we also want to support many kinds
15:32
and we also want to support many kinds
15:33
and we also want to support many kinds of special characters that we might find
15:34
of special characters that we might find
15:34
of special characters that we might find on the internet for example
15:37
on the internet for example
15:37
on the internet for example Emoji so how do we feed this text into
15:41
Emoji so how do we feed this text into
15:41
Emoji so how do we feed this text into uh
15:42
uh
15:42
uh Transformers well how's the what is this
15:44
Transformers well how's the what is this
15:44
Transformers well how's the what is this text anyway in Python so if you go to
15:46
text anyway in Python so if you go to
15:46
text anyway in Python so if you go to the documentation of a string in Python
15:49
the documentation of a string in Python
15:49
the documentation of a string in Python you can see that strings are immutable
15:51
you can see that strings are immutable
15:51
you can see that strings are immutable sequences of Unicode code
15:54
sequences of Unicode code
15:54
sequences of Unicode code points okay what are Unicode code points
15:57
points okay what are Unicode code points
15:57
points okay what are Unicode code points we can go to PDF so Unicode code points
16:01
we can go to PDF so Unicode code points
16:01
we can go to PDF so Unicode code points are defined by the Unicode Consortium as
16:04
are defined by the Unicode Consortium as
16:04
are defined by the Unicode Consortium as part of the Unicode standard and what
16:07
part of the Unicode standard and what
16:07
part of the Unicode standard and what this is really is that it's just a
16:08
this is really is that it's just a
16:09
this is really is that it's just a definition of roughly 150,000 characters
16:11
definition of roughly 150,000 characters
16:11
definition of roughly 150,000 characters right now and roughly speaking what they
16:14
right now and roughly speaking what they
16:14
right now and roughly speaking what they look like and what integers um represent
16:17
look like and what integers um represent
16:17
look like and what integers um represent those characters so it says 150,000
16:19
those characters so it says 150,000
16:19
those characters so it says 150,000 characters across 161 scripts as of
16:22
characters across 161 scripts as of
16:22
characters across 161 scripts as of right now so if you scroll down here you
16:24
right now so if you scroll down here you
16:24
right now so if you scroll down here you can see that the standard is very much
16:26
can see that the standard is very much
16:26
can see that the standard is very much alive the latest standard 15.1 in
16:28
alive the latest standard 15.1 in
16:28
alive the latest standard 15.1 in September
16:30
September
16:30
September 2023 and basically this is just a way to
16:33
2023 and basically this is just a way to
16:33
2023 and basically this is just a way to define lots of types of
16:36
define lots of types of
16:36
define lots of types of characters like for example all these
16:39
characters like for example all these
16:39
characters like for example all these characters across different scripts so
16:41
characters across different scripts so
16:41
characters across different scripts so the way we can access the unic code code
16:44
the way we can access the unic code code
16:44
the way we can access the unic code code Point given Single Character is by using
16:45
Point given Single Character is by using
16:45
Point given Single Character is by using the or function in Python so for example
16:48
the or function in Python so for example
16:48
the or function in Python so for example I can pass in Ord of H and I can see
16:51
I can pass in Ord of H and I can see
16:51
I can pass in Ord of H and I can see that for the Single Character H the unic
16:54
that for the Single Character H the unic
16:54
that for the Single Character H the unic code code point is
16:56
code code point is
16:56
code code point is 104 okay um but this can be arbitr
17:00
104 okay um but this can be arbitr
17:00
104 okay um but this can be arbitr complicated so we can take for example
17:02
complicated so we can take for example
17:02
complicated so we can take for example our Emoji here and we can see that the
17:04
our Emoji here and we can see that the
17:04
our Emoji here and we can see that the code point for this one is
17:06
code point for this one is
17:06
code point for this one is 128,000 or we can take
17:10
128,000 or we can take
17:10
128,000 or we can take un and this is 50,000 now keep in mind
17:13
un and this is 50,000 now keep in mind
17:13
un and this is 50,000 now keep in mind you can't plug in strings here because
17:16
you can't plug in strings here because
17:16
you can't plug in strings here because you uh this doesn't have a single code
17:18
you uh this doesn't have a single code
17:18
you uh this doesn't have a single code point it only takes a single uni code
17:20
point it only takes a single uni code
17:20
point it only takes a single uni code code Point character and tells you its
17:23
code Point character and tells you its
17:23
code Point character and tells you its integer so in this way we can look
17:26
integer so in this way we can look
17:26
integer so in this way we can look up all the um characters of this
17:30
up all the um characters of this
17:30
up all the um characters of this specific string and their code points so
17:32
specific string and their code points so
17:32
specific string and their code points so or of X forx in this string and we get
17:36
or of X forx in this string and we get
17:36
or of X forx in this string and we get this encoding here now see here we've
17:40
this encoding here now see here we've
17:40
this encoding here now see here we've already turned the raw code points
17:42
already turned the raw code points
17:42
already turned the raw code points already have integers so why can't we
17:44
already have integers so why can't we
17:44
already have integers so why can't we simply just use these integers and not
17:46
simply just use these integers and not
17:46
simply just use these integers and not have any tokenization at all why can't
17:48
have any tokenization at all why can't
17:48
have any tokenization at all why can't we just use this natively as is and just
17:50
we just use this natively as is and just
17:50
we just use this natively as is and just use the code Point well one reason for
17:52
use the code Point well one reason for
17:52
use the code Point well one reason for that of course is that the vocabulary in
17:54
that of course is that the vocabulary in
17:54
that of course is that the vocabulary in that case would be quite long so in this
17:56
that case would be quite long so in this
17:56
that case would be quite long so in this case for Unicode the this is a
17:58
case for Unicode the this is a
17:58
case for Unicode the this is a vocabulary of
17:59
vocabulary of
17:59
vocabulary of 150,000 different code points but more
18:02
150,000 different code points but more
18:02
150,000 different code points but more worryingly than that I think the Unicode
18:05
worryingly than that I think the Unicode
18:05
worryingly than that I think the Unicode standard is very much alive and it keeps
18:07
standard is very much alive and it keeps
18:07
standard is very much alive and it keeps changing and so it's not kind of a
18:09
changing and so it's not kind of a
18:09
changing and so it's not kind of a stable representation necessarily that
18:11
stable representation necessarily that
18:11
stable representation necessarily that we may want to use directly so for those
18:13
we may want to use directly so for those
18:13
we may want to use directly so for those reasons we need something a bit better
18:15
reasons we need something a bit better
18:15
reasons we need something a bit better so to find something better we turn to
18:17
so to find something better we turn to
18:17
so to find something better we turn to encodings so if we go to the Wikipedia
18:19
encodings so if we go to the Wikipedia
18:19
encodings so if we go to the Wikipedia page here we see that the Unicode
18:21
page here we see that the Unicode
18:21
page here we see that the Unicode consortion defines three types of
18:23
consortion defines three types of
18:23
consortion defines three types of encodings utf8 UTF 16 and UTF 32 these
18:27
encodings utf8 UTF 16 and UTF 32 these
18:27
encodings utf8 UTF 16 and UTF 32 these encoding are the way by which we can
18:30
encoding are the way by which we can
18:30
encoding are the way by which we can take Unicode text and translate it into
18:33
take Unicode text and translate it into
18:33
take Unicode text and translate it into binary data or by streams utf8 is by far
18:37
binary data or by streams utf8 is by far
18:37
binary data or by streams utf8 is by far the most common uh so this is the utf8
18:39
the most common uh so this is the utf8
18:39
the most common uh so this is the utf8 page now this Wikipedia page is actually
18:41
page now this Wikipedia page is actually
18:42
page now this Wikipedia page is actually quite long but what's important for our
18:44
quite long but what's important for our
18:44
quite long but what's important for our purposes is that utf8 takes every single
18:46
purposes is that utf8 takes every single
18:46
purposes is that utf8 takes every single Cod point and it translates it to a by
18:49
Cod point and it translates it to a by
18:49
Cod point and it translates it to a by stream and this by stream is between one
18:52
stream and this by stream is between one
18:52
stream and this by stream is between one to four bytes so it's a variable length
18:54
to four bytes so it's a variable length
18:54
to four bytes so it's a variable length encoding so depending on the Unicode
18:56
encoding so depending on the Unicode
18:56
encoding so depending on the Unicode Point according to the schema you're
18:58
Point according to the schema you're
18:58
Point according to the schema you're going to end up with between 1 to four
18:59
going to end up with between 1 to four
18:59
going to end up with between 1 to four bytes for each code point on top of that
19:02
bytes for each code point on top of that
19:03
bytes for each code point on top of that there's utf8 uh
19:05
there's utf8 uh
19:05
there's utf8 uh utf16 and UTF 32 UTF 32 is nice because
19:08
utf16 and UTF 32 UTF 32 is nice because
19:08
utf16 and UTF 32 UTF 32 is nice because it is fixed length instead of variable
19:10
it is fixed length instead of variable
19:10
it is fixed length instead of variable length but it has many other downsides
19:12
length but it has many other downsides
19:12
length but it has many other downsides as well so the full kind of spectrum of
19:16
as well so the full kind of spectrum of
19:17
as well so the full kind of spectrum of pros and cons of all these different
19:18
pros and cons of all these different
19:18
pros and cons of all these different three encodings are beyond the scope of
19:20
three encodings are beyond the scope of
19:20
three encodings are beyond the scope of this video I just like to point out that
19:22
this video I just like to point out that
19:22
this video I just like to point out that I enjoyed this block post and this block
19:25
I enjoyed this block post and this block
19:25
I enjoyed this block post and this block post at the end of it also has a number
19:27
post at the end of it also has a number
19:27
post at the end of it also has a number of references that can be quite useful
19:29
of references that can be quite useful
19:29
of references that can be quite useful uh one of them is uh utf8 everywhere
19:32
uh one of them is uh utf8 everywhere
19:32
uh one of them is uh utf8 everywhere Manifesto um and this Manifesto
19:34
Manifesto um and this Manifesto
19:34
Manifesto um and this Manifesto describes the reason why utf8 is
19:36
describes the reason why utf8 is
19:36
describes the reason why utf8 is significantly preferred and a lot nicer
19:39
significantly preferred and a lot nicer
19:39
significantly preferred and a lot nicer than the other encodings and why it is
19:41
than the other encodings and why it is
19:41
than the other encodings and why it is used a lot more prominently um on the
19:45
used a lot more prominently um on the
19:45
used a lot more prominently um on the internet one of the major advantages
19:48
internet one of the major advantages
19:48
internet one of the major advantages just just to give you a sense is that
19:49
just just to give you a sense is that
19:49
just just to give you a sense is that utf8 is the only one of these that is
19:51
utf8 is the only one of these that is
19:52
utf8 is the only one of these that is backwards compatible to the much simpler
19:54
backwards compatible to the much simpler
19:54
backwards compatible to the much simpler asky encoding of text um but I'm not
19:57
asky encoding of text um but I'm not
19:57
asky encoding of text um but I'm not going to go into the full detail in this
19:58
going to go into the full detail in this
19:58
going to go into the full detail in this video so suffice to say that we like the
20:00
video so suffice to say that we like the
20:01
video so suffice to say that we like the utf8 encoding and uh let's try to take
20:03
utf8 encoding and uh let's try to take
20:03
utf8 encoding and uh let's try to take the string and see what we get if we
20:06
the string and see what we get if we
20:06
the string and see what we get if we encoded into
20:07
encoded into
20:08
encoded into utf8 the string class in Python actually
20:10
utf8 the string class in Python actually
20:10
utf8 the string class in Python actually has do encode and you can give it the
20:12
has do encode and you can give it the
20:12
has do encode and you can give it the encoding which is say utf8 now we get
20:15
encoding which is say utf8 now we get
20:15
encoding which is say utf8 now we get out of this is not very nice because
20:17
out of this is not very nice because
20:17
out of this is not very nice because this is the bytes is a bytes object and
20:20
this is the bytes is a bytes object and
20:20
this is the bytes is a bytes object and it's not very nice in the way that it's
20:22
it's not very nice in the way that it's
20:22
it's not very nice in the way that it's printed so I personally like to take it
20:25
printed so I personally like to take it
20:25
printed so I personally like to take it through list because then we actually
20:26
through list because then we actually
20:26
through list because then we actually get the raw B
20:28
get the raw B
20:28
get the raw B of this uh encoding so this is the raw
20:32
of this uh encoding so this is the raw
20:32
of this uh encoding so this is the raw byes that represent this string
20:35
byes that represent this string
20:35
byes that represent this string according to the utf8 en coding we can
20:38
according to the utf8 en coding we can
20:38
according to the utf8 en coding we can also look at utf16 we get a slightly
20:40
also look at utf16 we get a slightly
20:40
also look at utf16 we get a slightly different by stream and we here we start
20:43
different by stream and we here we start
20:43
different by stream and we here we start to see one of the disadvantages of utf16
20:45
to see one of the disadvantages of utf16
20:45
to see one of the disadvantages of utf16 you see how we have zero Z something Z
20:47
you see how we have zero Z something Z
20:47
you see how we have zero Z something Z something Z something we're starting to
20:49
something Z something we're starting to
20:49
something Z something we're starting to get a sense that this is a bit of a
20:50
get a sense that this is a bit of a
20:50
get a sense that this is a bit of a wasteful encoding and indeed for simple
20:53
wasteful encoding and indeed for simple
20:53
wasteful encoding and indeed for simple asky characters or English characters
20:56
asky characters or English characters
20:56
asky characters or English characters here uh we just have the structure of 0
20:58
here uh we just have the structure of 0
20:58
here uh we just have the structure of 0 something Z something and it's not
21:00
something Z something and it's not
21:00
something Z something and it's not exactly nice same for UTF 32 when we
21:04
exactly nice same for UTF 32 when we
21:04
exactly nice same for UTF 32 when we expand this we can start to get a sense
21:06
expand this we can start to get a sense
21:06
expand this we can start to get a sense of the wastefulness of this encoding for
21:07
of the wastefulness of this encoding for
21:08
of the wastefulness of this encoding for our purposes you see a lot of zeros
21:10
our purposes you see a lot of zeros
21:10
our purposes you see a lot of zeros followed by
21:11
followed by
21:11
followed by something and so uh this is not
21:14
something and so uh this is not
21:14
something and so uh this is not desirable so suffice it to say that we
21:17
desirable so suffice it to say that we
21:17
desirable so suffice it to say that we would like to stick with utf8 for our
21:20
would like to stick with utf8 for our
21:20
would like to stick with utf8 for our purposes however if we just use utf8
21:23
purposes however if we just use utf8
21:23
purposes however if we just use utf8 naively these are by streams so that
21:26
naively these are by streams so that
21:26
naively these are by streams so that would imply a vocabulary length of only
21:29
would imply a vocabulary length of only
21:29
would imply a vocabulary length of only 256 possible tokens uh but this this
21:33
256 possible tokens uh but this this
21:33
256 possible tokens uh but this this vocabulary size is very very small what
21:35
vocabulary size is very very small what
21:35
vocabulary size is very very small what this is going to do if we just were to
21:36
this is going to do if we just were to
21:36
this is going to do if we just were to use it naively is that all of our text
21:39
use it naively is that all of our text
21:39
use it naively is that all of our text would be stretched out over very very
21:41
would be stretched out over very very
21:41
would be stretched out over very very long sequences of bytes and so
21:46
long sequences of bytes and so
21:46
long sequences of bytes and so um what what this does is that certainly
21:49
um what what this does is that certainly
21:49
um what what this does is that certainly the embeding table is going to be tiny
21:50
the embeding table is going to be tiny
21:51
the embeding table is going to be tiny and the prediction at the top at the
21:52
and the prediction at the top at the
21:52
and the prediction at the top at the final layer is going to be very tiny but
21:54
final layer is going to be very tiny but
21:54
final layer is going to be very tiny but our sequences are very long and remember
21:56
our sequences are very long and remember
21:56
our sequences are very long and remember that we have pretty finite um context
21:59
that we have pretty finite um context
21:59
that we have pretty finite um context length and the attention that we can
22:00
length and the attention that we can
22:01
length and the attention that we can support in a transformer for
22:02
support in a transformer for
22:02
support in a transformer for computational reasons and so we only
22:05
computational reasons and so we only
22:05
computational reasons and so we only have as much context length but now we
22:07
have as much context length but now we
22:07
have as much context length but now we have very very long sequences and this
22:09
have very very long sequences and this
22:09
have very very long sequences and this is just inefficient and it's not going
22:10
is just inefficient and it's not going
22:10
is just inefficient and it's not going to allow us to attend to sufficiently
22:12
to allow us to attend to sufficiently
22:12
to allow us to attend to sufficiently long text uh before us for the purposes
22:15
long text uh before us for the purposes
22:15
long text uh before us for the purposes of the next token prediction task so we
22:18
of the next token prediction task so we
22:18
of the next token prediction task so we don't want to use the raw bytes of the
22:21
don't want to use the raw bytes of the
22:21
don't want to use the raw bytes of the utf8 encoding we want to be able to
22:24
utf8 encoding we want to be able to
22:24
utf8 encoding we want to be able to support larger vocabulary size that we
22:26
support larger vocabulary size that we
22:26
support larger vocabulary size that we can tune as a hyper
22:28
can tune as a hyper
22:28
can tune as a hyper but we want to stick with the utf8
22:30
but we want to stick with the utf8
22:30
but we want to stick with the utf8 encoding of these strings so what do we
22:33
encoding of these strings so what do we
22:33
encoding of these strings so what do we do well the answer of course is we turn
22:35
do well the answer of course is we turn
22:35
do well the answer of course is we turn to the bite pair encoding algorithm
22:37
to the bite pair encoding algorithm
22:37
to the bite pair encoding algorithm which will allow us to compress these
22:39
which will allow us to compress these
22:39
which will allow us to compress these bite sequences um to a variable amount
22:42
bite sequences um to a variable amount
22:42
bite sequences um to a variable amount so we'll get to that in a bit but I just
22:44
so we'll get to that in a bit but I just
22:44
so we'll get to that in a bit but I just want to briefly speak to the fact that I
22:47
want to briefly speak to the fact that I
22:47
want to briefly speak to the fact that I would love nothing more than to be able
22:49
would love nothing more than to be able
22:49
would love nothing more than to be able to feed raw bite sequences into uh
22:52
to feed raw bite sequences into uh
22:52
to feed raw bite sequences into uh language models in fact there's a paper
22:54
language models in fact there's a paper
22:54
language models in fact there's a paper about how this could potentially be done
22:57
about how this could potentially be done
22:57
about how this could potentially be done uh from Summer last last year now the
22:59
uh from Summer last last year now the
22:59
uh from Summer last last year now the problem is you actually have to go in
23:00
problem is you actually have to go in
23:00
problem is you actually have to go in and you have to modify the Transformer
23:02
and you have to modify the Transformer
23:02
and you have to modify the Transformer architecture because as I mentioned
23:04
architecture because as I mentioned
23:04
architecture because as I mentioned you're going to have a problem where the
23:06
you're going to have a problem where the
23:06
you're going to have a problem where the attention will start to become extremely
23:08
attention will start to become extremely
23:08
attention will start to become extremely expensive because the sequences are so
23:10
expensive because the sequences are so
23:10
expensive because the sequences are so long and so in this paper they propose
23:13
long and so in this paper they propose
23:13
long and so in this paper they propose kind of a hierarchical structuring of
23:15
kind of a hierarchical structuring of
23:15
kind of a hierarchical structuring of the Transformer that could allow you to
23:17
the Transformer that could allow you to
23:17
the Transformer that could allow you to just feed in raw bites and so at the end
23:20
just feed in raw bites and so at the end
23:20
just feed in raw bites and so at the end they say together these results
23:21
they say together these results
23:21
they say together these results establish the viability of tokenization
23:23
establish the viability of tokenization
23:23
establish the viability of tokenization free autor regressive sequence modeling
23:25
free autor regressive sequence modeling
23:25
free autor regressive sequence modeling at scale so tokenization free would
23:27
at scale so tokenization free would
23:27
at scale so tokenization free would indeed be amazing we would just feed B
23:30
indeed be amazing we would just feed B
23:30
indeed be amazing we would just feed B streams directly into our models but
23:32
streams directly into our models but
23:32
streams directly into our models but unfortunately I don't know that this has
23:34
unfortunately I don't know that this has
23:34
unfortunately I don't know that this has really been proven out yet by
23:36
really been proven out yet by
23:36
really been proven out yet by sufficiently many groups and a
23:37
sufficiently many groups and a
23:37
sufficiently many groups and a sufficient scale uh but something like
23:39
sufficient scale uh but something like
23:39
sufficient scale uh but something like this at one point would be amazing and I
23:40
this at one point would be amazing and I
23:40
this at one point would be amazing and I hope someone comes up with it but for
23:42
hope someone comes up with it but for
23:42
hope someone comes up with it but for now we have to come back and we can't
23:44
now we have to come back and we can't
23:44
now we have to come back and we can't feed this directly into language models
23:46
feed this directly into language models
23:46
feed this directly into language models and we have to compress it using the B
23:48
and we have to compress it using the B
23:48
and we have to compress it using the B paare encoding algorithm so let's see
23:49
paare encoding algorithm so let's see
23:49
paare encoding algorithm so let's see how that works so as I mentioned the B
23:51
how that works so as I mentioned the B
23:51
how that works so as I mentioned the B paare encoding algorithm is not all that
23:53
paare encoding algorithm is not all that
23:53
paare encoding algorithm is not all that complicated and the Wikipedia page is
23:55
complicated and the Wikipedia page is
23:55
complicated and the Wikipedia page is actually quite instructive as far as the
23:57
actually quite instructive as far as the
23:57
actually quite instructive as far as the basic idea goes go what we're doing is
23:59
basic idea goes go what we're doing is
23:59
basic idea goes go what we're doing is we have some kind of a input sequence uh
24:01
we have some kind of a input sequence uh
24:01
we have some kind of a input sequence uh like for example here we have only four
24:03
like for example here we have only four
24:03
like for example here we have only four elements in our vocabulary a b c and d
24:06
elements in our vocabulary a b c and d
24:06
elements in our vocabulary a b c and d and we have a sequence of them so
24:07
and we have a sequence of them so
24:08
and we have a sequence of them so instead of bytes let's say we just have
24:09
instead of bytes let's say we just have
24:09
instead of bytes let's say we just have four a vocab size of
24:12
four a vocab size of
24:12
four a vocab size of four the sequence is too long and we'd
24:14
four the sequence is too long and we'd
24:14
four the sequence is too long and we'd like to compress it so what we do is
24:16
like to compress it so what we do is
24:16
like to compress it so what we do is that we iteratively find the pair of uh
24:20
that we iteratively find the pair of uh
24:20
that we iteratively find the pair of uh tokens that occur the most
24:23
tokens that occur the most
24:23
tokens that occur the most frequently and then once we've
24:25
frequently and then once we've
24:25
frequently and then once we've identified that pair we repl replace
24:28
identified that pair we repl replace
24:28
identified that pair we repl replace that pair with just a single new token
24:30
that pair with just a single new token
24:30
that pair with just a single new token that we append to our vocabulary so for
24:33
that we append to our vocabulary so for
24:33
that we append to our vocabulary so for example here the bite pair AA occurs
24:36
example here the bite pair AA occurs
24:36
example here the bite pair AA occurs most often so we mint a new token let's
24:38
most often so we mint a new token let's
24:38
most often so we mint a new token let's call it capital Z and we replace every
24:41
call it capital Z and we replace every
24:41
call it capital Z and we replace every single occurrence of AA by Z so now we
24:45
single occurrence of AA by Z so now we
24:46
single occurrence of AA by Z so now we have two Z's here so here we took a
24:48
have two Z's here so here we took a
24:48
have two Z's here so here we took a sequence of 11 characters with
24:51
sequence of 11 characters with
24:51
sequence of 11 characters with vocabulary size four and we've converted
24:54
vocabulary size four and we've converted
24:54
vocabulary size four and we've converted it to a um sequence of only nine tokens
24:58
it to a um sequence of only nine tokens
24:58
it to a um sequence of only nine tokens but now with a vocabulary of five
25:00
but now with a vocabulary of five
25:00
but now with a vocabulary of five because we have a fifth vocabulary
25:02
because we have a fifth vocabulary
25:02
because we have a fifth vocabulary element that we just created and it's Z
25:04
element that we just created and it's Z
25:04
element that we just created and it's Z standing for concatination of AA and we
25:07
standing for concatination of AA and we
25:07
standing for concatination of AA and we can again repeat this process so we
25:10
can again repeat this process so we
25:10
can again repeat this process so we again look at the sequence and identify
25:12
again look at the sequence and identify
25:12
again look at the sequence and identify the pair of tokens that are most
25:15
the pair of tokens that are most
25:15
the pair of tokens that are most frequent let's say that that is now AB
25:19
frequent let's say that that is now AB
25:19
frequent let's say that that is now AB well we are going to replace AB with a
25:20
well we are going to replace AB with a
25:20
well we are going to replace AB with a new token that we meant call Y so y
25:23
new token that we meant call Y so y
25:23
new token that we meant call Y so y becomes ab and then every single
25:25
becomes ab and then every single
25:25
becomes ab and then every single occurrence of ab is now replaced with y
25:28
occurrence of ab is now replaced with y
25:28
occurrence of ab is now replaced with y so we end up with this so now we only
25:31
so we end up with this so now we only
25:31
so we end up with this so now we only have 1 2 3 4 5 6 seven characters in our
25:35
have 1 2 3 4 5 6 seven characters in our
25:35
have 1 2 3 4 5 6 seven characters in our sequence but we have not just um four
25:40
sequence but we have not just um four
25:40
sequence but we have not just um four vocabulary elements or five but now we
25:42
vocabulary elements or five but now we
25:42
vocabulary elements or five but now we have six and for the final round we
25:45
have six and for the final round we
25:45
have six and for the final round we again look through the sequence find
25:47
again look through the sequence find
25:47
again look through the sequence find that the phrase zy or the pair zy is
25:50
that the phrase zy or the pair zy is
25:50
that the phrase zy or the pair zy is most common and replace it one more time
25:53
most common and replace it one more time
25:53
most common and replace it one more time with another um character let's say x so
25:56
with another um character let's say x so
25:56
with another um character let's say x so X is z y and we replace all curses of zy
25:59
X is z y and we replace all curses of zy
25:59
X is z y and we replace all curses of zy and we get this following sequence so
26:02
and we get this following sequence so
26:02
and we get this following sequence so basically after we have gone through
26:03
basically after we have gone through
26:03
basically after we have gone through this process instead of having a um
26:08
this process instead of having a um
26:08
this process instead of having a um sequence of
26:09
sequence of
26:09
sequence of 11 uh tokens with a vocabulary length of
26:13
11 uh tokens with a vocabulary length of
26:13
11 uh tokens with a vocabulary length of four we now have a sequence of 1 2 3
26:18
four we now have a sequence of 1 2 3
26:18
four we now have a sequence of 1 2 3 four five tokens but our vocabulary
26:21
four five tokens but our vocabulary
26:21
four five tokens but our vocabulary length now is seven and so in this way
26:25
length now is seven and so in this way
26:25
length now is seven and so in this way we can iteratively compress our sequence
26:27
we can iteratively compress our sequence
26:27
we can iteratively compress our sequence I we Mint new tokens so in the in the
26:30
I we Mint new tokens so in the in the
26:30
I we Mint new tokens so in the in the exact same way we start we start out
26:32
exact same way we start we start out
26:32
exact same way we start we start out with bite sequences so we have 256
26:36
with bite sequences so we have 256
26:36
with bite sequences so we have 256 vocabulary size but we're now going to
26:38
vocabulary size but we're now going to
26:38
vocabulary size but we're now going to go through these and find the bite pairs
26:40
go through these and find the bite pairs
26:40
go through these and find the bite pairs that occur the most and we're going to
26:42
that occur the most and we're going to
26:42
that occur the most and we're going to iteratively start minting new tokens
26:44
iteratively start minting new tokens
26:44
iteratively start minting new tokens appending them to our vocabulary and
26:46
appending them to our vocabulary and
26:46
appending them to our vocabulary and replacing things and in this way we're
26:48
replacing things and in this way we're
26:48
replacing things and in this way we're going to end up with a compressed
26:50
going to end up with a compressed
26:50
going to end up with a compressed training data set and also an algorithm
26:52
training data set and also an algorithm
26:52
training data set and also an algorithm for taking any arbitrary sequence and
26:55
for taking any arbitrary sequence and
26:55
for taking any arbitrary sequence and encoding it using this uh vocabul
26:58
encoding it using this uh vocabul
26:58
encoding it using this uh vocabul and also decoding it back to Strings so
27:00
and also decoding it back to Strings so
27:01
and also decoding it back to Strings so let's now Implement all that so here's
27:03
let's now Implement all that so here's
27:03
let's now Implement all that so here's what I did I went to this block post
27:05
what I did I went to this block post
27:05
what I did I went to this block post that I enjoyed and I took the first
27:07
that I enjoyed and I took the first
27:07
that I enjoyed and I took the first paragraph and I copy pasted it here into
27:09
paragraph and I copy pasted it here into
27:10
paragraph and I copy pasted it here into text so this is one very long line
27:13
text so this is one very long line
27:13
text so this is one very long line here now to get the tokens as I
27:15
here now to get the tokens as I
27:15
here now to get the tokens as I mentioned we just take our text and we
27:17
mentioned we just take our text and we
27:17
mentioned we just take our text and we encode it into utf8 the tokens here at
27:20
encode it into utf8 the tokens here at
27:20
encode it into utf8 the tokens here at this point will be a raw bites single
27:22
this point will be a raw bites single
27:22
this point will be a raw bites single stream of bytes and just so that it's
27:25
stream of bytes and just so that it's
27:25
stream of bytes and just so that it's easier to work with instead of just a
27:27
easier to work with instead of just a
27:27
easier to work with instead of just a bytes object I'm going to convert all
27:29
bytes object I'm going to convert all
27:29
bytes object I'm going to convert all those bytes to integers and then create
27:32
those bytes to integers and then create
27:32
those bytes to integers and then create a list of it just so it's easier for us
27:34
a list of it just so it's easier for us
27:34
a list of it just so it's easier for us to manipulate and work with in Python
27:35
to manipulate and work with in Python
27:35
to manipulate and work with in Python and visualize and here I'm printing all
27:37
and visualize and here I'm printing all
27:38
and visualize and here I'm printing all of that so this is the original um this
27:42
of that so this is the original um this
27:42
of that so this is the original um this is the original paragraph and its length
27:44
is the original paragraph and its length
27:45
is the original paragraph and its length is
27:45
is
27:45
is 533 uh code points and then here are the
27:49
533 uh code points and then here are the
27:49
533 uh code points and then here are the bytes encoded in ut utf8 and we see that
27:53
bytes encoded in ut utf8 and we see that
27:53
bytes encoded in ut utf8 and we see that this has a length of 616 bytes at this
27:56
this has a length of 616 bytes at this
27:56
this has a length of 616 bytes at this point or 616 tokens and the reason this
27:59
point or 616 tokens and the reason this
27:59
point or 616 tokens and the reason this is more is because a lot of these simple
28:01
is more is because a lot of these simple
28:01
is more is because a lot of these simple asky characters or simple characters
28:04
asky characters or simple characters
28:04
asky characters or simple characters they just become a single bite but a lot
28:06
they just become a single bite but a lot
28:06
they just become a single bite but a lot of these Unicode more complex characters
28:08
of these Unicode more complex characters
28:08
of these Unicode more complex characters become multiple bytes up to four and so
28:11
become multiple bytes up to four and so
28:11
become multiple bytes up to four and so we are expanding that
28:12
we are expanding that
28:12
we are expanding that size so now what we'd like to do as a
28:14
size so now what we'd like to do as a
28:14
size so now what we'd like to do as a first step of the algorithm is we'd like
28:16
first step of the algorithm is we'd like
28:16
first step of the algorithm is we'd like to iterate over here and find the pair
28:18
to iterate over here and find the pair
28:18
to iterate over here and find the pair of bites that occur most frequently
28:21
of bites that occur most frequently
28:22
of bites that occur most frequently because we're then going to merge it so
28:24
because we're then going to merge it so
28:24
because we're then going to merge it so if you are working long on a notebook on
28:25
if you are working long on a notebook on
28:25
if you are working long on a notebook on a side then I encourage you to basically
28:27
a side then I encourage you to basically
28:27
a side then I encourage you to basically click on the link find this notebook and
28:29
click on the link find this notebook and
28:29
click on the link find this notebook and try to write that function yourself
28:31
try to write that function yourself
28:31
try to write that function yourself otherwise I'm going to come here and
28:32
otherwise I'm going to come here and
28:32
otherwise I'm going to come here and Implement first the function that finds
28:34
Implement first the function that finds
28:34
Implement first the function that finds the most common pair okay so here's what
28:36
the most common pair okay so here's what
28:36
the most common pair okay so here's what I came up with there are many different
28:38
I came up with there are many different
28:38
I came up with there are many different ways to implement this but I'm calling
28:40
ways to implement this but I'm calling
28:40
ways to implement this but I'm calling the function get stats it expects a list
28:42
the function get stats it expects a list
28:42
the function get stats it expects a list of integers I'm using a dictionary to
28:44
of integers I'm using a dictionary to
28:44
of integers I'm using a dictionary to keep track of basically the counts and
28:46
keep track of basically the counts and
28:46
keep track of basically the counts and then this is a pythonic way to iterate
28:48
then this is a pythonic way to iterate
28:48
then this is a pythonic way to iterate consecutive elements of this list uh
28:51
consecutive elements of this list uh
28:51
consecutive elements of this list uh which we covered in the previous video
28:53
which we covered in the previous video
28:53
which we covered in the previous video and then here I'm just keeping track of
28:55
and then here I'm just keeping track of
28:55
and then here I'm just keeping track of just incrementing by one um for all the
28:58
just incrementing by one um for all the
28:58
just incrementing by one um for all the pairs so if I call this on all the
29:00
pairs so if I call this on all the
29:00
pairs so if I call this on all the tokens here then the stats comes out
29:03
tokens here then the stats comes out
29:03
tokens here then the stats comes out here so this is the dictionary the keys
29:06
here so this is the dictionary the keys
29:06
here so this is the dictionary the keys are these topples of consecutive
29:08
are these topples of consecutive
29:08
are these topples of consecutive elements and this is the count so just
29:11
elements and this is the count so just
29:11
elements and this is the count so just to uh print it in a slightly better way
29:14
to uh print it in a slightly better way
29:14
to uh print it in a slightly better way this is one way that I like to do that
29:17
this is one way that I like to do that
29:17
this is one way that I like to do that where you it's a little bit compound
29:20
where you it's a little bit compound
29:20
where you it's a little bit compound here so you can pause if you like but we
29:22
here so you can pause if you like but we
29:22
here so you can pause if you like but we iterate all all the items the items
29:25
iterate all all the items the items
29:25
iterate all all the items the items called on dictionary returns pairs of
29:27
called on dictionary returns pairs of
29:27
called on dictionary returns pairs of key value and instead I create a list
29:31
key value and instead I create a list
29:31
key value and instead I create a list here of value key because if it's a
29:35
here of value key because if it's a
29:35
here of value key because if it's a value key list then I can call sort on
29:37
value key list then I can call sort on
29:37
value key list then I can call sort on it and by default python will uh use the
29:41
it and by default python will uh use the
29:41
it and by default python will uh use the first element which in this case will be
29:43
first element which in this case will be
29:43
first element which in this case will be value to sort by if it's given tles and
29:46
value to sort by if it's given tles and
29:46
value to sort by if it's given tles and then reverse so it's descending and
29:48
then reverse so it's descending and
29:48
then reverse so it's descending and print that so basically it looks like
29:50
print that so basically it looks like
29:50
print that so basically it looks like 101 comma 32 was the most commonly
29:53
101 comma 32 was the most commonly
29:53
101 comma 32 was the most commonly occurring consecutive pair and it
29:55
occurring consecutive pair and it
29:55
occurring consecutive pair and it occurred 20 times we can double check
29:58
occurred 20 times we can double check
29:58
occurred 20 times we can double check that that makes reasonable sense so if I
30:00
that that makes reasonable sense so if I
30:00
that that makes reasonable sense so if I just search
30:02
just search
30:02
just search 10132 then you see that these are the 20
30:05
10132 then you see that these are the 20
30:05
10132 then you see that these are the 20 occurrences of that um pair and if we'd
30:10
occurrences of that um pair and if we'd
30:10
occurrences of that um pair and if we'd like to take a look at what exactly that
30:11
like to take a look at what exactly that
30:11
like to take a look at what exactly that pair is we can use Char which is the
30:14
pair is we can use Char which is the
30:14
pair is we can use Char which is the opposite of or in Python so we give it a
30:17
opposite of or in Python so we give it a
30:17
opposite of or in Python so we give it a um unic code Cod point so 101 and of 32
30:22
um unic code Cod point so 101 and of 32
30:22
um unic code Cod point so 101 and of 32 and we see that this is e and space so
30:24
and we see that this is e and space so
30:25
and we see that this is e and space so basically there's a lot of E space here
30:28
basically there's a lot of E space here
30:28
basically there's a lot of E space here meaning that a lot of these words seem
30:29
meaning that a lot of these words seem
30:29
meaning that a lot of these words seem to end with e so here's eace as an
30:32
to end with e so here's eace as an
30:32
to end with e so here's eace as an example so there's a lot of that going
30:34
example so there's a lot of that going
30:34
example so there's a lot of that going on here and this is the most common pair
30:36
on here and this is the most common pair
30:36
on here and this is the most common pair so now that we've identified the most
30:38
so now that we've identified the most
30:38
so now that we've identified the most common pair we would like to iterate
30:40
common pair we would like to iterate
30:40
common pair we would like to iterate over this sequence we're going to Mint a
30:42
over this sequence we're going to Mint a
30:42
over this sequence we're going to Mint a new token with the ID of
30:44
new token with the ID of
30:44
new token with the ID of 256 right because these tokens currently
30:47
256 right because these tokens currently
30:47
256 right because these tokens currently go from Z to 255 so when we create a new
30:50
go from Z to 255 so when we create a new
30:50
go from Z to 255 so when we create a new token it will have an ID of
30:52
token it will have an ID of
30:52
token it will have an ID of 256 and we're going to iterate over this
30:55
256 and we're going to iterate over this
30:56
256 and we're going to iterate over this entire um list and every every time we
30:59
entire um list and every every time we
30:59
entire um list and every every time we see 101 comma 32 we're going to swap
31:02
see 101 comma 32 we're going to swap
31:02
see 101 comma 32 we're going to swap that out for
31:03
that out for
31:03
that out for 256 so let's Implement that now and feel
31:07
256 so let's Implement that now and feel
31:07
256 so let's Implement that now and feel free to uh do that yourself as well so
31:09
free to uh do that yourself as well so
31:09
free to uh do that yourself as well so first I commented uh this just so we
31:11
first I commented uh this just so we
31:11
first I commented uh this just so we don't pollute uh the notebook too much
31:14
don't pollute uh the notebook too much
31:14
don't pollute uh the notebook too much this is a nice way of in Python
31:17
this is a nice way of in Python
31:17
this is a nice way of in Python obtaining the highest ranking pair so
31:20
obtaining the highest ranking pair so
31:20
obtaining the highest ranking pair so we're basically calling the Max on this
31:23
we're basically calling the Max on this
31:23
we're basically calling the Max on this dictionary stats and this will return
31:26
dictionary stats and this will return
31:26
dictionary stats and this will return the maximum
31:27
the maximum
31:27
the maximum key and then the question is how does it
31:30
key and then the question is how does it
31:30
key and then the question is how does it rank keys so you can provide it with a
31:32
rank keys so you can provide it with a
31:32
rank keys so you can provide it with a function that ranks keys and that
31:35
function that ranks keys and that
31:35
function that ranks keys and that function is just stats. getet uh stats.
31:38
function is just stats. getet uh stats.
31:38
function is just stats. getet uh stats. getet would basically return the value
31:41
getet would basically return the value
31:41
getet would basically return the value and so we're ranking by the value and
31:42
and so we're ranking by the value and
31:42
and so we're ranking by the value and getting the maximum key so it's 101
31:45
getting the maximum key so it's 101
31:45
getting the maximum key so it's 101 comma 32 as we saw now to actually merge
31:49
comma 32 as we saw now to actually merge
31:49
comma 32 as we saw now to actually merge 10132 um this is the function that I
31:51
10132 um this is the function that I
31:51
10132 um this is the function that I wrote but again there are many different
31:53
wrote but again there are many different
31:53
wrote but again there are many different versions of it so we're going to take a
31:55
versions of it so we're going to take a
31:55
versions of it so we're going to take a list of IDs and the the pair that we
31:57
list of IDs and the the pair that we
31:57
list of IDs and the the pair that we want to replace and that pair will be
31:59
want to replace and that pair will be
31:59
want to replace and that pair will be replaced with the new index
32:02
replaced with the new index
32:02
replaced with the new index idx so iterating through IDs if we find
32:05
idx so iterating through IDs if we find
32:05
idx so iterating through IDs if we find the pair swap it out for idx so we
32:08
the pair swap it out for idx so we
32:08
the pair swap it out for idx so we create this new list and then we start
32:10
create this new list and then we start
32:10
create this new list and then we start at zero and then we go through this
32:12
at zero and then we go through this
32:12
at zero and then we go through this entire list sequentially from left to
32:14
entire list sequentially from left to
32:14
entire list sequentially from left to right and here we are checking for
32:17
right and here we are checking for
32:17
right and here we are checking for equality at the current position with
32:19
equality at the current position with
32:19
equality at the current position with the
32:20
the
32:20
the pair um so here we are checking that the
32:23
pair um so here we are checking that the
32:23
pair um so here we are checking that the pair matches now here is a bit of a
32:25
pair matches now here is a bit of a
32:25
pair matches now here is a bit of a tricky condition that you have to append
32:27
tricky condition that you have to append
32:27
tricky condition that you have to append if you're trying to be careful and that
32:29
if you're trying to be careful and that
32:29
if you're trying to be careful and that is that um you don't want this here to
32:31
is that um you don't want this here to
32:31
is that um you don't want this here to be out of Bounds at the very last
32:33
be out of Bounds at the very last
32:33
be out of Bounds at the very last position when you're on the rightmost
32:35
position when you're on the rightmost
32:35
position when you're on the rightmost element of this list otherwise this
32:37
element of this list otherwise this
32:37
element of this list otherwise this would uh give you an autof bounds error
32:39
would uh give you an autof bounds error
32:39
would uh give you an autof bounds error so we have to make sure that we're not
32:40
so we have to make sure that we're not
32:40
so we have to make sure that we're not at the very very last element so uh this
32:44
at the very very last element so uh this
32:44
at the very very last element so uh this would be false for that so if we find a
32:46
would be false for that so if we find a
32:46
would be false for that so if we find a match we append to this new list that
32:51
match we append to this new list that
32:51
match we append to this new list that replacement index and we increment the
32:53
replacement index and we increment the
32:53
replacement index and we increment the position by two so we skip over that
32:54
position by two so we skip over that
32:54
position by two so we skip over that entire pair but otherwise if we we
32:57
entire pair but otherwise if we we
32:57
entire pair but otherwise if we we haven't found a matching pair we just
32:59
haven't found a matching pair we just
32:59
haven't found a matching pair we just sort of copy over the um element at that
33:02
sort of copy over the um element at that
33:02
sort of copy over the um element at that position and increment by one then
33:05
position and increment by one then
33:05
position and increment by one then return this so here's a very small toy
33:07
return this so here's a very small toy
33:07
return this so here's a very small toy example if we have a list 566 791 and we
33:10
example if we have a list 566 791 and we
33:10
example if we have a list 566 791 and we want to replace the occurrences of 67
33:12
want to replace the occurrences of 67
33:12
want to replace the occurrences of 67 with 99 then calling this on that will
33:16
with 99 then calling this on that will
33:16
with 99 then calling this on that will give us what we're asking for so here
33:18
give us what we're asking for so here
33:18
give us what we're asking for so here the 67 is replaced with
33:21
the 67 is replaced with
33:21
the 67 is replaced with 99 so now I'm going to uncomment this
33:23
99 so now I'm going to uncomment this
33:23
99 so now I'm going to uncomment this for our actual use case where we want to
33:27
for our actual use case where we want to
33:27
for our actual use case where we want to take our tokens we want to take the top
33:29
take our tokens we want to take the top
33:29
take our tokens we want to take the top pair here and replace it with 256 to get
33:33
pair here and replace it with 256 to get
33:33
pair here and replace it with 256 to get tokens to if we run this we get the
33:37
tokens to if we run this we get the
33:37
tokens to if we run this we get the following so recall that previously we
33:40
following so recall that previously we
33:40
following so recall that previously we had a length 616 in this list and now we
33:45
had a length 616 in this list and now we
33:45
had a length 616 in this list and now we have a length 596 right so this
33:48
have a length 596 right so this
33:48
have a length 596 right so this decreased by 20 which makes sense
33:50
decreased by 20 which makes sense
33:50
decreased by 20 which makes sense because there are 20 occurrences
33:52
because there are 20 occurrences
33:52
because there are 20 occurrences moreover we can try to find 256 here and
33:55
moreover we can try to find 256 here and
33:55
moreover we can try to find 256 here and we see plenty of occurrences on off it
33:58
we see plenty of occurrences on off it
33:58
we see plenty of occurrences on off it and moreover just double check there
33:59
and moreover just double check there
33:59
and moreover just double check there should be no occurrence of 10132 so this
34:02
should be no occurrence of 10132 so this
34:02
should be no occurrence of 10132 so this is the original array plenty of them and
34:04
is the original array plenty of them and
34:05
is the original array plenty of them and in the second array there are no
34:06
in the second array there are no
34:06
in the second array there are no occurrences of 1032 so we've
34:08
occurrences of 1032 so we've
34:08
occurrences of 1032 so we've successfully merged this single pair and
34:11
successfully merged this single pair and
34:11
successfully merged this single pair and now we just uh iterate this so we are
34:13
now we just uh iterate this so we are
34:13
now we just uh iterate this so we are going to go over the sequence again find
34:15
going to go over the sequence again find
34:15
going to go over the sequence again find the most common pair and replace it so
34:17
the most common pair and replace it so
34:17
the most common pair and replace it so let me now write a y Loop that uses
34:19
let me now write a y Loop that uses
34:19
let me now write a y Loop that uses these functions to do this um sort of
34:21
these functions to do this um sort of
34:21
these functions to do this um sort of iteratively and how many times do we do
34:24
iteratively and how many times do we do
34:24
iteratively and how many times do we do it four well that's totally up to us as
34:26
it four well that's totally up to us as
34:26
it four well that's totally up to us as a hyper parameter
34:27
a hyper parameter
34:27
a hyper parameter the more um steps we take the larger
34:30
the more um steps we take the larger
34:30
the more um steps we take the larger will be our vocabulary and the shorter
34:33
will be our vocabulary and the shorter
34:33
will be our vocabulary and the shorter will be our sequence and there is some
34:35
will be our sequence and there is some
34:35
will be our sequence and there is some sweet spot that we usually find works
34:37
sweet spot that we usually find works
34:37
sweet spot that we usually find works the best in practice and so this is kind
34:39
the best in practice and so this is kind
34:39
the best in practice and so this is kind of a hyperparameter and we tune it and
34:41
of a hyperparameter and we tune it and
34:41
of a hyperparameter and we tune it and we find good vocabulary sizes as an
34:44
we find good vocabulary sizes as an
34:44
we find good vocabulary sizes as an example gp4 currently uses roughly
34:45
example gp4 currently uses roughly
34:46
example gp4 currently uses roughly 100,000 tokens and um bpark that those
34:49
100,000 tokens and um bpark that those
34:49
100,000 tokens and um bpark that those are reasonable numbers currently instead
34:51
are reasonable numbers currently instead
34:51
are reasonable numbers currently instead the are large language models so let me
34:53
the are large language models so let me
34:53
the are large language models so let me now write uh putting putting it all
34:55
now write uh putting putting it all
34:55
now write uh putting putting it all together and uh iterating these steps
34:58
together and uh iterating these steps
34:58
together and uh iterating these steps okay now before we dive into the Y loop
35:00
okay now before we dive into the Y loop
35:00
okay now before we dive into the Y loop I wanted to add one more cell here where
35:03
I wanted to add one more cell here where
35:03
I wanted to add one more cell here where I went to the block post and instead of
35:04
I went to the block post and instead of
35:04
I went to the block post and instead of grabbing just the first paragraph or two
35:06
grabbing just the first paragraph or two
35:07
grabbing just the first paragraph or two I took the entire block post and I
35:08
I took the entire block post and I
35:08
I took the entire block post and I stretched it out in a single line and
35:10
stretched it out in a single line and
35:10
stretched it out in a single line and basically just using longer text will
35:12
basically just using longer text will
35:12
basically just using longer text will allow us to have more representative
35:13
allow us to have more representative
35:13
allow us to have more representative statistics for the bite Pairs and we'll
35:16
statistics for the bite Pairs and we'll
35:16
statistics for the bite Pairs and we'll just get a more sensible results out of
35:18
just get a more sensible results out of
35:18
just get a more sensible results out of it because it's longer text um so here
35:21
it because it's longer text um so here
35:21
it because it's longer text um so here we have the raw text we encode it into
35:24
we have the raw text we encode it into
35:24
we have the raw text we encode it into bytes using the utf8 encoding
35:27
bytes using the utf8 encoding
35:27
bytes using the utf8 encoding and then here as before we are just
35:30
and then here as before we are just
35:30
and then here as before we are just changing it into a list of integers in
35:31
changing it into a list of integers in
35:31
changing it into a list of integers in Python just so it's easier to work with
35:33
Python just so it's easier to work with
35:33
Python just so it's easier to work with instead of the raw byes objects and then
35:36
instead of the raw byes objects and then
35:36
instead of the raw byes objects and then this is the code that I came up with uh
35:40
this is the code that I came up with uh
35:40
this is the code that I came up with uh to actually do the merging in Loop these
35:43
to actually do the merging in Loop these
35:44
to actually do the merging in Loop these two functions here are identical to what
35:45
two functions here are identical to what
35:45
two functions here are identical to what we had above I only included them here
35:48
we had above I only included them here
35:48
we had above I only included them here just so that you have the point of
35:49
just so that you have the point of
35:49
just so that you have the point of reference here so uh these two are
35:53
reference here so uh these two are
35:53
reference here so uh these two are identical and then this is the new code
35:54
identical and then this is the new code
35:55
identical and then this is the new code that I added so the first first thing we
35:57
that I added so the first first thing we
35:57
that I added so the first first thing we want to do is we want to decide on the
35:58
want to do is we want to decide on the
35:58
want to do is we want to decide on the final vocabulary size that we want our
36:01
final vocabulary size that we want our
36:01
final vocabulary size that we want our tokenizer to have and as I mentioned
36:02
tokenizer to have and as I mentioned
36:02
tokenizer to have and as I mentioned this is a hyper parameter and you set it
36:04
this is a hyper parameter and you set it
36:04
this is a hyper parameter and you set it in some way depending on your best
36:06
in some way depending on your best
36:06
in some way depending on your best performance so let's say for us we're
36:08
performance so let's say for us we're
36:08
performance so let's say for us we're going to use 276 because that way we're
36:10
going to use 276 because that way we're
36:10
going to use 276 because that way we're going to be doing exactly 20
36:13
going to be doing exactly 20
36:13
going to be doing exactly 20 merges and uh 20 merges because we
36:15
merges and uh 20 merges because we
36:15
merges and uh 20 merges because we already have
36:16
already have
36:16
already have 256 tokens for the raw bytes and to
36:20
256 tokens for the raw bytes and to
36:20
256 tokens for the raw bytes and to reach 276 we have to do 20 merges uh to
36:23
reach 276 we have to do 20 merges uh to
36:23
reach 276 we have to do 20 merges uh to add 20 new
36:25
add 20 new
36:25
add 20 new tokens here uh this is uh one way in
36:28
tokens here uh this is uh one way in
36:28
tokens here uh this is uh one way in Python to just create a copy of a list
36:31
Python to just create a copy of a list
36:31
Python to just create a copy of a list so I'm taking the tokens list and by
36:33
so I'm taking the tokens list and by
36:33
so I'm taking the tokens list and by wrapping it in a list python will
36:35
wrapping it in a list python will
36:35
wrapping it in a list python will construct a new list of all the
36:37
construct a new list of all the
36:37
construct a new list of all the individual elements so this is just a
36:38
individual elements so this is just a
36:38
individual elements so this is just a copy
36:39
copy
36:39
copy operation then here I'm creating a
36:42
operation then here I'm creating a
36:42
operation then here I'm creating a merges uh dictionary so this merges
36:44
merges uh dictionary so this merges
36:44
merges uh dictionary so this merges dictionary is going to maintain
36:46
dictionary is going to maintain
36:46
dictionary is going to maintain basically the child one child two
36:49
basically the child one child two
36:49
basically the child one child two mapping to a new uh token and so what
36:52
mapping to a new uh token and so what
36:52
mapping to a new uh token and so what we're going to be building up here is a
36:53
we're going to be building up here is a
36:53
we're going to be building up here is a binary tree of merges but actually it's
36:56
binary tree of merges but actually it's
36:56
binary tree of merges but actually it's not exactly a tree because a tree would
36:59
not exactly a tree because a tree would
36:59
not exactly a tree because a tree would have a single root node with a bunch of
37:01
have a single root node with a bunch of
37:01
have a single root node with a bunch of leaves for us we're starting with the
37:03
leaves for us we're starting with the
37:03
leaves for us we're starting with the leaves on the bottom which are the
37:04
leaves on the bottom which are the
37:05
leaves on the bottom which are the individual bites those are the starting
37:06
individual bites those are the starting
37:06
individual bites those are the starting 256 tokens and then we're starting to
37:09
256 tokens and then we're starting to
37:09
256 tokens and then we're starting to like merge two of them at a time and so
37:11
like merge two of them at a time and so
37:11
like merge two of them at a time and so it's not a tree it's more like a forest
37:14
it's not a tree it's more like a forest
37:14
it's not a tree it's more like a forest um uh as we merge these elements
37:18
um uh as we merge these elements
37:18
um uh as we merge these elements so for 20 merges we're going to find the
37:22
so for 20 merges we're going to find the
37:22
so for 20 merges we're going to find the most commonly occurring pair we're going
37:25
most commonly occurring pair we're going
37:25
most commonly occurring pair we're going to Mint a new token integer for it so I
37:28
to Mint a new token integer for it so I
37:28
to Mint a new token integer for it so I here will start at zero so we'll going
37:30
here will start at zero so we'll going
37:30
here will start at zero so we'll going to start at 256 we're going to print
37:32
to start at 256 we're going to print
37:32
to start at 256 we're going to print that we're merging it and we're going to
37:34
that we're merging it and we're going to
37:34
that we're merging it and we're going to replace all of the occurrences of that
37:36
replace all of the occurrences of that
37:36
replace all of the occurrences of that pair with the new new lied token and
37:39
pair with the new new lied token and
37:39
pair with the new new lied token and we're going to record that this pair of
37:42
we're going to record that this pair of
37:42
we're going to record that this pair of integers merged into this new
37:45
integers merged into this new
37:45
integers merged into this new integer so running this gives us the
37:49
integer so running this gives us the
37:49
integer so running this gives us the following
37:51
following
37:51
following output so we did 20 merges and for
37:54
output so we did 20 merges and for
37:54
output so we did 20 merges and for example the first merge was exactly as
37:56
example the first merge was exactly as
37:56
example the first merge was exactly as before the
37:58
before the
37:58
before the 10132 um tokens merging into a new token
38:01
10132 um tokens merging into a new token
38:01
10132 um tokens merging into a new token 2556 now keep in mind that the
38:03
2556 now keep in mind that the
38:04
2556 now keep in mind that the individual uh tokens 101 and 32 can
38:06
individual uh tokens 101 and 32 can
38:06
individual uh tokens 101 and 32 can still occur in the sequence after
38:08
still occur in the sequence after
38:08
still occur in the sequence after merging it's only when they occur
38:10
merging it's only when they occur
38:10
merging it's only when they occur exactly consecutively that that becomes
38:12
exactly consecutively that that becomes
38:12
exactly consecutively that that becomes 256
38:13
256
38:13
256 now um and in particular the other thing
38:16
now um and in particular the other thing
38:16
now um and in particular the other thing to notice here is that the token 256
38:19
to notice here is that the token 256
38:19
to notice here is that the token 256 which is the newly minted token is also
38:21
which is the newly minted token is also
38:21
which is the newly minted token is also eligible for merging so here on the
38:23
eligible for merging so here on the
38:23
eligible for merging so here on the bottom the 20th merge was a merge of 25
38:26
bottom the 20th merge was a merge of 25
38:26
bottom the 20th merge was a merge of 25 and 259 becoming
38:28
and 259 becoming
38:28
and 259 becoming 275 so every time we replace these
38:31
275 so every time we replace these
38:31
275 so every time we replace these tokens they become eligible for merging
38:33
tokens they become eligible for merging
38:33
tokens they become eligible for merging in the next round of data ration so
38:35
in the next round of data ration so
38:35
in the next round of data ration so that's why we're building up a small
38:37
that's why we're building up a small
38:37
that's why we're building up a small sort of binary Forest instead of a
38:38
sort of binary Forest instead of a
38:38
sort of binary Forest instead of a single individual
38:40
single individual
38:40
single individual tree one thing we can take a look at as
38:42
tree one thing we can take a look at as
38:42
tree one thing we can take a look at as well is we can take a look at the
38:43
well is we can take a look at the
38:44
well is we can take a look at the compression ratio that we've achieved so
38:46
compression ratio that we've achieved so
38:46
compression ratio that we've achieved so in particular we started off with this
38:48
in particular we started off with this
38:48
in particular we started off with this tokens list um so we started off with
38:51
tokens list um so we started off with
38:51
tokens list um so we started off with 24,000 bytes and after merging 20 times
38:56
24,000 bytes and after merging 20 times
38:56
24,000 bytes and after merging 20 times uh we now have only
38:58
uh we now have only
38:58
uh we now have only 19,000 um tokens and so therefore the
39:01
19,000 um tokens and so therefore the
39:01
19,000 um tokens and so therefore the compression ratio simply just dividing
39:03
compression ratio simply just dividing
39:03
compression ratio simply just dividing the two is roughly 1.27 so that's the
39:06
the two is roughly 1.27 so that's the
39:06
the two is roughly 1.27 so that's the amount of compression we were able to
39:07
amount of compression we were able to
39:07
amount of compression we were able to achieve of this text with only 20
39:10
achieve of this text with only 20
39:10
achieve of this text with only 20 merges um and of course the more
39:13
merges um and of course the more
39:13
merges um and of course the more vocabulary elements you add uh the
39:15
vocabulary elements you add uh the
39:15
vocabulary elements you add uh the greater the compression ratio here would
39:19
greater the compression ratio here would
39:19
greater the compression ratio here would be finally so that's kind of like um the
39:23
be finally so that's kind of like um the
39:23
be finally so that's kind of like um the training of the tokenizer if you will
39:25
training of the tokenizer if you will
39:25
training of the tokenizer if you will now 1 Point I wanted to make is that and
39:28
now 1 Point I wanted to make is that and
39:28
now 1 Point I wanted to make is that and maybe this is a diagram that can help um
39:31
maybe this is a diagram that can help um
39:31
maybe this is a diagram that can help um kind of illustrate is that tokenizer is
39:33
kind of illustrate is that tokenizer is
39:33
kind of illustrate is that tokenizer is a completely separate object from the
39:34
a completely separate object from the
39:34
a completely separate object from the large language model itself so
39:36
large language model itself so
39:37
large language model itself so everything in this lecture we're not
39:38
everything in this lecture we're not
39:38
everything in this lecture we're not really touching the llm itself uh we're
39:40
really touching the llm itself uh we're
39:40
really touching the llm itself uh we're just training the tokenizer this is a
39:41
just training the tokenizer this is a
39:41
just training the tokenizer this is a completely separate pre-processing stage
39:43
completely separate pre-processing stage
39:43
completely separate pre-processing stage usually so the tokenizer will have its
39:46
usually so the tokenizer will have its
39:46
usually so the tokenizer will have its own training set just like a large
39:47
own training set just like a large
39:47
own training set just like a large language model has a potentially
39:49
language model has a potentially
39:49
language model has a potentially different training set so the tokenizer
39:52
different training set so the tokenizer
39:52
different training set so the tokenizer has a training set of documents on which
39:53
has a training set of documents on which
39:53
has a training set of documents on which you're going to train the
39:54
you're going to train the
39:54
you're going to train the tokenizer and then and um we're
39:57
tokenizer and then and um we're
39:57
tokenizer and then and um we're performing The Bite pair encoding
39:58
performing The Bite pair encoding
39:58
performing The Bite pair encoding algorithm as we saw above to train the
40:01
algorithm as we saw above to train the
40:01
algorithm as we saw above to train the vocabulary of this
40:02
vocabulary of this
40:02
vocabulary of this tokenizer so it has its own training set
40:04
tokenizer so it has its own training set
40:04
tokenizer so it has its own training set it is a pre-processing stage that you
40:06
it is a pre-processing stage that you
40:06
it is a pre-processing stage that you would run a single time in the beginning
40:09
would run a single time in the beginning
40:09
would run a single time in the beginning um and the tokenizer is trained using
40:11
um and the tokenizer is trained using
40:11
um and the tokenizer is trained using bipar coding algorithm once you have the
40:14
bipar coding algorithm once you have the
40:14
bipar coding algorithm once you have the tokenizer once it's trained and you have
40:16
tokenizer once it's trained and you have
40:16
tokenizer once it's trained and you have the vocabulary and you have the merges
40:19
the vocabulary and you have the merges
40:19
the vocabulary and you have the merges uh we can do both encoding and decoding
40:22
uh we can do both encoding and decoding
40:22
uh we can do both encoding and decoding so these two arrows here so the
40:24
so these two arrows here so the
40:24
so these two arrows here so the tokenizer is a translation layer between
40:26
tokenizer is a translation layer between
40:27
tokenizer is a translation layer between raw text which is as we saw the sequence
40:30
raw text which is as we saw the sequence
40:30
raw text which is as we saw the sequence of Unicode code points it can take raw
40:32
of Unicode code points it can take raw
40:32
of Unicode code points it can take raw text and turn it into a token sequence
40:35
text and turn it into a token sequence
40:35
text and turn it into a token sequence and vice versa it can take a token
40:36
and vice versa it can take a token
40:37
and vice versa it can take a token sequence and translate it back into raw
40:40
sequence and translate it back into raw
40:40
sequence and translate it back into raw text so now that we have trained uh
40:43
text so now that we have trained uh
40:43
text so now that we have trained uh tokenizer and we have these merges we
40:45
tokenizer and we have these merges we
40:45
tokenizer and we have these merges we are going to turn to how we can do the
40:47
are going to turn to how we can do the
40:47
are going to turn to how we can do the encoding and the decoding step if you
40:49
encoding and the decoding step if you
40:49
encoding and the decoding step if you give me text here are the tokens and
40:51
give me text here are the tokens and
40:51
give me text here are the tokens and vice versa if you give me tokens here's
40:52
vice versa if you give me tokens here's
40:53
vice versa if you give me tokens here's the text once we have that we can
40:55
the text once we have that we can
40:55
the text once we have that we can translate between these two Realms and
40:57
translate between these two Realms and
40:57
translate between these two Realms and then the language model is going to be
40:58
then the language model is going to be
40:58
then the language model is going to be trained as a step two afterwards and
41:01
trained as a step two afterwards and
41:01
trained as a step two afterwards and typically in a in a sort of a
41:03
typically in a in a sort of a
41:03
typically in a in a sort of a state-of-the-art application you might
41:05
state-of-the-art application you might
41:05
state-of-the-art application you might take all of your training data for the
41:06
take all of your training data for the
41:06
take all of your training data for the language model and you might run it
41:08
language model and you might run it
41:08
language model and you might run it through the tokenizer and sort of
41:10
through the tokenizer and sort of
41:10
through the tokenizer and sort of translate everything into a massive
41:11
translate everything into a massive
41:11
translate everything into a massive token sequence and then you can throw
41:13
token sequence and then you can throw
41:13
token sequence and then you can throw away the raw text you're just left with
41:15
away the raw text you're just left with
41:15
away the raw text you're just left with the tokens themselves and those are
41:17
the tokens themselves and those are
41:17
the tokens themselves and those are stored on disk and that is what the
41:19
stored on disk and that is what the
41:19
stored on disk and that is what the large language model is actually reading
41:21
large language model is actually reading
41:21
large language model is actually reading when it's training on them so this one
41:23
when it's training on them so this one
41:23
when it's training on them so this one approach that you can take as a single
41:24
approach that you can take as a single
41:24
approach that you can take as a single massive pre-processing step a
41:26
massive pre-processing step a
41:26
massive pre-processing step a stage um so yeah basically I think the
41:30
stage um so yeah basically I think the
41:30
stage um so yeah basically I think the most important thing I want to get
41:31
most important thing I want to get
41:31
most important thing I want to get across is that this is completely
41:32
across is that this is completely
41:32
across is that this is completely separate stage it usually has its own
41:34
separate stage it usually has its own
41:34
separate stage it usually has its own entire uh training set you may want to
41:36
entire uh training set you may want to
41:36
entire uh training set you may want to have those training sets be different
41:38
have those training sets be different
41:38
have those training sets be different between the tokenizer and the logge
41:39
between the tokenizer and the logge
41:39
between the tokenizer and the logge language model so for example when
41:41
language model so for example when
41:41
language model so for example when you're training the tokenizer as I
41:43
you're training the tokenizer as I
41:43
you're training the tokenizer as I mentioned we don't just care about the
41:45
mentioned we don't just care about the
41:45
mentioned we don't just care about the performance of English text we care
41:46
performance of English text we care
41:46
performance of English text we care about uh multi many different languages
41:49
about uh multi many different languages
41:49
about uh multi many different languages and we also care about code or not code
41:51
and we also care about code or not code
41:51
and we also care about code or not code so you may want to look into different
41:53
so you may want to look into different
41:53
so you may want to look into different kinds of mixtures of different kinds of
41:55
kinds of mixtures of different kinds of
41:55
kinds of mixtures of different kinds of languages and different amounts of code
41:57
languages and different amounts of code
41:57
languages and different amounts of code and things like that because the amount
42:00
and things like that because the amount
42:00
and things like that because the amount of different language that you have in
42:01
of different language that you have in
42:01
of different language that you have in your tokenizer training set will
42:03
your tokenizer training set will
42:03
your tokenizer training set will determine how many merges of it there
42:06
determine how many merges of it there
42:06
determine how many merges of it there will be and therefore that determines
42:08
will be and therefore that determines
42:08
will be and therefore that determines the density with which uh this type of
42:11
the density with which uh this type of
42:11
the density with which uh this type of data is um sort of has in the token
42:15
data is um sort of has in the token
42:15
data is um sort of has in the token space and so roughly speaking
42:17
space and so roughly speaking
42:17
space and so roughly speaking intuitively if you add some amount of
42:19
intuitively if you add some amount of
42:19
intuitively if you add some amount of data like say you have a ton of Japanese
42:21
data like say you have a ton of Japanese
42:21
data like say you have a ton of Japanese data in your uh tokenizer training set
42:24
data in your uh tokenizer training set
42:24
data in your uh tokenizer training set then that means that more Japanese
42:25
then that means that more Japanese
42:25
then that means that more Japanese tokens will get merged
42:26
tokens will get merged
42:26
tokens will get merged and therefore Japanese will have shorter
42:28
and therefore Japanese will have shorter
42:28
and therefore Japanese will have shorter sequences uh and that's going to be
42:30
sequences uh and that's going to be
42:30
sequences uh and that's going to be beneficial for the large language model
42:32
beneficial for the large language model
42:32
beneficial for the large language model which has a finite context length on
42:34
which has a finite context length on
42:34
which has a finite context length on which it can work on in in the token
42:36
which it can work on in in the token
42:36
which it can work on in in the token space uh so hopefully that makes sense
42:39
space uh so hopefully that makes sense
42:39
space uh so hopefully that makes sense so we're now going to turn to encoding
42:41
so we're now going to turn to encoding
42:41
so we're now going to turn to encoding and decoding now that we have trained a
42:43
and decoding now that we have trained a
42:43
and decoding now that we have trained a tokenizer so we have our merges and now
42:46
tokenizer so we have our merges and now
42:46
tokenizer so we have our merges and now how do we do encoding and decoding okay
42:48
how do we do encoding and decoding okay
42:48
how do we do encoding and decoding okay so let's begin with decoding which is
42:50
so let's begin with decoding which is
42:50
so let's begin with decoding which is this Arrow over here so given a token
42:52
this Arrow over here so given a token
42:52
this Arrow over here so given a token sequence let's go through the tokenizer
42:54
sequence let's go through the tokenizer
42:54
sequence let's go through the tokenizer to get back a python string object so
42:57
to get back a python string object so
42:57
to get back a python string object so the raw text so this is the function
42:59
the raw text so this is the function
42:59
the raw text so this is the function that we' like to implement um we're
43:01
that we' like to implement um we're
43:01
that we' like to implement um we're given the list of integers and we want
43:03
given the list of integers and we want
43:03
given the list of integers and we want to return a python string if you'd like
43:05
to return a python string if you'd like
43:05
to return a python string if you'd like uh try to implement this function
43:06
uh try to implement this function
43:06
uh try to implement this function yourself it's a fun exercise otherwise
43:08
yourself it's a fun exercise otherwise
43:08
yourself it's a fun exercise otherwise I'm going to start uh pasting in my own
43:11
I'm going to start uh pasting in my own
43:11
I'm going to start uh pasting in my own solution so there are many different
43:13
solution so there are many different
43:13
solution so there are many different ways to do it um here's one way I will
43:16
ways to do it um here's one way I will
43:16
ways to do it um here's one way I will create an uh kind of pre-processing
43:18
create an uh kind of pre-processing
43:18
create an uh kind of pre-processing variable that I will call
43:21
variable that I will call
43:21
variable that I will call vocab and vocab is a mapping or a
43:24
vocab and vocab is a mapping or a
43:24
vocab and vocab is a mapping or a dictionary in Python for from the token
43:27
dictionary in Python for from the token
43:27
dictionary in Python for from the token uh ID to the bytes object for that token
43:31
uh ID to the bytes object for that token
43:31
uh ID to the bytes object for that token so we begin with the raw bytes for
43:33
so we begin with the raw bytes for
43:33
so we begin with the raw bytes for tokens from 0 to 255 and then we go in
43:36
tokens from 0 to 255 and then we go in
43:36
tokens from 0 to 255 and then we go in order of all the merges and we sort of
43:39
order of all the merges and we sort of
43:39
order of all the merges and we sort of uh populate this vocab list by doing an
43:42
uh populate this vocab list by doing an
43:42
uh populate this vocab list by doing an addition here so this is the basically
43:45
addition here so this is the basically
43:45
addition here so this is the basically the bytes representation of the first
43:47
the bytes representation of the first
43:47
the bytes representation of the first child followed by the second one and
43:50
child followed by the second one and
43:50
child followed by the second one and remember these are bytes objects so this
43:52
remember these are bytes objects so this
43:52
remember these are bytes objects so this addition here is an addition of two
43:54
addition here is an addition of two
43:54
addition here is an addition of two bytes objects just concatenation
43:57
bytes objects just concatenation
43:57
bytes objects just concatenation so that's what we get
43:58
so that's what we get
43:58
so that's what we get here one tricky thing to be careful with
44:01
here one tricky thing to be careful with
44:01
here one tricky thing to be careful with by the way is that I'm iterating a
44:02
by the way is that I'm iterating a
44:02
by the way is that I'm iterating a dictionary in Python using a DOT items
44:05
dictionary in Python using a DOT items
44:06
dictionary in Python using a DOT items and uh it really matters that this runs
44:08
and uh it really matters that this runs
44:08
and uh it really matters that this runs in the order in which we inserted items
44:11
in the order in which we inserted items
44:11
in the order in which we inserted items into the merous dictionary luckily
44:13
into the merous dictionary luckily
44:13
into the merous dictionary luckily starting with python 3.7 this is
44:15
starting with python 3.7 this is
44:15
starting with python 3.7 this is guaranteed to be the case but before
44:17
guaranteed to be the case but before
44:17
guaranteed to be the case but before python 3.7 this iteration may have been
44:19
python 3.7 this iteration may have been
44:19
python 3.7 this iteration may have been out of order with respect to how we
44:20
out of order with respect to how we
44:20
out of order with respect to how we inserted elements into merges and this
44:23
inserted elements into merges and this
44:23
inserted elements into merges and this may not have worked but we are using an
44:25
may not have worked but we are using an
44:25
may not have worked but we are using an um modern python so we're okay and then
44:28
um modern python so we're okay and then
44:28
um modern python so we're okay and then here uh given the IDS the first thing
44:31
here uh given the IDS the first thing
44:31
here uh given the IDS the first thing we're going to do is get the
44:35
we're going to do is get the
44:35
we're going to do is get the tokens so the way I implemented this
44:37
tokens so the way I implemented this
44:37
tokens so the way I implemented this here is I'm taking I'm iterating over
44:39
here is I'm taking I'm iterating over
44:39
here is I'm taking I'm iterating over all the IDS I'm using vocap to look up
44:41
all the IDS I'm using vocap to look up
44:41
all the IDS I'm using vocap to look up their bytes and then here this is one
44:44
their bytes and then here this is one
44:44
their bytes and then here this is one way in Python to concatenate all these
44:46
way in Python to concatenate all these
44:46
way in Python to concatenate all these bytes together to create our tokens and
44:49
bytes together to create our tokens and
44:49
bytes together to create our tokens and then these tokens here at this point are
44:51
then these tokens here at this point are
44:51
then these tokens here at this point are raw bytes so I have to decode using UTF
44:55
raw bytes so I have to decode using UTF
44:56
raw bytes so I have to decode using UTF F now back into python strings so
44:59
F now back into python strings so
44:59
F now back into python strings so previously we called that encode on a
45:01
previously we called that encode on a
45:01
previously we called that encode on a string object to get the bytes and now
45:03
string object to get the bytes and now
45:03
string object to get the bytes and now we're doing it Opposite we're taking the
45:05
we're doing it Opposite we're taking the
45:05
we're doing it Opposite we're taking the bytes and calling a decode on the bytes
45:07
bytes and calling a decode on the bytes
45:07
bytes and calling a decode on the bytes object to get a string in Python and
45:10
object to get a string in Python and
45:11
object to get a string in Python and then we can return
45:13
then we can return
45:13
then we can return text so um this is how we can do it now
45:16
text so um this is how we can do it now
45:16
text so um this is how we can do it now this actually has a um issue um in the
45:20
this actually has a um issue um in the
45:20
this actually has a um issue um in the way I implemented it and this could
45:22
way I implemented it and this could
45:22
way I implemented it and this could actually throw an error so try to think
45:24
actually throw an error so try to think
45:24
actually throw an error so try to think figure out why this code could actually
45:26
figure out why this code could actually
45:26
figure out why this code could actually result in an error if we plug in um uh
45:30
result in an error if we plug in um uh
45:30
result in an error if we plug in um uh some sequence of IDs that is
45:32
some sequence of IDs that is
45:32
some sequence of IDs that is unlucky so let me demonstrate the issue
45:35
unlucky so let me demonstrate the issue
45:35
unlucky so let me demonstrate the issue when I try to decode just something like
45:37
when I try to decode just something like
45:37
when I try to decode just something like 97 I am going to get letter A here back
45:41
97 I am going to get letter A here back
45:41
97 I am going to get letter A here back so nothing too crazy happening but when
45:44
so nothing too crazy happening but when
45:44
so nothing too crazy happening but when I try to decode 128 as a single element
45:48
I try to decode 128 as a single element
45:48
I try to decode 128 as a single element the token 128 is what in string or in
45:51
the token 128 is what in string or in
45:51
the token 128 is what in string or in Python object uni Cod decoder utfa can't
45:55
Python object uni Cod decoder utfa can't
45:55
Python object uni Cod decoder utfa can't Decode by um 0x8 which is this in HEX in
46:00
Decode by um 0x8 which is this in HEX in
46:00
Decode by um 0x8 which is this in HEX in position zero invalid start bite what
46:01
position zero invalid start bite what
46:01
position zero invalid start bite what does that mean well to understand what
46:03
does that mean well to understand what
46:03
does that mean well to understand what this means we have to go back to our
46:04
this means we have to go back to our
46:04
this means we have to go back to our utf8 page uh that I briefly showed
46:07
utf8 page uh that I briefly showed
46:07
utf8 page uh that I briefly showed earlier and this is Wikipedia utf8 and
46:10
earlier and this is Wikipedia utf8 and
46:10
earlier and this is Wikipedia utf8 and basically there's a specific schema that
46:13
basically there's a specific schema that
46:13
basically there's a specific schema that utfa bytes take so in particular if you
46:16
utfa bytes take so in particular if you
46:16
utfa bytes take so in particular if you have a multi-te object for some of the
46:19
have a multi-te object for some of the
46:19
have a multi-te object for some of the Unicode characters they have to have
46:21
Unicode characters they have to have
46:21
Unicode characters they have to have this special sort of envelope in how the
46:24
this special sort of envelope in how the
46:24
this special sort of envelope in how the encoding works and so what's happening
46:26
encoding works and so what's happening
46:26
encoding works and so what's happening here is that invalid start pite that's
46:29
here is that invalid start pite that's
46:30
here is that invalid start pite that's because
46:30
because
46:31
because 128 the binary representation of it is
46:33
128 the binary representation of it is
46:33
128 the binary representation of it is one followed by all zeros so we have one
46:37
one followed by all zeros so we have one
46:37
one followed by all zeros so we have one and then all zero and we see here that
46:39
and then all zero and we see here that
46:39
and then all zero and we see here that that doesn't conform to the format
46:41
that doesn't conform to the format
46:41
that doesn't conform to the format because one followed by all zero just
46:42
because one followed by all zero just
46:42
because one followed by all zero just doesn't fit any of these rules so to
46:44
doesn't fit any of these rules so to
46:44
doesn't fit any of these rules so to speak so it's an invalid start bite
46:47
speak so it's an invalid start bite
46:47
speak so it's an invalid start bite which is byte one this one must have a
46:50
which is byte one this one must have a
46:50
which is byte one this one must have a one following it and then a zero
46:52
one following it and then a zero
46:52
one following it and then a zero following it and then the content of
46:54
following it and then the content of
46:54
following it and then the content of your uni codee in x here so basically we
46:57
your uni codee in x here so basically we
46:57
your uni codee in x here so basically we don't um exactly follow the utf8
46:59
don't um exactly follow the utf8
46:59
don't um exactly follow the utf8 standard and this cannot be decoded and
47:02
standard and this cannot be decoded and
47:02
standard and this cannot be decoded and so the way to fix this um is to
47:06
so the way to fix this um is to
47:06
so the way to fix this um is to use this errors equals in bytes. decode
47:11
use this errors equals in bytes. decode
47:11
use this errors equals in bytes. decode function of python and by default errors
47:13
function of python and by default errors
47:13
function of python and by default errors is strict so we will throw an error if
47:17
is strict so we will throw an error if
47:17
is strict so we will throw an error if um it's not valid utf8 bytes encoding
47:20
um it's not valid utf8 bytes encoding
47:20
um it's not valid utf8 bytes encoding but there are many different things that
47:21
but there are many different things that
47:21
but there are many different things that you could put here on error handling
47:23
you could put here on error handling
47:23
you could put here on error handling this is the full list of all the errors
47:25
this is the full list of all the errors
47:25
this is the full list of all the errors that you can use and in particular
47:27
that you can use and in particular
47:27
that you can use and in particular instead of strict let's change it to
47:29
instead of strict let's change it to
47:29
instead of strict let's change it to replace and that will replace uh with
47:32
replace and that will replace uh with
47:32
replace and that will replace uh with this special marker this replacement
47:35
this special marker this replacement
47:35
this special marker this replacement character so errors equals replace and
47:40
character so errors equals replace and
47:40
character so errors equals replace and now we just get that character
47:43
now we just get that character
47:43
now we just get that character back so basically not every single by
47:46
back so basically not every single by
47:46
back so basically not every single by sequence is valid
47:48
sequence is valid
47:48
sequence is valid utf8 and if it happens that your large
47:51
utf8 and if it happens that your large
47:51
utf8 and if it happens that your large language model for example predicts your
47:53
language model for example predicts your
47:53
language model for example predicts your tokens in a bad manner then they might
47:56
tokens in a bad manner then they might
47:56
tokens in a bad manner then they might not fall into valid utf8 and then we
48:00
not fall into valid utf8 and then we
48:00
not fall into valid utf8 and then we won't be able to decode them so the
48:02
won't be able to decode them so the
48:02
won't be able to decode them so the standard practice is to basically uh use
48:05
standard practice is to basically uh use
48:05
standard practice is to basically uh use errors equals replace and this is what
48:07
errors equals replace and this is what
48:07
errors equals replace and this is what you will also find in the openai um code
48:10
you will also find in the openai um code
48:10
you will also find in the openai um code that they released as well but basically
48:12
that they released as well but basically
48:12
that they released as well but basically whenever you see um this kind of a
48:14
whenever you see um this kind of a
48:14
whenever you see um this kind of a character in your output in that case uh
48:15
character in your output in that case uh
48:16
character in your output in that case uh something went wrong and the LM output
48:18
something went wrong and the LM output
48:18
something went wrong and the LM output not was not valid uh sort of sequence of
48:21
not was not valid uh sort of sequence of
48:21
not was not valid uh sort of sequence of tokens okay and now we're going to go
48:23
tokens okay and now we're going to go
48:23
tokens okay and now we're going to go the other way so we are going to
48:25
the other way so we are going to
48:25
the other way so we are going to implement
48:26
implement
48:26
implement this Arrow right here where we are going
48:27
this Arrow right here where we are going
48:27
this Arrow right here where we are going to be given a string and we want to
48:29
to be given a string and we want to
48:29
to be given a string and we want to encode it into
48:31
encode it into
48:31
encode it into tokens so this is the signature of the
48:33
tokens so this is the signature of the
48:33
tokens so this is the signature of the function that we're interested in and um
48:36
function that we're interested in and um
48:36
function that we're interested in and um this should basically print a list of
48:38
this should basically print a list of
48:38
this should basically print a list of integers of the tokens so again uh try
48:41
integers of the tokens so again uh try
48:41
integers of the tokens so again uh try to maybe implement this yourself if
48:43
to maybe implement this yourself if
48:43
to maybe implement this yourself if you'd like a fun exercise uh and pause
48:45
you'd like a fun exercise uh and pause
48:45
you'd like a fun exercise uh and pause here otherwise I'm going to start
48:46
here otherwise I'm going to start
48:46
here otherwise I'm going to start putting in my
48:47
putting in my
48:47
putting in my solution so again there are many ways to
48:50
solution so again there are many ways to
48:50
solution so again there are many ways to do this so um this is one of the ways
48:53
do this so um this is one of the ways
48:53
do this so um this is one of the ways that sort of I came came up with so the
48:57
that sort of I came came up with so the
48:57
that sort of I came came up with so the first thing we're going to do is we are
48:59
first thing we're going to do is we are
48:59
first thing we're going to do is we are going
49:00
going
49:00
going to uh take our text encode it into utf8
49:03
to uh take our text encode it into utf8
49:03
to uh take our text encode it into utf8 to get the raw bytes and then as before
49:05
to get the raw bytes and then as before
49:05
to get the raw bytes and then as before we're going to call list on the bytes
49:07
we're going to call list on the bytes
49:07
we're going to call list on the bytes object to get a list of integers of
49:10
object to get a list of integers of
49:10
object to get a list of integers of those bytes so those are the starting
49:12
those bytes so those are the starting
49:12
those bytes so those are the starting tokens those are the raw bytes of our
49:14
tokens those are the raw bytes of our
49:14
tokens those are the raw bytes of our sequence but now of course according to
49:16
sequence but now of course according to
49:16
sequence but now of course according to the merges dictionary above and recall
49:19
the merges dictionary above and recall
49:19
the merges dictionary above and recall this was the
49:21
this was the
49:21
this was the merges some of the bytes may be merged
49:23
merges some of the bytes may be merged
49:23
merges some of the bytes may be merged according to this lookup in addition to
49:26
according to this lookup in addition to
49:26
according to this lookup in addition to that remember that the merges was built
49:28
that remember that the merges was built
49:28
that remember that the merges was built from top to bottom and this is sort of
49:29
from top to bottom and this is sort of
49:29
from top to bottom and this is sort of the order in which we inserted stuff
49:31
the order in which we inserted stuff
49:31
the order in which we inserted stuff into merges and so we prefer to do all
49:34
into merges and so we prefer to do all
49:34
into merges and so we prefer to do all these merges in the beginning before we
49:36
these merges in the beginning before we
49:36
these merges in the beginning before we do these merges later because um for
49:39
do these merges later because um for
49:39
do these merges later because um for example this merge over here relies on
49:40
example this merge over here relies on
49:40
example this merge over here relies on the 256 which got merged here so we have
49:44
the 256 which got merged here so we have
49:44
the 256 which got merged here so we have to go in the order from top to bottom
49:46
to go in the order from top to bottom
49:46
to go in the order from top to bottom sort of if we are going to be merging
49:48
sort of if we are going to be merging
49:48
sort of if we are going to be merging anything now we expect to be doing a few
49:51
anything now we expect to be doing a few
49:51
anything now we expect to be doing a few merges so we're going to be doing W
49:54
merges so we're going to be doing W
49:54
merges so we're going to be doing W true um and now we want to find a pair
49:58
true um and now we want to find a pair
49:58
true um and now we want to find a pair of byes that is consecutive that we are
50:00
of byes that is consecutive that we are
50:00
of byes that is consecutive that we are allowed to merge according to this in
50:03
allowed to merge according to this in
50:03
allowed to merge according to this in order to reuse some of the functionality
50:04
order to reuse some of the functionality
50:05
order to reuse some of the functionality that we've already written I'm going to
50:06
that we've already written I'm going to
50:06
that we've already written I'm going to reuse the function uh get
50:09
reuse the function uh get
50:09
reuse the function uh get stats so recall that get stats uh will
50:12
stats so recall that get stats uh will
50:12
stats so recall that get stats uh will give us the we'll basically count up how
50:14
give us the we'll basically count up how
50:14
give us the we'll basically count up how many times every single pair occurs in
50:16
many times every single pair occurs in
50:16
many times every single pair occurs in our sequence of tokens and return that
50:18
our sequence of tokens and return that
50:18
our sequence of tokens and return that as a dictionary and the dictionary was a
50:22
as a dictionary and the dictionary was a
50:22
as a dictionary and the dictionary was a mapping from all the different uh by
50:25
mapping from all the different uh by
50:25
mapping from all the different uh by pairs to the number of times that they
50:27
pairs to the number of times that they
50:27
pairs to the number of times that they occur right um at this point we don't
50:30
occur right um at this point we don't
50:30
occur right um at this point we don't actually care how many times they occur
50:32
actually care how many times they occur
50:32
actually care how many times they occur in the sequence we only care what the
50:34
in the sequence we only care what the
50:34
in the sequence we only care what the raw pairs are in that sequence and so
50:36
raw pairs are in that sequence and so
50:36
raw pairs are in that sequence and so I'm only going to be using basically the
50:38
I'm only going to be using basically the
50:38
I'm only going to be using basically the keys of the dictionary I only care about
50:40
keys of the dictionary I only care about
50:40
keys of the dictionary I only care about the set of possible merge candidates if
50:42
the set of possible merge candidates if
50:42
the set of possible merge candidates if that makes
50:43
that makes
50:43
that makes sense now we want to identify the pair
50:46
sense now we want to identify the pair
50:46
sense now we want to identify the pair that we're going to be merging at this
50:47
that we're going to be merging at this
50:47
that we're going to be merging at this stage of the loop so what do we want we
50:50
stage of the loop so what do we want we
50:50
stage of the loop so what do we want we want to find the pair or like the a key
50:53
want to find the pair or like the a key
50:53
want to find the pair or like the a key inside stats that has the lowest index
50:57
inside stats that has the lowest index
50:57
inside stats that has the lowest index in the merges uh dictionary because we
50:59
in the merges uh dictionary because we
50:59
in the merges uh dictionary because we want to do all the early merges before
51:01
want to do all the early merges before
51:01
want to do all the early merges before we work our way to the late
51:03
we work our way to the late
51:03
we work our way to the late merges so again there are many different
51:05
merges so again there are many different
51:05
merges so again there are many different ways to implement this but I'm going to
51:07
ways to implement this but I'm going to
51:07
ways to implement this but I'm going to do something a little bit fancy
51:11
do something a little bit fancy
51:11
do something a little bit fancy here so I'm going to be using the Min
51:14
here so I'm going to be using the Min
51:14
here so I'm going to be using the Min over an iterator in Python when you call
51:16
over an iterator in Python when you call
51:16
over an iterator in Python when you call Min on an iterator and stats here as a
51:18
Min on an iterator and stats here as a
51:18
Min on an iterator and stats here as a dictionary we're going to be iterating
51:20
dictionary we're going to be iterating
51:20
dictionary we're going to be iterating the keys of this dictionary in Python so
51:24
the keys of this dictionary in Python so
51:24
the keys of this dictionary in Python so we're looking at all the pairs inside
51:27
we're looking at all the pairs inside
51:27
we're looking at all the pairs inside stats um which are all the consecutive
51:29
stats um which are all the consecutive
51:29
stats um which are all the consecutive Pairs and we're going to be taking the
51:32
Pairs and we're going to be taking the
51:32
Pairs and we're going to be taking the consecutive pair inside tokens that has
51:34
consecutive pair inside tokens that has
51:34
consecutive pair inside tokens that has the minimum what the Min takes a key
51:38
the minimum what the Min takes a key
51:38
the minimum what the Min takes a key which gives us the function that is
51:40
which gives us the function that is
51:40
which gives us the function that is going to return a value over which we're
51:42
going to return a value over which we're
51:42
going to return a value over which we're going to do the Min and the one we care
51:44
going to do the Min and the one we care
51:44
going to do the Min and the one we care about is we're we care about taking
51:46
about is we're we care about taking
51:46
about is we're we care about taking merges and basically getting um that
51:50
merges and basically getting um that
51:50
merges and basically getting um that pairs
51:52
pairs
51:52
pairs index so basically for any pair inside
51:57
index so basically for any pair inside
51:57
index so basically for any pair inside stats we are going to be looking into
51:59
stats we are going to be looking into
51:59
stats we are going to be looking into merges at what index it has and we want
52:03
merges at what index it has and we want
52:03
merges at what index it has and we want to get the pair with the Min number so
52:05
to get the pair with the Min number so
52:05
to get the pair with the Min number so as an example if there's a pair 101 and
52:07
as an example if there's a pair 101 and
52:07
as an example if there's a pair 101 and 32 we definitely want to get that pair
52:10
32 we definitely want to get that pair
52:10
32 we definitely want to get that pair uh we want to identify it here and
52:11
uh we want to identify it here and
52:11
uh we want to identify it here and return it and pair would become 10132 if
52:15
return it and pair would become 10132 if
52:15
return it and pair would become 10132 if it
52:15
it
52:15
it occurs and the reason that I'm putting a
52:17
occurs and the reason that I'm putting a
52:17
occurs and the reason that I'm putting a float INF here as a fall back is that in
52:21
float INF here as a fall back is that in
52:21
float INF here as a fall back is that in the get function when we call uh when we
52:24
the get function when we call uh when we
52:24
the get function when we call uh when we basically consider a pair that doesn't
52:26
basically consider a pair that doesn't
52:26
basically consider a pair that doesn't occur in the merges then that pair is
52:28
occur in the merges then that pair is
52:29
occur in the merges then that pair is not eligible to be merged right so if in
52:31
not eligible to be merged right so if in
52:31
not eligible to be merged right so if in the token sequence there's some pair
52:33
the token sequence there's some pair
52:33
the token sequence there's some pair that is not a merging pair it cannot be
52:35
that is not a merging pair it cannot be
52:35
that is not a merging pair it cannot be merged then uh it doesn't actually occur
52:38
merged then uh it doesn't actually occur
52:38
merged then uh it doesn't actually occur here and it doesn't have an index and uh
52:40
here and it doesn't have an index and uh
52:40
here and it doesn't have an index and uh it cannot be merged which we will denote
52:42
it cannot be merged which we will denote
52:42
it cannot be merged which we will denote as float INF and the reason Infinity is
52:45
as float INF and the reason Infinity is
52:45
as float INF and the reason Infinity is nice here is because for sure we're
52:46
nice here is because for sure we're
52:46
nice here is because for sure we're guaranteed that it's not going to
52:48
guaranteed that it's not going to
52:48
guaranteed that it's not going to participate in the list of candidates
52:50
participate in the list of candidates
52:50
participate in the list of candidates when we do the men so uh so this is one
52:53
when we do the men so uh so this is one
52:53
when we do the men so uh so this is one way to do it so B basically long story
52:55
way to do it so B basically long story
52:55
way to do it so B basically long story short this Returns the most eligible
52:58
short this Returns the most eligible
52:58
short this Returns the most eligible merging candidate pair uh that occurs in
53:01
merging candidate pair uh that occurs in
53:01
merging candidate pair uh that occurs in the tokens now one thing to be careful
53:04
the tokens now one thing to be careful
53:04
the tokens now one thing to be careful with here is this uh function here might
53:07
with here is this uh function here might
53:07
with here is this uh function here might fail in the following way if there's
53:09
fail in the following way if there's
53:09
fail in the following way if there's nothing to merge then uh uh then there's
53:13
nothing to merge then uh uh then there's
53:13
nothing to merge then uh uh then there's nothing in merges um that satisfi that
53:16
nothing in merges um that satisfi that
53:16
nothing in merges um that satisfi that is satisfied anymore there's nothing to
53:18
is satisfied anymore there's nothing to
53:18
is satisfied anymore there's nothing to merge everything just returns float imps
53:21
merge everything just returns float imps
53:21
merge everything just returns float imps and then the pair I think will just
53:23
and then the pair I think will just
53:23
and then the pair I think will just become the very first element of stats
53:26
become the very first element of stats
53:26
become the very first element of stats um but this pair is not actually a
53:28
um but this pair is not actually a
53:28
um but this pair is not actually a mergeable pair it just becomes the first
53:31
mergeable pair it just becomes the first
53:31
mergeable pair it just becomes the first pair inside stats arbitrarily because
53:33
pair inside stats arbitrarily because
53:33
pair inside stats arbitrarily because all of these pairs evaluate to float in
53:36
all of these pairs evaluate to float in
53:36
all of these pairs evaluate to float in for the merging Criterion so basically
53:38
for the merging Criterion so basically
53:38
for the merging Criterion so basically it could be that this this doesn't look
53:40
it could be that this this doesn't look
53:40
it could be that this this doesn't look succeed because there's no more merging
53:41
succeed because there's no more merging
53:41
succeed because there's no more merging pairs so if this pair is not in merges
53:44
pairs so if this pair is not in merges
53:44
pairs so if this pair is not in merges that was returned then this is a signal
53:46
that was returned then this is a signal
53:46
that was returned then this is a signal for us that actually there was nothing
53:48
for us that actually there was nothing
53:48
for us that actually there was nothing to merge no single pair can be merged
53:50
to merge no single pair can be merged
53:50
to merge no single pair can be merged anymore in that case we will break
53:53
anymore in that case we will break
53:53
anymore in that case we will break out um nothing else can be
53:57
merged you may come up with a different
53:59
merged you may come up with a different
53:59
merged you may come up with a different implementation by the way this is kind
54:01
implementation by the way this is kind
54:01
implementation by the way this is kind of like really trying hard in
54:03
of like really trying hard in
54:03
of like really trying hard in Python um but really we're just trying
54:05
Python um but really we're just trying
54:05
Python um but really we're just trying to find a pair that can be merged with
54:07
to find a pair that can be merged with
54:07
to find a pair that can be merged with the lowest index
54:09
the lowest index
54:09
the lowest index here now if we did find a pair that is
54:13
here now if we did find a pair that is
54:13
here now if we did find a pair that is inside merges with the lowest index then
54:16
inside merges with the lowest index then
54:16
inside merges with the lowest index then we can merge it
54:19
so we're going to look into the merger
54:22
so we're going to look into the merger
54:22
so we're going to look into the merger dictionary for that pair to look up the
54:24
dictionary for that pair to look up the
54:24
dictionary for that pair to look up the index and we're going to now merge that
54:27
index and we're going to now merge that
54:27
index and we're going to now merge that into that index so we're going to do
54:29
into that index so we're going to do
54:29
into that index so we're going to do tokens equals and we're going to
54:32
tokens equals and we're going to
54:32
tokens equals and we're going to replace the original tokens we're going
54:34
replace the original tokens we're going
54:34
replace the original tokens we're going to be replacing the pair pair and we're
54:36
to be replacing the pair pair and we're
54:36
to be replacing the pair pair and we're going to be replacing it with index idx
54:38
going to be replacing it with index idx
54:38
going to be replacing it with index idx and this returns a new list of tokens
54:41
and this returns a new list of tokens
54:41
and this returns a new list of tokens where every occurrence of pair is
54:43
where every occurrence of pair is
54:43
where every occurrence of pair is replaced with idx so we're doing a merge
54:46
replaced with idx so we're doing a merge
54:46
replaced with idx so we're doing a merge and we're going to be continuing this
54:47
and we're going to be continuing this
54:47
and we're going to be continuing this until eventually nothing can be merged
54:49
until eventually nothing can be merged
54:49
until eventually nothing can be merged we'll come out here and we'll break out
54:51
we'll come out here and we'll break out
54:51
we'll come out here and we'll break out and here we just return
54:53
and here we just return
54:53
and here we just return tokens and so that that's the
54:55
tokens and so that that's the
54:55
tokens and so that that's the implementation I think so hopefully this
54:57
implementation I think so hopefully this
54:57
implementation I think so hopefully this runs okay cool um yeah and this looks uh
55:02
runs okay cool um yeah and this looks uh
55:02
runs okay cool um yeah and this looks uh reasonable so for example 32 is a space
55:04
reasonable so for example 32 is a space
55:04
reasonable so for example 32 is a space in asky so that's here um so this looks
55:09
in asky so that's here um so this looks
55:09
in asky so that's here um so this looks like it worked great okay so let's wrap
55:11
like it worked great okay so let's wrap
55:11
like it worked great okay so let's wrap up this section of the video at least I
55:13
up this section of the video at least I
55:13
up this section of the video at least I wanted to point out that this is not
55:14
wanted to point out that this is not
55:14
wanted to point out that this is not quite the right implementation just yet
55:16
quite the right implementation just yet
55:16
quite the right implementation just yet because we are leaving out a special
55:17
because we are leaving out a special
55:17
because we are leaving out a special case so in particular if uh we try to do
55:20
case so in particular if uh we try to do
55:20
case so in particular if uh we try to do this this would give us an error and the
55:23
this this would give us an error and the
55:23
this this would give us an error and the issue is that um if we only have a
55:25
issue is that um if we only have a
55:25
issue is that um if we only have a single character or an empty string then
55:28
single character or an empty string then
55:28
single character or an empty string then stats is empty and that causes an issue
55:29
stats is empty and that causes an issue
55:29
stats is empty and that causes an issue inside Min so one way to fight this is
55:32
inside Min so one way to fight this is
55:32
inside Min so one way to fight this is if L of tokens is at least two because
55:36
if L of tokens is at least two because
55:36
if L of tokens is at least two because if it's less than two it's just a single
55:37
if it's less than two it's just a single
55:37
if it's less than two it's just a single token or no tokens then let's just uh
55:40
token or no tokens then let's just uh
55:40
token or no tokens then let's just uh there's nothing to merge so we just
55:41
there's nothing to merge so we just
55:41
there's nothing to merge so we just return so that would fix uh that
55:44
return so that would fix uh that
55:44
return so that would fix uh that case Okay and then second I have a few
55:48
case Okay and then second I have a few
55:48
case Okay and then second I have a few test cases here for us as well so first
55:50
test cases here for us as well so first
55:50
test cases here for us as well so first let's make sure uh about or let's note
55:53
let's make sure uh about or let's note
55:53
let's make sure uh about or let's note the following if we take a string and we
55:56
the following if we take a string and we
55:56
the following if we take a string and we try to encode it and then decode it back
55:58
try to encode it and then decode it back
55:58
try to encode it and then decode it back you'd expect to get the same string back
56:00
you'd expect to get the same string back
56:00
you'd expect to get the same string back right is that true for all
56:04
strings so I think uh so here it is the
56:07
strings so I think uh so here it is the
56:07
strings so I think uh so here it is the case and I think in general this is
56:08
case and I think in general this is
56:08
case and I think in general this is probably the case um but notice that
56:12
probably the case um but notice that
56:12
probably the case um but notice that going backwards is not is not you're not
56:14
going backwards is not is not you're not
56:14
going backwards is not is not you're not going to have an identity going
56:15
going to have an identity going
56:15
going to have an identity going backwards because as I mentioned us not
56:19
backwards because as I mentioned us not
56:19
backwards because as I mentioned us not all token sequences are valid utf8 uh
56:22
all token sequences are valid utf8 uh
56:22
all token sequences are valid utf8 uh sort of by streams and so so therefore
56:25
sort of by streams and so so therefore
56:25
sort of by streams and so so therefore you're some of them can't even be
56:27
you're some of them can't even be
56:27
you're some of them can't even be decodable um so this only goes in One
56:30
decodable um so this only goes in One
56:30
decodable um so this only goes in One Direction but for that one direction we
56:32
Direction but for that one direction we
56:32
Direction but for that one direction we can check uh here if we take the
56:34
can check uh here if we take the
56:34
can check uh here if we take the training text which is the text that we
56:36
training text which is the text that we
56:36
training text which is the text that we train to tokenizer around we can make
56:37
train to tokenizer around we can make
56:38
train to tokenizer around we can make sure that when we encode and decode we
56:39
sure that when we encode and decode we
56:39
sure that when we encode and decode we get the same thing back which is true
56:41
get the same thing back which is true
56:41
get the same thing back which is true and here I took some validation data so
56:43
and here I took some validation data so
56:43
and here I took some validation data so I went to I think this web page and I
56:45
I went to I think this web page and I
56:45
I went to I think this web page and I grabbed some text so this is text that
56:47
grabbed some text so this is text that
56:47
grabbed some text so this is text that the tokenizer has not seen and we can
56:49
the tokenizer has not seen and we can
56:49
the tokenizer has not seen and we can make sure that this also works um okay
56:52
make sure that this also works um okay
56:52
make sure that this also works um okay so that gives us some confidence that
56:53
so that gives us some confidence that
56:53
so that gives us some confidence that this was correctly implemented
56:55
this was correctly implemented
56:56
this was correctly implemented so those are the basics of the bite pair
56:58
so those are the basics of the bite pair
56:58
so those are the basics of the bite pair encoding algorithm we saw how we can uh
57:00
encoding algorithm we saw how we can uh
57:00
encoding algorithm we saw how we can uh take some training set train a tokenizer
57:03
take some training set train a tokenizer
57:03
take some training set train a tokenizer the parameters of this tokenizer really
57:05
the parameters of this tokenizer really
57:05
the parameters of this tokenizer really are just this dictionary of merges and
57:08
are just this dictionary of merges and
57:08
are just this dictionary of merges and that basically creates the little binary
57:09
that basically creates the little binary
57:09
that basically creates the little binary Forest on top of raw
57:11
Forest on top of raw
57:11
Forest on top of raw bites once we have this the merges table
57:14
bites once we have this the merges table
57:14
bites once we have this the merges table we can both encode and decode between
57:16
we can both encode and decode between
57:16
we can both encode and decode between raw text and token sequences so that's
57:19
raw text and token sequences so that's
57:19
raw text and token sequences so that's the the simplest setting of The
57:21
the the simplest setting of The
57:21
the the simplest setting of The tokenizer what we're going to do now
57:23
tokenizer what we're going to do now
57:23
tokenizer what we're going to do now though is we're going to look at some of
57:24
though is we're going to look at some of
57:24
though is we're going to look at some of the St the art lar language models and
57:26
the St the art lar language models and
57:26
the St the art lar language models and the kinds of tokenizers that they use
57:28
the kinds of tokenizers that they use
57:28
the kinds of tokenizers that they use and we're going to see that this picture
57:29
and we're going to see that this picture
57:29
and we're going to see that this picture complexifies very quickly so we're going
57:31
complexifies very quickly so we're going
57:31
complexifies very quickly so we're going to go through the details of this comp
57:34
to go through the details of this comp
57:34
to go through the details of this comp complexification one at a time so let's
57:37
complexification one at a time so let's
57:37
complexification one at a time so let's kick things off by looking at the GPD
57:39
kick things off by looking at the GPD
57:39
kick things off by looking at the GPD Series so in particular I have the gpt2
57:41
Series so in particular I have the gpt2
57:41
Series so in particular I have the gpt2 paper here um and this paper is from
57:44
paper here um and this paper is from
57:44
paper here um and this paper is from 2019 or so so 5 years ago and let's
57:48
2019 or so so 5 years ago and let's
57:48
2019 or so so 5 years ago and let's scroll down to input representation this
57:51
scroll down to input representation this
57:51
scroll down to input representation this is where they talk about the tokenizer
57:52
is where they talk about the tokenizer
57:52
is where they talk about the tokenizer that they're using for gpd2 now this is
57:55
that they're using for gpd2 now this is
57:55
that they're using for gpd2 now this is all fairly readable so I encourage you
57:57
all fairly readable so I encourage you
57:57
all fairly readable so I encourage you to pause and um read this yourself but
58:00
to pause and um read this yourself but
58:00
to pause and um read this yourself but this is where they motivate the use of
58:01
this is where they motivate the use of
58:02
this is where they motivate the use of the bite pair encoding algorithm on the
58:04
the bite pair encoding algorithm on the
58:04
the bite pair encoding algorithm on the bite level representation of utf8
58:07
bite level representation of utf8
58:07
bite level representation of utf8 encoding so this is where they motivate
58:09
encoding so this is where they motivate
58:09
encoding so this is where they motivate it and they talk about the vocabulary
58:11
it and they talk about the vocabulary
58:11
it and they talk about the vocabulary sizes and everything now everything here
58:13
sizes and everything now everything here
58:13
sizes and everything now everything here is exactly as we've covered it so far
58:15
is exactly as we've covered it so far
58:15
is exactly as we've covered it so far but things start to depart around here
58:18
but things start to depart around here
58:18
but things start to depart around here so what they mention is that they don't
58:20
so what they mention is that they don't
58:20
so what they mention is that they don't just apply the naive algorithm as we
58:22
just apply the naive algorithm as we
58:22
just apply the naive algorithm as we have done it and in particular here's a
58:25
have done it and in particular here's a
58:25
have done it and in particular here's a example suppose that you have common
58:26
example suppose that you have common
58:27
example suppose that you have common words like dog what will happen is that
58:29
words like dog what will happen is that
58:29
words like dog what will happen is that dog of course occurs very frequently in
58:31
dog of course occurs very frequently in
58:31
dog of course occurs very frequently in the text and it occurs right next to all
58:34
the text and it occurs right next to all
58:34
the text and it occurs right next to all kinds of punctuation as an example so
58:36
kinds of punctuation as an example so
58:36
kinds of punctuation as an example so doc dot dog exclamation mark dog
58:39
doc dot dog exclamation mark dog
58:39
doc dot dog exclamation mark dog question mark Etc and naively you might
58:42
question mark Etc and naively you might
58:42
question mark Etc and naively you might imagine that the BP algorithm could
58:43
imagine that the BP algorithm could
58:43
imagine that the BP algorithm could merge these to be single tokens and then
58:45
merge these to be single tokens and then
58:45
merge these to be single tokens and then you end up with lots of tokens that are
58:47
you end up with lots of tokens that are
58:47
you end up with lots of tokens that are just like dog with a slightly different
58:48
just like dog with a slightly different
58:49
just like dog with a slightly different punctuation and so it feels like you're
58:50
punctuation and so it feels like you're
58:50
punctuation and so it feels like you're clustering things that shouldn't be
58:52
clustering things that shouldn't be
58:52
clustering things that shouldn't be clustered you're combining kind of
58:53
clustered you're combining kind of
58:53
clustered you're combining kind of semantics with
58:55
semantics with
58:55
semantics with uation and this uh feels suboptimal and
58:58
uation and this uh feels suboptimal and
58:58
uation and this uh feels suboptimal and indeed they also say that this is
59:00
indeed they also say that this is
59:00
indeed they also say that this is suboptimal according to some of the
59:02
suboptimal according to some of the
59:02
suboptimal according to some of the experiments so what they want to do is
59:04
experiments so what they want to do is
59:04
experiments so what they want to do is they want to top down in a manual way
59:06
they want to top down in a manual way
59:06
they want to top down in a manual way enforce that some types of um characters
59:09
enforce that some types of um characters
59:09
enforce that some types of um characters should never be merged together um so
59:12
should never be merged together um so
59:12
should never be merged together um so they want to enforce these merging rules
59:14
they want to enforce these merging rules
59:14
they want to enforce these merging rules on top of the bite PA encoding algorithm
59:17
on top of the bite PA encoding algorithm
59:17
on top of the bite PA encoding algorithm so let's take a look um at their code
59:19
so let's take a look um at their code
59:19
so let's take a look um at their code and see how they actually enforce this
59:21
and see how they actually enforce this
59:21
and see how they actually enforce this and what kinds of mergy they actually do
59:23
and what kinds of mergy they actually do
59:23
and what kinds of mergy they actually do perform so I have to to tab open here
59:25
perform so I have to to tab open here
59:25
perform so I have to to tab open here for gpt2 under open AI on GitHub and
59:29
for gpt2 under open AI on GitHub and
59:29
for gpt2 under open AI on GitHub and when we go to
59:30
when we go to
59:30
when we go to Source there is an encoder thatp now I
59:34
Source there is an encoder thatp now I
59:34
Source there is an encoder thatp now I don't personally love that they call it
59:35
don't personally love that they call it
59:35
don't personally love that they call it encoder dopy because this is the
59:37
encoder dopy because this is the
59:37
encoder dopy because this is the tokenizer and the tokenizer can do both
59:39
tokenizer and the tokenizer can do both
59:39
tokenizer and the tokenizer can do both encode and decode uh so it feels kind of
59:41
encode and decode uh so it feels kind of
59:41
encode and decode uh so it feels kind of awkward to me that it's called encoder
59:43
awkward to me that it's called encoder
59:43
awkward to me that it's called encoder but that is the tokenizer and there's a
59:45
but that is the tokenizer and there's a
59:45
but that is the tokenizer and there's a lot going on here and we're going to
59:46
lot going on here and we're going to
59:47
lot going on here and we're going to step through it in detail at one point
59:49
step through it in detail at one point
59:49
step through it in detail at one point for now I just want to focus on this
59:51
for now I just want to focus on this
59:51
for now I just want to focus on this part here the create a rigix pattern
59:54
part here the create a rigix pattern
59:54
part here the create a rigix pattern here that looks very complicated and
59:56
here that looks very complicated and
59:56
here that looks very complicated and we're going to go through it in a bit uh
59:58
we're going to go through it in a bit uh
59:58
we're going to go through it in a bit uh but this is the core part that allows
1:00:00
but this is the core part that allows
1:00:00
but this is the core part that allows them to enforce rules uh for what parts
1:00:03
them to enforce rules uh for what parts
1:00:04
them to enforce rules uh for what parts of the text Will Never Be merged for
1:00:05
of the text Will Never Be merged for
1:00:05
of the text Will Never Be merged for sure now notice that re. compile here is
1:00:08
sure now notice that re. compile here is
1:00:08
sure now notice that re. compile here is a little bit misleading because we're
1:00:10
a little bit misleading because we're
1:00:10
a little bit misleading because we're not just doing import re which is the
1:00:12
not just doing import re which is the
1:00:12
not just doing import re which is the python re module we're doing import reex
1:00:14
python re module we're doing import reex
1:00:14
python re module we're doing import reex as re and reex is a python package that
1:00:17
as re and reex is a python package that
1:00:17
as re and reex is a python package that you can install P install r x and it's
1:00:20
you can install P install r x and it's
1:00:20
you can install P install r x and it's basically an extension of re so it's a
1:00:22
basically an extension of re so it's a
1:00:22
basically an extension of re so it's a bit more powerful
1:00:23
bit more powerful
1:00:23
bit more powerful re um
1:00:25
re um
1:00:26
re um so let's take a look at this pattern and
1:00:28
so let's take a look at this pattern and
1:00:28
so let's take a look at this pattern and what it's doing and why this is actually
1:00:30
what it's doing and why this is actually
1:00:30
what it's doing and why this is actually doing the separation that they are
1:00:32
doing the separation that they are
1:00:32
doing the separation that they are looking for okay so I've copy pasted the
1:00:34
looking for okay so I've copy pasted the
1:00:34
looking for okay so I've copy pasted the pattern here to our jupit notebook where
1:00:37
pattern here to our jupit notebook where
1:00:37
pattern here to our jupit notebook where we left off and let's take this pattern
1:00:39
we left off and let's take this pattern
1:00:39
we left off and let's take this pattern for a spin so in the exact same way that
1:00:42
for a spin so in the exact same way that
1:00:42
for a spin so in the exact same way that their code does we're going to call an
1:00:44
their code does we're going to call an
1:00:44
their code does we're going to call an re. findall for this pattern on any
1:00:47
re. findall for this pattern on any
1:00:47
re. findall for this pattern on any arbitrary string that we are interested
1:00:49
arbitrary string that we are interested
1:00:49
arbitrary string that we are interested so this is the string that we want to
1:00:50
so this is the string that we want to
1:00:50
so this is the string that we want to encode into tokens um to feed into n llm
1:00:55
encode into tokens um to feed into n llm
1:00:55
encode into tokens um to feed into n llm like gpt2 so what exactly is this doing
1:00:59
like gpt2 so what exactly is this doing
1:00:59
like gpt2 so what exactly is this doing well re. findall will take this pattern
1:01:01
well re. findall will take this pattern
1:01:01
well re. findall will take this pattern and try to match it against a
1:01:02
and try to match it against a
1:01:02
and try to match it against a string um the way this works is that you
1:01:06
string um the way this works is that you
1:01:06
string um the way this works is that you are going from left to right in the
1:01:07
are going from left to right in the
1:01:07
are going from left to right in the string and you're trying to match the
1:01:10
string and you're trying to match the
1:01:10
string and you're trying to match the pattern and R.F find all will get all
1:01:13
pattern and R.F find all will get all
1:01:13
pattern and R.F find all will get all the occurrences and organize them into a
1:01:16
the occurrences and organize them into a
1:01:16
the occurrences and organize them into a list now when you look at the um when
1:01:19
list now when you look at the um when
1:01:19
list now when you look at the um when you look at this pattern first of all
1:01:20
you look at this pattern first of all
1:01:20
you look at this pattern first of all notice that this is a raw string um and
1:01:23
notice that this is a raw string um and
1:01:23
notice that this is a raw string um and then these are three double quotes just
1:01:26
then these are three double quotes just
1:01:26
then these are three double quotes just to start the string so really the string
1:01:28
to start the string so really the string
1:01:28
to start the string so really the string itself this is the pattern itself
1:01:31
itself this is the pattern itself
1:01:31
itself this is the pattern itself right and notice that it's made up of a
1:01:34
right and notice that it's made up of a
1:01:34
right and notice that it's made up of a lot of ores so see these vertical bars
1:01:36
lot of ores so see these vertical bars
1:01:36
lot of ores so see these vertical bars those are ores in reg X and so you go
1:01:40
those are ores in reg X and so you go
1:01:40
those are ores in reg X and so you go from left to right in this pattern and
1:01:41
from left to right in this pattern and
1:01:41
from left to right in this pattern and try to match it against the string
1:01:43
try to match it against the string
1:01:43
try to match it against the string wherever you are so we have hello and
1:01:46
wherever you are so we have hello and
1:01:46
wherever you are so we have hello and we're going to try to match it well it's
1:01:48
we're going to try to match it well it's
1:01:48
we're going to try to match it well it's not apostrophe s it's not apostrophe t
1:01:50
not apostrophe s it's not apostrophe t
1:01:50
not apostrophe s it's not apostrophe t or any of these but it is an optional
1:01:53
or any of these but it is an optional
1:01:53
or any of these but it is an optional space followed by- P of uh sorry SL P of
1:01:58
space followed by- P of uh sorry SL P of
1:01:58
space followed by- P of uh sorry SL P of L one or more times what is/ P of L it
1:02:02
L one or more times what is/ P of L it
1:02:02
L one or more times what is/ P of L it is coming to some documentation that I
1:02:04
is coming to some documentation that I
1:02:04
is coming to some documentation that I found um there might be other sources as
1:02:07
found um there might be other sources as
1:02:08
found um there might be other sources as well uh SLP is a letter any kind of
1:02:11
well uh SLP is a letter any kind of
1:02:11
well uh SLP is a letter any kind of letter from any language and hello is
1:02:15
letter from any language and hello is
1:02:15
letter from any language and hello is made up of letters h e l Etc so optional
1:02:19
made up of letters h e l Etc so optional
1:02:19
made up of letters h e l Etc so optional space followed by a bunch of letters one
1:02:21
space followed by a bunch of letters one
1:02:21
space followed by a bunch of letters one or more letters is going to match hello
1:02:24
or more letters is going to match hello
1:02:24
or more letters is going to match hello but then the match ends because a white
1:02:27
but then the match ends because a white
1:02:27
but then the match ends because a white space is not a letter so from there on
1:02:31
space is not a letter so from there on
1:02:31
space is not a letter so from there on begins a new sort of attempt to match
1:02:33
begins a new sort of attempt to match
1:02:33
begins a new sort of attempt to match against the string again and starting in
1:02:36
against the string again and starting in
1:02:36
against the string again and starting in here we're going to skip over all of
1:02:38
here we're going to skip over all of
1:02:38
here we're going to skip over all of these again until we get to the exact
1:02:40
these again until we get to the exact
1:02:40
these again until we get to the exact same Point again and we see that there's
1:02:42
same Point again and we see that there's
1:02:42
same Point again and we see that there's an optional space this is the optional
1:02:44
an optional space this is the optional
1:02:44
an optional space this is the optional space followed by a bunch of letters one
1:02:46
space followed by a bunch of letters one
1:02:46
space followed by a bunch of letters one or more of them and so that matches so
1:02:48
or more of them and so that matches so
1:02:48
or more of them and so that matches so when we run this we get a list of two
1:02:51
when we run this we get a list of two
1:02:52
when we run this we get a list of two elements hello and then space world
1:02:55
elements hello and then space world
1:02:55
elements hello and then space world so how are you if we add more letters we
1:02:58
so how are you if we add more letters we
1:02:58
so how are you if we add more letters we would just get them like this now what
1:03:01
would just get them like this now what
1:03:01
would just get them like this now what is this doing and why is this important
1:03:03
is this doing and why is this important
1:03:03
is this doing and why is this important we are taking our string and instead of
1:03:05
we are taking our string and instead of
1:03:05
we are taking our string and instead of directly encoding it um for
1:03:08
directly encoding it um for
1:03:09
directly encoding it um for tokenization we are first splitting it
1:03:11
tokenization we are first splitting it
1:03:11
tokenization we are first splitting it up and when you actually step through
1:03:13
up and when you actually step through
1:03:13
up and when you actually step through the code and we'll do that in a bit more
1:03:15
the code and we'll do that in a bit more
1:03:15
the code and we'll do that in a bit more detail what really is doing on a high
1:03:17
detail what really is doing on a high
1:03:17
detail what really is doing on a high level is that it first splits your text
1:03:20
level is that it first splits your text
1:03:20
level is that it first splits your text into a list of texts just like this one
1:03:24
into a list of texts just like this one
1:03:24
into a list of texts just like this one and all these elements of this list are
1:03:26
and all these elements of this list are
1:03:26
and all these elements of this list are processed independently by the tokenizer
1:03:29
processed independently by the tokenizer
1:03:29
processed independently by the tokenizer and all of the results of that
1:03:30
and all of the results of that
1:03:30
and all of the results of that processing are simply
1:03:32
processing are simply
1:03:32
processing are simply concatenated so hello world oh I I
1:03:35
concatenated so hello world oh I I
1:03:35
concatenated so hello world oh I I missed how hello world how are you we
1:03:39
missed how hello world how are you we
1:03:39
missed how hello world how are you we have five elements of list all of these
1:03:41
have five elements of list all of these
1:03:41
have five elements of list all of these will independent
1:03:44
will independent
1:03:44
will independent independently go from text to a token
1:03:46
independently go from text to a token
1:03:47
independently go from text to a token sequence and then that token sequence is
1:03:49
sequence and then that token sequence is
1:03:49
sequence and then that token sequence is going to be concatenated it's all going
1:03:50
going to be concatenated it's all going
1:03:50
going to be concatenated it's all going to be joined up and roughly speaking
1:03:54
to be joined up and roughly speaking
1:03:54
to be joined up and roughly speaking what that does is you're only ever
1:03:56
what that does is you're only ever
1:03:56
what that does is you're only ever finding merges between the elements of
1:03:58
finding merges between the elements of
1:03:58
finding merges between the elements of this list so you can only ever consider
1:04:00
this list so you can only ever consider
1:04:00
this list so you can only ever consider merges within every one of these
1:04:01
merges within every one of these
1:04:01
merges within every one of these elements in
1:04:03
elements in
1:04:03
elements in individually and um after you've done
1:04:06
individually and um after you've done
1:04:06
individually and um after you've done all the possible merging for all of
1:04:07
all the possible merging for all of
1:04:07
all the possible merging for all of these elements individually the results
1:04:09
these elements individually the results
1:04:09
these elements individually the results of all that will be joined um by
1:04:13
of all that will be joined um by
1:04:13
of all that will be joined um by concatenation and so you are basically
1:04:16
concatenation and so you are basically
1:04:16
concatenation and so you are basically what what you're doing effectively is
1:04:18
what what you're doing effectively is
1:04:18
what what you're doing effectively is you are never going to be merging this e
1:04:20
you are never going to be merging this e
1:04:21
you are never going to be merging this e with this space because they are now
1:04:23
with this space because they are now
1:04:23
with this space because they are now parts of the separate elements of this
1:04:25
parts of the separate elements of this
1:04:25
parts of the separate elements of this list and so you are saying we are never
1:04:27
list and so you are saying we are never
1:04:27
list and so you are saying we are never going to merge
1:04:28
going to merge
1:04:28
going to merge eace um because we're breaking it up in
1:04:32
eace um because we're breaking it up in
1:04:32
eace um because we're breaking it up in this way so basically using this regx
1:04:35
this way so basically using this regx
1:04:35
this way so basically using this regx pattern to Chunk Up the text is just one
1:04:37
pattern to Chunk Up the text is just one
1:04:37
pattern to Chunk Up the text is just one way of enforcing that some merges are
1:04:41
way of enforcing that some merges are
1:04:41
way of enforcing that some merges are not to happen and we're going to go into
1:04:43
not to happen and we're going to go into
1:04:43
not to happen and we're going to go into more of this text and we'll see that
1:04:45
more of this text and we'll see that
1:04:45
more of this text and we'll see that what this is trying to do on a high
1:04:46
what this is trying to do on a high
1:04:46
what this is trying to do on a high level is we're trying to not merge
1:04:47
level is we're trying to not merge
1:04:48
level is we're trying to not merge across letters across numbers across
1:04:50
across letters across numbers across
1:04:50
across letters across numbers across punctuation and so on so let's see in
1:04:53
punctuation and so on so let's see in
1:04:53
punctuation and so on so let's see in more detail how that works so let's
1:04:54
more detail how that works so let's
1:04:54
more detail how that works so let's continue now we have/ P ofn if you go to
1:04:57
continue now we have/ P ofn if you go to
1:04:58
continue now we have/ P ofn if you go to the documentation SLP of n is any kind
1:05:01
the documentation SLP of n is any kind
1:05:01
the documentation SLP of n is any kind of numeric character in any script so
1:05:04
of numeric character in any script so
1:05:04
of numeric character in any script so it's numbers so we have an optional
1:05:06
it's numbers so we have an optional
1:05:06
it's numbers so we have an optional space followed by numbers and those
1:05:08
space followed by numbers and those
1:05:08
space followed by numbers and those would be separated out so letters and
1:05:10
would be separated out so letters and
1:05:10
would be separated out so letters and numbers are being separated so if I do
1:05:12
numbers are being separated so if I do
1:05:12
numbers are being separated so if I do Hello World 123 how are you then world
1:05:15
Hello World 123 how are you then world
1:05:15
Hello World 123 how are you then world will stop matching here because one is
1:05:17
will stop matching here because one is
1:05:17
will stop matching here because one is not a letter anymore but one is a number
1:05:20
not a letter anymore but one is a number
1:05:20
not a letter anymore but one is a number so this group will match for that and
1:05:22
so this group will match for that and
1:05:22
so this group will match for that and we'll get it as a separate entity
1:05:26
uh let's see how these apostrophes work
1:05:28
uh let's see how these apostrophes work
1:05:28
uh let's see how these apostrophes work so here if we have
1:05:30
so here if we have
1:05:31
so here if we have um uh Slash V or I mean apostrophe V as
1:05:35
um uh Slash V or I mean apostrophe V as
1:05:35
um uh Slash V or I mean apostrophe V as an example then apostrophe here is not a
1:05:38
an example then apostrophe here is not a
1:05:38
an example then apostrophe here is not a letter or a
1:05:39
letter or a
1:05:39
letter or a number so hello will stop matching and
1:05:42
number so hello will stop matching and
1:05:42
number so hello will stop matching and then we will exactly match this with
1:05:44
then we will exactly match this with
1:05:44
then we will exactly match this with that so that will come out as a separate
1:05:48
that so that will come out as a separate
1:05:48
that so that will come out as a separate thing so why are they doing the
1:05:50
thing so why are they doing the
1:05:50
thing so why are they doing the apostrophes here honestly I think that
1:05:52
apostrophes here honestly I think that
1:05:52
apostrophes here honestly I think that these are just like very common
1:05:53
these are just like very common
1:05:53
these are just like very common apostrophes p uh that are used um
1:05:56
apostrophes p uh that are used um
1:05:56
apostrophes p uh that are used um typically I don't love that they've done
1:05:59
typically I don't love that they've done
1:05:59
typically I don't love that they've done this
1:06:00
this
1:06:00
this because uh let me show you what happens
1:06:03
because uh let me show you what happens
1:06:03
because uh let me show you what happens when you have uh some Unicode
1:06:05
when you have uh some Unicode
1:06:05
when you have uh some Unicode apostrophes like for example you can
1:06:07
apostrophes like for example you can
1:06:07
apostrophes like for example you can have if you have house then this will be
1:06:10
have if you have house then this will be
1:06:10
have if you have house then this will be separated out because of this matching
1:06:13
separated out because of this matching
1:06:13
separated out because of this matching but if you use the Unicode apostrophe
1:06:15
but if you use the Unicode apostrophe
1:06:15
but if you use the Unicode apostrophe like
1:06:16
like
1:06:16
like this then suddenly this does not work
1:06:19
this then suddenly this does not work
1:06:19
this then suddenly this does not work and so this apostrophe will actually
1:06:21
and so this apostrophe will actually
1:06:21
and so this apostrophe will actually become its own thing now and so so um
1:06:24
become its own thing now and so so um
1:06:24
become its own thing now and so so um it's basically hardcoded for this
1:06:26
it's basically hardcoded for this
1:06:26
it's basically hardcoded for this specific kind of apostrophe and uh
1:06:29
specific kind of apostrophe and uh
1:06:29
specific kind of apostrophe and uh otherwise they become completely
1:06:31
otherwise they become completely
1:06:31
otherwise they become completely separate tokens in addition to this you
1:06:34
separate tokens in addition to this you
1:06:34
separate tokens in addition to this you can go to the gpt2 docs and here when
1:06:38
can go to the gpt2 docs and here when
1:06:38
can go to the gpt2 docs and here when they Define the pattern they say should
1:06:40
they Define the pattern they say should
1:06:40
they Define the pattern they say should have added re. ignore case so BP merges
1:06:42
have added re. ignore case so BP merges
1:06:43
have added re. ignore case so BP merges can happen for capitalized versions of
1:06:44
can happen for capitalized versions of
1:06:44
can happen for capitalized versions of contractions so what they're pointing
1:06:46
contractions so what they're pointing
1:06:46
contractions so what they're pointing out is that you see how this is
1:06:47
out is that you see how this is
1:06:47
out is that you see how this is apostrophe and then lowercase letters
1:06:50
apostrophe and then lowercase letters
1:06:50
apostrophe and then lowercase letters well because they didn't do re. ignore
1:06:52
well because they didn't do re. ignore
1:06:52
well because they didn't do re. ignore case then then um these rules will not
1:06:56
case then then um these rules will not
1:06:56
case then then um these rules will not separate out the apostrophes if it's
1:06:58
separate out the apostrophes if it's
1:06:58
separate out the apostrophes if it's uppercase so
1:07:01
uppercase so
1:07:01
uppercase so house would be like this but if I did
1:07:06
house would be like this but if I did
1:07:06
house would be like this but if I did house if I'm uppercase then notice
1:07:10
house if I'm uppercase then notice
1:07:10
house if I'm uppercase then notice suddenly the apostrophe comes by
1:07:12
suddenly the apostrophe comes by
1:07:12
suddenly the apostrophe comes by itself so the tokenization will work
1:07:15
itself so the tokenization will work
1:07:15
itself so the tokenization will work differently in uppercase and lower case
1:07:17
differently in uppercase and lower case
1:07:17
differently in uppercase and lower case inconsistently separating out these
1:07:19
inconsistently separating out these
1:07:19
inconsistently separating out these apostrophes so it feels extremely gnarly
1:07:21
apostrophes so it feels extremely gnarly
1:07:21
apostrophes so it feels extremely gnarly and slightly gross um but that's that's
1:07:24
and slightly gross um but that's that's
1:07:24
and slightly gross um but that's that's how that works okay so let's come back
1:07:27
how that works okay so let's come back
1:07:27
how that works okay so let's come back after trying to match a bunch of
1:07:28
after trying to match a bunch of
1:07:28
after trying to match a bunch of apostrophe Expressions by the way the
1:07:30
apostrophe Expressions by the way the
1:07:30
apostrophe Expressions by the way the other issue here is that these are quite
1:07:32
other issue here is that these are quite
1:07:32
other issue here is that these are quite language specific probably so I don't
1:07:34
language specific probably so I don't
1:07:34
language specific probably so I don't know that all the languages for example
1:07:35
know that all the languages for example
1:07:35
know that all the languages for example use or don't use apostrophes but that
1:07:37
use or don't use apostrophes but that
1:07:37
use or don't use apostrophes but that would be inconsistently tokenized as a
1:07:39
would be inconsistently tokenized as a
1:07:39
would be inconsistently tokenized as a result then we try to match letters then
1:07:42
result then we try to match letters then
1:07:42
result then we try to match letters then we try to match numbers and then if that
1:07:44
we try to match numbers and then if that
1:07:44
we try to match numbers and then if that doesn't work we fall back to here and
1:07:47
doesn't work we fall back to here and
1:07:47
doesn't work we fall back to here and what this is saying is again optional
1:07:49
what this is saying is again optional
1:07:49
what this is saying is again optional space followed by something that is not
1:07:50
space followed by something that is not
1:07:50
space followed by something that is not a letter number or a space in one or
1:07:53
a letter number or a space in one or
1:07:53
a letter number or a space in one or more of that so what this is doing
1:07:55
more of that so what this is doing
1:07:55
more of that so what this is doing effectively is this is trying to match
1:07:57
effectively is this is trying to match
1:07:57
effectively is this is trying to match punctuation roughly speaking not letters
1:07:59
punctuation roughly speaking not letters
1:07:59
punctuation roughly speaking not letters and not numbers so this group will try
1:08:02
and not numbers so this group will try
1:08:02
and not numbers so this group will try to trigger for that so if I do something
1:08:04
to trigger for that so if I do something
1:08:04
to trigger for that so if I do something like this then these parts here are not
1:08:08
like this then these parts here are not
1:08:08
like this then these parts here are not letters or numbers but they will
1:08:09
letters or numbers but they will
1:08:09
letters or numbers but they will actually they are uh they will actually
1:08:12
actually they are uh they will actually
1:08:12
actually they are uh they will actually get caught here and so they become its
1:08:14
get caught here and so they become its
1:08:14
get caught here and so they become its own group so we've separated out the
1:08:17
own group so we've separated out the
1:08:17
own group so we've separated out the punctuation and finally this um this is
1:08:20
punctuation and finally this um this is
1:08:20
punctuation and finally this um this is also a little bit confusing so this is
1:08:22
also a little bit confusing so this is
1:08:22
also a little bit confusing so this is matching white space but this is using a
1:08:25
matching white space but this is using a
1:08:25
matching white space but this is using a negative look ahead assertion in regex
1:08:29
negative look ahead assertion in regex
1:08:29
negative look ahead assertion in regex so what this is doing is it's matching
1:08:30
so what this is doing is it's matching
1:08:30
so what this is doing is it's matching wh space up to but not including the
1:08:33
wh space up to but not including the
1:08:33
wh space up to but not including the last Whit space
1:08:34
last Whit space
1:08:35
last Whit space character why is this important um this
1:08:37
character why is this important um this
1:08:37
character why is this important um this is pretty subtle I think so you see how
1:08:40
is pretty subtle I think so you see how
1:08:40
is pretty subtle I think so you see how the white space is always included at
1:08:41
the white space is always included at
1:08:41
the white space is always included at the beginning of the word so um space r
1:08:45
the beginning of the word so um space r
1:08:45
the beginning of the word so um space r space u Etc suppose we have a lot of
1:08:48
space u Etc suppose we have a lot of
1:08:48
space u Etc suppose we have a lot of spaces
1:08:49
spaces
1:08:49
spaces here what's going to happen here is that
1:08:52
here what's going to happen here is that
1:08:52
here what's going to happen here is that these spaces up to not including the
1:08:54
these spaces up to not including the
1:08:54
these spaces up to not including the last character will get caught by this
1:08:57
last character will get caught by this
1:08:57
last character will get caught by this and what that will do is it will
1:08:59
and what that will do is it will
1:08:59
and what that will do is it will separate out the spaces up to but not
1:09:01
separate out the spaces up to but not
1:09:01
separate out the spaces up to but not including the last character so that the
1:09:03
including the last character so that the
1:09:03
including the last character so that the last character can come here and join
1:09:05
last character can come here and join
1:09:05
last character can come here and join with the um space you and the reason
1:09:09
with the um space you and the reason
1:09:09
with the um space you and the reason that's nice is because space you is the
1:09:11
that's nice is because space you is the
1:09:11
that's nice is because space you is the common token so if I didn't have these
1:09:13
common token so if I didn't have these
1:09:13
common token so if I didn't have these Extra Spaces here you would just have
1:09:15
Extra Spaces here you would just have
1:09:15
Extra Spaces here you would just have space you and if I add tokens if I add
1:09:18
space you and if I add tokens if I add
1:09:18
space you and if I add tokens if I add spaces we still have a space view but
1:09:20
spaces we still have a space view but
1:09:20
spaces we still have a space view but now we have all this extra white space
1:09:22
now we have all this extra white space
1:09:22
now we have all this extra white space so basically the GB to tokenizer really
1:09:24
so basically the GB to tokenizer really
1:09:24
so basically the GB to tokenizer really likes to have a space letters or numbers
1:09:27
likes to have a space letters or numbers
1:09:27
likes to have a space letters or numbers um and it it preens these spaces and
1:09:30
um and it it preens these spaces and
1:09:30
um and it it preens these spaces and this is just something that it is
1:09:31
this is just something that it is
1:09:31
this is just something that it is consistent about so that's what that is
1:09:33
consistent about so that's what that is
1:09:33
consistent about so that's what that is for and then finally we have all the the
1:09:36
for and then finally we have all the the
1:09:36
for and then finally we have all the the last fallback is um whites space
1:09:38
last fallback is um whites space
1:09:38
last fallback is um whites space characters uh so um that would be
1:09:42
characters uh so um that would be
1:09:42
characters uh so um that would be just um if that doesn't get caught then
1:09:46
just um if that doesn't get caught then
1:09:46
just um if that doesn't get caught then this thing will catch any trailing
1:09:48
this thing will catch any trailing
1:09:48
this thing will catch any trailing spaces and so on I wanted to show one
1:09:50
spaces and so on I wanted to show one
1:09:50
spaces and so on I wanted to show one more real world example here so if we
1:09:53
more real world example here so if we
1:09:53
more real world example here so if we have this string which is a piece of
1:09:54
have this string which is a piece of
1:09:54
have this string which is a piece of python code and then we try to split it
1:09:56
python code and then we try to split it
1:09:56
python code and then we try to split it up then this is the kind of output we
1:09:58
up then this is the kind of output we
1:09:58
up then this is the kind of output we get so you'll notice that the list has
1:10:00
get so you'll notice that the list has
1:10:00
get so you'll notice that the list has many elements here and that's because we
1:10:02
many elements here and that's because we
1:10:02
many elements here and that's because we are splitting up fairly often uh every
1:10:05
are splitting up fairly often uh every
1:10:05
are splitting up fairly often uh every time sort of a category
1:10:07
time sort of a category
1:10:07
time sort of a category changes um so there will never be any
1:10:09
changes um so there will never be any
1:10:09
changes um so there will never be any merges Within These
1:10:10
merges Within These
1:10:10
merges Within These elements and um that's what you are
1:10:13
elements and um that's what you are
1:10:13
elements and um that's what you are seeing here now you might think that in
1:10:16
seeing here now you might think that in
1:10:16
seeing here now you might think that in order to train the
1:10:17
order to train the
1:10:17
order to train the tokenizer uh open AI has used this to
1:10:21
tokenizer uh open AI has used this to
1:10:21
tokenizer uh open AI has used this to split up text into chunks and then run
1:10:23
split up text into chunks and then run
1:10:23
split up text into chunks and then run just a BP algorithm within all the
1:10:25
just a BP algorithm within all the
1:10:25
just a BP algorithm within all the chunks but that is not exactly what
1:10:27
chunks but that is not exactly what
1:10:27
chunks but that is not exactly what happened and the reason is the following
1:10:30
happened and the reason is the following
1:10:30
happened and the reason is the following notice that we have the spaces here uh
1:10:33
notice that we have the spaces here uh
1:10:33
notice that we have the spaces here uh those Spaces end up being entire
1:10:35
those Spaces end up being entire
1:10:35
those Spaces end up being entire elements but these spaces never actually
1:10:38
elements but these spaces never actually
1:10:38
elements but these spaces never actually end up being merged by by open Ai and
1:10:40
end up being merged by by open Ai and
1:10:40
end up being merged by by open Ai and the way you can tell is that if you copy
1:10:42
the way you can tell is that if you copy
1:10:42
the way you can tell is that if you copy paste the exact same chunk here into Tik
1:10:44
paste the exact same chunk here into Tik
1:10:44
paste the exact same chunk here into Tik token U Tik tokenizer you see that all
1:10:47
token U Tik tokenizer you see that all
1:10:47
token U Tik tokenizer you see that all the spaces are kept independent and
1:10:49
the spaces are kept independent and
1:10:49
the spaces are kept independent and they're all token
1:10:50
they're all token
1:10:51
they're all token 220 so I think opena at some point Point
1:10:53
220 so I think opena at some point Point
1:10:53
220 so I think opena at some point Point en Force some rule that these spaces
1:10:56
en Force some rule that these spaces
1:10:56
en Force some rule that these spaces would never be merged and so um there's
1:10:59
would never be merged and so um there's
1:10:59
would never be merged and so um there's some additional rules on top of just
1:11:01
some additional rules on top of just
1:11:01
some additional rules on top of just chunking and bpe that open ey is not uh
1:11:04
chunking and bpe that open ey is not uh
1:11:04
chunking and bpe that open ey is not uh clear about now the training code for
1:11:06
clear about now the training code for
1:11:06
clear about now the training code for the gpt2 tokenizer was never released so
1:11:08
the gpt2 tokenizer was never released so
1:11:08
the gpt2 tokenizer was never released so all we have is uh the code that I've
1:11:10
all we have is uh the code that I've
1:11:10
all we have is uh the code that I've already shown you but this code here
1:11:13
already shown you but this code here
1:11:13
already shown you but this code here that they've released is only the
1:11:14
that they've released is only the
1:11:14
that they've released is only the inference code for the tokens so this is
1:11:17
inference code for the tokens so this is
1:11:17
inference code for the tokens so this is not the training code you can't give it
1:11:19
not the training code you can't give it
1:11:19
not the training code you can't give it a piece of text and training tokenizer
1:11:21
a piece of text and training tokenizer
1:11:21
a piece of text and training tokenizer this is just the inference code which
1:11:23
this is just the inference code which
1:11:23
this is just the inference code which Tak takes the merges that we have up
1:11:25
Tak takes the merges that we have up
1:11:25
Tak takes the merges that we have up above and applies them to a new piece of
1:11:28
above and applies them to a new piece of
1:11:28
above and applies them to a new piece of text and so we don't know exactly how
1:11:30
text and so we don't know exactly how
1:11:30
text and so we don't know exactly how opening ey trained um train the
1:11:32
opening ey trained um train the
1:11:32
opening ey trained um train the tokenizer but it wasn't as simple as
1:11:34
tokenizer but it wasn't as simple as
1:11:34
tokenizer but it wasn't as simple as chunk it up and BP it uh whatever it was
1:11:38
chunk it up and BP it uh whatever it was
1:11:38
chunk it up and BP it uh whatever it was next I wanted to introduce you to the
1:11:40
next I wanted to introduce you to the
1:11:40
next I wanted to introduce you to the Tik token library from openai which is
1:11:42
Tik token library from openai which is
1:11:42
Tik token library from openai which is the official library for tokenization
1:11:44
the official library for tokenization
1:11:44
the official library for tokenization from openai so this is Tik token bip
1:11:48
from openai so this is Tik token bip
1:11:48
from openai so this is Tik token bip install P to Tik token and then um you
1:11:51
install P to Tik token and then um you
1:11:51
install P to Tik token and then um you can do the tokenization in inference
1:11:54
can do the tokenization in inference
1:11:54
can do the tokenization in inference this is again not training code this is
1:11:55
this is again not training code this is
1:11:55
this is again not training code this is only inference code for
1:11:57
only inference code for
1:11:57
only inference code for tokenization um I wanted to show you how
1:12:00
tokenization um I wanted to show you how
1:12:00
tokenization um I wanted to show you how you would use it quite simple and
1:12:02
you would use it quite simple and
1:12:02
you would use it quite simple and running this just gives us the gpt2
1:12:04
running this just gives us the gpt2
1:12:04
running this just gives us the gpt2 tokens or the GPT 4 tokens so this is
1:12:06
tokens or the GPT 4 tokens so this is
1:12:06
tokens or the GPT 4 tokens so this is the tokenizer use for GPT 4 and so in
1:12:09
the tokenizer use for GPT 4 and so in
1:12:09
the tokenizer use for GPT 4 and so in particular we see that the Whit space in
1:12:11
particular we see that the Whit space in
1:12:11
particular we see that the Whit space in gpt2 remains unmerged but in GPT 4 uh
1:12:14
gpt2 remains unmerged but in GPT 4 uh
1:12:14
gpt2 remains unmerged but in GPT 4 uh these Whit spaces merge as we also saw
1:12:17
these Whit spaces merge as we also saw
1:12:17
these Whit spaces merge as we also saw in this one where here they're all
1:12:19
in this one where here they're all
1:12:19
in this one where here they're all unmerged but if we go down to GPT 4 uh
1:12:22
unmerged but if we go down to GPT 4 uh
1:12:22
unmerged but if we go down to GPT 4 uh they become merged
1:12:25
they become merged
1:12:25
they become merged um now in the
1:12:27
um now in the
1:12:27
um now in the gp4 uh tokenizer they changed the
1:12:31
gp4 uh tokenizer they changed the
1:12:31
gp4 uh tokenizer they changed the regular expression that they use to
1:12:33
regular expression that they use to
1:12:33
regular expression that they use to Chunk Up text so the way to see this is
1:12:35
Chunk Up text so the way to see this is
1:12:35
Chunk Up text so the way to see this is that if you come to your the Tik token
1:12:37
that if you come to your the Tik token
1:12:38
that if you come to your the Tik token uh library and then you go to this file
1:12:41
uh library and then you go to this file
1:12:41
uh library and then you go to this file Tik token X openi public this is where
1:12:44
Tik token X openi public this is where
1:12:44
Tik token X openi public this is where sort of like the definition of all these
1:12:45
sort of like the definition of all these
1:12:45
sort of like the definition of all these different tokenizers that openi
1:12:46
different tokenizers that openi
1:12:46
different tokenizers that openi maintains is and so uh necessarily to do
1:12:50
maintains is and so uh necessarily to do
1:12:50
maintains is and so uh necessarily to do the inference they had to publish some
1:12:51
the inference they had to publish some
1:12:51
the inference they had to publish some of the details about the strings
1:12:53
of the details about the strings
1:12:53
of the details about the strings so this is the string that we already
1:12:55
so this is the string that we already
1:12:55
so this is the string that we already saw for gpt2 it is slightly different
1:12:58
saw for gpt2 it is slightly different
1:12:58
saw for gpt2 it is slightly different but it is actually equivalent uh to what
1:13:00
but it is actually equivalent uh to what
1:13:00
but it is actually equivalent uh to what we discussed here so this pattern that
1:13:02
we discussed here so this pattern that
1:13:02
we discussed here so this pattern that we discussed is equivalent to this
1:13:04
we discussed is equivalent to this
1:13:04
we discussed is equivalent to this pattern this one just executes a little
1:13:06
pattern this one just executes a little
1:13:07
pattern this one just executes a little bit faster so here you see a little bit
1:13:09
bit faster so here you see a little bit
1:13:09
bit faster so here you see a little bit of a slightly different definition but
1:13:10
of a slightly different definition but
1:13:10
of a slightly different definition but otherwise it's the same we're going to
1:13:12
otherwise it's the same we're going to
1:13:12
otherwise it's the same we're going to go into special tokens in a bit and then
1:13:15
go into special tokens in a bit and then
1:13:15
go into special tokens in a bit and then if you scroll down to CL 100k this is
1:13:18
if you scroll down to CL 100k this is
1:13:18
if you scroll down to CL 100k this is the GPT 4 tokenizer you see that the
1:13:20
the GPT 4 tokenizer you see that the
1:13:20
the GPT 4 tokenizer you see that the pattern has changed um and this is kind
1:13:23
pattern has changed um and this is kind
1:13:23
pattern has changed um and this is kind of like the main the major change in
1:13:26
of like the main the major change in
1:13:26
of like the main the major change in addition to a bunch of other special
1:13:27
addition to a bunch of other special
1:13:27
addition to a bunch of other special tokens which I'll go into in a bit again
1:13:30
tokens which I'll go into in a bit again
1:13:30
tokens which I'll go into in a bit again now some I'm not going to actually go
1:13:31
now some I'm not going to actually go
1:13:31
now some I'm not going to actually go into the full detail of the pattern
1:13:33
into the full detail of the pattern
1:13:33
into the full detail of the pattern change because honestly this is my
1:13:35
change because honestly this is my
1:13:35
change because honestly this is my numbing uh I would just advise that you
1:13:37
numbing uh I would just advise that you
1:13:37
numbing uh I would just advise that you pull out chat GPT and the regex
1:13:39
pull out chat GPT and the regex
1:13:39
pull out chat GPT and the regex documentation and just step through it
1:13:42
documentation and just step through it
1:13:42
documentation and just step through it but really the major changes are number
1:13:44
but really the major changes are number
1:13:44
but really the major changes are number one you see this eye here that means
1:13:48
one you see this eye here that means
1:13:48
one you see this eye here that means that the um case sensitivity this is
1:13:51
that the um case sensitivity this is
1:13:51
that the um case sensitivity this is case insensitive match and so the
1:13:53
case insensitive match and so the
1:13:53
case insensitive match and so the comment that we saw earlier on oh we
1:13:56
comment that we saw earlier on oh we
1:13:56
comment that we saw earlier on oh we should have used re. uppercase uh
1:13:58
should have used re. uppercase uh
1:13:58
should have used re. uppercase uh basically we're now going to be matching
1:14:01
basically we're now going to be matching
1:14:01
basically we're now going to be matching these apostrophe s apostrophe D
1:14:04
these apostrophe s apostrophe D
1:14:04
these apostrophe s apostrophe D apostrophe M Etc uh we're going to be
1:14:06
apostrophe M Etc uh we're going to be
1:14:06
apostrophe M Etc uh we're going to be matching them both in lowercase and in
1:14:08
matching them both in lowercase and in
1:14:08
matching them both in lowercase and in uppercase so that's fixed there's a
1:14:11
uppercase so that's fixed there's a
1:14:11
uppercase so that's fixed there's a bunch of different like handling of the
1:14:12
bunch of different like handling of the
1:14:12
bunch of different like handling of the whites space that I'm not going to go
1:14:14
whites space that I'm not going to go
1:14:14
whites space that I'm not going to go into the full details of and then one
1:14:16
into the full details of and then one
1:14:16
into the full details of and then one more thing here is you will notice that
1:14:18
more thing here is you will notice that
1:14:18
more thing here is you will notice that when they match the numbers they only
1:14:20
when they match the numbers they only
1:14:20
when they match the numbers they only match one to three numbers so so they
1:14:23
match one to three numbers so so they
1:14:23
match one to three numbers so so they will never merge
1:14:26
will never merge
1:14:26
will never merge numbers that are in low in more than
1:14:28
numbers that are in low in more than
1:14:28
numbers that are in low in more than three digits only up to three digits of
1:14:31
three digits only up to three digits of
1:14:31
three digits only up to three digits of numbers will ever be merged and uh
1:14:34
numbers will ever be merged and uh
1:14:34
numbers will ever be merged and uh that's one change that they made as well
1:14:36
that's one change that they made as well
1:14:36
that's one change that they made as well to prevent uh tokens that are very very
1:14:38
to prevent uh tokens that are very very
1:14:38
to prevent uh tokens that are very very long number
1:14:39
long number
1:14:40
long number sequences uh but again we don't really
1:14:42
sequences uh but again we don't really
1:14:42
sequences uh but again we don't really know why they do any of this stuff uh
1:14:44
know why they do any of this stuff uh
1:14:44
know why they do any of this stuff uh because none of this is documented and
1:14:46
because none of this is documented and
1:14:46
because none of this is documented and uh it's just we just get the pattern so
1:14:49
uh it's just we just get the pattern so
1:14:49
uh it's just we just get the pattern so um yeah it is what it is but those are
1:14:51
um yeah it is what it is but those are
1:14:51
um yeah it is what it is but those are some of the changes that gp4 has made
1:14:54
some of the changes that gp4 has made
1:14:54
some of the changes that gp4 has made and of course the vocabulary size went
1:14:56
and of course the vocabulary size went
1:14:56
and of course the vocabulary size went from roughly 50k to roughly
1:14:58
from roughly 50k to roughly
1:14:58
from roughly 50k to roughly 100K the next thing I would like to do
1:15:00
100K the next thing I would like to do
1:15:00
100K the next thing I would like to do very briefly is to take you through the
1:15:02
very briefly is to take you through the
1:15:02
very briefly is to take you through the gpt2 encoder dopy that openi has
1:15:05
gpt2 encoder dopy that openi has
1:15:05
gpt2 encoder dopy that openi has released uh this is the file that I
1:15:07
released uh this is the file that I
1:15:07
released uh this is the file that I already mentioned to you briefly now
1:15:09
already mentioned to you briefly now
1:15:09
already mentioned to you briefly now this file is uh fairly short and should
1:15:12
this file is uh fairly short and should
1:15:12
this file is uh fairly short and should be relatively understandable to you at
1:15:14
be relatively understandable to you at
1:15:14
be relatively understandable to you at this point um starting at the bottom
1:15:17
this point um starting at the bottom
1:15:17
this point um starting at the bottom here they are loading two files encoder
1:15:21
here they are loading two files encoder
1:15:21
here they are loading two files encoder Json and vocab bpe and they do some
1:15:24
Json and vocab bpe and they do some
1:15:24
Json and vocab bpe and they do some light processing on it and then they
1:15:25
light processing on it and then they
1:15:25
light processing on it and then they call this encoder object which is the
1:15:27
call this encoder object which is the
1:15:27
call this encoder object which is the tokenizer now if you'd like to inspect
1:15:30
tokenizer now if you'd like to inspect
1:15:30
tokenizer now if you'd like to inspect these two files which together
1:15:31
these two files which together
1:15:31
these two files which together constitute their saved tokenizer then
1:15:34
constitute their saved tokenizer then
1:15:34
constitute their saved tokenizer then you can do that with a piece of code
1:15:36
you can do that with a piece of code
1:15:36
you can do that with a piece of code like
1:15:36
like
1:15:36
like this um this is where you can download
1:15:39
this um this is where you can download
1:15:39
this um this is where you can download these two files and you can inspect them
1:15:40
these two files and you can inspect them
1:15:40
these two files and you can inspect them if you'd like and what you will find is
1:15:42
if you'd like and what you will find is
1:15:42
if you'd like and what you will find is that this encoder as they call it in
1:15:45
that this encoder as they call it in
1:15:45
that this encoder as they call it in their code is exactly equivalent to our
1:15:47
their code is exactly equivalent to our
1:15:47
their code is exactly equivalent to our vocab so remember here where we have
1:15:51
vocab so remember here where we have
1:15:51
vocab so remember here where we have this vocab object which allowed us us to
1:15:53
this vocab object which allowed us us to
1:15:53
this vocab object which allowed us us to decode very efficiently and basically it
1:15:55
decode very efficiently and basically it
1:15:56
decode very efficiently and basically it took us from the integer to the byes uh
1:16:00
took us from the integer to the byes uh
1:16:00
took us from the integer to the byes uh for that integer so our vocab is exactly
1:16:03
for that integer so our vocab is exactly
1:16:03
for that integer so our vocab is exactly their encoder and then their vocab bpe
1:16:07
their encoder and then their vocab bpe
1:16:07
their encoder and then their vocab bpe confusingly is actually are merges so
1:16:11
confusingly is actually are merges so
1:16:11
confusingly is actually are merges so their BP merges which is based on the
1:16:13
their BP merges which is based on the
1:16:14
their BP merges which is based on the data inside vocab bpe ends up being
1:16:16
data inside vocab bpe ends up being
1:16:16
data inside vocab bpe ends up being equivalent to our merges so uh basically
1:16:20
equivalent to our merges so uh basically
1:16:20
equivalent to our merges so uh basically they are saving and loading the two uh
1:16:24
they are saving and loading the two uh
1:16:24
they are saving and loading the two uh variables that for us are also critical
1:16:26
variables that for us are also critical
1:16:26
variables that for us are also critical the merges variable and the vocab
1:16:28
the merges variable and the vocab
1:16:28
the merges variable and the vocab variable using just these two variables
1:16:31
variable using just these two variables
1:16:31
variable using just these two variables you can represent a tokenizer and you
1:16:32
you can represent a tokenizer and you
1:16:32
you can represent a tokenizer and you can both do encoding and decoding once
1:16:34
can both do encoding and decoding once
1:16:34
can both do encoding and decoding once you've trained this
1:16:35
you've trained this
1:16:36
you've trained this tokenizer now the only thing that um is
1:16:39
tokenizer now the only thing that um is
1:16:40
tokenizer now the only thing that um is actually slightly confusing inside what
1:16:42
actually slightly confusing inside what
1:16:42
actually slightly confusing inside what opening ey does here is that in addition
1:16:44
opening ey does here is that in addition
1:16:44
opening ey does here is that in addition to this encoder and a decoder they also
1:16:46
to this encoder and a decoder they also
1:16:46
to this encoder and a decoder they also have something called a bite encoder and
1:16:48
have something called a bite encoder and
1:16:48
have something called a bite encoder and a bite decoder and this is actually
1:16:51
a bite decoder and this is actually
1:16:51
a bite decoder and this is actually unfortunately just
1:16:53
unfortunately just
1:16:53
unfortunately just kind of a spirous implementation detail
1:16:55
kind of a spirous implementation detail
1:16:55
kind of a spirous implementation detail and isn't actually deep or interesting
1:16:57
and isn't actually deep or interesting
1:16:57
and isn't actually deep or interesting in any way so I'm going to skip the
1:16:59
in any way so I'm going to skip the
1:16:59
in any way so I'm going to skip the discussion of it but what opening ey
1:17:01
discussion of it but what opening ey
1:17:01
discussion of it but what opening ey does here for reasons that I don't fully
1:17:02
does here for reasons that I don't fully
1:17:02
does here for reasons that I don't fully understand is that not only have they
1:17:04
understand is that not only have they
1:17:05
understand is that not only have they this tokenizer which can encode and
1:17:06
this tokenizer which can encode and
1:17:06
this tokenizer which can encode and decode but they have a whole separate
1:17:08
decode but they have a whole separate
1:17:08
decode but they have a whole separate layer here in addition that is used
1:17:09
layer here in addition that is used
1:17:10
layer here in addition that is used serially with the tokenizer and so you
1:17:12
serially with the tokenizer and so you
1:17:12
serially with the tokenizer and so you first do um bite encode and then encode
1:17:16
first do um bite encode and then encode
1:17:16
first do um bite encode and then encode and then you do decode and then bite
1:17:17
and then you do decode and then bite
1:17:17
and then you do decode and then bite decode so that's the loop and they are
1:17:20
decode so that's the loop and they are
1:17:20
decode so that's the loop and they are just stacked serial on top of each other
1:17:22
just stacked serial on top of each other
1:17:22
just stacked serial on top of each other and and it's not that interesting so I
1:17:24
and and it's not that interesting so I
1:17:24
and and it's not that interesting so I won't cover it and you can step through
1:17:25
won't cover it and you can step through
1:17:25
won't cover it and you can step through it if you'd like otherwise this file if
1:17:28
it if you'd like otherwise this file if
1:17:28
it if you'd like otherwise this file if you ignore the bite encoder and the bite
1:17:30
you ignore the bite encoder and the bite
1:17:30
you ignore the bite encoder and the bite decoder will be algorithmically very
1:17:31
decoder will be algorithmically very
1:17:31
decoder will be algorithmically very familiar with you and the meat of it
1:17:33
familiar with you and the meat of it
1:17:33
familiar with you and the meat of it here is the what they call bpe function
1:17:37
here is the what they call bpe function
1:17:37
here is the what they call bpe function and you should recognize this Loop here
1:17:39
and you should recognize this Loop here
1:17:39
and you should recognize this Loop here which is very similar to our own y Loop
1:17:41
which is very similar to our own y Loop
1:17:41
which is very similar to our own y Loop where they're trying to identify the
1:17:43
where they're trying to identify the
1:17:43
where they're trying to identify the Byram uh a pair that they should be
1:17:46
Byram uh a pair that they should be
1:17:46
Byram uh a pair that they should be merging next and then here just like we
1:17:49
merging next and then here just like we
1:17:49
merging next and then here just like we had they have a for Loop trying to merge
1:17:50
had they have a for Loop trying to merge
1:17:50
had they have a for Loop trying to merge this pair uh so they will go over all of
1:17:53
this pair uh so they will go over all of
1:17:53
this pair uh so they will go over all of the sequence and they will merge the
1:17:55
the sequence and they will merge the
1:17:55
the sequence and they will merge the pair whenever they find it and they keep
1:17:57
pair whenever they find it and they keep
1:17:57
pair whenever they find it and they keep repeating that until they run out of
1:17:59
repeating that until they run out of
1:17:59
repeating that until they run out of possible merges in the in the text so
1:18:02
possible merges in the in the text so
1:18:02
possible merges in the in the text so that's the meat of this file and uh
1:18:04
that's the meat of this file and uh
1:18:04
that's the meat of this file and uh there's an encode and a decode function
1:18:06
there's an encode and a decode function
1:18:06
there's an encode and a decode function just like we have implemented it so long
1:18:08
just like we have implemented it so long
1:18:08
just like we have implemented it so long story short what I want you to take away
1:18:09
story short what I want you to take away
1:18:09
story short what I want you to take away at this point is that unfortunately it's
1:18:11
at this point is that unfortunately it's
1:18:11
at this point is that unfortunately it's a little bit of a messy code that they
1:18:12
a little bit of a messy code that they
1:18:13
a little bit of a messy code that they have but algorithmically it is identical
1:18:15
have but algorithmically it is identical
1:18:15
have but algorithmically it is identical to what we've built up above and what
1:18:17
to what we've built up above and what
1:18:17
to what we've built up above and what we've built up above if you understand
1:18:19
we've built up above if you understand
1:18:19
we've built up above if you understand it is algorithmically what is necessary
1:18:21
it is algorithmically what is necessary
1:18:21
it is algorithmically what is necessary to actually build a BP to organizer
1:18:23
to actually build a BP to organizer
1:18:23
to actually build a BP to organizer train it and then both encode and decode
1:18:26
train it and then both encode and decode
1:18:26
train it and then both encode and decode the next topic I would like to turn to
1:18:28
the next topic I would like to turn to
1:18:28
the next topic I would like to turn to is that of special tokens so in addition
1:18:30
is that of special tokens so in addition
1:18:30
is that of special tokens so in addition to tokens that are coming from you know
1:18:32
to tokens that are coming from you know
1:18:32
to tokens that are coming from you know raw bytes and the BP merges we can
1:18:35
raw bytes and the BP merges we can
1:18:35
raw bytes and the BP merges we can insert all kinds of tokens that we are
1:18:36
insert all kinds of tokens that we are
1:18:36
insert all kinds of tokens that we are going to use to delimit different parts
1:18:38
going to use to delimit different parts
1:18:38
going to use to delimit different parts of the data or introduced to create a
1:18:41
of the data or introduced to create a
1:18:41
of the data or introduced to create a special structure of the token streams
1:18:44
special structure of the token streams
1:18:44
special structure of the token streams so in uh if you look at this encoder
1:18:47
so in uh if you look at this encoder
1:18:47
so in uh if you look at this encoder object from open AIS gpd2 right here we
1:18:50
object from open AIS gpd2 right here we
1:18:50
object from open AIS gpd2 right here we mentioned this is very similar to our
1:18:52
mentioned this is very similar to our
1:18:52
mentioned this is very similar to our vocab you'll notice that the length of
1:18:54
vocab you'll notice that the length of
1:18:54
vocab you'll notice that the length of this is
1:18:58
50257 and as I mentioned it's mapping uh
1:19:01
50257 and as I mentioned it's mapping uh
1:19:01
50257 and as I mentioned it's mapping uh and it's inverted from the mapping of
1:19:03
and it's inverted from the mapping of
1:19:03
and it's inverted from the mapping of our vocab our vocab goes from integer to
1:19:06
our vocab our vocab goes from integer to
1:19:06
our vocab our vocab goes from integer to string and they go the other way around
1:19:08
string and they go the other way around
1:19:08
string and they go the other way around for no amazing reason um but the thing
1:19:11
for no amazing reason um but the thing
1:19:11
for no amazing reason um but the thing to note here is that this the mapping
1:19:13
to note here is that this the mapping
1:19:13
to note here is that this the mapping table here is
1:19:14
table here is
1:19:15
table here is 50257 where does that number come from
1:19:18
50257 where does that number come from
1:19:18
50257 where does that number come from where what are the tokens as I mentioned
1:19:20
where what are the tokens as I mentioned
1:19:20
where what are the tokens as I mentioned there are 256 raw bite token
1:19:24
there are 256 raw bite token
1:19:24
there are 256 raw bite token tokens and then opena actually did
1:19:27
tokens and then opena actually did
1:19:27
tokens and then opena actually did 50,000
1:19:28
50,000
1:19:28
50,000 merges so those become the other tokens
1:19:31
merges so those become the other tokens
1:19:32
merges so those become the other tokens but this would have been
1:19:34
but this would have been
1:19:34
but this would have been 50256 so what is the 57th token and
1:19:37
50256 so what is the 57th token and
1:19:37
50256 so what is the 57th token and there is basically one special
1:19:40
there is basically one special
1:19:40
there is basically one special token and that one special token you can
1:19:43
token and that one special token you can
1:19:43
token and that one special token you can see is called end of text so this is a
1:19:47
see is called end of text so this is a
1:19:47
see is called end of text so this is a special token and it's the very last
1:19:49
special token and it's the very last
1:19:49
special token and it's the very last token and this token is used to delimit
1:19:52
token and this token is used to delimit
1:19:52
token and this token is used to delimit documents ments in the training set so
1:19:55
documents ments in the training set so
1:19:55
documents ments in the training set so when we're creating the training data we
1:19:57
when we're creating the training data we
1:19:57
when we're creating the training data we have all these documents and we tokenize
1:19:59
have all these documents and we tokenize
1:19:59
have all these documents and we tokenize them and we get a stream of tokens those
1:20:01
them and we get a stream of tokens those
1:20:01
them and we get a stream of tokens those tokens only range from Z to
1:20:05
tokens only range from Z to
1:20:05
tokens only range from Z to 50256 and then in between those
1:20:07
50256 and then in between those
1:20:07
50256 and then in between those documents we put special end of text
1:20:10
documents we put special end of text
1:20:10
documents we put special end of text token and we insert that token in
1:20:12
token and we insert that token in
1:20:12
token and we insert that token in between documents and we are using this
1:20:15
between documents and we are using this
1:20:15
between documents and we are using this as a signal to the language model that
1:20:18
as a signal to the language model that
1:20:18
as a signal to the language model that the document has ended and what follows
1:20:20
the document has ended and what follows
1:20:20
the document has ended and what follows is going to be unrelated to the document
1:20:23
is going to be unrelated to the document
1:20:23
is going to be unrelated to the document previously that said the language model
1:20:25
previously that said the language model
1:20:25
previously that said the language model has to learn this from data it it needs
1:20:27
has to learn this from data it it needs
1:20:27
has to learn this from data it it needs to learn that this token usually means
1:20:29
to learn that this token usually means
1:20:29
to learn that this token usually means that it should wipe its sort of memory
1:20:31
that it should wipe its sort of memory
1:20:31
that it should wipe its sort of memory of what came before and what came before
1:20:34
of what came before and what came before
1:20:34
of what came before and what came before this token is not actually informative
1:20:35
this token is not actually informative
1:20:35
this token is not actually informative to what comes next but we are expecting
1:20:37
to what comes next but we are expecting
1:20:37
to what comes next but we are expecting the language model to just like learn
1:20:38
the language model to just like learn
1:20:39
the language model to just like learn this but we're giving it the Special
1:20:40
this but we're giving it the Special
1:20:40
this but we're giving it the Special sort of the limiter of these documents
1:20:44
sort of the limiter of these documents
1:20:44
sort of the limiter of these documents we can go here to Tech tokenizer and um
1:20:46
we can go here to Tech tokenizer and um
1:20:46
we can go here to Tech tokenizer and um this the gpt2 tokenizer uh our code that
1:20:49
this the gpt2 tokenizer uh our code that
1:20:49
this the gpt2 tokenizer uh our code that we've been playing with before so we can
1:20:51
we've been playing with before so we can
1:20:51
we've been playing with before so we can add here right hello world world how are
1:20:53
add here right hello world world how are
1:20:53
add here right hello world world how are you and we're getting different tokens
1:20:55
you and we're getting different tokens
1:20:55
you and we're getting different tokens but now you can see what if what happens
1:20:58
but now you can see what if what happens
1:20:58
but now you can see what if what happens if I put end of text you see how until I
1:21:02
if I put end of text you see how until I
1:21:02
if I put end of text you see how until I finished it these are all different
1:21:03
finished it these are all different
1:21:03
finished it these are all different tokens end of
1:21:06
tokens end of
1:21:06
tokens end of text still set different tokens and now
1:21:08
text still set different tokens and now
1:21:08
text still set different tokens and now when I finish it suddenly we get token
1:21:13
when I finish it suddenly we get token
1:21:13
when I finish it suddenly we get token 50256 and the reason this works is
1:21:15
50256 and the reason this works is
1:21:15
50256 and the reason this works is because this didn't actually go through
1:21:18
because this didn't actually go through
1:21:18
because this didn't actually go through the bpe merges instead the code that
1:21:21
the bpe merges instead the code that
1:21:21
the bpe merges instead the code that actually outposted tokens has special
1:21:24
actually outposted tokens has special
1:21:25
actually outposted tokens has special case instructions for handling special
1:21:28
case instructions for handling special
1:21:28
case instructions for handling special tokens um we did not see these special
1:21:30
tokens um we did not see these special
1:21:30
tokens um we did not see these special instructions for handling special tokens
1:21:32
instructions for handling special tokens
1:21:32
instructions for handling special tokens in the encoder dopy it's absent there
1:21:36
in the encoder dopy it's absent there
1:21:36
in the encoder dopy it's absent there but if you go to Tech token Library
1:21:37
but if you go to Tech token Library
1:21:38
but if you go to Tech token Library which is uh implemented in Rust you will
1:21:40
which is uh implemented in Rust you will
1:21:40
which is uh implemented in Rust you will find all kinds of special case handling
1:21:42
find all kinds of special case handling
1:21:42
find all kinds of special case handling for these special tokens that you can
1:21:44
for these special tokens that you can
1:21:44
for these special tokens that you can register uh create adds to the
1:21:47
register uh create adds to the
1:21:47
register uh create adds to the vocabulary and then it looks for them
1:21:48
vocabulary and then it looks for them
1:21:49
vocabulary and then it looks for them and it uh whenever it sees these special
1:21:50
and it uh whenever it sees these special
1:21:50
and it uh whenever it sees these special tokens like this it will actually come
1:21:53
tokens like this it will actually come
1:21:53
tokens like this it will actually come in and swap in that special token so
1:21:56
in and swap in that special token so
1:21:56
in and swap in that special token so these things are outside of the typical
1:21:58
these things are outside of the typical
1:21:58
these things are outside of the typical algorithm of uh B PA en
1:22:00
algorithm of uh B PA en
1:22:00
algorithm of uh B PA en coding so these special tokens are used
1:22:02
coding so these special tokens are used
1:22:02
coding so these special tokens are used pervasively uh not just in uh basically
1:22:05
pervasively uh not just in uh basically
1:22:05
pervasively uh not just in uh basically base language modeling of predicting the
1:22:07
base language modeling of predicting the
1:22:07
base language modeling of predicting the next token in the sequence but
1:22:09
next token in the sequence but
1:22:09
next token in the sequence but especially when it gets to later to the
1:22:10
especially when it gets to later to the
1:22:10
especially when it gets to later to the fine tuning stage and all of the chat uh
1:22:13
fine tuning stage and all of the chat uh
1:22:13
fine tuning stage and all of the chat uh gbt sort of aspects of it uh because we
1:22:15
gbt sort of aspects of it uh because we
1:22:15
gbt sort of aspects of it uh because we don't just want to Del limit documents
1:22:16
don't just want to Del limit documents
1:22:16
don't just want to Del limit documents we want to delimit entire conversations
1:22:18
we want to delimit entire conversations
1:22:18
we want to delimit entire conversations between an assistant and a user so if I
1:22:21
between an assistant and a user so if I
1:22:21
between an assistant and a user so if I refresh this sck tokenizer page the
1:22:24
refresh this sck tokenizer page the
1:22:24
refresh this sck tokenizer page the default example that they have here is
1:22:26
default example that they have here is
1:22:26
default example that they have here is using not sort of base model encoders
1:22:30
using not sort of base model encoders
1:22:30
using not sort of base model encoders but ftuned model uh sort of tokenizers
1:22:33
but ftuned model uh sort of tokenizers
1:22:33
but ftuned model uh sort of tokenizers um so for example using the GPT 3.5
1:22:35
um so for example using the GPT 3.5
1:22:35
um so for example using the GPT 3.5 turbo scheme these here are all special
1:22:38
turbo scheme these here are all special
1:22:38
turbo scheme these here are all special tokens I am start I end Etc uh this is
1:22:43
tokens I am start I end Etc uh this is
1:22:43
tokens I am start I end Etc uh this is short for Imaginary mcore start by the
1:22:46
short for Imaginary mcore start by the
1:22:46
short for Imaginary mcore start by the way but you can see here that there's a
1:22:49
way but you can see here that there's a
1:22:49
way but you can see here that there's a sort of start and end of every single
1:22:51
sort of start and end of every single
1:22:51
sort of start and end of every single message and there can be many other
1:22:52
message and there can be many other
1:22:52
message and there can be many other other tokens lots of tokens um in use to
1:22:56
other tokens lots of tokens um in use to
1:22:56
other tokens lots of tokens um in use to delimit these conversations and kind of
1:22:58
delimit these conversations and kind of
1:22:58
delimit these conversations and kind of keep track of the flow of the messages
1:23:00
keep track of the flow of the messages
1:23:00
keep track of the flow of the messages here now we can go back to the Tik token
1:23:03
here now we can go back to the Tik token
1:23:03
here now we can go back to the Tik token library and here when you scroll to the
1:23:06
library and here when you scroll to the
1:23:06
library and here when you scroll to the bottom they talk about how you can
1:23:08
bottom they talk about how you can
1:23:08
bottom they talk about how you can extend tick token and I can you can
1:23:10
extend tick token and I can you can
1:23:10
extend tick token and I can you can create basically you can Fork uh the um
1:23:13
create basically you can Fork uh the um
1:23:13
create basically you can Fork uh the um CL 100K base tokenizers in gp4 and for
1:23:17
CL 100K base tokenizers in gp4 and for
1:23:17
CL 100K base tokenizers in gp4 and for example you can extend it by adding more
1:23:18
example you can extend it by adding more
1:23:18
example you can extend it by adding more special tokens and these are totally up
1:23:20
special tokens and these are totally up
1:23:20
special tokens and these are totally up to you you can come up with any
1:23:21
to you you can come up with any
1:23:21
to you you can come up with any arbitrary tokens and add them with the
1:23:23
arbitrary tokens and add them with the
1:23:23
arbitrary tokens and add them with the new ID afterwards and the tikken library
1:23:26
new ID afterwards and the tikken library
1:23:26
new ID afterwards and the tikken library will uh correctly swap them out uh when
1:23:29
will uh correctly swap them out uh when
1:23:29
will uh correctly swap them out uh when it sees this in the
1:23:31
it sees this in the
1:23:31
it sees this in the strings now we can also go back to this
1:23:34
strings now we can also go back to this
1:23:34
strings now we can also go back to this file which we've looked at previously
1:23:37
file which we've looked at previously
1:23:37
file which we've looked at previously and I mentioned that the gpt2 in Tik
1:23:39
and I mentioned that the gpt2 in Tik
1:23:39
and I mentioned that the gpt2 in Tik toen open
1:23:41
toen open
1:23:41
toen open I.P we have the vocabulary we have the
1:23:43
I.P we have the vocabulary we have the
1:23:44
I.P we have the vocabulary we have the pattern for splitting and then here we
1:23:46
pattern for splitting and then here we
1:23:46
pattern for splitting and then here we are registering the single special token
1:23:48
are registering the single special token
1:23:48
are registering the single special token in gpd2 which was the end of text token
1:23:50
in gpd2 which was the end of text token
1:23:50
in gpd2 which was the end of text token and we saw that it has this ID
1:23:52
and we saw that it has this ID
1:23:53
and we saw that it has this ID in GPT 4 when they defy this here you
1:23:56
in GPT 4 when they defy this here you
1:23:56
in GPT 4 when they defy this here you see that the pattern has changed as
1:23:57
see that the pattern has changed as
1:23:57
see that the pattern has changed as we've discussed but also the special
1:23:59
we've discussed but also the special
1:23:59
we've discussed but also the special tokens have changed in this tokenizer so
1:24:01
tokens have changed in this tokenizer so
1:24:01
tokens have changed in this tokenizer so we of course have the end of text just
1:24:03
we of course have the end of text just
1:24:03
we of course have the end of text just like in gpd2 but we also see three sorry
1:24:06
like in gpd2 but we also see three sorry
1:24:06
like in gpd2 but we also see three sorry four additional tokens here Thim prefix
1:24:09
four additional tokens here Thim prefix
1:24:09
four additional tokens here Thim prefix middle and suffix what is fim fim is
1:24:12
middle and suffix what is fim fim is
1:24:12
middle and suffix what is fim fim is short for fill in the middle and if
1:24:14
short for fill in the middle and if
1:24:14
short for fill in the middle and if you'd like to learn more about this idea
1:24:16
you'd like to learn more about this idea
1:24:17
you'd like to learn more about this idea it comes from this paper um and I'm not
1:24:19
it comes from this paper um and I'm not
1:24:20
it comes from this paper um and I'm not going to go into detail in this video
1:24:21
going to go into detail in this video
1:24:21
going to go into detail in this video it's beyond this video and then there's
1:24:23
it's beyond this video and then there's
1:24:23
it's beyond this video and then there's one additional uh serve token here so
1:24:27
one additional uh serve token here so
1:24:27
one additional uh serve token here so that's that encoding as well so it's
1:24:29
that's that encoding as well so it's
1:24:29
that's that encoding as well so it's very common basically to train a
1:24:31
very common basically to train a
1:24:31
very common basically to train a language model and then if you'd like uh
1:24:34
language model and then if you'd like uh
1:24:34
language model and then if you'd like uh you can add special tokens now when you
1:24:37
you can add special tokens now when you
1:24:37
you can add special tokens now when you add special tokens you of course have to
1:24:39
add special tokens you of course have to
1:24:39
add special tokens you of course have to um do some model surgery to the
1:24:41
um do some model surgery to the
1:24:41
um do some model surgery to the Transformer and all the parameters
1:24:43
Transformer and all the parameters
1:24:43
Transformer and all the parameters involved in that Transformer because you
1:24:45
involved in that Transformer because you
1:24:45
involved in that Transformer because you are basically adding an integer and you
1:24:47
are basically adding an integer and you
1:24:47
are basically adding an integer and you want to make sure that for example your
1:24:48
want to make sure that for example your
1:24:48
want to make sure that for example your embedding Matrix for the vocabulary
1:24:50
embedding Matrix for the vocabulary
1:24:50
embedding Matrix for the vocabulary tokens has to be extended by adding a
1:24:53
tokens has to be extended by adding a
1:24:53
tokens has to be extended by adding a row and typically this row would be
1:24:54
row and typically this row would be
1:24:54
row and typically this row would be initialized uh with small random numbers
1:24:56
initialized uh with small random numbers
1:24:56
initialized uh with small random numbers or something like that because we need
1:24:58
or something like that because we need
1:24:58
or something like that because we need to have a vector that now stands for
1:25:01
to have a vector that now stands for
1:25:01
to have a vector that now stands for that token in addition to that you have
1:25:03
that token in addition to that you have
1:25:03
that token in addition to that you have to go to the final layer of the
1:25:04
to go to the final layer of the
1:25:04
to go to the final layer of the Transformer and you have to make sure
1:25:05
Transformer and you have to make sure
1:25:05
Transformer and you have to make sure that that projection at the very end
1:25:07
that that projection at the very end
1:25:07
that that projection at the very end into the classifier uh is extended by
1:25:09
into the classifier uh is extended by
1:25:09
into the classifier uh is extended by one as well so basically there's some
1:25:11
one as well so basically there's some
1:25:11
one as well so basically there's some model surgery involved that you have to
1:25:13
model surgery involved that you have to
1:25:13
model surgery involved that you have to couple with the tokenization changes if
1:25:16
couple with the tokenization changes if
1:25:16
couple with the tokenization changes if you are going to add special tokens but
1:25:18
you are going to add special tokens but
1:25:18
you are going to add special tokens but this is a very common operation that
1:25:20
this is a very common operation that
1:25:20
this is a very common operation that people do especially if they'd like to
1:25:21
people do especially if they'd like to
1:25:21
people do especially if they'd like to fine tune the model for example taking
1:25:23
fine tune the model for example taking
1:25:23
fine tune the model for example taking it from a base model to a chat model
1:25:26
it from a base model to a chat model
1:25:26
it from a base model to a chat model like chat
1:25:27
like chat
1:25:27
like chat GPT okay so at this point you should
1:25:29
GPT okay so at this point you should
1:25:29
GPT okay so at this point you should have everything you need in order to
1:25:31
have everything you need in order to
1:25:31
have everything you need in order to build your own gp4 tokenizer now in the
1:25:33
build your own gp4 tokenizer now in the
1:25:33
build your own gp4 tokenizer now in the process of developing this lecture I've
1:25:35
process of developing this lecture I've
1:25:35
process of developing this lecture I've done that and I published the code under
1:25:37
done that and I published the code under
1:25:37
done that and I published the code under this repository
1:25:38
this repository
1:25:38
this repository MBP so MBP looks like this right now as
1:25:42
MBP so MBP looks like this right now as
1:25:42
MBP so MBP looks like this right now as I'm recording but uh the MBP repository
1:25:45
I'm recording but uh the MBP repository
1:25:45
I'm recording but uh the MBP repository will probably change quite a bit because
1:25:46
will probably change quite a bit because
1:25:46
will probably change quite a bit because I intend to continue working on it um in
1:25:49
I intend to continue working on it um in
1:25:49
I intend to continue working on it um in addition to the MBP repository I've
1:25:51
addition to the MBP repository I've
1:25:51
addition to the MBP repository I've published the this uh exercise
1:25:53
published the this uh exercise
1:25:53
published the this uh exercise progression that you can follow so if
1:25:55
progression that you can follow so if
1:25:55
progression that you can follow so if you go to exercise. MD here uh this is
1:25:58
you go to exercise. MD here uh this is
1:25:58
you go to exercise. MD here uh this is sort of me breaking up the task ahead of
1:26:01
sort of me breaking up the task ahead of
1:26:01
sort of me breaking up the task ahead of you into four steps that sort of uh
1:26:03
you into four steps that sort of uh
1:26:03
you into four steps that sort of uh build up to what can be a gp4 tokenizer
1:26:06
build up to what can be a gp4 tokenizer
1:26:06
build up to what can be a gp4 tokenizer and so feel free to follow these steps
1:26:08
and so feel free to follow these steps
1:26:08
and so feel free to follow these steps exactly and follow a little bit of the
1:26:10
exactly and follow a little bit of the
1:26:10
exactly and follow a little bit of the guidance that I've laid out here and
1:26:12
guidance that I've laid out here and
1:26:12
guidance that I've laid out here and anytime you feel stuck just reference
1:26:14
anytime you feel stuck just reference
1:26:14
anytime you feel stuck just reference the MBP repository here so either the
1:26:17
the MBP repository here so either the
1:26:17
the MBP repository here so either the tests could be useful or the MBP
1:26:20
tests could be useful or the MBP
1:26:20
tests could be useful or the MBP repository itself I try to keep the code
1:26:22
repository itself I try to keep the code
1:26:22
repository itself I try to keep the code fairly clean and understandable and so
1:26:26
fairly clean and understandable and so
1:26:26
fairly clean and understandable and so um feel free to reference it whenever um
1:26:28
um feel free to reference it whenever um
1:26:28
um feel free to reference it whenever um you get
1:26:30
you get
1:26:30
you get stuck uh in addition to that basically
1:26:32
stuck uh in addition to that basically
1:26:32
stuck uh in addition to that basically once you write it you should be able to
1:26:34
once you write it you should be able to
1:26:34
once you write it you should be able to reproduce this behavior from Tech token
1:26:36
reproduce this behavior from Tech token
1:26:36
reproduce this behavior from Tech token so getting the gb4 tokenizer you can
1:26:39
so getting the gb4 tokenizer you can
1:26:39
so getting the gb4 tokenizer you can take uh you can encode the string and
1:26:41
take uh you can encode the string and
1:26:41
take uh you can encode the string and you should get these tokens and then you
1:26:43
you should get these tokens and then you
1:26:43
you should get these tokens and then you can encode and decode the exact same
1:26:44
can encode and decode the exact same
1:26:44
can encode and decode the exact same string to recover it and in addition to
1:26:47
string to recover it and in addition to
1:26:47
string to recover it and in addition to all that you should be able to implement
1:26:48
all that you should be able to implement
1:26:48
all that you should be able to implement your own train function uh which Tik
1:26:50
your own train function uh which Tik
1:26:50
your own train function uh which Tik token Library does not provide it's it's
1:26:52
token Library does not provide it's it's
1:26:52
token Library does not provide it's it's again only inference code but you could
1:26:54
again only inference code but you could
1:26:54
again only inference code but you could write your own train MBP does it as well
1:26:57
write your own train MBP does it as well
1:26:57
write your own train MBP does it as well and that will allow you to train your
1:26:59
and that will allow you to train your
1:26:59
and that will allow you to train your own token
1:27:00
own token
1:27:00
own token vocabularies so here are some of the
1:27:02
vocabularies so here are some of the
1:27:02
vocabularies so here are some of the code inside M be mean bpe uh shows the
1:27:06
code inside M be mean bpe uh shows the
1:27:06
code inside M be mean bpe uh shows the token vocabularies that you might obtain
1:27:08
token vocabularies that you might obtain
1:27:08
token vocabularies that you might obtain so on the left uh here we have the GPT 4
1:27:12
so on the left uh here we have the GPT 4
1:27:12
so on the left uh here we have the GPT 4 merges uh so the first 256 are raw
1:27:15
merges uh so the first 256 are raw
1:27:15
merges uh so the first 256 are raw individual bytes and then here I am
1:27:17
individual bytes and then here I am
1:27:17
individual bytes and then here I am visualizing the merges that gp4
1:27:19
visualizing the merges that gp4
1:27:19
visualizing the merges that gp4 performed during its training so the
1:27:21
performed during its training so the
1:27:21
performed during its training so the very first merge that gp4 did was merge
1:27:24
very first merge that gp4 did was merge
1:27:24
very first merge that gp4 did was merge two spaces into a single token for you
1:27:27
two spaces into a single token for you
1:27:27
two spaces into a single token for you know two spaces and that is a token 256
1:27:30
know two spaces and that is a token 256
1:27:30
know two spaces and that is a token 256 and so this is the order in which things
1:27:32
and so this is the order in which things
1:27:32
and so this is the order in which things merged during gb4 training and this is
1:27:34
merged during gb4 training and this is
1:27:34
merged during gb4 training and this is the merge order that um we obtain in MBP
1:27:39
the merge order that um we obtain in MBP
1:27:39
the merge order that um we obtain in MBP by training a tokenizer and in this case
1:27:41
by training a tokenizer and in this case
1:27:41
by training a tokenizer and in this case I trained it on a Wikipedia page of
1:27:43
I trained it on a Wikipedia page of
1:27:43
I trained it on a Wikipedia page of Taylor Swift uh not because I'm a Swifty
1:27:45
Taylor Swift uh not because I'm a Swifty
1:27:45
Taylor Swift uh not because I'm a Swifty but because that is one of the longest
1:27:47
but because that is one of the longest
1:27:47
but because that is one of the longest um Wikipedia Pages apparently that's
1:27:49
um Wikipedia Pages apparently that's
1:27:49
um Wikipedia Pages apparently that's available but she is pretty cool and
1:27:54
available but she is pretty cool and
1:27:54
available but she is pretty cool and um what was I going to say yeah so you
1:27:56
um what was I going to say yeah so you
1:27:56
um what was I going to say yeah so you can compare these two uh vocabularies
1:27:59
can compare these two uh vocabularies
1:27:59
can compare these two uh vocabularies and so as an example um here GPT for
1:28:03
and so as an example um here GPT for
1:28:04
and so as an example um here GPT for merged I in to become in and we've done
1:28:06
merged I in to become in and we've done
1:28:06
merged I in to become in and we've done the exact same thing on this token 259
1:28:09
the exact same thing on this token 259
1:28:10
the exact same thing on this token 259 here space t becomes space t and that
1:28:13
here space t becomes space t and that
1:28:13
here space t becomes space t and that happened for us a little bit later as
1:28:14
happened for us a little bit later as
1:28:14
happened for us a little bit later as well so the difference here is again to
1:28:16
well so the difference here is again to
1:28:16
well so the difference here is again to my understanding only a difference of
1:28:18
my understanding only a difference of
1:28:18
my understanding only a difference of the training set so as an example
1:28:20
the training set so as an example
1:28:20
the training set so as an example because I see a lot of white space I
1:28:22
because I see a lot of white space I
1:28:22
because I see a lot of white space I supect that gp4 probably had a lot of
1:28:23
supect that gp4 probably had a lot of
1:28:23
supect that gp4 probably had a lot of python code in its training set I'm not
1:28:25
python code in its training set I'm not
1:28:25
python code in its training set I'm not sure uh for the
1:28:27
sure uh for the
1:28:27
sure uh for the tokenizer and uh here we see much less
1:28:30
tokenizer and uh here we see much less
1:28:30
tokenizer and uh here we see much less of that of course in the Wikipedia page
1:28:32
of that of course in the Wikipedia page
1:28:32
of that of course in the Wikipedia page so roughly speaking they look the same
1:28:34
so roughly speaking they look the same
1:28:34
so roughly speaking they look the same and they look the same because they're
1:28:35
and they look the same because they're
1:28:35
and they look the same because they're running the same algorithm and when you
1:28:38
running the same algorithm and when you
1:28:38
running the same algorithm and when you train your own you're probably going to
1:28:39
train your own you're probably going to
1:28:39
train your own you're probably going to get something similar depending on what
1:28:41
get something similar depending on what
1:28:41
get something similar depending on what you train it on okay so we are now going
1:28:43
you train it on okay so we are now going
1:28:43
you train it on okay so we are now going to move on from tick token and the way
1:28:45
to move on from tick token and the way
1:28:45
to move on from tick token and the way that open AI tokenizes its strings and
1:28:47
that open AI tokenizes its strings and
1:28:47
that open AI tokenizes its strings and we're going to discuss one more very
1:28:49
we're going to discuss one more very
1:28:49
we're going to discuss one more very commonly used library for working with
1:28:50
commonly used library for working with
1:28:51
commonly used library for working with tokenization inlm
1:28:52
tokenization inlm
1:28:52
tokenization inlm and that is sentence piece so sentence
1:28:55
and that is sentence piece so sentence
1:28:55
and that is sentence piece so sentence piece is very commonly used in language
1:28:58
piece is very commonly used in language
1:28:58
piece is very commonly used in language models because unlike Tik token it can
1:29:00
models because unlike Tik token it can
1:29:00
models because unlike Tik token it can do both training and inference and is
1:29:02
do both training and inference and is
1:29:02
do both training and inference and is quite efficient at both it supports a
1:29:04
quite efficient at both it supports a
1:29:04
quite efficient at both it supports a number of algorithms for training uh
1:29:06
number of algorithms for training uh
1:29:06
number of algorithms for training uh vocabularies but one of them is the B
1:29:09
vocabularies but one of them is the B
1:29:09
vocabularies but one of them is the B pair en coding algorithm that we've been
1:29:10
pair en coding algorithm that we've been
1:29:10
pair en coding algorithm that we've been looking at so it supports it now
1:29:13
looking at so it supports it now
1:29:13
looking at so it supports it now sentence piece is used both by llama and
1:29:15
sentence piece is used both by llama and
1:29:15
sentence piece is used both by llama and mistal series and many other models as
1:29:18
mistal series and many other models as
1:29:18
mistal series and many other models as well it is on GitHub under Google
1:29:20
well it is on GitHub under Google
1:29:20
well it is on GitHub under Google sentence piece
1:29:22
sentence piece
1:29:22
sentence piece and the big difference with sentence
1:29:24
and the big difference with sentence
1:29:24
and the big difference with sentence piece and we're going to look at example
1:29:26
piece and we're going to look at example
1:29:26
piece and we're going to look at example because this is kind of hard and subtle
1:29:27
because this is kind of hard and subtle
1:29:27
because this is kind of hard and subtle to explain is that they think different
1:29:31
to explain is that they think different
1:29:31
to explain is that they think different about the order of operations here so in
1:29:35
about the order of operations here so in
1:29:35
about the order of operations here so in the case of Tik token we first take our
1:29:38
the case of Tik token we first take our
1:29:38
the case of Tik token we first take our code points in the string we encode them
1:29:40
code points in the string we encode them
1:29:41
code points in the string we encode them using mutf to bytes and then we're
1:29:42
using mutf to bytes and then we're
1:29:42
using mutf to bytes and then we're merging bytes it's fairly
1:29:44
merging bytes it's fairly
1:29:44
merging bytes it's fairly straightforward for sentence piece um it
1:29:48
straightforward for sentence piece um it
1:29:48
straightforward for sentence piece um it works directly on the level of the code
1:29:50
works directly on the level of the code
1:29:50
works directly on the level of the code points themselves so so it looks at
1:29:52
points themselves so so it looks at
1:29:52
points themselves so so it looks at whatever code points are available in
1:29:53
whatever code points are available in
1:29:53
whatever code points are available in your training set and then it starts
1:29:55
your training set and then it starts
1:29:55
your training set and then it starts merging those code points and um the bpe
1:29:59
merging those code points and um the bpe
1:29:59
merging those code points and um the bpe is running on the level of code
1:30:01
is running on the level of code
1:30:01
is running on the level of code points and if you happen to run out of
1:30:04
points and if you happen to run out of
1:30:04
points and if you happen to run out of code points so there are maybe some rare
1:30:06
code points so there are maybe some rare
1:30:06
code points so there are maybe some rare uh code points that just don't come up
1:30:08
uh code points that just don't come up
1:30:08
uh code points that just don't come up too often and the Rarity is determined
1:30:09
too often and the Rarity is determined
1:30:09
too often and the Rarity is determined by this character coverage hyper
1:30:11
by this character coverage hyper
1:30:11
by this character coverage hyper parameter then these uh code points will
1:30:14
parameter then these uh code points will
1:30:14
parameter then these uh code points will either get mapped to a special unknown
1:30:16
either get mapped to a special unknown
1:30:16
either get mapped to a special unknown token like ank or if you have the bite
1:30:19
token like ank or if you have the bite
1:30:19
token like ank or if you have the bite foldback option turned on then that will
1:30:22
foldback option turned on then that will
1:30:22
foldback option turned on then that will take those rare Cod points it will
1:30:23
take those rare Cod points it will
1:30:23
take those rare Cod points it will encode them using utf8 and then the
1:30:26
encode them using utf8 and then the
1:30:26
encode them using utf8 and then the individual bytes of that encoding will
1:30:27
individual bytes of that encoding will
1:30:27
individual bytes of that encoding will be translated into tokens and there are
1:30:30
be translated into tokens and there are
1:30:30
be translated into tokens and there are these special bite tokens that basically
1:30:32
these special bite tokens that basically
1:30:32
these special bite tokens that basically get added to the vocabulary so it uses
1:30:35
get added to the vocabulary so it uses
1:30:35
get added to the vocabulary so it uses BP on on the code points and then it
1:30:38
BP on on the code points and then it
1:30:38
BP on on the code points and then it falls back to bytes for rare Cod points
1:30:41
falls back to bytes for rare Cod points
1:30:41
falls back to bytes for rare Cod points um and so that's kind of like difference
1:30:44
um and so that's kind of like difference
1:30:44
um and so that's kind of like difference personally I find the Tik token we
1:30:45
personally I find the Tik token we
1:30:45
personally I find the Tik token we significantly cleaner uh but it's kind
1:30:47
significantly cleaner uh but it's kind
1:30:47
significantly cleaner uh but it's kind of like a subtle but pretty major
1:30:48
of like a subtle but pretty major
1:30:48
of like a subtle but pretty major difference between the way they approach
1:30:50
difference between the way they approach
1:30:50
difference between the way they approach tokenization let's work with with a
1:30:52
tokenization let's work with with a
1:30:52
tokenization let's work with with a concrete example because otherwise this
1:30:53
concrete example because otherwise this
1:30:54
concrete example because otherwise this is kind of hard to um to get your head
1:30:56
is kind of hard to um to get your head
1:30:56
is kind of hard to um to get your head around so let's work with a concrete
1:30:59
around so let's work with a concrete
1:30:59
around so let's work with a concrete example this is how we can import
1:31:01
example this is how we can import
1:31:01
example this is how we can import sentence piece and then here we're going
1:31:03
sentence piece and then here we're going
1:31:03
sentence piece and then here we're going to take I think I took like the
1:31:05
to take I think I took like the
1:31:05
to take I think I took like the description of sentence piece and I just
1:31:06
description of sentence piece and I just
1:31:06
description of sentence piece and I just created like a little toy data set it
1:31:08
created like a little toy data set it
1:31:08
created like a little toy data set it really likes to have a file so I created
1:31:10
really likes to have a file so I created
1:31:10
really likes to have a file so I created a toy. txt file with this
1:31:13
a toy. txt file with this
1:31:13
a toy. txt file with this content now what's kind of a little bit
1:31:15
content now what's kind of a little bit
1:31:15
content now what's kind of a little bit crazy about sentence piece is that
1:31:16
crazy about sentence piece is that
1:31:16
crazy about sentence piece is that there's a ton of options and
1:31:18
there's a ton of options and
1:31:18
there's a ton of options and configurations and the reason this is so
1:31:20
configurations and the reason this is so
1:31:20
configurations and the reason this is so is because sentence piece has been
1:31:22
is because sentence piece has been
1:31:22
is because sentence piece has been around I think for a while and it really
1:31:23
around I think for a while and it really
1:31:23
around I think for a while and it really tries to handle a large diversity of
1:31:25
tries to handle a large diversity of
1:31:25
tries to handle a large diversity of things and um because it's been around I
1:31:28
things and um because it's been around I
1:31:28
things and um because it's been around I think it has quite a bit of accumulated
1:31:30
think it has quite a bit of accumulated
1:31:30
think it has quite a bit of accumulated historical baggage uh as well and so in
1:31:33
historical baggage uh as well and so in
1:31:33
historical baggage uh as well and so in particular there's like a ton of
1:31:35
particular there's like a ton of
1:31:35
particular there's like a ton of configuration arguments this is not even
1:31:36
configuration arguments this is not even
1:31:36
configuration arguments this is not even all of it you can go to here to see all
1:31:39
all of it you can go to here to see all
1:31:39
all of it you can go to here to see all the training
1:31:40
the training
1:31:40
the training options um and uh there's also quite
1:31:44
options um and uh there's also quite
1:31:44
options um and uh there's also quite useful documentation when you look at
1:31:45
useful documentation when you look at
1:31:45
useful documentation when you look at the raw Proto buff uh that is used to
1:31:48
the raw Proto buff uh that is used to
1:31:48
the raw Proto buff uh that is used to represent the trainer spec and so on um
1:31:52
represent the trainer spec and so on um
1:31:52
represent the trainer spec and so on um many of these options are irrelevant to
1:31:54
many of these options are irrelevant to
1:31:54
many of these options are irrelevant to us so maybe to point out one example Das
1:31:56
us so maybe to point out one example Das
1:31:56
us so maybe to point out one example Das Das shrinking Factor uh this shrinking
1:31:59
Das shrinking Factor uh this shrinking
1:31:59
Das shrinking Factor uh this shrinking factor is not used in the B pair en
1:32:01
factor is not used in the B pair en
1:32:01
factor is not used in the B pair en coding algorithm so this is just an
1:32:03
coding algorithm so this is just an
1:32:03
coding algorithm so this is just an argument that is irrelevant to us um it
1:32:05
argument that is irrelevant to us um it
1:32:05
argument that is irrelevant to us um it applies to a different training
1:32:09
algorithm now what I tried to do here is
1:32:11
algorithm now what I tried to do here is
1:32:11
algorithm now what I tried to do here is I tried to set up sentence piece in a
1:32:13
I tried to set up sentence piece in a
1:32:13
I tried to set up sentence piece in a way that is very very similar as far as
1:32:15
way that is very very similar as far as
1:32:15
way that is very very similar as far as I can tell to maybe identical hopefully
1:32:18
I can tell to maybe identical hopefully
1:32:18
I can tell to maybe identical hopefully to the way that llama 2 was strained so
1:32:22
to the way that llama 2 was strained so
1:32:22
to the way that llama 2 was strained so the way they trained their own um their
1:32:25
the way they trained their own um their
1:32:25
the way they trained their own um their own tokenizer and the way I did this was
1:32:27
own tokenizer and the way I did this was
1:32:27
own tokenizer and the way I did this was basically you can take the tokenizer
1:32:28
basically you can take the tokenizer
1:32:28
basically you can take the tokenizer model file that meta released and you
1:32:31
model file that meta released and you
1:32:31
model file that meta released and you can um open it using the Proto protuff
1:32:35
can um open it using the Proto protuff
1:32:35
can um open it using the Proto protuff uh sort of file that you can generate
1:32:38
uh sort of file that you can generate
1:32:38
uh sort of file that you can generate and then you can inspect all the options
1:32:39
and then you can inspect all the options
1:32:39
and then you can inspect all the options and I tried to copy over all the options
1:32:41
and I tried to copy over all the options
1:32:41
and I tried to copy over all the options that looked relevant so here we set up
1:32:43
that looked relevant so here we set up
1:32:43
that looked relevant so here we set up the input it's raw text in this file
1:32:46
the input it's raw text in this file
1:32:46
the input it's raw text in this file here's going to be the output so it's
1:32:48
here's going to be the output so it's
1:32:48
here's going to be the output so it's going to be for talk 400. model and
1:32:50
going to be for talk 400. model and
1:32:50
going to be for talk 400. model and vocab
1:32:52
vocab
1:32:52
vocab we're saying that we're going to use the
1:32:53
we're saying that we're going to use the
1:32:53
we're saying that we're going to use the BP algorithm and we want to Bap size of
1:32:56
BP algorithm and we want to Bap size of
1:32:56
BP algorithm and we want to Bap size of 400 then there's a ton of configurations
1:32:58
400 then there's a ton of configurations
1:32:58
400 then there's a ton of configurations here
1:33:01
for um for basically pre-processing and
1:33:05
for um for basically pre-processing and
1:33:05
for um for basically pre-processing and normalization rules as they're called
1:33:07
normalization rules as they're called
1:33:07
normalization rules as they're called normalization used to be very prevalent
1:33:09
normalization used to be very prevalent
1:33:09
normalization used to be very prevalent I would say before llms in natural
1:33:11
I would say before llms in natural
1:33:11
I would say before llms in natural language processing so in machine
1:33:12
language processing so in machine
1:33:12
language processing so in machine translation and uh text classification
1:33:14
translation and uh text classification
1:33:14
translation and uh text classification and so on you want to normalize and
1:33:16
and so on you want to normalize and
1:33:16
and so on you want to normalize and simplify the text and you want to turn
1:33:17
simplify the text and you want to turn
1:33:18
simplify the text and you want to turn it all lowercase and you want to remove
1:33:19
it all lowercase and you want to remove
1:33:19
it all lowercase and you want to remove all double whites space Etc
1:33:22
all double whites space Etc
1:33:22
all double whites space Etc and in language models we prefer not to
1:33:23
and in language models we prefer not to
1:33:23
and in language models we prefer not to do any of it or at least that is my
1:33:25
do any of it or at least that is my
1:33:25
do any of it or at least that is my preference as a deep learning person you
1:33:26
preference as a deep learning person you
1:33:26
preference as a deep learning person you want to not touch your data you want to
1:33:28
want to not touch your data you want to
1:33:28
want to not touch your data you want to keep the raw data as much as possible um
1:33:31
keep the raw data as much as possible um
1:33:31
keep the raw data as much as possible um in a raw
1:33:33
in a raw
1:33:33
in a raw form so you're basically trying to turn
1:33:35
form so you're basically trying to turn
1:33:35
form so you're basically trying to turn off a lot of this if you can the other
1:33:37
off a lot of this if you can the other
1:33:38
off a lot of this if you can the other thing that sentence piece does is that
1:33:39
thing that sentence piece does is that
1:33:39
thing that sentence piece does is that it has this concept of sentences so
1:33:43
it has this concept of sentences so
1:33:43
it has this concept of sentences so sentence piece it's back it's kind of
1:33:45
sentence piece it's back it's kind of
1:33:45
sentence piece it's back it's kind of like was developed I think early in the
1:33:46
like was developed I think early in the
1:33:46
like was developed I think early in the days where there was um an idea that
1:33:50
days where there was um an idea that
1:33:50
days where there was um an idea that they you're training a tokenizer on a
1:33:51
they you're training a tokenizer on a
1:33:51
they you're training a tokenizer on a bunch of independent sentences so it has
1:33:54
bunch of independent sentences so it has
1:33:54
bunch of independent sentences so it has a lot of like how many sentences you're
1:33:56
a lot of like how many sentences you're
1:33:56
a lot of like how many sentences you're going to train on what is the maximum
1:33:57
going to train on what is the maximum
1:33:58
going to train on what is the maximum sentence length
1:34:00
sentence length
1:34:00
sentence length um shuffling sentences and so for it
1:34:03
um shuffling sentences and so for it
1:34:03
um shuffling sentences and so for it sentences are kind of like the
1:34:04
sentences are kind of like the
1:34:04
sentences are kind of like the individual training examples but again
1:34:06
individual training examples but again
1:34:06
individual training examples but again in the context of llms I find that this
1:34:08
in the context of llms I find that this
1:34:08
in the context of llms I find that this is like a very spous and weird
1:34:10
is like a very spous and weird
1:34:10
is like a very spous and weird distinction like sentences are just like
1:34:13
distinction like sentences are just like
1:34:13
distinction like sentences are just like don't touch the raw data sentences
1:34:15
don't touch the raw data sentences
1:34:15
don't touch the raw data sentences happen to exist but in raw data sets
1:34:18
happen to exist but in raw data sets
1:34:18
happen to exist but in raw data sets there are a lot of like inet like what
1:34:20
there are a lot of like inet like what
1:34:20
there are a lot of like inet like what exactly is a sentence what isn't a
1:34:22
exactly is a sentence what isn't a
1:34:22
exactly is a sentence what isn't a sentence um and so I think like it's
1:34:24
sentence um and so I think like it's
1:34:25
sentence um and so I think like it's really hard to Define what an actual
1:34:26
really hard to Define what an actual
1:34:26
really hard to Define what an actual sentence is if you really like dig into
1:34:28
sentence is if you really like dig into
1:34:28
sentence is if you really like dig into it and there could be different concepts
1:34:30
it and there could be different concepts
1:34:30
it and there could be different concepts of it in different languages or
1:34:32
of it in different languages or
1:34:32
of it in different languages or something like that so why even
1:34:33
something like that so why even
1:34:33
something like that so why even introduce the concept it it doesn't
1:34:35
introduce the concept it it doesn't
1:34:35
introduce the concept it it doesn't honestly make sense to me I would just
1:34:36
honestly make sense to me I would just
1:34:36
honestly make sense to me I would just prefer to treat a file as a giant uh
1:34:39
prefer to treat a file as a giant uh
1:34:39
prefer to treat a file as a giant uh stream of
1:34:40
stream of
1:34:40
stream of bytes it has a lot of treatment around
1:34:42
bytes it has a lot of treatment around
1:34:42
bytes it has a lot of treatment around rare word characters and when I say word
1:34:45
rare word characters and when I say word
1:34:45
rare word characters and when I say word I mean code points we're going to come
1:34:46
I mean code points we're going to come
1:34:46
I mean code points we're going to come back to this in a second and it has a
1:34:48
back to this in a second and it has a
1:34:48
back to this in a second and it has a lot of other rules for um basically
1:34:51
lot of other rules for um basically
1:34:51
lot of other rules for um basically splitting digits splitting white space
1:34:54
splitting digits splitting white space
1:34:54
splitting digits splitting white space and numbers and how you deal with that
1:34:56
and numbers and how you deal with that
1:34:56
and numbers and how you deal with that so these are some kind of like merge
1:34:58
so these are some kind of like merge
1:34:58
so these are some kind of like merge rules so I think this is a little bit
1:35:00
rules so I think this is a little bit
1:35:00
rules so I think this is a little bit equivalent to tick token using the
1:35:02
equivalent to tick token using the
1:35:02
equivalent to tick token using the regular expression to split up
1:35:04
regular expression to split up
1:35:04
regular expression to split up categories there's like kind of
1:35:07
categories there's like kind of
1:35:07
categories there's like kind of equivalence of it if you squint T it in
1:35:09
equivalence of it if you squint T it in
1:35:09
equivalence of it if you squint T it in sentence piece where you can also for
1:35:10
sentence piece where you can also for
1:35:10
sentence piece where you can also for example split up split up the digits uh
1:35:14
example split up split up the digits uh
1:35:14
example split up split up the digits uh and uh so
1:35:15
and uh so
1:35:15
and uh so on there's a few more things here that
1:35:18
on there's a few more things here that
1:35:18
on there's a few more things here that I'll come back to in a bit and then
1:35:19
I'll come back to in a bit and then
1:35:19
I'll come back to in a bit and then there are some special tokens that you
1:35:20
there are some special tokens that you
1:35:20
there are some special tokens that you can indicate and it hardcodes the UN
1:35:23
can indicate and it hardcodes the UN
1:35:23
can indicate and it hardcodes the UN token the beginning of sentence end of
1:35:25
token the beginning of sentence end of
1:35:25
token the beginning of sentence end of sentence and a pad token um and the UN
1:35:29
sentence and a pad token um and the UN
1:35:29
sentence and a pad token um and the UN token must exist for my understanding
1:35:32
token must exist for my understanding
1:35:32
token must exist for my understanding and then some some things so we can
1:35:34
and then some some things so we can
1:35:34
and then some some things so we can train and when when I press train it's
1:35:37
train and when when I press train it's
1:35:37
train and when when I press train it's going to create this file talk 400.
1:35:40
going to create this file talk 400.
1:35:40
going to create this file talk 400. model and talk 400. wab I can then load
1:35:43
model and talk 400. wab I can then load
1:35:43
model and talk 400. wab I can then load the model file and I can inspect the
1:35:45
the model file and I can inspect the
1:35:45
the model file and I can inspect the vocabulary off it and so we trained
1:35:48
vocabulary off it and so we trained
1:35:48
vocabulary off it and so we trained vocab size 400 on this text here and
1:35:53
vocab size 400 on this text here and
1:35:53
vocab size 400 on this text here and these are the individual pieces the
1:35:54
these are the individual pieces the
1:35:55
these are the individual pieces the individual tokens that sentence piece
1:35:56
individual tokens that sentence piece
1:35:56
individual tokens that sentence piece will create so in the beginning we see
1:35:58
will create so in the beginning we see
1:35:58
will create so in the beginning we see that we have the an token uh with the ID
1:36:02
that we have the an token uh with the ID
1:36:02
that we have the an token uh with the ID zero then we have the beginning of
1:36:04
zero then we have the beginning of
1:36:04
zero then we have the beginning of sequence end of sequence one and two and
1:36:07
sequence end of sequence one and two and
1:36:07
sequence end of sequence one and two and then we said that the pad ID is negative
1:36:09
then we said that the pad ID is negative
1:36:09
then we said that the pad ID is negative 1 so we chose not to use it so there's
1:36:12
1 so we chose not to use it so there's
1:36:12
1 so we chose not to use it so there's no pad ID
1:36:13
no pad ID
1:36:13
no pad ID here then these are individual bite
1:36:16
here then these are individual bite
1:36:16
here then these are individual bite tokens so here we saw that bite fallback
1:36:20
tokens so here we saw that bite fallback
1:36:20
tokens so here we saw that bite fallback in llama was turned on so it's true so
1:36:23
in llama was turned on so it's true so
1:36:23
in llama was turned on so it's true so what follows are going to be the 256
1:36:26
what follows are going to be the 256
1:36:26
what follows are going to be the 256 bite
1:36:27
bite
1:36:27
bite tokens and these are their
1:36:31
IDs and then at the bottom after the
1:36:35
IDs and then at the bottom after the
1:36:35
IDs and then at the bottom after the bite tokens come the
1:36:37
bite tokens come the
1:36:37
bite tokens come the merges and these are the parent nodes in
1:36:40
merges and these are the parent nodes in
1:36:40
merges and these are the parent nodes in the merges so we're not seeing the
1:36:42
the merges so we're not seeing the
1:36:42
the merges so we're not seeing the children we're just seeing the parents
1:36:43
children we're just seeing the parents
1:36:43
children we're just seeing the parents and their
1:36:44
and their
1:36:44
and their ID and then after the
1:36:47
ID and then after the
1:36:47
ID and then after the merges comes eventually the individual
1:36:50
merges comes eventually the individual
1:36:50
merges comes eventually the individual tokens and their IDs and so these are
1:36:53
tokens and their IDs and so these are
1:36:53
tokens and their IDs and so these are the individual tokens so these are the
1:36:55
the individual tokens so these are the
1:36:55
the individual tokens so these are the individual code Point tokens if you will
1:36:58
individual code Point tokens if you will
1:36:58
individual code Point tokens if you will and they come at the end so that is the
1:37:00
and they come at the end so that is the
1:37:00
and they come at the end so that is the ordering with which sentence piece sort
1:37:01
ordering with which sentence piece sort
1:37:01
ordering with which sentence piece sort of like represents its vocabularies it
1:37:03
of like represents its vocabularies it
1:37:03
of like represents its vocabularies it starts with special tokens then the bike
1:37:06
starts with special tokens then the bike
1:37:06
starts with special tokens then the bike tokens then the merge tokens and then
1:37:08
tokens then the merge tokens and then
1:37:08
tokens then the merge tokens and then the individual codo tokens and all these
1:37:11
the individual codo tokens and all these
1:37:11
the individual codo tokens and all these raw codepoint to tokens are the ones
1:37:14
raw codepoint to tokens are the ones
1:37:14
raw codepoint to tokens are the ones that it encountered in the training
1:37:16
that it encountered in the training
1:37:16
that it encountered in the training set so those individual code points are
1:37:19
set so those individual code points are
1:37:19
set so those individual code points are all the the entire set of code points
1:37:22
all the the entire set of code points
1:37:22
all the the entire set of code points that occurred
1:37:24
that occurred
1:37:24
that occurred here so those all get put in there and
1:37:27
here so those all get put in there and
1:37:27
here so those all get put in there and then those that are extremely rare as
1:37:29
then those that are extremely rare as
1:37:29
then those that are extremely rare as determined by character coverage so if a
1:37:31
determined by character coverage so if a
1:37:31
determined by character coverage so if a code Point occurred only a single time
1:37:32
code Point occurred only a single time
1:37:32
code Point occurred only a single time out of like a million um sentences or
1:37:35
out of like a million um sentences or
1:37:35
out of like a million um sentences or something like that then it would be
1:37:37
something like that then it would be
1:37:37
something like that then it would be ignored and it would not be added to our
1:37:40
ignored and it would not be added to our
1:37:40
ignored and it would not be added to our uh
1:37:41
uh
1:37:41
uh vocabulary once we have a vocabulary we
1:37:43
vocabulary once we have a vocabulary we
1:37:43
vocabulary once we have a vocabulary we can encode into IDs and we can um sort
1:37:46
can encode into IDs and we can um sort
1:37:46
can encode into IDs and we can um sort of get a
1:37:47
of get a
1:37:47
of get a list and then here I am also decoding
1:37:50
list and then here I am also decoding
1:37:50
list and then here I am also decoding the indiv idual tokens back into little
1:37:54
the indiv idual tokens back into little
1:37:54
the indiv idual tokens back into little pieces as they call it so let's take a
1:37:56
pieces as they call it so let's take a
1:37:56
pieces as they call it so let's take a look at what happened here hello space
1:38:01
look at what happened here hello space
1:38:01
look at what happened here hello space on so these are the token IDs we got
1:38:04
on so these are the token IDs we got
1:38:04
on so these are the token IDs we got back and when we look here uh a few
1:38:07
back and when we look here uh a few
1:38:07
back and when we look here uh a few things sort of uh jump to mind number
1:38:11
things sort of uh jump to mind number
1:38:11
things sort of uh jump to mind number one take a look at these characters the
1:38:14
one take a look at these characters the
1:38:14
one take a look at these characters the Korean characters of course were not
1:38:15
Korean characters of course were not
1:38:15
Korean characters of course were not part of the training set so sentence
1:38:17
part of the training set so sentence
1:38:18
part of the training set so sentence piece is encountering code points that
1:38:19
piece is encountering code points that
1:38:19
piece is encountering code points that it has not seen during training time and
1:38:22
it has not seen during training time and
1:38:22
it has not seen during training time and those code points do not have a token
1:38:24
those code points do not have a token
1:38:24
those code points do not have a token associated with them so suddenly these
1:38:26
associated with them so suddenly these
1:38:26
associated with them so suddenly these are un tokens unknown tokens but because
1:38:30
are un tokens unknown tokens but because
1:38:30
are un tokens unknown tokens but because bite fall back as true instead sentence
1:38:33
bite fall back as true instead sentence
1:38:33
bite fall back as true instead sentence piece falls back to bytes and so it
1:38:36
piece falls back to bytes and so it
1:38:36
piece falls back to bytes and so it takes this it encodes it with utf8 and
1:38:39
takes this it encodes it with utf8 and
1:38:39
takes this it encodes it with utf8 and then it uses these tokens to represent
1:38:43
then it uses these tokens to represent
1:38:43
then it uses these tokens to represent uh those bytes and that's what we are
1:38:45
uh those bytes and that's what we are
1:38:45
uh those bytes and that's what we are getting sort of here this is the utf8 uh
1:38:49
getting sort of here this is the utf8 uh
1:38:49
getting sort of here this is the utf8 uh encoding and in this shifted by three uh
1:38:52
encoding and in this shifted by three uh
1:38:52
encoding and in this shifted by three uh because of these um special tokens here
1:38:56
because of these um special tokens here
1:38:56
because of these um special tokens here that have IDs earlier on so that's what
1:38:58
that have IDs earlier on so that's what
1:38:58
that have IDs earlier on so that's what happened here now one more thing that um
1:39:02
happened here now one more thing that um
1:39:02
happened here now one more thing that um well first before I go on with respect
1:39:05
well first before I go on with respect
1:39:05
well first before I go on with respect to the bitef back let me remove bite
1:39:08
to the bitef back let me remove bite
1:39:08
to the bitef back let me remove bite foldback if this is false what's going
1:39:10
foldback if this is false what's going
1:39:10
foldback if this is false what's going to happen let's
1:39:12
to happen let's
1:39:12
to happen let's retrain so the first thing that happened
1:39:14
retrain so the first thing that happened
1:39:14
retrain so the first thing that happened is all the bite tokens disappeared right
1:39:17
is all the bite tokens disappeared right
1:39:17
is all the bite tokens disappeared right and now we just have the merges and we
1:39:18
and now we just have the merges and we
1:39:19
and now we just have the merges and we have a lot more merges now because we
1:39:20
have a lot more merges now because we
1:39:20
have a lot more merges now because we have a lot more space because we're not
1:39:21
have a lot more space because we're not
1:39:21
have a lot more space because we're not taking up space in the wab size uh with
1:39:25
taking up space in the wab size uh with
1:39:25
taking up space in the wab size uh with all the
1:39:25
all the
1:39:25
all the bytes and now if we encode
1:39:29
bytes and now if we encode
1:39:29
bytes and now if we encode this we get a zero so this entire string
1:39:33
this we get a zero so this entire string
1:39:33
this we get a zero so this entire string here suddenly there's no bitef back so
1:39:35
here suddenly there's no bitef back so
1:39:35
here suddenly there's no bitef back so this is unknown and unknown is an and so
1:39:39
this is unknown and unknown is an and so
1:39:39
this is unknown and unknown is an and so this is zero because the an token is
1:39:42
this is zero because the an token is
1:39:42
this is zero because the an token is token zero and you have to keep in mind
1:39:44
token zero and you have to keep in mind
1:39:44
token zero and you have to keep in mind that this would feed into your uh
1:39:46
that this would feed into your uh
1:39:46
that this would feed into your uh language model so what is a language
1:39:48
language model so what is a language
1:39:48
language model so what is a language model supposed to do when all kinds of
1:39:49
model supposed to do when all kinds of
1:39:49
model supposed to do when all kinds of different things that are unrecognized
1:39:52
different things that are unrecognized
1:39:52
different things that are unrecognized because they're rare just end up mapping
1:39:53
because they're rare just end up mapping
1:39:54
because they're rare just end up mapping into Unk it's not exactly the property
1:39:56
into Unk it's not exactly the property
1:39:56
into Unk it's not exactly the property that you want so that's why I think
1:39:57
that you want so that's why I think
1:39:57
that you want so that's why I think llama correctly uh used by fallback true
1:40:02
llama correctly uh used by fallback true
1:40:02
llama correctly uh used by fallback true uh because we definitely want to feed
1:40:03
uh because we definitely want to feed
1:40:03
uh because we definitely want to feed these um unknown or rare code points
1:40:06
these um unknown or rare code points
1:40:06
these um unknown or rare code points into the model and some uh some manner
1:40:08
into the model and some uh some manner
1:40:08
into the model and some uh some manner the next thing I want to show you is the
1:40:10
the next thing I want to show you is the
1:40:10
the next thing I want to show you is the following notice here when we are
1:40:12
following notice here when we are
1:40:12
following notice here when we are decoding all the individual tokens you
1:40:14
decoding all the individual tokens you
1:40:14
decoding all the individual tokens you see how spaces uh space here ends up
1:40:18
see how spaces uh space here ends up
1:40:18
see how spaces uh space here ends up being this um bold underline I'm not
1:40:21
being this um bold underline I'm not
1:40:21
being this um bold underline I'm not 100% sure by the way why sentence piece
1:40:23
100% sure by the way why sentence piece
1:40:23
100% sure by the way why sentence piece switches whites space into these bold
1:40:25
switches whites space into these bold
1:40:25
switches whites space into these bold underscore characters maybe it's for
1:40:27
underscore characters maybe it's for
1:40:27
underscore characters maybe it's for visualization I'm not 100% sure why that
1:40:29
visualization I'm not 100% sure why that
1:40:29
visualization I'm not 100% sure why that happens uh but notice this why do we
1:40:32
happens uh but notice this why do we
1:40:32
happens uh but notice this why do we have an extra space in the front of
1:40:37
have an extra space in the front of
1:40:37
have an extra space in the front of hello um what where is this coming from
1:40:40
hello um what where is this coming from
1:40:40
hello um what where is this coming from well it's coming from this option
1:40:43
well it's coming from this option
1:40:43
well it's coming from this option here
1:40:45
here
1:40:45
here um add dummy prefix is true and when you
1:40:48
um add dummy prefix is true and when you
1:40:48
um add dummy prefix is true and when you go to the
1:40:49
go to the
1:40:49
go to the documentation add D whites space at the
1:40:51
documentation add D whites space at the
1:40:51
documentation add D whites space at the beginning of text in order to treat
1:40:53
beginning of text in order to treat
1:40:53
beginning of text in order to treat World in world and hello world in the
1:40:55
World in world and hello world in the
1:40:55
World in world and hello world in the exact same way so what this is trying to
1:40:57
exact same way so what this is trying to
1:40:57
exact same way so what this is trying to do is the
1:40:59
do is the
1:40:59
do is the following if we go back to our tick
1:41:02
following if we go back to our tick
1:41:02
following if we go back to our tick tokenizer world as uh token by itself
1:41:06
tokenizer world as uh token by itself
1:41:06
tokenizer world as uh token by itself has a different ID than space world so
1:41:10
has a different ID than space world so
1:41:10
has a different ID than space world so we have this is 1917 but this is 14 Etc
1:41:14
we have this is 1917 but this is 14 Etc
1:41:14
we have this is 1917 but this is 14 Etc so these are two different tokens for
1:41:15
so these are two different tokens for
1:41:16
so these are two different tokens for the language model and the language
1:41:17
the language model and the language
1:41:17
the language model and the language model has to learn from data that they
1:41:18
model has to learn from data that they
1:41:18
model has to learn from data that they are actually kind of like a very similar
1:41:20
are actually kind of like a very similar
1:41:20
are actually kind of like a very similar concept so to the language model in the
1:41:22
concept so to the language model in the
1:41:23
concept so to the language model in the Tik token World um basically words in
1:41:25
Tik token World um basically words in
1:41:26
Tik token World um basically words in the beginning of sentences and words in
1:41:27
the beginning of sentences and words in
1:41:27
the beginning of sentences and words in the middle of sentences actually look
1:41:29
the middle of sentences actually look
1:41:29
the middle of sentences actually look completely different um and it has to
1:41:32
completely different um and it has to
1:41:32
completely different um and it has to learned that they are roughly the same
1:41:34
learned that they are roughly the same
1:41:34
learned that they are roughly the same so this add dami prefix is trying to
1:41:36
so this add dami prefix is trying to
1:41:36
so this add dami prefix is trying to fight that a little bit and the way that
1:41:38
fight that a little bit and the way that
1:41:38
fight that a little bit and the way that works is that it basically
1:41:41
works is that it basically
1:41:41
works is that it basically uh adds a dummy prefix so for as a as a
1:41:46
uh adds a dummy prefix so for as a as a
1:41:46
uh adds a dummy prefix so for as a as a part of pre-processing it will take the
1:41:49
part of pre-processing it will take the
1:41:49
part of pre-processing it will take the string and it will add a space it will
1:41:51
string and it will add a space it will
1:41:51
string and it will add a space it will do this and that's done in an effort to
1:41:54
do this and that's done in an effort to
1:41:54
do this and that's done in an effort to make this world and that world the same
1:41:57
make this world and that world the same
1:41:57
make this world and that world the same they will both be space world so that's
1:42:00
they will both be space world so that's
1:42:00
they will both be space world so that's one other kind of pre-processing option
1:42:02
one other kind of pre-processing option
1:42:02
one other kind of pre-processing option that is turned on and llama 2 also uh
1:42:05
that is turned on and llama 2 also uh
1:42:05
that is turned on and llama 2 also uh uses this option and that's I think
1:42:07
uses this option and that's I think
1:42:07
uses this option and that's I think everything that I want to say for my
1:42:08
everything that I want to say for my
1:42:08
everything that I want to say for my preview of sentence piece and how it is
1:42:10
preview of sentence piece and how it is
1:42:10
preview of sentence piece and how it is different um maybe here what I've done
1:42:13
different um maybe here what I've done
1:42:13
different um maybe here what I've done is I just uh put in the Raw protocol
1:42:16
is I just uh put in the Raw protocol
1:42:16
is I just uh put in the Raw protocol buffer representation basically of the
1:42:19
buffer representation basically of the
1:42:19
buffer representation basically of the tokenizer the too trained so feel free
1:42:22
tokenizer the too trained so feel free
1:42:22
tokenizer the too trained so feel free to sort of Step through this and if you
1:42:24
to sort of Step through this and if you
1:42:24
to sort of Step through this and if you would like uh your tokenization to look
1:42:26
would like uh your tokenization to look
1:42:27
would like uh your tokenization to look identical to that of the meta uh llama 2
1:42:30
identical to that of the meta uh llama 2
1:42:30
identical to that of the meta uh llama 2 then you would be copy pasting these
1:42:31
then you would be copy pasting these
1:42:31
then you would be copy pasting these settings as I tried to do up above and
1:42:34
settings as I tried to do up above and
1:42:34
settings as I tried to do up above and uh yeah that's I think that's it for
1:42:36
uh yeah that's I think that's it for
1:42:36
uh yeah that's I think that's it for this section I think my summary for
1:42:38
this section I think my summary for
1:42:38
this section I think my summary for sentence piece from all of this is
1:42:40
sentence piece from all of this is
1:42:40
sentence piece from all of this is number one I think that there's a lot of
1:42:42
number one I think that there's a lot of
1:42:42
number one I think that there's a lot of historical baggage in sentence piece a
1:42:44
historical baggage in sentence piece a
1:42:44
historical baggage in sentence piece a lot of Concepts that I think are
1:42:45
lot of Concepts that I think are
1:42:45
lot of Concepts that I think are slightly confusing and I think
1:42:47
slightly confusing and I think
1:42:47
slightly confusing and I think potentially um contain foot guns like
1:42:49
potentially um contain foot guns like
1:42:49
potentially um contain foot guns like this concept of a sentence and it's
1:42:50
this concept of a sentence and it's
1:42:50
this concept of a sentence and it's maximum length and stuff like that um
1:42:53
maximum length and stuff like that um
1:42:53
maximum length and stuff like that um otherwise it is fairly commonly used in
1:42:55
otherwise it is fairly commonly used in
1:42:55
otherwise it is fairly commonly used in the industry um because it is efficient
1:42:58
the industry um because it is efficient
1:42:58
the industry um because it is efficient and can do both training and inference
1:43:00
and can do both training and inference
1:43:01
and can do both training and inference uh it has a few quirks like for example
1:43:02
uh it has a few quirks like for example
1:43:02
uh it has a few quirks like for example un token must exist and the way the bite
1:43:05
un token must exist and the way the bite
1:43:05
un token must exist and the way the bite fallbacks are done and so on I don't
1:43:06
fallbacks are done and so on I don't
1:43:06
fallbacks are done and so on I don't find particularly elegant and
1:43:08
find particularly elegant and
1:43:08
find particularly elegant and unfortunately I have to say it's not
1:43:09
unfortunately I have to say it's not
1:43:09
unfortunately I have to say it's not very well documented so it took me a lot
1:43:11
very well documented so it took me a lot
1:43:11
very well documented so it took me a lot of time working with this myself um and
1:43:14
of time working with this myself um and
1:43:14
of time working with this myself um and just visualizing things and trying to
1:43:16
just visualizing things and trying to
1:43:16
just visualizing things and trying to really understand what is happening here
1:43:17
really understand what is happening here
1:43:17
really understand what is happening here because uh the documentation
1:43:19
because uh the documentation
1:43:19
because uh the documentation unfortunately is in my opion not not
1:43:21
unfortunately is in my opion not not
1:43:21
unfortunately is in my opion not not super amazing but it is a very nice repo
1:43:24
super amazing but it is a very nice repo
1:43:24
super amazing but it is a very nice repo that is available to you if you'd like
1:43:26
that is available to you if you'd like
1:43:26
that is available to you if you'd like to train your own tokenizer right now
1:43:28
to train your own tokenizer right now
1:43:28
to train your own tokenizer right now okay let me now switch gears again as
1:43:29
okay let me now switch gears again as
1:43:29
okay let me now switch gears again as we're starting to slowly wrap up here I
1:43:31
we're starting to slowly wrap up here I
1:43:31
we're starting to slowly wrap up here I want to revisit this issue in a bit more
1:43:33
want to revisit this issue in a bit more
1:43:33
want to revisit this issue in a bit more detail of how we should set the vocap
1:43:35
detail of how we should set the vocap
1:43:35
detail of how we should set the vocap size and what are some of the
1:43:36
size and what are some of the
1:43:36
size and what are some of the considerations around it so for this I'd
1:43:39
considerations around it so for this I'd
1:43:39
considerations around it so for this I'd like to go back to the model
1:43:40
like to go back to the model
1:43:40
like to go back to the model architecture that we developed in the
1:43:42
architecture that we developed in the
1:43:42
architecture that we developed in the last video when we built the GPT from
1:43:44
last video when we built the GPT from
1:43:44
last video when we built the GPT from scratch so this here was uh the file
1:43:47
scratch so this here was uh the file
1:43:47
scratch so this here was uh the file that we built in the previous video and
1:43:49
that we built in the previous video and
1:43:49
that we built in the previous video and we defined the Transformer model and and
1:43:51
we defined the Transformer model and and
1:43:51
we defined the Transformer model and and let's specifically look at Bap size and
1:43:52
let's specifically look at Bap size and
1:43:52
let's specifically look at Bap size and where it appears in this file so here we
1:43:55
where it appears in this file so here we
1:43:55
where it appears in this file so here we Define the voap size uh at this time it
1:43:58
Define the voap size uh at this time it
1:43:58
Define the voap size uh at this time it was 65 or something like that extremely
1:43:59
was 65 or something like that extremely
1:43:59
was 65 or something like that extremely small number so this will grow much
1:44:02
small number so this will grow much
1:44:02
small number so this will grow much larger you'll see that Bap size doesn't
1:44:04
larger you'll see that Bap size doesn't
1:44:04
larger you'll see that Bap size doesn't come up too much in most of these layers
1:44:06
come up too much in most of these layers
1:44:06
come up too much in most of these layers the only place that it comes up to is in
1:44:08
the only place that it comes up to is in
1:44:08
the only place that it comes up to is in exactly these two places here so when we
1:44:11
exactly these two places here so when we
1:44:11
exactly these two places here so when we Define the language model there's the
1:44:13
Define the language model there's the
1:44:13
Define the language model there's the token embedding table which is this
1:44:15
token embedding table which is this
1:44:15
token embedding table which is this two-dimensional array where the vocap
1:44:18
two-dimensional array where the vocap
1:44:18
two-dimensional array where the vocap size is basically the number of rows and
1:44:21
size is basically the number of rows and
1:44:21
size is basically the number of rows and uh each vocabulary element each token
1:44:23
uh each vocabulary element each token
1:44:23
uh each vocabulary element each token has a vector that we're going to train
1:44:25
has a vector that we're going to train
1:44:25
has a vector that we're going to train using back propagation that Vector is of
1:44:27
using back propagation that Vector is of
1:44:27
using back propagation that Vector is of size and embed which is number of
1:44:29
size and embed which is number of
1:44:29
size and embed which is number of channels in the Transformer and
1:44:31
channels in the Transformer and
1:44:31
channels in the Transformer and basically as voap size increases this
1:44:33
basically as voap size increases this
1:44:33
basically as voap size increases this embedding table as I mentioned earlier
1:44:35
embedding table as I mentioned earlier
1:44:35
embedding table as I mentioned earlier is going to also grow we're going to be
1:44:36
is going to also grow we're going to be
1:44:37
is going to also grow we're going to be adding rows in addition to that at the
1:44:39
adding rows in addition to that at the
1:44:39
adding rows in addition to that at the end of the Transformer there's this LM
1:44:41
end of the Transformer there's this LM
1:44:41
end of the Transformer there's this LM head layer which is a linear layer and
1:44:44
head layer which is a linear layer and
1:44:44
head layer which is a linear layer and you'll notice that that layer is used at
1:44:46
you'll notice that that layer is used at
1:44:46
you'll notice that that layer is used at the very end to produce the logits uh
1:44:48
the very end to produce the logits uh
1:44:48
the very end to produce the logits uh which become the probabilities for the
1:44:49
which become the probabilities for the
1:44:49
which become the probabilities for the next token in sequence and so
1:44:51
next token in sequence and so
1:44:51
next token in sequence and so intuitively we're trying to produce a
1:44:53
intuitively we're trying to produce a
1:44:53
intuitively we're trying to produce a probability for every single token that
1:44:56
probability for every single token that
1:44:56
probability for every single token that might come next at every point in time
1:44:58
might come next at every point in time
1:44:58
might come next at every point in time of that Transformer and if we have more
1:45:01
of that Transformer and if we have more
1:45:01
of that Transformer and if we have more and more tokens we need to produce more
1:45:02
and more tokens we need to produce more
1:45:02
and more tokens we need to produce more and more probabilities so every single
1:45:04
and more probabilities so every single
1:45:04
and more probabilities so every single token is going to introduce an
1:45:06
token is going to introduce an
1:45:06
token is going to introduce an additional dot product that we have to
1:45:08
additional dot product that we have to
1:45:08
additional dot product that we have to do here in this linear layer for this
1:45:10
do here in this linear layer for this
1:45:10
do here in this linear layer for this final layer in a
1:45:11
final layer in a
1:45:11
final layer in a Transformer so why can't vocap size be
1:45:14
Transformer so why can't vocap size be
1:45:14
Transformer so why can't vocap size be infinite why can't we grow to Infinity
1:45:16
infinite why can't we grow to Infinity
1:45:16
infinite why can't we grow to Infinity well number one your token embedding
1:45:18
well number one your token embedding
1:45:18
well number one your token embedding table is going to grow uh your linear
1:45:21
table is going to grow uh your linear
1:45:21
table is going to grow uh your linear layer is going to grow so we're going to
1:45:23
layer is going to grow so we're going to
1:45:23
layer is going to grow so we're going to be doing a lot more computation here
1:45:25
be doing a lot more computation here
1:45:25
be doing a lot more computation here because this LM head layer will become
1:45:26
because this LM head layer will become
1:45:26
because this LM head layer will become more computational expensive number two
1:45:29
more computational expensive number two
1:45:29
more computational expensive number two because we have more parameters we could
1:45:30
because we have more parameters we could
1:45:30
because we have more parameters we could be worried that we are going to be under
1:45:33
be worried that we are going to be under
1:45:33
be worried that we are going to be under trining some of these
1:45:35
trining some of these
1:45:35
trining some of these parameters so intuitively if you have a
1:45:37
parameters so intuitively if you have a
1:45:37
parameters so intuitively if you have a very large vocabulary size say we have a
1:45:38
very large vocabulary size say we have a
1:45:38
very large vocabulary size say we have a million uh tokens then every one of
1:45:41
million uh tokens then every one of
1:45:41
million uh tokens then every one of these tokens is going to come up more
1:45:42
these tokens is going to come up more
1:45:42
these tokens is going to come up more and more rarely in the training data
1:45:45
and more rarely in the training data
1:45:45
and more rarely in the training data because there's a lot more other tokens
1:45:46
because there's a lot more other tokens
1:45:46
because there's a lot more other tokens all over the place and so we're going to
1:45:48
all over the place and so we're going to
1:45:48
all over the place and so we're going to be seeing fewer and fewer examples uh
1:45:50
be seeing fewer and fewer examples uh
1:45:51
be seeing fewer and fewer examples uh for each individual token and you might
1:45:53
for each individual token and you might
1:45:53
for each individual token and you might be worried that basically the vectors
1:45:54
be worried that basically the vectors
1:45:55
be worried that basically the vectors associated with every token will be
1:45:56
associated with every token will be
1:45:56
associated with every token will be undertrained as a result because they
1:45:58
undertrained as a result because they
1:45:58
undertrained as a result because they just don't come up too often and they
1:45:59
just don't come up too often and they
1:45:59
just don't come up too often and they don't participate in the forward
1:46:00
don't participate in the forward
1:46:00
don't participate in the forward backward pass in addition to that as
1:46:03
backward pass in addition to that as
1:46:03
backward pass in addition to that as your vocab size grows you're going to
1:46:04
your vocab size grows you're going to
1:46:04
your vocab size grows you're going to start shrinking your sequences a lot
1:46:07
start shrinking your sequences a lot
1:46:07
start shrinking your sequences a lot right and that's really nice because
1:46:09
right and that's really nice because
1:46:09
right and that's really nice because that means that we're going to be
1:46:10
that means that we're going to be
1:46:10
that means that we're going to be attending to more and more text so
1:46:11
attending to more and more text so
1:46:12
attending to more and more text so that's nice but also you might be
1:46:13
that's nice but also you might be
1:46:13
that's nice but also you might be worrying that two large of chunks are
1:46:15
worrying that two large of chunks are
1:46:15
worrying that two large of chunks are being squished into single tokens and so
1:46:18
being squished into single tokens and so
1:46:18
being squished into single tokens and so the model just doesn't have as much of
1:46:20
the model just doesn't have as much of
1:46:20
the model just doesn't have as much of time to think per sort of um some number
1:46:25
time to think per sort of um some number
1:46:25
time to think per sort of um some number of characters in the text or you can
1:46:26
of characters in the text or you can
1:46:26
of characters in the text or you can think about it that way right so
1:46:28
think about it that way right so
1:46:28
think about it that way right so basically we're squishing too much
1:46:29
basically we're squishing too much
1:46:29
basically we're squishing too much information into a single token and then
1:46:31
information into a single token and then
1:46:31
information into a single token and then the forward pass of the Transformer is
1:46:33
the forward pass of the Transformer is
1:46:33
the forward pass of the Transformer is not enough to actually process that
1:46:34
not enough to actually process that
1:46:34
not enough to actually process that information appropriately and so these
1:46:36
information appropriately and so these
1:46:36
information appropriately and so these are some of the considerations you're
1:46:37
are some of the considerations you're
1:46:37
are some of the considerations you're thinking about when you're designing the
1:46:38
thinking about when you're designing the
1:46:38
thinking about when you're designing the vocab size as I mentioned this is mostly
1:46:40
vocab size as I mentioned this is mostly
1:46:40
vocab size as I mentioned this is mostly an empirical hyperparameter and it seems
1:46:42
an empirical hyperparameter and it seems
1:46:42
an empirical hyperparameter and it seems like in state-of-the-art architectures
1:46:44
like in state-of-the-art architectures
1:46:44
like in state-of-the-art architectures today this is usually in the high 10,000
1:46:46
today this is usually in the high 10,000
1:46:46
today this is usually in the high 10,000 or somewhere around 100,000 today and
1:46:49
or somewhere around 100,000 today and
1:46:49
or somewhere around 100,000 today and the next consideration I want to briefly
1:46:50
the next consideration I want to briefly
1:46:50
the next consideration I want to briefly talk about is what if we want to take a
1:46:52
talk about is what if we want to take a
1:46:53
talk about is what if we want to take a pre-trained model and we want to extend
1:46:55
pre-trained model and we want to extend
1:46:55
pre-trained model and we want to extend the vocap size and this is done fairly
1:46:57
the vocap size and this is done fairly
1:46:57
the vocap size and this is done fairly commonly actually so for example when
1:46:58
commonly actually so for example when
1:46:58
commonly actually so for example when you're doing fine-tuning for cha GPT um
1:47:02
you're doing fine-tuning for cha GPT um
1:47:02
you're doing fine-tuning for cha GPT um a lot more new special tokens get
1:47:03
a lot more new special tokens get
1:47:03
a lot more new special tokens get introduced on top of the base model to
1:47:05
introduced on top of the base model to
1:47:05
introduced on top of the base model to maintain the metadata and all the
1:47:08
maintain the metadata and all the
1:47:08
maintain the metadata and all the structure of conversation objects
1:47:09
structure of conversation objects
1:47:09
structure of conversation objects between a user and an assistant so that
1:47:11
between a user and an assistant so that
1:47:11
between a user and an assistant so that takes a lot of special tokens you might
1:47:14
takes a lot of special tokens you might
1:47:14
takes a lot of special tokens you might also try to throw in more special tokens
1:47:15
also try to throw in more special tokens
1:47:15
also try to throw in more special tokens for example for using the browser or any
1:47:17
for example for using the browser or any
1:47:17
for example for using the browser or any other tool and so it's very tempting to
1:47:20
other tool and so it's very tempting to
1:47:20
other tool and so it's very tempting to add a lot of tokens for all kinds of
1:47:22
add a lot of tokens for all kinds of
1:47:22
add a lot of tokens for all kinds of special functionality so if you want to
1:47:24
special functionality so if you want to
1:47:24
special functionality so if you want to be adding a token that's totally
1:47:25
be adding a token that's totally
1:47:25
be adding a token that's totally possible Right all we have to do is we
1:47:27
possible Right all we have to do is we
1:47:27
possible Right all we have to do is we have to resize this embedding so we have
1:47:29
have to resize this embedding so we have
1:47:29
have to resize this embedding so we have to add rows we would initialize these uh
1:47:32
to add rows we would initialize these uh
1:47:32
to add rows we would initialize these uh parameters from scratch to be small
1:47:34
parameters from scratch to be small
1:47:34
parameters from scratch to be small random numbers and then we have to
1:47:36
random numbers and then we have to
1:47:36
random numbers and then we have to extend the weight inside this linear uh
1:47:39
extend the weight inside this linear uh
1:47:39
extend the weight inside this linear uh so we have to start making dot products
1:47:41
so we have to start making dot products
1:47:41
so we have to start making dot products um with the associated parameters as
1:47:43
um with the associated parameters as
1:47:43
um with the associated parameters as well to basically calculate the
1:47:44
well to basically calculate the
1:47:44
well to basically calculate the probabilities for these new tokens so
1:47:46
probabilities for these new tokens so
1:47:46
probabilities for these new tokens so both of these are just a resizing
1:47:48
both of these are just a resizing
1:47:48
both of these are just a resizing operation it's a very mild
1:47:50
operation it's a very mild
1:47:50
operation it's a very mild model surgery and can be done fairly
1:47:52
model surgery and can be done fairly
1:47:52
model surgery and can be done fairly easily and it's quite common that
1:47:54
easily and it's quite common that
1:47:54
easily and it's quite common that basically you would freeze the base
1:47:55
basically you would freeze the base
1:47:55
basically you would freeze the base model you introduce these new parameters
1:47:57
model you introduce these new parameters
1:47:57
model you introduce these new parameters and then you only train these new
1:47:58
and then you only train these new
1:47:58
and then you only train these new parameters to introduce new tokens into
1:48:00
parameters to introduce new tokens into
1:48:00
parameters to introduce new tokens into the architecture um and so you can
1:48:03
the architecture um and so you can
1:48:03
the architecture um and so you can freeze arbitrary parts of it or you can
1:48:04
freeze arbitrary parts of it or you can
1:48:04
freeze arbitrary parts of it or you can train arbitrary parts of it and that's
1:48:06
train arbitrary parts of it and that's
1:48:06
train arbitrary parts of it and that's totally up to you but basically minor
1:48:08
totally up to you but basically minor
1:48:08
totally up to you but basically minor surgery required if you'd like to
1:48:10
surgery required if you'd like to
1:48:10
surgery required if you'd like to introduce new tokens and finally I'd
1:48:11
introduce new tokens and finally I'd
1:48:11
introduce new tokens and finally I'd like to mention that actually there's an
1:48:13
like to mention that actually there's an
1:48:13
like to mention that actually there's an entire design space of applications in
1:48:15
entire design space of applications in
1:48:15
entire design space of applications in terms of introducing new tokens into a
1:48:17
terms of introducing new tokens into a
1:48:17
terms of introducing new tokens into a vocabulary that go Way Beyond just
1:48:19
vocabulary that go Way Beyond just
1:48:19
vocabulary that go Way Beyond just adding special tokens and special new
1:48:21
adding special tokens and special new
1:48:21
adding special tokens and special new functionality so just to give you a
1:48:22
functionality so just to give you a
1:48:23
functionality so just to give you a sense of the design space but this could
1:48:24
sense of the design space but this could
1:48:24
sense of the design space but this could be an entire video just by itself uh
1:48:26
be an entire video just by itself uh
1:48:26
be an entire video just by itself uh this is a paper on learning to compress
1:48:28
this is a paper on learning to compress
1:48:28
this is a paper on learning to compress prompts with what they called uh gist
1:48:31
prompts with what they called uh gist
1:48:31
prompts with what they called uh gist tokens and the rough idea is suppose
1:48:33
tokens and the rough idea is suppose
1:48:33
tokens and the rough idea is suppose that you're using language models in a
1:48:34
that you're using language models in a
1:48:34
that you're using language models in a setting that requires very long prompts
1:48:37
setting that requires very long prompts
1:48:37
setting that requires very long prompts while these long prompts just slow
1:48:38
while these long prompts just slow
1:48:38
while these long prompts just slow everything down because you have to
1:48:39
everything down because you have to
1:48:39
everything down because you have to encode them and then you have to use
1:48:41
encode them and then you have to use
1:48:41
encode them and then you have to use them and then you're tending over them
1:48:43
them and then you're tending over them
1:48:43
them and then you're tending over them and it's just um you know heavy to have
1:48:45
and it's just um you know heavy to have
1:48:45
and it's just um you know heavy to have very large prompts so instead what they
1:48:47
very large prompts so instead what they
1:48:47
very large prompts so instead what they do here in this paper is they introduce
1:48:50
do here in this paper is they introduce
1:48:50
do here in this paper is they introduce new tokens and um imagine basically
1:48:54
new tokens and um imagine basically
1:48:54
new tokens and um imagine basically having a few new tokens you put them in
1:48:56
having a few new tokens you put them in
1:48:56
having a few new tokens you put them in a sequence and then you train the model
1:48:59
a sequence and then you train the model
1:48:59
a sequence and then you train the model by distillation so you are keeping the
1:49:01
by distillation so you are keeping the
1:49:01
by distillation so you are keeping the entire model Frozen and you're only
1:49:03
entire model Frozen and you're only
1:49:03
entire model Frozen and you're only training the representations of the new
1:49:04
training the representations of the new
1:49:05
training the representations of the new tokens their embeddings and you're
1:49:06
tokens their embeddings and you're
1:49:06
tokens their embeddings and you're optimizing over the new tokens such that
1:49:09
optimizing over the new tokens such that
1:49:09
optimizing over the new tokens such that the behavior of the language model is
1:49:11
the behavior of the language model is
1:49:11
the behavior of the language model is identical uh to the model that has a
1:49:15
identical uh to the model that has a
1:49:15
identical uh to the model that has a very long prompt that works for you and
1:49:17
very long prompt that works for you and
1:49:17
very long prompt that works for you and so it's a compression technique of
1:49:18
so it's a compression technique of
1:49:19
so it's a compression technique of compressing that very long prompt into
1:49:20
compressing that very long prompt into
1:49:20
compressing that very long prompt into those few new gist tokens and so you can
1:49:23
those few new gist tokens and so you can
1:49:23
those few new gist tokens and so you can train this and then at test time you can
1:49:25
train this and then at test time you can
1:49:25
train this and then at test time you can discard your old prompt and just swap in
1:49:26
discard your old prompt and just swap in
1:49:26
discard your old prompt and just swap in those tokens and they sort of like uh
1:49:28
those tokens and they sort of like uh
1:49:28
those tokens and they sort of like uh stand in for that very long prompt and
1:49:31
stand in for that very long prompt and
1:49:31
stand in for that very long prompt and have an almost identical performance and
1:49:33
have an almost identical performance and
1:49:33
have an almost identical performance and so this is one um technique and a class
1:49:36
so this is one um technique and a class
1:49:36
so this is one um technique and a class of parameter efficient fine-tuning
1:49:37
of parameter efficient fine-tuning
1:49:38
of parameter efficient fine-tuning techniques where most of the model is
1:49:39
techniques where most of the model is
1:49:39
techniques where most of the model is basically fixed and there's no training
1:49:41
basically fixed and there's no training
1:49:41
basically fixed and there's no training of the model weights there's no training
1:49:43
of the model weights there's no training
1:49:43
of the model weights there's no training of Laura or anything like that of new
1:49:45
of Laura or anything like that of new
1:49:45
of Laura or anything like that of new parameters the the parameters that
1:49:47
parameters the the parameters that
1:49:47
parameters the the parameters that you're training are now just the uh
1:49:49
you're training are now just the uh
1:49:49
you're training are now just the uh token embeddings so that's just one
1:49:51
token embeddings so that's just one
1:49:51
token embeddings so that's just one example but this could again be like an
1:49:52
example but this could again be like an
1:49:52
example but this could again be like an entire video but just to give you a
1:49:54
entire video but just to give you a
1:49:54
entire video but just to give you a sense that there's a whole design space
1:49:55
sense that there's a whole design space
1:49:55
sense that there's a whole design space here that is potentially worth exploring
1:49:57
here that is potentially worth exploring
1:49:57
here that is potentially worth exploring in the future the next thing I want to
1:49:59
in the future the next thing I want to
1:49:59
in the future the next thing I want to briefly address is that I think recently
1:50:01
briefly address is that I think recently
1:50:01
briefly address is that I think recently there's a lot of momentum in how you
1:50:03
there's a lot of momentum in how you
1:50:03
there's a lot of momentum in how you actually could construct Transformers
1:50:05
actually could construct Transformers
1:50:05
actually could construct Transformers that can simultaneously process not just
1:50:06
that can simultaneously process not just
1:50:06
that can simultaneously process not just text as the input modality but a lot of
1:50:08
text as the input modality but a lot of
1:50:08
text as the input modality but a lot of other modalities so be it images videos
1:50:11
other modalities so be it images videos
1:50:11
other modalities so be it images videos audio Etc and how do you feed in all
1:50:14
audio Etc and how do you feed in all
1:50:14
audio Etc and how do you feed in all these modalities and potentially predict
1:50:15
these modalities and potentially predict
1:50:16
these modalities and potentially predict these modalities from a Transformer uh
1:50:18
these modalities from a Transformer uh
1:50:18
these modalities from a Transformer uh do you have to change the architecture
1:50:19
do you have to change the architecture
1:50:19
do you have to change the architecture in some fundamental way and I think what
1:50:21
in some fundamental way and I think what
1:50:21
in some fundamental way and I think what a lot of people are starting to converge
1:50:23
a lot of people are starting to converge
1:50:23
a lot of people are starting to converge towards is that you're not changing the
1:50:24
towards is that you're not changing the
1:50:24
towards is that you're not changing the architecture you stick with the
1:50:25
architecture you stick with the
1:50:25
architecture you stick with the Transformer you just kind of tokenize
1:50:27
Transformer you just kind of tokenize
1:50:27
Transformer you just kind of tokenize your input domains and then call the day
1:50:29
your input domains and then call the day
1:50:29
your input domains and then call the day and pretend it's just text tokens and
1:50:31
and pretend it's just text tokens and
1:50:31
and pretend it's just text tokens and just do everything else identical in an
1:50:33
just do everything else identical in an
1:50:33
just do everything else identical in an identical manner so here for example
1:50:36
identical manner so here for example
1:50:36
identical manner so here for example there was a early paper that has nice
1:50:37
there was a early paper that has nice
1:50:37
there was a early paper that has nice graphic for how you can take an image
1:50:39
graphic for how you can take an image
1:50:39
graphic for how you can take an image and you can chunc at it into
1:50:42
and you can chunc at it into
1:50:42
and you can chunc at it into integers um and these sometimes uh so
1:50:45
integers um and these sometimes uh so
1:50:45
integers um and these sometimes uh so these will basically become the tokens
1:50:46
these will basically become the tokens
1:50:46
these will basically become the tokens of images as an example and uh these
1:50:49
of images as an example and uh these
1:50:49
of images as an example and uh these tokens can be uh hard tokens where you
1:50:52
tokens can be uh hard tokens where you
1:50:52
tokens can be uh hard tokens where you force them to be integers they can also
1:50:53
force them to be integers they can also
1:50:53
force them to be integers they can also be soft tokens where you uh sort of
1:50:56
be soft tokens where you uh sort of
1:50:57
be soft tokens where you uh sort of don't require uh these to be discrete
1:51:00
don't require uh these to be discrete
1:51:00
don't require uh these to be discrete but you do Force these representations
1:51:02
but you do Force these representations
1:51:02
but you do Force these representations to go through bottlenecks like in Auto
1:51:04
to go through bottlenecks like in Auto
1:51:04
to go through bottlenecks like in Auto encoders uh also in this paper that came
1:51:06
encoders uh also in this paper that came
1:51:06
encoders uh also in this paper that came out from open a SORA which I think
1:51:08
out from open a SORA which I think
1:51:08
out from open a SORA which I think really um uh blew the mind of many
1:51:11
really um uh blew the mind of many
1:51:11
really um uh blew the mind of many people and inspired a lot of people in
1:51:13
people and inspired a lot of people in
1:51:13
people and inspired a lot of people in terms of what's possible they have a
1:51:15
terms of what's possible they have a
1:51:15
terms of what's possible they have a Graphic here and they talk briefly about
1:51:16
Graphic here and they talk briefly about
1:51:16
Graphic here and they talk briefly about how llms have text tokens Sora has
1:51:20
how llms have text tokens Sora has
1:51:20
how llms have text tokens Sora has visual patches so again they came up
1:51:22
visual patches so again they came up
1:51:22
visual patches so again they came up with a way to chunc a videos into
1:51:24
with a way to chunc a videos into
1:51:24
with a way to chunc a videos into basically tokens when they own
1:51:26
basically tokens when they own
1:51:26
basically tokens when they own vocabularies and then you can either
1:51:28
vocabularies and then you can either
1:51:28
vocabularies and then you can either process discrete tokens say with autog
1:51:30
process discrete tokens say with autog
1:51:30
process discrete tokens say with autog regressive models or even soft tokens
1:51:32
regressive models or even soft tokens
1:51:32
regressive models or even soft tokens with diffusion models and uh all of that
1:51:35
with diffusion models and uh all of that
1:51:35
with diffusion models and uh all of that is sort of uh being actively worked on
1:51:38
is sort of uh being actively worked on
1:51:38
is sort of uh being actively worked on designed on and is beyond the scope of
1:51:39
designed on and is beyond the scope of
1:51:39
designed on and is beyond the scope of this video but just something I wanted
1:51:40
this video but just something I wanted
1:51:40
this video but just something I wanted to mention briefly okay now that we have
1:51:42
to mention briefly okay now that we have
1:51:42
to mention briefly okay now that we have come quite deep into the tokenization
1:51:45
come quite deep into the tokenization
1:51:45
come quite deep into the tokenization algorithm and we understand a lot more
1:51:46
algorithm and we understand a lot more
1:51:46
algorithm and we understand a lot more about how it works let's loop back
1:51:48
about how it works let's loop back
1:51:48
about how it works let's loop back around to the beginning of this video
1:51:50
around to the beginning of this video
1:51:50
around to the beginning of this video and go through some of these bullet
1:51:51
and go through some of these bullet
1:51:51
and go through some of these bullet points and really see why they happen so
1:51:54
points and really see why they happen so
1:51:54
points and really see why they happen so first of all why can't my llm spell
1:51:56
first of all why can't my llm spell
1:51:56
first of all why can't my llm spell words very well or do other spell
1:51:58
words very well or do other spell
1:51:58
words very well or do other spell related
1:52:00
related
1:52:00
related tasks so fundamentally this is because
1:52:02
tasks so fundamentally this is because
1:52:02
tasks so fundamentally this is because as we saw these characters are chunked
1:52:05
as we saw these characters are chunked
1:52:05
as we saw these characters are chunked up into tokens and some of these tokens
1:52:07
up into tokens and some of these tokens
1:52:07
up into tokens and some of these tokens are actually fairly long so as an
1:52:10
are actually fairly long so as an
1:52:10
are actually fairly long so as an example I went to the gp4 vocabulary and
1:52:12
example I went to the gp4 vocabulary and
1:52:12
example I went to the gp4 vocabulary and I looked at uh one of the longer tokens
1:52:15
I looked at uh one of the longer tokens
1:52:15
I looked at uh one of the longer tokens so that default style turns out to be a
1:52:17
so that default style turns out to be a
1:52:17
so that default style turns out to be a single individual token so that's a lot
1:52:19
single individual token so that's a lot
1:52:19
single individual token so that's a lot of characters for a single token so my
1:52:22
of characters for a single token so my
1:52:22
of characters for a single token so my suspicion is that there's just too much
1:52:23
suspicion is that there's just too much
1:52:23
suspicion is that there's just too much crammed into this single token and my
1:52:26
crammed into this single token and my
1:52:26
crammed into this single token and my suspicion was that the model should not
1:52:27
suspicion was that the model should not
1:52:27
suspicion was that the model should not be very good at tasks related to
1:52:30
be very good at tasks related to
1:52:30
be very good at tasks related to spelling of this uh single token so I
1:52:34
spelling of this uh single token so I
1:52:34
spelling of this uh single token so I asked how many letters L are there in
1:52:36
asked how many letters L are there in
1:52:37
asked how many letters L are there in the word default style and of course my
1:52:41
the word default style and of course my
1:52:41
the word default style and of course my prompt is intentionally done that way
1:52:44
prompt is intentionally done that way
1:52:44
prompt is intentionally done that way and you see how default style will be a
1:52:45
and you see how default style will be a
1:52:45
and you see how default style will be a single token so this is what the model
1:52:47
single token so this is what the model
1:52:47
single token so this is what the model sees so my suspicion is that it wouldn't
1:52:49
sees so my suspicion is that it wouldn't
1:52:49
sees so my suspicion is that it wouldn't be very good at this and indeed it is
1:52:51
be very good at this and indeed it is
1:52:51
be very good at this and indeed it is not it doesn't actually know how many
1:52:53
not it doesn't actually know how many
1:52:53
not it doesn't actually know how many L's are in there it thinks there are
1:52:54
L's are in there it thinks there are
1:52:54
L's are in there it thinks there are three and actually there are four if I'm
1:52:56
three and actually there are four if I'm
1:52:57
three and actually there are four if I'm not getting this wrong myself so that
1:52:59
not getting this wrong myself so that
1:52:59
not getting this wrong myself so that didn't go extremely well let's look look
1:53:02
didn't go extremely well let's look look
1:53:02
didn't go extremely well let's look look at another kind of uh character level
1:53:04
at another kind of uh character level
1:53:04
at another kind of uh character level task so for example here I asked uh gp4
1:53:08
task so for example here I asked uh gp4
1:53:08
task so for example here I asked uh gp4 to reverse the string default style and
1:53:11
to reverse the string default style and
1:53:11
to reverse the string default style and they tried to use a code interpreter and
1:53:13
they tried to use a code interpreter and
1:53:13
they tried to use a code interpreter and I stopped it and I said just do it just
1:53:15
I stopped it and I said just do it just
1:53:15
I stopped it and I said just do it just try it and uh it gave me jumble so it
1:53:19
try it and uh it gave me jumble so it
1:53:19
try it and uh it gave me jumble so it doesn't actually really know how to
1:53:21
doesn't actually really know how to
1:53:21
doesn't actually really know how to reverse this string going from right to
1:53:23
reverse this string going from right to
1:53:23
reverse this string going from right to left uh so it gave a wrong result so
1:53:26
left uh so it gave a wrong result so
1:53:26
left uh so it gave a wrong result so again like working with this working
1:53:28
again like working with this working
1:53:28
again like working with this working hypothesis that maybe this is due to the
1:53:29
hypothesis that maybe this is due to the
1:53:30
hypothesis that maybe this is due to the tokenization I tried a different
1:53:31
tokenization I tried a different
1:53:31
tokenization I tried a different approach I said okay let's reverse the
1:53:34
approach I said okay let's reverse the
1:53:34
approach I said okay let's reverse the exact same string but take the following
1:53:36
exact same string but take the following
1:53:36
exact same string but take the following approach step one just print out every
1:53:38
approach step one just print out every
1:53:38
approach step one just print out every single character separated by spaces and
1:53:40
single character separated by spaces and
1:53:40
single character separated by spaces and then as a step two reverse that list and
1:53:43
then as a step two reverse that list and
1:53:43
then as a step two reverse that list and it again Tred to use a tool but when I
1:53:44
it again Tred to use a tool but when I
1:53:44
it again Tred to use a tool but when I stopped it it uh first uh produced all
1:53:47
stopped it it uh first uh produced all
1:53:47
stopped it it uh first uh produced all the characters and that was actually
1:53:48
the characters and that was actually
1:53:48
the characters and that was actually correct and then It reversed them and
1:53:50
correct and then It reversed them and
1:53:50
correct and then It reversed them and that was correct once it had this so
1:53:53
that was correct once it had this so
1:53:53
that was correct once it had this so somehow it can't reverse it directly but
1:53:54
somehow it can't reverse it directly but
1:53:54
somehow it can't reverse it directly but when you go just first uh you know
1:53:57
when you go just first uh you know
1:53:57
when you go just first uh you know listing it out in order it can do that
1:53:59
listing it out in order it can do that
1:53:59
listing it out in order it can do that somehow and then it can once it's uh
1:54:01
somehow and then it can once it's uh
1:54:01
somehow and then it can once it's uh broken up this way this becomes all
1:54:03
broken up this way this becomes all
1:54:03
broken up this way this becomes all these individual characters and so now
1:54:06
these individual characters and so now
1:54:06
these individual characters and so now this is much easier for it to see these
1:54:07
this is much easier for it to see these
1:54:07
this is much easier for it to see these individual tokens and reverse them and
1:54:10
individual tokens and reverse them and
1:54:10
individual tokens and reverse them and print them out so that is kind of
1:54:13
print them out so that is kind of
1:54:13
print them out so that is kind of interesting so let's continue now why
1:54:16
interesting so let's continue now why
1:54:16
interesting so let's continue now why are llms worse at uh non-english langu
1:54:20
are llms worse at uh non-english langu
1:54:20
are llms worse at uh non-english langu and I briefly covered this already but
1:54:22
and I briefly covered this already but
1:54:22
and I briefly covered this already but basically um it's not only that the
1:54:24
basically um it's not only that the
1:54:24
basically um it's not only that the language model sees less non-english
1:54:27
language model sees less non-english
1:54:27
language model sees less non-english data during training of the model
1:54:28
data during training of the model
1:54:28
data during training of the model parameters but also the tokenizer is not
1:54:31
parameters but also the tokenizer is not
1:54:31
parameters but also the tokenizer is not um is not sufficiently trained on
1:54:34
um is not sufficiently trained on
1:54:34
um is not sufficiently trained on non-english data and so here for example
1:54:37
non-english data and so here for example
1:54:37
non-english data and so here for example hello how are you is five tokens and its
1:54:40
hello how are you is five tokens and its
1:54:40
hello how are you is five tokens and its translation is 15 tokens so this is a
1:54:42
translation is 15 tokens so this is a
1:54:42
translation is 15 tokens so this is a three times blow up and so for example
1:54:45
three times blow up and so for example
1:54:45
three times blow up and so for example anang is uh just hello basically in
1:54:48
anang is uh just hello basically in
1:54:48
anang is uh just hello basically in Korean and that end up being three
1:54:50
Korean and that end up being three
1:54:50
Korean and that end up being three tokens I'm actually kind of surprised by
1:54:51
tokens I'm actually kind of surprised by
1:54:51
tokens I'm actually kind of surprised by that because that is a very common
1:54:53
that because that is a very common
1:54:53
that because that is a very common phrase there just the typical greeting
1:54:55
phrase there just the typical greeting
1:54:55
phrase there just the typical greeting of like hello and that ends up being
1:54:56
of like hello and that ends up being
1:54:57
of like hello and that ends up being three tokens whereas our hello is a
1:54:58
three tokens whereas our hello is a
1:54:58
three tokens whereas our hello is a single token and so basically everything
1:55:00
single token and so basically everything
1:55:00
single token and so basically everything is a lot more bloated and diffuse and
1:55:02
is a lot more bloated and diffuse and
1:55:02
is a lot more bloated and diffuse and this is I think partly the reason that
1:55:04
this is I think partly the reason that
1:55:04
this is I think partly the reason that the model Works worse on other
1:55:06
the model Works worse on other
1:55:07
the model Works worse on other languages uh coming back why is LM bad
1:55:10
languages uh coming back why is LM bad
1:55:10
languages uh coming back why is LM bad at simple arithmetic um that has to do
1:55:13
at simple arithmetic um that has to do
1:55:13
at simple arithmetic um that has to do with the tokenization of numbers and so
1:55:17
with the tokenization of numbers and so
1:55:17
with the tokenization of numbers and so um you'll notice that for example
1:55:19
um you'll notice that for example
1:55:19
um you'll notice that for example addition is very sort of
1:55:20
addition is very sort of
1:55:20
addition is very sort of like uh there's an algorithm that is
1:55:23
like uh there's an algorithm that is
1:55:23
like uh there's an algorithm that is like character level for doing addition
1:55:25
like character level for doing addition
1:55:25
like character level for doing addition so for example here we would first add
1:55:27
so for example here we would first add
1:55:27
so for example here we would first add the ones and then the tens and then the
1:55:29
the ones and then the tens and then the
1:55:29
the ones and then the tens and then the hundreds you have to refer to specific
1:55:31
hundreds you have to refer to specific
1:55:31
hundreds you have to refer to specific parts of these digits but uh these
1:55:34
parts of these digits but uh these
1:55:34
parts of these digits but uh these numbers are represented completely
1:55:36
numbers are represented completely
1:55:36
numbers are represented completely arbitrarily based on whatever happened
1:55:37
arbitrarily based on whatever happened
1:55:37
arbitrarily based on whatever happened to merge or not merge during the
1:55:39
to merge or not merge during the
1:55:39
to merge or not merge during the tokenization process there's an entire
1:55:41
tokenization process there's an entire
1:55:41
tokenization process there's an entire blog post about this that I think is
1:55:42
blog post about this that I think is
1:55:42
blog post about this that I think is quite good integer tokenization is
1:55:44
quite good integer tokenization is
1:55:44
quite good integer tokenization is insane and this person basically
1:55:46
insane and this person basically
1:55:46
insane and this person basically systematically explores the tokenization
1:55:48
systematically explores the tokenization
1:55:48
systematically explores the tokenization of numbers in I believe this is gpt2 and
1:55:52
of numbers in I believe this is gpt2 and
1:55:52
of numbers in I believe this is gpt2 and so they notice that for example for the
1:55:53
so they notice that for example for the
1:55:53
so they notice that for example for the for um four-digit numbers you can take a
1:55:57
for um four-digit numbers you can take a
1:55:57
for um four-digit numbers you can take a look at whether it is uh a single token
1:56:00
look at whether it is uh a single token
1:56:00
look at whether it is uh a single token or whether it is two tokens that is a 1
1:56:02
or whether it is two tokens that is a 1
1:56:02
or whether it is two tokens that is a 1 three or a 2 two or a 31 combination and
1:56:04
three or a 2 two or a 31 combination and
1:56:04
three or a 2 two or a 31 combination and so all the different numbers are all the
1:56:06
so all the different numbers are all the
1:56:06
so all the different numbers are all the different combinations and you can
1:56:08
different combinations and you can
1:56:08
different combinations and you can imagine this is all completely
1:56:09
imagine this is all completely
1:56:09
imagine this is all completely arbitrarily so and the model
1:56:11
arbitrarily so and the model
1:56:11
arbitrarily so and the model unfortunately sometimes sees uh four um
1:56:14
unfortunately sometimes sees uh four um
1:56:14
unfortunately sometimes sees uh four um a token for for all four digits
1:56:16
a token for for all four digits
1:56:16
a token for for all four digits sometimes for three sometimes for two
1:56:18
sometimes for three sometimes for two
1:56:18
sometimes for three sometimes for two sometimes for one and it's in an
1:56:19
sometimes for one and it's in an
1:56:20
sometimes for one and it's in an arbitrary uh Manner and so this is
1:56:22
arbitrary uh Manner and so this is
1:56:22
arbitrary uh Manner and so this is definitely a headwind if you will for
1:56:24
definitely a headwind if you will for
1:56:25
definitely a headwind if you will for the language model and it's kind of
1:56:26
the language model and it's kind of
1:56:26
the language model and it's kind of incredible that it can kind of do it and
1:56:27
incredible that it can kind of do it and
1:56:27
incredible that it can kind of do it and deal with it but it's also kind of not
1:56:30
deal with it but it's also kind of not
1:56:30
deal with it but it's also kind of not ideal and so that's why for example we
1:56:31
ideal and so that's why for example we
1:56:32
ideal and so that's why for example we saw that meta when they train the Llama
1:56:34
saw that meta when they train the Llama
1:56:34
saw that meta when they train the Llama 2 algorithm and they use sentence piece
1:56:36
2 algorithm and they use sentence piece
1:56:36
2 algorithm and they use sentence piece they make sure to split up all the um
1:56:39
they make sure to split up all the um
1:56:39
they make sure to split up all the um all the digits as an example for uh
1:56:42
all the digits as an example for uh
1:56:42
all the digits as an example for uh llama 2 and this is partly to improve a
1:56:44
llama 2 and this is partly to improve a
1:56:44
llama 2 and this is partly to improve a simple arithmetic kind of
1:56:46
simple arithmetic kind of
1:56:46
simple arithmetic kind of performance and finally why is gpt2 not
1:56:50
performance and finally why is gpt2 not
1:56:50
performance and finally why is gpt2 not as good in Python again this is partly a
1:56:52
as good in Python again this is partly a
1:56:52
as good in Python again this is partly a modeling issue on in the architecture
1:56:54
modeling issue on in the architecture
1:56:54
modeling issue on in the architecture and the data set and the strength of the
1:56:56
and the data set and the strength of the
1:56:56
and the data set and the strength of the model but it's also partially
1:56:58
model but it's also partially
1:56:58
model but it's also partially tokenization because as we saw here with
1:57:00
tokenization because as we saw here with
1:57:00
tokenization because as we saw here with the simple python example the encoding
1:57:03
the simple python example the encoding
1:57:03
the simple python example the encoding efficiency of the tokenizer for handling
1:57:05
efficiency of the tokenizer for handling
1:57:05
efficiency of the tokenizer for handling spaces in Python is terrible and every
1:57:07
spaces in Python is terrible and every
1:57:07
spaces in Python is terrible and every single space is an individual token and
1:57:09
single space is an individual token and
1:57:09
single space is an individual token and this dramatically reduces the context
1:57:11
this dramatically reduces the context
1:57:11
this dramatically reduces the context length that the model can attend to
1:57:12
length that the model can attend to
1:57:12
length that the model can attend to cross so that's almost like a
1:57:14
cross so that's almost like a
1:57:14
cross so that's almost like a tokenization bug for gpd2 and that was
1:57:16
tokenization bug for gpd2 and that was
1:57:16
tokenization bug for gpd2 and that was later fixed with gp4 okay so here's
1:57:19
later fixed with gp4 okay so here's
1:57:20
later fixed with gp4 okay so here's another fun one my llm abruptly halts
1:57:22
another fun one my llm abruptly halts
1:57:22
another fun one my llm abruptly halts when it sees the string end of text so
1:57:25
when it sees the string end of text so
1:57:25
when it sees the string end of text so here's um here's a very strange Behavior
1:57:28
here's um here's a very strange Behavior
1:57:28
here's um here's a very strange Behavior print a string end of text is what I
1:57:30
print a string end of text is what I
1:57:30
print a string end of text is what I told jt4 and it says could you please
1:57:32
told jt4 and it says could you please
1:57:32
told jt4 and it says could you please specify the string and I'm I'm telling
1:57:35
specify the string and I'm I'm telling
1:57:35
specify the string and I'm I'm telling it give me end of text and it seems like
1:57:37
it give me end of text and it seems like
1:57:37
it give me end of text and it seems like there's an issue it's not seeing end of
1:57:39
there's an issue it's not seeing end of
1:57:39
there's an issue it's not seeing end of text and then I give it end of text is
1:57:41
text and then I give it end of text is
1:57:41
text and then I give it end of text is the string and then here's a string and
1:57:44
the string and then here's a string and
1:57:44
the string and then here's a string and then it just doesn't print it so
1:57:45
then it just doesn't print it so
1:57:45
then it just doesn't print it so obviously something is breaking here
1:57:47
obviously something is breaking here
1:57:47
obviously something is breaking here with respect to the handling of the
1:57:48
with respect to the handling of the
1:57:48
with respect to the handling of the special token and I don't actually know
1:57:50
special token and I don't actually know
1:57:50
special token and I don't actually know what open ey is doing under the hood
1:57:52
what open ey is doing under the hood
1:57:52
what open ey is doing under the hood here and whether they are potentially
1:57:54
here and whether they are potentially
1:57:54
here and whether they are potentially parsing this as an um as an actual token
1:57:58
parsing this as an um as an actual token
1:57:58
parsing this as an um as an actual token instead of this just being uh end of
1:58:01
instead of this just being uh end of
1:58:01
instead of this just being uh end of text um as like individual sort of
1:58:04
text um as like individual sort of
1:58:04
text um as like individual sort of pieces of it without the special token
1:58:06
pieces of it without the special token
1:58:06
pieces of it without the special token handling logic and so it might be that
1:58:09
handling logic and so it might be that
1:58:09
handling logic and so it might be that someone when they're calling do encode
1:58:11
someone when they're calling do encode
1:58:11
someone when they're calling do encode uh they are passing in the allowed
1:58:13
uh they are passing in the allowed
1:58:13
uh they are passing in the allowed special and they are allowing end of
1:58:16
special and they are allowing end of
1:58:16
special and they are allowing end of text as a special character in the user
1:58:18
text as a special character in the user
1:58:18
text as a special character in the user prompt but the user prompt of course is
1:58:20
prompt but the user prompt of course is
1:58:20
prompt but the user prompt of course is is a sort of um attacker controlled text
1:58:23
is a sort of um attacker controlled text
1:58:23
is a sort of um attacker controlled text so you would hope that they don't really
1:58:25
so you would hope that they don't really
1:58:25
so you would hope that they don't really parse or use special tokens or you know
1:58:28
parse or use special tokens or you know
1:58:28
parse or use special tokens or you know from that kind of input but it appears
1:58:30
from that kind of input but it appears
1:58:30
from that kind of input but it appears that there's something definitely going
1:58:31
that there's something definitely going
1:58:31
that there's something definitely going wrong here and um so your knowledge of
1:58:34
wrong here and um so your knowledge of
1:58:34
wrong here and um so your knowledge of these special tokens ends up being in a
1:58:36
these special tokens ends up being in a
1:58:36
these special tokens ends up being in a tax surface potentially and so if you'd
1:58:38
tax surface potentially and so if you'd
1:58:38
tax surface potentially and so if you'd like to confuse llms then just um try to
1:58:42
like to confuse llms then just um try to
1:58:43
like to confuse llms then just um try to give them some special tokens and see if
1:58:44
give them some special tokens and see if
1:58:44
give them some special tokens and see if you're breaking something by chance okay
1:58:46
you're breaking something by chance okay
1:58:46
you're breaking something by chance okay so this next one is a really fun one uh
1:58:49
so this next one is a really fun one uh
1:58:49
so this next one is a really fun one uh the trailing whites space issue so if
1:58:52
the trailing whites space issue so if
1:58:52
the trailing whites space issue so if you come to playground and uh we come
1:58:55
you come to playground and uh we come
1:58:56
you come to playground and uh we come here to GPT 3.5 turbo instruct so this
1:58:58
here to GPT 3.5 turbo instruct so this
1:58:58
here to GPT 3.5 turbo instruct so this is not a chat model this is a completion
1:59:00
is not a chat model this is a completion
1:59:00
is not a chat model this is a completion model so think of it more like it's a
1:59:02
model so think of it more like it's a
1:59:02
model so think of it more like it's a lot more closer to a base model it does
1:59:05
lot more closer to a base model it does
1:59:05
lot more closer to a base model it does completion it will continue the token
1:59:07
completion it will continue the token
1:59:07
completion it will continue the token sequence so here's a tagline for ice
1:59:09
sequence so here's a tagline for ice
1:59:09
sequence so here's a tagline for ice cream shop and we want to continue the
1:59:11
cream shop and we want to continue the
1:59:11
cream shop and we want to continue the sequence and so we can submit and get a
1:59:14
sequence and so we can submit and get a
1:59:14
sequence and so we can submit and get a bunch of tokens okay no problem but now
1:59:18
bunch of tokens okay no problem but now
1:59:18
bunch of tokens okay no problem but now suppose I do this but instead of
1:59:20
suppose I do this but instead of
1:59:20
suppose I do this but instead of pressing submit here I do here's a
1:59:23
pressing submit here I do here's a
1:59:23
pressing submit here I do here's a tagline for ice cream shop space so I
1:59:25
tagline for ice cream shop space so I
1:59:26
tagline for ice cream shop space so I have a space here before I click
1:59:28
have a space here before I click
1:59:28
have a space here before I click submit we get a warning your text ends
1:59:31
submit we get a warning your text ends
1:59:31
submit we get a warning your text ends in a trail Ling space which causes worse
1:59:33
in a trail Ling space which causes worse
1:59:33
in a trail Ling space which causes worse performance due to how API splits text
1:59:35
performance due to how API splits text
1:59:35
performance due to how API splits text into tokens so what's happening here it
1:59:38
into tokens so what's happening here it
1:59:38
into tokens so what's happening here it still gave us a uh sort of completion
1:59:40
still gave us a uh sort of completion
1:59:40
still gave us a uh sort of completion here but let's take a look at what's
1:59:42
here but let's take a look at what's
1:59:42
here but let's take a look at what's happening so here's a tagline for an ice
1:59:44
happening so here's a tagline for an ice
1:59:44
happening so here's a tagline for an ice cream shop and then what does this look
1:59:48
cream shop and then what does this look
1:59:48
cream shop and then what does this look like in the actual actual training data
1:59:50
like in the actual actual training data
1:59:50
like in the actual actual training data suppose you found the completion in the
1:59:52
suppose you found the completion in the
1:59:52
suppose you found the completion in the training document somewhere on the
1:59:53
training document somewhere on the
1:59:53
training document somewhere on the internet and the llm trained on this
1:59:55
internet and the llm trained on this
1:59:55
internet and the llm trained on this data so maybe it's something like oh
1:59:58
data so maybe it's something like oh
1:59:58
data so maybe it's something like oh yeah maybe that's the tagline that's a
2:00:00
yeah maybe that's the tagline that's a
2:00:00
yeah maybe that's the tagline that's a terrible tagline but notice here that
2:00:02
terrible tagline but notice here that
2:00:02
terrible tagline but notice here that when I create o you see that because
2:00:05
when I create o you see that because
2:00:05
when I create o you see that because there's the the space character is
2:00:07
there's the the space character is
2:00:07
there's the the space character is always a prefix to these tokens in GPT
2:00:11
always a prefix to these tokens in GPT
2:00:11
always a prefix to these tokens in GPT so it's not an O token it's a space o
2:00:13
so it's not an O token it's a space o
2:00:13
so it's not an O token it's a space o token the space is part of the O and
2:00:16
token the space is part of the O and
2:00:16
token the space is part of the O and together they are token 8840 that's
2:00:19
together they are token 8840 that's
2:00:19
together they are token 8840 that's that's space o so what's What's
2:00:21
that's space o so what's What's
2:00:21
that's space o so what's What's Happening Here is that when I just have
2:00:24
Happening Here is that when I just have
2:00:24
Happening Here is that when I just have it like this and I let it complete the
2:00:27
it like this and I let it complete the
2:00:27
it like this and I let it complete the next token it can sample the space o
2:00:30
next token it can sample the space o
2:00:30
next token it can sample the space o token but instead if I have this and I
2:00:32
token but instead if I have this and I
2:00:32
token but instead if I have this and I add my space then what I'm doing here
2:00:34
add my space then what I'm doing here
2:00:34
add my space then what I'm doing here when I incode this string is I have
2:00:37
when I incode this string is I have
2:00:37
when I incode this string is I have basically here's a t line for an ice
2:00:39
basically here's a t line for an ice
2:00:39
basically here's a t line for an ice cream uh shop and this space at the very
2:00:41
cream uh shop and this space at the very
2:00:42
cream uh shop and this space at the very end becomes a token
2:00:44
end becomes a token
2:00:44
end becomes a token 220 and so we've added token 220 and
2:00:47
220 and so we've added token 220 and
2:00:47
220 and so we've added token 220 and this token otherwise would be part of
2:00:49
this token otherwise would be part of
2:00:49
this token otherwise would be part of the tagline because if there actually is
2:00:51
the tagline because if there actually is
2:00:51
the tagline because if there actually is a tagline here so space o is the token
2:00:55
a tagline here so space o is the token
2:00:55
a tagline here so space o is the token and so this is suddenly a of
2:00:57
and so this is suddenly a of
2:00:57
and so this is suddenly a of distribution for the model because this
2:00:59
distribution for the model because this
2:00:59
distribution for the model because this space is part of the next token but
2:01:01
space is part of the next token but
2:01:01
space is part of the next token but we're putting it here like this and the
2:01:04
we're putting it here like this and the
2:01:04
we're putting it here like this and the model has seen very very little data of
2:01:07
model has seen very very little data of
2:01:07
model has seen very very little data of actual Space by itself and we're asking
2:01:10
actual Space by itself and we're asking
2:01:10
actual Space by itself and we're asking it to complete the sequence like add in
2:01:11
it to complete the sequence like add in
2:01:11
it to complete the sequence like add in more tokens but the problem is that
2:01:13
more tokens but the problem is that
2:01:13
more tokens but the problem is that we've sort of begun the first token and
2:01:16
we've sort of begun the first token and
2:01:16
we've sort of begun the first token and now it's been split up and now we're out
2:01:18
now it's been split up and now we're out
2:01:18
now it's been split up and now we're out of this distribution and now arbitrary
2:01:20
of this distribution and now arbitrary
2:01:20
of this distribution and now arbitrary bad things happen and it's just a very
2:01:23
bad things happen and it's just a very
2:01:23
bad things happen and it's just a very rare example for it to see something
2:01:24
rare example for it to see something
2:01:24
rare example for it to see something like that and uh that's why we get the
2:01:26
like that and uh that's why we get the
2:01:26
like that and uh that's why we get the warning so the fundamental issue here is
2:01:29
warning so the fundamental issue here is
2:01:29
warning so the fundamental issue here is of course that um the llm is on top of
2:01:32
of course that um the llm is on top of
2:01:32
of course that um the llm is on top of these tokens and these tokens are text
2:01:34
these tokens and these tokens are text
2:01:34
these tokens and these tokens are text chunks they're not characters in a way
2:01:36
chunks they're not characters in a way
2:01:36
chunks they're not characters in a way you and I would think of them they are
2:01:38
you and I would think of them they are
2:01:38
you and I would think of them they are these are the atoms of what the LM is
2:01:40
these are the atoms of what the LM is
2:01:40
these are the atoms of what the LM is seeing and there's a bunch of weird
2:01:41
seeing and there's a bunch of weird
2:01:41
seeing and there's a bunch of weird stuff that comes out of it let's go back
2:01:43
stuff that comes out of it let's go back
2:01:43
stuff that comes out of it let's go back to our default cell style I bet you that
2:01:47
to our default cell style I bet you that
2:01:48
to our default cell style I bet you that the model has never in its training set
2:01:49
the model has never in its training set
2:01:49
the model has never in its training set seen default cell sta without Le in
2:01:54
seen default cell sta without Le in
2:01:54
seen default cell sta without Le in there it's always seen this as a single
2:01:56
there it's always seen this as a single
2:01:56
there it's always seen this as a single group because uh this is some kind of a
2:01:59
group because uh this is some kind of a
2:01:59
group because uh this is some kind of a function in um I'm guess I don't
2:02:01
function in um I'm guess I don't
2:02:02
function in um I'm guess I don't actually know what this is part of this
2:02:03
actually know what this is part of this
2:02:03
actually know what this is part of this is some kind of API but I bet you that
2:02:05
is some kind of API but I bet you that
2:02:05
is some kind of API but I bet you that it's never seen this combination of
2:02:07
it's never seen this combination of
2:02:07
it's never seen this combination of tokens uh in its training data because
2:02:10
tokens uh in its training data because
2:02:10
tokens uh in its training data because or I think it would be extremely rare so
2:02:12
or I think it would be extremely rare so
2:02:12
or I think it would be extremely rare so I took this and I copy pasted it here
2:02:14
I took this and I copy pasted it here
2:02:14
I took this and I copy pasted it here and I had I tried to complete from it
2:02:17
and I had I tried to complete from it
2:02:17
and I had I tried to complete from it and the it immediately gave me a big
2:02:19
and the it immediately gave me a big
2:02:19
and the it immediately gave me a big error and it said the model predicted to
2:02:21
error and it said the model predicted to
2:02:21
error and it said the model predicted to completion that begins with a stop
2:02:22
completion that begins with a stop
2:02:22
completion that begins with a stop sequence resulting in no output consider
2:02:24
sequence resulting in no output consider
2:02:24
sequence resulting in no output consider adjusting your prompt or stop sequences
2:02:26
adjusting your prompt or stop sequences
2:02:26
adjusting your prompt or stop sequences so what happened here when I clicked
2:02:27
so what happened here when I clicked
2:02:27
so what happened here when I clicked submit is that immediately the model
2:02:30
submit is that immediately the model
2:02:30
submit is that immediately the model emitted and sort of like end of text
2:02:32
emitted and sort of like end of text
2:02:32
emitted and sort of like end of text token I think or something like that it
2:02:34
token I think or something like that it
2:02:34
token I think or something like that it basically predicted the stop sequence
2:02:36
basically predicted the stop sequence
2:02:36
basically predicted the stop sequence immediately so it had no completion and
2:02:38
immediately so it had no completion and
2:02:38
immediately so it had no completion and so this is why I'm getting a warning
2:02:40
so this is why I'm getting a warning
2:02:40
so this is why I'm getting a warning again because we're off the data
2:02:42
again because we're off the data
2:02:42
again because we're off the data distribution and the model is just uh
2:02:45
distribution and the model is just uh
2:02:45
distribution and the model is just uh predicting just totally arbitrary things
2:02:47
predicting just totally arbitrary things
2:02:47
predicting just totally arbitrary things it's just really confused basically this
2:02:49
it's just really confused basically this
2:02:49
it's just really confused basically this is uh this is giving it brain damage
2:02:50
is uh this is giving it brain damage
2:02:50
is uh this is giving it brain damage it's never seen this before it's shocked
2:02:53
it's never seen this before it's shocked
2:02:53
it's never seen this before it's shocked and it's predicting end of text or
2:02:54
and it's predicting end of text or
2:02:54
and it's predicting end of text or something I tried it again here and it
2:02:57
something I tried it again here and it
2:02:57
something I tried it again here and it in this case it completed it but then
2:02:59
in this case it completed it but then
2:02:59
in this case it completed it but then for some reason this request May violate
2:03:01
for some reason this request May violate
2:03:01
for some reason this request May violate our usage policies this was
2:03:03
our usage policies this was
2:03:03
our usage policies this was flagged um basically something just like
2:03:06
flagged um basically something just like
2:03:06
flagged um basically something just like goes wrong and there's something like
2:03:07
goes wrong and there's something like
2:03:07
goes wrong and there's something like Jank you can just feel the Jank because
2:03:09
Jank you can just feel the Jank because
2:03:09
Jank you can just feel the Jank because the model is like extremely unhappy with
2:03:11
the model is like extremely unhappy with
2:03:11
the model is like extremely unhappy with just this and it doesn't know how to
2:03:12
just this and it doesn't know how to
2:03:12
just this and it doesn't know how to complete it because it's never occurred
2:03:14
complete it because it's never occurred
2:03:14
complete it because it's never occurred in training set in a training set it
2:03:16
in training set in a training set it
2:03:16
in training set in a training set it always appears like this and becomes a
2:03:18
always appears like this and becomes a
2:03:18
always appears like this and becomes a single token
2:03:20
single token
2:03:20
single token so these kinds of issues where tokens
2:03:21
so these kinds of issues where tokens
2:03:21
so these kinds of issues where tokens are either you sort of like complete the
2:03:24
are either you sort of like complete the
2:03:24
are either you sort of like complete the first character of the next token or you
2:03:26
first character of the next token or you
2:03:26
first character of the next token or you are sort of you have long tokens that
2:03:28
are sort of you have long tokens that
2:03:28
are sort of you have long tokens that you then have just some of the
2:03:29
you then have just some of the
2:03:29
you then have just some of the characters off all of these are kind of
2:03:32
characters off all of these are kind of
2:03:32
characters off all of these are kind of like issues with partial tokens is how I
2:03:35
like issues with partial tokens is how I
2:03:35
like issues with partial tokens is how I would describe it and if you actually
2:03:37
would describe it and if you actually
2:03:37
would describe it and if you actually dig into the T token
2:03:39
dig into the T token
2:03:39
dig into the T token repository go to the rust code and
2:03:41
repository go to the rust code and
2:03:41
repository go to the rust code and search for
2:03:44
search for
2:03:44
search for unstable and you'll see um en code
2:03:47
unstable and you'll see um en code
2:03:47
unstable and you'll see um en code unstable native unstable token tokens
2:03:49
unstable native unstable token tokens
2:03:49
unstable native unstable token tokens and a lot of like special case handling
2:03:51
and a lot of like special case handling
2:03:51
and a lot of like special case handling none of this stuff about unstable tokens
2:03:53
none of this stuff about unstable tokens
2:03:53
none of this stuff about unstable tokens is documented anywhere but there's a ton
2:03:55
is documented anywhere but there's a ton
2:03:55
is documented anywhere but there's a ton of code dealing with unstable tokens and
2:03:58
of code dealing with unstable tokens and
2:03:58
of code dealing with unstable tokens and unstable tokens is exactly kind of like
2:04:00
unstable tokens is exactly kind of like
2:04:00
unstable tokens is exactly kind of like what I'm describing here what you would
2:04:02
what I'm describing here what you would
2:04:02
what I'm describing here what you would like out of a completion API is
2:04:05
like out of a completion API is
2:04:05
like out of a completion API is something a lot more fancy like if we're
2:04:06
something a lot more fancy like if we're
2:04:06
something a lot more fancy like if we're putting in default cell sta if we're
2:04:08
putting in default cell sta if we're
2:04:08
putting in default cell sta if we're asking for the next token sequence we're
2:04:10
asking for the next token sequence we're
2:04:10
asking for the next token sequence we're not actually trying to append the next
2:04:12
not actually trying to append the next
2:04:12
not actually trying to append the next token exactly after this list we're
2:04:14
token exactly after this list we're
2:04:14
token exactly after this list we're actually trying to append we're trying
2:04:16
actually trying to append we're trying
2:04:16
actually trying to append we're trying to consider lots of tokens um
2:04:19
to consider lots of tokens um
2:04:19
to consider lots of tokens um that if we were or I guess like we're
2:04:22
that if we were or I guess like we're
2:04:22
that if we were or I guess like we're trying to search over characters that if
2:04:25
trying to search over characters that if
2:04:25
trying to search over characters that if we retened would be of high probability
2:04:28
we retened would be of high probability
2:04:28
we retened would be of high probability if that makes sense um so that we can
2:04:30
if that makes sense um so that we can
2:04:30
if that makes sense um so that we can actually add a single individual
2:04:32
actually add a single individual
2:04:32
actually add a single individual character uh instead of just like adding
2:04:34
character uh instead of just like adding
2:04:34
character uh instead of just like adding the next full token that comes after
2:04:36
the next full token that comes after
2:04:36
the next full token that comes after this partial token list so I this is
2:04:39
this partial token list so I this is
2:04:39
this partial token list so I this is very tricky to describe and I invite you
2:04:41
very tricky to describe and I invite you
2:04:41
very tricky to describe and I invite you to maybe like look through this it ends
2:04:43
to maybe like look through this it ends
2:04:43
to maybe like look through this it ends up being extremely gnarly and hairy kind
2:04:44
up being extremely gnarly and hairy kind
2:04:44
up being extremely gnarly and hairy kind of topic it and it comes from
2:04:46
of topic it and it comes from
2:04:46
of topic it and it comes from tokenization fundamentally so um maybe I
2:04:49
tokenization fundamentally so um maybe I
2:04:49
tokenization fundamentally so um maybe I can even spend an entire video talking
2:04:50
can even spend an entire video talking
2:04:50
can even spend an entire video talking about unstable tokens sometime in the
2:04:52
about unstable tokens sometime in the
2:04:52
about unstable tokens sometime in the future okay and I'm really saving the
2:04:54
future okay and I'm really saving the
2:04:54
future okay and I'm really saving the best for last my favorite one by far is
2:04:56
best for last my favorite one by far is
2:04:56
best for last my favorite one by far is the solid gold
2:04:59
the solid gold
2:04:59
the solid gold Magikarp and it just okay so this comes
2:05:01
Magikarp and it just okay so this comes
2:05:01
Magikarp and it just okay so this comes from this blog post uh solid gold
2:05:03
from this blog post uh solid gold
2:05:03
from this blog post uh solid gold Magikarp and uh this is um internet
2:05:06
Magikarp and uh this is um internet
2:05:07
Magikarp and uh this is um internet famous now for those of us in llms and
2:05:10
famous now for those of us in llms and
2:05:10
famous now for those of us in llms and basically I I would advise you to uh
2:05:11
basically I I would advise you to uh
2:05:11
basically I I would advise you to uh read this block Post in full but
2:05:13
read this block Post in full but
2:05:13
read this block Post in full but basically what this person was doing is
2:05:16
basically what this person was doing is
2:05:16
basically what this person was doing is this person went to the um
2:05:19
this person went to the um
2:05:19
this person went to the um token embedding stable and clustered the
2:05:22
token embedding stable and clustered the
2:05:22
token embedding stable and clustered the tokens based on their embedding
2:05:24
tokens based on their embedding
2:05:24
tokens based on their embedding representation and this person noticed
2:05:27
representation and this person noticed
2:05:27
representation and this person noticed that there's a cluster of tokens that
2:05:29
that there's a cluster of tokens that
2:05:29
that there's a cluster of tokens that look really strange so there's a cluster
2:05:31
look really strange so there's a cluster
2:05:31
look really strange so there's a cluster here at rot e stream Fame solid gold
2:05:34
here at rot e stream Fame solid gold
2:05:34
here at rot e stream Fame solid gold Magikarp Signet message like really
2:05:35
Magikarp Signet message like really
2:05:36
Magikarp Signet message like really weird tokens in uh basically in this
2:05:39
weird tokens in uh basically in this
2:05:39
weird tokens in uh basically in this embedding cluster and so what are these
2:05:42
embedding cluster and so what are these
2:05:42
embedding cluster and so what are these tokens and where do they even come from
2:05:43
tokens and where do they even come from
2:05:43
tokens and where do they even come from like what is solid gold magikarpet makes
2:05:45
like what is solid gold magikarpet makes
2:05:45
like what is solid gold magikarpet makes no sense and then they found bunch of
2:05:48
no sense and then they found bunch of
2:05:48
no sense and then they found bunch of these
2:05:50
these
2:05:50
these tokens and then they notice that
2:05:52
tokens and then they notice that
2:05:52
tokens and then they notice that actually the plot thickens here because
2:05:53
actually the plot thickens here because
2:05:53
actually the plot thickens here because if you ask the model about these tokens
2:05:56
if you ask the model about these tokens
2:05:56
if you ask the model about these tokens like you ask it uh some very benign
2:05:58
like you ask it uh some very benign
2:05:58
like you ask it uh some very benign question like please can you repeat back
2:06:00
question like please can you repeat back
2:06:00
question like please can you repeat back to me the string sold gold Magikarp uh
2:06:02
to me the string sold gold Magikarp uh
2:06:02
to me the string sold gold Magikarp uh then you get a variety of basically
2:06:04
then you get a variety of basically
2:06:04
then you get a variety of basically totally broken llm Behavior so either
2:06:07
totally broken llm Behavior so either
2:06:07
totally broken llm Behavior so either you get evasion so I'm sorry I can't
2:06:09
you get evasion so I'm sorry I can't
2:06:09
you get evasion so I'm sorry I can't hear you or you get a bunch of
2:06:11
hear you or you get a bunch of
2:06:11
hear you or you get a bunch of hallucinations as a response um you can
2:06:14
hallucinations as a response um you can
2:06:14
hallucinations as a response um you can even get back like insults so you ask it
2:06:17
even get back like insults so you ask it
2:06:17
even get back like insults so you ask it uh about streamer bot it uh tells the
2:06:19
uh about streamer bot it uh tells the
2:06:20
uh about streamer bot it uh tells the and the model actually just calls you
2:06:22
and the model actually just calls you
2:06:22
and the model actually just calls you names uh or it kind of comes up with
2:06:24
names uh or it kind of comes up with
2:06:24
names uh or it kind of comes up with like weird humor like you're actually
2:06:26
like weird humor like you're actually
2:06:26
like weird humor like you're actually breaking the model by asking about these
2:06:28
breaking the model by asking about these
2:06:28
breaking the model by asking about these very simple strings like at Roth and
2:06:30
very simple strings like at Roth and
2:06:30
very simple strings like at Roth and sold gold Magikarp so like what the hell
2:06:32
sold gold Magikarp so like what the hell
2:06:32
sold gold Magikarp so like what the hell is happening and there's a variety of
2:06:34
is happening and there's a variety of
2:06:34
is happening and there's a variety of here documented behaviors uh there's a
2:06:37
here documented behaviors uh there's a
2:06:37
here documented behaviors uh there's a bunch of tokens not just so good
2:06:38
bunch of tokens not just so good
2:06:38
bunch of tokens not just so good Magikarp that have that kind of a
2:06:40
Magikarp that have that kind of a
2:06:40
Magikarp that have that kind of a behavior and so basically there's a
2:06:42
behavior and so basically there's a
2:06:42
behavior and so basically there's a bunch of like trigger words and if you
2:06:44
bunch of like trigger words and if you
2:06:44
bunch of like trigger words and if you ask the model about these trigger words
2:06:46
ask the model about these trigger words
2:06:46
ask the model about these trigger words or you just include them in your prompt
2:06:48
or you just include them in your prompt
2:06:48
or you just include them in your prompt the model goes haywire and has all kinds
2:06:49
the model goes haywire and has all kinds
2:06:50
the model goes haywire and has all kinds of uh really Strange Behaviors including
2:06:52
of uh really Strange Behaviors including
2:06:52
of uh really Strange Behaviors including sort of ones that violate typical safety
2:06:54
sort of ones that violate typical safety
2:06:54
sort of ones that violate typical safety guidelines uh and the alignment of the
2:06:56
guidelines uh and the alignment of the
2:06:57
guidelines uh and the alignment of the model like it's swearing back at you so
2:06:59
model like it's swearing back at you so
2:06:59
model like it's swearing back at you so what is happening here and how can this
2:07:01
what is happening here and how can this
2:07:01
what is happening here and how can this possibly be true well this again comes
2:07:04
possibly be true well this again comes
2:07:04
possibly be true well this again comes down to tokenization so what's happening
2:07:06
down to tokenization so what's happening
2:07:06
down to tokenization so what's happening here is that sold gold Magikarp if you
2:07:08
here is that sold gold Magikarp if you
2:07:08
here is that sold gold Magikarp if you actually dig into it is a Reddit user so
2:07:11
actually dig into it is a Reddit user so
2:07:11
actually dig into it is a Reddit user so there's a u Sol gold
2:07:14
there's a u Sol gold
2:07:14
there's a u Sol gold Magikarp and probably what happened here
2:07:16
Magikarp and probably what happened here
2:07:16
Magikarp and probably what happened here even though I I don't know that this has
2:07:17
even though I I don't know that this has
2:07:18
even though I I don't know that this has been like really definitively explored
2:07:20
been like really definitively explored
2:07:20
been like really definitively explored but what is thought to have happened is
2:07:23
but what is thought to have happened is
2:07:23
but what is thought to have happened is that the tokenization data set was very
2:07:25
that the tokenization data set was very
2:07:25
that the tokenization data set was very different from the training data set for
2:07:27
different from the training data set for
2:07:28
different from the training data set for the actual language model so in the
2:07:29
the actual language model so in the
2:07:29
the actual language model so in the tokenization data set there was a ton of
2:07:31
tokenization data set there was a ton of
2:07:31
tokenization data set there was a ton of redded data potentially where the user
2:07:34
redded data potentially where the user
2:07:34
redded data potentially where the user solid gold Magikarp was mentioned in the
2:07:36
solid gold Magikarp was mentioned in the
2:07:36
solid gold Magikarp was mentioned in the text because solid gold Magikarp was a
2:07:39
text because solid gold Magikarp was a
2:07:39
text because solid gold Magikarp was a very common um sort of uh person who
2:07:41
very common um sort of uh person who
2:07:41
very common um sort of uh person who would post a lot uh this would be a
2:07:43
would post a lot uh this would be a
2:07:43
would post a lot uh this would be a string that occurs many times in a
2:07:45
string that occurs many times in a
2:07:45
string that occurs many times in a tokenization data set because it occurs
2:07:47
tokenization data set because it occurs
2:07:48
tokenization data set because it occurs many times in a tokenization data set
2:07:49
many times in a tokenization data set
2:07:50
many times in a tokenization data set these tokens would end up getting merged
2:07:51
these tokens would end up getting merged
2:07:51
these tokens would end up getting merged to the single individual token for that
2:07:53
to the single individual token for that
2:07:53
to the single individual token for that single Reddit user sold gold Magikarp so
2:07:56
single Reddit user sold gold Magikarp so
2:07:56
single Reddit user sold gold Magikarp so they would have a dedicated token in a
2:07:58
they would have a dedicated token in a
2:07:58
they would have a dedicated token in a vocabulary of was it 50,000 tokens in
2:08:00
vocabulary of was it 50,000 tokens in
2:08:00
vocabulary of was it 50,000 tokens in gpd2 that is devoted to that Reddit user
2:08:04
gpd2 that is devoted to that Reddit user
2:08:04
gpd2 that is devoted to that Reddit user and then what happens is the
2:08:05
and then what happens is the
2:08:05
and then what happens is the tokenization data set has those strings
2:08:08
tokenization data set has those strings
2:08:08
tokenization data set has those strings but then later when you train the model
2:08:10
but then later when you train the model
2:08:10
but then later when you train the model the language model itself um this data
2:08:13
the language model itself um this data
2:08:13
the language model itself um this data from Reddit was not present and so
2:08:16
from Reddit was not present and so
2:08:16
from Reddit was not present and so therefore in the entire training set for
2:08:18
therefore in the entire training set for
2:08:18
therefore in the entire training set for the language model sold gold Magikarp
2:08:21
the language model sold gold Magikarp
2:08:21
the language model sold gold Magikarp never occurs that token never appears in
2:08:24
never occurs that token never appears in
2:08:24
never occurs that token never appears in the training set for the actual language
2:08:25
the training set for the actual language
2:08:25
the training set for the actual language model later so this token never gets
2:08:28
model later so this token never gets
2:08:28
model later so this token never gets activated it's initialized at random in
2:08:31
activated it's initialized at random in
2:08:31
activated it's initialized at random in the beginning of optimization then you
2:08:32
the beginning of optimization then you
2:08:32
the beginning of optimization then you have forward backward passes and updates
2:08:34
have forward backward passes and updates
2:08:34
have forward backward passes and updates to the model and this token is just
2:08:35
to the model and this token is just
2:08:36
to the model and this token is just never updated in the embedding table
2:08:37
never updated in the embedding table
2:08:37
never updated in the embedding table that row Vector never gets sampled it
2:08:39
that row Vector never gets sampled it
2:08:40
that row Vector never gets sampled it never gets used so it never gets trained
2:08:42
never gets used so it never gets trained
2:08:42
never gets used so it never gets trained and it's completely untrained it's kind
2:08:43
and it's completely untrained it's kind
2:08:43
and it's completely untrained it's kind of like unallocated memory in a typical
2:08:46
of like unallocated memory in a typical
2:08:46
of like unallocated memory in a typical binary program written in C or something
2:08:48
binary program written in C or something
2:08:48
binary program written in C or something like that that so it's unallocated
2:08:49
like that that so it's unallocated
2:08:50
like that that so it's unallocated memory and then at test time if you
2:08:51
memory and then at test time if you
2:08:51
memory and then at test time if you evoke this token then you're basically
2:08:54
evoke this token then you're basically
2:08:54
evoke this token then you're basically plucking out a row of the embedding
2:08:55
plucking out a row of the embedding
2:08:55
plucking out a row of the embedding table that is completely untrained and
2:08:57
table that is completely untrained and
2:08:57
table that is completely untrained and that feeds into a Transformer and
2:08:58
that feeds into a Transformer and
2:08:58
that feeds into a Transformer and creates undefined behavior and that's
2:09:00
creates undefined behavior and that's
2:09:00
creates undefined behavior and that's what we're seeing here this completely
2:09:02
what we're seeing here this completely
2:09:02
what we're seeing here this completely undefined never before seen in a
2:09:03
undefined never before seen in a
2:09:03
undefined never before seen in a training behavior and so any of these
2:09:06
training behavior and so any of these
2:09:06
training behavior and so any of these kind of like weird tokens would evoke
2:09:07
kind of like weird tokens would evoke
2:09:08
kind of like weird tokens would evoke this Behavior because fundamentally the
2:09:09
this Behavior because fundamentally the
2:09:09
this Behavior because fundamentally the model is um is uh uh out of sample out
2:09:14
model is um is uh uh out of sample out
2:09:14
model is um is uh uh out of sample out of distribution okay and the very last
2:09:16
of distribution okay and the very last
2:09:16
of distribution okay and the very last thing I wanted to just briefly mention
2:09:18
thing I wanted to just briefly mention
2:09:18
thing I wanted to just briefly mention point out although I think a lot of
2:09:19
point out although I think a lot of
2:09:19
point out although I think a lot of people are quite aware of this is that
2:09:21
people are quite aware of this is that
2:09:21
people are quite aware of this is that different kinds of formats and different
2:09:23
different kinds of formats and different
2:09:23
different kinds of formats and different representations and different languages
2:09:24
representations and different languages
2:09:25
representations and different languages and so on might be more or less
2:09:26
and so on might be more or less
2:09:26
and so on might be more or less efficient with GPD tokenizers uh or any
2:09:29
efficient with GPD tokenizers uh or any
2:09:29
efficient with GPD tokenizers uh or any tokenizers for any other L for that
2:09:31
tokenizers for any other L for that
2:09:31
tokenizers for any other L for that matter so for example Json is actually
2:09:33
matter so for example Json is actually
2:09:33
matter so for example Json is actually really dense in tokens and yaml is a lot
2:09:36
really dense in tokens and yaml is a lot
2:09:36
really dense in tokens and yaml is a lot more efficient in tokens um so for
2:09:39
more efficient in tokens um so for
2:09:39
more efficient in tokens um so for example this are these are the same in
2:09:41
example this are these are the same in
2:09:41
example this are these are the same in Json and in yaml the Json is
2:09:44
Json and in yaml the Json is
2:09:44
Json and in yaml the Json is 116 and the yaml is 99 so quite a bit of
2:09:48
116 and the yaml is 99 so quite a bit of
2:09:48
116 and the yaml is 99 so quite a bit of an Improvement and so in the token
2:09:51
an Improvement and so in the token
2:09:51
an Improvement and so in the token economy where we are paying uh per token
2:09:53
economy where we are paying uh per token
2:09:53
economy where we are paying uh per token in many ways and you are paying in the
2:09:55
in many ways and you are paying in the
2:09:55
in many ways and you are paying in the context length and you're paying in um
2:09:57
context length and you're paying in um
2:09:57
context length and you're paying in um dollar amount for uh the cost of
2:09:59
dollar amount for uh the cost of
2:09:59
dollar amount for uh the cost of processing all this kind of structured
2:10:01
processing all this kind of structured
2:10:01
processing all this kind of structured data when you have to um so prefer to
2:10:03
data when you have to um so prefer to
2:10:03
data when you have to um so prefer to use theal over Json and in general kind
2:10:06
use theal over Json and in general kind
2:10:06
use theal over Json and in general kind of like the tokenization density is
2:10:07
of like the tokenization density is
2:10:07
of like the tokenization density is something that you have to um sort of
2:10:09
something that you have to um sort of
2:10:09
something that you have to um sort of care about and worry about at all times
2:10:11
care about and worry about at all times
2:10:11
care about and worry about at all times and try to find efficient encoding
2:10:13
and try to find efficient encoding
2:10:13
and try to find efficient encoding schemes and spend a lot of time in tick
2:10:15
schemes and spend a lot of time in tick
2:10:15
schemes and spend a lot of time in tick tokenizer and measure the different
2:10:16
tokenizer and measure the different
2:10:16
tokenizer and measure the different token efficiencies of different formats
2:10:18
token efficiencies of different formats
2:10:18
token efficiencies of different formats and settings and so on okay so that
2:10:20
and settings and so on okay so that
2:10:21
and settings and so on okay so that concludes my fairly long video on
2:10:23
concludes my fairly long video on
2:10:23
concludes my fairly long video on tokenization I know it's a try I know
2:10:25
tokenization I know it's a try I know
2:10:25
tokenization I know it's a try I know it's annoying I know it's irritating I
2:10:28
it's annoying I know it's irritating I
2:10:28
it's annoying I know it's irritating I personally really dislike the stage what
2:10:30
personally really dislike the stage what
2:10:30
personally really dislike the stage what I do have to say at this point is don't
2:10:32
I do have to say at this point is don't
2:10:32
I do have to say at this point is don't brush it off there's a lot of foot guns
2:10:34
brush it off there's a lot of foot guns
2:10:34
brush it off there's a lot of foot guns sharp edges here security issues uh AI
2:10:38
sharp edges here security issues uh AI
2:10:38
sharp edges here security issues uh AI safety issues as we saw plugging in
2:10:39
safety issues as we saw plugging in
2:10:39
safety issues as we saw plugging in unallocated memory into uh language
2:10:42
unallocated memory into uh language
2:10:42
unallocated memory into uh language models so um it's worth understanding
2:10:45
models so um it's worth understanding
2:10:45
models so um it's worth understanding this stage um that said I will say that
2:10:48
this stage um that said I will say that
2:10:48
this stage um that said I will say that eternal glory goes to anyone who can get
2:10:50
eternal glory goes to anyone who can get
2:10:50
eternal glory goes to anyone who can get rid of it uh I showed you one possible
2:10:52
rid of it uh I showed you one possible
2:10:52
rid of it uh I showed you one possible paper that tried to uh do that and I
2:10:54
paper that tried to uh do that and I
2:10:54
paper that tried to uh do that and I think I hope a lot more can follow over
2:10:57
think I hope a lot more can follow over
2:10:57
think I hope a lot more can follow over time and my final recommendations for
2:10:59
time and my final recommendations for
2:10:59
time and my final recommendations for the application right now are if you can
2:11:01
the application right now are if you can
2:11:01
the application right now are if you can reuse the GPT 4 tokens and the
2:11:03
reuse the GPT 4 tokens and the
2:11:03
reuse the GPT 4 tokens and the vocabulary uh in your application then
2:11:04
vocabulary uh in your application then
2:11:05
vocabulary uh in your application then that's something you should consider and
2:11:06
that's something you should consider and
2:11:06
that's something you should consider and just use Tech token because it is very
2:11:07
just use Tech token because it is very
2:11:07
just use Tech token because it is very efficient and nice library for inference
2:11:11
efficient and nice library for inference
2:11:11
efficient and nice library for inference for bpe I also really like the bite
2:11:13
for bpe I also really like the bite
2:11:13
for bpe I also really like the bite level BP that uh Tik toen and openi uses
2:11:17
level BP that uh Tik toen and openi uses
2:11:17
level BP that uh Tik toen and openi uses uh if you for some reason want to train
2:11:19
uh if you for some reason want to train
2:11:19
uh if you for some reason want to train your own vocabulary from scratch um then
2:11:22
your own vocabulary from scratch um then
2:11:22
your own vocabulary from scratch um then I would use uh the bpe with sentence
2:11:24
I would use uh the bpe with sentence
2:11:25
I would use uh the bpe with sentence piece um oops as I mentioned I'm not a
2:11:28
piece um oops as I mentioned I'm not a
2:11:28
piece um oops as I mentioned I'm not a huge fan of sentence piece I don't like
2:11:30
huge fan of sentence piece I don't like
2:11:30
huge fan of sentence piece I don't like its uh bite fallback and I don't like
2:11:33
its uh bite fallback and I don't like
2:11:33
its uh bite fallback and I don't like that it's doing BP on unic code code
2:11:35
that it's doing BP on unic code code
2:11:35
that it's doing BP on unic code code points I think it's uh it also has like
2:11:37
points I think it's uh it also has like
2:11:37
points I think it's uh it also has like a million settings and I think there's a
2:11:39
a million settings and I think there's a
2:11:39
a million settings and I think there's a lot of foot gonss here and I think it's
2:11:40
lot of foot gonss here and I think it's
2:11:40
lot of foot gonss here and I think it's really easy to Mis calibrate them and
2:11:42
really easy to Mis calibrate them and
2:11:42
really easy to Mis calibrate them and you end up cropping your sentences or
2:11:43
you end up cropping your sentences or
2:11:43
you end up cropping your sentences or something like that uh because of some
2:11:45
something like that uh because of some
2:11:45
something like that uh because of some type of parameter that you don't fully
2:11:47
type of parameter that you don't fully
2:11:47
type of parameter that you don't fully understand so so be very careful with
2:11:49
understand so so be very careful with
2:11:49
understand so so be very careful with the settings try to copy paste exactly
2:11:51
the settings try to copy paste exactly
2:11:51
the settings try to copy paste exactly maybe where what meta did or basically
2:11:54
maybe where what meta did or basically
2:11:54
maybe where what meta did or basically spend a lot of time looking at all the
2:11:56
spend a lot of time looking at all the
2:11:56
spend a lot of time looking at all the hyper parameters and go through the code
2:11:57
hyper parameters and go through the code
2:11:57
hyper parameters and go through the code of sentence piece and make sure that you
2:11:59
of sentence piece and make sure that you
2:11:59
of sentence piece and make sure that you have this correct um but even if you
2:12:02
have this correct um but even if you
2:12:02
have this correct um but even if you have all the settings correct I still
2:12:03
have all the settings correct I still
2:12:03
have all the settings correct I still think that the algorithm is kind of
2:12:04
think that the algorithm is kind of
2:12:04
think that the algorithm is kind of inferior to what's happening here and
2:12:07
inferior to what's happening here and
2:12:07
inferior to what's happening here and maybe the best if you really need to
2:12:09
maybe the best if you really need to
2:12:09
maybe the best if you really need to train your vocabulary maybe the best
2:12:11
train your vocabulary maybe the best
2:12:11
train your vocabulary maybe the best thing is to just wait for M bpe to
2:12:13
thing is to just wait for M bpe to
2:12:13
thing is to just wait for M bpe to becomes as efficient as possible and uh
2:12:16
becomes as efficient as possible and uh
2:12:16
becomes as efficient as possible and uh that's something that maybe I hope to
2:12:18
that's something that maybe I hope to
2:12:18
that's something that maybe I hope to work on and at some point maybe we can
2:12:20
work on and at some point maybe we can
2:12:20
work on and at some point maybe we can be training basically really what we
2:12:22
be training basically really what we
2:12:22
be training basically really what we want is we want tick token but training
2:12:24
want is we want tick token but training
2:12:24
want is we want tick token but training code and that is the ideal thing that
2:12:27
code and that is the ideal thing that
2:12:27
code and that is the ideal thing that currently does not exist and MBP is um
2:12:31
currently does not exist and MBP is um
2:12:31
currently does not exist and MBP is um is in implementation of it but currently
2:12:33
is in implementation of it but currently
2:12:33
is in implementation of it but currently it's in Python so that's currently what
2:12:35
it's in Python so that's currently what
2:12:35
it's in Python so that's currently what I have to say for uh tokenization there
2:12:38
I have to say for uh tokenization there
2:12:38
I have to say for uh tokenization there might be an advanced video that has even
2:12:40
might be an advanced video that has even
2:12:40
might be an advanced video that has even drier and even more detailed in the
2:12:41
drier and even more detailed in the
2:12:41
drier and even more detailed in the future but for now I think we're going
2:12:43
future but for now I think we're going
2:12:43
future but for now I think we're going to leave things off here and uh I hope
2:12:46
to leave things off here and uh I hope
2:12:46
to leave things off here and uh I hope that was helpful bye
2:12:54
and uh they increase this contact size
2:12:56
and uh they increase this contact size
2:12:56
and uh they increase this contact size from gpt1 of 512 uh to 1024 and GPT 4
2:13:02
from gpt1 of 512 uh to 1024 and GPT 4
2:13:02
from gpt1 of 512 uh to 1024 and GPT 4 two the
2:13:05
two the
2:13:05
two the next okay next I would like us to
2:13:07
next okay next I would like us to
2:13:07
next okay next I would like us to briefly walk through the code from open
2:13:09
briefly walk through the code from open
2:13:09
briefly walk through the code from open AI on the gpt2 encoded
2:13:15
ATP I'm sorry I'm gonna sneeze
2:13:19
ATP I'm sorry I'm gonna sneeze
2:13:19
ATP I'm sorry I'm gonna sneeze and then what's Happening Here
2:13:21
and then what's Happening Here
2:13:21
and then what's Happening Here is this is a spous layer that I will
2:13:24
is this is a spous layer that I will
2:13:24
is this is a spous layer that I will explain in a
2:13:26
explain in a
2:13:26
explain in a bit What's Happening Here
2:13:33
is