View on GitHub
GitHub
Neural networks
But how do AI images and videos actually work? | Guest video by Welch Labs
Loading player
Notes
Transcript
583 segments
0:03
Over the last few years, AI systems have become
0:06
astonishingly good at turning text props into videos.
0:10
At the core of how these models operate is a deep connection to physics.
0:14
This generation of image and video models works using a process known as diffusion,
0:19
which is remarkably equivalent to the Brownian motion we see as particles diffuse,
0:23
but with time run backwards, and in high-dimensional space.
0:28
As we'll see, this connection to physics is much more than a curiosity.
0:31
We get real algorithms out of the physics that we can use to generate images and videos.
0:36
And this perspective will also give us some really
0:39
nice intuitions for how these models work in practice.
0:42
But before we dive into this connection, let's get hands-on with a real diffusion model.
0:47
While the best models are closed source, there are some compelling open source models.
0:52
This video of an astronaut was generated by an open source model called WAN 2.1.
0:57
We can add to our prompt and have our astronaut hold a flag,
1:01
hold a laptop, or hold a meeting.
1:04
If we cut down our prompt to just an astronaut, we get this.
1:08
And if we cut down our prompt to nothing, we interestingly
1:10
still get this video of a woman.
1:13
If we dig into our WAN model's source code, we'll find that the video
1:16
generation process begins with this call to a random number generator.
1:21
Creating a video where the pixel intensity values are chosen randomly.
1:25
Here's what it looks like.
1:27
From here, this pure noise video is passed into a transformer.
1:31
This is the same type of AI model used by large language models, like ChatGPT.
1:36
But instead of outputting text, this transformer
1:39
outputs another video that now looks like this.
1:42
Still mostly noise, but with some hints of structure.
1:45
This new video is added to our pure noise video,
1:48
and then passed back into the model again, producing a third video that looks like this.
1:54
This process is repeated again and again.
1:56
Here's what the video looks like after 5 iterations, 10, 20, 30, 40, and finally 50.
2:05
Step by step, our transformer shapes pure noise into incredibly realistic video.
2:11
But what exactly is the connection to Brownian motion here?
2:15
And how is our model able to use text input so expressively
2:19
to shape noise into what our prompt describes?
2:22
In this video, we'll impact diffusion models in 3 parts.
2:26
First we'll look at a 2021 OpenAI paper and model called CLIP.
2:30
As we'll see, CLIP is really two models, a language model and a vision model,
2:34
that are trained using a clever learning objective that allows them to
2:38
learn this really powerful shared space between words and pictures.
2:43
Experimenting with this space will help us get a feel for
2:46
the high dimensional spaces that diffusion models operate in.
2:50
But learning a shared representation is not enough to generate images.
2:53
From here we'll look at the diffusion process itself.
2:56
At a high level, diffusion models are trained to remove noise from images or videos.
3:02
However, if you dig into the landmark papers in the field,
3:04
you'll find that this naive understanding of diffusion really doesn't hold
3:08
up in practice.
3:09
In this section we'll dig into the connection between
3:12
diffusion models and diffusion processes in physics.
3:15
This connection will help us understand how these models really work in practice and
3:19
give us some powerful theory for dramatically speeding up image and video generation.
3:25
Finally, we'll bring these worlds together and see how approaches
3:28
like CLIP are combined with diffusion models to condition and guide
3:31
the generation process towards the videos we ask for in our prompts.
3:37
2020 was a landmark year for language modeling.
3:40
New results in neural scaling laws and OpenAI's
3:43
GPT-3 showed that bigger really was better.
3:47
Massive models trained on massive datasets had
3:50
capabilities that simply didn't exist in smaller models.
3:53
It didn't take long for researchers to apply similar ideas to images.
3:58
In February 2021, a team at OpenAI released a new model architecture called CLIP,
4:02
trained on a dataset of 400 million image and caption pairs scraped from the internet.
4:08
CLIP is composed of two models, one that processes text and one that processes images.
4:14
The output of each of these models is a vector of length 512,
4:17
and the central idea is that the vectors for a given image and its captions
4:21
should be similar.
4:23
To achieve this, the OpenAI team developed a clever training approach.
4:28
Given a batch of image-caption pairs, for example our batch could contain
4:32
a picture of a cat, a dog, and me, with the captions a photo of a cat,
4:36
a photo of a dog, and a photo of a man, we then pass our three images
4:39
into our image model, and our three captions into our text model.
4:44
We now have three image vectors and three text vectors,
4:47
and we would like the vectors for the matching image-caption pairs to be similar.
4:52
The clever idea from here is to make use of the similarity not
4:55
just between the corresponding images and captions,
4:57
but between all image-caption pairs in the batch when training our models.
5:01
If we arrange our image vectors as the columns of a matrix,
5:04
and our text vectors as the rows, the pairs of vectors along
5:07
the diagonal of our matrix correspond to matching images and captions.
5:11
And all the pairs off-diagonal are non-matching images and captions.
5:16
The CLIP training objective seeks to maximize the similarity between
5:19
corresponding image-caption pairs, while simultaneously minimizing
5:23
the similarity between non-corresponding image-caption pairs.
5:28
The C in CLIP stands for contrastive, because the model learns
5:31
to contrast matching and non-matching image-caption pairs.
5:36
The CLIP algorithm measures similarity between
5:38
vectors using a metric called cosine similarity.
5:41
Geometrically, we can think of each of these vectors as
5:44
pointing in some direction in high-dimensional space.
5:47
Cosine similarity measures the cosine of the angle between our vectors in this space.
5:53
So if our text and image vector point in the same direction,
5:56
the angle between our vectors will be zero, resulting in a maximum value for our cosine
6:00
similarity score of 1.
6:03
So the image and text models that make up CLIP are trained to maximize the
6:07
alignment of related images and captions in this shared high-dimensional space,
6:11
while minimizing the alignment between unrelated images and captions.
6:16
The learned geometry of this shared vector space,
6:19
known as a latent or embedding space, has some really interesting properties.
6:24
If I take two pictures of myself, one not wearing a hat and one wearing a hat,
6:28
and pass both of these into our CLIP image model,
6:31
we get two vectors in our embedding space.
6:35
Now if I take the vector corresponding to me wearing a hat,
6:38
and subtract the vector of me not wearing a hat,
6:40
we get a new vector in our embedding space.
6:43
Now what text might this new vector correspond to?
6:48
Mathematically we took the difference of me wearing a hat and me not wearing a hat.
6:52
We can search for corresponding text by passing a bunch of different
6:55
words into our text encoder, and for each computing the cosine similarity
6:59
between our newly computed difference vector and the text vector.
7:04
Testing a set of a few hundred common words, the top ranked match with
7:08
a similarity of 0.165 is the word hat, followed by cap and helmet.
7:13
This is a remarkable result.
7:16
The learned geometry of CLIP's embedding space allows us to operate
7:20
mathematically on the pure ideas or concepts in our images and text,
7:24
translating the differences in the content of our images,
7:27
like if there's a hat or not, into a literal distance between vectors in
7:32
our embedding space.
7:33
The OpenAI team showed that CLIP could produce very impressive image classification
7:38
results by simply passing an image into our image encoder,
7:41
and then comparing the resulting vector to a set of possible captions,
7:45
one for each label that could be assigned to the image,
7:48
and classifying the image with whatever label resulted in the highest cosine similarity.
7:54
So techniques like CLIP give us a powerful shared representation of image and text,
7:59
a kind of vector space of pure ideas.
8:02
However, our CLIP models only go one direction.
8:06
We can only map image and text to our shared embedding space.
8:10
We have no way of generating images and text from our embedding vectors.
8:15
2020 turned out not only to be a transformative year for language modeling.
8:20
A few weeks after the GPT-3 paper came out, a team at Berkeley published a
8:25
paper called Denoising Diffusion Probabilistic Models, now known as DDPM.
8:30
The paper showed for the first time that it was possible to
8:33
generate very high quality images using a diffusion process,
8:37
where pure noise is transformed step by step into realistic images.
8:42
The core idea behind diffusion models is pretty straightforward.
8:46
We take a set of training images and add noise to each
8:49
image step by step until the image is completely destroyed.
8:53
From here we train a neural network to reverse this process.
8:57
When I first learned about diffusion models, I assumed that the
9:00
models would be trained to remove noise a single step at a time.
9:04
Our model would be trained to predict the image in step 1 given the noisier image in step
9:09
2, trained to predict the image in step 2 given the noisier image in step 3, and so on.
9:14
When it came time to generate an image, we would pass pure noise into our model,
9:18
take its output and pass it back into its input again and again,
9:22
and after enough steps we would have a nice image.
9:25
Now, it turns out that this naive approach to
9:28
building a diffusion model really does not work well.
9:31
Virtually no modern models work like this.
9:35
These are the training and image generation algorithms from the Berkeley team's paper.
9:40
The notation is a bit dense, but there's some key details we can pull out
9:43
that will help us understand what it takes to make these models really work.
9:47
The first thing that surprised me is that the team added random noise
9:51
to images not just during training, but also during image generation.
9:55
Algorithm 2 tells us that when generating new images, at each step,
9:59
after our neural network predicts a less noisy image,
10:03
we need to add random noise to this image before passing it back into our model.
10:08
This added noise turns out to matter a lot in practice.
10:12
If we take a popular diffusion model like stable diffusion 2 and use the Berkeley team's
10:17
image generation approach, known as DDPM sampling, we can get some really nice images.
10:23
Here's the image we get when prompting the model with this prompt,
10:26
asking for a tree in the desert.
10:28
Now, if we remove the line of code that adds noise at each step
10:32
of the generation process, we end up with a tiny sad blurry tree.
10:37
How is it that adding random noise while generating images leads to better quality,
10:41
sharper images?
10:43
The second thing that surprised me when I encountered the Berkeley team's approach was
10:47
that the team wasn't training models to reverse a single step in the noise addition
10:51
process.
10:52
Instead, the team takes an initial clean image, which they call X0,
10:56
and adds scaled random noise to the image, which they call epsilon.
11:00
And from here, they train the model to predict the
11:03
total noise that was added to the original image.
11:06
So the team is effectively asking the model to skip all the
11:09
intermediate steps and make a prediction about the original image.
11:14
Intuitively, this learning task seems much more difficult to me
11:17
than just learning to make a noisy image slightly less noisy.
11:21
The Berkeley team's paper and approach was a landmark
11:23
result that put diffusion on the map.
11:26
Why does adding random noise while generating images
11:29
and training the model like this work so well?
11:33
The DDPM paper draws on some fairly complex theory to arrive at these algorithms.
11:38
I'll include a link to a great tutorial in the
11:40
description if you want to dig deeper into the theory.
11:43
Happily, it turns out that there's a different but mathematically equivalent
11:46
way of understanding what diffusion models are really learning that we can
11:50
use to get a visual and intuitive sense for why the DDPM algorithms work so well.
11:55
The key will be thinking of diffusion models as learning a time-varying vector field.
12:00
This perspective also leads to a more general approach called flow-based models,
12:04
which have become very popular recently.
12:07
To see how diffusion models learn this time-varying vector field,
12:11
let's temporarily simplify our learning problem.
12:14
One way to think about an image is as a point in high-dimensional space,
12:18
where the intensity value of each pixel controls the position of the point in each
12:22
dimension.
12:24
If we reduce the size of our images to only two pixels,
12:27
we can visualize the distribution of our images by plotting the pixel intensity
12:31
value of our first pixel on the x-axis of scatterplot and the pixel intensity of
12:35
our second pixel on the y-axis.
12:38
So an image with a black first pixel and a white second pixel
12:41
would show up at x equals zero and y equals one on our scatterplot.
12:45
And an all-white image would be at one, one, and so on.
12:49
Now, real images have a very specific structure in this high-dimensional space.
12:53
Let's create some structure for our points in our lower
12:56
two-dimensional space for our diffusion model to learn.
12:59
The exact structure we choose doesn't matter too much at this point.
13:02
Let's start with a spiral shape like this.
13:05
The core idea of diffusion models, adding more and more noise to an
13:09
image and then training a neural network to reverse this process,
13:12
looks really interesting from the perspective of our 2D toy data.
13:17
When we add random noise to an image, we're effectively
13:20
changing each pixel's value by a random amount.
13:23
In our toy 2D dataset, where the coordinates of a point correspond
13:27
to that image's pixel intensity values, adding random noise is
13:30
equivalent to taking a step in a randomly chosen direction.
13:34
As we add more and more noise to our image, our point goes on a random walk.
13:38
This process is equivalent to the Brownian motion that drives diffusion
13:42
processes in physics and is where diffusion models get their name.
13:46
From here, it's pretty wild to think about what we're asking our diffusion model to do.
13:51
Our model will see many different random walks from various starting points in our
13:56
dataset, and we're effectively asking our model to reverse the clock,
13:59
removing noise from our images by letting it play these diffusion processes backwards,
14:04
starting our points from random locations and recovering the original structure of
14:08
our dataset.
14:10
How can our model learn to reverse these random walks?
14:14
If we consider the specific point at the end of this 100-step random walk,
14:18
in our naive diffusion modeling approach, where we ask our model to denoise images a
14:23
single step at a time, this is equivalent to giving our model the coordinates of the
14:28
final 100th point in our walk, and asking our model to predict the coordinates of our
14:32
point at the 99th step.
14:34
Although the direction of our 100th step is chosen randomly,
14:38
there will be some signal in aggregate for our model to learn from here.
14:42
Given enough training points, we expect many diffusion paths to go through
14:46
this neighborhood, and on average our points will be diffusing away from
14:51
our starting spiral, so our model can learn to point back towards our spiral.
14:56
We can now see why the Berkeley team's training objective works so well.
15:01
Instead of training the model to remove noise from images one step at a time,
15:05
this would correspond to predicting the coordinates of the 99th step given the 100th,
15:09
the team instead trained the model to predict the total noise added across the entire
15:13
walk.
15:14
On our plot, this is the vector pointing from our 100th
15:17
step back to the original starting point of the walk.
15:20
It turns out that we can prove that learning to predict the noise added
15:23
in the final step of our walk is mathematically equivalent to learning
15:27
to predict the total noise added, divided by the number of steps taken.
15:32
This means that when our model learns to reverse a single step,
15:35
although our training data is noisy, we expect our model to ultimately learn to
15:40
point back towards x0.
15:42
By instead training our model to directly predict the vector pointing back towards x0,
15:47
we're significantly reducing the variance of our training examples,
15:51
allowing our model to learn much more efficiently,
15:53
without actually changing our underlying learning objective.
15:58
So for each point in our space, our model learns the direction
16:01
pointing back towards the original data distribution.
16:05
This is also known as a score function, and the intuition here is that
16:09
the score function points us towards more likely, less noisy data.
16:14
Now, in practice, these learned directions depend
16:17
heavily on how much noise we add to our original data.
16:20
After 100 steps, most of our points are far from their starting points,
16:24
so our model learns to move these points back in the general direction of our spiral.
16:29
However, if we train our model on examples after only one diffusion step,
16:33
we end up with a much more nuanced vector field,
16:36
pointing to the fine structure of our spiral.
16:39
There turns out to be a clever solution to this problem.
16:42
Instead of just passing in the coordinates of our point into our model,
16:46
which we'll write here as a function f, we can also pass in a time
16:50
variable that corresponds to the number of steps taken in our random walk.
16:54
If we set t equal to 1 at our 100th step, then t would equal 0.99 at our 99th step,
17:00
and so on.
17:01
Conditioning our models on time like this turns out to be essential in practice,
17:06
allowing our model to learn coarse vector fields for large values of t,
17:09
and very refined structures as t approaches 0.
17:13
After training, we can watch the time evolution of our model.
17:18
We see this really interesting behavior as t approaches 0.4.
17:23
Our learned vector field suddenly transitions,
17:25
from pointing towards the center of the spiral to pointing towards the spiral itself.
17:29
It feels like a phase change.
17:33
We're now in a great position to resolve the final mystery of the DDPM paper.
17:38
How is it that adding random noise at each step while
17:41
generating images leads to better quality, sharper images?
17:45
Let's follow the path of a single point guided by the DDPM image generation algorithm.
17:51
On our 2D dataset, generating an image is equivalent to starting
17:54
at a random location and working our way back to our spiral.
17:59
Starting at a randomly chosen location of x equals minus 1.6 and y equals 1.8,
18:03
our model's vector field points us back towards our spiral.
18:08
Following the DDPM algorithm, we take a small step in the direction returned by our
18:13
model, and add scaled random noise, which effectively moves our point in a random
18:17
direction.
18:19
We'll color the steps driven by our diffusion model in blue, and our random steps in gray.
18:24
Note that the scale of the random step may seem large, but following our DDPM algorithm,
18:28
the size of our random steps will come down as we progress.
18:33
Repeating this process for 64 steps, our particle jumps around
18:36
quite a bit due to both our learned vector field changing and our random noise steps,
18:42
but ultimately lands nicely on our spiral.
18:45
Repeating this process for a point cloud of 256 points,
18:49
our reverse diffusion process starts out looking like absolute chaos,
18:53
but does converge nicely, with most points landing on our spiral.
18:58
Now, what happens if we remove the noise addition steps?
19:03
Running our reverse diffusion process again without the random noise step,
19:07
all of our points quickly move to the center of our spiral,
19:10
and then make their way towards a single inside edge of the spiral.
19:14
This result can help us make sense of why we saw a sad
19:17
blurry tree earlier when we removed this random noise step.
19:21
Instead of capturing our full spiral distribution,
19:24
as we did when we included a noise step, all of our generated points end up close to
19:28
the center or average of our spiral.
19:32
In the space of images, averages look blurry.
19:36
Conceptually, we can imagine different parts of our spiral
19:39
corresponding to different images of trees in the desert.
19:42
And when we remove the random noise steps from our generation process,
19:46
our generated images end up in the center or average of these images,
19:49
which looks like a blurry mess.
19:52
Now, note that the analogy between our toy dataset and
19:55
high dimensional image dataset breaks down a bit here.
19:58
If all the points on our spiral correspond to realistic images,
20:02
since our generated points do still end up landing on our 2D spiral,
20:05
we would expect these generated points to still look like real images,
20:09
but likely with less diversity than we would want.
20:13
However, in the high dimensional space of images,
20:16
it appears that our image generation process doesn't quite make
20:19
it to the manifold of realistic images, resulting in a blurry non-realistic image.
20:25
This prediction of the average is not a coincidence.
20:29
It turns out that we can show mathematically that our model
20:31
learns to point to the mean or average of our dataset,
20:34
conditioned on our input point and the time in our diffusion process.
20:39
One way to arrive at this result is to show that given the noise we add in our forward
20:43
process is Gaussian, for sufficiently small step sizes our reverse process will also
20:48
follow a Gaussian distribution, where our model actually learns the mean of this
20:52
distribution.
20:54
Since our model just predicts the mean of our normal distribution,
20:58
to actually sample from this distribution, we need to add zero mean
21:02
Gaussian noise to our model's predicted value,
21:04
which is precisely what the DDPM image generation process does when we
21:08
add random noise after each step.
21:11
We can see this mean learning behavior most clearly early in our reverse diffusion
21:16
process, when t is close to 1 and our training points are far from our spiral.
21:21
Our model's learned vector field points towards the center or average of our dataset.
21:26
So adding random noise during image generation falls nicely out of theory,
21:29
and in practice prevents all our points from landing near the center or average of
21:33
our dataset.
21:35
The DDPM paper put diffusion models on the map as a viable method of generating images,
21:40
but the diffusion approach did not immediately see widespread adoption.
21:45
A key issue with the DDPM approach at the time was the high compute demands of
21:49
the large number of steps required to generate high quality images,
21:53
since each step required a complete pass through a potentially very large neural network.
21:58
A few months later, a pair of papers from teams at Stanford and Google showed that it's
22:03
remarkably possible to generate high quality images without actually adding random
22:07
noise during the generation process, significantly reducing the number of steps required.
22:13
The DDPM image generation process we've been looking at can be expressed using a
22:17
special type of differential equation known as a stochastic differential equation.
22:22
This first term represents the motion of our point driven by our model's vector field,
22:26
and the second term represents the random motions of our point.
22:30
Adding these terms together, we get the overall motion of our point at each step, dx.
22:35
From here, we can consider how the distribution of all of our points evolves over time,
22:40
where the motion of each point is governed by this stochastic differential equation.
22:45
This problem has been well studied in physics.
22:48
Using a key result from statistical mechanics known as the Fokker-Planck equation,
22:52
the Google Brain team showed that there's another differential equation,
22:56
this time an ordinary differential equation with no random component,
23:00
that results in the same exact final distribution of points as our stochastic
23:05
differential equation.
23:07
This result gives us a new algorithm for generating images using our model's
23:12
learned vector fields that does not require taking random steps along the way.
23:17
Exactly how our ordinary differential equation maps
23:20
to an image generation algorithm is a bit technical.
23:23
I'll leave a link to a tutorial in the description.
23:26
The key result here though, is that we end up with something that looks very
23:30
similar to our DDPM image generation process,
23:32
but without the random noise addition at each step,
23:35
and with a new scaling for the sizes of steps that we take.
23:39
This approach is generally known as DDIM.
23:43
The scaling of our step sizes, and especially how these step sizes vary
23:47
throughout a reverse diffusion process, matters a lot in practice.
23:52
When we just removed the random noise steps from our DDPM generation algorithm earlier,
23:57
all of our points ended up near the mean of our data,
24:00
and we saw blurry results for our generated images.
24:03
Switching to our DDIM approach, we now have smaller scaling for our step
24:08
sizes that allow our trajectories to better follow the contour lines of
24:12
our vector field, and land nicely on the correct spiral distribution.
24:18
And applying our DDIM algorithm to our tree in the desert example,
24:21
we're now able to generate nice results.
24:24
Comparing to our original DDPM algorithm that required random steps,
24:29
DDIM remarkably does not require any changes to model training,
24:33
but is able to generate high quality images in significantly fewer steps,
24:37
completely deterministically.
24:40
Note that the theory does not tell us that our individual images or
24:44
points on our spiral will be the same, but instead that our final
24:47
distribution of points or images will be the same,
24:50
regardless of whether we use our stochastic DDPM algorithm or our
24:54
deterministic DDIM algorithm.
24:56
The WAN model we saw earlier uses a generalization of DDIM called flow matching.
25:03
By early 2021, it was clear that diffusion models were capable of generating
25:07
high quality images, and thanks to image generation methods like DDIM,
25:11
it was possible to generate these images without using enormous amounts of compute.
25:17
However, our ability to steer the diffusion process
25:20
using text prompts was still very limited.
25:23
Earlier, we saw how CLIP was able to learn a powerful shared representation
25:27
of images and text by concurrently training image and text encoder models.
25:32
However, these models only go one way, converting text or images into embedding vectors.
25:38
These two problems potentially fit together in a really interesting way.
25:43
Diffusion models are able to potentially reverse the CLIP image encoder,
25:47
generating high quality images, and the output vector of the CLIP text encoder
25:51
could be used to guide our diffusion models toward the images or videos that we want.
25:57
So the high level idea here is that we could pass in a prompt into the CLIP text
26:01
encoder to generate an embedding vector, and use this embedding vector to steer
26:06
the diffusion process towards the image or video of what our prompt describes.
26:11
A team at OpenAI did exactly this in 2022.
26:15
Using image and caption pairs to train a diffusion model to invert the CLIP image encoder.
26:21
Their approach yielded an incredible level of prompt adherence,
26:24
capturing an unprecedented level of detail from the input text.
26:29
The team called their method unCLIP, but their
26:31
model is better known by its commercial name, DALI2.
26:35
But how do we actually use the embedding vectors
26:37
for models like CLIP to steer the diffusion process?
26:41
One option is to simply pass our text vector as another input into our diffusion model,
26:46
and train as we normally would to remove noise.
26:49
If we train our diffusion model using image and caption pairs,
26:53
as the OpenAI team did, the idea here is that the model will learn to
26:56
use the text information to more accurately remove noise from images,
27:00
since it now has more context about the image that it's learning to denoise.
27:05
This technique is called conditioning.
27:07
We used a similar approach earlier, when we conditioned our toy diffusion
27:11
model on the number of time steps elapsed in the diffusion process,
27:14
allowing the model to learn coarse structure for large values of t,
27:18
and finer structures as our training samples get closer to our original spiral.
27:23
Interestingly, there turns out to be a variety of ways
27:26
we can pass in the text vector into our diffusion model.
27:30
Some approaches use a mechanism called cross-attention
27:33
to couple image and text information.
27:35
Other approaches simply add or append the embedded text vector to our diffusion
27:40
model's input, and some approaches pass in text information in multiple ways at once.
27:45
Now it turns out that conditioning alone is not enough to achieve
27:49
the level of prompt adherence that we see in models like DALI2.
27:53
If we take the stable diffusion tree in the desert example we've been experimenting with,
27:58
and only condition our model with our text inputs,
28:01
the model no longer gives us everything we ask for.
28:04
We get a shadow in a desert, but no tree.
28:08
Note that stable diffusion was developed by a team at Heidelberg University
28:12
around the same time as DALI2, and works in a similar way, but is open source.
28:17
It turns out that there's one more powerful idea that
28:20
we need to effectively steer our diffusion models.
28:23
We can see this idea in action by returning to our toy dataset one last time.
28:28
If our overall spiral corresponds to realistic images,
28:30
then different sections of our spiral may correspond to different types of images.
28:35
Let's say this inner part is images of people, this middle part is images of dogs,
28:40
and this outer part is different images of cats.
28:43
Now let's train the same diffusion model we trained earlier,
28:46
but in addition to passing in our starting coordinates and the
28:49
time of our diffusion process, we'll also pass in the points class.
28:53
Person, cat, or dog.
28:55
This extra signal should allow our model to steer points to
28:58
the right sections of our spiral, based on each points class.
29:03
Running our generation process, after assigning person, dog, or cat labels to each point,
29:08
we see that we're able to recover the overall structure of our dataset,
29:11
but the fit is not great, and we see some confusion here between people and dog images.
29:18
Part of the problem here is that we're asking our model to simultaneously learn to point
29:22
to our overall spiral of realistic images, and toward specific classes on our spiral.
29:28
If we consider this cat point for example, it starts off heading
29:31
towards the center of our spiral, and as our class conditioned vector
29:35
field shifts to point towards a cat region of our spiral,
29:38
our point moves towards this part of the spiral, but it doesn't quite make it.
29:44
The modeling task of generally matching our overall spiral has overpowered
29:47
our model's ability to move our point in the direction of a specific class.
29:53
Now, is there a way to decouple and maybe even control these two factors?
29:58
Remarkably, it turns out that we can.
30:00
The trick is to leverage the differences between the unconditional model that is not
30:04
trained on a specific class, and a model that is conditioned on specific classes.
30:09
We could do this by training two separate models,
30:11
but in practice it's more efficient to just leave out the class information for a
30:15
subset of our training examples.
30:18
We now have the option of effectively passing in no class or text
30:21
information into our model, and getting back a vector field that
30:24
points towards our data in general, not towards any specific class.
30:29
We can visualize these two vector fields together.
30:32
Here the gray vectors show our diffusion model points when we don't pass in any class
30:36
information, and these yellow vectors show when our model is conditioned on the cat
30:40
class.
30:42
For large values of our diffusion time variable when our training
30:45
data is far from our spiral, our two vector fields basically point
30:48
in the same direction, roughly towards the average of our spiral.
30:53
But as time approaches zero, our vector fields diverge,
30:56
with our cat conditioned vector field pointing more towards the outer cat
31:00
portion of our spiral.
31:02
Now that we have these two separate directions,
31:04
we can use their differences to push our points more in the direction
31:08
of the class we want.
31:10
Specifically, we take our yellow class conditioned
31:13
vector and subtract our gray unconditioned vector.
31:16
This gives us a new vector pointing from the tip of our
31:18
unconditioned vector to the tip of our conditioned vector.
31:22
The idea from here is that this direction should point more in the direction of our
31:26
cat examples, now that we've removed the direction generally pointing towards our data.
31:31
We can now amplify this direction by multiplying by a scaling factor, alpha,
31:35
and replace our original conditioned yellow vector with a vector pointing in this new
31:40
direction.
31:41
Let's follow the trajectory of the same cat point we
31:44
saw earlier that didn't quite make it onto our spiral.
31:47
We'll roll back our diffusion time variable and start
31:50
a new green point from the same starting location.
31:53
If we use our new green vectors to guide the diffusion process instead of our original
31:57
yellow vectors, the difference between our gray arrows that point towards the center
32:02
of our spiral and yellow vectors that start pointing us back towards our cat part
32:06
of the spiral are amplified, now guiding our point to land nicely on our spiral.
32:11
This approach is called classifier-free guidance.
32:15
Using our new green vectors to guide a set of cat points,
32:18
we see a nice tight fit to our spiral for this class.
32:22
Switching to our dog class, our unconditional gray vector field stays the same,
32:26
but our dog conditioned model outputs, shown in magenta,
32:30
now point us more towards the dog part of our spiral.
32:33
And adding guidance amplifies this learned direction.
32:38
Using our guided vectors and running our generation process,
32:41
we see a nice fit for our dog points.
32:44
Finally, we get a third vector field for our people examples
32:47
that again results in nice convergence to our spiral.
32:51
Classifier-free guidance works remarkably well and has become an
32:55
essential part of many modern image and video generation models.
32:59
Earlier, we saw that if we only conditioned our stable diffusion model,
33:03
our image would have a desert and a shadow, but no tree that we asked for in the prompt.
33:08
If we add classifier-free guidance to this model,
33:11
once we reach a guidance scale alpha of around 2,
33:14
we start to actually see a tiny tree in our images.
33:17
And the size and detail of our tree improve as we increase our scaling factor, alpha.
33:23
The fact that this works so well is remarkable to me.
33:26
As we use guidance to point our stable diffusion model's vector field more in the
33:31
direction of our prompt, our tree literally grows in size and detail in our images.
33:37
Our WAN video generation model takes this guidance approach one step further.
33:41
Instead of subtracting the output of an unconditioned model with no text input,
33:45
the WAN team uses what's known as a negative prompt,
33:48
where they specifically write out all the features they don't want in their video,
33:52
and then subtract the resulting vector from the model's conditioned output
33:56
and amplify the result, steering the diffusion process away from these unwanted features.
34:02
Their standard negative prompt is fascinating,
34:05
including features like extra fingers and walking backwards,
34:08
and interestingly is actually passed into their text encoder in Chinese.
34:13
Here's a video generated using the same astronaut on a horse prompt we used earlier,
34:17
but without the negative prompt.
34:19
It's really interesting to see how the parts of the
34:22
scene get cartoonish and no longer fit together.
34:26
Since the publication of the DDPM paper in the summer of 2020,
34:29
the field has progressed at a blistering pace,
34:32
leading to the incredible text-to-video models that we see today.
34:38
Of all the interesting details that make these models tick,
34:41
the most astounding thing to me is that the pieces fit together at all.
34:46
The fact that we can take a trained text encoder from clip or
34:49
elsewhere and use its output to actually steer the diffusion process,
34:53
which itself is highly complex, seems almost too good to be true.
34:59
And on top of that, many of these core ideas can be built from
35:02
relatively simple geometric intuitions that somehow hold in
35:06
the incredibly high dimensional spaces these models operate in.
35:10
The resulting models feel like a fundamentally new class of machine.
35:15
To create incredibly lifelike and beautiful images and video,
35:18
you no longer need a camera, you don't need to know how to draw or how to paint,
35:23
or how to use animation software.
35:25
All you need is language.
35:29
So this, as you can no doubt tell, was a guest video.
35:32
It comes from Stephen Welsh, who runs the channel WelshLabs.
35:35
If somehow you watch this channel and you're not already familiar with WelshLabs,
35:38
you should absolutely go and just watch everything that he's made.
35:42
A while back he made this completely iconic series about imaginary numbers.
35:46
He actually has since turned it into a book, and consistent with everything he makes,
35:50
it's just super high quality, lots of exercises, good stuff like that.
35:54
More recently he's been doing a lot of machine learning content,
35:56
so cannot recommend his stuff highly enough.
35:59
Now the context on why I'm doing guest videos at all is that very
36:02
recently my wife and I had our first baby, which I'm very excited about.
36:05
And I'm not sure what most solo YouTubers do for paternity leave,
36:09
but the way I decided to go about it was to reach out to a few creators whose work I
36:13
really enjoy, and who I'm quite sure you're going to enjoy, and essentially ask, hey,
36:17
what do you feel about me pointing some of the Patreon funds that come towards this
36:21
channel towards you during this time that I'm away,
36:24
and kind of commission pieces to fill the airtime while I'm away.
36:28
The pieces are actually going to be really great.
36:30
I've enjoyed giving some editorial oversight as they're coming in.
36:33
You know, we've got statistical mechanics, we've got machine learning,
36:36
even some modern art.
36:38
It's going to be a good time.
36:39
The next guest video is going to be about a combination of modern art and group theory.
36:43
It's actually very fun.
36:44
And like all the other videos on this channel, if you're a Patreon supporter,
36:47
you can get early views of these ones and provide some feedback before they go live.
36:51
Until then, I hope you thoroughly enjoy binge-watching WelshLabs,
36:54
and again, consider buying the things that he makes.
36:56
There is just as much thought and care put into those as there is into the videos.
37:18
Bye!