Title | Venue | Year | Code | Review |
---|---|---|---|---|
DDPM, Denoising Diffusion Probabilistic Models | NIPS | '20 | code | review |
- presented high quality image samples using diffusion models
- found connections among diffusion models and variational inference for
- training Markov chains
- denoising score matching
- annealed Langevin dynamics
- (and energy-based models by extension)
- autoregressive models
- progressive lossy compression
Imagine we take an image and add a bit of gaussian noise to it and repeat this many times, eventually we'll have an unrecognizable image of static a sample of pure noise.
- Diffusion model is trained to undo this process.
- Diffusion models are inspired by non-equilibrium thermodynamics.
- Define a Markov chain of diffusion steps to slowly add random noise to data and then learn to reverse the diffusion process to construct desired data samples from the noise.
- Simply, it's the process of gradually adding noise to the original image.
- The image will become complete Gaussian noise in the finally step.
- This process is unrelated to the Model and is fixed throughout the entire process.
-
$x_0$ target image -
$x_T$ random gaussian noise
- The task of the model is to remove the noise so that it can restore the original image.
- The goal of each step in restoring the image from Gaussian noise is to turn
$x_t$ back into$x_{t-1}$ .
- Initially labeled as
$x_0$ , the final noise is labeled as$x_T$ . - From
$x_1$ to$x_T$ , it is labeled as$x_{1:T}$ .
Introduction of ELBO
We hope that the agent normal distribution q(Z)
and the sampling distribution P(Z|X)
are as close as possible.
- Note that
$P(Z|X)$ can be an abstract distribution - We want that the Kullback-Leibler Divergence (KLD) between q and P(Z|X) is as small as possible.
- In VAE (Variational Autoencoder),
$q_\phi$ is our encoder, and$p_\theta$ is the decoder. - We use the ELBO (Evidence Lower Bound) as the loss function to train the model so that the output gets as close as possible to the input.
- Combining the concept of the Markov chain, we replace the latent factor
$Z$ with$x_{1:T}$ , and we get:
-
$x_1, ..., x_T$ are latents of the same dimensionality as the data $x_0 \sim q(x_0)$ - Schematic diagram:
- Note that the encoding part of the diffusion model is fixed.
- The process of adding noise to the original image according to a variance schedule.
- It can be seen as the encoder of VAE (Variational Autoencoder), but it does not contain any model parameters, and it only outputs sample results according to a pre-set schedule.
-
Forward Process is fixed to a Markov chain that gradually adds Gaussian noise to the data according to a variance schedule
$\beta_1, ..., \beta_T$ - Control the noise added by the forward process through a fixed variance schedule.
- Here, after the derivation of mathematical formulas (see appendix):
When calculating
- Since
$q(x_{t-1}|x_t, x_0)$ is the posterior distribution of the forward process, and because the variance schedule is fixed, it has tractable properties, following a normal distribution with mean$\tilde\mu$ and$\tilde\beta_t$ . (proof):
- Substitute
$\tilde\mu$ into the KLD formula:
- Through the nice property derived from reparameterization, we find that we can express both
$\mu_\theta$ and$\tilde\mu$ in the form of$\epsilon$ ,$x_t$ , and$\alpha$ . - Using Reparameter trick and epsilon we can derive
$x_t$ :
- Simplify the weight to be denoted as
$w_t$ .
- This paper propose that ignore
$w_t$ allowing training to focus on more challenging great noise$\epsilon$ - Can be understood as the model predicting
$\epsilon_{\theta, t}$ at time t through$x_t$ and$t$ . - The loss function calculates the gap between the random variable
$\epsilon$ and$epsilon_{\theta, t}$ .
- Can be understood as the model predicting
- Ignore
$w_t$ to derive the training algorithm.