Diffusion Models

Resources

Main Idea

Similar to the setup for VAEs, we begin with a fundamental modeling assumption of $p (x, z)$ where $z$ is some tractable distribution and $x$ is based on data. We encode exactly with a sequence of normal distributions. Thus, the encoder has no parameters, and the posterior $p (z | x)$ can be exactly calculated. (No need to have approximate like in VAEs). We want a denoiser $p_{θ} (x | z)$ , which tries to denoise $z$ to $x$ . This is the term that is trained -- i.e. only the decoder of the VAE. Unlike vanilla VAEs, there are multiple levels of noise / latent spaces, so a diffusion model is more similar to these Hierarchical Markovian chain VAEs (HMVAEs).

Within the paradigm of Generative Modeling, where we seek the data-generating distribution $p_{X}$ , given samples $x$ , diffusion models apply invertible operations, eventually pushing the samples $x$ to some simple distribution; the sampling can be done through sampling that simple distribution and then inverting the operations.

Evidence Lower Bound (ELBO)

Suppose there are some latent (random) variables $Z$ which may help us model $p_{X}$ , with an associated joint probability function $p (x, z)$ . We can obtain $p_{X}$ by marginalizing / integrating $z$ ,

p_{X} (x) = \int p (x, z) d z .

Or by applying Bayes Theorem,

p_{X} (x) = \frac{p (x, z)}{p (z | x)}

Integrating out $z$ is difficult for higher dimensions, and we don't have a "ground truth latent encoder" $p (z | x)$ . For now, I'm not clear on the meaning of this, but it may not be necessary. We can derive the ELBO, through the following

\begin{aligned} \log p_{X} (x) & = \log \int p (x, z) d z, \\ = \log \int \frac{p (x, z) q_{ϕ} (z | x)}{q_{ϕ} (z | x)} d z, \\ = \log E_{q_{ϕ} (z | x)} [\frac{p (x, z)}{q_{ϕ} (z | x)}], \\ \geq E_{q_{ϕ} (z | x)} [\log \frac{p (x, z)}{q_{ϕ} (z | x)}] . \end{aligned}

Note that the expectation is with respect to the distribution $q_{ϕ} (z | x)$ , which is taken as the density in the integrand. The $\geq$ comes from Jensen's inequality, where a (convex) sum of a convex function ( $\log$ ) is greater than the convex function applied to the sum. Or in other words, for a Convex function, a particular point on the secant (a convex sum of the endpoints which are evaluated on the function) lies above the function itself, at that same point.

Here, $q_{ϕ} (z | x)$ is also the "variational posterior" in Variational Inference.
We can decompose the ELBO into two terms: a reconstruction term, and a prior matching term.

Sequence of mappings

To map from $X \to Z$ , apply many $q$ : $q (x_{t} | x_{t - 1}) = N (\sqrt{α_{t}} x_{t - 1}, (1 - α_{t}) I)$ . This makes the steps close to one another. This makes $q (x_{t} | x_{0}) = N$ . Or in other words, the inverse is a Gaussian. The sequence $α_{t}$ defines the "variance schedule".

A "score function" comes in and simplifies, based on Tweedie's Formula.

TODO

$x_{0} \to x_{t}$ (i.e. $x_{0} \to z$ ) can be represented as a stochastic PDE or Langevin equation.
Look more into this from the lens of Stochastic Differential Equations.

Interpretation via VAEs

Hierarchical Markovian chain VAEs (HMVAEs)...