Diffusion Models

#todo

Resources

Main Idea

Similar to the setup for VAEs, we begin with a fundamental modeling assumption of p(x,z) where z is some tractable distribution and x is based on data. We encode exactly with a sequence of normal distributions. Thus, the encoder has no parameters, and the posterior p(z|x) can be exactly calculated. (No need to have approximate like in VAEs). We want a denoiser pθ(x|z), which tries to denoise z to x. This is the term that is trained -- i.e. only the decoder of the VAE. Unlike vanilla VAEs, there are multiple levels of noise / latent spaces, so a diffusion model is more similar to these Hierarchical Markovian chain VAEs (HMVAEs).

Within the paradigm of Generative Modeling, where we seek the data-generating distribution pX, given samples x, diffusion models apply invertible operations, eventually pushing the samples x to some simple distribution; the sampling can be done through sampling that simple distribution and then inverting the operations.

Evidence Lower Bound (ELBO)

Suppose there are some latent (random) variables Z which may help us model pX, with an associated joint probability function p(x,z). We can obtain pX by marginalizing / integrating z,

pX(x)=p(x,z)dz.

Or by applying Bayes Theorem,

pX(x)=p(x,z)p(z|x)

Integrating out z is difficult for higher dimensions, and we don't have a "ground truth latent encoder" p(z|x). For now, I'm not clear on the meaning of this, but it may not be necessary. We can derive the ELBO, through the following

logpX(x)=logp(x,z)dz,=logp(x,z)qϕ(z|x)qϕ(z|x)dz,=logEqϕ(z|x)[p(x,z)qϕ(z|x)],Eqϕ(z|x)[logp(x,z)qϕ(z|x)].

Note that the expectation is with respect to the distribution qϕ(z|x), which is taken as the density in the integrand. The comes from Jensen's inequality, where a (convex) sum of a convex function (log) is greater than the convex function applied to the sum. Or in other words, for a Convex function, a particular point on the secant (a convex sum of the endpoints which are evaluated on the function) lies above the function itself, at that same point.

Here, qϕ(z|x) is also the "variational posterior" in Variational Inference.
We can decompose the ELBO into two terms: a reconstruction term, and a prior matching term.

Sequence of mappings

To map from XZ, apply many q: q(xt|xt1)=N(αtxt1,(1αt)I). This makes the steps close to one another. This makes q(xt|x0)=N. Or in other words, the inverse is a Gaussian. The sequence αt defines the "variance schedule".

A "score function" comes in and simplifies, based on Tweedie's Formula.

TODO

x0xt (i.e. x0z) can be represented as a stochastic PDE or Langevin equation.
Look more into this from the lens of Stochastic Differential Equations.

Interpretation via VAEs

Hierarchical Markovian chain VAEs (HMVAEs)...