Similar to the setup for VAEs, we begin with a fundamental modeling assumption of where is some tractable distribution and is based on data. We encode exactly with a sequence of normal distributions. Thus, the encoder has no parameters, and the posterior can be exactly calculated. (No need to have approximate like in VAEs). We want a denoiser , which tries to denoise to . This is the term that is trained -- i.e. only the decoder of the VAE. Unlike vanilla VAEs, there are multiple levels of noise / latent spaces, so a diffusion model is more similar to these Hierarchical Markovian chain VAEs (HMVAEs).
Within the paradigm of Generative Modeling, where we seek the data-generating distribution , given samples , diffusion models apply invertible operations, eventually pushing the samples to some simple distribution; the sampling can be done through sampling that simple distribution and then inverting the operations.
Evidence Lower Bound (ELBO)
Suppose there are some latent (random) variables which may help us model , with an associated joint probability function . We can obtain by marginalizing / integrating ,
Integrating out is difficult for higher dimensions, and we don't have a "ground truth latent encoder" . For now, I'm not clear on the meaning of this, but it may not be necessary. We can derive the ELBO, through the following
Note that the expectation is with respect to the distribution , which is taken as the density in the integrand. The comes from Jensen's inequality, where a (convex) sum of a convex function () is greater than the convex function applied to the sum. Or in other words, for a Convex function, a particular point on the secant (a convex sum of the endpoints which are evaluated on the function) lies above the function itself, at that same point.
Here, is also the "variational posterior" in Variational Inference.
We can decompose the ELBO into two terms: a reconstruction term, and a prior matching term.
Sequence of mappings
To map from , apply many : . This makes the steps close to one another. This makes . Or in other words, the inverse is a Gaussian. The sequence defines the "variance schedule".
A "score function" comes in and simplifies, based on Tweedie's Formula.
TODO
(i.e. ) can be represented as a stochastic PDE or Langevin equation.
Look more into this from the lens of Stochastic Differential Equations.