VAEs

Resources

https://towardsdatascience.com/understanding-variational-autoencoders-vaes-f70510919f73
https://sannaperzon.medium.com/paper-summary-variational-autoencoders-with-pytorch-implementation-1b4b23b1763a
https://ermongroup.github.io/blog/a-tutorial-on-mmd-variational-autoencoders/
https://towardsdatascience.com/bayesian-inference-problem-mcmc-and-variational-inference-25a8aa9bce29
Kingma, Diederik P., and Max Welling. 2022. “Auto-Encoding Variational Bayes.” arXiv. http://arxiv.org/abs/1312.6114.
Higgins, Irina, Loic Matthey, Arka Pal, Christopher Burgess, Xavier Glorot, Matthew Botvinick, Shakir Mohamed, and Alexander Lerchner. 2017. “Β-VAE: Learning Basic Visual Concepts with a Constrained Variational Framework.”
Zhao, Shengjia, Jiaming Song, and Stefano Ermon. 2018. “InfoVAE: Information Maximizing Variational Autoencoders.” arXiv. https://doi.org/10.48550/arXiv.1706.02262

Diffusion Models

Main Idea

In order to create a generative model for $p (x)$ , we assume there is some latent distribution on $z$ , such that we can transform (decode) samples from this to resemble the data. We maximize a lower bound (ELBO) on the likelihood over the observed data, thus keeping the data fixed as the truth, while letting our model vary (like in Bayesian Methods).

Aside

Within the field of Generative Modeling, VAEs extend autoencoders by imposing an organized, probabilistic structure on the latent space (regularization).

Later, we will use this result from Bayes Theorem

p (x) = \frac{p (x | z) \cdot p (z)}{p (z | x)}

Variational Inference

Variational Inference is a method to approximate a distribution. It is simply parameterizing a distribution, and using some loss function (such as the KL-divergence) to determine which set of parameters best fit that distribution.

D_{K L} (q (y) | | p (y)) = E_{y \sim q} \frac{q (y)}{p (y)}

Formulation

The prior $p (z) = N (0, I)$ , and $p_{θ} (x | z)$ is the probabilistic decoder, taking input $z$ and returning a distribution over the data $x$ . By using the result above, we can theoretically use the prior and the likelihood to calculate $p (z | x)$ as

p (z | x) = \frac{p (x | z) \cdot p (z)}{p (x)} = \frac{p (x | z) \cdot p (z)}{\int_{z} p (x | z) p (z) d z},

but the integral in the denominator is an integration in whatever dimension $z$ is, making it intractable.

Thus, we use Variational Inference to approximate $p (z | x) \approx q_{ϕ} (z | x)$ . Note that all variations of $p$ depend on $θ$ , so $ϕ$ must be tied to $θ$ for this approximation to be accurate. We choose that $q_{ϕ}$ is Gaussian, with $q_{ϕ} (z | x) = N (E_{ϕ} (x), σ_{ϕ} (x)^{2} I)$ . Taking input $x$ and outputting a distribution in the latent space, $q_{ϕ} (z | x)$ is the probabilistic encoder, aiming to approximate the posterior.

Similarly, we assume that $p_{θ} (x | z)$ is also Gaussian, $p_{θ} (x | z) = N (D_{θ} (z), c I)$ with $c > 0$ . This allows us to write the log-likelihood term as a mean-squared error (MSE). Note that the encoder mean / variance and the decoder mean are often parameterized through the standard machinery in Deep Learning: Neural Networks.

Evidence Lower Bound (ELBO)

Begin with

p (x) = \frac{p (x | z) p (z) q (z | x)}{p (z | x) q (z | x)}

Take the $\log$ , then split into three terms

\log p (x) = \log p (x | z) - \log \frac{q (z | x)}{p (z)} + \log \frac{q (z | x)}{p (z | x)}

Then, take the expectation over $z$ drawn from the approximate posterior: $z \sim q (z | x)$ , which turns the $\log$ fraction terms into $D_{K L}$ terms. The left hand size does not depend on $z$ , so it's expectation is dropped.

\log p (x) = E_{z \sim q (z | x)} \log p (x | z) - D_{K L} (q (z | x) | | p (z)) + D_{K L} (q (z | x) | | p (z | x))

The term $D_{K L} (q (z | x) | | p (z | x))$ involves the intractable posterior. This term is also nonnegative, thus dropping it gives the inequality

\log p (x) \leq \underset{ELBO}{\underset{⏟}{E_{z \sim q (z | x)} \log p (x | z) - D_{K L} (q (z | x) | | p (z))}}

Note that this term is also equal to $D_{K L} (q (z | x) | | p (z | x))$ .

Info-VAE or MMD-VAE

A standard VAE forms a loss as a reconstruction and latent-space structure term, such as

L (x) = - E_{z \sim q_{ϕ} (z | x)} [\log p (x | z)] + β \cdot D (q_{ϕ} (z | x) | | p (z)),

where $D$ is taken as the KL-Divergence. In the above, the term $D (q_{ϕ} (z | x) | | p (z))$ encourages $q_{ϕ} (z | x)$ to match $p (z)$ , regardless of $x$ . So the solution would be $q_{ϕ} (z | x) = p (z),$ which would imply that the encoder $q_{ϕ}$ actually contains no information from the data $x$ . So while we could tune $β$ to find a balancing point between these competing objectives, we can instead find a way of loosening the requirement itself. While we want the latent space to have some structure, this is in general (in expectation), not for each data sample.

Motivated by this, we can replace the latent-space structure term with an average term, e.g. via Jensen's Inequality if $D$ is convex in the first argument (like KL-Divergence is),

D (\underset{E_{x} [q_{ϕ} (z | x)]}{\underset{⏟}{q_{ϕ} (z)}} | | p (z)) \leq E_{x} [D (q_{ϕ} (z | x) | | p (z))] .

Thus, the aforementioned lost is unnecessarily strict for our goal. The new form would be

L (x) = - E_{z \sim q_{ϕ} (z | x)} [\log p (x | z)] + β \cdot D (q_{ϕ} (z) | | p (z)),

where we will be taking $E_{x}$ by cycling through training data. The key difference here is that the divergence term depends on the distribution marginalized over $x$ .

In this form, we can't compute $D_{K L} (q_{ϕ} (z) | | p (z))$ analytically like before. Thus, we'd have to estimate anyways, so we may as well take $D$ as another Divergence, such as the Maximum Mean Discrepancy (MMD). To do this, we take a batch of $x$ , compute $q_{ϕ} (z | x)$ (such as $μ_{ϕ} (x)$ , $σ_{ϕ}^{2} (x)$ ), and take one $z$ from each of these (reparameterization trick). The $p (z)$ is easier to sample. We then use these samples to compute an empirical estimate of MMD (replace expectations with empirical means). The negative log-likelihood term stays the same.

From Expectation to a Practical Loss

For the vanilla VAE, we had the sample loss as

L (x) = - \underset{z \sim q (z | x)}{E} \log p (x | z) + D_{K L} (q (z | x) | | p (z)) .

The first expectation is approximated empirically with a single sample. This is the basis of the reparameterization trick, which effectively allows sampling from a parameterized distribution without breaking the computational graph for automatic differentiation. Because $q (z | x)$ is Gaussian, we can draw from the standard normal, then scale by the standard deviation from $q (z | x)$ and add the mean from $q (z | x)$ (these are really the objects the encoder outputs anyways). The KL-divergence has an analytic form when it is between Gaussians, so this simplifies just in terms of the mean and standard deviation of $q_{ϕ}$ , which are $μ_{ϕ}$ and $σ_{ϕ}$ (diagonal).

So the loss is really

\begin{aligned} \hat{z} & \sim q_{ϕ} (z | x) \\ L (x) & = - \log p_{θ} (x | \hat{z}) + \frac{1}{2} (tr (diag (σ_{ϕ}^{2} (x)) + μ_{ϕ}^{T} (x) μ_{ϕ} (x) - d + - \log (\prod_{j = 1}^{d} (σ_{ϕ}^{2} (x))_{j})) . \end{aligned}

Further, by assuming that there is additive Gaussian noise on the reconstruction target (not that the data is Gaussian) with mean $D_{θ} (\hat{z})$ and standard deviation $σ$ , the reconstruction term simplifies to the Mean Squared Error, giving the final form as

\begin{aligned} \hat{z} & \sim q_{ϕ} (z | x) \\ L (x) & = \frac{1}{2 σ^{2}} {| | x - D_{θ} (\hat{z}) | |}_{2}^{2} + \frac{1}{2} (tr (diag (σ_{ϕ}^{2} (x)) + μ_{ϕ}^{T} (x) μ_{ϕ} (x) - d + - \log (\prod_{j = 1}^{d} (σ_{ϕ}^{2} (x))_{j})) . \end{aligned}

In practice, we use some Unconstrained Optimization method over "batch" estimates of this per-sample loss as a surrogate for the expected loss:

min_{θ, ϕ} \underset{x}{E} [L (x)] .