Maximum Mean Discrepancy

Resources

Main Idea

As a Divergence, the MMD says two distributions are equal if and only if all their moments match. A kernel measures similarity between pairs of samples, so if the similarity for samples taken from within the same distribution is the same as that between distributions, those distributions are the same as one another.

The MMD is also an Integral Probability metric. Formally, for distributions P(X) and Q(Y) (X from the generator, Y as data in Generative Modeling), this is

MMD(P,Q)=supfF(EP[f(X)]EQ[f(Y)]),

with

F={f:||f||H1}.

First, the expectations taken with respect to P(X) and Q(Y) are ultimately integrals, partly explaining the name of Integral Probability Metrics and reaffirming that MMD is one of these methods. Next, F is the class of witness functions or features, where each function maps from the domain of X and Y to H (which could just be Rn?). H is an Reproducing Kernel Hilbert Space (RKHS), which has some associated norm (induced by a particular inner product based on the so-called reproducing Kernel k(,)). Importantly, we don't need to consider the possibly infinite number of functions in f. The supremum occurs at

f(x)EP[k(X,x)]EQ[k(Y,x)].

Given samples of X, {xi}i=1m, and samples of Y, {yi}i=1n, both drawn IID, noting potentially nm, the empirical estimate of the MMD2 is

MMD^2(X,Y)=1m(m1)ijmk(xi,xj)+1n(n1)ijnk(yi,yj)2mni=1mj=1nk(xi,yj).

There is a choice of the kernel, k. One can use the Gaussian RBF Kernel

k(x,y)=exp(12σ2||xy||2),

or the rational quadratic kernel

kαrq(x,y)=(1+||xy||22α)α.

We can make new kernels by summing existing ones (similar to the sum of Symmetric Positive Definite matrices). This allows evaluating for a mixture of length scales,

krq(x,y)=αAkαrq(x,y).

Further, we can even mix kernels of different types, such as

krq(x,y)=krq(x,y)+x,y.