Gaussian Process Regression (Kriging)

#todo

Resources

Main Idea

Gaussian Process

Recall that a Stochastic Process is the infinite dimension analogue of a random vector.
A Gaussian process (GP) imposes a Gaussian structure. Suppose x is the domain of this function (similar to the vector index). For any points x1,,xn, a GP f with Kernel k(x,x) has function evaluations distributed according to

(f(x1),,f(xn))N(μ,K),

where Kij=k(xi,xj) (Gram matrix), and μi=μ(xi). According to prior information, this kernel can be further specified. If we assume μ=0, k fully defines the behavior of the process f. More specifically, a covariance kernel is Symmetric Positive Semidefinite.

For instance if k is stationary, it can be written as k(xi,xj)=k(xixj), and the process f is stationary in both strict and wide sense because the first two moments are finite.
More restrictively, if k(xi,xj)=k(||xixj||), then the covariance is isotropic and a Radial Basis Function. The structure of k also encodes information about the smoothness of f realizations, for instance relating to Square Integrable functions (L2) depending also on the spatial domain.

Weight-space view

Let us take a step back to the discrete case before continuing with the functional version afterwards.

Let the independent training data be D={(xi,yi)|i=1,,n}RD×R, or the matrix version with design matrix XRD×n and target vector yRn. We are interested in modeling the distribution of y, given some x (e.g. the conditional distribution p(y|x), not a generative distribution p(x) or p(x|y)).

The standard Bayesian Linear Regression paradigm poses model f with parameters wRd and a Gaussian noise model on the observations:

f(x;w)=xTw,y=f(x;w)+ε,εN(0,σ2).

The model and noise assumption define the likelihood, a probability distribution over the data observations, given the model. This is

p(y|X;w)=i=1np(yi|xi;w)=i=1n12πσ2exp((yif(xi;w))22σ2)=1(2πσ2)n/2i=1nexp((yixiTw)22σ2)=1(2πσ2)n/2exp(12σ2i=1n(yixiTw)2)=1(2πσ2)n/2exp(12σ2||[y1x1Twy2x2TwynxnTw]||22)=1(2πσ2)n/2exp(12σ2||yXTw||22).

In summary, this gives the likelihood as a Multivariate Gaussian:

p(y|X;w)=N(XTw;σ2I).

The Bayesian framework also requires a prior over the parameter values, which is assumed to also be Gaussian, centered at 0 without loss of generality (we could normalize the data instead), which for covariance Σp is

wN(0;Σp).

Deriving parameter posterior

The posterior (the distribution over the parameters, given the data and observations) is computed according to Bayes Theorem as

p(w|X,y)=p(y|X;w)p(w)p(y|X).

The marginal likelihood in the denominator is independent of the parameters w, and is computed according to the (usually intractable) integral

p(y|X)=p(w,y|X)dw=p(y|X,w)p(w)dw.

While in this case, one could conceivably compute this, there is no need, as we determine the shape of the posterior as Gaussian and can normalize afterwards because of this nice distribution. Thus, we continue by noting

p(w|X,y)p(y|X;w)p(w)exp(12σ2(yXTw)T(yXTw))exp(12wTΣp1w)exp(12σ2(yXTw)T(yXTw)12wTΣp1w)exp(σ2(Xy)TbTw12wT(σ2XXT+Σp1)Aw).

We have the argument of exp as

bTw12wTAw,

which we want to write in the Gaussian form by finding some unknown mean w¯ such that

12(ww¯)TA(ww¯).

Setting these two equal, we can drop the w¯TAw¯ term, which is constant, and find

w¯=σ2A1Xy.

Thus, the posterior is

p(w|X,y)=N(σ2A1Xyw¯,(σ2XXT+Σp)1A1)

Expanding the parameter mean (and mode / Maximum a Posteriori) gives

w¯=σ2(σ2XXT+Σp1)1Xy.

Recall, the normal equations from (linear) Least Squares give the parameters as

wOLS=(XXT)1Xy,

or the version with Tikhonov Regularization or Ridge Regression magnitude λ:

wTik=(XXT+λI)1Xy.

We can relate this Bayesian version by noting that λI=Σp1 gives the same coefficient prediction (ignoring σ); in other words, the Gaussian prior regularizes the coefficients in exactly the same way as Tikhonov Regularization (we can generalize to non-diagonal regularization if needed). Alternatively, if we have Σp1=0, e.g. by infinite variances for the prior, then we recover the least squares case.

Predictive posterior

Future predictions for new x include pushing the posterior distribution through the forward model with that x. In other words, denote the predictive distribution as f, which has true value f(x). The distribution is computed as

p(f|x,X,y)=p(f|x,w)p(w|X,y)dw.

That is, sample w from the posterior. Then with that value, compute f by also using x, which is a deterministic map in this case. Because f is linear, this output is also Gaussian, with mean and covariance adjusted according to the respective transform:

p(f|x,X,y)=N(xTw¯,xTA1x).

To predict the distribution of the observations, rather than the underlying function, we would also add in our Gaussian noise model (e.g. add σ2 in the variance).

Feature space

We can generalize to a different class of models by instead "engineering" features depending on x, via the vector-valued ϕ(x), i.e. ϕ:RDRN. The matrix Φ(X)RN×n comes from applying ϕ to each column of X. We now seek wRN, with one parameter for each feature. The new model is nonlinear in x:

f(x)=ϕ(x)Tw.

This is fundamentally just changing what inputs we build a linear model for, substituting Φ for X in the analysis. The predictive posterior distribution changes the most (and is the most relevant, as it is more connected with x), giving

f|x,X,yN(σ2ϕ(x)TA1Φy,ϕ(x)TA1ϕ(x))

with A=σ2ΦΦT+Σp1. In this case, predictions, even deterministic ones, hinge on inverting the N×N matrix A, which may be difficult for large N.

Kernel Trick

Let K=ΦTΣpΦ, noting KRn×n. We will derive a new form of the predictive posterior:

σ2Φ(K+σ2I)=σ2Φ(ΦTΣpΦ+σ2I)=σ2Φ(ΦTΣpΦ+σ2I)ΣpΣp1=σ2ΦΦTΣpΦ+Φ=(σ2ΦΦTΣp+I)Φ=(σ2ΦΦTΣp+I)(Σp1Σp)Φ=(σ2ΦΦTΣpΣp1Σp+Σp1Σp)Φ=(σ2ΦΦT+Σp1)ΣpΦ=AΣpΦ.

Now, left multiply by A1 and right multiply by (K+σ2I)1 to give

A1σ2Φ(K+σ2I)(K+σ2I)1=A1AΣpΦ(K+σ2I)1σ2A1Φ=ΣpΦ(K+σ2I)1.

Thus, we can rewrite the mean as

ϕ(x)Tσ2A1Φy=ϕ(x)TΣpΦ(K+σ2I)1y.

The covariance matrix can be rewritten using the Sherman-Morrison inversion formula for A1=(Σp1+σ2ΦΦT)1, giving

A1=Σpσ2ΣpΦ(I+σ2ΦTΣpΦ)1ΦTΣpA1=ΣpΣpΦ(K+σ2I)1ΦTΣp.

Then the entire covariance term just plugs this in to give

ϕ(x)TA1ϕ(x)=ϕ(x)TΣpϕ(x)ϕ(x)TΣpΦ(K+σ2I)1ΦTΣpϕ(x).

The main point of this tedious derivation is to have the inverse on (K+σ2I)1, which is n×n, rather than the N×N previous version A1. Thus, the cost of computing the posterior predictive (propagating parameter uncertainties) scales independently of the feature / parameter dimension N, instead scaling based on the data size n. This is uncommon for Forward UQ, where we expect cost to scale based on the parameter dimension.

In the final forms for the predictive mean and covariance, the feature space always comes in as ΦTΣpΦ or ϕ(x)TΣpϕ(x) (or a mix between the two). This motivates the definition of a covariance function or kernel k(x,x)=ϕ(x)TΣpϕ(x). Recalling that the prior covariance matrix on the parameters, Σp, is positive definite, we can also define ψ(x)=Σp1/2ϕ(x) and write the kernel as an Inner Product (or dot product): k(x,x)=ψ(x)Tψ(x)=ψ(x)ψ(x). Fundamentally, the kernel trick will rely on this kernel to replace inner products. As we have derived in this section, this substitution avoids explicitly calculating the feature vectors themselves ϕ(x), which can particularly be huge in infinite dimensions...

Function-space view

Previously we had the model

f(x)=ϕ(x)Tw,

where we treated w as a random vector (in a Bayesian sense). This then means that the function f is a random function. This is like a standard basis function expansion from coefficients (w) to an actual function.

The function-space view seeks to instead just directly model f(x), rather than explicitly using the basis functions ϕ(x) and parameters w. This is the key jump. Through the kernel trick, we already eliminated the direct influence of ϕ and Σ, replacing it with the kernel. With just f(x), we can still presumably do the same things as before. Because w is Gaussian (in both prior and posterior form) and f is a linear combination of w, f is also Gaussian. Hence, we define it directly as a Gaussian process (in function space).

To match the above case, we can define f(x)=ϕ(x)Tw for prior wN(0,Σp), giving the Gaussian process mean and covariance

E[f(x)]=ϕ(x)TE[w]E[f(x)f(x)]=ϕ(x)TE[wwT]ϕ(x)=ϕ(x)TΣpϕ(x).

This converts from the weight-space view to the function-space view.

Instead, we can start directly in the function-space view, defining f as a Gaussian process, which can induce a certain meaning to ϕ(x) and w. For instance, the squared exponential kernel

k(x,x)=exp(12||xx||2)

corresponds implicitly to infinitely many basis functions (N=). Via Mercer's Theorem, any positive definite covariance kernel induces a (possibly infinite) dimensional basis function expansion. The choice of the kernel (including the length scale ) specifies the distribution over functions.