Gaussian Process Regression (Kriging)

Resources

Rasmussen, Carl Edward, and Christopher K. I. Williams. 2008. Gaussian Processes for Machine Learning. 3. print. Adaptive Computation and Machine Learning. MIT Press.

Main Idea

Gaussian Process

Recall that a Stochastic Process is the infinite dimension analogue of a random vector.
A Gaussian process (GP) imposes a Gaussian structure. Suppose $x$ is the domain of this function (similar to the vector index). For any points $x_{1}, \dots, x_{n}$ , a GP $f$ with Kernel $k (x, x^{'})$ has function evaluations distributed according to

(f (x_{1}), \dots, f (x_{n})) \sim N (μ, K),

where $K_{i j} = k (x_{i}, x_{j})$ (Gram matrix), and $μ_{i} = μ (x_{i})$ . According to prior information, this kernel can be further specified. If we assume $μ = 0$ , $k$ fully defines the behavior of the process $f$ . More specifically, a covariance kernel is Symmetric Positive Semidefinite.

For instance if $k$ is stationary, it can be written as $k (x_{i}, x_{j}) = k^{'} (x_{i} - x_{j})$ , and the process $f$ is stationary in both strict and wide sense because the first two moments are finite.
More restrictively, if $k (x_{i}, x_{j}) = k^{'} (| | x_{i} - x_{j} | |)$ , then the covariance is isotropic and a Radial Basis Function. The structure of $k$ also encodes information about the smoothness of $f$ realizations, for instance relating to Square Integrable functions ( $L^{2}$ ) depending also on the spatial domain.

Weight-space view

Let us take a step back to the discrete case before continuing with the functional version afterwards.

Let the independent training data be $D = {(x_{i}, y_{i}) | i = 1, \dots, n} \subset R^{D} \times R$ , or the matrix version with design matrix $X \in R^{D \times n}$ and target vector $y \in R^{n}$ . We are interested in modeling the distribution of $y$ , given some $x$ (e.g. the conditional distribution $p (y | x)$ , not a generative distribution $p (x)$ or $p (x | y)$ ).

The standard Bayesian Linear Regression paradigm poses model $f$ with parameters $w \in R^{d}$ and a Gaussian noise model on the observations:

f (x; w) = x^{T} w, y = f (x; w) + ε, ε \sim N (0, σ^{2}) .

The model and noise assumption define the likelihood, a probability distribution over the data observations, given the model. This is

\begin{aligned} p (y | X; w) & = \prod_{i = 1}^{n} p (y_{i} | x_{i}; w) \\ = \prod_{i = 1}^{n} \frac{1}{\sqrt{2 π σ^{2}}} \exp (- \frac{(y_{i} - f (x_{i}; w))^{2}}{2 σ^{2}}) \\ = \frac{1}{(2 π σ^{2})^{n / 2}} \prod_{i = 1}^{n} \exp (- \frac{(y_{i} - x_{i}^{T} w)^{2}}{2 σ^{2}}) \\ = \frac{1}{(2 π σ^{2})^{n / 2}} \exp (- \frac{1}{2 σ^{2}} \sum_{i = 1}^{n} (y_{i} - x_{i}^{T} w)^{2}) \\ = \frac{1}{(2 π σ^{2})^{n / 2}} \exp (- \frac{1}{2 σ^{2}} {| | [\begin{array}{c} y_{1} - x_{1}^{T} w \\ y_{2} - x_{2}^{T} w \\ ⋮ \\ y_{n} - x_{n}^{T} w \end{array}] | |}_{2}^{2}) \\ = \frac{1}{(2 π σ^{2})^{n / 2}} \exp (- \frac{1}{2 σ^{2}} {| | y - X^{T} w | |}_{2}^{2}) . \end{aligned}

In summary, this gives the likelihood as a Multivariate Gaussian:

p (y | X; w) = N (X^{T} w; σ^{2} I) .

The Bayesian framework also requires a prior over the parameter values, which is assumed to also be Gaussian, centered at $0$ without loss of generality (we could normalize the data instead), which for covariance $Σ_{p}$ is

w \sim N (0; Σ_{p}) .

Deriving parameter posterior

The posterior (the distribution over the parameters, given the data and observations) is computed according to Bayes Theorem as

p (w | X, y) = \frac{p (y | X; w) p (w)}{p (y | X)} .

The marginal likelihood in the denominator is independent of the parameters $w$ , and is computed according to the (usually intractable) integral

p (y | X) = \int p (w, y | X) d w = \int p (y | X, w) p (w) d w .

While in this case, one could conceivably compute this, there is no need, as we determine the shape of the posterior as Gaussian and can normalize afterwards because of this nice distribution. Thus, we continue by noting

\begin{aligned} p (w | X, y) & \propto p (y | X; w) p (w) \\ \propto \exp (- \frac{1}{2 σ^{2}} (y - X^{T} w)^{T} (y - X^{T} w)) \exp (- \frac{1}{2} w^{T} Σ_{p}^{- 1} w) \\ \propto \exp (- \frac{1}{2 σ^{2}} (y - X^{T} w)^{T} (y - X^{T} w) - \frac{1}{2} w^{T} Σ_{p}^{- 1} w) \\ \propto \exp (\underset{b^{T}}{\underset{⏟}{σ^{- 2} (X y)^{T}}} w - \frac{1}{2} w^{T} \underset{A}{\underset{⏟}{(σ^{- 2} X X^{T} + Σ_{p}^{- 1})}} w) . \end{aligned}

We have the argument of $\exp$ as

b^{T} w - \frac{1}{2} w^{T} A w,

which we want to write in the Gaussian form by finding some unknown mean $\bar{w}$ such that

- \frac{1}{2} (w - \bar{w})^{T} A (w - \bar{w}) .

Setting these two equal, we can drop the ${\bar{w}}^{T} A \bar{w}$ term, which is constant, and find

\bar{w} = σ^{- 2} A^{- 1} X y .

Thus, the posterior is

p (w | X, y) = N (\underset{\bar{w}}{\underset{⏟}{σ^{- 2} A^{- 1} X y}}, \underset{A^{- 1}}{\underset{⏟}{(σ^{- 2} X X^{T} + Σ_{p})^{- 1}}})

Expanding the parameter mean (and mode / Maximum a Posteriori) gives

\bar{w} = σ^{- 2} {(σ^{- 2} X X^{T} + Σ_{p}^{- 1})}^{- 1} X y .

Recall, the normal equations from (linear) Least Squares give the parameters as

w^{OLS} = (X X^{T})^{- 1} X y,

or the version with Tikhonov Regularization or Ridge Regression magnitude $λ$ :

w^{Tik} = (X X^{T} + λ I)^{- 1} X y .

We can relate this Bayesian version by noting that $λ I = Σ_{p}^{- 1}$ gives the same coefficient prediction (ignoring $σ$ ); in other words, the Gaussian prior regularizes the coefficients in exactly the same way as Tikhonov Regularization (we can generalize to non-diagonal regularization if needed). Alternatively, if we have $Σ_{p}^{- 1} = 0$ , e.g. by infinite variances for the prior, then we recover the least squares case.

Predictive posterior

Future predictions for new $x_{*}$ include pushing the posterior distribution through the forward model with that $x_{*}$ . In other words, denote the predictive distribution as $f_{*}$ , which has true value $f (x_{*})$ . The distribution is computed as

p (f_{*} | x_{*}, X, y) = \int p (f_{*} | x_{*}, w) p (w | X, y) d w .

That is, sample $w$ from the posterior. Then with that value, compute $f_{*}$ by also using $x_{*}$ , which is a deterministic map in this case. Because $f$ is linear, this output is also Gaussian, with mean and covariance adjusted according to the respective transform:

p (f_{*} | x_{*}, X, y) = N (x_{*}^{T} \bar{w}, x_{*}^{T} A^{- 1} x_{*}) .

To predict the distribution of the observations, rather than the underlying function, we would also add in our Gaussian noise model (e.g. add $σ^{2}$ in the variance).

Feature space

We can generalize to a different class of models by instead "engineering" features depending on $x$ , via the vector-valued $ϕ (x)$ , i.e. $ϕ : R^{D} \to R^{N}$ . The matrix $Φ (X) \in R^{N \times n}$ comes from applying $ϕ$ to each column of $X$ . We now seek $w \in R^{N}$ , with one parameter for each feature. The new model is nonlinear in $x$ :

f (x) = ϕ (x)^{T} w .

This is fundamentally just changing what inputs we build a linear model for, substituting $Φ$ for $X$ in the analysis. The predictive posterior distribution changes the most (and is the most relevant, as it is more connected with $x_{*}$ ), giving

f_{*} | x_{*}, X, y \sim N (σ^{- 2} ϕ (x_{*})^{T} A^{- 1} Φ y, ϕ (x_{*})^{T} A^{- 1} ϕ (x_{*}))

with $A = σ^{- 2} Φ Φ^{T} + Σ_{p}^{- 1}$ . In this case, predictions, even deterministic ones, hinge on inverting the $N \times N$ matrix $A$ , which may be difficult for large $N$ .

Kernel Trick

Let $K = Φ^{T} Σ_{p} Φ$ , noting $K \in R^{n \times n}$ . We will derive a new form of the predictive posterior:

\begin{aligned} σ^{- 2} Φ (K + σ^{2} I) & = σ^{- 2} Φ (Φ^{T} Σ_{p} Φ + σ^{2} I) \\ = σ^{- 2} Φ (Φ^{T} Σ_{p} Φ + σ^{2} I) Σ_{p} Σ_{p}^{- 1} \\ = σ^{- 2} Φ Φ^{T} Σ_{p} Φ + Φ \\ = (σ^{- 2} Φ Φ^{T} Σ_{p} + I) Φ \\ = (σ^{- 2} Φ Φ^{T} Σ_{p} + I) (Σ_{p}^{- 1} Σ_{p}) Φ \\ = (σ^{- 2} Φ Φ^{T} Σ_{p} Σ_{p}^{- 1} Σ_{p} + Σ_{p}^{- 1} Σ_{p}) Φ \\ = (σ^{- 2} Φ Φ^{T} + Σ_{p}^{- 1}) Σ_{p} Φ \\ = A Σ_{p} Φ . \end{aligned}

Now, left multiply by $A^{- 1}$ and right multiply by $(K + σ^{2} I)^{- 1}$ to give

\begin{aligned} A^{- 1} σ^{- 2} Φ (K + σ^{2} I) (K + σ^{2} I)^{- 1} & = A^{- 1} A Σ_{p} Φ (K + σ^{2} I)^{- 1} \\ σ^{- 2} A^{- 1} Φ & = Σ_{p} Φ (K + σ^{2} I)^{- 1} . \end{aligned}

Thus, we can rewrite the mean as

ϕ (x_{*})^{T} \underset{⏟}{σ^{- 2} A^{- 1} Φ} y = ϕ (x_{*})^{T} \underset{⏟}{Σ_{p} Φ (K + σ^{2} I)^{- 1}} y .

The covariance matrix can be rewritten using the Sherman-Morrison inversion formula for $A^{- 1} = (Σ_{p}^{- 1} + σ^{- 2} Φ Φ^{T})^{- 1}$ , giving

\begin{aligned} A^{- 1} & = Σ_{p} - σ^{- 2} Σ_{p} Φ (I + σ^{- 2} Φ^{T} Σ_{p} Φ)^{- 1} Φ^{T} Σ_{p} \\ A^{- 1} & = Σ_{p} - Σ_{p} Φ (K + σ^{2} I)^{- 1} Φ^{T} Σ_{p} . \end{aligned}

Then the entire covariance term just plugs this in to give

ϕ (x_{*})^{T} A^{- 1} ϕ (x_{*}) = ϕ (x_{*})^{T} Σ_{p} ϕ (x_{*}) - ϕ (x_{*})^{T} Σ_{p} Φ (K + σ^{2} I)^{- 1} Φ^{T} Σ_{p} ϕ (x_{*}) .

The main point of this tedious derivation is to have the inverse on $(K + σ^{2} I)^{- 1}$ , which is $n \times n$ , rather than the $N \times N$ previous version $A^{- 1}$ . Thus, the cost of computing the posterior predictive (propagating parameter uncertainties) scales independently of the feature / parameter dimension $N$ , instead scaling based on the data size $n$ . This is uncommon for Forward UQ, where we expect cost to scale based on the parameter dimension.

In the final forms for the predictive mean and covariance, the feature space always comes in as $Φ^{T} Σ_{p} Φ$ or $ϕ (x_{*})^{T} Σ_{p} ϕ (x_{*})$ (or a mix between the two). This motivates the definition of a covariance function or kernel $k (x, x^{'}) = ϕ (x)^{T} Σ_{p} ϕ (x^{'})$ . Recalling that the prior covariance matrix on the parameters, $Σ_{p}$ , is positive definite, we can also define $ψ (x) = Σ_{p}^{1 / 2} ϕ (x)$ and write the kernel as an Inner Product (or dot product): $k (x, x^{'}) = ψ (x)^{T} ψ (x) = ψ (x) \cdot ψ (x)$ . Fundamentally, the kernel trick will rely on this kernel to replace inner products. As we have derived in this section, this substitution avoids explicitly calculating the feature vectors themselves $ϕ (x)$ , which can particularly be huge in infinite dimensions...

Function-space view

Previously we had the model

f (x) = ϕ (x)^{T} w,

where we treated $w$ as a random vector (in a Bayesian sense). This then means that the function $f$ is a random function. This is like a standard basis function expansion from coefficients ( $w$ ) to an actual function.

The function-space view seeks to instead just directly model $f (x)$ , rather than explicitly using the basis functions $ϕ (x)$ and parameters $w$ . This is the key jump. Through the kernel trick, we already eliminated the direct influence of $ϕ$ and $Σ$ , replacing it with the kernel. With just $f (x)$ , we can still presumably do the same things as before. Because $w$ is Gaussian (in both prior and posterior form) and $f$ is a linear combination of $w$ , $f$ is also Gaussian. Hence, we define it directly as a Gaussian process (in function space).

To match the above case, we can define $f (x) = ϕ (x)^{T} w$ for prior $w \sim N (0, Σ_{p})$ , giving the Gaussian process mean and covariance

\begin{aligned} E [f (x)] & = ϕ (x)^{T} E [w] \\ E [f (x) f (x^{'})] & = ϕ (x)^{T} E [w w^{T}] ϕ (x^{'}) = ϕ (x)^{T} Σ_{p} ϕ (x^{'}) . \end{aligned}

This converts from the weight-space view to the function-space view.

Instead, we can start directly in the function-space view, defining $f$ as a Gaussian process, which can induce a certain meaning to $ϕ (x)$ and $w$ . For instance, the squared exponential kernel

k (x, x^{'}) = \exp (- \frac{1}{2} \frac{{| | x - x^{'} | |}^{2}}{ℓ})

corresponds implicitly to infinitely many basis functions $(N = \infty)$ . Via Mercer's Theorem, any positive definite covariance kernel induces a (possibly infinite) dimensional basis function expansion. The choice of the kernel (including the length scale $ℓ$ ) specifies the distribution over functions.

Resources

Related

Main Idea