Recall that a Stochastic Process is the infinite dimension analogue of a random vector.
A Gaussian process (GP) imposes a Gaussian structure. Suppose is the domain of this function (similar to the vector index). For any points , a GP with Kernel has function evaluations distributed according to
where (Gram matrix), and . According to prior information, this kernel can be further specified. If we assume , fully defines the behavior of the process . More specifically, a covariance kernel is Symmetric Positive Semidefinite.
For instance if is stationary, it can be written as , and the process is stationary in both strict and wide sense because the first two moments are finite.
More restrictively, if , then the covariance is isotropic and a Radial Basis Function. The structure of also encodes information about the smoothness of realizations, for instance relating to Square Integrable functions () depending also on the spatial domain.
Weight-space view
Let us take a step back to the discrete case before continuing with the functional version afterwards.
Let the independent training data be , or the matrix version with design matrix and target vector . We are interested in modeling the distribution of , given some (e.g. the conditional distribution , not a generative distribution or ).
The standard BayesianLinear Regression paradigm poses model with parameters and a Gaussian noise model on the observations:
The model and noise assumption define the likelihood, a probability distribution over the data observations, given the model. This is
In summary, this gives the likelihood as a Multivariate Gaussian:
The Bayesian framework also requires a prior over the parameter values, which is assumed to also be Gaussian, centered at without loss of generality (we could normalize the data instead), which for covariance is
Deriving parameter posterior
The posterior (the distribution over the parameters, given the data and observations) is computed according to Bayes Theorem as
The marginal likelihood in the denominator is independent of the parameters , and is computed according to the (usually intractable) integral
While in this case, one could conceivably compute this, there is no need, as we determine the shape of the posterior as Gaussian and can normalize afterwards because of this nice distribution. Thus, we continue by noting
We have the argument of as
which we want to write in the Gaussian form by finding some unknown mean such that
Setting these two equal, we can drop the term, which is constant, and find
We can relate this Bayesian version by noting that gives the same coefficient prediction (ignoring ); in other words, the Gaussian prior regularizes the coefficients in exactly the same way as Tikhonov Regularization (we can generalize to non-diagonal regularization if needed). Alternatively, if we have , e.g. by infinite variances for the prior, then we recover the least squares case.
Predictive posterior
Future predictions for new include pushing the posterior distribution through the forward model with that . In other words, denote the predictive distribution as , which has true value . The distribution is computed as
That is, sample from the posterior. Then with that value, compute by also using , which is a deterministic map in this case. Because is linear, this output is also Gaussian, with mean and covariance adjusted according to the respective transform:
To predict the distribution of the observations, rather than the underlying function, we would also add in our Gaussian noise model (e.g. add in the variance).
Feature space
We can generalize to a different class of models by instead "engineering" features depending on , via the vector-valued , i.e. . The matrix comes from applying to each column of . We now seek , with one parameter for each feature. The new model is nonlinear in :
This is fundamentally just changing what inputs we build a linear model for, substituting for in the analysis. The predictive posterior distribution changes the most (and is the most relevant, as it is more connected with ), giving
with . In this case, predictions, even deterministic ones, hinge on inverting the matrix , which may be difficult for large .
Kernel Trick
Let , noting . We will derive a new form of the predictive posterior:
Now, left multiply by and right multiply by to give
Thus, we can rewrite the mean as
The covariance matrix can be rewritten using the Sherman-Morrison inversion formula for , giving
Then the entire covariance term just plugs this in to give
The main point of this tedious derivation is to have the inverse on , which is , rather than the previous version . Thus, the cost of computing the posterior predictive (propagating parameter uncertainties) scales independently of the feature / parameter dimension , instead scaling based on the data size . This is uncommon for Forward UQ, where we expect cost to scale based on the parameter dimension.
In the final forms for the predictive mean and covariance, the feature space always comes in as or (or a mix between the two). This motivates the definition of a covariance function or kernel. Recalling that the prior covariance matrix on the parameters, , is positive definite, we can also define and write the kernel as an Inner Product (or dot product): . Fundamentally, the kernel trick will rely on this kernel to replace inner products. As we have derived in this section, this substitution avoids explicitly calculating the feature vectors themselves , which can particularly be huge in infinite dimensions...
Function-space view
Previously we had the model
where we treated as a random vector (in a Bayesian sense). This then means that the function is a random function. This is like a standard basis function expansion from coefficients () to an actual function.
The function-space view seeks to instead just directly model , rather than explicitly using the basis functions and parameters . This is the key jump. Through the kernel trick, we already eliminated the direct influence of and , replacing it with the kernel. With just , we can still presumably do the same things as before. Because is Gaussian (in both prior and posterior form) and is a linear combination of , is also Gaussian. Hence, we define it directly as a Gaussian process (in function space).
To match the above case, we can define for prior , giving the Gaussian process mean and covariance
This converts from the weight-space view to the function-space view.
Instead, we can start directly in the function-space view, defining as a Gaussian process, which can induce a certain meaning to and . For instance, the squared exponential kernel
corresponds implicitly to infinitely many basis functions . Via Mercer's Theorem, any positive definite covariance kernel induces a (possibly infinite) dimensional basis function expansion. The choice of the kernel (including the length scale ) specifies the distribution over functions.