Brunton, Steven L, Joshua L Proctor, and J Nathan Kutz. 2016. “Discovering Governing Equations from Data by Sparse Identification of Nonlinear Dynamical Systems.” Proceedings of the National Academy of Sciencies 113 (15). https://doi.org/10.1073/pnas.1517384113.
Rudy, Samuel H, Steven L Brunton, Joshua L Proctor, and J Nathan Kutz. 2016. “Data-Driven Discovery of Partial Differential Equations.” Scientific Advances, 123.
Messenger, Daniel A., and David M. Bortz. 2021. “Weak SINDy for Partial Differential Equations.” Journal of Computational Physics 443 (October):110525. https://doi.org/10.1016/J.JCP.2021.110525.
Hirsh, Seth M., David A. Barajas-Solano, and J. Nathan Kutz. 2022. “Sparsifying Priors for Bayesian Uncertainty Quantification in Model Discovery.” Royal Society Open Science 9 (2). https://doi.org/10.1098/rsos.211823.
where is the state and is the evolution operator. SINDy seeks to discover a Sparse form of from data. The representation is sparse in some chosen library of candidate functions.
For instance, a library can be composed through polynomials, as
where for a total polynomial degree of , there are
total terms, making this object an element of .
Then, for the weight matrix , the evolution function is given by
Each column of corresponds to a different entry in . E.g. taking as column of , .
The method for determining always relies on measurements of the state, which are given in some data matrix ,
Then, the operator is applied row-wise, giving , as
We can then verify that , which is computed as
The training often involves some residual
and some sparsity promoting penalty, e.g. from Compressed Sensing, using the norm
Library
Rudy extends SINDy to the PDE framework, by adding spatial derivatives of to the library. Given a state , the library function constructs a row of the library, e.g.:
To be consistent with notation, we could think of that makes the reliance on spatial derivatives more clear. Here, the spatial derivatives such as are computed with Finite Differences or Polynomial Interpolation and differentiation. Then, stacking all measurements into , we compute the library matrix via the element-wise application of , denoted as . Similarly, we compute for these points, stacking them in the same way as . The goal is to represent the dynamics via a linear combination of the library elements. That is, find such that:
Then, the operator is given by .
Solving
Solving the Least Squares problem gives many nonzero entries in , and could be ill-conditioned for a few reasons. First, computing the spatial derivatives for is ill-conditioned. Next, the monomial structure of the matrix might make it more poorly conditioned, similar to the Vandermonde Matrix. Given this, we impose the regularization assumption of a sparse , which could be formed into the minimization problem
Yet, the convex relaxation is instead solved, with . With both LASSO and sequentially thresholded least squares (STLS) for this Sparse Regression, there is still an issue of ill-conditioning due to correlation of different columns of . However, Tikhonov Regularization with the standard squared penalty addresses this problem. This approach can be used within STLS, which is named as sequentially thresholded ridge regression (STRidge).
Numerical Differentiation
The other key consideration for these methods is the Numerical Differentiation. For noisy data with magnitude and grid size , the derivative has error
Convolution with a smoothing Kernel or Tikhonov differentiation tend to remove important features and this bias causes inaccurate discovery. Instead, the authors fit a polynomial of degree to at least points, and the polynomial analytic derivatives were evaluated. Note that for using exactly points, this is a finite difference of order . For example, requires points and gives the same computations as fitting a degree polynomial and then differentiating. Sometimes, more points are used in total for calculating derivatives than there are added to the system (subsampling by selecting rows).
There can be some further Denoising operation, before numerical differentiation and before the regularized solve. The authors truncate the Singular Value Decomposition.
Weak SINDy
As opposed to having a residual in terms of derivatives, weak SINDy methods use a Weak Form, testing against test/trial functions and applying Integration by Parts.
Weak SINDy for ODEs
We will consider with the framework of , and having (transposed) measurements stacked as . Our goal is to represent from these measurements. Consider as an arbitrary element of . The problem is then finding .
We can test against some arbitrary function , giving
We can apply integration by parts to the left-hand side, now giving
Yet, we will require that , so thus
Unlike the strong form, we can't just choose specific times for this to hold based on where we have data. Thus, we use where we do have data to evaluate the integrals. We will use the composite trapezoid rule, which approximates
or for a mesh of ,
If the mesh is equispaced with size , this simplifies to the form we will use:
Yet, we do not need to use all of the data we have for . Instead, we only use the samples corresponding to when is nonzero.
Weak SINDy for PDEs
TODO
Note, this analysis mostly assumes one spatial dimension, for simplicity. Consider a library such that
That is, each library function has a function which is differentiated with order (a potential multi-index). It may be harder to represent the same libraries in this form! Beginning with the form of the governing equation, we have
Then, multiplying by a test function and integrating both sides over the entire domain, and pulling out the sum and , which does not depend on or ,
Now applying Integration by Parts and assuming sufficient derivatives of are 0 along the boundary of , and at , we can interchange the derivative operators to the test function, with an added sign term. This gives
Note, this holds for an arbitrary (subject only to the BC constraint), thus we can talk about it holding . We input the measured data , and can calculate the integrals using numerical integration, and knowing the mesh sizes (e.g. a rectangle or trapezoidal approximation of the definite integrals). For , this gives
Repeating this for all the yields the matrix system
where , , and . We assume that we can take , as these are arbitrary trial functions. We follow a similar least-squares / sparsity promoting solve to get , which then fully defines the original (strong form) PDE.
UQ-SINDy
Consider a data or feature matrix (like above). Each row is a sample of the state at a different time (). For a Linear Regression problem, we seek such that for noise
The goal of Bayesian Methods is to find SINDy uses a model of the form
although the computation of the Posterior is not tractable. The predictive posterior is given similarly as
The integral is approximated by taking the expectation of the likelihood evaluated at samples from the posterior.
To promote sparsity in or , there are a number of priors to choose from. The maximum a posteriori estimate for the Laplace Distribution corresponds to Least Squares regression with -regularization. This works well but not great. The Spike and Slab (hierarchical) prior is generally better.
Spike and Slab Prior
Each comes from a hierarchical model:
where
Here is the prior probability that and is nonzero, following a 'slab' prior, . Otherwise, and follows a 'spike' prior at (Dirac delta distribution). This can be loosed to just a narrow distribution centered at , like . Yet, this is a tough distribution to work with, due to its combinatorial nature.
We can carry on and do Markov Chain Monte Carlo, with a sparsity-promoting prior on the ODE coefficients and integrating the ODE to provide a prediction. E.g. use
and compare against , including our error distribution over . Note, this differs a bit from standard SINDy, as if we do maximum likelihood estimation, we're comparing a prediction that involves numerical integration. This differs from SINDy, which generally constructs comparisons through numerical differentiation.
TODO
What does this look like if we do Variational Inference instead? Does it match my idea for the neural network variant?