SINDy

Resources

Main Idea

Within Ordinary Differential Equation Discovery and Partial Differential Equation Discovery, the Sparse Identification of Nonlinear Dynamics (SINDy) methods assume a sparse structure for the time derivative (e.g. N), relying on some library of functions. These methods are claimed to be Interpretable.

ODE case

Consider a Ordinary Differential Equation,

y˙=dydt=f(y),

where y(t)Rn is the state and f:RnRn is the evolution operator. SINDy seeks to discover a Sparse form of f from data. The representation is sparse in some chosen library of candidate functions.

For instance, a library can be composed through polynomials, as

Θ(y)=[1y1y2yny12y1y2],

where for a total polynomial degree of p, there are

P=(n+p!)p!n!=(n+pp)

total terms, making this object an element of R1×P.
Then, for the weight matrix ΞRP×n, the evolution function is given by

f(y)=(Θ(y)Ξ)T.

Each column of Ξ corresponds to a different entry in y. E.g. taking ξk as column k of Ξ, yk=Θ(y)ξk.

The method for determining Ξ always relies on m measurements of the state, which are given in some data matrix YRm×n,

Y=[y(t1)Ty(t2)Ty(tm)T].

Then, the operator Θ is applied row-wise, giving Θ(y)Rm×P, as

Θ(Y)=[Θ(y(t1))Θ(y(t2))Θ(y(tm))].

We can then verify that Y˙Rm×n, which is computed as

Y˙=Θ(Y)Ξ.

The training often involves some residual

||Y˙Θ(Y)Ξ||22,

and some sparsity promoting penalty, e.g. from Compressed Sensing, using the 1 norm

||Ξ||1.

These can be combined to give a Convex Program (or even Quadratic Program) in the unknown coefficients Ξ.

PDE-FIND

Library
Rudy extends SINDy to the PDE framework, by adding spatial derivatives of u to the library. Given a state u, the library function Θ(u) constructs a row of the library, e.g.:

Θ(u)=[1,u,u2,,ux,uux,...,u3uxxx].

To be consistent with notation, we could think of θ(u,ux,uxx,) that makes the reliance on spatial derivatives more clear. Here, the spatial derivatives such as ux are computed with Finite Differences or Polynomial Interpolation and differentiation. Then, stacking all measurements into URnxnt, we compute the library matrix via the element-wise application of Θ, denoted as Θ(U)Rnxnt×P. Similarly, we compute ut for these points, stacking them in the same way as Ut. The goal is to represent the dynamics via a linear combination of the library elements. That is, find ξRP such that:

Ut=Θ(U)ξ.

Then, the operator is given by N(u,ux,uxx,)=θ(u,ux,uxx,)ξ.

Solving
Solving the Least Squares problem gives many nonzero entries in ξ, and could be ill-conditioned for a few reasons. First, computing the spatial derivatives for Θ is ill-conditioned. Next, the monomial structure of the matrix might make it more poorly conditioned, similar to the Vandermonde Matrix. Given this, we impose the regularization assumption of a sparse ξ, which could be formed into the minimization problem

arg minξ ||UtΘ(U)ξ||22+λ||ξ||0.

Yet, the convex relaxation is instead solved, with ||ξ||1. With both LASSO and sequentially thresholded least squares (STLS) for this Sparse Regression, there is still an issue of ill-conditioning due to correlation of different columns of Θ. However, Tikhonov Regularization with the standard 2 squared penalty addresses this problem. This approach can be used within STLS, which is named as sequentially thresholded ridge regression (STRidge).

Numerical Differentiation
The other key consideration for these methods is the Numerical Differentiation. For noisy data with magnitude ε and grid size Δx, the dth derivative has error

O(εΔxd).

See also Finite Differences#Error Analysis.

Convolution with a smoothing Kernel or Tikhonov differentiation tend to remove important features and this bias causes inaccurate discovery. Instead, the authors fit a polynomial of degree p to at least p+1 points, and the polynomial analytic derivatives were evaluated. Note that for using exactly p+1 points, this is a finite difference of order p. For example, p=2 requires 3 points and gives the same computations as fitting a degree 2 polynomial and then differentiating. Sometimes, more points are used in total for calculating derivatives than there are added to the system (subsampling by selecting rows).

Miscellaneous
This technique can be applied to a Fokker-Plank Equation within Stochastic Differential Equations, by taking u(x,t) as a distribution over the uncertain space x. The authors discover the Diffusion Equation from Brownian Motion, treating u as a histogram.

There can be some further Denoising operation, before numerical differentiation and before the regularized solve. The authors truncate the Singular Value Decomposition.

Weak SINDy

As opposed to having a residual in terms of derivatives, weak SINDy methods use a Weak Form, testing against test/trial functions and applying Integration by Parts.

Weak SINDy for ODEs

We will consider with the framework of y(t)Rn, and having m (transposed) measurements stacked as YRm×n. Our goal is to represent y˙=f(y) from these measurements. Consider y(t) as an arbitrary element of y(t). The problem is then finding y˙=f(y).

We can test against some arbitrary function ϕ(t), giving

tatby˙ϕdt=tatbf(y)ϕdt.

We can apply integration by parts to the left-hand side, now giving

[yϕ]tatbtatbyϕ˙dt=tatbf(y)ϕdt.

Yet, we will require that ϕ(ta)=ϕ(tb)=0, so thus

tatbyϕ˙dt=tatbf(y)ϕdt.

Unlike the strong form, we can't just choose specific times for this to hold based on where we have data. Thus, we use where we do have data to evaluate the integrals. We will use the composite trapezoid rule, which approximates

abg(x)dxg(b)+g(a)2(ba),

or for a mesh of x0,...,xn,

x0xng(x)dx=i=0n1xixi+1g(x)dx12i=0n1(g(xi+1)+g(xi))(xi+1xi).

If the mesh is equispaced with size Δx, this simplifies to the form we will use:

x0xng(x)dxΔx2i=0n1(g(xi+1)+g(xi))=Δx(12g(x0)+i=1n1g(xi)+12g(xn)).

Yet, we do not need to use all of the data we have for y(t). Instead, we only use the samples corresponding to when ϕ(t) is nonzero.

Weak SINDy for PDEs

TODO

Note, this analysis mostly assumes one spatial dimension, for simplicity. Consider a library such that

Θ(u)i=dαidx|αi|fi(u).

That is, each library function has a function which is differentiated with order αi (a potential multi-index). It may be harder to represent the same libraries in this form! Beginning with the form of the governing equation, we have

(1)ut=iξidαidx|αi|fi(u).

Then, multiplying by a test function ψ(x,t) and integrating both sides over the entire domain, and pulling out the sum and ξ, which does not depend on x or t,

0TΩψutdΩdt=iξi0TΩψdαidx|αi|fi(u)dΩdt.

Now applying Integration by Parts and assuming sufficient derivatives of ψ are 0 along the boundary of Ω, and at t=0,T, we can interchange the derivative operators to the test function, with an added sign term. This gives

10TΩψtudΩdt=iξi(1)|αi|0TΩdαidx|αi|ψfi(u)dΩdt.

Note, this holds for an arbitrary ψ (subject only to the BC constraint), thus we can talk about it holding ψ{ψk}k[K]. We input the measured data U, and can calculate the integrals using numerical integration, and knowing the mesh sizes (e.g. a rectangle or trapezoidal approximation of the definite integrals). For ψk, this gives

1m=1Ntn=1NxddtψkUmnΔxΔtbk=iξi(1)|αi|m=1Ntn=1Nxdαidx|αi|ψkfi(Umn)ΔxΔtGki.

Repeating this for all the ψk yields the matrix system

b=Gξ,

where bRK, GRK×P, and ξRP. We assume that we can take KP, as these are arbitrary trial functions. We follow a similar least-squares / sparsity promoting solve to get ξ, which then fully defines the original (strong form) PDE.

UQ-SINDy

Consider a data or feature matrix X (like Y above). Each row is a sample of the state at a different time (yiT=xiT+ϵi). For a Linear Regression problem, we seek β such that for noise ϵN(0,σ2),

y=βTX+ϵ.

The goal of Bayesian Methods is to find p(β,σ|X,y). SINDy uses a model of the form

yiT=x0T+0tiΘ(x(τ))Ξdτ+ϵi.

Through Bayes Theorem,

p(Ξ,x0,σ|X)p(X|Ξ,x0,σ)likelihoodp(Ξ)p(x0)p(σ)priors,

although the computation of the Posterior is not tractable. The predictive posterior is given similarly as

p(x(t)|X)=p(x(t)|Ξ,x0,σ)posteriorp(Ξ,x0,σ|X)likelihooddΞdx0dσ.

The integral is approximated by taking the expectation of the likelihood evaluated at samples from the posterior.

To promote sparsity in Ξ or β, there are a number of priors to choose from. The maximum a posteriori estimate for the Laplace Distribution corresponds to Least Squares regression with 1-regularization. This works well but not great. The Spike and Slab (hierarchical) prior is generally better.

Spike and Slab Prior

Each βj comes from a hierarchical model:

βj|λjN(0,c2)λj,

where

λjBer(π).

Here π is the prior probability that λj=1 and βj is nonzero, following a 'slab' prior, N(0,c2). Otherwise, λ=0 and βj follows a 'spike' prior at 0 (Dirac delta distribution). This can be loosed to just a narrow distribution centered at 0, like N(0,ε2). Yet, this is a tough distribution to work with, due to its combinatorial nature.

We can carry on and do Markov Chain Monte Carlo, with a sparsity-promoting prior on the ODE coefficients and integrating the ODE to provide a prediction. E.g. use

x^iT=x0T+0tiΘ(x(τ))Ξdτ,

and compare against yiT, including our error distribution over ϵi. Note, this differs a bit from standard SINDy, as if we do maximum likelihood estimation, we're comparing a prediction that involves numerical integration. This differs from SINDy, which generally constructs comparisons through numerical differentiation.

TODO

What does this look like if we do Variational Inference instead? Does it match my idea for the neural network variant?