SINDy

Resources

Brunton, Steven L, Joshua L Proctor, and J Nathan Kutz. 2016. “Discovering Governing Equations from Data by Sparse Identification of Nonlinear Dynamical Systems.” Proceedings of the National Academy of Sciencies 113 (15). https://doi.org/10.1073/pnas.1517384113.
Rudy, Samuel H, Steven L Brunton, Joshua L Proctor, and J Nathan Kutz. 2016. “Data-Driven Discovery of Partial Differential Equations.” Scientific Advances, 123.
Messenger, Daniel A., and David M. Bortz. 2021. “Weak SINDy for Partial Differential Equations.” Journal of Computational Physics 443 (October):110525. https://doi.org/10.1016/J.JCP.2021.110525.
Hirsh, Seth M., David A. Barajas-Solano, and J. Nathan Kutz. 2022. “Sparsifying Priors for Bayesian Uncertainty Quantification in Model Discovery.” Royal Society Open Science 9 (2). https://doi.org/10.1098/rsos.211823.

Main Idea

Within Ordinary Differential Equation Discovery and Partial Differential Equation Discovery, the Sparse Identification of Nonlinear Dynamics (SINDy) methods assume a sparse structure for the time derivative (e.g. $N$ ), relying on some library of functions. These methods are claimed to be Interpretable.

ODE case

Consider a Ordinary Differential Equation,

\dot{y} = \frac{d y}{d t} = f (y),

where $y (t) \in R^{n}$ is the state and $f : R^{n} \to R^{n}$ is the evolution operator. SINDy seeks to discover a Sparse form of $f$ from data. The representation is sparse in some chosen library of candidate functions.

For instance, a library can be composed through polynomials, as

Θ (y) = [\begin{matrix} 1 & y_{1} & y_{2} & \dots & y_{n} & y_{1}^{2} & y_{1} y_{2} & \dots \end{matrix}],

where for a total polynomial degree of $p$ , there are

P = \frac{(n + p!)}{p! n!} = (\binom{n + p}{p})

total terms, making this object an element of $R^{1 \times P}$ .
Then, for the weight matrix $Ξ \in R^{P \times n}$ , the evolution function is given by

f (y) = {(Θ (y) Ξ)}^{T} .

Each column of $Ξ$ corresponds to a different entry in $y$ . E.g. taking $ξ_{k}$ as column $k$ of $Ξ$ , $y_{k} = Θ (y) ξ_{k}$ .

The method for determining $Ξ$ always relies on $m$ measurements of the state, which are given in some data matrix $Y \in R^{m \times n}$ ,

Y = [\begin{matrix} - y (t_{1})^{T} - \\ - y (t_{2})^{T} - \\ ⋮ \\ - y (t_{m})^{T} - \end{matrix}] .

Then, the operator $Θ$ is applied row-wise, giving $Θ (y) \in R^{m \times P}$ , as

Θ (Y) = [\begin{matrix} - Θ (y (t_{1})) - \\ - Θ (y (t_{2})) - \\ ⋮ \\ - Θ (y (t_{m})) - \end{matrix}] .

We can then verify that $\dot{Y} \in R^{m \times n}$ , which is computed as

\dot{Y} = Θ (Y) Ξ .

The training often involves some residual

| | \dot{Y} - Θ (Y) Ξ | |_{2}^{2},

and some sparsity promoting penalty, e.g. from Compressed Sensing, using the $ℓ^{1}$ norm

| | Ξ | |_{1} .

These can be combined to give a Convex Program (or even Quadratic Program) in the unknown coefficients $Ξ$ .

PDE-FIND

Library
Rudy extends SINDy to the PDE framework, by adding spatial derivatives of $u$ to the library. Given a state $u$ , the library function $Θ (u)$ constructs a row of the library, e.g.:

Θ (u) = [1, u, u^{2}, \dots, u_{x}, u u_{x}, . . ., u^{3} u_{x x x}] .

To be consistent with notation, we could think of $θ (u, u_{x}, u_{x x}, \dots)$ that makes the reliance on spatial derivatives more clear. Here, the spatial derivatives such as $u_{x}$ are computed with Finite Differences or Polynomial Interpolation and differentiation. Then, stacking all measurements into $U \in R^{n_{x} \cdot n_{t}}$ , we compute the library matrix via the element-wise application of $Θ$ , denoted as $Θ (U) \in R^{n_{x} \cdot n_{t} \times P}$ . Similarly, we compute $u_{t}$ for these points, stacking them in the same way as $U_{t}$ . The goal is to represent the dynamics via a linear combination of the library elements. That is, find $ξ \in R^{P}$ such that:

U_{t} = Θ (U) ξ .

Then, the operator is given by $N (u, u_{x}, u_{x x}, \dots) = θ (u, u_{x}, u_{x x}, \dots) ξ$ .

Solving
Solving the Least Squares problem gives many nonzero entries in $ξ$ , and could be ill-conditioned for a few reasons. First, computing the spatial derivatives for $Θ$ is ill-conditioned. Next, the monomial structure of the matrix might make it more poorly conditioned, similar to the Vandermonde Matrix. Given this, we impose the regularization assumption of a sparse $ξ$ , which could be formed into the minimization problem

\underset{ξ}{arg min} | | U_{t} - Θ (U) ξ | |_{2}^{2} + λ | | ξ | |_{0} .

Yet, the convex relaxation is instead solved, with $| | ξ | |_{1}$ . With both LASSO and sequentially thresholded least squares (STLS) for this Sparse Regression, there is still an issue of ill-conditioning due to correlation of different columns of $Θ$ . However, Tikhonov Regularization with the standard $ℓ^{2}$ squared penalty addresses this problem. This approach can be used within STLS, which is named as sequentially thresholded ridge regression (STRidge).

Numerical Differentiation
The other key consideration for these methods is the Numerical Differentiation. For noisy data with magnitude $ε$ and grid size $Δ x$ , the $d^{th}$ derivative has error

O (\frac{ε}{Δ x^{d}}) .

Convolution with a smoothing Kernel or Tikhonov differentiation tend to remove important features and this bias causes inaccurate discovery. Instead, the authors fit a polynomial of degree $p$ to at least $p + 1$ points, and the polynomial analytic derivatives were evaluated. Note that for using exactly $p + 1$ points, this is a finite difference of order $p$ . For example, $p = 2$ requires $3$ points and gives the same computations as fitting a degree $2$ polynomial and then differentiating. Sometimes, more points are used in total for calculating derivatives than there are added to the system (subsampling by selecting rows).

Miscellaneous
This technique can be applied to a Fokker-Plank Equation within Stochastic Differential Equations, by taking $u (x, t)$ as a distribution over the uncertain space $x$ . The authors discover the Diffusion Equation from Brownian Motion, treating $u$ as a histogram.

There can be some further Denoising operation, before numerical differentiation and before the regularized solve. The authors truncate the Singular Value Decomposition.

Weak SINDy

As opposed to having a residual in terms of derivatives, weak SINDy methods use a Weak Form, testing against test/trial functions and applying Integration by Parts.

Weak SINDy for ODEs

We will consider with the framework of $y (t) \in R^{n}$ , and having $m$ (transposed) measurements stacked as $Y \in R^{m \times n}$ . Our goal is to represent $\dot{y} = f (y)$ from these measurements. Consider $y (t)$ as an arbitrary element of $y (t)$ . The problem is then finding $\dot{y} = f (y)$ .

We can test against some arbitrary function $ϕ (t)$ , giving

\int_{t_{a}}^{t_{b}} \dot{y} \cdot ϕ d t = \int_{t_{a}}^{t_{b}} f (y) \cdot ϕ d t .

We can apply integration by parts to the left-hand side, now giving

{[y \cdot ϕ]}_{t_{a}}^{t_{b}} - \int_{t_{a}}^{t_{b}} y \cdot \dot{ϕ} d t = \int_{t_{a}}^{t_{b}} f (y) \cdot ϕ d t .

Yet, we will require that $ϕ (t_{a}) = ϕ (t_{b}) = 0$ , so thus

- \int_{t_{a}}^{t_{b}} y \cdot \dot{ϕ} d t = \int_{t_{a}}^{t_{b}} f (y) \cdot ϕ d t .

Unlike the strong form, we can't just choose specific times for this to hold based on where we have data. Thus, we use where we do have data to evaluate the integrals. We will use the composite trapezoid rule, which approximates

\int_{a}^{b} g (x) d x \approx \frac{g (b) + g (a)}{2} \cdot (b - a),

or for a mesh of $x_{0}, . . ., x_{n}$ ,

\int_{x_{0}}^{x_{n}} g (x) d x = \sum_{i = 0}^{n - 1} \int_{x_{i}}^{x_{i + 1}} g (x) d x \approx \frac{1}{2} \sum_{i = 0}^{n - 1} (g (x_{i + 1}) + g (x_{i})) \cdot (x_{i + 1} - x_{i}) .

If the mesh is equispaced with size $Δ x$ , this simplifies to the form we will use:

\int_{x_{0}}^{x_{n}} g (x) d x \approx \frac{Δ x}{2} \sum_{i = 0}^{n - 1} (g (x_{i + 1}) + g (x_{i})) = Δ x (\frac{1}{2} g (x_{0}) + \sum_{i = 1}^{n - 1} g (x_{i}) + \frac{1}{2} g (x_{n})) .

Yet, we do not need to use all of the data we have for $y (t)$ . Instead, we only use the samples corresponding to when $ϕ (t)$ is nonzero.

Weak SINDy for PDEs

TODO

Note, this analysis mostly assumes one spatial dimension, for simplicity. Consider a library such that

Θ (u)_{i} = \frac{d^{α_{i}}}{d x^{| α_{i} |}} f_{i} (u) .

That is, each library function has a function which is differentiated with order $α_{i}$ (a potential multi-index). It may be harder to represent the same libraries in this form! Beginning with the form of the governing equation, we have

\begin{matrix} (1) & u_{t} = \sum_{i} ξ_{i} \frac{d^{α_{i}}}{d x^{| α_{i} |}} f_{i} (u) . \end{matrix}

Then, multiplying by a test function $ψ (x, t)$ and integrating both sides over the entire domain, and pulling out the sum and $ξ$ , which does not depend on $x$ or $t$ ,

\int_{0}^{T} \int_{Ω} ψ \cdot u_{t} d Ω d t = \sum_{i} ξ_{i} \int_{0}^{T} \int_{Ω} ψ \cdot \frac{d^{α_{i}}}{d x^{| α_{i} |}} f_{i} (u) d Ω d t .

Now applying Integration by Parts and assuming sufficient derivatives of $ψ$ are 0 along the boundary of $Ω$ , and at $t = 0, T$ , we can interchange the derivative operators to the test function, with an added sign term. This gives

- 1 \int_{0}^{T} \int_{Ω} ψ_{t} \cdot u d Ω d t = \sum_{i} ξ_{i} \cdot (- 1)^{| α_{i} |} \cdot \int_{0}^{T} \int_{Ω} \frac{d^{α_{i}}}{d x^{| α_{i} |}} ψ \cdot f_{i} (u) d Ω d t .

Note, this holds for an arbitrary $ψ$ (subject only to the BC constraint), thus we can talk about it holding $\forall ψ \in {ψ_{k}}_{k \in [K]}$ . We input the measured data $U$ , and can calculate the integrals using numerical integration, and knowing the mesh sizes (e.g. a rectangle or trapezoidal approximation of the definite integrals). For $ψ_{k}$ , this gives

\underset{b_{k}}{\underset{⏟}{- 1 \sum_{m = 1}^{N_{t}} \sum_{n = 1}^{N_{x}} \frac{d}{d t} ψ_{k} \cdot U_{m n} Δ x Δ t}} = \sum_{i} ξ_{i} \underset{G_{k i}}{\underset{⏟}{\cdot (- 1)^{| α_{i} |} \cdot \sum_{m = 1}^{N_{t}} \sum_{n = 1}^{N_{x}} \frac{d^{α_{i}}}{d x^{| α_{i} |}} ψ_{k} \cdot f_{i} (U_{m n}) Δ x Δ t}} .

Repeating this for all the $ψ_{k}$ yields the matrix system

b = G ξ,

where $b \in R^{K}$ , $G \in R^{K \times P}$ , and $ξ \in R^{P}$ . We assume that we can take $K ≫ P$ , as these are arbitrary trial functions. We follow a similar least-squares / sparsity promoting solve to get $ξ$ , which then fully defines the original (strong form) PDE.

UQ-SINDy

Consider a data or feature matrix $X$ (like $Y$ above). Each row is a sample of the state at a different time ( $y_{i}^{T} = x_{i}^{T} + ϵ_{i}$ ). For a Linear Regression problem, we seek $β$ such that for noise $ϵ \sim N (0, σ^{2}),$

y = β^{T} X + ϵ .

The goal of Bayesian Methods is to find $p (β, σ | X, y) .$ SINDy uses a model of the form

y_{i}^{T} = x_{0}^{T} + \int_{0}^{t_{i}} Θ (x (τ)) Ξ d τ + ϵ_{i} .

Through Bayes Theorem,

p (Ξ, x_{0}, σ | X) \propto \underset{likelihood}{\underset{⏟}{p (X | Ξ, x_{0}, σ)}} \underset{priors}{\underset{⏟}{p (Ξ) p (x_{0}) p (σ)}},

although the computation of the Posterior is not tractable. The predictive posterior is given similarly as

p (x (t) | X) = \int \underset{posterior}{\underset{⏟}{p (x (t) | Ξ, x_{0}, σ)}} \underset{likelihood}{\underset{⏟}{p (Ξ, x_{0}, σ | X)}} d Ξ d x_{0} d σ .

The integral is approximated by taking the expectation of the likelihood evaluated at samples from the posterior.

To promote sparsity in $Ξ$ or $β$ , there are a number of priors to choose from. The maximum a posteriori estimate for the Laplace Distribution corresponds to Least Squares regression with $ℓ^{1}$ -regularization. This works well but not great. The Spike and Slab (hierarchical) prior is generally better.

Spike and Slab Prior

Each $β_{j}$ comes from a hierarchical model:

β_{j} | λ_{j} \sim N (0, c^{2}) λ_{j},

where

λ_{j} \sim Ber (π) .

Here $π$ is the prior probability that $λ_{j} = 1$ and $β_{j}$ is nonzero, following a 'slab' prior, $N (0, c^{2})$ . Otherwise, $λ = 0$ and $β_{j}$ follows a 'spike' prior at $0$ (Dirac delta distribution). This can be loosed to just a narrow distribution centered at $0$ , like $N (0, ε^{2})$ . Yet, this is a tough distribution to work with, due to its combinatorial nature.

We can carry on and do Markov Chain Monte Carlo, with a sparsity-promoting prior on the ODE coefficients and integrating the ODE to provide a prediction. E.g. use

{\hat{x}}_{i}^{T} = x_{0}^{T} + \int_{0}^{t_{i}} Θ (x (τ)) Ξ d τ,

and compare against $y_{i}^{T}$ , including our error distribution over $ϵ_{i}$ . Note, this differs a bit from standard SINDy, as if we do maximum likelihood estimation, we're comparing a prediction that involves numerical integration. This differs from SINDy, which generally constructs comparisons through numerical differentiation.