Flow Map Learning

Resources

Brandstetter, Johannes, Daniel Worrall, and Max Welling. 2023. “Message Passing Neural PDE Solvers.” arXiv. http://arxiv.org/abs/2202.03376.
Chakraborty, Dibyajyoti, Seung Whan Chung, Ashesh Chattopadhyay, and Romit Maulik. 2024. “Improved Deep Learning of Chaotic Dynamical Systems with Multistep Penalty Losses.” arXiv. http://arxiv.org/abs/2410.05572.
Chen, Yuan, and Dongbin Xiu. 2024. “Learning Stochastic Dynamical System via Flow Map Operator.” Journal of Computational Physics 508 (July): 112984. https://doi.org/10.1016/j.jcp.2024.112984.
Xu, Zhongshu, Yuan Chen, Qifan Chen, and Dongbin Xiu. 2024. “Modeling Unknown Stochastic Dynamical System Via Autoencoder.” Journal of Machine Learning for Modeling and Computing 5 (3). https://doi.org/10.1615/JMachLearnModelComput.2024055773.

Main Idea

A flow $f$ maps $f : R \times X \to X$ , such that with some (initial) state, $y (t) \in X$ ,

y (t + Δ t) = f (Δ t, y (t)) .

For our purposes, $X \subset R^{d}$ . We often also discretize $y (t)$ as $y_{i}$ , such that $y_{i} = y (i \cdot Δ t)$ . Then, if the flow $f$ is implicit, it maps

y_{i + 1} = f (y_{i}), ⟹ y_{i} = f (f (\dots f (y_{0}))) = f^{(i)} (y_{0})

If $y (t)$ is differentiable in $t$ , then from the Fundamental Theorem of Calculus,

y (t + Δ t) = \underset{f (Δ t, y (t))}{\underset{⏟}{y (t) + \int_{t}^{t + Δ t} \frac{d y (s)}{d s} d s}} .

For an Ordinary Differential Equation, if

g (y) = \frac{d y}{d t},

and $G (y) = \int g (y) d t$ , then

f (y) = y + G (y) .

This inspires the common ResNet architecture, where we parameterize $G$ and encode this identity addition as the residual connection.

The Method of Lines constructs $g$ , by combining a spatial discretization with $N$ and the boundary conditions. Neural ODEs or Ordinary Differential Equation Discovery look for $g$ , as opposed to $f$ or $G$ . In other words, they find a function that must be integrated in time. Thus, just as Partial Differential Equation Discovery may form $G$ by discretizing in space, and then integrating in time, Neural ODEs or ODE discovery skip the discretization in space, but still must integrate in time. Flow maps must do neither! Thus, we can change spatial discretization and temporal discretization for PDE discovery. We can change temporal discretization for ODE discovery / Neural ODEs, but we often can not change either for flow map discovery. Interpretability also follows the same order, from most to least interpretable. Generalizability may follow a similar trend too! See Spectrum of Interpretability.

Distributional Shift and the Pushforward Trick

Let $p_{0} (y_{0})$ be the distribution of "initial conditions" (this can be arbitrary with batches starting at nonzero times). Let $p_{k} (y_{k}) = \int_{y_{0}} p (y_{k} | y_{0}) p_{0} (y_{0}) d y_{0}$ be the true distribution of the corresponding "initial conditions" $k$ timesteps later. Suppose we train with the following loss:

L (θ) = \underset{y_{k + 1} | y_{k}, y_{k} \sim p_{k}}{E} | | f^{θ} (y_{k}) - y_{k + 1} | |_{2}^{2} .

Fundamentally, the solver maps $p_{k} \to f_{♯}^{θ} p_{k} \approx p_{k + 1}$ , where $f_{♯}^{θ}$ is the pushforward operator for $f^{θ}$ . Subsequent iterations of the solver use samples from $f_{♯}^{θ}$ , not $p_{k + 1}$ , as $f_{♯}^{θ} p_{k} \neq p_{k + 1}$ . Thus, when using the solver, we are using samples from a different distribution than what we trained on.

Stochastic FML

Suppose now we have

\frac{d y_{t}}{d t} (ω) = f (y_{t}, ω),

where $ω \in Ω$ is an event space in a probability space. The solution $y_{t} := x (ω, t) : Ω \times [0, T] \to R^{d},$ and the right-hand side $f : R^{d} \times Ω \to R^{d}$ is unknown.

We can decompose the search for $f$ into deterministic and stochastic parts $D$ and $S$ . We train $D$ as a standard deterministic flow map, where it should then approximate the mean. Then, we take samples $z$ from some distribution with a selected stochastic dimension, and use this as an input to $S$ . We could train $S$ as a Generative Adversarial Network.

Another alternative is to use an VAE. We will train $E : R^{d} \times R^{d} \to R^{n_{s}}$ and $D : R^{d} \times R^{n_{s}} \to R^{d} .$ In particular, splitting the data into pairs ${(y_{0}^{(i)}, y_{1}^{(i)})}$ and for a latent variable $z$ ,

E (y_{0}, y_{1}) = z,

and

D (y_{0}, z) = y_{1} .

We ideally want the decoder to handle all the influence of $y_{0}$ , or in other words, $z$ should be independent of $y_{0}$ (but this is the contradiction that info VAE addresses).