Flow Map Learning

Resources

Main Idea

A flow f maps f:R×XX, such that with some (initial) state, y(t)X,

y(t+Δt)=f(Δt,y(t)).

For our purposes, XRd. We often also discretize y(t) as yi, such that yi=y(iΔt). Then, if the flow f is implicit, it maps

yi+1=f(yi),yi=f(f(f(y0)))=f(i)(y0)

If y(t) is differentiable in t, then from the Fundamental Theorem of Calculus,

y(t+Δt)=y(t)+tt+Δtdy(s)dsdsf(Δt,y(t)).

For an Ordinary Differential Equation, if

g(y)=dydt,

and G(y)=g(y)dt, then

f(y)=y+G(y).

This inspires the common ResNet architecture, where we parameterize G and encode this identity addition as the residual connection.

The Method of Lines constructs g, by combining a spatial discretization with N and the boundary conditions. Neural ODEs or Ordinary Differential Equation Discovery look for g, as opposed to f or G. In other words, they find a function that must be integrated in time. Thus, just as Partial Differential Equation Discovery may form G by discretizing in space, and then integrating in time, Neural ODEs or ODE discovery skip the discretization in space, but still must integrate in time. Flow maps must do neither! Thus, we can change spatial discretization and temporal discretization for PDE discovery. We can change temporal discretization for ODE discovery / Neural ODEs, but we often can not change either for flow map discovery. Interpretability also follows the same order, from most to least interpretable. Generalizability may follow a similar trend too! See Spectrum of Interpretability.

Distributional Shift and the Pushforward Trick

Let p0(y0) be the distribution of "initial conditions" (this can be arbitrary with batches starting at nonzero times). Let pk(yk)=y0p(yk|y0)p0(y0)dy0 be the true distribution of the corresponding "initial conditions" k timesteps later. Suppose we train with the following loss:

L(θ)=Eyk+1|yk,ykpk||fθ(yk)yk+1||22.

Fundamentally, the solver maps pkfθpkpk+1, where fθ is the pushforward operator for fθ. Subsequent iterations of the solver use samples from fθ, not pk+1, as fθpkpk+1. Thus, when using the solver, we are using samples from a different distribution than what we trained on.

Stochastic FML

Suppose now we have

dytdt(ω)=f(yt,ω),

where ωΩ is an event space in a probability space. The solution yt:=x(ω,t):Ω×[0,T]Rd, and the right-hand side f:Rd×ΩRd is unknown.

We can decompose the search for f into deterministic and stochastic parts D and S. We train D as a standard deterministic flow map, where it should then approximate the mean. Then, we take samples z from some distribution with a selected stochastic dimension, and use this as an input to S. We could train S as a Generative Adversarial Network.

Another alternative is to use an VAE. We will train E:Rd×RdRns and D:Rd×RnsRd. In particular, splitting the data into pairs {(y0(i),y1(i))} and for a latent variable z,

E(y0,y1)=z,

and

D(y0,z)=y1.

We ideally want the decoder to handle all the influence of y0, or in other words, z should be independent of y0 (but this is the contradiction that info VAE addresses).