Adam

Resources

Unconstrained Optimization

Main Idea

Add momentum and gradient normalization to (Stochastic) Gradient Descent. This gradient normalization can be seen as a special Preconditioner based on the gradient history. These methods are not Descent Methods, even in the deterministic case, as they depend on the history and thus the starting point.

Details

Define $g_{k} = \nabla_{θ} L (θ_{k - 1})$ .

RMSProp

RMSProp or Root-mean-square propagation adapts the learning rate for each parameter based on the gradient history. The update is given as

θ_{k} = θ_{k - 1} - \frac{η}{\sqrt{v_{k}} + ϵ} \cdot g_{k},

where

v_{k} = γ \cdot v_{k - 1} + (1 - γ) g_{k}^{2} .

So for two steps, this would be

v_{k} = γ^{2} \cdot v_{k - 2} + γ (1 - γ) g_{k - 1}^{2} + (1 - γ) g_{k}^{2} .

E.g. taking $γ = 1 / 2$ gives $g_{k - 1}$ half the weight of $g_{k}$ , and $γ = 1$ relies entirely on the previous update and not at all on the gradient.

Adam

Adam combines momentum with RMSProp. There are two history variables tracked now.

\begin{matrix} m_{k} = β_{1} m_{k - 1} + (1 - β_{1}) g_{k}, \\ v_{k} = β_{2} v_{k - 1} + (1 - β_{2}) g_{k}^{2} . \end{matrix}

To account for the bias of $m_{0}$ and $v_{0}$ being initialized to $0$ , take

\begin{matrix} {\hat{m}}_{k} = \frac{m_{k}}{1 - (β_{1})^{k}}, \\ {\hat{v}}_{k} = \frac{v_{k}}{1 - (β_{2})^{k}} . \end{matrix}

Then, the parameter update is given as

θ_{k} = θ_{k - 1} - η \cdot \frac{{\hat{m}}_{k}}{\sqrt{{\hat{v}}_{k}} + ϵ} .

Resources

Related

Main Idea

Details

RMSProp

Adam