Adam

Resources

Main Idea

Add momentum and gradient normalization to (Stochastic) Gradient Descent. This gradient normalization can be seen as a special Preconditioner based on the gradient history. These methods are not Descent Methods, even in the deterministic case, as they depend on the history and thus the starting point.

Details

Define gk=θL(θk1).

RMSProp

RMSProp or Root-mean-square propagation adapts the learning rate for each parameter based on the gradient history. The update is given as

θk=θk1ηvk+ϵgk,

where

vk=γvk1+(1γ)gk2.

So for two steps, this would be

vk=γ2vk2+γ(1γ)gk12+(1γ)gk2.

E.g. taking γ=1/2 gives gk1 half the weight of gk, and γ=1 relies entirely on the previous update and not at all on the gradient.

Adam

Adam combines momentum with RMSProp. There are two history variables tracked now.

mk=β1mk1+(1β1)gk,vk=β2vk1+(1β2)gk2.

To account for the bias of m0 and v0 being initialized to 0, take

m^k=mk1(β1)k,v^k=vk1(β2)k.

Then, the parameter update is given as

θk=θk1ηm^kv^k+ϵ.