Adam
Resources
Related
Main Idea
Add momentum and gradient normalization to (Stochastic) Gradient Descent. This gradient normalization can be seen as a special Preconditioner based on the gradient history. These methods are not Descent Methods, even in the deterministic case, as they depend on the history and thus the starting point.
Details
Define
RMSProp
RMSProp or Root-mean-square propagation adapts the learning rate for each parameter based on the gradient history. The update is given as
where
So for two steps, this would be
E.g. taking
Adam
Adam combines momentum with RMSProp. There are two history variables tracked now.
To account for the bias of
Then, the parameter update is given as