9.4.4 in Boyd, Stephen, and Lieven Vandenberghe. 2004. “Convex Optimization.” http://www.cambridge.org.
Main Idea
Within Unconstrained Optimization (or root finding), Newton's method is a second-order, iterative optimization method, making use of the Hessian (Jacobian).
Definition
Suppose our objective is given by , with , and sufficiently continuous. Newton's method is given by
where is a stepsize parameter. We can rewrite this identically, by using an identity matrix ,
For sake of example, consider the quadratic objective, where is full-rank,
The gradient is given as
and the Hessian is
The optimum is given as (but we don't know this in general for other ), and the initial guess for the iterative method is given as .
For gradient descent,
We want to discuss what to set such that , thus,
This requires that , which rarely happens, as is not necessarily even diagonal, let alone a multiple of the identity matrix. Motivated by this, let us introduce another matrix , such that the gradient descent method gives
Here, is a Preconditioner for . There is a whole literature on preconditioners for the iterative solution of linear systems. An introductory method is to take as a diagonal matrix, where its elements are the inverse of the diagonal of (a Jacobi Preconditioner). That is,
In this case, if is diagonal, then this descent method will give , as . There are other methods to construct , including the trivial gradient descent method of . Although I'm not sure, I suspect that most (gradient-based) unconstrained methods can be written in this framework (such as BFGS, but not trust-region methods). We can consider the best possible choice of the preconditioner:
In our example, this gives , and thus, for any ,
However, Newton's method is asymptotically more expensive due to the computation of the inverse of the Hessian. Thus, other methods essentially seek to use heuristics to approximate the inverse of the Hessian (), more and more accurately.
RMSProp and Adam replace the preconditioning matrix with an exponential moving average of the (inverse of the) previously squared gradients.