Newton's Method

Resources

9.4.4 in Boyd, Stephen, and Lieven Vandenberghe. 2004. “Convex Optimization.” http://www.cambridge.org.

Main Idea

Within Unconstrained Optimization (or root finding), Newton's method is a second-order, iterative optimization method, making use of the Hessian (Jacobian).

Definition

Suppose our objective is given by $L (θ)$ , with $θ \in R^{n}$ , and $L$ sufficiently continuous. Newton's method is given by

θ_{i + 1} = θ_{i} - [\nabla^{2} L (θ_{i})]^{- 1} \nabla L (θ_{i}) .

Gradient Descent is given by

θ_{i + 1} = θ_{i} - ρ^{i} \nabla L (θ_{i}),

where $ρ^{i}$ is a stepsize parameter. We can rewrite this identically, by using an identity matrix $I \in R^{n \times n}$ ,

θ_{i + 1} = θ_{i} - (ρ^{i} I) \nabla L (θ_{i}) .

For sake of example, consider the quadratic objective, where $K$ is full-rank,

L (θ) = \frac{1}{2} θ^{T} \underset{K}{\underset{⏟}{R^{T} R}} θ .

The gradient is given as

\nabla L (θ) = \frac{1}{2} (K + K^{T}) θ = \frac{1}{2} ((R^{T} R) + (R^{T} R)^{T}) θ = K θ

and the Hessian is

\nabla^{2} L (θ) = K .

The optimum is given as $θ^{*} = 0$ (but we don't know this in general for other $L (θ)$ ), and the initial guess for the iterative method is given as $θ_{0}$ .

For gradient descent,

θ_{1} = θ_{0} - ρ^{0} K θ_{0}

We want to discuss what to set $ρ^{0}$ such that $θ_{1} = θ^{*} = 0$ , thus,

0 = θ_{0} - ρ^{0} K θ_{0} = (I - ρ^{0} K) θ_{0} .

This requires that $ρ^{0} K = I$ , which rarely happens, as $K$ is not necessarily even diagonal, let alone a multiple of the identity matrix. Motivated by this, let us introduce another matrix $T$ , such that the gradient descent method gives

θ_{1} = θ_{0} - T K θ_{0} .

Here, $T$ is a Preconditioner for $K$ . There is a whole literature on preconditioners for the iterative solution of linear systems. An introductory method is to take $T$ as a diagonal matrix, where its elements are the inverse of the diagonal of $K$ (a Jacobi Preconditioner). That is,

T_{j j} = \frac{1}{K_{j j}} .

In this case, if $K$ is diagonal, then this descent method will give $θ_{1} = θ^{*}$ , as $T = K^{- 1}$ . There are other methods to construct $T$ , including the trivial gradient descent method of $T = ρ^{0} I$ . Although I'm not sure, I suspect that most (gradient-based) unconstrained methods can be written in this framework (such as BFGS, but not trust-region methods). We can consider the best possible choice of the preconditioner:

T = [\nabla^{2} L (θ^{0})]^{- 1} .

In our example, this gives $T$ , and thus, for any $K$ ,

θ_{1} = θ_{0} - T K θ_{0} = θ_{0} - [\nabla^{2} L (θ_{0})]^{- 1} K θ_{0} = θ_{0} - K^{- 1} K θ_{0} = 0 = θ^{*} .

However, Newton's method is asymptotically more expensive due to the computation of the inverse of the Hessian. Thus, other methods essentially seek to use heuristics to approximate the inverse of the Hessian ( $T$ ), more and more accurately.

RMSProp and Adam replace the preconditioning matrix $T$ with an exponential moving average of the (inverse of the) previously squared gradients.