Newton's Method

Resources

Main Idea

Within Unconstrained Optimization (or root finding), Newton's method is a second-order, iterative optimization method, making use of the Hessian (Jacobian).

Definition

Suppose our objective is given by L(θ), with θRn, and L sufficiently continuous. Newton's method is given by

θi+1=θi[2L(θi)]1L(θi).

Gradient Descent is given by

θi+1=θiρiL(θi),

where ρi is a stepsize parameter. We can rewrite this identically, by using an identity matrix IRn×n,

θi+1=θi(ρiI)L(θi).

For sake of example, consider the quadratic objective, where K is full-rank,

L(θ)=12θTRTRKθ.

The gradient is given as

L(θ)=12(K+KT)θ=12((RTR)+(RTR)T)θ=Kθ

and the Hessian is

2L(θ)=K.

The optimum is given as θ=0 (but we don't know this in general for other L(θ)), and the initial guess for the iterative method is given as θ0.

For gradient descent,

θ1=θ0ρ0Kθ0

We want to discuss what to set ρ0 such that θ1=θ=0, thus,

0=θ0ρ0Kθ0=(Iρ0K)θ0.

This requires that ρ0K=I, which rarely happens, as K is not necessarily even diagonal, let alone a multiple of the identity matrix. Motivated by this, let us introduce another matrix T, such that the gradient descent method gives

θ1=θ0TKθ0.

Here, T is a Preconditioner for K. There is a whole literature on preconditioners for the iterative solution of linear systems. An introductory method is to take T as a diagonal matrix, where its elements are the inverse of the diagonal of K (a Jacobi Preconditioner). That is,

Tjj=1Kjj.

In this case, if K is diagonal, then this descent method will give θ1=θ, as T=K1. There are other methods to construct T, including the trivial gradient descent method of T=ρ0I. Although I'm not sure, I suspect that most (gradient-based) unconstrained methods can be written in this framework (such as BFGS, but not trust-region methods). We can consider the best possible choice of the preconditioner:

T=[2L(θ0)]1.

In our example, this gives T, and thus, for any K,

θ1=θ0TKθ0=θ0[2L(θ0)]1Kθ0=θ0K1Kθ0=0=θ.

However, Newton's method is asymptotically more expensive due to the computation of the inverse of the Hessian. Thus, other methods essentially seek to use heuristics to approximate the inverse of the Hessian (T), more and more accurately.

RMSProp and Adam replace the preconditioning matrix T with an exponential moving average of the (inverse of the) previously squared gradients.