Multi-Objective Optimization

Resources

Main Idea

Within Optimization, there are multiple objective functions, and the goal is formulated as

minxX[f1(x)f2(x)fk(x)]f(x),

with k2, and X is the feasible set, typically XRn (which can incorporate constraints), and f:XRk. Denote Y={f(x):xX}Rk.

Details

A feasible solution x1X is said to Pareto dominate another solution x2X, if

i[k],fi(x1)fi(x2), and i[k],fi(x1)<fi(x2).

That is, x1 dominates x2, if it is at least as good as x2 on every objective function, and strictly better on at least one. A solution xX is Pareto optimal (or Pareto efficient) if no other solution dominates it:

¬(xX:x dominates x)xX:x does not dominate x.

The Pareto front is X={xX:x is Pareto optimal}. This is bounded by the nadir objective vector, znadir, which is a collection of the worst (greatest) objective values achieved by xX. Conversely, he ideal objective vector zideal is a collection of the best (least) objective values. Thus, for any xX (or even xX), zidealf(x)znadir, in a component-wise sense.

Scalarization

The process of scalarization converts f to a scalar-valued function, which is instead optimized. The global criterion scalarization problem considers

minxX||f(x)zideal||,

where the norm can be any Lebesgue Space norm, e.g. L1, L2, L. Note that this depends on the scale of the objective functions, so some uniform, dimensionless scale is recommended.

More generally, a scalarization uses a function g:Rk×R|θ|R (potentially with parameters θ) to convert to a single-objective problem:

ming(f1(x),,fk(x),θ),s.t. xXθ.

Here, we also have XθX.

Linear Scalarization

minxXi=1kθifi(x),θi>0

ϵ-Constraint Method

minfj(x),s.t. xXfi(x)εi,i[k]{j}.

Here, the parameters θ correspond to εi. We have applied this approach in Partial Differential Equation Discovery. Given X=Rn, this requires solving a Constrained Optimization problem, as opposed to an Unconstrained Optimization problem for the linear scalarization.

Other methods include "hypervolume/Chebyshev scalarization".

Jacobian Descent

Recalling f:RnRk, the Jacobian of f is Jf:RnRk×n (the transpose of what stacking gradients would look like). This is a matrix for a given input x, with entries [Jf]ij=fixj, or

Jf=[f1Tf2TfkT]

A Taylor expansion about x with perturbation Δx gives

f(x+Δx)=f(x)+[Jf(x)]Δx+O(||Δx||2).

We define an aggregator function A:Rk×nRn. In other words, A converts from the Jacobian to a vector of the same size as the gradient, allowing for a similar update rule to that of Gradient Descent, with η>0,

xj+1=xjηA[Jf(xj)]Δx.

For example, when k=1, we take A as the transpose operation and recover gradient descent exactly.

We want to decrease f across all of its entries. Thus, according to the Taylor expansion, we require

f(x+Δx)f(x)=(Jf(x))(ηA[Jf(x)]Δx)0,

where is a relation for the (natural) Partial Order on Rk (a0ai0,i). Take u as row i of Jf(x) (k×n), and v as A[Jf(x)] (which is just a single column). If

uTA[Jf(x)]>0,

then the objective function fi will increase with this step (according to the first-order Taylor Expansion). This does not always mean that x pareto dominates x+Δx, as if another objective function decreases, this is not the case.

Definition

Let A:Rk×nRn be an aggregator. For any JRk×n, the aggregator and J are nonconflicting if

JA(J)0.
Preconditioned Gradient Descent (Normalized Steepest Descent)

Let k=1. Consider an n×n matrix P. Normalized steepest descent in the space described by ||||P gives the update as

Δx=P1f(x).

With the same restriction that the step decreases the objective function, we require that

f(x)T(P1f(x))Δx0.

P can not depend on f(x), so this must hold for any f(x). Thus, we require that P1 is positive semi-definite. The corresponding A(J)=P1JT is nonconflicting for all JR1×n, as JP1JT0.

In other words, the nonconflicting requirement on J and A is that of positive semi-definiteness.

Definition

Let A:Rk×nRn be an aggregator. If for all JRk×n and cRk, c0, the mapping

cA(diag(c)J)

is linear, then A is linear under scaling.

Here, c is some weighting of the rows of J, meaning that if a given of J changes by a constant factor, it's contribution to Δx (through A) should change similarly.

A weighted aggregator can be written as

A(J)=JTw,

for some wRk, for all J. This corresponds to the linear scalarization approach.

The Projection of xRn onto the dual cone of the rows of J is

πJ(x)=argminy,Jy0||yx||2

We can think of the Jy0 constraint as a pointed cone in Rn. Adding more rows to J can only make this cone more narrow. If Δx is within this cone of Jf(x), then according to the first-order Taylor Expansion, taking the step should reduce f, for all of its entries. The projection is the nearest point such that this happens, or the current point, if it already reduces f for all of its entries. It's possible to have a case where the dual cone is just {0}, in which case the projection will always just be 0. Intuitively, the dual cone is the set within 90 degrees of each row of J.

Question

How to implement πJ in some efficient way?