In this post we will look at one of the many remarkable findings by Roger W. Brockett. Consider a Linear Program (LP)

$mathrm{maximize}_{xin X}quad langle c,x ranglequad (1)$

parametrized by the compact set $X={xin mathbf{R}^n:Axleq b}$ and a suitable triple (A,b,c)

. As a solution to (1)

can always be found to be a vertex of

, a smooth method to solve (1)

seems somewhat awkward. We will see that one can construct a so-called isospectral flow that does the job. Here we will follow Dynamical systems that sort lists, diagonalize matrices and solve linear programming problems, by Roger. W. Brockett (CDC 1988) and the book Optimization and Dynamical Systems edited by Uwe Helmke and John B. Moore (Springer 2ed. 1996). Let

have

vertices, then one can always find a map $T:mathbf{R}^kto mathbf{R}^n$ , mapping the simplex $S={xin mathbf{R}_{geq 0}^k:sum_{j}x_j=1}$ onto

. Indeed, with some abuse of notation, let

be a matrix defined as T=(v_1,dots,v_k)

, for ${v_j}_{j=1}^k$ , the vertices of

Before we continue, we need to establish some differential geometric results. Given the Special Orthogonal group $mathsf{SO}(n)={Qin mathbf{R}^{ntimes n}:Q^{mathsf{T}}Q=I_n}cap mathsf{GL}^+(n,mathbf{R})$ , the tangent space is given by $T_Q mathsf{SO}(n)={QOmega : Omegain mathrm{skew}(n,mathbf{R})}$ . Note, this is the explicit formulation, which is indeed equivalent to shifting the corresponding Lie Algebra. The easiest way to compute this is to look at the kernel of the map defining the underlying manifold.

Now, following Brockett, consider the function f:mathsf{SO}(n)to mathbf{R}

defined by $f:Thetamapsto mathrm{Tr}(QTheta NTheta^{mathsf{T}})$ for some Q,Nin mathsf{Sym}(n)

. This approach is not needed for the full construction, but it allows for more intuition and more computations. To construct the corresponding gradient flow, recall that the (Riemannian) gradient at Thetain mathsf{SO}(n)

is defined via $df(Theta)[V]=langle mathrm{grad},f(Theta), Vrangle_{Theta}$ for all $Vin T_{Theta} mathsf{SO}(n)$ . Using the explicit tangent space representation, we know that V=Theta Omega

with $Omega = -Omega^{mathsf{T}}$ . Then, see that by using

$f(Theta+tV) = mathrm{Tr}(QTheta(I_n+tOmega)N(I_n-tOmega)Theta^{mathsf{T}})$

$df(Theta)[V]=lim_{tdownarrow 0}frac{f(Theta+tV)-f(Theta)}{t} = langle QTheta N, Theta Omega rangle - langle Theta N Theta^{mathsf{T}} Q Theta, Theta Omega rangle.$

This means that (the minus is missing in the paper) the (a) gradient flow becomes

$dot{Theta} = mathrm{grad},f(Theta) = QTheta N-Theta NTheta^{mathsf{T}}QTheta, quad Theta(0)in mathsf{SO}(n).$

Consider the standard commutator bracket [A,B]=AB-BA

and see that for $H(t)=Theta(t)^{mathsf{T}}QTheta(t)$ one obtains from the equation above (typo in the paper)

$dot{H}(t) = [H(t),[H(t),N]],quad H(0)=Theta^{mathsf{T}}QThetaquad (2).$

Hence,

can be seen as a reparametrization of a gradient flow. It turns out that (2)

has a variety of remarkable properties. First of all, see that H(t)

preserves the eigenvalues of

. Also, observe the relation between extremizing

and the function

defined via $g:H=Theta^{mathsf{T}}QTheta mapsto -frac{1}{2}|N-H|_F^2$ . The idea to handle LPs is now that the limiting H(t)

will relate to putting weight one the correct vertex to get the optimizer,

gives you this weight as it will contain the corresponding largest costs.

In fact, the matrix

can be seen as an element of the set $mathsf{M}(Q)={Theta^{mathsf{T}}QTheta:Thetain mathsf{O}(n)}$ . This set is in fact a $C^{infty}$ -smooth compact manifold as it can be written as the orbit space corresponding to the group action $sigma:mathsf{O}(n)times mathbf{R}^{ntimes n}to mathsf{R}^{ntimes}$ , $sigma:(Theta,Q)mapsto Theta^{mathsf{T}}QTheta$ , one can check that this map satisfies the group properties. Hence, to extremize

over

, it appears to be appealing to look at Riemannian optimization tools indeed. When doing so, it is convenient to understand the tangent space of mathsf{M}(Q)

. Consider the map defining the manifold h:mathsf{O}(n)to mathsf{M}(Q)

, $h:Theta mapsto Theta^{mathsf{T}}QTheta$ . Then by the construction of $T_Theta mathsf{O}(n)$ , see that dh(Theta)[V]=0

yields the relation [H,Omega]=0

for any

For the moment, let $Q=mathrm{diag}(lambda_{1}I_{n_1},dots,lambda_{r}I_{n_r})in mathsf{Sym}(n)$ such that $lambda_{1}>cdots>lambda_{r}$ and $sum_{n_i}n_i=n$ . First we consider the convergence of (2)

. Let

have only distinct eigenvalues, then $H_{infty}:=lim_{tto infty}H(t)$ exists and is diagonal. Using the objective

from before, consider f(H)=mathrm{Tr}(HN)

and see that by using the skew-symmetry one recovers the following

$begin{array}{lll} frac{d}{dt}mathrm{Tr}(H(t)N) &=& mathrm{Tr}(N [H,[H,N]]) &=& -mathrm{Tr}((HN-NH)^2) &=& |HN-NH|_F^2. end{array}$

This means the cost monotonically increases, but by compactness converges to some point $H_{infty}$ . By construction, this point must satisfy $[H_{infty},N]=0$ . As

has distinct eigenvalues, this can only be true if $H_{infty}$ itself is diagonal (due to the distinct eigenvalues).

More can be said about $H_{infty}$ , let (lambda_1,dots,lambda_n)

be the eigenvalues of H(0)

, that is, they correspond to the eigenvalues of

as defined above. Then as H(t)

preserves the eigenvalues of

(

), we must have $H_{infty}=pi Q pi^{mathsf{T}}$ , for

a permutation matrix. This is also tells us there is just a finite number of equilibrium points (finite number of permutations). We will write this sometimes as $H_{infty}=mathrm{diag}(lambda_{pi(1)},dots,lambda_{pi(n)})$ .

Now as

is one of those points, when does H(t)

converge to

? To start this investigation, we look at the linearization of (2)

, which at an equilibrium point $H_{infty}$ becomes

$dot{xi}_{ij} = -(lambda_{pi(i)}-lambda_{pi(j)})(mu_i-mu_j)xi_{ij}$

for $xiin T_{H_{infty}}mathsf{M}(Q)$ . As we work with matrix-valued vector fields, this might seems like a duanting computation. However, at equilibrium points one does not need a connection and can again use the directional derivative approach, in combination with the construction of $T_{H_{infty}}mathsf{M}(Q)$ , to figure out the linearization. The beauty is that from there one can see that

is the only asymptotically stable equilibrium point of (2)

. Differently put, almost all initial conditions H(0)in mathsf{M}(Q)

will converge to

with the rate captured by spectral gaps in

and

. If

does not have distinct eigenvalues and we do not impose any eigenvalue ordering on

, one sees that an asymptotically stable equilibrium point $H_{infty}$ must have the same eigenvalue ordering as

. This is the sorting property of the isospectral flow and this is of use for the next and final statement.

Theorem: Consider the LP (1)

with

for all

, then, there exist diagonal matrices

and

such that

converges for almost any H(0)in mathsf{M}(Q)

to $H_{infty}=mathrm{diag}(d)$ with the optimizer of (1)

being $x^{star}=Td$ .

Proof: Global convergence is prohibited by the topology of mathsf{M}(Q)

. Let $N=mathrm{diag}(T^{mathsf{T}}c)$ and let Q=(1,0,dots,0)in mathsf{Sym}(k)

. Then, the isospectral flow will converge from almost everywhere to $H_{infty}=mathrm{diag}(0,dots,0,1,0,dots,0)$ (only $(H_{infty})_{ii}=1$ ), such that $x^{star}=v_i$ .

Riemannian Gradient Flow |5 Nov. 2019|
tags: math.DG, math.DS

Previously we looked at dot{x}=f(x)

, $f(x)=big(A-(x^{top}Ax)I_n big)x$ - and its resulting flow - as the result from mapping dot{x}=Ax

to the sphere. However, we saw that for $Ain mathcal{S}^n_{++}$ this flow convergences to pm v_1

, with $Av_1=lambda_{mathrm{max}}(A)v_1$ such that for g(x)=|Ax|_2

we have $lim_{tto infty}gbig(x(t,x_0) big)=|A|_2$ $forall;x_0in mathbf{S}^{n-1}setminus{v_2,dots,v_n}$ . Hence, it is interesting to look at the flow from an optimization point of view: $|A|_2=sup_{xin mathbf{S}^{n-1}}|Ax|_2$ .

A fair question would be, is our flow not simply Riemannian gradient ascent for this problem? As common with these kind of problems, such a hypothesis is something you just feel.

Now, using the tools from (CU1994, p.311) we can compute the gradient of g(x)

(on the sphere) via $mathrm{grad}_{mathbf{S}^2}(g|_{mathbf{S}^2})=mathrm{grad}_{mathbf{R}^3}(g)-mathrm{d}g(xi)xi$ , where

is a vector field normal to $mathbf{S}^2$ , e.g., xi=(x,y,z)

. From there we obtain

$mathrm{grad}_{mathbf{S}^2}(g|_{mathbf{S}^2}) = left[ begin{array}{lll} (1-x^2) & -xy & -xz -xy & (1-y^2) & -yz -xz & -yz & (1-z^2) end{array}right] left[ begin{array}{l}partial_x g partial_y g partial_z g end{array}right]=G(x,y,z)nabla g.$

To make our life a bit easier, use instead of

the map

. Moreover, set $s=(x,y,z)in mathbf{R}^3$ . Then it follows that

$begin{array}{ll} mathrm{grad}_{mathbf{S}^2}(h|_{mathbf{S}^2}) &= left(I_3-mathrm{diag}(s)left[begin{array}{l} s^{top}s^{top}s^{top} end{array}right]right) 2A^{top}As &= 2left(A^{top}A-(s^{top}A^{top}Asright)I_3)s. end{array}$

Of course, to make the computation cleaner, we changed

, but the relation between $mathrm{grad}_{mathbf{S}^2}(h|_{mathbf{S}^2})$ and

is beautiful. Somehow, mapping trajectories of dot{x}=Ax

, for some $Ain mathcal{S}^n_{++}$ , to the sphere corresponds to (Riemannian) gradient ascent applied to the problem $sup_{xin S^{2}}|Ax|_2$ .

Now, we formalize the previous analysis a bit a show how fast we convergence. Assume that the eigenvectors are ordered such that eigenvector q_1

corresponds to the largest eigenvalue of $Ain mathcal{S}^n_{++}$ . Then, the solution to $dot{x}=big(A-(x^{top}Ax)I_nbig)x$ is given by

$x(t) = frac{exp({At})x_0}{|exp({At})x_0|_2}.$

Let $x_0=sum^{n}_{i=1}c_iq_i$ be x_0

expressed in eigenvector coordinates, with all $q_iin mathbf{S}^{n-1}$ (normalized). Moreover, assume all eigenvalues are distinct. Then, to measure if x(t)

is near

, we compute 1-|langle x(t),q_1 rangle|

, which is

if and only if q_1

is parallel to x(t)

. To simplify the analysis a bit, we look at 1-langle x(t),q_1 rangle ^2=varepsilon

, for some perturbation $varepsilonin mathbf{R}_{geq 0}$ , this yields

$frac{c_1^2 exp({2lambda_1 t})}{sum^{n}_{i=1}c_i^2 exp({2lambda_i t})} = 1-varepsilon.$

$2log (|c_1|)+2lambda_1 t - log left( sum^{n}_{i=1}expleft{2lambda_it + 2log(|c_i|)right}right) = log(1-varepsilon).$

This log-sum-exp terms are hard to deal with, but we can apply the so-called ‘‘log-sum-exp trick’’:

$log sum_{iin I}exp (x_i) = x^{star}+log sum_{iin I}exp(x_i-x^{star}), quad forall x^{star}in mathbf{R}^n.$

$-log left(sum^{n}_{i=1}exp left{2 left[(lambda_i-lambda_1)t + logleft(frac{|c_i|}{|c_1|} right) right] right}+1 right) = log(1-varepsilon).$

We clearly observe that for tto infty

the LHS approaches

from below, which means that varepsilonto 0

from above, like intended. Of course, we also observe that the mentioned method is not completely general, we already assume distinct eigenvalues, but there is more. We do also not convergence when $x_0 in mathrm{span}{q_2,dots,q_n}$ , which is however a set of measure

on the sphere $mathbf{S}^{n-1}$ .

More interestingly, we see that the convergence rate is largely dictated by the ‘‘spectral gap/eigengap’’ lambda_2-lambda_1

. Specifically, to have a particular projection error varepsilon

, such that 1gg varepsilon>0

, we need

$t approx log left( frac{varepsilon}{1-varepsilon}right)frac{1}{lambda_2-lambda_1}.$

Comparing this to the resulting flow from $dot{s}=2left(A^{top}A-(s^{top}A^{top}Asright)I_3)s$ , $s(0)in mathbf{S}^2$ , we see that we have the same flow, but with Amapsto 2 A^2

. This is interesting, since

and

have the same eigenvectors, yet a different (scaled) spectrum. With respect to the convergence rate, we have to compare (lambda_2-lambda_1)

and

for any $lambda_1,lambda_2in mathbf{R}_{> 0}$ with lambda_1>lambda_2

(the additional

is not so interesting).

It is obvious what we will happen, the crux is, is lambda

larger or smaller than

? Can we immediately extend this to a Newton-type algorithm? Well, this fails (globally) since we work in $mathbf{R}^3$ instead of purely with $mathbf{S}^2$ . To be concrete, $mathrm{det}(G(x,y,z))|_{mathbf{S}^2}=0$ , we never have

degrees of freedom.

Of course, these observations turned out to be far from new, see for example (AMS2008, sec. 4.6).

(AMS2008) P.A. Absil, R. Mahony and R. Sepulchre: ‘‘Optimization Algorithms on Matrix Manifolds’’, 2008 Princeton University Press.
(CU1994) Constantin Udriste: ‘‘Convex Functions and Optimization Methods on Riemannian Manifolds’’, 1994 Kluwer Academic Publishers.

Posts (2) containing the 'math.DG’ (Differential Geometry) tag:

Solving Linear Programs via Isospectral flows |05 September 2021|
`tags: math.OC, math.DS, math.DG`

Riemannian Gradient Flow |5 Nov. 2019|
`tags: math.DG, math.DS`

Posts (2) containing the 'math.DG’ (Differential Geometry) tag:

Solving Linear Programs via Isospectral flows |05 September 2021|tags: math.OC, math.DS, math.DG

Riemannian Gradient Flow |5 Nov. 2019|tags: math.DG, math.DS

Solving Linear Programs via Isospectral flows |05 September 2021|
`tags: math.OC, math.DS, math.DG`

Riemannian Gradient Flow |5 Nov. 2019|
`tags: math.DG, math.DS`