Riemannian Gradient Flow |5 Nov. 2019|

Previously we looked at dot{x}=f(x), f(x)=big(A-(x^{top}Ax)I_n big)x - and its resulting flow - as the result from mapping dot{x}=Ax to the sphere. However, we saw that for Ain mathcal{S}^n_{++} this flow convergences to pm v_1, with Av_1=lambda_{mathrm{max}}(A)v_1 such that for g(x)=|Ax|_2 we have lim_{tto infty}gbig(x(t,x_0) big)=|A|_2 forall;x_0in mathbf{S}^{n-1}setminus{v_2,dots,v_n}. Hence, it is interesting to look at the flow from an optimization point of view: |A|_2=sup_{xin mathbf{S}^{n-1}}|Ax|_2.

A fair question would be, is our flow not simply Riemannian gradient ascent for this problem? As common with these kind of problems, such a hypothesis is something you just feel.

Now, using the tools from (CU1994, p.311) we can compute the gradient of g(x) (on the sphere) via mathrm{grad}_{mathbf{S}^2}(g|_{mathbf{S}^2})=mathrm{grad}_{mathbf{R}^3}(g)-mathrm{d}g(xi)xi, where xi is a vector field normal to mathbf{S}^2, e.g., xi=(x,y,z). From there we obtain

   mathrm{grad}_{mathbf{S}^2}(g|_{mathbf{S}^2})  = left[   begin{array}{lll} (1-x^2) & -xy  & -xz -xy & (1-y^2) & -yz -xz & -yz & (1-z^2) end{array}right] left[   begin{array}{l}partial_x g partial_y g partial_z g end{array}right]=G(x,y,z)nabla g.

To make our life a bit easier, use instead of g the map h:=g^2. Moreover, set s=(x,y,z)in mathbf{R}^3. Then it follows that

   begin{array}{ll}   mathrm{grad}_{mathbf{S}^2}(h|_{mathbf{S}^2})  		&= left(I_3-mathrm{diag}(s)left[begin{array}{l} 													s^{top}s^{top}s^{top}  													end{array}right]right) 2A^{top}As   						&= 2left(A^{top}A-(s^{top}A^{top}Asright)I_3)s.  end{array}

Of course, to make the computation cleaner, we changed g to h, but the relation between  mathrm{grad}_{mathbf{S}^2}(h|_{mathbf{S}^2}) and f is beautiful. Somehow, mapping trajectories of dot{x}=Ax, for some Ain mathcal{S}^n_{++}, to the sphere corresponds to (Riemannian) gradient ascent applied to the problem sup_{xin S^{2}}|Ax|_2.

Example sphere 1 

To do an example, let Ain mathcal{S}^3_{++} be given by

   A = left[   begin{array}{lll}   10 & 2 & 0    2 & 10 & 2    0 & 2 & 1   end{array}right].

We can compare dot{x}=f(x) with dot{y}=mathrm{grad}_{mathbf{S}^2}(h|_{mathbf{S}^2}). We see that the gradient flow under the quadratic function takes a clearly ‘‘shorter’’ path.

Now, we formalize the previous analysis a bit a show how fast we convergence. Assume that the eigenvectors are ordered such that eigenvector q_1 corresponds to the largest eigenvalue of Ain mathcal{S}^n_{++}. Then, the solution to dot{x}=big(A-(x^{top}Ax)I_nbig)x is given by

     x(t) = frac{exp({At})x_0}{|exp({At})x_0|_2}.

Let x_0=sum^{n}_{i=1}c_iq_i be x_0 expressed in eigenvector coordinates, with all q_iin mathbf{S}^{n-1} (normalized). Moreover, assume all eigenvalues are distinct. Then, to measure if x(t) is near q_1, we compute 1-|langle x(t),q_1 rangle|, which is 0 if and only if q_1 is parallel to x(t). To simplify the analysis a bit, we look at 1-langle x(t),q_1 rangle ^2=varepsilon, for some perturbation varepsilonin mathbf{R}_{geq 0}, this yields

     frac{c_1^2 exp({2lambda_1 t})}{sum^{n}_{i=1}c_i^2 exp({2lambda_i t})} = 1-varepsilon.

Next, take the the (natural) logarithm on both sides:

     2log (|c_1|)+2lambda_1 t - log left( sum^{n}_{i=1}expleft{2lambda_it + 2log(|c_i|)right}right) = log(1-varepsilon).

This log-sum-exp terms are hard to deal with, but we can apply the so-called ‘‘log-sum-exp trick’’:

     log sum_{iin I}exp (x_i) = x^{star}+log sum_{iin I}exp(x_i-x^{star}), quad forall x^{star}in mathbf{R}^n.

In our case, we set x^{star}:= 2lambda_1 t + 2log(|c_1|) and obtain

     -log left(sum^{n}_{i=1}exp left{2 left[(lambda_i-lambda_1)t + logleft(frac{|c_i|}{|c_1|} right) right] right}+1 right) = log(1-varepsilon).

We clearly observe that for tto infty the LHS approaches 0 from below, which means that varepsilonto 0 from above, like intended. Of course, we also observe that the mentioned method is not completely general, we already assume distinct eigenvalues, but there is more. We do also not convergence when x_0 in mathrm{span}{q_2,dots,q_n}, which is however a set of measure 0 on the sphere mathbf{S}^{n-1}.

More interestingly, we see that the convergence rate is largely dictated by the ‘‘spectral gap/eigengap’’ lambda_2-lambda_1. Specifically, to have a particular projection error varepsilon, such that 1gg varepsilon>0, we need

     t approx log left( frac{varepsilon}{1-varepsilon}right)frac{1}{lambda_2-lambda_1}.

Comparing this to the resulting flow from dot{s}=2left(A^{top}A-(s^{top}A^{top}Asright)I_3)s, s(0)in mathbf{S}^2, we see that we have the same flow, but with Amapsto  2 A^2. This is interesting, since A and 2A^2 have the same eigenvectors, yet a different (scaled) spectrum. With respect to the convergence rate, we have to compare (lambda_2-lambda_1) and (lambda_2^2-lambda_1^2) for any lambda_1,lambda_2in mathbf{R}_{> 0} with lambda_1>lambda_2 (the additional 2 is not so interesting).

It is obvious what we will happen, the crux is, is lambda larger or smaller than 1? Can we immediately extend this to a Newton-type algorithm? Well, this fails (globally) since we work in mathbf{R}^3 instead of purely with mathbf{S}^2. To be concrete, mathrm{det}(G(x,y,z))|_{mathbf{S}^2}=0, we never have 3 degrees of freedom.

Of course, these observations turned out to be far from new, see for example (AMS2008, sec. 4.6).

(AMS2008) P.A. Absil, R. Mahony and R. Sepulchre: ‘‘Optimization Algorithms on Matrix Manifolds’’, 2008 Princeton University Press.
(CU1994) Constantin Udriste: ‘‘Convex Functions and Optimization Methods on Riemannian Manifolds’’, 1994 Kluwer Academic Publishers.