Posts (9) containing the 'math.OC’ (Optimization and Control) tag:

Weekend read |May 14, 2023|
tags: math.OC

The ‘‘Control for Societal-scale Challenges: Road Map 2030’’ by the IEEE Control Systems Society (Eds. A. M. Annaswamy, K. H. Johansson, and G. J. Pappas) was published a week ago. This roadmap is a remarkable piece of work, over 250 pages, an outstanding list of authors and coverage of almost anything you can imagine, and more.

If you somehow found your way to this website, then I can only strongly recommend reading this document. Despite many of us being grounded in traditional engineering disciplines, I do agree with the sentiment of this roadmap that the most exciting (future) work is interdisciplinary, this is substantiated by many examples from biology. Better yet, it is stressed that besides familiarizing yourself with the foundations, it is of quintessential importance (and fun I would say) to properly dive into the field where you hope to apply these tools.

‘‘Just because you can formulate and solve an optimization problem does not mean that you have the correct or best cost function.’’ p. 32

Section 4.A on Learning and Data-Driven Control also contains many nice pointers, sometimes alluding to a slight disconnect between practice and theory.

‘‘Sample complexity may only be a coarse measure in this regard, since it is not just the number of samples but the “quality” of the samples that matters.’’ p. 142

The section on safety is also inspiring, just stability will not be enough anymore. However, the most exciting part for me was Chapter 6 on education. The simple goal of making students excited early-on is just great. Also, the aspiration to design the best possible material as a community is more than admirable.

Standing on the shoulders of giants |April 3, 2023|
tags: math.OC

One of the most illuminating activities one can undertake is to go back in time and see how the giants in (y)our field shaped the present. Sadly, several of our giants passed away recently and I want to highlight one of them in particular: Roger W. Brockett.

It is very rare to find elegant and powerful theories in sytems and control that are not related to Brockett somehow, better yet, many powerful theories emerged directly from his work. Let me highlight five topics/directions I find remarkably beautiful:

  • (Lie algebras): As put by Brockett himself, it is only natural to study the link between Lie theory and control theory since the two are intimately connected through differential equations |1.2|. ‘‘Completing this triangle’’ turned out to be rather fruitful, in particular via Frobenius’ theorem. Brockett played a key role in bringing Lie theoretic methods to control theory, a nice example is his 1973 paper on control systems on spheres (in general, his work on bilinear systems was of great interest) |5|.

  • (Differential geometric methods): Together with several others, Brockett was one of the first to see how differential geometric methods allowed for elegantly extending ideas from linear control (linear algebra) to the realm of nonlinear control. A key event is the 1972 conference co-organized with David Mayne |1|. See also |2| for a short note on the history of geometric nonlinear control written by Brockett himself.

  • (Brockett's condition): After pioneering work on (local) nonlinear controllability in the 70s it was observed (by means of low-dimensional counterexamples) that controllability is not sufficient for the existence of a stabilizing continuous feedback. This observation was firmly established in the (1982) 1983 paper by Brockett |3| where he provides his topological necessary condition (Theorem 1 (iii)) for the existence of a stabilizing differentiable feedback, i.e. (x,u)mapsto f(x,u) must be locally onto (the same is true for continuous feedback). Formally speaking, this paper is not the first (see Geometrical Methods of Nonlinear Analysis and this paper by Zabczyk), yet, this paper revolutionized how to study necessary conditions for nonlinear stabilization and inspired an incredible amount of work.

  • (Nonlinear control systems): Although we are still far from the definitive control system (modelling) framework, Brockett largely contributed to a better understanding of structure. Evidently, this neatly intertwines with the previous points on bilinear systems (Lie algebras) and differential geometry, however, let me also mention that the fiber bundle perspective (going beyond Cartesian products) is often attributed to Brockett |4|.

  • (Dynamical systems perspective on optimization): We see nowadays still more and more work on the continuous-time (and system theoretic) viewpoint with respect to optimization algorithms. One can argue that Brockett was also of great importance here. It is not particularly surprising that one can study gradient flows to better understand gradient descent algorithms, however, it turned out that one can understand a fair amount of routines from (numerical) linear algebra through the lens of (continuous-time) dynamical systems. Brockett initiated a significant part of this work with his 1988 paper on the applications of the double bracket equation dot{H}=[H,[H,N]] |6|. For a more complete overview, including a foreword by Brockett, see Optimization and Dynamical Systems.

References (all by Brockett).
|1|: Geometric Methods in System Theory - Proceedings of the NATO Advanced Study Institute held at London, England, August 27-Septernber 7, 1973, ed. with Mayne, D. Reidel Publishing Company (1973).
|1.2|: Chapter: Lie Algebras and Lie groups in Control Theory in |1|.
|2|: The early days of geometric nonlinear control, Automatica (2014).
|3|: Asymptotic stability and feedback stabilization, Differential Geometric Control Theory, ed. with Millman and Sussmann, Birkhäuser (1983).
|4|: Control theory and analytical mechanics, Geometric Control Theory, Lie Groups: History, Frontiers and Applications, Vol. VII, ed. Martin and Hermann, Math Sci Press, (1976).
|5|: Lie Theory and Control Systems Defined on Spheres, SIAM Journal on Applied Mathematics (1973).
|6|: Dynamical systems that sort lists and solve linear programming problems, IEEE CDC (1988).

See also this 2022 interview (video) with John Baillieul and the foreword to this book for more on the person behind the researcher.

Co-observability |6 January 2022|
tags: math.OC

A while ago Prof. Jan H. van Schuppen published his book Control and System Theory of Discrete-Time Stochastic Systems. In this post I would like to highlight one particular concept from the book: (stochastic) co-observability, which is otherwise rarely discussed.

We start with recalling observability. Given a linear-time-invariant system with xin mathbf{R}^n, yin mathbf{R}^p:

 Sigma :left{ begin{array}{ll} x(t+1) &= Ax(t),quad x(0)=x_0  y(t) &= Cx(t)   end{array}right.

one might wonder if the initial state x_0 can be recovered from a sequence of outputs y(0),dots,y(k). (This is of great use in feedback problems.) By observing that y(0)=Cx(0), y(1)=CAx(0), y(2)=CA^2x(0) one is drawn to the observability matrix

 mathcal{O}= left(  begin{array}{l} C  CA  vdots CA^{n-1} end{array} right)

Without going into detectability, when mathcal{O} is full-rank, one can (uniquely) recover x_0 (proof by contradiction). If this is the case, we say that Sigma, or equivalenty the pair (A,C), is observable. Note that using a larger matrix (more data) is redudant by the Cayley-Hamilton theorem (If mathcal{O} would not be full-rank, but by adding CA^n “from below” it would somehow be full-rank, one would contradict the Cayley-Hamilton theorem.). Also note that in practice one does not “invert” mathcal{O} but rather uses a (Luenberger) observer (or a Kalman filter).

Now lets continue with a stochastic (Gaussian) system, can we do something similar? Here it will be even more important to only think about observability matrices as merely tools to assert observability. Let v:Omegatomathbf{R}^v be a zero-mean Gaussian random variable with covariance Q_v and define for some matrices M and N the stochastic system

 Sigma_G :left{ begin{array}{ll} x(t+1) &= Ax(t)+Mv(t),quad x(0)=x_0  y(t) &= Cx(t)+Nv(t)   end{array}right.

We will assume that A is asymptotically (exponentially stable) such that the Lyapunov equation describing the (invariant) state-covariance is defined:

 Q_x = AQ_xA^{mathsf{T}}+MQ_vM^{mathsf{T}}.

Now the support of the state x is the range of Q_x.

A convenient tool to analyze Sigma_G will be the characteristic function of a Gaussian random variable X, defined as varphi(omega)=mathbf{E}[mathrm{exp}(ilangle omega, Xrangle)]. It can be shown that for a Gaussian random variable sim G(mu,Q) varphi(omega)=mathrm{exp}(ilangle omega,murangle - frac{1}{2}langle omega, Qomega rangle). With this notation fixed, we say that Sigma_{G} is stochastically observable on the internal t,dots,t+k if the map

 x(t)mapsto mathbf{E}[mathrm{exp}(ilangle omega,bar{y}rangle|mathcal{F}^{x(t)}],quad bar{y}=(y(t),dots,y(t+k))in mathbf{R}^{pcdot(k+1)}quad forall omega

is injective on the support of x (note the forall omega). The intuition is the same as before, but now we want the state to give rise to an unique (conditional) distribution. At this point is seems rather complicated, but as it turns out, the conditions will be similar to ones from before. We start by writing down explicit expressions for bar{y}, as

 y(t+s) = CA^s x(t) + sum^{t+s-1-i}_{i=0}CA^i M v(t+s-1-i) + N v(t+s)

we find that

 bar{y} = mathcal{O}_k x(t) + mathcal{M}_k bar{v},

for mathcal{O}_k the observability matrix corresponding to the data (length) of bar{y}, mathcal{M}_k a matrix containing all the noise related terms and bar{v} a stacked vector of noises similar to bar{y}. It follows that x(t)mapsto mathbf{E}[mathrm{exp}(ilangle omega,bar{y})|mathcal{F}^{x(t)}] is given by mathrm{exp}(ilangle omega, mathcal{O}_k x(t)rangle - frac{1}{2}langle omega, mathcal{Q}_komega rangle), for mathcal{Q}_k = mathbf{E}[mathcal{M}_kbar{v}bar{v}^{mathsf{T}}mathcal{M}_k^{mathsf{T}}]. Injectivity of this map clearly relates directly to injectivity of x(t)mapsto mathcal{O}_k x(t). As such (taking the support into account), a neat characterization of stochastic observability is that mathrm{ker}(mathcal{O}_kQx)=mathrm{ker}(Q_x).

Then, to introduce the notion of stochastic co-observability we need to introduce the backward representation of a system. We term system representations like Sigma and Sigma_G “forward” representations as tto +infty. Assume that Q_xsucc 0, then see that the forward representation of a system matrix, denoted A^f is given by A^f = mathbf{E}[x(t+1)x(t)^{mathsf{T}}]Q_x^{-1}. In a similar vein, the backward representation is given by A^b=mathbf{E}[x(t-1)x(t)^{mathsf{T}}]Q_x^{-1}. Doing the same for the output matrix C yields C^b=mathbf{E}[y(t-1)x(t)^{mathsf{T}}]Q_x^{-1} and thereby the complete backward system

 Sigma^b_G :left{ begin{array}{ll} x(t-1) &= A^bx(t)+Mv(t),quad x(0)=x_0  y(t-1) &= C^bx(t)+Nv(t)   end{array}right.

Note, to keep M and N fixed, we adjust the distribution of v. Indeed, when Q_x is not full-rank, the translation between forward and backward representations is not well-defined. Initial conditions x_0in mathrm{ker}(Q_x) cannot be recovered.

To introduce co-observability, ignore the noise for the moment and observe that y(t-1)=C^bx(t), y(t-2)=C^bA^bx(t), and so forth. We see that when looking at observability using the backward representation, we can ask if it possible to recover the current state using past outputs. Standard observability looks at past states instead. With this in mind we can define stochastic co-observability on the interval t-k-1,dots,t-1 be demanding that the map

 x(t)mapsto mathbf{E}[mathrm{exp}(ilangle omega,bar{y}^brangle|mathcal{F}^{x(t)}],quad bar{y}^b=(y(t-1),dots,y(t-k-1))in mathbf{R}^{pcdot(k+1)}quad forall omega

is injective on the support of x (note the forall omega). Of course, one needs to make sure that y(t-k-1) is defined. It is no surprise that the conditions for stochastic co-observability will also be similar, but now using the co-observability matrix. What is however remarkable, is that these notions do not always coincide.

Lets look at when this can happen and what this implies. One reason to study these kind of questions is to say something about (minimal) realizations of stochastic processes. Simply put, what is the smallest (as measured by the dimension of the state x) system Sigma_G that gives rise to a certain output process. When observability and co-observability do not agree, this indicates that the representation is not minimal. To get some intuition, we can do an example as adapted from the book. Consider the scalar (forward) Gaussian system

 Sigma^f_G :left{ begin{array}{ll} x(t+1) &= ax(t)+mv(t),quad x(0)=x_0,quad ain (-1,1) setminus{0},quad m=(a^2-1)/a  y(t) &= x(t)+v(t)   end{array}right.

for vsim G(0,1). The system is stochastically observable as cneq 0 and q_x=(1-a)^2/a^2neq 0. Now for stochastic co-observability we see that c^b=mathbf{E}[y(t-1)x(t)]q_x^{-1}]=0, as such the system is not co-observable. What this shows is that y(t) behaves as a Gaussian random variable, no internal dynamics are at play and such a minimal realization is of dimension n=0.

For this and a lot more, have a look at the book!

Fair convex partitioning |26 September 2021|
tags: math.OC

When learning about convex sets, the definitions seem so clean that perhaps you think all is known what could be known about finite-dimensional convex geometry. In this short note we will look at a problem which is still largely open beyond the planar case. This problem is called the fair partitioning problem.

In the 2-dimensional case, the question is the following, given any integer n, can a convex set mathcal{K}subset mathbf{R}^2 be divided into n convex sets all of equal area and perimeter. Differently put, does there exist a fair convex partitioning, see Figure 1.


Figure 1: A partition of mathcal{K} into n=6 fair convex sets.

This problem was affirmatively solved in 2018, see this paper.

As you can see, this work was updated just a few months ago. The general proof is involved, lets see if we can do the case for a compact set and n=2.

First of all, when splitting mathcal{K} (into just two sets) you might think of many different methods to do so. What happens when the line is curved? The answer is that when the line is curved, one of the resulting sets must be non-convex, compare the options in Figure 2.


Figure 2: A partitioning of the square in mathbf{R}^2 into two convex sets can only be done via a straight cut.

This observation is particularly useful as it implies we only need to look at two points on the boundary of mathcal{K} (and the line between them). As mathcal{K} is compact we can always select a cut such that the resulting perimeters of mathcal{K}_1 and mathcal{K}_2 are equal.

Let us assume that the points a and b in Figure 3.(i) are like that. If we start moving them around with equal speed, the resulting perimeters remain fixed. Better yet, as the cut is a straight-line, the volumes (area) of the resulting set mathcal{K}_1 and mathcal{K}_2 change continuously. Now the result follows from the Intermediate Value Theorem and seeing that we can flip the meaning of mathcal{K}_1 and mathcal{K}_2, see Figure 3.(iii).


Figure 3: By moving the points a and b we continuously change mathcal{K}_1 and mathcal{K}_2.

Solving Linear Programs via Isospectral flows |05 September 2021|
tags: math.OC, math.DS, math.DG

In this post we will look at one of the many remarkable findings by Roger W. Brockett. Consider a Linear Program (LP)

 mathrm{maximize}_{xin X}quad langle c,x ranglequad (1)

parametrized by the compact set X={xin mathbf{R}^n:Axleq b} and a suitable triple (A,b,c). As a solution to (1) can always be found to be a vertex of X, a smooth method to solve (1) seems somewhat awkward. We will see that one can construct a so-called isospectral flow that does the job. Here we will follow Dynamical systems that sort lists, diagonalize matrices and solve linear programming problems, by Roger. W. Brockett (CDC 1988) and the book Optimization and Dynamical Systems edited by Uwe Helmke and John B. Moore (Springer 2ed. 1996). Let X have k vertices, then one can always find a map T:mathbf{R}^kto mathbf{R}^n, mapping the simplex S={xin mathbf{R}_{geq 0}^k:sum_{j}x_j=1} onto X. Indeed, with some abuse of notation, let T be a matrix defined as T=(v_1,dots,v_k), for {v_j}_{j=1}^k, the vertices of X.

Before we continue, we need to establish some differential geometric results. Given the Special Orthogonal group mathsf{SO}(n)={Qin mathbf{R}^{ntimes n}:Q^{mathsf{T}}Q=I_n}cap mathsf{GL}^+(n,mathbf{R}), the tangent space is given by T_Q mathsf{SO}(n)={QOmega : Omegain mathrm{skew}(n,mathbf{R})}. Note, this is the explicit formulation, which is indeed equivalent to shifting the corresponding Lie Algebra. The easiest way to compute this is to look at the kernel of the map defining the underlying manifold.

Now, following Brockett, consider the function f:mathsf{SO}(n)to mathbf{R} defined by f:Thetamapsto mathrm{Tr}(QTheta NTheta^{mathsf{T}}) for some Q,Nin mathsf{Sym}(n). This approach is not needed for the full construction, but it allows for more intuition and more computations. To construct the corresponding gradient flow, recall that the (Riemannian) gradient at Thetain mathsf{SO}(n) is defined via df(Theta)[V]=langle mathrm{grad},f(Theta), Vrangle_{Theta} for all Vin T_{Theta} mathsf{SO}(n). Using the explicit tangent space representation, we know that V=Theta Omega with Omega = -Omega^{mathsf{T}}. Then, see that by using

 f(Theta+tV) = mathrm{Tr}(QTheta(I_n+tOmega)N(I_n-tOmega)Theta^{mathsf{T}})

we obtain the gradient via

 df(Theta)[V]=lim_{tdownarrow 0}frac{f(Theta+tV)-f(Theta)}{t} = langle QTheta N, Theta Omega rangle - langle Theta N Theta^{mathsf{T}} Q Theta, Theta Omega rangle.

This means that (the minus is missing in the paper) the (a) gradient flow becomes

 dot{Theta} = mathrm{grad},f(Theta) = QTheta N-Theta NTheta^{mathsf{T}}QTheta, quad Theta(0)in mathsf{SO}(n).

Consider the standard commutator bracket [A,B]=AB-BA and see that for H(t)=Theta(t)^{mathsf{T}}QTheta(t) one obtains from the equation above (typo in the paper)

 dot{H}(t) = [H(t),[H(t),N]],quad H(0)=Theta^{mathsf{T}}QThetaquad (2).

Hence, (2) can be seen as a reparametrization of a gradient flow. It turns out that (2) has a variety of remarkable properties. First of all, see that H(t) preserves the eigenvalues of Q. Also, observe the relation between extremizing f and the function g defined via g:H=Theta^{mathsf{T}}QTheta mapsto -frac{1}{2}|N-H|_F^2. The idea to handle LPs is now that the limiting H(t) will relate to putting weight one the correct vertex to get the optimizer, N gives you this weight as it will contain the corresponding largest costs.

In fact, the matrix H can be seen as an element of the set mathsf{M}(Q)={Theta^{mathsf{T}}QTheta:Thetain mathsf{O}(n)}. This set is in fact a C^{infty}-smooth compact manifold as it can be written as the orbit space corresponding to the group action sigma:mathsf{O}(n)times mathbf{R}^{ntimes n}to mathsf{R}^{ntimes}, sigma:(Theta,Q)mapsto Theta^{mathsf{T}}QTheta, one can check that this map satisfies the group properties. Hence, to extremize g over M(Q), it appears to be appealing to look at Riemannian optimization tools indeed. When doing so, it is convenient to understand the tangent space of mathsf{M}(Q). Consider the map defining the manifold h:mathsf{O}(n)to mathsf{M}(Q), h:Theta mapsto Theta^{mathsf{T}}QTheta. Then by the construction of T_Theta mathsf{O}(n), see that dh(Theta)[V]=0 yields the relation [H,Omega]=0 for any Omegain mathrm{skew}(n,mathbf{R}).

For the moment, let Q=mathrm{diag}(lambda_{1}I_{n_1},dots,lambda_{r}I_{n_r})in mathsf{Sym}(n) such that lambda_{1}>cdots>lambda_{r} and sum_{n_i}n_i=n. First we consider the convergence of (2). Let N have only distinct eigenvalues, then H_{infty}:=lim_{tto infty}H(t) exists and is diagonal. Using the objective f from before, consider f(H)=mathrm{Tr}(HN) and see that by using the skew-symmetry one recovers the following

 begin{array}{lll} frac{d}{dt}mathrm{Tr}(H(t)N) &=& mathrm{Tr}(N [H,[H,N]]) &=& -mathrm{Tr}((HN-NH)^2) &=& |HN-NH|_F^2. end{array}

This means the cost monotonically increases, but by compactness converges to some point H_{infty}. By construction, this point must satisfy [H_{infty},N]=0. As N has distinct eigenvalues, this can only be true if H_{infty} itself is diagonal (due to the distinct eigenvalues).

More can be said about H_{infty}, let (lambda_1,dots,lambda_n) be the eigenvalues of H(0), that is, they correspond to the eigenvalues of Q as defined above. Then as H(t) preserves the eigenvalues of Q (H(0)), we must have H_{infty}=pi Q pi^{mathsf{T}}, for pi a permutation matrix. This is also tells us there is just a finite number of equilibrium points (finite number of permutations). We will write this sometimes as H_{infty}=mathrm{diag}(lambda_{pi(1)},dots,lambda_{pi(n)}).

Now as Q is one of those points, when does H(t) converge to Q? To start this investigation, we look at the linearization of (2), which at an equilibrium point H_{infty} becomes

 dot{xi}_{ij} = -(lambda_{pi(i)}-lambda_{pi(j)})(mu_i-mu_j)xi_{ij}

for xiin T_{H_{infty}}mathsf{M}(Q). As we work with matrix-valued vector fields, this might seems like a duanting computation. However, at equilibrium points one does not need a connection and can again use the directional derivative approach, in combination with the construction of T_{H_{infty}}mathsf{M}(Q), to figure out the linearization. The beauty is that from there one can see that Q is the only asymptotically stable equilibrium point of (2). Differently put, almost all initial conditions H(0)in mathsf{M}(Q) will converge to Q with the rate captured by spectral gaps in Q and N. If Q does not have distinct eigenvalues and we do not impose any eigenvalue ordering on N, one sees that an asymptotically stable equilibrium point H_{infty} must have the same eigenvalue ordering as N. This is the sorting property of the isospectral flow and this is of use for the next and final statement.

Theorem: Consider the LP (1) with langle c, v_i-v_jrangle neq 0 for all i,jin [k], then, there exist diagonal matrices Q and N such that (2) converges for almost any H(0)in mathsf{M}(Q) to H_{infty}=mathrm{diag}(d) with the optimizer of (1) being x^{star}=Td.

Proof: Global convergence is prohibited by the topology of mathsf{M}(Q). Let N=mathrm{diag}(T^{mathsf{T}}c) and let Q=(1,0,dots,0)in mathsf{Sym}(k). Then, the isospectral flow will converge from almost everywhere to H_{infty}=mathrm{diag}(0,dots,0,1,0,dots,0) (only (H_{infty})_{ii}=1), such that x^{star}=v_i.

Please consider the references for more on the fascinating structure of (2).

The Imaginary trick |12 March 2021|
tags: math.OC

Most of us will learn at some point in our life that sqrt{-1} is problematic, as which number multiplied by itself can ever be negative? To overcome this seemingly useless defiency one learns about complex numbers and specifically, the imaginary number i, which is defined to satisfy i^2=-1. At this point you should have asked yourself when on earth is this useful?

In this short post I hope to highlight - from a slightly different angle most people grounded in physics would expect - that the complex numbers are remarkably useful.

Complex numbers, and especially the complex exponential, show up in a variety of contexts, from signal processing to statistics and quantum mechanics. With of course most notably, the Fourier transformation.

We will however look at something completely different. It can be argued that in the late 60s James Lyness and Cleve Moler brought to life a very elegant new approach to numerical differentiation. To introduce this idea, recall that even nowadays the most well-known approach in numerical differentiation is to use some sort of finite-difference method, for example, for any fin C^1(mathbf{R};mathbf{R}) one could use the central-difference method

 partial_x f(x) = frac{f(x+h)-f(x-h)}{2h} + O(h^2)quad (1).

Now one might be tempted to make h extremely small, as then the error must vanish! However, numerically, for a very small h the two function evaluations f(x+h) and f(x-h) will be indistinguishable. So although the error scales as O(h^2) there is some practical lower bound on this error based on the machine precision of your computer. One potential application of numerical derivatives is in the context of zeroth-order (derivative-free) optimization. Say you want to adjust van der Poel his Canyon frame such that he goes even faster, you will not have access to explicit gradients, but you can evaluate the performance of a change in design, for example in a simulator. So what you usually can obtain is a set of function evaluations f(x_0),f(x_1),dots,f(x_K). Given this data, a somewhat obvious approach is to mimick first-order algorithms

 x_{k+1}=x_k - eta_k nabla f(x_k)quad (2)

where eta_k is some stepsize. For example, one could replace nabla f(x_k) in (2) by the central-difference approximation (1). Clearly, if the objective function f is well-behaved and the approximation of nabla f(x_k) is reasonably good, then something must come out? As was remarked before, if your approximation - for example due to numerical cancellation errors - will always have a bias it is not immediate how to construct a high-performance zeroth-order optimization algorithm. Only if there was a way to have a numerical approximation without finite differencing?

Let us assume that f:mathbf{R}to mathbf{R} is sufficiently smooth, let i be the imaginary number and consider the following series

 f(x+ih) = f(x) + partial_x f(x) ih - frac{1}{2}partial_x^2 f(x) h^2 - frac{1}{6}partial_x^3 f(x) ih^3 + O(h^4).

From this expression it follows that

 partial_x f(x) = frac{Imbig(f(x+ih) big)}{h}+O(h^2).

So we see that by passing to the complex domain and projecting the imaginary part back, we obtain a numerical method to construct approximations of derivatives without even the possibility of cancellation errors. This remarkable property makes it a very attractive candidate to be used in zeroth-order optimization algorithms, which is precisely what we investigated in our new pre-print. It turns out that convergence is not only robust, but also very fast!

Optimal coordinates |24 May 2020|
tags: math.LA, math.OC, math.DS

There was a bit of radio silence, but as indicated here, some interesting stuff is on its way.

In this post we highlight this 1976 paper by Mullis and Roberts on ’Synthesis of Minimum Roundoff Noise Fixed Point Digital Filters’.

Let us be given some single-input single-output (SISO) dynamical system sigma in Sigma

 sigma :left{ begin{array}{ll} x(t+1) &= Ax(t) + bu(k) y(t) &= cx(t) + du(k)  end{array}right. .

It is known that the input-output behaviour of any sigma, that is, the map from u(t) to y(t) is invariant under a similarity transform. To be explicit, the tuples (A,b,c,d) and (TAT^{-1},Tb,cT^{-1},d), which correspond to the change of coordinates z(t)=Tx(t) for some Tin mathsf{GL}_n, give rise to the same input-output behaviour. Hence, one can define the equivalence relation sigma sim sigma' by imposing that the input-output maps of sigma and sigma' are equivalent. By doing so we can consider the quotient Sigma / mathsf{GL}_n. However, in practice, given a sigma, is any sigma' such that sigma'sim sigma equally useful? For example, the following A and A' are similar, but clearly, A' is preferred from a numerical point of view:

 A = left[begin{array}{ll} 0.5 & 10^9 0 & 0.5  end{array}right],quad  A' = left[begin{array}{ll} 0.5 & 1 0 & 0.5  end{array}right].

In what follows, we highlight the approach from Mullis and Roberts and conclude by how to optimally select Tin mathsf{GL}_n. Norms will be interpreted in the ell_2 sense, that is, for {x(t)}_{tin mathbf{N}}, |x|^2=sum^{infty}_{t=0}|x(t)|_2. Also, in what follows we assume that x(0)=0 plus that A is stable, which will mean rho(A)<1 and that sigma corresponds to a minimal realization.

The first step is to quantify the error. Assume we have allocated m_i bits at our disposal to present the i^{mathrm{th}} component of x(t). These values for m_i can differ, but we constrain the average by sum^n_{i=1}m_i = nm for some m. Let mull 1 be a 'step-size’, such that our dynamic range of x_i(t) is bounded by pm mu 2^{m_i-1} (these are our possible representations). Next we use the modelling choices from Mullis and Roberts, of course, they are debatable, but still, we will infer some nice insights.

First, to bound the dynamic range, consider solely the effect of an input, that is, define f_i(t) by x(t)=sum^{infty}_{k=1}A^{k-1}b u(t-k), x_i(t)=sum^{infty}_{k=1}f_i(t)u(t-k). Then we will impose the bound pm delta |f_i| on x_i(t). In combination with the step size (quantization), this results in delta^2 |f_i|^2 = (mu 2^{m_i-1})^2. Let u be a sequence of i.i.d. sampled random variables from mathcal{N}(0,1). Then we see that mathrm{var}big(x_i(t)big)= |f_i|^2. Hence, one can think of delta as scaling parameter related to the probability with which this dynamic range bound is true.

Next, assume that all the round-off errors are independent and have a variance of mu^2. Hence, the worst-case variance of computating x_i(t+1) is mu^2(n+1). To evaluate the effect of this error on the output, assume for the moment that u(t) is identically 0. Then, y(t) = cA^t x(0). Similar to f_i(t), define g_i(t) as the i^{mathrm{th}} component of cA^t. As before we can compute the variance, this time of the full output signal, which yields sigma_y^2:=mathrm{var}(y) = mu^2(n+1)sum^n_{i=1}|g_i|^2. Note, these expressions hinge on the indepedence assumption.

Next, define the (infamous) matrices W_c, W_o by

 begin{array}{lll} W^{(c)} &= AW^{(c)}A^{top} + bb^{top} &= sum^{infty}_{k=0} A^k bb^{top} (A^k)^{top},  W^{(o)} &= A^{top}W^{(o)} A + c^{top}c &=sum^{infty}_{k=0} (A^k)^{top}c^{top}c A.  end{array}

If we assume that the realization is not just minimal, but that (A,b) is a controllable pair and that (A,c) is an observable pair, then, W^{(c)}succ 0 and W^{(o)}succ 0. Now the key observation is that W^{(c)}_{ij} = langle f_i, f_j rangle and similarly W^{(o)}_{ij} = langle g_i, g_j rangle. Hence, we can write delta^2 |f_i|^2 = (mu 2^{m-1})^2 as delta^2 K_ii = (mu 2^{m-1})^2 and indeed sigma_y^2 = mu^2(n+1)sum^n_{i=1}W_{ii}. Using these Lyapunov equations we can say goodbye to the infinite-dimensional objects.

To combine these error terms, let say we have apply a coordinate transformation x'=Tx, for some Tin mathsf{GL}_n. Specifically, let T be diagonal and defined by T_{ii}=2delta sqrt{K_{ii}}/(mu 2^m). Then one can find that K'_{ii}=K_{ii}/T_{ii}^2, W'_{ii}=W_{ii}T_{ii}^2. Where the latter expressions allows to express the output error (variance) by

 sigma_y^2 = (n+1)delta^2 sum^n_{i=1}frac{W^{(c)}_{ii}W^{(o)}_{ii}}{(2^{m_i})^2}.

Now we want to minimize sigma_y^2 over all sigma'sim sigma plus some optimal selection of m_i. At this point it looks rather painful. To make it work we first reformulate the problem using the well-known arithmetic-geometric mean inequality^{[1]} for non-negative sequences {a_i}_{iin [n]}:

 frac{1}{n} sum^n_{i=1}a_i geq left( prod^n_{i=1}a_i right)^{1/n}.

This inequality yields

 sigma_y^2 = n(n+1)delta^2 frac{1}{n}sum^n_{i=1}frac{W^{(c)}_{ii}W^{(o)}_{ii}}{(2^{m_i})^2}geq n(n+1)left(frac{delta}{2^m}right)^2left(prod^n_{i=1}{W^{(c)}_{ii}W^{(o)}_{ii}}right)^{1/n}qquad (1).

See that the right term is independent of m_i, hence this is a lower-bound with respect to minimization over m_i. To achieve this (to make the inequality from (1) an equality), we can select

 m_i = m + frac{1}{2} left(log_2 W_{ii}^{(c)}W_{ii}^{(o)} - frac{1}{2}sum^n_{j=1}log_2 W_{ii}^{(c)}W_{ii}^{(o)} right).

Indeed, as remarked in the paper, m_i is not per se an integer. Nevertheless, by this selection we find the clean expression from (1) to minimize over systems equivalent to sigma, that is, over some transformation matrix T. Define a map f:mathcal{S}^n_{succ 0}to (0,1] by

 f(P) = left( frac{mathrm{det}(P)}{prod^n_{i=1}P_{ii}}right)^{1/2}.

It turns out that f(P)=1 if and only if P is diagonal. This follows^{[2]} from Hadamard's inequality. We can use this map to write

 sigma_y^2 = n(n+1) left(frac{delta}{2^m} right)^2 frac{mathrm{det}(W^{(c)}W^{(o)})^{1/n}}{big(f(W^{(c)})f(W^{(o)})big)^{2/n}} geq n(n+1) left(frac{delta}{2^m}right)^2 mathrm{det}(W^{(c)}W^{(o)})^{1/n}.

Since the term mathrm{det}(W^{(c)}W^{(o)}) is invariant under a transformation T, we can only optimize sigma_y^2 over a structural change in realization tuple (A,b,c,d), that is, we need to make W^{(c)} and W^{(o)} simultaneous diagonal! It turns out that this numerically 'optimal’ realization, denoted sigma^{star}, is what is called a principal axis realization.

To compute it, diagonalize W^{(o)}, W^{(o)}=Q_o Lambda_o Q_o^{top} and define T_1^{-1}:=Lambda_o^{1/2}Q_o^{top}. Next, construct the diagonalization T_1^{-1}W^{(c)}T_1^{-top}=Q_{1,c}Lambda_{1,c}Q_{1,c}^{top}. Then our desired transformation is T:=T_1 Q_{1,c}. First, recall that under any Tin mathsf{GL}_n the pair (W^{(c)},W^{(o)}) becomes (T^{-1}W^{(c)}T^{-top},T^{top}W^{(o)}T). Plugging in our map T yields the transformed matrices (Lambda_{1,c},I_n), which are indeed diagonal.


At last, we do a numerical test in Julia. Consider a linear system sigma defined by the tuple (A,b,c,d):

 A = left[begin{array}{ll} 0.8 & 0.001  0 & -0.5 end{array} right],quad b = left[begin{array}{l} 10  0.1 end{array} right],
 c=left[begin{array}{ll} 10 & 0.1 end{array} right],quad d=0.

To simulate numerical problems, we round the maps f(x,u)=Ax+bu and h(x)=cx to the closest integer where u(t) is a noisy sinusoidal input. We compare a perfect (no numerical errors) realization sigma to the rounded, denoted by [cdot], naive realization [sigma] and to the optimized one [sigma^{star}]. We do 100 experiments and show the mean plus the full confidence interval (of the input-output behaviour), the optimized representation is remarkably better. Numerically we observe that for sigma^{star} we have A^{star}_{21}neq 0, which is precisely where the naive A struggles.

[1]: To show this AM-GM inequality, we can use Jensen's inequality for concave functions, that is, mathbf{E}[g(x)]leq g(mathbf{E}[x]). We can use the logarithm as our prototypical concave function on mathbf{R}_{>0} and find logbig(frac{1}{n}sum^n_{i=1}x_ibig)geq frac{1}{n}sum^n_{i=1}log(x_i) = logbig((prod^n_{i=1}x_i)^{1/n} big). Then, the result follows.
[2]: The inequality attributed to Hadamard is slightly more difficult to show. In its general form the statement is that |mathrm{det}(A)|leq prod^n_{i=1}|a_i|_2, for a_i the i^{mathrm{th}} column of A. The inequality becomes an equality when the columns are mutually orthogonal. The intuition is clear if one interprets the determinant as the signed volume spanned by the columns of A, which. In case Ain mathcal{S}^n_{succ 0}, we know that there is L such that A=L^{top}L, hence, by this simple observation it follows that

 mathrm{det}(A)=mathrm{det}(L)^2 leq prod^n_{i=1}|ell_i|_2^2 = prod^n_{i=1}a_{ii}.

Equality holds when the columns are orthogonal, so AA^{top} must be diagonal, but A=A^{top} must also hold, hence, A^2 must be diagonal, and thus A must be diagonal, which is the result we use.

Risky Business in Stochastic Control: Exponential Utility |12 Jan. 2020|
tags: math.OC

One of the most famous problems in linear control theory is that of designing a control law which minimizes some cost being quadratic in input and state. We all know this as the Linear Quadratic Regulator (LQR) problem. There is however one problem (some would call it a blessing) with this formulation once the dynamics contain some zero-mean noise: the control law is independent of this stochasticity. This is easy in proofs, easy in practice, but does it work well? Say, your control law is rather slow, but bringing you back to 0 in the end of time. In the noiseless case, no problem. But now say there is a substantial amount of noise, do you still want such a slow control law? The answer is most likely no since you quickly drift away. The classical LQR formulation does not differentiate between noise intensities and hence can be rightfully called naive as Bertsekas did (Bert76). Unfortunately, this name did not stuck.

Can we do better? Yes. Define the cost function

     gamma_{T-1}(theta) = frac{2}{theta T}log mathbf{E}_{xi}left[mathrm{exp}left(frac{theta}{2}sum^{T-1}_{t=0}x_t^{top}Qx_t + u_t^{top}Ru_t right) right]=frac{2}{theta T}log mathbf{E}_{xi}left[mathrm{exp}left(frac{theta}{2}Psi right) right]

and consider the problem of finding the minimizing policy pi^{star}_{T-1}=u_0,u_1,dots,u_{T-1} in:

     mathcal{J}_{T-1}:=inf_{pi_{T-1}} gamma_{T-1}(theta)quad mathrm{s.t.} quad x_{t+1}=Ax_t+Bu_t+xi_t,quad xi_t {sim} mathcal{N}(0,Sigma_{xi}^{-1}).

Which is precisely the classical LQR problem, but now with the cost wrapped in an exponential utility function parametrized by thetain mathbf{R}. This problem was pioneered by most notably Peter Whittle (Wh90).

Before we consider solving this problem, let us interpret the risk parameter thetain mathbf{R}. For theta >0 we speak of a risk-sensitive formulation while for theta<0 we are risk-seeking. This becomes especially clear when you solve the problem, but a quick way to see this is to consider the approximation of gamma(theta) near theta=0, which yields gammaapprox mathbf{E}[Psi]+frac{1}{4}{theta}mathbf{E}[Psi]^2, so theta>0 relates to pessimism and theta<0 to optimism indeed. Here we skipped a few scaling factors, but the idea remains the same, for a derivation, consider cumulant generating functions and have a look at our problem.

As with standard LQR, we would like to obtain a stabilizing policy and to that end we will be mostly bothered with solving limsup_{Tto infty}mathcal{J}_T. However, instead of immediately trying to solve the infinite horizon average cost Bellman equation it is easier to consider gamma_{T}(theta) for finite T first. Then, when we can prove monotonicty and upper-bound limsup_{Tto infty}mathcal{J}_{T}, the infinite horizon optimal policy is given by lim_{Tto infty}pi_{T}^{star}. The reason being that monotonic sequences which are uniformly bounded converge.

The main technical tool towards finding the optimal policy is the following Lemma similar to one in the Appendix of (Jac73):
Lemma {
Consider a noisy linear dynamical system defined by x_{t+1}=Ax_k+Bu_t+Dxi_t with xi_t {sim} mathcal{N}(0,Sigma_{xi}^{-1}) and let |A| be shorthand notation for mathrm{det}(A). Then, if (Sigma_{xi}-theta D^{top}PD)^{-1}succ 0 holds we have

  mathbf{E}_{xi}left[mathrm{exp}left(frac{theta}{2}x_{k+1}^{top}Px_{k+1} right)|x_kright] = frac{|(Sigma_{xi}-theta D^{top}PD)^{-1}|^{1/2}}{|Sigma_{xi}^{-1}|^{1/2}}mathrm{exp}left(frac{theta}{2}(Ax_k+Bu_k)^{top}widetilde{P}(Ax_k+Bu_k) right)

where widetilde{P} = P +theta  PD(Sigma_{xi}-theta D^{top}PD)^{-1}D^{top}P. }

Proof {
Let z:= mathbf{E}_{xi}left[mathrm{exp}left(frac{theta}{2}x_{t+1}^{top}Px_{t+1} right)|x_tright] and recall that the shorthand notation for mathrm{det}(A) is |A|, then:

  begin{array}{ll} z &= frac{1}{(2pi)^{d/2}|Sigma_{xi}^{-1}|^{1/2}}int_{mathbf{R}^d}  mathrm{exp}left(frac{theta}{2}x_{t+1}^{top}Px_{t+1} right)mathrm{exp}left(-frac{1}{2}xi_t^{top}Sigma_{xi}xi_t right) dxi_t &= frac{1}{(2pi)^{d/2}|Sigma_{xi}^{-1}|^{1/2}}Big{mathrm{exp}left(frac{theta}{2}(Ax_t+Bu_t)^{top}P(Ax_t+Bu_t) right) &quad cdotint_{mathbf{R}^d}  mathrm{exp}left( theta (Ax_t+Bu_t)^{top}PDxi_t - frac{1}{2}xi_t^{top}(Sigma_{xi}-theta D^{top}PD)xi_tright)dxi_tBig}   &=frac{|(Sigma_{xi}-theta D^{top}PD)^{-1}|^{1/2}}{|Sigma_{xi}^{-1}|^{1/2}}mathrm{exp}left(frac{theta}{2}(Ax_t+Bu_t)^{top}widetilde{P}(Ax_t+Bu_t) right)  &quad cdot frac{1}{(2pi)^{d/2}|(Sigma_{xi}-theta D^{top}PD)^{-1}|^{1/2}}int_{mathbf{R}^d} mathrm{exp}left(-frac{1}{2}(xi_t-overline{xi}_t)^{top}(Sigma_{xi}-theta D^{top}PD)(xi_t-overline{xi}_t)right) dxi_t &=frac{|(Sigma_{xi}-theta D^{top}PD)^{-1}|^{1/2}}{|Sigma_{xi}^{-1}|^{1/2}}mathrm{exp}left(frac{theta}{2}(Ax_t+Bu_t)^{top}widetilde{P}(Ax_t+Bu_t) right). end{array}

Here, the first step follows directly from xi being a zero-mean Gaussian. In the second step we plug in x_{t+1}=Ax_t+Bu_t+Dxi_t. Then, in the third step we introduce a variable overline{xi}_t with the goal of making (Sigma_{xi}-theta D^{top}PD) the covariance matrix of a Gaussian with mean overline{xi}_t. We can make this work for

 overline{xi}_t = theta( Sigma_{xi}-theta D^{top}PD)^{-1}D^{top}P(Ax_t+Bu_t)

and additionally

 widetilde{P} = P +theta PD(Sigma_{xi}-theta D^{top}PD)^{-1}D^{top}P.

Using this approach we can integrate the latter part to 1 and end up with the final expression. Note that in this case the random variable xi_t needs to be Gaussian, since the second to last expression in z equals 1 by being a Gaussian probability distribution integrated over its entire domain. }

What is the point of doing this? Let f(x,u)=Ax+Bu+xi and assume that r_tmathrm{exp}big(frac{theta}{2} x^{top}P_txbig) represents the cost-to-go from stage t and state x. Then consider

     r_{t-1}mathrm{exp}left(frac{theta}{2}x^{top}P_{t-1} xright) = inf_{u} left{mathrm{exp}left(frac{theta}{2}(x^{top}Qx+u^{top}Ru)right)mathbf{E}_{xi}left[r_tmathrm{exp}left( frac{theta}{2}f(x,u)^{top}P_{t}f(x,u)right);|;xright]right}.

Note, since we work with a sum within the exponent, we must multiply within the right-hand-side of the Bellman equation. From there it follows that u^{star}_t=-(R+B^{top}widetilde{P}_tB)^{-1}B^{top}widetilde{P}_tAx_t, for

 P_{t-1} = Q +  A^{top}widetilde{P}_tA -  A^{top}widetilde{P}_tB(R+ B^{top}widetilde{P}_tB)^{-1}B^{top}widetilde{P}_tA,
 widetilde{P}_t = P_t +theta P_tD(Sigma_{xi}-theta D^{top}P_tD)^{-1}D^{top}P_t.

The key trick in simplifying your expressions is to apply the logarithm after minimizing over u such that the fraction of determinants becomes a state-independent affine term in the cost. Now, using a matrix inversion lemma and the push-through rule we can remove widetilde{P}_t and construct a map P_{t-1}=f(P_t):

     P_{t-1} = Q + A^{top}P_t left(I_n + (BR^{-1}B^{top}-theta DSigma_{xi}^{-1}D^{top})P_t right)^{-1}A = Q+A^{top}P_tLambda_t^{-1}A.

such that u_t^{star} = -R^{-1}B^{top}P_t Lambda_t A x_t. See below for the derivations, many (if not all) texts skip them, but if you have never applied the push-through rule they are not that obvious.

As was pointed out for the first time by Jacobson (Jac73), these equations are precisely the ones we see in (non-cooperative) Dynamic Game theory for isotropic Sigma_{xi} and appropriately scaled theta geq 0.

Especially with this observation in mind there are many texts which show that lim_{tdownarrow 0}P_t=P is well-defined and finite, which relates to finite cost and a stabilizing control law u^{star}_t=-R^{-1}B^{top}PLambda^{-1}Ax_t. To formalize this, one needs to assume that (A,B,C) is a minimal realization for C defined by C^{top}C=Q. Then you can appeal to texts like (BB95).

Numerical Experiment and Robustness Interpretation
To show what happens, we do a small 2-dimensional example. Here we want to solve the Risk-Sensitive (theta=0.0015) and the Risk-neutral (theta=0) infinite-horizon average cost problem for

     A = left[   begin{array}{ll}   2 & 1    0 & 2    end{array}right],quad B = left[   begin{array}{l}   0    1    end{array}right],quad Sigma_{xi}^{-1} = left[   begin{array}{ll}   10^{-1} & 200^{-1}    200^{-1} & 10    end{array}right],

D=I_2, Q=I_2, R=1. There is clearly a lot of noise, especially on the second signal, which also happens to be critical for controlling the first state. This makes it interesting. We compute K^{star}|_{theta=0}=:K^{star} and K^{star}|_{theta=0.0015}=:K^{star}_{theta}. Given the noise statistics, it would be reasonable to not take the certainty equivalence control law K^{star} since you control the first state (which has little noise on its line) via the second state (which has a lot of noise on its line). Let x^{(i)} be the i^{mathrm{th}} state under K^{star} and x^{(i)}_{theta} the i^{mathrm{th}} state under K^{star}_{theta}.

We see in the plot below (for some arbitrary initial condition) typical behaviour, K^{star}_{theta} does take the noise into account and indeed we see the K^{star}_{theta} induces a smaller variance.

Simulated link 

So, K^{star} is more robust than K^{star} in a particular way. It turns out that this can be neatly explained. To do so, we have to introduce the notion of Relative Entropy (Kullback-Leibler divergence). We will skip a few technical details (see the references for full details). Given a measure mu on mathcal{V}, then for any other measure nu, being absolutely continuous with respect to mu (nu ll mu), define the Relative Entropy as:

     h(nu || mu ) = int_{mathcal{V}}log.  frac{dnu}{dmu} dnu

Now, for any measurable function Psi on mathcal{V}, being bounded from below, it can be shown (see (DP96)) that

     log int e^{Psi}dmu = sup_{nu}left{ int Psi dnu - h(nu || mu) ; : ; h(nu || mu )<infty right}

For the moment, think of Psi as your standard finite-horizon LQR cost with product measure dnu, then we see that an exponential utility results in the understanding that a control law which minimizes log int e^{Psi}dmu is robust against adversarial noise generated by distributions sufficiently close (measured by h) to the reference mu.

Here we skipped over a lot of technical details, but the intuition is beautiful, just changing the utility to the exponential function gives a wealth of deep distributional results we just touched upon in this post.

Simplification steps in u_t^{star}
We only show to get the simpler representation for the input, the approach to obtain P_{k-1}=f(P_k) is very similar.

First, using the matrix inversion lemma:


we rewrite widetilde{P}_t into:

      widetilde{P}_t= (I_n -theta P_t D Sigma_{xi}^{-1}D^{top})^{-1}P_t=X^{-1}P_t.

Note, we factored our P_t since we cannot assume that P_t is invertible. Our next tool is called the push-through rule. Given Qin mathbf{R}^{mtimes n} and Pin mathbf{R}^{ntimes m} we have

     (I_m + QP)^{-1}Q = Q(I_n + PQ)^{-1}.

You can check that indeed Q(I_n+PQ) = (I_m +QP)Q. Now, to continue, plug this expression for widetilde{P}_t into the input expression:

 begin{array}{ll}     -(R+B^{top}widetilde{P}_t B)^{-1}B^{top}widetilde{P}_t A &= -(R+B^{top}X^{-1}P_t B)^{-1}B^{top}X^{-1}P_t A 								    &= -big( (I_m+B^{top}X^{-1}P_t BR^{-1})Rbig)^{-1}B^{top}X^{-1}P_t A   								    &= -R^{-1}(I_m+B^{top}X^{-1}P_t BR^{-1})^{-1}B^{top}X^{-1}P_t A 								    &= -R^{-1}B^{top}(I_n+X^{-1}P_t BR^{-1}B^{top})^{-1}X^{-1}P_t A 								    &= -R^{-1}B^{top}(X+P_t BR^{-1}B^{top})^{-1}P_t A 								    &= -R^{-1}B^{top}P_t Lambda_t A.   end{array}

Indeed, we only used factorizations and the push-through rule to arrive here.

(Bert76) Dimitri P. Bertsekas : ‘‘Dynamic Programming and Stochastic Control’’, 1976 Academic Press.
(Jac73) D. Jacobson : ‘‘Optimal stochastic linear systems with exponential performance criteria and their relation to deterministic differential games’’, 1973 IEEE TAC.
(Wh90) Peter Whittle : ‘‘Risk-sensitive Optimal Control’’, 1990 Wiley.
(BB95) Tamer Basar and Pierre Bernhard : ‘‘H_{infty}-Optimal Control and Related Minimax Design Problems A Dynamic Game Approach’’ 1995 Birkhauser.
(DP96) Paolo Dai Pra, Lorenzo Meneghini and Wolfgang J. Runggaldier :‘‘Connections Between Stochastic Control and Dynamic Games’’ 1996 Math. Control Signals Systems.

Are Theo Jansen his Linkages Optimal? |16 Dec. 2019|
tags: math.OC, other.Mech

Theo Jansen, designer of the infamous ‘‘strandbeesten’’ explains in a relatively recent series of videos that the design of his linkage system is the result of what could be called an early implementation of genetic programming. Specifically, he wanted the profile that the foot follows to be flat on the bottom. He does not elaborate on this too much, but the idea is that in this way his creatures flow over the surface instead of the more bumpy motion we make. Moreover, you can imagine that in this way the center of mass remains more or less at a constant height, which is indeed energy efficient. Now, many papers have been written on understanding his structure, but I find most of them rather vague, so lets try a simple example.

To that end, consider just 1 leg, 1 linkage system as in the figure(s) below. We can write down an explicit expression for p(L,theta)in mathbf{R}^2 where L=(a,b,c,d,e,f,g,h,i,j,k,l,m) and thetain [0,2pi). See below for an explicit derivation. Now, with his motivation in mind, what would be an easy and relevant cost function? An idea is to maximize the ratio p_w/p_h, of course, subject to still having a functioning leg. To that end we consider the cost function (see the schematic below for notation):

     c(L) = logleft(frac{leftlangle e_1,p(L,1/2pi)-p(L,-1/2pi)rightrangle}{leftlangle e_2,p(L,pi)-p(L,0)rightrangle} right).

Of course, this is a heuristic, but the intuition is that this should approximate p_w/p_h. Let L_1 be the set of parameters as given in Jansen his video, then we apply a few steps of (warm-start) gradient ascent: L_{k+1}=L_k + k^{-2}nabla c(L_k), k=1,2,dots.

Simulated link 

In the figure on the left we compare the linkage as given by Jansen (in black) with the result of just two steps of gradient ascent (in red). You clearly see that the trajectory of the foot becomes more shallow. Nevertheless, we see that the profile starts to become curved on the bottom. It is actually interesting to note that |nabla c(L_1)|_2=0.8, which is still far from 0 but already remarkably small for |L_1|_2=160.

Can we conclude anything from here? I would say that we need dynamics, not just kinematics, but more importantly, we need to put the application in the optimization and only Theo Jansen understands the constraints of the beach.

Still, it is fun to see for yourself how this works and like I did you can build a (baby) strandbeest yourself using nothing more than some PVC tubing and a source of heat.

Derivation of p(L,theta):
Essentially, all that you need is the cosine-rule plus some feeling for how to make the equations continuous. To start, let e_1=(1,0) and e_2=(0,1) (just the standard unit vectors), plus let R(theta) be a standard (counter-clockwise) rotation matrix acting on (x,y)in mathbf{R}^2. To explain the main approach, when we want to find the coordinates of for example point p_4, we try to compute the angle theta_2 such that when we rotate the unit vector e_1 with length j emanating from point p_3, we get point p_4. See the schematic picture below.

Then we easily obtain p_1=(-a,-l), p_2=(0,0) and p_3=p_2+R(theta)me_1. Now, to obtain p_4 and p_b compute the length of the diagonal, n=|p_1-p_3|_2 and using the cosine rule beta_1=arccosbig( -(2nj)^{-1}(b^2-n^2-j^2) big), beta_2=arccosbig( -(2nk)^{-1}(c^2-n^2-k^2) big). Then, the trick is to use cos(theta)=langle a,brangle/ (|a|_2|b|_2) to compute theta_4:=theta_2+beta_1. To avoid angles larger than pi, consider an inner product with e_2 and add the remaining 1/2pi; theta_4=arccosbig(n^{-1}langle e_2,(p_1-p_3)rangle big)+1/2pi. From there we get theta_2=theta_4-beta_1, theta_3=theta_2+beta_1+beta_2, p_4=p_3+R(theta_2)je_1, p_5=p_3+R(theta_3)ke_1. The point p_6 is not missing, to check your code it is convenient to compute p_6=p_3+R(theta_4)ne_1 and check that p_6 agrees with p_1 for all theta. In a similar fashion, gamma_1=arccosbig(-(2db)^{-1}(e^2-d^2-b^2) big), delta_1=arccosbig(b^{-1}langle e_1,(p_4-p_1)rangle big), p_7=p_1+R(gamma_1+delta_1)de_1. Then, for p_8, again compute a diagonal q=|p_5-p_7|_2 plus alpha_1=arccosbig(-(2qc)^{-1}(d^2-q^2-c^2) big), alpha_2=arccosbig(c^{-1}langle e_1,(p_1-p_5)rangle big) alpha_3=arccosbig(-(2qg)^{-1}(f^2-q^2-q^2) big), such that for alpha=alpha_1+alpha_2+alpha_3 we have p_8=p_5+R(alpha)ge_1. At last, psi_1=arccosbig(-(2gi)^{-1}(h^2-i^2-g^2) big), such that p(L,theta)=p_5+R(psi)ie_1 for psi=psi_1+alpha.

Schematic of Jansen his linkages