Almost Surely Investigations in Optimal Control and Dynamical Systems,by Wouter Jongeneel.
Hi, in January 2020 I started as a PhD student at EPFL in the Risk Analytics & Optimization group lead by Daniel Kuhn. Optimal coordinates 24 May 2020

At last, we do a numerical test in Julia. Consider a linear system defined by the tuple : To simulate numerical problems, we round the maps and to the closest integer where is a noisy sinusoidal input. We compare a perfect (no numerical errors) realization to the rounded, denoted by , naive realization and to the optimized one . We do 100 experiments and show the mean plus the full confidence interval (of the inputoutput behaviour), the optimized representation is remarkably better. Numerically we observe that for we have , which is precisely where the naive struggles. 
: To show this AMGM inequality, we can use Jensen's inequality for concave functions, that is, . We can use the logarithm as our prototypical concave function on and find . Then, the result follows.
: The inequality attributed to Hadamard is slightly more difficult to show. In its general form the statement is that , for the column of . The inequality becomes an equality when the columns are mutually orthogonal. The intuition is clear if one interprets the determinant as the signed volume spanned by the columns of , which. In case , we know that there is such that , hence, by this simple observation it follows that
Equality holds when the columns are orthogonal, so must be diagonal, but must also hold, hence, must be diagonal, and thus must be diagonal, which is the result we use.
As a follow up to the previous post, we discuss a lower bound, given as exercise P7.1.14 in (GvL13). I have never seen this bound before, especially not with the corresponding proof, so here we go. Given any , we have the following lower bound:
which can be thought of as a sharpening of the bound . In the case of a singular , the bound is trivial , but in the general setting it is less obvious. To prove that this holds we need a few ingredients. First, we need that for any we have . Then, we construct two decompositions of , (1) the Jordan Normal form , and (2) the Singular Value Decomposition . Using the multiplicative property of the determinant, we find that
Hence, since it follows that and thus we have the desired result.
(Update 8 March) It is of course interesting to get insight in how sharp this bound is, in general. To that end we consider random matrices with . When we compute the relative error for samples we obtain the histogram as shown on the left. Clearly, the relative error is not concentrated near . Nevertheless, it is an interesting bound, from a theoretical point of view. 
(GvL13) G.H. Golub and C.F. Van Loan: ‘‘Matrix Computations’’, 2013 John Hopkins University Press.
Lately, linear discretetime control theory has start to appear in areas far beyond the ordinary. As a byproduct, papers start to surface which claim that stability of linear discretetime systems
is characterized by . The confunsion is complete by calling this object the spectral norm of a matrix . Indeed, for fixed coordinates, stability of is not characterized by , but by . If , then, all absolute eigenvalues of are strictly smaller than one, and hence for , . This follows from the Jordan form of in combination with the simple observation that for , .
To continue, define two sets; and . Since we have . Now, the main question is, how much do we lose by approximating with ? Motivation to do so is given by the fact that is a norm and hence a convex function, i.e., when given a convex polytope with vertices , if , then . Note that is not a norm, can be without (consider an uppertriangular matrix with zeros on the main diagonal), the triangle inequality can easily fail as well. For example, let
Then , but , hence fails to hold. A setting where the aforementioned property of might help is Robust Control, say we want to assess if some algorithm rendered a compact convex set , filled with 's, stable. As highlighted before, we could just check if all extreme points of are members of , which might be a small and finite set. Thus, computationally, it appears to be attractive to consider over the generic .
As a form of critique, one might say that is a lot larger than . Towards such an argument, one might recall that . Indeed, if , but . Therefore, it seems like the set of for which considering over is reasonable, is negligibly small. To say a bit more, since we see that we can always find a ball with nonzero volume fully contained in . Hence, is at least locally dense in . So in principle we could try to investigate . For , the sets agree, which degrades asymptotically. However, is this the way to go? Lets say we consider the set . Clearly, the sets and are different, even in volume, but for sufficiently large , should we care? The behaviour they parametrize is effectively the same.
We will stress that by approximating with , regardless of their volumetric difference, we are ignoring a full class of systems and miss out on a nonneglible set of behaviours. To see this, any system described by is contractive in the sense that , while systems in are merely asymptotically stable. They might wander of before they return, i.e., there is no reason why for all we must have . We can do a quick example, consider the matrices
Then , and both and have . We observe that indeed is contractive, for any initial condition on , we move strictly inside the sphere, whereas for , when starting from the same initial condition, we take a detour outside of the sphere before converging to . So although and have the same spectrum, they parametrize different systems.
In our deterministic setting this would mean that we would confine our statespace to a (solid) sphere with radius , instead of . Moreover, in linear optimal control, the resulting closedloop system is usually not contractive. Think of the infamous pendulum on a cart. Being energy efficient has usually nothing to do with taking the shortest, in the Euclidean sense, path.
(update June 06): As suggested by Pedro Zattoni, we would like to add some nuance here. As highlighted in the first 12 minutes of this lecture by Pablo Parrilo, if is stable, then, one can always find a (linear) change of coordinates such that the transformed system matrix is contractive. Although a very neat observation, you have to be careful, since your measurements might not come from this transformed system. So either you assume merely that is stable, or you assume that is contractive, but then you might need to find an additional map, mapping your states to your measurements.
: Recall, . Then, let be some eigenvector of . Now we have . Since this eigenvalue is arbitrary it follows that .
: Let then . This follows from the being a norm.
: Clearly, if , we have . Now, when , does this imply that ? The answer is no, consider
Then, , yet, . For the full set of conditions on such that see this paper by Goldberg and Zwas.
: Recall that . This expression is clearly larger or equal to .
One of the most famous problems in linear control theory is that of designing a control law which minimizes some cost being quadratic in input and state. We all know this as the Linear Quadratic Regulator (LQR) problem. There is however one problem (some would call it a blessing) with this formulation once the dynamics contain some zeromean noise: the control law is independent of this stochasticity. This is easy in proofs, easy in practice, but does it work well? Say, your control law is rather slow, but bringing you back to in the end of time. In the noiseless case, no problem. But now say there is a substantial amount of noise, do you still want such a slow control law? The answer is most likely no since you quickly drift away. The classical LQR formulation does not differentiate between noise intensities and hence can be rightfully called naive as Bertsekas did (Bert76). Unfortunately, this name did not stuck.
Can we do better? Yes. Define the cost function
and consider the problem of finding the minimizing policy in:
Which is precisely the classical LQR problem, but now with the cost wrapped in an exponential utility function parametrized by . This problem was pioneered by most notably Peter Whittle (Wh90).
Before we consider solving this problem, let us interpret the risk parameter . For we speak of a risksensitive formulation while for we are riskseeking. This becomes especially clear when you solve the problem, but a quick way to see this is to consider the approximation of near , which yields , so relates to pessimism and to optimism indeed. Here we skipped a few scaling factors, but the idea remains the same, for a derivation, consider cumulant generating functions and have a look at our problem.
As with standard LQR, we would like to obtain a stabilizing policy and to that end we will be mostly bothered with solving . However, instead of immediately trying to solve the infinite horizon average cost Bellman equation it is easier to consider for finite first. Then, when we can prove monotonicty and upperbound , the infinite horizon optimal policy is given by . The reason being that monotonic sequences which are uniformly bounded converge.
The main technical tool towards finding the optimal policy is the following Lemma similar to one in the Appendix of (Jac73):
Lemma
Consider a noisy linear dynamical system defined by with and let be shorthand notation for . Then, if holds we have
where .
Proof
Let and recall that the shorthand notation for is , then:
Here, the first step follows directly from being a zeromean Gaussian. In the second step we plug in . Then, in the third step we introduce a variable with the goal of making the covariance matrix of a Gaussian with mean . We can make this work for
and additionally
Using this approach we can integrate the latter part to and end up with the final expression. Note that in this case the random variable needs to be Gaussian, since the second to last expression in equals by being a Gaussian probability distribution integrated over its entire domain.
What is the point of doing this? Let and assume that represents the costtogo from stage and state . Then consider
Note, since we work with a sum within the exponent, we must multiply within the righthandside of the Bellman equation. From there it follows that , for
The key trick in simplifying your expressions is to apply the logarithm after minimizing over such that the fraction of determinants becomes a stateindependent affine term in the cost. Now, using a matrix inversion lemma and the pushthrough rule we can remove and construct a map :
such that . See below for the derivations, many (if not all) texts skip them, but if you have never applied the pushthrough rule they are not that obvious.
As was pointed out for the first time by Jacobson (Jac73), these equations are precisely the ones we see in (noncooperative) Dynamic Game theory for isotropic and appropriately scaled .
Especially with this observation in mind there are many texts which show that is welldefined and finite, which relates to finite cost and a stabilizing control law . To formalize this, one needs to assume that is a minimal realization for defined by . Then you can appeal to texts like (BB95).
Numerical Experiment and Robustness Interpretation
To show what happens, we do a small dimensional example. Here we want to solve the RiskSensitive ) and the Riskneutral () infinitehorizon average cost problem for
, , . There is clearly a lot of noise, especially on the second signal, which also happens to be critical for controlling the first state. This makes it interesting. We compute and . Given the noise statistics, it would be reasonable to not take the certainty equivalence control law since you control the first state (which has little noise on its line) via the second state (which has a lot of noise on its line). Let be the state under and the state under .
We see in the plot below (for some arbitrary initial condition) typical behaviour, does take the noise into account and indeed we see the induces a smaller variance.
So, is more robust than in a particular way. It turns out that this can be neatly explained. To do so, we have to introduce the notion of Relative Entropy (KullbackLeibler divergence). We will skip a few technical details (see the references for full details). Given a measure on , then for any other measure , being absolutely continuous with respect to (), define the Relative Entropy as:
Now, for any measurable function on , being bounded from below, it can be shown (see (DP96)) that
For the moment, think of as your standard finitehorizon LQR cost with product measure , then we see that an exponential utility results in the understanding that a control law which minimizes is robust against adversarial noise generated by distributions sufficiently close (measured by ) to the reference .
Here we skipped over a lot of technical details, but the intuition is beautiful, just changing the utility to the exponential function gives a wealth of deep distributional results we just touched upon in this post.
Simplification steps in
We only show to get the simpler representation for the input, the approach to obtain is very similar.
First, using the matrix inversion lemma:
we rewrite into:
Note, we factored our since we cannot assume that is invertible. Our next tool is called the pushthrough rule. Given and we have
You can check that indeed . Now, to continue, plug this expression for into the input expression:
Indeed, we only used factorizations and the pushthrough rule to arrive here.
(Bert76) Dimitri P. Bertsekas : ‘‘Dynamic Programming and Stochastic Control’’, 1976 Academic Press.
(Jac73) D. Jacobson : ‘‘Optimal stochastic linear systems with exponential performance criteria and their relation to deterministic differential games’’, 1973 IEEE TAC.
(Wh90) Peter Whittle : ‘‘Risksensitive Optimal Control’’, 1990 Wiley.
(BB95) Tamer Basar and Pierre Bernhard : ‘‘Optimal Control and Related Minimax Design Problems A Dynamic Game Approach’’ 1995 Birkhauser.
(DP96) Paolo Dai Pra, Lorenzo Meneghini and Wolfgang J. Runggaldier :‘‘Connections Between Stochastic Control and Dynamic Games’’ 1996 Math. Control Signals Systems.
Theo Jansen, designer of the infamous ‘‘strandbeesten’’ explains in a relatively recent series of videos that the design of his linkage system is the result of what could be called an early implementation of genetic programming. Specifically, he wanted the profile that the foot follows to be flat on the bottom. He does not elaborate on this too much, but the idea is that in this way his creatures flow over the surface instead of the more bumpy motion we make. Moreover, you can imagine that in this way the center of mass remains more or less at a constant height, which is indeed energy efficient. Now, many papers have been written on understanding his structure, but I find most of them rather vague, so lets try a simple example.
To that end, consider just 1 leg, 1 linkage system as in the figure(s) below. We can write down an explicit expression for where and . See below for an explicit derivation. Now, with his motivation in mind, what would be an easy and relevant cost function? An idea is to maximize the ratio , of course, subject to still having a functioning leg. To that end we consider the cost function (see the schematic below for notation):
Of course, this is a heuristic, but the intuition is that this should approximate . Let be the set of parameters as given in Jansen his video, then we apply a few steps of (warmstart) gradient ascent: , .
In the figure on the left we compare the linkage as given by Jansen (in black) with the result of just two steps of gradient ascent (in red). You clearly see that the trajectory of the foot becomes more shallow. Nevertheless, we see that the profile starts to become curved on the bottom. It is actually interesting to note that , which is still far from but already remarkably small for . 
Can we conclude anything from here? I would say that we need dynamics, not just kinematics, but more importantly, we need to put the application in the optimization and only Theo Jansen understands the constraints of the beach.
Still, it is fun to see for yourself how this works and like I did you can build a (baby) strandbeest yourself using nothing more than some PVC tubing and a source of heat.
Derivation of :
Essentially, all that you need is the cosinerule plus some feeling for how to make the equations continuous.
To start, let and (just the standard unit vectors), plus let be a standard (counterclockwise) rotation matrix acting on . To explain the main approach, when we want to find the coordinates of for example point , we try to compute the angle such that when we rotate the unit vector with length emanating from point , we get point . See the schematic picture below.
Then we easily obtain , and . Now, to obtain and compute the length of the diagonal, and using the cosine rule , . Then, the trick is to use to compute . To avoid angles larger than , consider an inner product with and add the remaining ; . From there we get , , , . The point is not missing, to check your code it is convenient to compute and check that agrees with for all . In a similar fashion, , , . Then, for , again compute a diagonal plus , , such that for we have . At last, , such that for .
This short note is inspired by the beautiful visualization techniques from Duistermaat and Kolk (DK1999) (themselves apparently inspired by Atiyah).
Let's say we have a dimensional linear discretetime dynamical system , which preserves the volume of the cube under the dynamics, i.e. for any .
Formally put, this means that is part of a certain group, specifically, consider some field and define the Special Linear group by
Now, assume that we are only interested in matrices such that the cube remains bounded under the dynamics, i.e., is bounded. In this scenario we restrict our attention to . To see why this is needed, consider and both with determinant :
If we look at the image of under both these maps (for several iterations), we see that volume is preserved, but also clearly that the set is extending indefinitely in the horizontal direction.
To have a somewhat interesting problem, assume we are given with while it is our task to find a such that , hence, find feedback which not only preserves the volume, but keeps any set of initial conditions (say ) bounded over time.
Towards a solution, we might ask, given any what is the nearest (in some norm) rotation matrix? This turns out to be a classic question, starting from orthogonal matrices, the solution to is given by , where follows from the SVD of , . Differently put, when we use a polar decomposition of , , then the solution is given by . See this note for a quick multiplier proof. An interesting sidenote, problem is welldefined since is compact. To see this, recall that for any we have , hence the set is bounded, plus is the preimage of under , , hence the set is closed as well.
This is great, however, we would like to optimize over instead. To do so, one usually resorts to simply checking the sign  and if necessary  flipping it via finding the closest matrix with a change in determinant sign. We will see that by selecting appropriate coordinates we can find the solution without this checking of signs.
For today we look at and largely follow (DK1999), for such a dimensional matrix, the determinant requirement translates to . Under the following (invertible) linear change of coordinates
becomes , i.e., for any pair the point runs over a circle with radius . Hence, we obtain the diffeomorphism . We can use however a more pretty diffeomorphism of the sphere and an open set in . To that end, use the diffeomorphism:
for (formal way of declaring that should not be any real number) and the pair being part of (the open unitdisk). To see this, recall that the determinant is not a linear operator. Since we have a dimensional example we can explicitly compute the eigenvalues of any (plug into the characteristic equation) and obtain:
At this point, the only demand on the eigenvalues is that . Therefore, when we would consider within the context of a discrete linear dynamical system , is either a saddlepoint, or we speak of a marginally stable system (). We can visualize the set of all marginally stable , called , defined by all satisfying
(DK1999) J.J. Duistermaat and J.A.C. Kolk : ‘‘Lie Groups’’, 1999 Springer.
Previously we looked at ,  and its resulting flow  as the result from mapping to the sphere. However, we saw that for this flow convergences to , with such that for we have . Hence, it is interesting to look at the flow from an optimization point of view: .
A fair question would be, is our flow not simply Riemannian gradient ascent for this problem? As common with these kind of problems, such a hypothesis is something you just feel.
Now, using the tools from (CU1994, p.311) we can compute the gradient of (on the sphere) via , where is a vector field normal to , e.g., . From there we obtain
To make our life a bit easier, use instead of the map . Moreover, set . Then it follows that
Of course, to make the computation cleaner, we changed to , but the relation between and is beautiful. Somehow, mapping trajectories of , for some , to the sphere corresponds to (Riemannian) gradient ascent applied to the problem .
To do an example, let be given by We can compare with . We see that the gradient flow under the quadratic function takes a clearly ‘‘shorter’’ path. 
Now, we formalize the previous analysis a bit a show how fast we convergence. Assume that the eigenvectors are ordered such that eigenvector corresponds to the largest eigenvalue of . Then, the solution to is given by
Let be expressed in eigenvector coordinates, with all (normalized). Moreover, assume all eigenvalues are distinct. Then, to measure if is near , we compute , which is if and only if is parallel to . To simplify the analysis a bit, we look at , for some perturbation , this yields
Next, take the the (natural) logarithm on both sides:
This logsumexp terms are hard to deal with, but we can apply the socalled ‘‘logsumexp trick’’:
In our case, we set and obtain
We clearly observe that for the LHS approaches from below, which means that from above, like intended. Of course, we also observe that the mentioned method is not completely general, we already assume distinct eigenvalues, but there is more. We do also not convergence when , which is however a set of measure on the sphere .
More interestingly, we see that the convergence rate is largely dictated by the ‘‘spectral gap/eigengap’’ . Specifically, to have a particular projection error , such that , we need
Comparing this to the resulting flow from , , we see that we have the same flow, but with . This is interesting, since and have the same eigenvectors, yet a different (scaled) spectrum. With respect to the convergence rate, we have to compare and for any with (the additional is not so interesting).
It is obvious what we will happen, the crux is, is larger or smaller than ? Can we immediately extend this to a Newtontype algorithm? Well, this fails (globally) since we work in instead of purely with . To be concrete, , we never have degrees of freedom.
Of course, these observations turned out to be far from new, see for example (AMS2008, sec. 4.6).
(AMS2008) P.A. Absil, R. Mahony and R. Sepulchre: ‘‘Optimization Algorithms on Matrix Manifolds’’, 2008 Princeton University Press.
(CU1994) Constantin Udriste: ‘‘Convex Functions and Optimization Methods on Riemannian Manifolds’’, 1994 Kluwer Academic Publishers.
Using the ideas from the previous post, we immediately obtain a dynamical system to compute for any (symmetric positive definite matrices). Here we use the vector field , defined by and denote a solution by , where is the initial condition. Moreover, define a map by . Then , where is the eigenvector corresponding to the largest eigenvalue.
We can do a simple example, let with . Indeed, when given 100 , we observe convergence. To apply this in practice we obviously need to discretize the flow. To add to that, if we consider the discretetime version of , we can get something of the form , which is indeed the Power Iteration algorithm, used to compute the largest eigenvalue of . Recall however that this algorithm needs stronger conditions (on ) to work than our flow. 
In this post we start a bit with control of systems living on compact metric spaces. This is interesting since we have to rethink what we want, cannot even occur. Let a great arc (circle) on be parametrized by the normal of a hyperplane through , i.e., , is the axis of rotation for . 
To start, consider a linear dynamical systems , , for which the solution is given by . We can map this solution to the sphere via . The corresponding vector field for is given by
The beauty is the following, to find the equilibria, consider . Of course, is not valid since . However, take any eigenvector of , then is still an eigenvector, plus it lives on . If we plug such a scaled (from now on consider all eigenvectors to live on ) into we get , which is rather cool. The eigenvectors of are (at least) the equilibria of . Note, if , then so is , each eigenvector gives rise to two equilibria. We say at least, since if is nontrivial, then for any , we get .
Let , then what can be said about the stability of the equilibrium points? Here, we must look at , where the dimensional system is locally a dimensional linear system. To do this, we start by computing the standard Jacobian of , which is given by
As said before, we do not have a dimensional system. To that end, we assume with diagonalization (otherwise peform a GramSchmidt procedure) and perform a change of basis . Then if we are interested in the qualitative stability properties of (the eigenvector of ), we can simply look at the eigenvalues of the matrix resulting from removing the row and column in , denoted .
We do two examples. In both figures are the initial conditions indicated with a dot while the equilibrium points are stars.
First, let Clearly, for we have and let be . Then using the procedure from before, we find that is stable (attracting), while are locally unstable in one direction while in the other (along the great arc). Hence, this example gives rise to the two stable poles. 
Consider an unknown dynamical system parametrized by . In this short note we look at a relatively simple, yet neat method to assess stability of via finite calls to a particular oracle. Here, we will follow the notes by Roman Vershynin (RV2011).
Assume we have no knowledge of , but we are able to obtain for any desired . It turns out that we can use this oracle to bound from above.
First, recall that since is symmetric, . Now, the common definition of is . The constraint set is not per se convenient from a computational point of view. Instead, we can look at a compact constraint set: , where is the sphere in . The latter formulation has an advantage, we can discretize these compact constraints sets and tame functions over the individual chunks.
To do this formally, we can use nets, which are convenient in quantitative analysis on compact metric spaces . Let be a net for if . Hence, we measure the amount of balls to cover .
For example, take as our metric space where we want to bound (the cardinality, i.e., the amount of balls). Then, to construct a minimal set of balls, consider an net of balls , with . Now the insight is simply that . Luckily, we understand the volume of dimensional Euclidean balls well such that we can establish
Which is indeed sharper than the bound from Lemma 5.2 in (RV2011).
Next, we recite Lemma 5.4 from (RV2011). Given an net for with we have
This can be shown using the CauchySchwartz inequality. Now, it would be interesting to see how sharp this bound is. To that end, we do a dimensional example (). Note, here we can easily find an optimal grid using polar coordinates. Let the unknown system be parametrized by
such that . In the figures below we compute the upperbound to for several nets . Indeed, for finer nets, we see the upperbound approaching . However, we see that the convergence is slow while being inversely related to the exponential growth in .
Overall, we see that after about a 100 calls to our oracle we can be sure about being exponentially stable. Here we just highlighted an example from (RV2011), at a later moment we will discuss its use in probability theory.
In this first post we will see that the ‘‘almost surely’’ in the title is not only hinting at stochasticity.
For a long time this picture  as taken from Gabriel PeyrÃ© his outstanding twitter feed  has been on a wall in our local wildlife (inc. hedgehogs) shelter. This has lead to a few fun discussions and here we try to explain the picture a bit more.
Let . Now, we want to find the largest ball within . We can find that is given by with . This follows the most easily from geometric intuition.
To prove it, see that has ‘‘similar’’ faces, without loss of generality we take the one in the positive orthant of . Now we fit a dimensional plane (hyperplane) to this face and find the point the closest (in norm) to . This hyperplane can be parametrized by for and . Thus the normal (direction) is given by . To find , observe that for we have . Hence we have . Now, let and define the other unit vectors correspondingly. Then, since for any we can think of a (skinny) hedgehog indeed for . The inscribed ball shrinks to , while the spikes remain. 
Besides being a fun visualization of concentration, this picture means a bit more to me. For years I have been working at the Wildlife shelter in Delft and the moment has come where the lack of financial support from surrounding municipalities is becoming critical. As usual, this problem can be largely attributed to bureaucracy and the exorbitant amount of managers this world has. Nevertheless, the team in Delft and their colleagues around the country do a fantastic job in spreading the word (see Volkskrant, RTLnieuws). Lets hope Delft, and it neighbours, finally see value in protecting the local wildlife we still have.