Posts (10) containing the 'math.OC’ (Optimization and Control) tag:Communities |July 02, 2024|
|
![]() |
Figure 1:
A partition of |
This problem was affirmatively solved in 2018, see this paper.
As you can see, this work was updated just a few months ago. The general proof is involved, lets see if we can do the case for a compact set and .
First of all, when splitting (into just two sets) you might think of many different methods to do so. What happens when the line is curved? The answer is that when the line is curved, one of the resulting sets must be non-convex, compare the options in Figure 2.
![]() |
Figure 2:
A partitioning of the square in |
This observation is particularly useful as it implies we only need to look at two points on the boundary of (and the line between them). As
is compact we can always select a cut such that the resulting perimeters of
and
are equal.
Let us assume that the points and
in Figure 3.(i) are like that. If we start moving them around with equal speed, the resulting perimeters remain fixed. Better yet, as the cut is a straight-line, the volumes (area) of the resulting set
and
change continuously. Now the result follows from the Intermediate Value Theorem and seeing that we can flip the meaning of
and
, see Figure 3.(iii).
![]() |
Figure 3:
By moving the points |
In this post we will look at one of the many remarkable findings by Roger W. Brockett. Consider a Linear Program (LP)
parametrized by the compact set and a suitable triple
.
As a solution to
can always be found to be a vertex of
, a smooth method to solve
seems somewhat awkward.
We will see that one can construct a so-called isospectral flow that does the job.
Here we will follow Dynamical systems that sort lists, diagonalize matrices and solve linear programming problems, by Roger. W. Brockett (CDC 1988) and the book Optimization and Dynamical Systems edited by Uwe Helmke and John B. Moore (Springer 2ed. 1996).
Let
have
vertices, then one can always find a map
, mapping the simplex
onto
.
Indeed, with some abuse of notation, let
be a matrix defined as
, for
, the vertices of
.
Before we continue, we need to establish some differential geometric results. Given the Special Orthogonal group , the tangent space is given by
. Note, this is the explicit formulation, which is indeed equivalent to shifting the corresponding Lie Algebra.
The easiest way to compute this is to look at the kernel of the map defining the underlying manifold.
Now, following Brockett, consider the function defined by
for some
. This approach is not needed for the full construction, but it allows for more intuition and more computations.
To construct the corresponding gradient flow, recall that the (Riemannian) gradient at
is defined via
for all
. Using the explicit tangent space representation, we know that
with
.
Then, see that by using
we obtain the gradient via
This means that (the minus is missing in the paper) the (a) gradient flow becomes
Consider the standard commutator bracket and see that for
one obtains from the equation above (typo in the paper)
Hence, can be seen as a reparametrization of a gradient flow.
It turns out that
has a variety of remarkable properties. First of all, see that
preserves the eigenvalues of
.
Also, observe the relation between extremizing
and the function
defined via
. The idea to handle LPs is now that the limiting
will relate to putting weight one the correct vertex to get the optimizer,
gives you this weight as it will contain the corresponding largest costs.
In fact, the matrix can be seen as an element of the set
.
This set is in fact a
-smooth compact manifold as it can be written as the orbit space corresponding to the group action
,
, one can check that this map satisfies the group properties. Hence, to extremize
over
, it appears to be appealing to look at Riemannian optimization tools indeed. When doing so, it is convenient to understand the tangent space of
. Consider the map defining the manifold
,
. Then by the construction of
, see that
yields the relation
for any
.
For the moment, let such that
and
.
First we consider the convergence of
.
Let
have only distinct eigenvalues, then
exists and is diagonal.
Using the objective
from before, consider
and see that by using the skew-symmetry one recovers the following
This means the cost monotonically increases, but by compactness converges to some point . By construction, this point must satisfy
. As
has distinct eigenvalues, this can only be true if
itself is diagonal (due to the distinct eigenvalues).
More can be said about , let
be the eigenvalues of
, that is, they correspond to the eigenvalues of
as defined above.
Then as
preserves the eigenvalues of
(
), we must have
, for
a permutation matrix.
This is also tells us there is just a finite number of equilibrium points (finite number of permutations). We will write this sometimes as
.
Now as is one of those points, when does
converge to
? To start this investigation, we look at the linearization of
, which at an equilibrium point
becomes
for . As we work with matrix-valued vector fields, this might seems like a duanting computation. However, at equilibrium points one does not need a connection and can again use the directional derivative approach, in combination with the construction of
, to figure out the linearization. The beauty is that from there one can see that
is the only asymptotically stable equilibrium point of
. Differently put, almost all initial conditions
will converge to
with the rate captured by spectral gaps in
and
. If
does not have distinct eigenvalues and we do not impose any eigenvalue ordering on
, one sees that an asymptotically stable equilibrium point
must have the same eigenvalue ordering as
. This is the sorting property of the isospectral flow and this is of use for the next and final statement.
Theorem:
Consider the LP with
for all
, then, there exist diagonal matrices
and
such that
converges for almost any
to
with the optimizer of
being
.
Proof:
Global convergence is prohibited by the topology of .
Let
and let
. Then, the isospectral flow will converge from almost everywhere to
(only
), such that
.
Please consider the references for more on the fascinating structure of .
Most of us will learn at some point in our life that is problematic, as which number multiplied by itself can ever be negative?
To overcome this seemingly useless defiency one learns about complex numbers and specifically, the imaginary number
, which is defined to satisfy
.
At this point you should have asked yourself when on earth is this useful?
In this short post I hope to highlight - from a slightly different angle most people grounded in physics would expect - that the complex numbers are remarkably useful.
Complex numbers, and especially the complex exponential, show up in a variety of contexts, from signal processing to statistics and quantum mechanics. With of course most notably, the Fourier transformation.
We will however look at something completely different. It can be argued that in the late 60s James Lyness and Cleve Moler brought to life a very elegant new approach to numerical differentiation. To introduce this idea, recall that even nowadays the most well-known approach in numerical differentiation is to use some sort of finite-difference method, for example, for any one could use the central-difference method
Now one might be tempted to make extremely small, as then the error must vanish!
However, numerically, for a very small
the two function evaluations
and
will be indistinguishable.
So although the error scales as
there is some practical lower bound on this error based on the machine precision of your computer.
One potential application of numerical derivatives is in the context of zeroth-order (derivative-free) optimization.
Say you want to adjust van der Poel his Canyon frame such that he goes even faster, you will not have access to explicit gradients, but you can evaluate the performance of a change in design, for example in a simulator. So what you usually can obtain is a set of function evaluations
. Given this data, a somewhat obvious approach is to mimick first-order algorithms
where is some stepsize. For example, one could replace
in
by the central-difference approximation
.
Clearly, if the objective function
is well-behaved and the approximation of
is reasonably good, then something must come out?
As was remarked before, if your approximation - for example due to numerical cancellation errors - will always have a bias it is not immediate how to construct a high-performance zeroth-order optimization algorithm. Only if there was a way to have a numerical approximation without finite differencing?
Let us assume that is sufficiently smooth, let
be the imaginary number and consider the following series
From this expression it follows that
So we see that by passing to the complex domain and projecting the imaginary part back, we obtain a numerical method to construct approximations of derivatives without even the possibility of cancellation errors. This remarkable property makes it a very attractive candidate to be used in zeroth-order optimization algorithms, which is precisely what we investigated in our new pre-print. It turns out that convergence is not only robust, but also very fast!
There was a bit of radio silence, but as indicated here, some interesting stuff is on its way.
In this post we highlight this 1976 paper by Mullis and Roberts on ’Synthesis of Minimum Roundoff Noise Fixed Point Digital Filters’.
Let us be given some single-input single-output (SISO) dynamical system
It is known that the input-output behaviour of any , that is, the map from
to
is invariant under a similarity transform.
To be explicit, the tuples
and
, which correspond to the change of coordinates
for some
, give rise to the same input-output behaviour.
Hence, one can define the equivalence relation
by imposing that the input-output maps of
and
are equivalent.
By doing so we can consider the quotient
. However, in practice, given a
, is any
such that
equally useful?
For example, the following
and
are similar, but clearly,
is preferred from a numerical point of view:
In what follows, we highlight the approach from Mullis and Roberts and conclude by how to optimally select . Norms will be interpreted in the
sense, that is, for
,
. Also, in what follows we assume that
plus that
is stable, which will mean
and that
corresponds to a minimal realization.
The first step is to quantify the error. Assume we have allocated bits at our disposal to present the
component of
. These values for
can differ, but we constrain the average by
for some
. Let
be a 'step-size’, such that our dynamic range of
is bounded by
(these are our possible representations). Next we use the modelling choices from Mullis and Roberts, of course, they are debatable, but still, we will infer some nice insights.
First, to bound the dynamic range, consider solely the effect of an input, that is, define by
,
. Then we will impose the bound
on
. In combination with the step size (quantization), this results in
. Let
be a sequence of i.i.d. sampled random variables from
. Then we see that
. Hence, one can think of
as scaling parameter related to the probability with which this dynamic range bound is true.
Next, assume that all the round-off errors are independent and have a variance of .
Hence, the worst-case variance of computating
is
. To evaluate the effect of this error on the output, assume for the moment that
is identically
.
Then,
. Similar to
, define
as the
component of
. As before we can compute the variance, this time of the full output signal, which yields
. Note, these expressions hinge on the indepedence assumption.
Next, define the (infamous) matrices ,
by
If we assume that the realization is not just minimal, but that is a controllable pair and that
is an observable pair, then,
and
.
Now the key observation is that
and similarly
.
Hence, we can write
as
and indeed
.
Using these Lyapunov equations we can say goodbye to the infinite-dimensional objects.
To combine these error terms, let say we have apply a coordinate transformation , for some
.
Specifically, let
be diagonal and defined by
.
Then one can find that
,
. Where the latter expressions allows to express the output error (variance) by
Now we want to minimize over all
plus some optimal selection of
. At this point it looks rather painful. To make it work we first reformulate the problem using the well-known arithmetic-geometric mean inequality
for non-negative sequences
:
This inequality yields
See that the right term is independent of , hence this is a lower-bound with respect to minimization over
. To achieve this (to make the inequality from
an equality), we can select
Indeed, as remarked in the paper, is not per se an integer. Nevertheless, by this selection we find the clean expression from
to minimize over systems equivalent to
, that is, over some transformation matrix
. Define a map
by
It turns out that if and only if
is diagonal. This follows
from Hadamard's inequality.
We can use this map to write
Since the term is invariant under a transformation
, we can only optimize
over a structural change in realization tuple
, that is, we need to make
and
simultaneous diagonal! It turns out that this numerically 'optimal’ realization, denoted
, is what is called a principal axis realization.
To compute it, diagonalize ,
and define
. Next, construct the diagonalization
. Then our desired transformation is
. First, recall that under any
the pair
becomes
. Plugging in our map
yields the transformed matrices
, which are indeed diagonal.
![]() |
At last, we do a numerical test in Julia. Consider a linear system ![]() ![]() To simulate numerical problems, we round the maps |
: To show this AM-GM inequality, we can use Jensen's inequality for concave functions, that is,
. We can use the logarithm as our prototypical concave function on
and find
. Then, the result follows.
: The inequality attributed to Hadamard is slightly more difficult to show. In its general form the statement is that
, for
the
column of
. The inequality becomes an equality when the columns are mutually orthogonal. The intuition is clear if one interprets the determinant as the signed volume spanned by the columns of
, which. In case
, we know that there is
such that
, hence, by this simple observation it follows that
Equality holds when the columns are orthogonal, so must be diagonal, but
must also hold, hence,
must be diagonal, and thus
must be diagonal, which is the result we use.
One of the most famous problems in linear control theory is that of designing a control law which minimizes some cost being quadratic in input and state. We all know this as the Linear Quadratic Regulator (LQR) problem.
There is however one problem (some would call it a blessing) with this formulation once the dynamics contain some zero-mean noise: the control law is independent of this stochasticity. This is easy in proofs, easy in practice, but does it work well? Say, your control law is rather slow, but bringing you back to in the end of time. In the noiseless case, no problem. But now say there is a substantial amount of noise, do you still want such a slow control law? The answer is most likely no since you quickly drift away. The classical LQR formulation does not differentiate between noise intensities and hence can be rightfully called naive as Bertsekas did (Bert76). Unfortunately, this name did not stuck.
Can we do better? Yes. Define the cost function
and consider the problem of finding the minimizing policy in:
Which is precisely the classical LQR problem, but now with the cost wrapped in an exponential utility function parametrized by .
This problem was pioneered by most notably Peter Whittle (Wh90).
Before we consider solving this problem, let us interpret the risk parameter . For
we speak of a risk-sensitive formulation while for
we are risk-seeking. This becomes especially clear when you solve the problem, but a quick way to see this is to consider the approximation of
near
, which yields
, so
relates to pessimism and
to optimism indeed. Here we skipped a few scaling factors, but the idea remains the same, for a derivation, consider cumulant generating functions and have a look at our problem.
As with standard LQR, we would like to obtain a stabilizing policy and to that end we will be mostly bothered with solving .
However, instead of immediately trying to solve the infinite horizon average cost Bellman equation it is easier to consider
for finite
first. Then, when we can prove monotonicty and upper-bound
, the infinite horizon optimal policy is given by
.
The reason being that monotonic sequences which are uniformly bounded converge.
The main technical tool towards finding the optimal policy is the following Lemma similar to one in the Appendix of (Jac73):
Lemma
Consider a noisy linear dynamical system defined by with
and let
be shorthand notation for
. Then, if
holds we have
where
.
Proof
Let and recall that the shorthand notation for
is
, then:
Here, the first step follows directly from being a zero-mean Gaussian. In the second step we plug in
. Then, in the third step we introduce a variable
with the goal of making
the covariance matrix of a Gaussian with mean
. We can make this work for
and additionally
Using this approach we can integrate the latter part to and end up with the final expression.
Note that in this case the random variable
needs to be Gaussian, since the second to last expression in
equals
by being a Gaussian probability distribution integrated over its entire domain.
What is the point of doing this?
Let and assume that
represents the cost-to-go from stage
and state
. Then consider
Note, since we work with a sum within the exponent, we must multiply within the right-hand-side of the Bellman equation.
From there it follows that
, for
The key trick in simplifying your expressions is to apply the logarithm after minimizing over such that the fraction of determinants becomes a state-independent affine term in the cost.
Now, using a matrix inversion lemma and the push-through rule we can remove
and construct a map
:
such that .
See below for the derivations, many (if not all) texts skip them, but if you have never applied the push-through rule they are not that obvious.
As was pointed out for the first time by Jacobson (Jac73), these equations are precisely the ones we see in (non-cooperative) Dynamic Game theory for isotropic and appropriately scaled
.
Especially with this observation in mind there are many texts which show that is well-defined and finite, which relates to finite cost and a stabilizing control law
. To formalize this, one needs to assume that
is a minimal realization for
defined by
. Then you can appeal to texts like (BB95).
Numerical Experiment and Robustness Interpretation
To show what happens, we do a small -dimensional example. Here we want to solve the Risk-Sensitive
) and the Risk-neutral (
) infinite-horizon average cost problem for
,
,
.
There is clearly a lot of noise, especially on the second signal, which also happens to be critical for controlling the first state. This makes it interesting.
We compute
and
.
Given the noise statistics, it would be reasonable to not take the certainty equivalence control law
since you control the first state (which has little noise on its line) via the second state (which has a lot of noise on its line). Let
be the
state under
and
the
state under
.
We see in the plot below (for some arbitrary initial condition) typical behaviour, does take the noise into account and indeed we see the
induces a smaller variance.
![]() |
So, is more robust than
in a particular way. It turns out that this can be neatly explained. To do so, we have to introduce the notion of Relative Entropy (Kullback-Leibler divergence). We will skip a few technical details (see the references for full details). Given a measure
on
, then for any other measure
, being absolutely continuous with respect to
(
), define the Relative Entropy as:
Now, for any measurable function on
, being bounded from below, it can be shown (see (DP96)) that
For the moment, think of as your standard finite-horizon LQR cost with product measure
, then we see that an exponential utility results in the understanding that a control law which minimizes
is robust against adversarial noise generated by distributions sufficiently close (measured by
) to the reference
.
Here we skipped over a lot of technical details, but the intuition is beautiful, just changing the utility to the exponential function gives a wealth of deep distributional results we just touched upon in this post.
Simplification steps in
We only show to get the simpler representation for the input, the approach to obtain is very similar.
First, using the matrix inversion lemma:
we rewrite into:
Note, we factored our since we cannot assume that
is invertible. Our next tool is called the push-through rule. Given
and
we have
You can check that indeed .
Now, to continue, plug this expression for
into the input expression:
Indeed, we only used factorizations and the push-through rule to arrive here.
(Bert76) Dimitri P. Bertsekas : ‘‘Dynamic Programming and Stochastic Control’’, 1976 Academic Press.
(Jac73) D. Jacobson : ‘‘Optimal stochastic linear systems with exponential performance criteria and their relation to deterministic differential games’’, 1973 IEEE TAC.
(Wh90) Peter Whittle : ‘‘Risk-sensitive Optimal Control’’, 1990 Wiley.
(BB95) Tamer Basar and Pierre Bernhard : ‘‘-Optimal Control and Related Minimax Design Problems A Dynamic Game Approach’’ 1995 Birkhauser.
(DP96) Paolo Dai Pra, Lorenzo Meneghini and Wolfgang J. Runggaldier :‘‘Connections Between Stochastic Control and Dynamic Games’’ 1996 Math. Control Signals Systems.
Theo Jansen, designer of the infamous ‘‘strandbeesten’’ explains in a relatively recent series of videos that the design of his linkage system is the result of what could be called an early implementation of genetic programming. Specifically, he wanted the profile that the foot follows to be flat on the bottom. He does not elaborate on this too much, but the idea is that in this way his creatures flow over the surface instead of the more bumpy motion we make. Moreover, you can imagine that in this way the center of mass remains more or less at a constant height, which is indeed energy efficient. Now, many papers have been written on understanding his structure, but I find most of them rather vague, so lets try a simple example.
To that end, consider just 1 leg, 1 linkage system as in the figure(s) below. We can write down an explicit expression for where
and
. See below for an explicit derivation. Now, with his motivation in mind, what would be an easy and relevant cost function? An idea is to maximize the ratio
, of course, subject to still having a functioning leg. To that end we consider the cost function (see the schematic below for notation):
Of course, this is a heuristic, but the intuition is that this should approximate .
Let
be the set of parameters as given in Jansen his video, then we apply a few steps of (warm-start) gradient ascent:
,
.
![]() |
In the figure on the left we compare the linkage as given by Jansen (in black) with the result of just two steps of gradient ascent (in red).
You clearly see that the trajectory of the foot becomes more shallow. Nevertheless, we see that the profile starts to become curved on the bottom. It is actually interesting to note that |
Can we conclude anything from here? I would say that we need dynamics, not just kinematics, but more importantly, we need to put the application in the optimization and only Theo Jansen understands the constraints of the beach.
Still, it is fun to see for yourself how this works and like I did you can build a (baby) strandbeest yourself using nothing more than some PVC tubing and a source of heat.
Derivation of :
Essentially, all that you need is the cosine-rule plus some feeling for how to make the equations continuous.
To start, let and
(just the standard unit vectors), plus let
be a standard (counter-clockwise) rotation matrix acting on
. To explain the main approach, when we want to find the coordinates of for example point
, we try to compute the angle
such that when we rotate the unit vector
with length
emanating from point
, we get point
. See the schematic picture below.
Then we easily obtain ,
and
.
Now, to obtain
and
compute the length of the diagonal,
and using the cosine rule
,
. Then, the trick is to use
to compute
. To avoid angles larger than
, consider an inner product with
and add the remaining
;
. From there we get
,
,
,
. The point
is not missing, to check your code it is convenient to compute
and check that
agrees with
for all
.
In a similar fashion,
,
,
.
Then, for
, again compute a diagonal
plus
,
, such that for
we have
.
At last,
, such that
for
.
![]() |