Risky Business in Stochastic Control: Exponential Utility |12 Jan. 2020|

One of the most famous problems in linear control theory is that of designing a control law which minimizes some cost being quadratic in input and state. We all know this as the Linear Quadratic Regulator (LQR) problem. There is however one problem (some would call it a blessing) with this formulation once the dynamics contain some zero-mean noise: the control law is independent of this stochasticity. This is easy in proofs, easy in practice, but does it work well? Say, your control law is rather slow, but bringing you back to

in the end of time. In the noiseless case, no problem. But now say there is a substantial amount of noise, do you still want such a slow control law? The answer is most likely no since you quickly drift away. The classical LQR formulation does not differentiate between noise intensities and hence can be rightfully called naive as Bertsekas did (Bert76). Unfortunately, this name did not stuck.

$gamma_{T-1}(theta) = frac{2}{theta T}log mathbf{E}_{xi}left[mathrm{exp}left(frac{theta}{2}sum^{T-1}_{t=0}x_t^{top}Qx_t + u_t^{top}Ru_t right) right]=frac{2}{theta T}log mathbf{E}_{xi}left[mathrm{exp}left(frac{theta}{2}Psi right) right]$

and consider the problem of finding the minimizing policy $pi^{star}_{T-1}=u_0,u_1,dots,u_{T-1}$ in:

Which is precisely the classical LQR problem, but now with the cost wrapped in an exponential utility function parametrized by thetain mathbf{R}

. This problem was pioneered by most notably Peter Whittle (Wh90).

Before we consider solving this problem, let us interpret the risk parameter thetain mathbf{R}

. For

we speak of a risk-sensitive formulation while for theta<0

we are risk-seeking. This becomes especially clear when you solve the problem, but a quick way to see this is to consider the approximation of gamma(theta)

near

, which yields $gammaapprox mathbf{E}[Psi]+frac{1}{4}{theta}mathbf{E}[Psi]^2$ , so theta>0

relates to pessimism and theta<0

to optimism indeed. Here we skipped a few scaling factors, but the idea remains the same, for a derivation, consider cumulant generating functions and have a look at our problem.

As with standard LQR, we would like to obtain a stabilizing policy and to that end we will be mostly bothered with solving $limsup_{Tto infty}mathcal{J}_T$ . However, instead of immediately trying to solve the infinite horizon average cost Bellman equation it is easier to consider $gamma_{T}(theta)$ for finite

first. Then, when we can prove monotonicty and upper-bound $limsup_{Tto infty}mathcal{J}_{T}$ , the infinite horizon optimal policy is given by $lim_{Tto infty}pi_{T}^{star}$ . The reason being that monotonic sequences which are uniformly bounded converge.

The main technical tool towards finding the optimal policy is the following Lemma similar to one in the Appendix of (Jac73):
Lemma

Consider a noisy linear dynamical system defined by $x_{t+1}=Ax_k+Bu_t+Dxi_t$ with $xi_t {sim} mathcal{N}(0,Sigma_{xi}^{-1})$ and let |A|

be shorthand notation for mathrm{det}(A)

. Then, if $(Sigma_{xi}-theta D^{top}PD)^{-1}succ 0$ holds we have

$mathbf{E}_{xi}left[mathrm{exp}left(frac{theta}{2}x_{k+1}^{top}Px_{k+1} right)|x_kright] = frac{|(Sigma_{xi}-theta D^{top}PD)^{-1}|^{1/2}}{|Sigma_{xi}^{-1}|^{1/2}}mathrm{exp}left(frac{theta}{2}(Ax_k+Bu_k)^{top}widetilde{P}(Ax_k+Bu_k) right)$

Proof

Let $z:= mathbf{E}_{xi}left[mathrm{exp}left(frac{theta}{2}x_{t+1}^{top}Px_{t+1} right)|x_tright]$ and recall that the shorthand notation for mathrm{det}(A)

, then:

$begin{array}{ll} z &= frac{1}{(2pi)^{d/2}|Sigma_{xi}^{-1}|^{1/2}}int_{mathbf{R}^d} mathrm{exp}left(frac{theta}{2}x_{t+1}^{top}Px_{t+1} right)mathrm{exp}left(-frac{1}{2}xi_t^{top}Sigma_{xi}xi_t right) dxi_t &= frac{1}{(2pi)^{d/2}|Sigma_{xi}^{-1}|^{1/2}}Big{mathrm{exp}left(frac{theta}{2}(Ax_t+Bu_t)^{top}P(Ax_t+Bu_t) right) &quad cdotint_{mathbf{R}^d} mathrm{exp}left( theta (Ax_t+Bu_t)^{top}PDxi_t - frac{1}{2}xi_t^{top}(Sigma_{xi}-theta D^{top}PD)xi_tright)dxi_tBig} &=frac{|(Sigma_{xi}-theta D^{top}PD)^{-1}|^{1/2}}{|Sigma_{xi}^{-1}|^{1/2}}mathrm{exp}left(frac{theta}{2}(Ax_t+Bu_t)^{top}widetilde{P}(Ax_t+Bu_t) right) &quad cdot frac{1}{(2pi)^{d/2}|(Sigma_{xi}-theta D^{top}PD)^{-1}|^{1/2}}int_{mathbf{R}^d} mathrm{exp}left(-frac{1}{2}(xi_t-overline{xi}_t)^{top}(Sigma_{xi}-theta D^{top}PD)(xi_t-overline{xi}_t)right) dxi_t &=frac{|(Sigma_{xi}-theta D^{top}PD)^{-1}|^{1/2}}{|Sigma_{xi}^{-1}|^{1/2}}mathrm{exp}left(frac{theta}{2}(Ax_t+Bu_t)^{top}widetilde{P}(Ax_t+Bu_t) right). end{array}$

Here, the first step follows directly from

being a zero-mean Gaussian. In the second step we plug in $x_{t+1}=Ax_t+Bu_t+Dxi_t$ . Then, in the third step we introduce a variable $overline{xi}_t$ with the goal of making $(Sigma_{xi}-theta D^{top}PD)$ the covariance matrix of a Gaussian with mean $overline{xi}_t$ . We can make this work for

Using this approach we can integrate the latter part to

and end up with the final expression. Note that in this case the random variable xi_t

needs to be Gaussian, since the second to last expression in

equals

by being a Gaussian probability distribution integrated over its entire domain.

What is the point of doing this? Let f(x,u)=Ax+Bu+xi

and assume that $r_tmathrm{exp}big(frac{theta}{2} x^{top}P_txbig)$ represents the cost-to-go from stage

and state

. Then consider

$r_{t-1}mathrm{exp}left(frac{theta}{2}x^{top}P_{t-1} xright) = inf_{u} left{mathrm{exp}left(frac{theta}{2}(x^{top}Qx+u^{top}Ru)right)mathbf{E}_{xi}left[r_tmathrm{exp}left( frac{theta}{2}f(x,u)^{top}P_{t}f(x,u)right);|;xright]right}.$

Note, since we work with a sum within the exponent, we must multiply within the right-hand-side of the Bellman equation. From there it follows that $u^{star}_t=-(R+B^{top}widetilde{P}_tB)^{-1}B^{top}widetilde{P}_tAx_t$ , for

The key trick in simplifying your expressions is to apply the logarithm after minimizing over

such that the fraction of determinants becomes a state-independent affine term in the cost. Now, using a matrix inversion lemma and the push-through rule we can remove $widetilde{P}_t$ and construct a map $P_{t-1}=f(P_t)$ :

such that $u_t^{star} = -R^{-1}B^{top}P_t Lambda_t A x_t$ . See below for the derivations, many (if not all) texts skip them, but if you have never applied the push-through rule they are not that obvious.

As was pointed out for the first time by Jacobson (Jac73), these equations are precisely the ones we see in (non-cooperative) Dynamic Game theory for isotropic $Sigma_{xi}$ and appropriately scaled theta geq 0

Especially with this observation in mind there are many texts which show that $lim_{tdownarrow 0}P_t=P$ is well-defined and finite, which relates to finite cost and a stabilizing control law $u^{star}_t=-R^{-1}B^{top}PLambda^{-1}Ax_t$ . To formalize this, one needs to assume that (A,B,C)

is a minimal realization for

defined by $C^{top}C=Q$ . Then you can appeal to texts like (BB95).

Numerical Experiment and Robustness Interpretation
To show what happens, we do a small

-dimensional example. Here we want to solve the Risk-Sensitive (theta=0.0015

) and the Risk-neutral ( theta=0

) infinite-horizon average cost problem for

$A = left[ begin{array}{ll} 2 & 1 0 & 2 end{array}right],quad B = left[ begin{array}{l} 0 1 end{array}right],quad Sigma_{xi}^{-1} = left[ begin{array}{ll} 10^{-1} & 200^{-1} 200^{-1} & 10 end{array}right],$

. There is clearly a lot of noise, especially on the second signal, which also happens to be critical for controlling the first state. This makes it interesting. We compute $K^{star}|_{theta=0}=:K^{star}$ and $K^{star}|_{theta=0.0015}=:K^{star}_{theta}$ . Given the noise statistics, it would be reasonable to not take the certainty equivalence control law $K^{star}$ since you control the first state (which has little noise on its line) via the second state (which has a lot of noise on its line). Let $x^{(i)}$ be the $i^{mathrm{th}}$ state under $K^{star}$ and $x^{(i)}_{theta}$ the $i^{mathrm{th}}$ state under $K^{star}_{theta}$ .

We see in the plot below (for some arbitrary initial condition) typical behaviour, $K^{star}_{theta}$ does take the noise into account and indeed we see the $K^{star}_{theta}$ induces a smaller variance.

So, $K^{star}$ is more robust than $K^{star}$ in a particular way. It turns out that this can be neatly explained. To do so, we have to introduce the notion of Relative Entropy (Kullback-Leibler divergence). We will skip a few technical details (see the references for full details). Given a measure

, then for any other measure

, being absolutely continuous with respect to

(

), define the Relative Entropy as:

Now, for any measurable function Psi

, being bounded from below, it can be shown (see (DP96)) that

For the moment, think of Psi

as your standard finite-horizon LQR cost with product measure dnu

, then we see that an exponential utility results in the understanding that a control law which minimizes $log int e^{Psi}dmu$ is robust against adversarial noise generated by distributions sufficiently close (measured by

) to the reference

Here we skipped over a lot of technical details, but the intuition is beautiful, just changing the utility to the exponential function gives a wealth of deep distributional results we just touched upon in this post.

Simplification steps in $u_t^{star}$
We only show to get the simpler representation for the input, the approach to obtain $P_{k-1}=f(P_k)$ is very similar.

Note, we factored our P_t

since we cannot assume that P_t

is invertible. Our next tool is called the push-through rule. Given $Qin mathbf{R}^{mtimes n}$ and $Pin mathbf{R}^{ntimes m}$ we have

You can check that indeed Q(I_n+PQ) = (I_m +QP)Q

. Now, to continue, plug this expression for $widetilde{P}_t$ into the input expression:

$begin{array}{ll} -(R+B^{top}widetilde{P}_t B)^{-1}B^{top}widetilde{P}_t A &= -(R+B^{top}X^{-1}P_t B)^{-1}B^{top}X^{-1}P_t A &= -big( (I_m+B^{top}X^{-1}P_t BR^{-1})Rbig)^{-1}B^{top}X^{-1}P_t A &= -R^{-1}(I_m+B^{top}X^{-1}P_t BR^{-1})^{-1}B^{top}X^{-1}P_t A &= -R^{-1}B^{top}(I_n+X^{-1}P_t BR^{-1}B^{top})^{-1}X^{-1}P_t A &= -R^{-1}B^{top}(X+P_t BR^{-1}B^{top})^{-1}P_t A &= -R^{-1}B^{top}P_t Lambda_t A. end{array}$

(Bert76) Dimitri P. Bertsekas : ‘‘Dynamic Programming and Stochastic Control’’, 1976 Academic Press.
(Jac73) D. Jacobson : ‘‘Optimal stochastic linear systems with exponential performance criteria and their relation to deterministic differential games’’, 1973 IEEE TAC.
(Wh90) Peter Whittle : ‘‘Risk-sensitive Optimal Control’’, 1990 Wiley.
(BB95) Tamer Basar and Pierre Bernhard : ‘‘ $H_{infty}$ -Optimal Control and Related Minimax Design Problems A Dynamic Game Approach’’ 1995 Birkhauser.
(DP96) Paolo Dai Pra, Lorenzo Meneghini and Wolfgang J. Runggaldier :‘‘Connections Between Stochastic Control and Dynamic Games’’ 1996 Math. Control Signals Systems.