• # Paper 2, Section II, K

During each of $N$ time periods a venture capitalist, Vicky, is presented with an investment opportunity for which the rate of return for that period is a random variable; the rates of return in successive periods are independent identically distributed random variables with distributions concentrated on $[-1, \infty)$. Thus, if $x_{n}$ is Vicky's capital at period $n$, then $x_{n+1}=\left(1-p_{n}\right) x_{n}+p_{n} x_{n}\left(1+R_{n}\right)$, where $p_{n} \in[0,1]$ is the proportion of her capital she chooses to invest at period $n$, and $R_{n}$ is the rate of return for period $n$. Vicky desires to maximize her expected yield over $N$ periods, where the yield is defined as $\left(\frac{x_{N}}{x_{0}}\right)^{\frac{1}{N}}-1$, and $x_{0}$ and $x_{N}$ are respectively her initial and final capital.

(a) Express the problem of finding an optimal policy in a dynamic programming framework.

(b) Show that in each time period, the optimal strategy can be expressed in terms of the quantity $p^{*}$ which solves the optimization problem $\max _{p} \mathbb{E}\left(1+p R_{1}\right)^{1 / N}$. Show that $p^{*}>0$ if $\mathbb{E} R_{1}>0$. [Do not calculate $p^{*}$ explicitly.]

(c) Compare her optimal policy with the policy which maximizes her expected final capital $x_{N}$.

comment
• # Paper 3, Section II, K

A particle follows a discrete-time trajectory on $\mathbb{R}$ given by

$x_{t+1}=\left(A x_{t}+u_{t}\right) \xi_{t}+\epsilon_{t}$

for $t=1,2, \ldots, T$. Here $T \geqslant 2$ is a fixed integer, $A$ is a real constant, $x_{t}$ and $u_{t}$ are the position of the particle and control action at time $t$, respectively, and $\left(\xi_{t}, \epsilon_{t}\right)_{t=1}^{T}$ is a sequence of independent random vectors with

$\mathbb{E} \xi_{t}=\mathbb{E} \epsilon_{t}=0, \operatorname{var}\left(\xi_{t}\right)=V_{\xi}>0, \operatorname{var}\left(\epsilon_{t}\right)=V_{\epsilon}>0 \text { and } \operatorname{cov}\left(\xi_{t}, \epsilon_{t}\right)=0$

Find the optimal control, i.e. the control action $u_{t}$, defined as a function of $\left(x_{1}, \ldots, x_{t} ; u_{1}, \ldots, u_{t-1}\right)$, that minimizes

$\sum_{t=1}^{T} x_{t}^{2}+c \sum_{t=1}^{T-1} u_{t}^{2}$

where $c>0$ is given.

On which of $V_{\epsilon}$ and $V_{\xi}$ does the optimal control depend?

Find the limiting form of the optimal control as $T \rightarrow \infty$, and the minimal average cost per unit time.

comment
• # Paper 4, Section II, $29 \mathrm{~K}$

A file of $X$ gigabytes (GB) is to be transmitted over a communications link. At each time $t$ the sender can choose a transmission rate $u(t)$ within the range $[0,1]$ GB per second. The charge for transmitting at rate $u(t)$ at time $t$ is $u(t) p(t)$. The function $p$ is fully known at time $t=0$. If it takes a total time $T$ to transmit the file then there is a delay cost of $\gamma T^{2}, \gamma>0$. Thus $u$ and $T$ are to be chosen to minimize

$\int_{0}^{T} u(t) p(t) d t+\gamma T^{2}$

where $u(t) \in[0,1], d x(t) / d t=-u(t), x(0)=X$ and $x(T)=0$. Using Pontryagin's maximum principle, or otherwise, show that a property of the optimal policy is that there exists $p^{*}$ such that $u(t)=1$ if $p(t) and $u(t)=0$ if $p(t)>p^{*}$.

Show that the optimal $p^{*}$ and $T$ are related by $p^{*}=p(T)+2 \gamma T$.

Suppose $p(t)=t+1 / t$ and $X=1$. Show that it is optimal to transmit at a constant rate $u(t)=1$ between times $T-1 \leqslant t \leqslant T$, where $T$ is the unique positive solution to the equation

$\frac{1}{(T-1) T}=2 \gamma T+1$

comment

• # Paper 2, Section II, K

Consider a Markov decision problem with finite state space $X$, value function $F$ and dynamic programming equation $F=\mathcal{L} F$, where

$(\mathcal{L} \phi)(i)=\min _{a \in\{0,1\}}\left\{c(i, a)+\beta \sum_{j \in X} P_{i j}(a) \phi(j)\right\} .$

Suppose $0<\beta<1$, and $|c(i, a)| \leqslant B$ for all $i \in X, a \in\{0,1\}$. Prove there exists a deterministic stationary Markov policy that is optimal, explaining what the italicised words mean.

Let $F_{n}=\mathcal{L}^{n} F_{0}$, where $F_{0}=0$, and $M_{n}=\max _{i \in X}\left|F(i)-F_{n}(i)\right|$. Prove that

$M_{n} \leqslant \beta M_{n-1} \leqslant \beta^{n} B /(1-\beta) .$

Deduce that the value iteration algorithm converges to an optimal policy in a finite number of iterations.

comment
• # Paper 3, Section II, K

Consider the system in scalar variables, for $t=1,2, \ldots, h$ :

\begin{aligned} x_{t} &=x_{t-1}+u_{t-1} \\ y_{t} &=x_{t-1}+\eta_{t} \\ \hat{x}_{0} &=x_{0}+\eta_{0} \end{aligned}

where $\hat{x}_{0}$ is given, $y_{t}, u_{t}$ are observed at $t$, but $x_{0}, x_{1}, \ldots$ and $\eta_{0}, \eta_{1}, \ldots$ are unobservable, and $\eta_{0}, \eta_{1}, \ldots$ are independent random variables with mean 0 and variance $v$. Define $\hat{x}_{t-1}$ to be the estimator of $x_{t-1}$ with minimum variance amongst all estimators that are unbiased and linear functions of $W_{t-1}=\left(\hat{x}_{0}, y_{1}, \ldots, y_{t-1}, u_{0}, \ldots, u_{t-2}\right)$. Suppose $\hat{x}_{t-1}=a^{T} W_{t-1}$ and its variance is $V_{t-1}$. After observation at $t$ of $\left(y_{t}, u_{t-1}\right)$, a new unbiased estimator of $x_{t-1}$, linear in $W_{t}$, is expressed

$x_{t-1}^{*}=(1-H) b^{T} W_{t-1}+H y_{t}$

Find $b$ and $H$ to minimize the variance of $x_{t-1}^{*}$. Hence find $\hat{x}_{t}$ in terms of $\hat{x}_{t-1}, y_{t}, u_{t-1}$, $V_{t-1}$ and $v$. Calculate $V_{h}$.

Suppose $\eta_{0}, \eta_{1}, \ldots$ are Gaussian and thus $\hat{x}_{t}=E\left[x_{t} \mid W_{t}\right]$. Consider minimizing $E\left[x_{h}^{2}+\sum_{t=0}^{h-1} u_{t}^{2}\right]$, under the constraint that the control $u_{t}$ can only depend on $W_{t}$. Show that the value function of dynamic programming for this problem can be expressed

$F\left(W_{t}\right)=\Pi_{t} \hat{x}_{t}^{2}+\cdots$

where $F\left(W_{h}\right)=\hat{x}_{h}^{2}+V_{h}$ and $+\cdots$ is independent of $W_{t}$ and linear in $v$.

comment
• # Paper 4, Section II, K

State transversality conditions that can be used with Pontryagin's maximum principle and say when they are helpful.

Given $T$, it is desired to maximize $c_{1} x_{1}(T)+c_{2} x_{2}(T)$, where

\begin{aligned} &\dot{x}_{1}=u_{1}\left(a_{1} x_{1}+a_{2} x_{2}\right), \\ &\dot{x}_{2}=u_{2}\left(a_{1} x_{1}+a_{2} x_{2}\right), \end{aligned}

and $u=\left(u_{1}, u_{2}\right)$ is a time-varying control such that $u_{1} \geqslant 0, u_{2} \geqslant 0$ and $u_{1}+u_{2}=1$. Suppose that $x_{1}(0)$ and $x_{2}(0)$ are positive, and that $0 and $0. Find the optimal control at times close to $T$. Show that over $[0, T]$ the optimal control is constant, or makes exactly one switch, the latter happening if and only if

$c_{2} e^{a_{2} T}

comment

• # Paper 2, Section II, $26 \mathrm{~K}$

As a function of policy $\pi$ and initial state $x$, let

$F(\pi, x)=E_{\pi}\left[\sum_{t=0}^{\infty} \beta^{t} r\left(x_{t}, u_{t}\right) \mid x_{0}=x\right]$

where $\beta \geqslant 1$ and $r(x, u) \geqslant 0$ for all $x, u$. Suppose that for a specific policy $\pi$, and all $x$,

$F(\pi, x)=\sup _{u}\left\{r(x, u)+\beta E\left[F\left(\pi, x_{1}\right) \mid x_{0}=x, u_{0}=u\right]\right\} .$

Prove that $F(\pi, x) \geqslant F\left(\pi^{\prime}, x\right)$ for all $\pi^{\prime}$ and $x$.

A gambler plays games in which he may bet 1 or 2 pounds, but no more than his present wealth. Suppose he has $x_{t}$ pounds after $t$ games. If he bets $i$ pounds then $x_{t+1}=x_{t}+i$, or $x_{t+1}=x_{t}-i$, with probabilities $p_{i}$ and $1-p_{i}$ respectively. Gambling terminates at the first $\tau$ such that $x_{\tau}=0$ or $x_{\tau}=100$. His final reward is $(9 / 8)^{\tau / 2} x_{\tau}$. Let $\pi$ be the policy of always betting 1 pound. Given $p_{1}=1 / 3$, show that $F(\pi, x) \propto x 2^{x / 2}$.

Is $\pi$ optimal when $p_{2}=1 / 4$ ?

comment
• # Paper 3, Section II, K

A burglar having wealth $x$ may retire, or go burgling another night, in either of towns 1 or 2 . If he burgles in town $i$ then with probability $p_{i}=1-q_{i}$ he will, independently of previous nights, be caught, imprisoned and lose all his wealth. If he is not caught then his wealth increases by 0 or $2 a_{i}$, each with probability $1 / 2$ and independently of what happens on other nights. Values of $p_{i}$ and $a_{i}$ are the same every night. He wishes to maximize his expected wealth at the point he retires, is imprisoned, or $s$ nights have elapsed.

Using the dynamic programming equation

$F_{s}(x)=\max \left\{x, q_{1} E F_{s-1}\left(x+R_{1}\right), q_{2} E F_{s-1}\left(x+R_{2}\right)\right\}$

with $R_{j}, F_{0}(x)$ appropriately defined, prove that there exists an optimal policy under which he burgles another night if and only if his wealth is less than $x^{*}=\max _{i}\left\{a_{i} q_{i} / p_{i}\right\}$.

Suppose $q_{1}>q_{2}$ and $q_{1} a_{1}>q_{2} a_{2}$. Prove that he should never burgle in town 2 .

[Hint: Suppose $x, there are $s$ nights to go, and it has been shown that he ought not burgle in town 2 if less than $s$ nights remain. For the case $a_{2}>a_{1}$, separately consider subcases $x+2 a_{2} \geqslant x^{*}$ and $x+2 a_{2}. An interchange argument may help.]

comment
• # Paper 4, Section II, $25 K$

Consider the scalar system evolving as

$x_{t}=x_{t-1}+u_{t-1}+\epsilon_{t}, \quad t=1,2, \ldots,$

where $\left\{\epsilon_{t}\right\}_{t=1}^{\infty}$ is a white noise sequence with $E \epsilon_{t}=0$ and $E \epsilon_{t}^{2}=v$. It is desired to choose controls $\left\{u_{t}\right\}_{t=0}^{h-1}$ to minimize $E\left[\sum_{t=0}^{h-1}\left(\frac{1}{2} x_{t}^{2}+u_{t}^{2}\right)+x_{h}^{2}\right]$. Show that for $h=6$ the minimal cost is $x_{0}^{2}+6 v$.

Find a constant $\lambda$ and a function $\phi$ which solve

$\phi(x)+\lambda=\min _{u}\left[\frac{1}{2} x^{2}+u^{2}+E \phi\left(x+u+\epsilon_{1}\right)\right]$

Let $P$ be the class of those policies for which every $u_{t}$ obeys the constraint $\left(x_{t}+u_{t}\right)^{2} \leqslant(0.9) x_{t}^{2}$. Show that $E_{\pi} \phi\left(x_{t}\right) \leqslant x_{0}^{2}+10 v$, for all $\pi \in P$. Find, and prove optimal, a policy which over all $\pi \in P$ minimizes

$\lim _{h \rightarrow \infty} \frac{1}{h} E_{\pi}\left[\sum_{t=0}^{h-1}\left(\frac{1}{2} x_{t}^{2}+u_{t}^{2}\right)\right]$

comment

• # Paper 2, Section II, J

Describe the elements of a discrete-time stochastic dynamic programming equation for the problem of maximizing the expected sum of non-negative rewards over an infinite horizon. Give an example to show that there may not exist an optimal policy. Prove that if a policy has a value function that satisfies the dynamic programming equation then the policy is optimal.

A squirrel collects nuts for the coming winter. There are plenty of nuts lying around, but each time the squirrel leaves its lair it risks being caught by a predator. Assume that the outcomes of the squirrel's journeys are independent, that it is caught with probability $p$, and that it returns safely with a random weight of nuts, exponentially distributed with parameter $\lambda$. By solving the dynamic programming equation for the value function $F(x)$, find a policy maximizing the expected weight of nuts collected for the winter. Here the state variable $x$ takes values in $\mathbb{R}_{+}$(the weight of nuts so far collected) or $-1$ (a no-return state when the squirrel is caught).

comment
• # Paper 3, Section II, J

A particle follows a discrete-time trajectory on $\mathbb{R}$ given by

$x_{t+1}=A x_{t}+\xi_{t} u_{t}+\epsilon_{t}$

for $t=1,2, \ldots, T$, where $T \geqslant 2$ is a fixed integer, $A$ is a real constant, $x_{t}$ is the position of the particle and $u_{t}$ is the control action at time $t$, and $\left(\xi_{t}, \epsilon_{t}\right)_{t=1}^{T}$ is a sequence of independent random vectors with $\mathbb{E} \xi_{t}=\mathbb{E} \epsilon_{t}=0, \operatorname{var}\left(\xi_{t}\right)=V_{\xi}>0, \operatorname{var}\left(\epsilon_{t}\right)=V_{\epsilon}>0$ and $\operatorname{cov}\left(\xi_{t}, \epsilon_{t}\right)=0$.

Find the closed-loop control, i.e. the control action $u_{t}$ defined as a function of $\left(x_{1}, \ldots, x_{t} ; u_{1}, \ldots, u_{t-1}\right)$, that minimizes

$\sum_{t=1}^{T} x_{t}^{2}+c \sum_{t=1}^{T-1} u_{t}$

where $c>0$ is given. [Note that this function is quadratic in $x$, but linear in $u$.]

Does the closed-loop control depend on $V_{\epsilon}$ or on $V_{\xi}$ ? Deduce the form of the optimal open-loop control.

comment
• # Paper 4, Section II, J

A girl begins swimming from a point $(0,0)$ on the bank of a straight river. She swims at a constant speed $v$ relative to the water. The speed of the downstream current at a distance $y$ from the shore is $c(y)$. Hence her trajectory is described by

$\dot{x}=v \cos \theta+c(y), \quad \dot{y}=v \sin \theta$

where $\theta$ is the angle at which she swims relative to the direction of the current.

She desires to reach a downstream point $(1,0)$ on the same bank as she starts, as quickly as possible. Construct the Hamiltonian for this problem, and describe how Pontryagin's maximum principle can be used to give necessary conditions that must hold on an optimal trajectory. Given that $c(y)$ is positive, increasing and differentiable in $y$, show that on an optimal trajectory

$\frac{d}{d t} \tan (\theta(t))=-c^{\prime}(y(t))$

comment

• # Paper 2, Section II, K

Suppose $\left\{x_{t}\right\}_{t \geqslant 0}$ is a Markov chain. Consider the dynamic programming equation

$F_{s}(x)=\max \left\{r(x), \beta E\left[F_{s-1}\left(x_{1}\right) \mid x_{0}=x\right]\right\}, \quad s=1,2, \ldots,$

with $r(x)>0, \beta \in(0,1)$, and $F_{0}(x)=0$. Prove that:

(i) $F_{s}(x)$ is nondecreasing in $s$;

(ii) $F_{s}(x) \leqslant F(x)$, where $F(x)$ is the value function of an infinite-horizon problem that you should describe;

(iii) $F_{\infty}(x)=\lim _{s \rightarrow \infty} F_{s}(x)=F(x)$.

A coin lands heads with probability $p$. A statistician wishes to choose between: $H_{0}: p=1 / 3$ and $H_{1}: p=2 / 3$, one of which is true. Prior probabilities of $H_{1}$ and $H_{0}$ in the ratio $x: 1$ change after one toss of the coin to ratio $2 x: 1$ (if the toss was a head) or to ratio $x: 2$ (if the toss was a tail). What problem is being addressed by the following dynamic programming equation?

$F(x)=\max \left\{\frac{1}{1+x}, \frac{x}{1+x}, \beta\left[\left(\frac{1}{1+x} \frac{2}{3}+\frac{x}{1+x} \frac{1}{3}\right) F(x / 2)+\left(\frac{1}{1+x} \frac{1}{3}+\frac{x}{1+x} \frac{2}{3}\right) F(2 x)\right]\right\}$

Prove that $G(x)=(1+x) F(x)$ is a convex function of $x$.

By sketching a graph of $G$, describe the form of the optimal policy.

comment
• # Paper 3, Section II, K

A particle follows a discrete-time trajectory in $\mathbb{R}^{2}$ given by

$\left(\begin{array}{l} x_{t+1} \\ y_{t+1} \end{array}\right)=\left(\begin{array}{ll} 1 & 1 \\ 0 & 1 \end{array}\right)\left(\begin{array}{l} x_{t} \\ y_{t} \end{array}\right)+\left(\begin{array}{c} t \\ 1 \end{array}\right) u_{t}+\left(\begin{array}{c} \epsilon_{t} \\ 0 \end{array}\right)$

where $\left\{\epsilon_{t}\right\}$ is a white noise sequence with $E \epsilon_{t}=0$ and $E \epsilon_{t}^{2}=v$. Given $\left(x_{0}, y_{0}\right)$, we wish to choose $\left\{u_{t}\right\}_{t=0}^{9}$ to minimize $C=E\left[x_{10}^{2}+\sum_{t=0}^{9} u_{t}^{2}\right]$.

Show that for some $\left\{a_{t}\right\}$ this problem can be reduced to one of controlling a scalar state $\xi_{t}=x_{t}+a_{t} y_{t}$.

Find, in terms of $x_{0}, y_{0}$, the optimal $u_{0}$. What is the change in minimum $C$ achievable when the system starts in $\left(x_{0}, y_{0}\right)$ as compared to when it starts in $(0,0)$ ?

Consider now a trajectory starting at $\left(x_{-1}, y_{-1}\right)=(11,-1)$. What value of $u_{-1}$ is optimal if we wish to minimize $5 u_{-1}^{2}+C$ ?

comment
• # Paper 4, Section II, K

Given $r, \rho, \mu, T$, all positive, it is desired to choose $u(t)>0$ to maximize

$\mu x(T)+\int_{0}^{T} e^{-\rho t} \log u(t) d t$

subject to $\dot{x}(t)=r x(t)-u(t), x(0)=10$.

Explain what Pontryagin's maximum principle guarantees about a solution to this problem.

Show that no matter whether $x(T)$ is constrained or unconstrained there is a constant $\alpha$ such that the optimal control is of the form $u(t)=\alpha e^{-(\rho-r) t}$. Find an expression for $\alpha$ under the constraint $x(T)=5$.

Show that if $x(T)$ is unconstrained then $\alpha=(1 / \mu) e^{-r T}$.

comment

• # Paper 2, Section II, J

Describe the elements of a generic stochastic dynamic programming equation for the problem of maximizing the expected sum of discounted rewards accrued at times $0,1, \ldots$ What is meant by the positive case? What is specially true in this case that is not true in general?

An investor owns a single asset which he may sell once, on any of the days $t=0,1, \ldots$. On day $t$ he will be offered a price $X_{t}$. This value is unknown until day $t$, is independent of all other offers, and a priori it is uniformly distributed on $[0,1]$. Offers remain open, so that on day $t$ he may sell the asset for the best of the offers made on days $0, \ldots, t$. If he sells for $x$ on day $t$ then the reward is $x \beta^{t}$. Show from first principles that if $0<\beta<1$ then there exists $\bar{x}$ such that the expected reward is maximized by selling the first day the offer is at least $\bar{x}$.

For $\beta=4 / 5$, find both $\bar{x}$ and the expected reward under the optimal policy.

Explain what is special about the case $\beta=1$.

comment
• # Paper 3, Section II, J

A state variable $x=\left(x_{1}, x_{2}\right) \in \mathbb{R}^{2}$ is subject to dynamics

\begin{aligned} &\dot{x}_{1}(t)=x_{2}(t) \\ &\dot{x}_{2}(t)=u(t) \end{aligned}

where $u=u(t)$ is a scalar control variable constrained to the interval $[-1,1]$. Given an initial value $x(0)=\left(x_{1}, x_{2}\right)$, let $F\left(x_{1}, x_{2}\right)$ denote the minimal time required to bring the state to $(0,0)$. Prove that

$\max _{u \in[-1,1]}\left\{-x_{2} \frac{\partial F}{\partial x_{1}}-u \frac{\partial F}{\partial x_{2}}-1\right\}=0$

Explain how this equation figures in Pontryagin's maximum principle.

Use Pontryagin's maximum principle to show that, on an optimal trajectory, $u(t)$ only takes the values 1 and $-1$, and that it makes at most one switch between them.

Show that $u(t)=1,0 \leqslant t \leqslant 2$ is optimal when $x(0)=(2,-2)$.

Find the optimal control when $x(0)=(7,-2)$.

comment
• # Paper 4, Section II, J

A factory has a tank of capacity $3 \mathrm{~m}^{3}$ in which it stores chemical waste. Each week the factory produces, independently of other weeks, an amount of waste that is equally likely to be 0,1 , or $2 \mathrm{~m}^{3}$. If the amount of waste exceeds the remaining space in the tank then the excess must be specially handled at a cost of $£ C$ per $\mathrm{m}^{3}$. The tank may be emptied or not at the end of each week. Emptying costs $£ D$, plus a variable cost of $£ \alpha$ for each $\mathrm{m}^{3}$ of its content. It is always emptied when it ends the week full.

It is desired to minimize the average cost per week. Write down equations from which one can determine when it is optimal to empty the tank.

Find the average cost per week of a policy $\pi$, which empties the tank if and only if its content at the end of the week is 2 or $3 \mathrm{~m}^{3}$.

Describe the policy improvement algorithm. Explain why, starting from $\pi$, this algorithm will find an optimal policy in at most three iterations.

Prove that $\pi$ is optimal if and only if $C \geqslant \alpha+(4 / 3) D$.

comment

• # Paper 2, Section II, K

Consider an optimal stopping problem in which the optimality equation takes the form

$F_{t}(x)=\max \left\{r(x), E\left[F_{t+1}\left(x_{t+1}\right)\right]\right\}, \quad t=1, \ldots, N-1,$

$F_{N}(x)=r(x)$, and where $r(x)>0$ for all $x$. Let $S$ denote the stopping set of the onestep-look-ahead rule. Show that if $S$ is closed (in a sense you should explain) then the one-step-look-ahead rule is optimal.

$N$ biased coins are to be tossed successively. The probability that the $i$ th coin toss will show a head is known to be $p_{i}\left(0. At most once, after observing a head, and before tossing the next coin, you may guess that you have just seen the last head (i.e. that all subsequent tosses will show tails). If your guess turns out to be correct then you win $£ 1$.

Suppose that you have not yet guessed 'last head', and the $i$ th toss is a head. Show that it cannot be optimal to guess that this is the last head if

$\frac{p_{i+1}}{q_{i+1}}+\cdots+\frac{p_{N}}{q_{N}}>1$

where $q_{j}=1-p_{j}$.

Suppose that $p_{i}=1 / i$. Show that it is optimal to guess that the last head is the first head (if any) to occur after having tossed at least $i^{*}$ coins, where $i^{*} \approx N / e$ when $N$ is large.

comment
• # Paper 3, Section II, 28K

An observable scalar state variable evolves as $x_{t+1}=x_{t}+u_{t}, t=0,1, \ldots$ Let controls $u_{0}, u_{1}, \ldots$ be determined by a policy $\pi$ and define

$C_{s}\left(\pi, x_{0}\right)=\sum_{t=0}^{s-1}\left(x_{t}^{2}+2 x_{t} u_{t}+7 u_{t}^{2}\right) \quad \text { and } \quad C_{s}\left(x_{0}\right)=\inf _{\pi} C_{s}\left(\pi, x_{0}\right)$

Show that it is possible to express $C_{s}\left(x_{0}\right)$ in terms of $\Pi_{s}$, which satisfies the recurrence

$\Pi_{s}=\frac{6\left(1+\Pi_{s-1}\right)}{7+\Pi_{s-1}}, \quad s=1,2, \ldots$

with $\Pi_{0}=0$.

Deduce that $C_{\infty}\left(x_{0}\right) \geqslant 2 x_{0}^{2} .\left[C_{\infty}\left(x_{0}\right)\right.$ is defined as $\left.\lim _{s \rightarrow \infty} C_{s}\left(x_{0}\right) .\right]$

By considering the policy $\pi^{*}$ which takes $u_{t}=-(1 / 3)(2 / 3)^{t} x_{0}, t=0,1, \ldots$, show that $C_{\infty}\left(x_{0}\right)=2 x_{0}^{2}$.

Give an alternative description of $\pi^{*}$ in closed-loop form.

comment
• # Paper 4, Section II, K

Describe the type of optimal control problem that is amenable to analysis using Pontryagin's Maximum Principle.

A firm has the right to extract oil from a well over the interval $[0, T]$. The oil can be sold at price $£ p$ per unit. To extract oil at rate $u$ when the remaining quantity of oil in the well is $x$ incurs cost at rate $£ u^{2} / x$. Thus the problem is one of maximizing

$\int_{0}^{T}\left[p u(t)-\frac{u(t)^{2}}{x(t)}\right] d t$

subject to $d x(t) / d t=-u(t), u(t) \geqslant 0, x(t) \geqslant 0$. Formulate the Hamiltonian for this problem.

Explain why $\lambda(t)$, the adjoint variable, has a boundary condition $\lambda(T)=0$.

Use Pontryagin's Maximum Principle to show that under optimal control

$\lambda(t)=p-\frac{1}{1 / p+(T-t) / 4}$

and

$\frac{d x(t)}{d t}=-\frac{2 p x(t)}{4+p(T-t)}$

Find the oil remaining in the well at time $T$, as a function of $x(0), p$, and $T$,

comment

• # Paper 2, Section II, J

(a) Suppose that

$\left(\begin{array}{l} X \\ Y \end{array}\right) \sim N\left(\left(\begin{array}{l} \mu_{X} \\ \mu_{Y} \end{array}\right),\left(\begin{array}{ll} V_{X X} & V_{X Y} \\ V_{Y X} & V_{Y Y} \end{array}\right)\right)$

Prove that conditional on $Y=y$, the distribution of $X$ is again multivariate normal, with mean $\mu_{X}+V_{X Y} V_{Y Y}^{-1}\left(y-\mu_{Y}\right)$ and covariance $V_{X X}-V_{X Y} V_{Y Y}^{-1} V_{Y X}$.

(b) The $\mathbb{R}^{d}$-valued process $X$ evolves in discrete time according to the dynamics

$X_{t+1}=A X_{t}+\varepsilon_{t+1}$

where $A$ is a constant $d \times d$ matrix, and $\varepsilon_{t}$ are independent, with common $N\left(0, \Sigma_{\varepsilon}\right)$ distribution. The process $X$ is not observed directly; instead, all that is seen is the process $Y$ defined as

$Y_{t}=C X_{t}+\eta_{t},$

where $\eta_{t}$ are independent of each other and of the $\varepsilon_{t}$, with common $N\left(0, \Sigma_{\eta}\right)$ distribution.

If the observer has the prior distribution $X_{0} \sim N\left(\hat{X}_{0}, V_{0}\right)$ for $X_{0}$, prove that at all later times the distribution of $X_{t}$ conditional on $\mathcal{Y}_{t} \equiv\left(Y_{1}, \ldots, Y_{t}\right)$ is again normally distributed, with mean $\hat{X}_{t}$ and covariance $V_{t}$ which evolve as

\begin{aligned} \hat{X}_{t+1} &=A \hat{X}_{t}+M_{t} C^{T}\left(\Sigma_{\eta}+C M_{t} C^{T}\right)^{-1}\left(Y_{t+1}-C A \hat{X}_{t}\right) \\ V_{t+1} &=M_{t}-M_{t} C^{T}\left(\Sigma_{\eta}+C M_{t} C^{T}\right)^{-1} C M_{t} \end{aligned}

where

$M_{t}=A V_{t} A^{T}+\Sigma_{\varepsilon}$

(c) In the special case where both $X$ and $Y$ are one-dimensional, and $A=C=1$, $\Sigma_{\varepsilon}=0$, find the form of the updating recursion. Show in particular that

$\frac{1}{V_{t+1}}=\frac{1}{V_{t}}+\frac{1}{\Sigma_{\eta}}$

and that

$\frac{\hat{X}_{t+1}}{V_{t+1}}=\frac{\hat{X}_{t}}{V_{t}}+\frac{Y_{t+1}}{\Sigma_{\eta}}$

Hence deduce that, with probability one,

$\lim _{t \rightarrow \infty} \hat{X}_{t}=\lim _{t \rightarrow \infty} t^{-1} \sum_{j=1}^{t} Y_{j}$

comment
• # Paper 3, Section II, J

Consider an infinite-horizon controlled Markov process having per-period costs $c(x, u) \geqslant 0$, where $x \in \mathcal{X}$ is the state of the system, and $u \in \mathcal{U}$ is the control. Costs are discounted at rate $\beta \in(0,1]$, so that the objective to be minimized is

$\mathbb{E}\left[\sum_{t \geqslant 0} \beta^{t} c\left(X_{t}, u_{t}\right) \mid X_{0}=x\right]$

What is meant by a policy $\pi$ for this problem?

Let $\mathcal{L}$ denote the dynamic programming operator

$\mathcal{L} f(x) \equiv \inf _{u \in \mathcal{U}}\left\{c(x, u)+\beta \mathbb{E}\left[f\left(X_{1}\right) \mid X_{0}=x, u_{0}=u\right]\right\}$

Further, let $F$ denote the value of the optimal control problem:

$F(x)=\inf _{\pi} \mathbb{E}^{\pi}\left[\sum_{t \geqslant 0} \beta^{t} c\left(X_{t}, u_{t}\right) \mid X_{0}=x\right]$

where the infimum is taken over all policies $\pi$, and $\mathbb{E}^{\pi}$ denotes expectation under policy $\pi$. Show that the functions $F_{t}$ defined by

$F_{t+1}=\mathcal{L} F_{t} \quad(t \geqslant 0), \quad F_{0} \equiv 0$

increase to a limit $F_{\infty} \in[0, \infty] .$ Prove that $F_{\infty} \leqslant F$. Prove that $F=\mathcal{L} F$

Suppose that $\Phi=\mathcal{L} \Phi \geqslant 0$. Prove that $\Phi \geqslant F$.

[You may assume that there is a function $u_{*}: \mathcal{X} \rightarrow \mathcal{U}$ such that

$\mathcal{L} \Phi(x)=c\left(x, u_{*}(x)\right)+\beta \mathbb{E}\left[\Phi\left(X_{1}\right) \mid X_{0}=x, u_{0}=u_{*}(x)\right]$

though the result remains true without this simplifying assumption.]

comment
• # Paper 4, Section II, J

Dr Seuss' wealth $x_{t}$ at time $t$ evolves as

$\frac{d x}{d t}=r x_{t}+\ell_{t}-c_{t}$

where $r>0$ is the rate of interest earned, $\ell_{t}$ is his intensity of working $(0 \leqslant \ell \leqslant 1)$, and $c_{t}$ is his rate of consumption. His initial wealth $x_{0}>0$ is given, and his objective is to maximize

$\int_{0}^{T} U\left(c_{t}, \ell_{t}\right) d t$

where $U(c, \ell)=c^{\alpha}(1-\ell)^{\beta}$, and $T$ is the (fixed) time his contract expires. The constants $\alpha$ and $\beta$ satisfy the inequalities $0<\alpha<1,0<\beta<1$, and $\alpha+\beta>1$. At all times, $c_{t}$ must be non-negative, and his final wealth $x_{T}$ must be non-negative. Establish the following properties of the optimal solution $\left(x^{*}, c^{*}, \ell^{*}\right)$ :

(i) $\beta c_{t}^{*}=\alpha\left(1-\ell_{t}^{*}\right)$;

(ii) $c_{t}^{*} \propto e^{-\gamma r t}$, where $\gamma \equiv(\beta-1+\alpha)^{-1}$;

(iii) $x_{t}^{*}=A e^{r t}+B e^{-\gamma r t}-r^{-1}$ for some constants $A$ and $B$.

Hence deduce that the optimal wealth is

$x_{t}^{*}=\frac{\left(1-e^{-\gamma r T}\left(1+r x_{0}\right)\right) e^{r t}+\left(\left(1+r x_{0}\right) e^{r T}-1\right) e^{-\gamma r t}}{r\left(e^{r T}-e^{-\gamma r T}\right)}-\frac{1}{r}$

comment

• # Paper 2, Section II, I

In the context of stochastic dynamic programming, explain what is meant by an average-reward optimal policy.

A player has a fair coin and a six-sided die. At each epoch he may choose either to toss the coin or to roll the die. If he tosses the coin and it shows heads then he adds 1 to his total score, and if it shows tails then he adds 0 . If he rolls the die then he adds the number showing. He wins a reward of $£ 1$ whenever his total score is divisible by 3 .

Suppose the player always tosses the coin. Find his average reward per toss.

Still using the above policy, and given that he starts with a total score of $x$, let $F_{s}(x)$ be the expected total reward over the next $s$ epochs. Find the value of

$\lim _{s \rightarrow \infty}\left[F_{s}(x)-F_{s}(0)\right] .$

Use the policy improvement algorithm to find a policy that produces a greater average reward than the policy of only tossing the coin.

Find the average-reward optimal policy.

comment
• # Paper 3, Section II, I

Two scalar systems have dynamics

$x_{t+1}=x_{t}+u_{t}+\epsilon_{t}, \quad y_{t+1}=y_{t}+w_{t}+\eta_{t},$

where $\left\{\epsilon_{t}\right\}$ and $\left\{\eta_{t}\right\}$ are independent sequences of independent and identically distributed random variables of mean 0 and variance 1 . Let

$F(x)=\inf _{\pi} \mathbb{E}\left[\sum_{t=0}^{\infty}\left(x_{t}^{2}+u_{t}^{2}\right)(2 / 3)^{t} \mid x_{0}=x\right]$

where $\pi$ is a policy in which $u_{t}$ depends on only $x_{0}, \ldots, x_{t}$.

Show that $G(x)=P x^{2}+d$ is a solution to the optimality equation satisfied by $F(x)$, for some $P$ and $d$ which you should find.

Find the optimal controls.

State a theorem that justifies $F(x)=G(x)$.

For each of the two cases (a) $\lambda=0$ and (b) $\lambda=1$, find controls $\left\{u_{t}, w_{t}\right\}$ which minimize

$\mathbb{E}\left[\sum_{t=0}^{\infty}\left(x_{t}^{2}+2 \lambda x_{t} y_{t}+y_{t}^{2}+u_{t}^{2}+w_{t}^{2}\right)(2 / 3+\lambda / 12)^{t} \mid x_{0}=x, y_{0}=y\right]$

comment
• # Paper 4, Section II, I

Explain how transversality conditions can be helpful when employing Pontryagin's Maximum Principle to solve an optimal control problem.

A particle in $\mathbb{R}^{2}$ starts at $(0,0.5)$ and follows the dynamics

$\dot{x}=u \sqrt{|y|}, \quad \dot{y}=v \sqrt{|y|}, \quad t \in[0, T],$

where controls $u(t)$ and $v(t)$ are to be chosen subject to $u^{2}(t)+v^{2}(t)=1$.

Using Pontryagin's maximum principle do the following:

(a) Find controls that minimize $-y(1)$;

(b) Suppose we wish to choose $T$ and the controls $u, v$ to minimize $-y(T)+T$ under a constraint $(x(T), y(T))=(1,1)$. By expressing both $d y / d x$ and $d^{2} y / d x^{2}$ in terms of the adjoint variables, show that on an optimal trajectory,

$1+\left(\frac{d y}{d x}\right)^{2}+2 y \frac{d^{2} y}{d x^{2}}=0$

comment

• # 2.II.29I

Consider a stochastic controllable dynamical system $P$ with action-space $A$ and countable state-space $S$. Thus $P=\left(p_{x y}(a): x, y \in S, a \in A\right)$ and $p_{x y}(a)$ denotes the transition probability from $x$ to $y$ when taking action $a$. Suppose that a cost $c(x, a)$ is incurred each time that action $a$ is taken in state $x$, and that this cost is uniformly bounded. Write down the dynamic optimality equation for the problem of minimizing the expected long-run average cost.

State in terms of this equation a general result, which can be used to identify an optimal control and the minimal long-run average cost.

A particle moves randomly on the integers, taking steps of size 1 . Suppose we can choose at each step a control parameter $u \in[\alpha, 1-\alpha]$, where $\alpha \in(0,1 / 2)$ is fixed, which has the effect that the particle moves in the positive direction with probability $u$ and in the negative direction with probability $1-u$. It is desired to maximize the long-run proportion of time $\pi$ spent by the particle at 0 . Show that there is a solution to the optimality equation for this example in which the relative cost function takes the form $\theta(x)=\mu|x|$, for some constant $\mu$.

Determine an optimal control and show that the maximal long-run proportion of time spent at 0 is given by

$\pi=\frac{1-2 \alpha}{2(1-\alpha)} .$

You may assume that it is valid to use an unbounded function $\theta$ in the optimality equation in this example.

comment
• # 3.II.28I

Let $Q$ be a positive-definite symmetric $m \times m$ matrix. Show that a non-negative quadratic form on $\mathbb{R}^{d} \times \mathbb{R}^{m}$ of the form

$c(x, a)=x^{T} R x+x^{T} S^{T} a+a^{T} S x+a^{T} Q a, \quad x \in \mathbb{R}^{d}, \quad a \in \mathbb{R}^{m}$

is minimized over $a$, for each $x$, with value $x^{T}\left(R-S^{T} Q^{-1} S\right) x$, by taking $a=K x$, where $K=-Q^{-1} S$.

Consider for $k \leqslant n$ the controllable stochastic linear system in $\mathbb{R}^{d}$

$X_{j+1}=A X_{j}+B U_{j}+\varepsilon_{j+1}, \quad j=k, k+1, \ldots, n-1,$

starting from $X_{k}=x$ at time $k$, where the control variables $U_{j}$ take values in $\mathbb{R}^{m}$, and where $\varepsilon_{k+1}, \ldots, \varepsilon_{n}$ are independent, zero-mean random variables, with $\operatorname{var}\left(\varepsilon_{j}\right)=N_{j}$. Here, $A, B$ and $N_{j}$ are, respectively, $d \times d, d \times m$ and $d \times d$ matrices. Assume that a cost $c\left(X_{j}, U_{j}\right)$ is incurred at each time $j=k, \ldots, n-1$ and that a final cost $C\left(X_{n}\right)=X_{n}^{T} \Pi_{0} X_{n}$ is incurred at time $n$. Here, $\Pi_{0}$ is a given non-negative-definite symmetric matrix. It is desired to minimize, over the set of all controls $u$, the total expected cost $V^{u}(k, x)$. Write down the optimality equation for the infimal cost function $V(k, x)$.

Hence, show that $V(k, x)$ has the form

$V(k, x)=x^{T} \Pi_{n-k} x+\gamma_{k}$

for some non-negative-definite symmetric matrix $\Pi_{n-k}$ and some real constant $\gamma_{k}$. Show how to compute the matrix $\Pi_{n-k}$ and constant $\gamma_{k}$ and how to determine an optimal control.

comment
• # 4.II.29I

State Pontryagin's maximum principle for the controllable dynamical system with state-space $\mathbb{R}^{+}$, given by

$\dot{x}_{t}=b\left(t, x_{t}, u_{t}\right), \quad t \geqslant 0,$

where the running costs are given by $c\left(t, x_{t}, u_{t}\right)$, up to an unconstrained terminal time $\tau$ when the state first reaches 0 , and there is a terminal cost $C(\tau)$.

A company pays a variable price $p(t)$ per unit time for electrical power, agreed in advance, which depends on the time of day. The company takes on a job at time $t=0$, which requires a total amount $E$ of electrical energy, but can be processed at a variable level of power consumption $u(t) \in[0,1]$. If the job is completed by time $\tau$, then the company will receive a reward $R(\tau)$. Thus, it is desired to minimize

$\int_{0}^{\tau} u(t) p(t) d t-R(\tau)$

subject to

$\int_{0}^{\tau} u(t) d t=E, \quad u(t) \in[0,1],$