Paper 2, Section II, I

In the context of stochastic dynamic programming, explain what is meant by an average-reward optimal policy.

A player has a fair coin and a six-sided die. At each epoch he may choose either to toss the coin or to roll the die. If he tosses the coin and it shows heads then he adds 1 to his total score, and if it shows tails then he adds 0 . If he rolls the die then he adds the number showing. He wins a reward of $£ 1$ whenever his total score is divisible by 3 .

Suppose the player always tosses the coin. Find his average reward per toss.

Still using the above policy, and given that he starts with a total score of $x$ , let $F_{s}(x)$ be the expected total reward over the next $s$ epochs. Find the value of

\lim _{s \rightarrow \infty}\left[F_{s}(x)-F_{s}(0)\right] .

Use the policy improvement algorithm to find a policy that produces a greater average reward than the policy of only tossing the coin.

Find the average-reward optimal policy.