• # Paper 1, Section II, J

Let $X_{1}, \ldots, X_{n}$ be random variables with joint probability density function in a statistical model $\left\{f_{\theta}: \theta \in \mathbb{R}\right\}$.

(a) Define the Fisher information $I_{n}(\theta)$. What do we mean when we say that the Fisher information tensorises?

(b) Derive the relationship between the Fisher information and the derivative of the score function in a regular model.

(c) Consider the model defined by $X_{1}=\theta+\varepsilon_{1}$ and

$X_{i}=\theta(1-\sqrt{\gamma})+\sqrt{\gamma} X_{i-1}+\sqrt{1-\gamma} \varepsilon_{i} \quad \text { for } i=2, \ldots, n$

where $\varepsilon_{1}, \ldots, \varepsilon_{n}$ are i.i.d. $N(0,1)$ random variables, and $\gamma \in[0,1)$ is a known constant. Compute the Fisher information $I_{n}(\theta)$. For which values of $\gamma$ does the Fisher information tensorise? State a lower bound on the variance of an unbiased estimator $\hat{\theta}$ in this model.

comment
• # Paper 2, Section II, J

Let $X_{1}, \ldots, X_{n}$ be i.i.d. random observations taking values in $[0,1]$ with a continuous distribution function $F$. Let $\hat{F}_{n}(x)=n^{-1} \sum_{i=1}^{n} \mathbf{1}_{\left\{X_{i} \leqslant x\right\}}$ for each $x \in[0,1]$.

(a) State the Kolmogorov-Smirnov theorem. Explain how this theorem may be used in a goodness-of-fit test for the null hypothesis $H_{0}: F=F_{0}$, with $F_{0}$ continuous.

(b) Suppose you do not have access to the quantiles of the sampling distribution of the Kolmogorov-Smirnov test statistic. However, you are given i.i.d. samples $Z_{1}, \ldots, Z_{n m}$ with distribution function $F_{0}$. Describe a test of $H_{0}: F=F_{0}$ with size exactly $1 /(m+1)$.

(c) Now suppose that $X_{1}, \ldots, X_{n}$ are i.i.d. taking values in $[0, \infty)$ with probability density function $f$, with $\sup _{x \geqslant 0}\left(|f(x)|+\left|f^{\prime}(x)\right|\right)<1$. Define the density estimator

$\left.\hat{f}_{n}(x)=n^{-2 / 3} \sum_{i=1}^{n} \mathbf{1}_{\left\{X_{i}\right.}-\frac{1}{2 n^{1 / 3}} \leqslant x \leqslant X_{i}+\frac{1}{2 n^{1 / 3}}\right\}, \quad x \geqslant 0 .$

Show that for all $x \geqslant 0$ and all $n \geqslant 1$,

$\mathbb{E}\left[\left(\hat{f}_{n}(x)-f(x)\right)^{2}\right] \leqslant \frac{2}{n^{2 / 3}} .$

comment
• # Paper 3, Section II, J

Let $X_{1}, \ldots, X_{n} \sim$ iid $\operatorname{Gamma}(\alpha, \beta)$ for some known $\alpha>0$ and some unknown $\beta>0$. [The gamma distribution has probability density function

$f(x)=\frac{\beta^{\alpha}}{\Gamma(\alpha)} x^{\alpha-1} e^{-\beta x}, \quad x>0$

and its mean and variance are $\alpha / \beta$ and $\alpha / \beta^{2}$, respectively.]

(a) Find the maximum likelihood estimator $\hat{\beta}$ for $\beta$ and derive the distributional limit of $\sqrt{n}(\hat{\beta}-\beta)$. [You may not use the asymptotic normality of the maximum likelihood estimator proved in the course.]

(b) Construct an asymptotic $(1-\gamma)$-level confidence interval for $\beta$ and show that it has the correct (asymptotic) coverage.

(c) Write down all the steps needed to construct a candidate to an asymptotic $(1-\gamma)$-level confidence interval for $\beta$ using the nonparametric bootstrap.

comment
• # Paper 4, Section II, J

Suppose that $X \mid \theta \sim \operatorname{Poisson}(\theta), \theta>0$, and suppose the prior $\pi$ on $\theta$ is a gamma distribution with parameters $\alpha>0$ and $\beta>0$. [Recall that $\pi$ has probability density function

$f(z)=\frac{\beta^{\alpha}}{\Gamma(\alpha)} z^{\alpha-1} e^{-\beta z}, \quad z>0$

and that its mean and variance are $\alpha / \beta$ and $\alpha / \beta^{2}$, respectively. ]

(a) Find the $\pi$-Bayes estimator for $\theta$ for the quadratic loss, and derive its quadratic risk function.

(b) Suppose we wish to estimate $\mu=e^{-\theta}=\mathbb{P}_{\theta}(X=0)$. Find the $\pi$-Bayes estimator for $\mu$ for the quadratic loss, and derive its quadratic risk function. [Hint: The moment generating function of a Poisson $(\theta)$ distribution is $M(t)=\exp \left(\theta\left(e^{t}-1\right)\right)$ for $t \in \mathbb{R}$, and that of a Gamma $(\alpha, \beta)$ distribution is $M(t)=(1-t / \beta)^{-\alpha}$ for $t<\beta$.]

(c) State a sufficient condition for an admissible estimator to be minimax, and give a proof of this fact.

(d) For each of the estimators in parts (a) and (b), is it possible to deduce using the condition in (c) that the estimator is minimax for some value of $\alpha$ and $\beta$ ? Justify your answer.

comment

• # Paper 1, Section II, J

State and prove the Cramér-Rao inequality for a real-valued parameter $\theta$. [Necessary regularity conditions need not be stated.]

In a general decision problem, define what it means for a decision rule to be minimax.

Let $X_{1}, \ldots, X_{n}$ be i.i.d. from a $N(\theta, 1)$ distribution, where $\theta \in \Theta=[0, \infty)$. Prove carefully that $\bar{X}_{n}=\frac{1}{n} \sum_{i=1}^{n} X_{i}$ is minimax for quadratic risk on $\Theta$.

comment
• # Paper 2, Section II, J

Consider $X_{1}, \ldots, X_{n}$ from a $N\left(\mu, \sigma^{2}\right)$ distribution with parameter $\theta=\left(\mu, \sigma^{2}\right) \in$ $\Theta=\mathbb{R} \times(0, \infty)$. Derive the likelihood ratio test statistic $\Lambda_{n}\left(\Theta, \Theta_{0}\right)$ for the composite hypothesis

$H_{0}: \sigma^{2}=1 \text { vs. } H_{1}: \sigma^{2} \neq 1$

where $\Theta_{0}=\{(\mu, 1): \mu \in \mathbb{R}\}$ is the parameter space constrained by $H_{0}$.

Prove carefully that

$\Lambda_{n}\left(\Theta, \Theta_{0}\right) \rightarrow^{d} \chi_{1}^{2} \quad \text { as } n \rightarrow \infty$

where $\chi_{1}^{2}$ is a Chi-Square distribution with one degree of freedom.

comment
• # Paper 3, Section II, J

Let $\Theta=\mathbb{R}^{p}$, let $\mu>0$ be a probability density function on $\Theta$ and suppose we are given a further auxiliary conditional probability density function $q(\cdot \mid t)>0, t \in \Theta$, on $\Theta$ from which we can generate random draws. Consider a sequence of random variables $\left\{\vartheta_{m}: m \in \mathbb{N}\right\}$ generated as follows:

• For $m \in \mathbb{N}$ and given $\vartheta_{m}$, generate a new draw $s_{m} \sim q\left(\cdot \mid \vartheta_{m}\right)$.

• Define

$\vartheta_{m+1}= \begin{cases}s_{m}, & \text { with probability } \rho\left(\vartheta_{m}, s_{m}\right) \\ \vartheta_{m}, & \text { with probability } 1-\rho\left(\vartheta_{m}, s_{m}\right)\end{cases}$

where $\rho(t, s)=\min \left\{\frac{\mu(s)}{\mu(t)} \frac{q(t \mid s)}{q(s \mid t)}, 1\right\}$.

(i) Show that the Markov chain $\left(\vartheta_{m}\right)$ has invariant measure $\mu$, that is, show that for all (measurable) subsets $B \subset \Theta$ and all $m \in \mathbb{N}$ we have

$\int_{\Theta} \operatorname{Pr}\left(\vartheta_{m+1} \in B \mid \vartheta_{m}=t\right) \mu(t) d t=\int_{B} \mu(\theta) d \theta$

(ii) Now suppose that $\mu$ is the posterior probability density function arising in a statistical model $\{f(\cdot, \theta): \theta \in \Theta\}$ with observations $x$ and a $N\left(0, I_{p}\right)$ prior distribution on $\theta$. Derive a family $\{q(\cdot \mid t): t \in \Theta\}$ such that in the above algorithm the acceptance probability $\rho(t, s)$ is a function of the likelihood ratio $f(x, s) / f(x, t)$, and for which the probability density function $q(\cdot \mid t)$ has covariance matrix $2 \delta I_{p}$ for all $t \in \Theta$.

comment
• # Paper 4 , Section II, J

Consider $X_{1}, \ldots, X_{n}$ drawn from a statistical model $\{f(\cdot, \theta): \theta \in \Theta\}, \Theta=\mathbb{R}^{p}$, with non-singular Fisher information matrix $I(\theta)$. For $\theta_{0} \in \Theta, h \in \mathbb{R}^{p}$, define likelihood ratios

$Z_{n}(h)=\log \frac{\prod_{i=1}^{n} f\left(X_{i}, \theta_{0}+h / \sqrt{n}\right)}{\prod_{i=1}^{n} f\left(X_{i}, \theta_{0}\right)}, \quad X_{i} \sim^{i . i . d .} f\left(\cdot, \theta_{0}\right)$

Next consider the probability density functions $\left(p_{h}: h \in \mathbb{R}^{p}\right)$ of normal distributions $N\left(h, I\left(\theta_{0}\right)^{-1}\right)$ with corresponding likelihood ratios given by

$Z(h)=\log \frac{p_{h}(X)}{p_{0}(X)}, \quad X \sim p_{0} .$

Show that for every fixed $h \in \mathbb{R}^{p}$, the random variables $Z_{n}(h)$ converge in distribution as $n \rightarrow \infty$ to $Z(h) .$

[You may assume suitable regularity conditions of the model $\{f(\cdot, \theta): \theta \in \Theta\}$ without specification, and results on uniform laws of large numbers from lectures can be used without proof.]

comment

• # Paper 1, Section II, J

In a regression problem, for a given $X \in \mathbb{R}^{n \times p}$ fixed, we observe $Y \in \mathbb{R}^{n}$ such that

$Y=X \theta_{0}+\varepsilon$

for an unknown $\theta_{0} \in \mathbb{R}^{p}$ and $\varepsilon$ random such that $\varepsilon \sim \mathcal{N}\left(0, \sigma^{2} I_{n}\right)$ for some known $\sigma^{2}>0$.

(a) When $p \leqslant n$ and $X$ has rank $p$, compute the maximum likelihood estimator $\hat{\theta}_{M L E}$ for $\theta_{0}$. When $p>n$, what issue is there with the likelihood maximisation approach and how many maximisers of the likelihood are there (if any)?

(b) For any $\lambda>0$ fixed, we consider $\hat{\theta}_{\lambda}$ minimising

$\|Y-X \theta\|_{2}^{2}+\lambda\|\theta\|_{2}^{2}$

over $\mathbb{R}^{p}$. Derive an expression for $\hat{\theta}_{\lambda}$ and show it is well defined, i.e., there is a unique minimiser for every $X, Y$ and $\lambda$.

Assume $p \leqslant n$ and that $X$ has rank $p$. Let $\Sigma=X^{\top} X$ and note that $\Sigma=V \Lambda V^{\top}$ for some orthogonal matrix $V$ and some diagonal matrix $\Lambda$ whose diagonal entries satisfy $\Lambda_{1,1} \geqslant \Lambda_{2,2} \geqslant \ldots \geqslant \Lambda_{p, p}$. Assume that the columns of $X$ have mean zero.

(c) Denote the columns of $U=X V$ by $u_{1}, \ldots, u_{p}$. Show that they are sample principal components, i.e., that their pairwise sample correlations are zero and that they have sample variances $n^{-1} \Lambda_{1,1}, \ldots, n^{-1} \Lambda_{p, p}$, respectively. [Hint: the sample covariance between $u_{i}$ and $u_{j}$ is $n^{-1} u_{i}^{\top} u_{j}$.]

(d) Show that

$\hat{Y}_{M L E}=X \hat{\theta}_{M L E}=U \Lambda^{-1} U^{\top} Y .$

Conclude that prediction $\hat{Y}_{M L E}$ is the closest point to $Y$ within the subspace spanned by the normalised sample principal components of part (c).

(e) Show that

$\hat{Y}_{\lambda}=X \hat{\theta}_{\lambda}=U\left(\Lambda+\lambda I_{p}\right)^{-1} U^{\top} Y$

Assume $\Lambda_{1,1}, \Lambda_{2,2}, \ldots, \Lambda_{q, q}>>\lambda>>\Lambda_{q+1, q+1}, \ldots, \Lambda_{p, p}$ for some $1 \leqslant q. Conclude that prediction $\hat{Y}_{\lambda}$ is approximately the closest point to $Y$ within the subspace spanned by the $q$ normalised sample principal components of part (c) with the greatest variance.

comment
• # Paper 2, Section II, J

(a) We consider the model $\{\operatorname{Poisson}(\theta): \theta \in(0, \infty)\}$ and an i.i.d. sample $X_{1}, \ldots, X_{n}$ from it. Compute the expectation and variance of $X_{1}$ and check they are equal. Find the maximum likelihood estimator $\hat{\theta}_{M L E}$ for $\theta$ and, using its form, derive the limit in distribution of $\sqrt{n}\left(\hat{\theta}_{M L E}-\theta\right)$.

(b) In practice, Poisson-looking data show overdispersion, i.e., the sample variance is larger than the sample expectation. For $\pi \in[0,1]$ and $\lambda \in(0, \infty)$, let $p_{\pi, \lambda}: \mathbb{N}_{0} \rightarrow[0,1]$,

$k \mapsto p_{\pi, \lambda}(k)= \begin{cases}\pi e^{-\lambda} \frac{\lambda^{k}}{k !} & \text { for } k \geqslant 1 \\ (1-\pi)+\pi e^{-\lambda} & \text { for } k=0\end{cases}$

Show that this defines a distribution. Does it model overdispersion? Justify your answer.

(c) Let $Y_{1}, \ldots, Y_{n}$ be an i.i.d. sample from $p_{\pi, \lambda}$. Assume $\lambda$ is known. Find the maximum likelihood estimator $\hat{\pi}_{M L E}$ for $\pi$.

(d) Furthermore, assume that, for any $\pi \in[0,1], \sqrt{n}\left(\hat{\pi}_{M L E}-\pi\right)$ converges in distribution to a random variable $Z$ as $n \rightarrow \infty$. Suppose we wanted to test the null hypothesis that our data arises from the model in part (a). Before making any further computations, can we necessarily expect $Z$ to follow a normal distribution under the null hypothesis? Explain. Check your answer by computing the appropriate distribution.

[You may use results from the course, provided you state it clearly.]

comment
• # Paper 3, Section II, J

We consider the exponential model $\{f(\cdot, \theta): \theta \in(0, \infty)\}$, where

$f(x, \theta)=\theta e^{-\theta x} \quad \text { for } x \geqslant 0$

We observe an i.i.d. sample $X_{1}, \ldots, X_{n}$ from the model.

(a) Compute the maximum likelihood estimator $\hat{\theta}_{M L E}$ for $\theta$. What is the limit in distribution of $\sqrt{n}\left(\hat{\theta}_{M L E}-\theta\right)$ ?

(b) Consider the Bayesian setting and place a $\operatorname{Gamma}(\alpha, \beta), \alpha, \beta>0$, prior for $\theta$ with density

$\pi(\theta)=\frac{\beta^{\alpha}}{\Gamma(\alpha)} \theta^{\alpha-1} \exp (-\beta \theta) \quad \text { for } \theta>0$

where $\Gamma$ is the Gamma function satisfying $\Gamma(\alpha+1)=\alpha \Gamma(\alpha)$ for all $\alpha>0$. What is the posterior distribution for $\theta$ ? What is the Bayes estimator $\hat{\theta}_{\pi}$ for the squared loss?

(c) Show that the Bayes estimator is consistent. What is the limiting distribution of $\sqrt{n}\left(\hat{\theta}_{\pi}-\theta\right)$ ?

[You may use results from the course, provided you state them clearly.]

comment
• # Paper 4, Section II, J

We consider a statistical model $\{f(\cdot, \theta): \theta \in \Theta\}$.

(a) Define the maximum likelihood estimator (MLE) and the Fisher information $I(\theta) .$

(b) Let $\Theta=\mathbb{R}$ and assume there exist a continuous one-to-one function $\mu: \mathbb{R} \rightarrow \mathbb{R}$ and a real-valued function $h$ such that

$\mathbb{E}_{\theta}[h(X)]=\mu(\theta) \quad \forall \theta \in \mathbb{R}$

(i) For $X_{1}, \ldots, X_{n}$ i.i.d. from the model for some $\theta_{0} \in \mathbb{R}$, give the limit in almost sure sense of

$\hat{\mu}_{n}=\frac{1}{n} \sum_{i=1}^{n} h\left(X_{i}\right)$

Give a consistent estimator $\hat{\theta}_{n}$ of $\theta_{0}$ in terms of $\hat{\mu}_{n}$.

(ii) Assume further that $\mathbb{E}_{\theta_{0}}\left[h(X)^{2}\right]<\infty$ and that $\mu$ is continuously differentiable and strictly monotone. What is the limit in distribution of $\sqrt{n}\left(\hat{\theta}_{n}-\theta_{0}\right)$. Assume too that the statistical model satisfies the usual regularity assumptions. Do you necessarily expect $\operatorname{Var}\left(\hat{\theta}_{n}\right) \geqslant\left(n I\left(\theta_{0}\right)\right)^{-1}$ for all $n$ ? Why?

(iii) Propose an alternative estimator for $\theta_{0}$ with smaller bias than $\hat{\theta}_{n}$ if $B_{n}\left(\theta_{0}\right)=$ $\mathbb{E}_{\theta_{0}}\left[\hat{\theta}_{n}\right]-\theta_{0}=\frac{a}{n}+\frac{b}{n^{2}}+O\left(\frac{1}{n^{3}}\right)$ for some $a, b \in \mathbb{R}$ with $a \neq 0$.

(iv) Further to all the assumptions in iii), assume that the MLE for $\theta_{0}$ is of the form

$\hat{\theta}_{M L E}=\frac{1}{n} \sum_{i=1}^{n} h\left(X_{i}\right)$

What is the link between the Fisher information at $\theta_{0}$ and the variance of $h(X)$ ? What does this mean in terms of the precision of the estimator and why?

[You may use results from the course, provided you state them clearly.]

comment

• # Paper 1, Section II, $29 \mathrm{~K}$

A scientist wishes to estimate the proportion $\theta \in(0,1)$ of presence of a gene in a population of flies of size $n$. Every fly receives a chromosome from each of its two parents, each carrying the gene $A$ with probability $(1-\theta)$ or the gene $B$ with probability $\theta$, independently. The scientist can observe if each fly has two copies of the gene A (denoted by AA), two copies of the gene $B$ (denoted by BB) or one of each (denoted by AB). We let $n_{\mathrm{AA}}, n_{\mathrm{BB}}$, and $n_{\mathrm{AB}}$ denote the number of each observation among the $n$ flies.

(a) Give the probability of each observation as a function of $\theta$, denoted by $f(X, \theta)$, for all three values $X=\mathrm{AA}, \mathrm{BB}$, or $\mathrm{AB}$.

(b) For a vector $w=\left(w_{\mathrm{AA}}, w_{\mathrm{BB}}, w_{\mathrm{AB}}\right)$, we let $\hat{\theta}_{w}$ denote the estimator defined by

$\hat{\theta}_{w}=w_{\mathrm{AA}} \frac{n_{\mathrm{AA}}}{n}+w_{\mathrm{BB}} \frac{n_{\mathrm{BB}}}{n}+w_{\mathrm{AB}} \frac{n_{\mathrm{AB}}}{n} .$

Find the unique vector $w^{*}$ such that $\hat{\theta}_{w^{*}}$ is unbiased. Show that $\hat{\theta}_{w^{*}}$ is a consistent estimator of $\theta$.

(c) Compute the maximum likelihood estimator of $\theta$ in this model, denoted by $\hat{\theta}_{M L E}$. Find the limiting distribution of $\sqrt{n}\left(\hat{\theta}_{M L E}-\theta\right)$. [You may use results from the course, provided that you state them clearly.]

comment
• # Paper 2, Section II, $28 K$

We consider the model $\left\{\mathcal{N}\left(\theta, I_{p}\right), \theta \in \mathbb{R}^{p}\right\}$ of a Gaussian distribution in dimension $p \geqslant 3$, with unknown mean $\theta$ and known identity covariance matrix $I_{p}$. We estimate $\theta$ based on one observation $X \sim \mathcal{N}\left(\theta, I_{p}\right)$, under the loss function

$\ell(\theta, \delta)=\|\theta-\delta\|_{2}^{2}$

(a) Define the risk of an estimator $\hat{\theta}$. Compute the maximum likelihood estimator $\hat{\theta}_{M L E}$ of $\theta$ and its risk for any $\theta \in \mathbb{R}^{p}$.

(b) Define what an admissible estimator is. Is $\hat{\theta}_{M L E}$ admissible?

(c) For any $c>0$, let $\pi_{c}(\theta)$ be the prior $\mathcal{N}\left(0, c^{2} I_{p}\right)$. Find a Bayes optimal estimator $\hat{\theta}_{c}$ under this prior with the quadratic loss, and compute its Bayes risk.

(d) Show that $\hat{\theta}_{M L E}$ is minimax.

[You may use results from the course provided that you state them clearly.]

comment
• # Paper 3, Section II, K

In the model $\left\{\mathcal{N}\left(\theta, I_{p}\right), \theta \in \mathbb{R}^{p}\right\}$ of a Gaussian distribution in dimension $p$, with unknown mean $\theta$ and known identity covariance matrix $I_{p}$, we estimate $\theta$ based on a sample of i.i.d. observations $X_{1}, \ldots, X_{n}$ drawn from $\mathcal{N}\left(\theta_{0}, I_{p}\right)$.

(a) Define the Fisher information $I\left(\theta_{0}\right)$, and compute it in this model.

(b) We recall that the observed Fisher information $i_{n}(\theta)$ is given by

$i_{n}(\theta)=\frac{1}{n} \sum_{i=1}^{n} \nabla_{\theta} \log f\left(X_{i}, \theta\right) \nabla_{\theta} \log f\left(X_{i}, \theta\right)^{\top}$

Find the limit of $\hat{i}_{n}=i_{n}\left(\hat{\theta}_{M L E}\right)$, where $\hat{\theta}_{M L E}$ is the maximum likelihood estimator of $\theta$ in this model.

(c) Define the Wald statistic $W_{n}(\theta)$ and compute it. Give the limiting distribution of $W_{n}\left(\theta_{0}\right)$ and explain how it can be used to design a confidence interval for $\theta_{0}$.

[You may use results from the course provided that you state them clearly.]

comment
• # Paper 4, Section II, $28 \mathrm{~K}$

Let $g: \mathbb{R} \rightarrow \mathbb{R}$ be an unknown function, twice continuously differentiable with $\left|g^{\prime \prime}(x)\right| \leqslant M$ for all $x \in \mathbb{R}$. For some $x_{0} \in \mathbb{R}$, we know the value $g\left(x_{0}\right)$ and we wish to estimate its derivative $g^{\prime}\left(x_{0}\right)$. To do so, we have access to a pseudo-random number generator that gives $U_{1}^{*}, \ldots, U_{N}^{*}$ i.i.d. uniform over $[0,1]$, and a machine that takes input $x_{1}, \ldots, x_{N} \in \mathbb{R}$ and returns $g\left(x_{i}\right)+\varepsilon_{i}$, where the $\varepsilon_{i}$ are i.i.d. $\mathcal{N}\left(0, \sigma^{2}\right)$.

(a) Explain how this setup allows us to generate $N$ independent $X_{i}=x_{0}+h Z_{i}$, where the $Z_{i}$ take value 1 or $-1$ with probability $1 / 2$, for any $h>0$.

(b) We denote by $Y_{i}$ the output $g\left(X_{i}\right)+\varepsilon_{i}$. Show that for some independent $\xi_{i} \in \mathbb{R}$

$Y_{i}-g\left(x_{0}\right)=h Z_{i} g^{\prime}\left(x_{0}\right)+\frac{h^{2}}{2} g^{\prime \prime}\left(\xi_{i}\right)+\varepsilon_{i}$

(c) Using the intuition given by the least-squares estimator, justify the use of the estimator $\hat{g}_{N}$ given by

$\hat{g}_{N}=\frac{1}{N} \sum_{i=1}^{N} \frac{Z_{i}\left(Y_{i}-g\left(x_{0}\right)\right)}{h}$

(d) Show that

$\mathbb{E}\left[\left|\hat{g}_{N}-g^{\prime}\left(x_{0}\right)\right|^{2}\right] \leqslant \frac{h^{2} M^{2}}{4}+\frac{\sigma^{2}}{N h^{2}} .$

Show that for some choice $h_{N}$ of parameter $h$, this implies

$\mathbb{E}\left[\left|\hat{g}_{N}-g^{\prime}\left(x_{0}\right)\right|^{2}\right] \leqslant \frac{\sigma M}{\sqrt{N}}$

comment

• # Paper 1, Section II, $28 \mathrm{~K}$

For a positive integer $n$, we want to estimate the parameter $p$ in the binomial statistical model $\{\operatorname{Bin}(n, p), p \in[0,1]\}$, based on an observation $X \sim \operatorname{Bin}(n, p)$.

(a) Compute the maximum likelihood estimator for $p$. Show that the posterior distribution for $p$ under a uniform prior on $[0,1]$ is $\operatorname{Beta}(a, b)$, and specify $a$ and $b$. [The p.d.f. of $\operatorname{Beta}(a, b)$ is given by

$\left.f_{a, b}(p)=\frac{(a+b-1) !}{(a-1) !(b-1) !} p^{a-1}(1-p)^{b-1} .\right]$

(b) (i) For a risk function $L$, define the risk of an estimator $\hat{p}$ of $p$, and the Bayes risk under a prior $\pi$ for $p$.

(ii) Under the loss function

$L(\hat{p}, p)=\frac{(\hat{p}-p)^{2}}{p(1-p)}$

find a Bayes optimal estimator for the uniform prior. Give its risk as a function of $p$.

(iii) Give a minimax optimal estimator for the loss function $L$ given above. Justify your answer.

comment
• # Paper 2, Section II, $26 \mathrm{~K}$

We consider the problem of estimating $\theta$ in the model $\{f(x, \theta): \theta \in(0, \infty)\}$, where

$f(x, \theta)=(1-\alpha)(x-\theta)^{-\alpha} 1\{x \in[\theta, \theta+1]\}$

Here $1\{A\}$ is the indicator of the set $A$, and $\alpha \in(0,1)$ is known. This estimation is based on a sample of $n$ i.i.d. $X_{1}, \ldots, X_{n}$, and we denote by $X_{(1)}<\ldots the ordered sample.

(a) Compute the mean and the variance of $X_{1}$. Construct an unbiased estimator of $\theta$ taking the form $\tilde{\theta}_{n}=\bar{X}_{n}+c(\alpha)$, where $\bar{X}_{n}=n^{-1} \sum_{i=1}^{n} X_{i}$, specifying $c(\alpha)$.

(b) Show that $\tilde{\theta}_{n}$ is consistent and find the limit in distribution of $\sqrt{n}\left(\tilde{\theta}_{n}-\theta\right)$. Justify your answer, citing theorems that you use.

(c) Find the maximum likelihood estimator $\hat{\theta}_{n}$ of $\theta$. Compute $\mathbf{P}\left(\hat{\theta}_{n}-\theta>t\right)$ for all real $t$. Is $\hat{\theta}_{n}$ unbiased?

(d) For $t>0$, show that $\mathbf{P}\left(n^{\beta}\left(\hat{\theta}_{n}-\theta\right)>t\right)$ has a limit in $(0,1)$ for some $\beta>0$. Give explicitly the value of $\beta$ and the limit. Why should one favour using $\hat{\theta}_{n}$ over $\tilde{\theta}_{n}$ ?

comment
• # Paper 3, Section II, $26 \mathrm{~K}$

We consider the problem of estimating an unknown $\theta_{0}$ in a statistical model $\{f(x, \theta), \theta \in \Theta\}$ where $\Theta \subset \mathbb{R}$, based on $n$ i.i.d. observations $X_{1}, \ldots, X_{n}$ whose distribution has p.d.f. $f\left(x, \theta_{0}\right)$.

In all the parts below you may assume that the model satisfies necessary regularity conditions.

(a) Define the score function $S_{n}$ of $\theta$. Prove that $S_{n}\left(\theta_{0}\right)$ has mean 0 .

(b) Define the Fisher Information $I(\theta)$. Show that it can also be expressed as

$I(\theta)=-\mathbb{E}_{\theta}\left[\frac{d^{2}}{d \theta^{2}} \log f\left(X_{1}, \theta\right)\right]$

(c) Define the maximum likelihood estimator $\hat{\theta}_{n}$ of $\theta$. Give without proof the limits of $\hat{\theta}_{n}$ and of $\sqrt{n}\left(\hat{\theta}_{n}-\theta_{0}\right.$ ) (in a manner which you should specify). [Be as precise as possible when describing a distribution.]

(d) Let $\psi: \Theta \rightarrow \mathbb{R}$ be a continuously differentiable function, and $\tilde{\theta}_{n}$ another estimator of $\theta_{0}$ such that $\left|\hat{\theta}_{n}-\tilde{\theta}_{n}\right| \leqslant 1 / n$ with probability 1 . Give the limits of $\psi\left(\tilde{\theta}_{n}\right)$ and of $\sqrt{n}\left(\psi\left(\tilde{\theta}_{n}\right)-\psi\left(\theta_{0}\right)\right)$ (in a manner which you should specify).

comment
• # Paper 4, Section II, $27 \mathrm{~K}$

For the statistical model $\left\{\mathcal{N}_{d}(\theta, \Sigma), \theta \in \mathbb{R}^{d}\right\}$, where $\Sigma$ is a known, positive-definite $d \times d$ matrix, we want to estimate $\theta$ based on $n$ i.i.d. observations $X_{1}, \ldots, X_{n}$ with distribution $\mathcal{N}_{d}(\theta, \Sigma)$.

(a) Derive the maximum likelihood estimator $\hat{\theta}_{n}$ of $\theta$. What is the distribution of $\hat{\theta}_{n}$ ?

(b) For $\alpha \in(0,1)$, construct a confidence region $C_{n}^{\alpha}$ such that $\mathbf{P}_{\theta}\left(\theta \in C_{n}^{\alpha}\right)=1-\alpha$.

(c) For $\Sigma=I_{d}$, compute the maximum likelihood estimator of $\theta$ for the following parameter spaces:

(i) $\Theta=\left\{\theta:\|\theta\|_{2}=1\right\}$.

(ii) $\Theta=\left\{\theta: v^{\top} \theta=0\right\}$ for some unit vector $v \in \mathbb{R}^{d}$.

(d) For $\Sigma=I_{d}$, we want to test the null hypothesis $\Theta_{0}=\{0\}$ (i.e. $\left.\theta=0\right)$ against the composite alternative $\Theta_{1}=\mathbb{R}^{d} \backslash\{0\}$. Compute the likelihood ratio statistic $\Lambda\left(\Theta_{1}, \Theta_{0}\right)$ and give its distribution under the null hypothesis. Compare this result with the statement of Wilks' theorem.

comment

• # Paper 1, Section II, $\mathbf{2 7 J}$

Derive the maximum likelihood estimator $\hat{\theta}_{n}$ based on independent observations $X_{1}, \ldots, X_{n}$ that are identically distributed as $N(\theta, 1)$, where the unknown parameter $\theta$ lies in the parameter space $\Theta=\mathbb{R}$. Find the limiting distribution of $\sqrt{n}\left(\widehat{\theta}_{n}-\theta\right)$ as $n \rightarrow \infty$.

Now define

$\begin{array}{rll} \tilde{\theta}_{n} & =\widehat{\theta}_{n} & \text { whenever }\left|\widehat{\theta}_{n}\right|>n^{-1 / 4}, \\ & =0 & \text { otherwise, } \end{array}$

\begin{aligned} & =0 \text { otherwise, } \end{aligned}

and find the limiting distribution of $\sqrt{n}\left(\tilde{\theta}_{n}-\theta\right)$ as $n \rightarrow \infty$.

Calculate

$\lim _{n \rightarrow \infty} \sup _{\theta \in \Theta} n E_{\theta}\left(T_{n}-\theta\right)^{2}$

for the choices $T_{n}=\widehat{\theta}_{n}$ and $T_{n}=\widetilde{\theta}_{n}$. Based on the above findings, which estimator $T_{n}$ of $\theta$ would you prefer? Explain your answer.

[Throughout, you may use standard facts of stochastic convergence, such as the central limit theorem, provided they are clearly stated.]

comment
• # Paper 2, Section II,

(a) State and prove the Cramér-Rao inequality in a parametric model $\{f(\theta): \theta \in \Theta\}$, where $\Theta \subseteq \mathbb{R}$. [Necessary regularity conditions on the model need not be specified.]

(b) Let $X_{1}, \ldots, X_{n}$ be i.i.d. Poisson random variables with unknown parameter $E X_{1}=\theta>0$. For $\bar{X}_{n}=(1 / n) \sum_{i=1}^{n} X_{i}$ and $S^{2}=(n-1)^{-1} \sum_{i=1}^{n}\left(X_{i}-\bar{X}_{n}\right)^{2}$ define

$T_{\alpha}=\alpha \bar{X}_{n}+(1-\alpha) S^{2}, \quad 0 \leqslant \alpha \leqslant 1$

Show that $\operatorname{Var}_{\theta}\left(T_{\alpha}\right) \geqslant \operatorname{Var}_{\theta}\left(\bar{X}_{n}\right)$ for all values of $\alpha, \theta$.

Now suppose $\tilde{\theta}=\tilde{\theta}\left(X_{1}, \ldots, X_{n}\right)$ is an estimator of $\theta$ with possibly nonzero bias $B(\theta)=E_{\theta} \tilde{\theta}-\theta$. Suppose the function $B$ is monotone increasing on $(0, \infty)$. Prove that the mean-squared errors satisfy

$E_{\theta}\left(\tilde{\theta}_{n}-\theta\right)^{2} \geqslant E_{\theta}\left(\bar{X}_{n}-\theta\right)^{2} \text { for all } \theta \in \Theta$

comment
• # Paper 3, Section II, J

Let $X_{1}, \ldots, X_{n}$ be i.i.d. random variables from a $N(\theta, 1)$ distribution, $\theta \in \mathbb{R}$, and consider a Bayesian model $\theta \sim N\left(0, v^{2}\right)$ for the unknown parameter, where $v>0$ is a fixed constant.

(a) Derive the posterior distribution $\Pi\left(\cdot \mid X_{1}, \ldots, X_{n}\right)$ of $\theta \mid X_{1}, \ldots, X_{n}$.

(b) Construct a credible set $C_{n} \subset \mathbb{R}$ such that

(i) $\Pi\left(C_{n} \mid X_{1}, \ldots, X_{n}\right)=0.95$ for every $n \in \mathbb{N}$, and

(ii) for any $\theta_{0} \in \mathbb{R}$,

$P_{\theta_{0}}^{\mathbb{N}}\left(\theta_{0} \in C_{n}\right) \rightarrow 0.95 \quad \text { as } n \rightarrow \infty,$

where $P_{\theta}^{\mathbb{N}}$ denotes the distribution of the infinite sequence $X_{1}, X_{2}, \ldots$ when drawn independently from a fixed $N(\theta, 1)$ distribution.

[You may use the central limit theorem.]

comment
• # Paper 4, Section II, J

Consider a decision problem with parameter space $\Theta$. Define the concepts of a Bayes decision rule $\delta_{\pi}$ and of a least favourable prior.

Suppose $\pi$ is a prior distribution on $\Theta$ such that the Bayes risk of the Bayes rule equals $\sup _{\theta \in \Theta} R\left(\delta_{\pi}, \theta\right)$, where $R(\delta, \theta)$ is the risk function associated to the decision problem. Prove that $\delta_{\pi}$ is least favourable.

Now consider a random variable $X$ arising from the binomial distribution $\operatorname{Bin}(n, \theta)$, where $\theta \in \Theta=[0,1]$. Construct a least favourable prior for the squared risk $R(\delta, \theta)=E_{\theta}(\delta(X)-\theta)^{2}$. [You may use without proof the fact that the Bayes rule for quadratic risk is given by the posterior mean.]

comment

• # Paper 1, Section II, J

Consider a normally distributed random vector $X \in \mathbb{R}^{p}$ modelled as $X \sim N\left(\theta, I_{p}\right)$ where $\theta \in \mathbb{R}^{p}, I_{p}$ is the $p \times p$ identity matrix, and where $p \geqslant 3$. Define the Stein estimator $\hat{\theta}_{S T E I N}$ of $\theta$.

Prove that $\hat{\theta}_{S T E I N}$ dominates the estimator $\tilde{\theta}=X$ for the risk function induced by quadratic loss

$\ell(a, \theta)=\sum_{i=1}^{p}\left(a_{i}-\theta_{i}\right)^{2}, \quad a \in \mathbb{R}^{p}$

Show however that the worst case risks coincide, that is, show that

$\sup _{\theta \in \mathbb{R}^{p}} E_{\theta} \ell(X, \theta)=\sup _{\theta \in \mathbb{R}^{p}} E_{\theta} \ell\left(\hat{\theta}_{S T E I N}, \theta\right)$

[You may use Stein's lemma without proof, provided it is clearly stated.]

comment
• # Paper 2, Section II, J

Consider a random variable $X$ arising from the binomial distribution $\operatorname{Bin}(n, \theta)$, $\theta \in \Theta=[0,1]$. Find the maximum likelihood estimator $\hat{\theta}_{M L E}$ and the Fisher information $I(\theta)$ for $\theta \in \Theta$.

Now consider the following priors on $\Theta$ :

(i) a uniform $U([0,1])$ prior on $[0,1]$,

(ii) a prior with density $\pi(\theta)$ proportional to $\sqrt{I(\theta)}$,

(iii) a $\operatorname{Beta}(\sqrt{n} / 2, \sqrt{n} / 2)$ prior.

Find the means $E[\theta \mid X]$ and modes $m_{\theta} \mid X$ of the posterior distributions corresponding to the prior distributions (i)-(iii). Which of these posterior decision rules coincide with $\hat{\theta}_{M L E}$ ? Which one is minimax for quadratic risk? Justify your answers.

[You may use the following properties of the $\operatorname{Beta}(a, b)(a>0, b>0)$ distribution. Its density $f(x ; a, b), x \in[0,1]$, is proportional to $x^{a-1}(1-x)^{b-1}$, its mean is equal to $a /(a+b)$, and its mode is equal to

$\frac{\max (a-1,0)}{\max (a, 1)+\max (b, 1)-2}$

provided either $a>1$ or $b>1$.

You may further use the fact that a unique Bayes rule of constant risk is a unique minimax rule for that risk.]

comment
• # Paper 3, Section II, J

Define what it means for an estimator $\hat{\theta}$ of an unknown parameter $\theta$ to be consistent.

Let $S_{n}$ be a sequence of random real-valued continuous functions defined on $\mathbb{R}$ such that, as $n \rightarrow \infty, S_{n}(\theta)$ converges to $S(\theta)$ in probability for every $\theta \in \mathbb{R}$, where $S: \mathbb{R} \rightarrow \mathbb{R}$ is non-random. Suppose that for some $\theta_{0} \in \mathbb{R}$ and every $\varepsilon>0$ we have

$S\left(\theta_{0}-\varepsilon\right)<0

and that $S_{n}$ has exactly one zero $\hat{\theta}_{n}$ for every $n \in \mathbb{N}$. Show that $\hat{\theta}_{n} \rightarrow \theta_{0}$ as $n \rightarrow \infty$, and deduce from this that the maximum likelihood estimator (MLE) based on observations $X_{1}, \ldots, X_{n}$ from a $N(\theta, 1), \theta \in \mathbb{R}$ model is consistent.

Now consider independent observations $\mathbf{X}_{1}, \ldots, \mathbf{X}_{n}$ of bivariate normal random vectors

$\mathbf{X}_{i}=\left(X_{1 i}, X_{2 i}\right)^{T} \sim N_{2}\left[\left(\mu_{i}, \mu_{i}\right)^{T}, \sigma^{2} I_{2}\right], \quad i=1, \ldots, n,$

where $\mu_{i} \in \mathbb{R}, \sigma>0$ and $I_{2}$ is the $2 \times 2$ identity matrix. Find the MLE $\hat{\mu}=\left(\hat{\mu}_{1}, \ldots, \hat{\mu}_{n}\right)^{T}$ of $\mu=\left(\mu_{1}, \ldots, \mu_{n}\right)^{T}$ and show that the MLE of $\sigma^{2}$ equals

$\hat{\sigma}^{2}=\frac{1}{n} \sum_{i=1}^{n} s_{i}^{2}, \quad s_{i}^{2}=\frac{1}{2}\left[\left(X_{1 i}-\hat{\mu}_{i}\right)^{2}+\left(X_{2 i}-\hat{\mu}_{i}\right)^{2}\right]$

Show that $\hat{\sigma}^{2}$ is not consistent for estimating $\sigma^{2}$. Explain briefly why the MLE fails in this model.

[You may use the Law of Large Numbers without proof.]

comment
• # Paper 4, Section II, $\mathbf{2 4 J}$

Given independent and identically distributed observations $X_{1}, \ldots, X_{n}$ with finite mean $E\left(X_{1}\right)=\mu$ and variance $\operatorname{Var}\left(X_{1}\right)=\sigma^{2}$, explain the notion of a bootstrap sample $X_{1}^{b}, \ldots, X_{n}^{b}$, and discuss how you can use it to construct a confidence interval $C_{n}$ for $\mu$.

Suppose you can operate a random number generator that can simulate independent uniform random variables $U_{1}, \ldots, U_{n}$ on $[0,1]$. How can you use such a random number generator to simulate a bootstrap sample?

Suppose that $\left(F_{n}: n \in \mathbb{N}\right)$ and $F$ are cumulative probability distribution functions defined on the real line, that $F_{n}(t) \rightarrow F(t)$ as $n \rightarrow \infty$ for every $t \in \mathbb{R}$, and that $F$ is continuous on $\mathbb{R}$. Show that, as $n \rightarrow \infty$,

$\sup _{t \in \mathbb{R}}\left|F_{n}(t)-F(t)\right| \rightarrow 0 .$

State (without proof) the theorem about the consistency of the bootstrap of the mean, and use it to give an asymptotic justification of the confidence interval $C_{n}$. That is, prove that as $n \rightarrow \infty, P^{\mathbb{N}}\left(\mu \in C_{n}\right) \rightarrow 1-\alpha$ where $P^{\mathbb{N}}$ is the joint distribution of $X_{1}, X_{2}, \ldots$

[You may use standard facts of stochastic convergence and the Central Limit Theorem without proof.]

comment

• # Paper 1, Section II, J

State without proof the inequality known as the Cramér-Rao lower bound in a parametric model $\{f(\cdot, \theta): \theta \in \Theta\}, \Theta \subseteq \mathbb{R}$. Give an example of a maximum likelihood estimator that attains this lower bound, and justify your answer.

Give an example of a parametric model where the maximum likelihood estimator based on observations $X_{1}, \ldots, X_{n}$ is biased. State without proof an analogue of the Cramér-Rao inequality for biased estimators.

Define the concept of a minimax decision rule, and show that the maximum likelihood estimator $\hat{\theta}_{M L E}$ based on $X_{1}, \ldots, X_{n}$ in a $N(\theta, 1)$ model is minimax for estimating $\theta \in \Theta=\mathbb{R}$ in quadratic risk.

comment
• # Paper 2, Section II, J

In a general decision problem, define the concepts of a Bayes rule and of admissibility. Show that a unique Bayes rule is admissible.

Consider i.i.d. observations $X_{1}, \ldots, X_{n}$ from a $\operatorname{Poisson}(\theta), \theta \in \Theta=(0, \infty)$, model. Can the maximum likelihood estimator $\hat{\theta}_{M L E}$ of $\theta$ be a Bayes rule for estimating $\theta$ in quadratic risk for any prior distribution on $\theta$ that has a continuous probability density on $(0, \infty) ?$ Justify your answer.

Now model the $X_{i}$ as i.i.d. copies of $X \mid \theta \sim \operatorname{Poisson}(\theta)$, where $\theta$ is drawn from a prior that is a Gamma distribution with parameters $\alpha>0$ and $\lambda>0$ (given below). Show that the posterior distribution of $\theta \mid X_{1}, \ldots, X_{n}$ is a Gamma distribution and find its parameters. Find the Bayes rule ${\stackrel{^}{\theta }}_{BA}$