Paper 1, Section II, J

In a regression problem, for a given $X \in \mathbb{R}^{n \times p}$ fixed, we observe $Y \in \mathbb{R}^{n}$ such that

Y=X \theta_{0}+\varepsilon

for an unknown $\theta_{0} \in \mathbb{R}^{p}$ and $\varepsilon$ random such that $\varepsilon \sim \mathcal{N}\left(0, \sigma^{2} I_{n}\right)$ for some known $\sigma^{2}>0$ .

(a) When $p \leqslant n$ and $X$ has rank $p$ , compute the maximum likelihood estimator $\hat{\theta}_{M L E}$ for $\theta_{0}$ . When $p>n$ , what issue is there with the likelihood maximisation approach and how many maximisers of the likelihood are there (if any)?

(b) For any $\lambda>0$ fixed, we consider $\hat{\theta}_{\lambda}$ minimising

\|Y-X \theta\|_{2}^{2}+\lambda\|\theta\|_{2}^{2}

over $\mathbb{R}^{p}$ . Derive an expression for $\hat{\theta}_{\lambda}$ and show it is well defined, i.e., there is a unique minimiser for every $X, Y$ and $\lambda$ .

Assume $p \leqslant n$ and that $X$ has rank $p$ . Let $\Sigma=X^{\top} X$ and note that $\Sigma=V \Lambda V^{\top}$ for some orthogonal matrix $V$ and some diagonal matrix $\Lambda$ whose diagonal entries satisfy $\Lambda_{1,1} \geqslant \Lambda_{2,2} \geqslant \ldots \geqslant \Lambda_{p, p}$ . Assume that the columns of $X$ have mean zero.

(c) Denote the columns of $U=X V$ by $u_{1}, \ldots, u_{p}$ . Show that they are sample principal components, i.e., that their pairwise sample correlations are zero and that they have sample variances $n^{-1} \Lambda_{1,1}, \ldots, n^{-1} \Lambda_{p, p}$ , respectively. [Hint: the sample covariance between $u_{i}$ and $u_{j}$ is $n^{-1} u_{i}^{\top} u_{j}$ .]

(d) Show that

\hat{Y}_{M L E}=X \hat{\theta}_{M L E}=U \Lambda^{-1} U^{\top} Y .

Conclude that prediction $\hat{Y}_{M L E}$ is the closest point to $Y$ within the subspace spanned by the normalised sample principal components of part (c).

(e) Show that

\hat{Y}_{\lambda}=X \hat{\theta}_{\lambda}=U\left(\Lambda+\lambda I_{p}\right)^{-1} U^{\top} Y

Assume $\Lambda_{1,1}, \Lambda_{2,2}, \ldots, \Lambda_{q, q}>>\lambda>>\Lambda_{q+1, q+1}, \ldots, \Lambda_{p, p}$ for some $1 \leqslant q<p$ . Conclude that prediction $\hat{Y}_{\lambda}$ is approximately the closest point to $Y$ within the subspace spanned by the $q$ normalised sample principal components of part (c) with the greatest variance.