Paper 1, Section II, J

Principles of Statistics | Part II, 2019

In a regression problem, for a given XRn×pX \in \mathbb{R}^{n \times p} fixed, we observe YRnY \in \mathbb{R}^{n} such that

Y=Xθ0+εY=X \theta_{0}+\varepsilon

for an unknown θ0Rp\theta_{0} \in \mathbb{R}^{p} and ε\varepsilon random such that εN(0,σ2In)\varepsilon \sim \mathcal{N}\left(0, \sigma^{2} I_{n}\right) for some known σ2>0\sigma^{2}>0.

(a) When pnp \leqslant n and XX has rank pp, compute the maximum likelihood estimator θ^MLE\hat{\theta}_{M L E} for θ0\theta_{0}. When p>np>n, what issue is there with the likelihood maximisation approach and how many maximisers of the likelihood are there (if any)?

(b) For any λ>0\lambda>0 fixed, we consider θ^λ\hat{\theta}_{\lambda} minimising

YXθ22+λθ22\|Y-X \theta\|_{2}^{2}+\lambda\|\theta\|_{2}^{2}

over Rp\mathbb{R}^{p}. Derive an expression for θ^λ\hat{\theta}_{\lambda} and show it is well defined, i.e., there is a unique minimiser for every X,YX, Y and λ\lambda.

Assume pnp \leqslant n and that XX has rank pp. Let Σ=XX\Sigma=X^{\top} X and note that Σ=VΛV\Sigma=V \Lambda V^{\top} for some orthogonal matrix VV and some diagonal matrix Λ\Lambda whose diagonal entries satisfy Λ1,1Λ2,2Λp,p\Lambda_{1,1} \geqslant \Lambda_{2,2} \geqslant \ldots \geqslant \Lambda_{p, p}. Assume that the columns of XX have mean zero.

(c) Denote the columns of U=XVU=X V by u1,,upu_{1}, \ldots, u_{p}. Show that they are sample principal components, i.e., that their pairwise sample correlations are zero and that they have sample variances n1Λ1,1,,n1Λp,pn^{-1} \Lambda_{1,1}, \ldots, n^{-1} \Lambda_{p, p}, respectively. [Hint: the sample covariance between uiu_{i} and uju_{j} is n1uiujn^{-1} u_{i}^{\top} u_{j}.]

(d) Show that

Y^MLE=Xθ^MLE=UΛ1UY.\hat{Y}_{M L E}=X \hat{\theta}_{M L E}=U \Lambda^{-1} U^{\top} Y .

Conclude that prediction Y^MLE\hat{Y}_{M L E} is the closest point to YY within the subspace spanned by the normalised sample principal components of part (c).

(e) Show that

Y^λ=Xθ^λ=U(Λ+λIp)1UY\hat{Y}_{\lambda}=X \hat{\theta}_{\lambda}=U\left(\Lambda+\lambda I_{p}\right)^{-1} U^{\top} Y

Assume Λ1,1,Λ2,2,,Λq,q>>λ>>Λq+1,q+1,,Λp,p\Lambda_{1,1}, \Lambda_{2,2}, \ldots, \Lambda_{q, q}>>\lambda>>\Lambda_{q+1, q+1}, \ldots, \Lambda_{p, p} for some 1q<p1 \leqslant q<p. Conclude that prediction Y^λ\hat{Y}_{\lambda} is approximately the closest point to YY within the subspace spanned by the qq normalised sample principal components of part (c) with the greatest variance.

Typos? Please submit corrections to this page on GitHub.