Paper 1, Section I, J

Statistical Modelling | Part II, 2021

Let $\mu>0$ . The probability density function of the inverse Gaussian distribution (with the shape parameter equal to 1 ) is given by

f(x ; \mu)=\frac{1}{\sqrt{2 \pi x^{3}}} \exp \left[-\frac{(x-\mu)^{2}}{2 \mu^{2} x}\right]

Show that this is a one-parameter exponential family. What is its natural parameter? Show that this distribution has mean $\mu$ and variance $\mu^{3}$ .

comment

Paper 1, Section II, J

Statistical Modelling | Part II, 2021

The following data were obtained in a randomised controlled trial for a drug. Due to a manufacturing error, a subset of trial participants received a low dose (LD) instead of a standard dose (SD) of the drug.

(a) Below we analyse the data using Poisson regression:

(i) After introducing necessary notation, write down the Poisson models being fitted above.

(ii) Write down the corresponding multinomial models, then state the key theoretical result (the "Poisson trick") that allows you to fit the multinomial models using Poisson regression. [You do not need to prove this theoretical result.]

(iii) Explain why the number of degrees of freedom in the likelihood ratio test is 2 in the analysis of deviance table. What can you conclude about the drug?

(b) Below is the summary table of the second model:

(i) Drug efficacy is defined as one minus the ratio of the probability of worsening in the treated group to the probability of worsening in the control group. By using a more sophisticated method, a published analysis estimated that the drug efficacy is $90.0 \%$ for the LD treatment and $62.1 \%$ for the $\mathrm{SD}$ treatment. Are these numbers similar to what is obtained by Poisson regression? [Hint: $e^{-1} \approx 0.37, e^{-2} \approx 0.14$ , and $e^{-3} \approx 0.05$ , where $e$ is the base of the natural logarithm.]

(ii) Explain why the information in the summary table is not enough to test the hypothesis that the LD drug and the SD drug have the same efficacy. Then describe how you can test this hypothesis using analysis of deviance in $R$ .

comment

Paper 2, Section I, J

Statistical Modelling | Part II, 2021

Define a generalised linear model for a sample $Y_{1}, \ldots, Y_{n}$ of independent random variables. Define further the concept of the link function. Define the binomial regression model (without the dispersion parameter) with logistic and probit link functions. Which of these is the canonical link function?

comment

Paper 3, Section I, J

Statistical Modelling | Part II, 2021

Consider the normal linear model $Y \mid X \sim \mathrm{N}\left(X \beta, \sigma^{2} I\right)$ , where $X$ is a $n \times p$ design matrix, $Y$ is a vector of responses, $I$ is the $n \times n$ identity matrix, and $\beta, \sigma^{2}$ are unknown parameters.

Derive the maximum likelihood estimator of the pair $\beta$ and $\sigma^{2}$ . What is the distribution of the estimator of $\sigma^{2}$ ? Use it to construct a $(1-\alpha)$ -level confidence interval of $\sigma^{2}$ . [You may use without proof the fact that the "hat matrix" $H=X\left(X^{T} X\right)^{-1} X^{T}$ is a projection matrix.]

comment

Paper 4 , Section I, J

Statistical Modelling | Part II, 2021

The data frame data contains the daily number of new avian influenza cases in a large poultry farm.

Write down the model being fitted by the $R$ code below. Does the model seem to provide a satisfactory fit to the data? Justify your answer.

The owner of the farm estimated that the size of the epidemic was initially doubling every 7 days. Is that estimate supported by the analysis below? [You may need $\log 2 \approx 0.69$ .]

comment

Paper 4, Section II, J

Statistical Modelling | Part II, 2021

Let $X$ be an $n \times p$ non-random design matrix and $Y$ be a $n$ -vector of random responses. Suppose $Y \sim N\left(\mu, \sigma^{2} I\right)$ , where $\mu$ is an unknown vector and $\sigma^{2}>0$ is known.

(a) Let $\lambda \geqslant 0$ be a constant. Consider the ridge regression problem

\hat{\beta}_{\lambda}=\arg \min _{\beta}\|Y-X \beta\|^{2}+\lambda\|\beta\|^{2} .

Let $\hat{\mu}_{\lambda}=X \hat{\beta}_{\lambda}$ be the fitted values. Show that $\hat{\mu}_{\lambda}=H_{\lambda} Y$ , where

H_{\lambda}=X\left(X^{T} X+\lambda I\right)^{-1} X^{T}

(b) Show that

\mathbb{E}\left(\left\|Y-\hat{\mu}_{\lambda}\right\|^{2}\right)=\left\|\left(I-H_{\lambda}\right) \mu\right\|^{2}+\left\{n-2 \operatorname{trace}\left(H_{\lambda}\right)+\operatorname{trace}\left(H_{\lambda}^{2}\right)\right\} \sigma^{2}

(c) Let $Y^{*}=\mu+\epsilon^{*}$ , where $\epsilon^{*} \sim \mathrm{N}\left(0, \sigma^{2} I\right)$ is independent of $Y$ . Show that $\left\|Y-\hat{\mu}_{\lambda}\right\|^{2}+2 \sigma^{2} \operatorname{trace}\left(H_{\lambda}\right)$ is an unbiased estimator of $\mathbb{E}\left(\left\|Y^{*}-\hat{\mu}_{\lambda}\right\|^{2}\right)$ .

(d) Describe the behaviour (monotonicity and limits) of $\mathbb{E}\left(\left\|Y^{*}-\hat{\mu}_{\lambda}\right\|^{2}\right)$ as a function of $\lambda$ when $p=n$ and $X=I$ . What is the minimum value of $\mathbb{E}\left(\left\|Y^{*}-\hat{\mu}_{\lambda}\right\|^{2}\right)$ ?

comment

Paper 1, Section I, J

Statistical Modelling | Part II, 2020

Consider a generalised linear model with full column rank design matrix $X \in \mathbb{R}^{n \times p}$ , output variables $Y=\left(Y_{1}, \ldots, Y_{n}\right) \in \mathbb{R}^{n}$ , link function $g$ , mean parameters $\mu=\left(\mu_{1}, \ldots, \mu_{n}\right)$ and known dispersion parameters $\sigma_{i}^{2}=a_{i} \sigma^{2}, i=1, \ldots, n$ . Denote its variance function by $V$ and recall that $g\left(\mu_{i}\right)=x_{i}^{T} \beta, i=1, \ldots, n$ , where $\beta \in \mathbb{R}^{p}$ and $x_{i}^{T}$ is the $i^{\text {th }}$ row of $X$ .

(a) Define the score function in terms of the log-likelihood function and the Fisher information matrix, and define the update of the Fisher scoring algorithm.

(b) Let $W \in \mathbb{R}^{n \times n}$ be a diagonal matrix with positive entries. Note that $X^{T} W X$ is invertible. Show that

\operatorname{argmin}_{b \in \mathbb{R}^{p}}\left\{\sum_{i=1}^{n} W_{i i}\left(Y_{i}-x_{i}^{T} b\right)^{2}\right\}=\left(X^{T} W X\right)^{-1} X^{T} W Y

[Hint: you may use that $\left.\operatorname{argmin}_{b \in \mathbb{R}^{p}}\left\{\left\|Y-X^{T} b\right\|^{2}\right\}=\left(X^{T} X\right)^{-1} X^{T} Y .\right]$

(c) Recall that the score function and the Fisher information matrix have entries

\begin{aligned} &U_{j}(\beta)=\sum_{i=1}^{n} \frac{\left(Y_{i}-\mu_{i}\right) X_{i j}}{a_{i} \sigma^{2} V\left(\mu_{i}\right) g^{\prime}\left(\mu_{i}\right)} \quad j=1, \ldots, p \\ &i_{j k}(\beta)=\sum_{i=1}^{n} \frac{X_{i j} X_{i k}}{a_{i} \sigma^{2} V\left(\mu_{i}\right)\left\{g^{\prime}\left(\mu_{i}\right)\right\}^{2}} \quad j, k=1, \ldots, p \end{aligned}

Justify, performing the necessary calculations and using part (b), why the Fisher scoring algorithm is also known as the iterative reweighted least squares algorithm.

comment

Paper 1, Section II, J

Statistical Modelling | Part II, 2020

We consider a subset of the data on car insurance claims from Hallin and Ingenbleek (1983). For each customer, the dataset includes total payments made per policy-year, the amount of kilometres driven, the bonus from not having made previous claims, and the brand of the car. The amount of kilometres driven is a factor taking values $1,2,3,4$ , or 5 , where a car in level $i+1$ has driven a larger number of kilometres than a car in level $i$ for any $i=1,2,3,4$ . A statistician from an insurance company fits the following model on $R$ .

$>$ model1 <- Im(Paymentperpolicyyr as numeric(Kilometres) $+$ Brand $+$ Bonus)

(i) Why do you think the statistician transformed variable Kilometres from a factor to a numerical variable?

(ii) To check the quality of the model, the statistician applies a function to model1 which returns the following figure:

What does the plot represent? Does it suggest that model1 is a good model? Explain. If not, write down a model which the plot suggests could be better.

(iii) The statistician fits the model suggested by the graph and calls it model2. Consider the following abbreviated output:

$>\operatorname{summary}(\operatorname{model2})$

$\cdots$

Coefficients:

$\begin{array}{lrrrr}\text { (Intercept) } & 6.514035 & 0.186339 & 34.958 & <2 \mathrm{e}-16 * * * \\ \text { as.numeric(Kilometres) } & 0.057132 & 0.032654 & 1.750 & 0.08126 . \\ \text { Brand2 } & 0.363869 & 0.186857 & 1.947 & 0.05248 .\end{array}$

Brand2

$\cdots$

Brand9

$\begin{array}{lllll}0.125446 & 0.186857 & 0.671 & 0.50254\end{array}$

Bonus

Signif. codes: 0 '' $0.001$ '' $0.01$ '' $0.05$ '.' $0.1$ ' 1

Residual standard error: $0.7817$ on 284 degrees of freedom ..

Using the output, write down a $95 \%$ prediction interval for the ratio between the total payments per policy year for two cars of the same brand and with the same value of Bonus, one of which has a Kilometres value one higher than the other. You may express your answer as a function of quantiles of a common distribution, which you should specify.

(iv) Write down a generalised linear model for Paymentperpolicyyr which may be a better model than model1 and give two reasons. You must specify the link function.

comment

Paper 2, Section I, J

Statistical Modelling | Part II, 2020

The data frame WCG contains data from a study started in 1960 about heart disease. The study used 3154 adult men, all free of heart disease at the start, and eight and a half years later it recorded into variable chd whether they suffered from heart disease (1 if the respective man did and 0 otherwise) along with their height and average number of cigarettes smoked per day. Consider the $\mathrm{R}$ code below and its abbreviated output.

(a) Write down the model fitted by the code above.

(b) Interpret the effect on heart disease of a man smoking an average of two packs of cigarettes per day if each pack contains 20 cigarettes.

(c) Give an alternative latent logistic-variable representation of the model. [Hint: if $F$ is the cumulative distribution function of a logistic random variable, its inverse function is the logit function.]

comment

Paper 3, Section I, J

Statistical Modelling | Part II, 2020

Suppose we have data $\left(Y_{1}, x_{1}^{T}\right), \ldots,\left(Y_{n}, x_{n}^{T}\right)$ , where the $Y_{i}$ are independent conditional on the design matrix $X$ whose rows are the $x_{i}^{T}, i=1, \ldots, n$ . Suppose that given $x_{i}$ , the true probability density function of $Y_{i}$ is $f_{x_{i}}$ , so that the data is generated from an element of a model $\mathcal{F}:=\left\{\left(f_{x_{i}}(\cdot ; \theta)\right)_{i=1}^{n}, \theta \in \Theta\right\}$ for some $\Theta \subseteq \mathbb{R}^{q}$ and $q \in \mathbb{N}$ .

(a) Define the log-likelihood function for $\mathcal{F}$ , the maximum likelihood estimator of $\theta$ and Akaike's Information Criterion (AIC) for $\mathcal{F}$ .

From now on let $\mathcal{F}$ be the normal linear model, i.e. $Y:=\left(Y_{1}, \ldots, Y_{n}\right)^{T}=X \beta+\varepsilon$ , where $X \in \mathbb{R}^{n \times p}$ has full column rank and $\varepsilon \sim N_{n}\left(0, \sigma^{2} I\right)$ .

(b) Let $\hat{\sigma}^{2}$ denote the maximum likelihood estimator of $\sigma^{2}$ . Show that the AIC of $\mathcal{F}$ is

n\left(1+\log \left(2 \pi \hat{\sigma}^{2}\right)\right)+2(p+1)

(c) Let $\chi_{n-p}^{2}$ be a chi-squared distribution on $n-p$ degrees of freedom. Using any results from the course, show that the distribution of the AIC of $\mathcal{F}$ is

n \log \left(\chi_{n-p}^{2}\right)+n\left(\log \left(2 \pi \sigma^{2} / n\right)+1\right)+2(p+1)

$\left[\right.$ Hint: $\hat{\sigma}^{2}:=n^{-1}\|Y-X \hat{\beta}\|^{2}=n^{-1}\|(I-P) \varepsilon\|^{2}$ , where $\hat{\beta}$ is the maximum likelihood estimator of $\beta$ and $P$ is the projection matrix onto the column space of $X$ .]

comment

Paper 4, Section I, J

Statistical Modelling | Part II, 2020

Suppose you have a data frame with variables response, covar1, and covar2. You run the following commands on $R$ .

$\begin{array}{llllll}\text { covar2 } & 0.3755 & 2.5978 & 0.145 & 0.886\end{array}$

...

(a) Consider the following three scenarios:

(i) All the output you have is the abbreviated output of summary (model) above.

(ii) You have the abbreviated output of summary (model) above together with

Residual standard error: $0.8097$ on 47 degrees of freedom

Multiple R-squared: $0.8126$ , Adjusted R-squared: $0.8046$

F-statistic: $101.9$ on 2 and 47 DF, p-value: < $2.2 e-16$

(iii) You have the abbreviated output of summary (model) above together with

Residual standard error: $0.9184$ on 47 degrees of freedom

Multiple R-squared: $0.000712$ , Adjusted R-squared: $-0.04181$

F-statistic: $0.01674$ on 2 and 47 DF, p-value: $0.9834$

What conclusion can you draw about which variables explain the response in each of the three scenarios? Explain.

(b) Assume now that you have the abbreviated output of summary (model) above together with

anova(lm(response $~ 1), \operatorname{lm}($ response $\sim \operatorname{covar} 1)$ , model $)$

What are the values of the entries with a question mark? [You may express your answers as arithmetic expressions if necessary].

comment

Paper 4, Section II, J

Statistical Modelling | Part II, 2020

(a) Define a generalised linear model $(\mathrm{GLM})$ with design matrix $X \in \mathbb{R}^{n \times p}$ , output variables $Y:=\left(Y_{1}, \ldots, Y_{n}\right)^{T}$ and parameters $\mu:=\left(\mu_{1}, \ldots, \mu_{n}\right)^{T}, \beta \in \mathbb{R}^{p}$ and $\sigma_{i}^{2}:=a_{i} \sigma^{2} \in(0, \infty), i=1, \ldots, n$ . Derive the moment generating function of $Y$ , i.e. give an expression for $\mathbb{E}\left[\exp \left(t^{T} Y\right)\right], t \in \mathbb{R}^{n}$ , wherever it is well-defined.

Assume from now on that the GLM satisfies the usual regularity assumptions, $X$ has full column rank, and $\sigma^{2}$ is known and satisfies $1 / \sigma^{2} \in \mathbb{N}$ .

(b) Let $\tilde{Y}:=\left(\tilde{Y}_{1}, \ldots, \tilde{Y}_{n / \sigma^{2}}\right)^{T}$ be the output variables of a GLM from the same family as that of part (a) and parameters $\tilde{\mu}:=\left(\tilde{\mu}_{1}, \ldots, \tilde{\mu}_{n / \sigma^{2}}\right)^{T}$ and $\tilde{\sigma}^{2}:=\left(\tilde{\sigma}_{1}^{2}, \ldots, \tilde{\sigma}_{n / \sigma^{2}}^{2}\right)$ . Suppose the output variables may be split into $n$ blocks of size $1 / \sigma^{2}$ with constant parameters. To be precise, for each block $i=1, \ldots, n$ , if $j \in\left\{(i-1) / \sigma^{2}+1, \ldots, i / \sigma^{2}\right\}$ then

\tilde{\mu}_{j}=\mu_{i} \quad \text { and } \quad \tilde{\sigma}_{j}^{2}=a_{i}

with $\mu_{i}=\mu_{i}(\beta)$ and $a_{i}$ defined as in part $($ a $)$ . Let $\bar{Y}:=\left(\bar{Y}_{1}, \ldots, \bar{Y}_{n}\right)^{T}$ , where $\bar{Y}_{i}:=$ $\sigma^{2} \sum_{k=1}^{1 / \sigma^{2}} \tilde{Y}_{(i-1) / \sigma^{2}+k}$ .

(i) Show that $\bar{Y}$ is equal to $Y$ in distribution. [Hint: you may use without proof that moment generating functions uniquely determine distributions from exponential dispersion families.]

(ii) For any $\tilde{y} \in \mathbb{R}^{n / \sigma^{2}}$ , let $\bar{y}=\left(\bar{y}_{1}, \ldots, \bar{y}_{n}\right)^{T}$ , where $\bar{y}_{i}:=\sigma^{2} \sum_{k=1}^{1 / \sigma^{2}} \tilde{y}_{(i-1) / \sigma^{2}+k}$ . Show that the model function of $\tilde{Y}$ satisfies

f\left(\tilde{y} ; \tilde{\mu}, \tilde{\sigma}^{2}\right)=g_{1}\left(\bar{y} ; \tilde{\mu}, \tilde{\sigma}^{2}\right) \times g_{2}\left(\tilde{y} ; \tilde{\sigma}^{2}\right)

for some functions $g_{1}, g_{2}$ , and conclude that $\bar{Y}$ is a sufficient statistic for $\beta$ from $\tilde{Y}$ .

(iii) For the model and data from part (a), let $\hat{\mu}$ be the maximum likelihood estimator for $\mu$ and let $D(Y ; \mu)$ be the deviance at $\mu$ . Using (i) and (ii), show that

\frac{D(Y ; \hat{\mu})}{\sigma^{2}}={ }^{d} 2 \log \left\{\frac{\sup _{\tilde{\mu}^{\prime} \in \widetilde{\mathcal{M}}_{1}} f\left(\tilde{Y} ; \tilde{\mu}^{\prime}, \tilde{\sigma}^{2}\right)}{\sup _{\tilde{\mu}^{\prime} \in \widetilde{\mathcal{M}}_{0}} f\left(\tilde{Y} ; \tilde{\mu}^{\prime}, \tilde{\sigma}^{2}\right)}\right\}

where $=^{d}$ means equality in distribution and $\widetilde{\mathcal{M}}_{0}$ and $\widetilde{\mathcal{M}}_{1}$ are nested subspaces of $\mathbb{R}^{n / \sigma^{2}}$ which you should specify. Argue that $\operatorname{dim}\left(\widetilde{\mathcal{M}}_{1}\right)=n$ and $\operatorname{dim}\left(\widetilde{\mathcal{M}}_{0}\right)=p$ , and, assuming the usual regularity assumptions, conclude that

\frac{D(Y ; \hat{\mu})}{\sigma^{2}} \rightarrow^{d} \chi_{n-p}^{2} \quad \text { as } \sigma^{2} \rightarrow 0

stating the name of the result from class that you use.

comment

Paper 1, Section I, J

Statistical Modelling | Part II, 2019

The Gamma distribution with shape parameter $\alpha>0$ and scale parameter $\lambda>0$ has probability density function

f(y ; \alpha, \lambda)=\frac{\lambda^{\alpha}}{\Gamma(\alpha)} y^{\alpha-1} e^{-\lambda y} \quad \text { for } y>0

Give the definition of an exponential dispersion family and show that the set of Gamma distributions forms one such family. Find the cumulant generating function and derive the mean and variance of the Gamma distribution as a function of $\alpha$ and $\lambda$ .

comment

Paper 1, Section II, J

Statistical Modelling | Part II, 2019

The ice_cream data frame contains the result of a blind tasting of 90 ice creams, each of which is rated as poor, good, or excellent. It also contains the price of each ice cream classified into three categories. Consider the $R$ code below and its output.

(a) Write down the generalised linear model fitted by the code above.

(b) Prove that the fitted values resulting from the maximum likelihood estimator of the coefficients in this model are identical to those resulting from the maximum likelihood estimator when fitting a Multinomial model which assumes the number of ice creams at each price level is fixed.

(c) Using the output above, perform a goodness-of-fit test at the $1 \%$ level, specifying the null hypothesis, the test statistic, its asymptotic null distribution, any assumptions of the test and the decision from your test. (d) If we believe that better ice creams are more expensive, what could be a more powerful test against the model fitted above and why?

comment

Paper 2, Section I, J

Statistical Modelling | Part II, 2019

The cycling data frame contains the results of a study on the effects of cycling to work among 1,000 participants with asthma, a respiratory illness. Half of the participants, chosen uniformly at random, received a monetary incentive to cycle to work, and the other half did not. The variables in the data frame are:

miles: the average number of miles cycled per week
episodes: the number of asthma episodes experienced during the study
incentive: whether or not a monetary incentive to cycle was given
history: the number of asthma episodes in the year preceding the study

Consider the $R$ code below and its abbreviated output.

$>\operatorname{lm} .1=\operatorname{lm}$ (episodes miles $+$ history, data=cycling)

$>\operatorname{summary}(1 \mathrm{~lm} .1)$

Coefficients:

Estimate Std. Error $t$ value $\operatorname{Pr}(>|t|)$

(Intercept) $0.66937 \quad 0.07965 \quad 8.404<2 \mathrm{e}-16 * * *$

miles $\quad-0.04917 \quad 0.01839-2.6740 .00761 * *$

history $1.489540 .0481830 .918<2 \mathrm{e}-16 * * *$

$>\operatorname{lm} .2=\operatorname{lm}($ episodes $~$ incentive $+$ history, data=cycling)

$>$ summary (lm.2)

Coefficients:

Estimate Std. Error $t$ value $\operatorname{Pr}(>|t|)$

(Intercept) $0.09539 \quad 0.06960 \quad 1.371 \quad 0.171$

incentiveYes $0.91387 \quad 0.06504 \quad 14.051<2 \mathrm{e}-16 * * *$

history $1.46806 \quad 0.04346 \quad 33.782<2 \mathrm{e}-16 * * *$

$>\operatorname{lm} .3=\operatorname{lm}($ miles incentive $+$ history, data=cycling)

$>\operatorname{summary}(\operatorname{lm} .3)$

Coefficients :

Estimate Std. Error t value $\operatorname{Pr}(>|t|)$

(Intercept) $1.47050 \quad 0.11682 \quad 12.588<2 \mathrm{e}-16 * * *$

incentiveYes $1.73282 \quad 0.10917 \quad 15.872<2 \mathrm{e}-16 * * *$

history $\quad 0.47322 \quad 0.07294 \quad 6.487 \quad 1.37 \mathrm{e}-10 * * *$

(a) For each of the fitted models, briefly explain what can be inferred about participants with similar histories.

(b) Based on this analysis and the experimental design, is it advisable for a participant with asthma to cycle to work more often? Explain.

comment

Paper 3, Section I, J

Statistical Modelling | Part II, 2019

(a) For a given model with likelihood $L(\beta), \beta \in \mathbb{R}^{p}$ , define the Fisher information matrix in terms of the Hessian of the log-likelihood.

Consider a generalised linear model with design matrix $X \in \mathbb{R}^{n \times p}$ , output variables $y \in \mathbb{R}^{n}$ , a bijective link function, mean parameters $\mu=\left(\mu_{1}, \ldots, \mu_{n}\right)$ and dispersion parameters $\sigma_{1}^{2}=\ldots=\sigma_{n}^{2}=\sigma^{2}$ . Assume $\sigma^{2}$ is known.

(b) State the form of the log-likelihood.

(c) For the canonical link, show that when the parameter $\sigma^{2}$ is known, the Fisher information matrix is equal to

\sigma^{-2} X^{T} W X

for a diagonal matrix $W$ depending on the means $\mu$ . Identify $W$ .

comment

Paper 4, Section I, J

Statistical Modelling | Part II, 2019

In a normal linear model with design matrix $X \in \mathbb{R}^{n \times p}$ , output variables $y \in \mathbb{R}^{n}$ and parameters $\beta \in \mathbb{R}^{p}$ and $\sigma^{2}>0$ , define a $(1-\alpha)$ -level prediction interval for a new observation with input variables $x^{*} \in \mathbb{R}^{p}$ . Derive an explicit formula for the interval, proving that it satisfies the properties required by the definition. [You may assume that the maximum likelihood estimator $\hat{\beta}$ is independent of $\sigma^{-2}\|y-X \hat{\beta}\|_{2}^{2}$ , which has a $\chi_{n-p}^{2}$ distribution.]

comment

Paper 4, Section II, J

Statistical Modelling | Part II, 2019

A sociologist collects a dataset on friendships among $m$ Cambridge graduates. Let $y_{i, j}=1$ if persons $i$ and $j$ are friends 3 years after graduation, and $y_{i, j}=0$ otherwise. Let $z_{i}$ be a categorical variable for person $i$ 's college, taking values in the set $\{1,2, \ldots, C\}$ . Consider logistic regression models,

\mathbb{P}\left(y_{i, j}=1\right)=\frac{e^{\theta_{i, j}}}{1+e^{\theta_{i, j}}}, \quad 1 \leqslant i<j \leqslant m

with parameters either

$\theta_{i, j}=\beta_{z_{i}, z_{j}}$ ; or,
$\theta_{i, j}=\beta_{z_{i}}+\beta_{z_{j}}$ ; or,
$\theta_{i, j}=\beta_{z_{i}}+\beta_{z_{j}}+\beta_{0} \delta_{z_{i}, z_{j}}$ , where $\delta_{z_{i}, z_{j}}=1$ if $z_{i}=z_{j}$ and 0 otherwise.

(a) Write the likelihood of the models.

(b) Show that the three models are nested and specify the order. Suggest a statistic to compare models 1 and 3, give its definition and specify its asymptotic distribution under the null hypothesis, citing any necessary theorems.

(c) Suppose persons $i$ and $j$ are in the same college $k ;$ consider the number of friendships, $M_{i}$ and $M_{j}$ , that each of them has with people in college $\ell \neq k$ ( $\ell$ and $k$ fixed). In each of the models above, compare the distribution of these two random variables. Explain why this might lead to a poor quality of fit.

(d) Find a minimal sufficient statistic for model 3. [You may use the following characterisation of a minimal sufficient statistic: let $f(\beta ; y)$ be the likelihood in this model, where $\beta=\left(\beta_{k}\right)_{k=0,1, \ldots, C}$ and $y=\left(y_{i, j}\right)_{i, j=1, \ldots, m} ;$ suppose $T=t(y)$ is a statistic such that $f(\beta ; y) / f\left(\beta ; y^{\prime}\right)$ is constant in $\beta$ if and only if $t(y)=t\left(y^{\prime}\right)$ ; then, $T$ is a minimal sufficient statistic for $\beta$ .]

comment

Paper 1, Section I, J

Statistical Modelling | Part II, 2018

The data frame Ambulance contains data on the number of ambulance requests from a Cambridgeshire hospital on different days. In addition to the number of ambulance requests on each day, the dataset records whether each day fell in the winter season, on a weekend, or on a bank holiday, as well as the pollution level on each day.

A health researcher fitted two models to the dataset above using $R$ . Consider the following code and its output.

\begin{aligned} & \text { > head (Ambulance) } \\ & \text { Winter Weekend Bank. holiday Pollution. level Ambulance.requests } \\ & 1 \text { Yes Yes No High } 16 \end{aligned}

\begin{aligned} & 3 \text { No No No High } 22 \\ & 4 \text { No } \quad \text { Yes } \quad \text { No } \quad \text { Medium } \quad 11 \end{aligned}

Define the two models fitted by this code and perform a hypothesis test with level $1 \%$ in which one of the models is the null hypothesis and the other is the alternative. State the theorem used in this hypothesis test. You may use the information generated by the following commands.

comment

Paper 1, Section II, J

Statistical Modelling | Part II, 2018

A clinical study follows a number of patients with an illness. Let $Y_{i} \in[0, \infty)$ be the length of time that patient $i$ lives and $x_{i} \in \mathbb{R}^{p}$ a vector of predictors, for $i \in\{1, \ldots, n\}$ . We shall assume that $Y_{1}, \ldots, Y_{n}$ are independent. Let $f_{i}$ and $F_{i}$ be the probability density function and cumulative distribution function, respectively, of $Y_{i}$ . The hazard function $h_{i}$ is defined as

h_{i}(t)=\frac{f_{i}(t)}{1-F_{i}(t)} \quad \text { for } t \geqslant 0 .

We shall assume that $h_{i}(t)=\lambda(t) \exp \left(\beta^{\top} x_{i}\right)$ , where $\beta \in \mathbb{R}^{p}$ is a vector of coefficients and $\lambda(t)$ is some fixed hazard function.

(a) Prove that $F_{i}(t)=1-\exp \left(-\int_{0}^{t} h_{i}(s) d s\right)$ .

(b) Using the equation in part (a), write the log-likelihood function for $\beta$ in terms of $\lambda, \beta, x_{i}$ and $Y_{i}$ only.

(c) Show that the maximum likelihood estimate of $\beta$ can be obtained through a surrogate Poisson generalised linear model with an offset.

comment

Paper 2, Section I, $\mathbf{5 J}$

Statistical Modelling | Part II, 2018

Consider a linear model $Y=X \beta+\sigma^{2} \varepsilon$ with $\varepsilon \sim N(0, I)$ , where the design matrix $X$ is $n$ by $p$ . Provide an expression for the $F$ -statistic used to test the hypothesis $\beta_{p_{0}+1}=\beta_{p_{0}+2}=\cdots=\beta_{p}=0$ for $p_{0}<p$ . Show that it is a monotone function of a log-likelihood ratio statistic.

comment

Paper 3, Section I, J

Statistical Modelling | Part II, 2018

The data frame Cases. of .flu contains a list of cases of flu recorded in 3 London hospitals during each month of 2017 . Consider the following $R$ code and its output.

table (Cases. of.flu)

Month Hospital

$\begin{array}{lrrr}\text { Month } & \text { A } & \text { B } & \text { C } \\ \text { April } & 10 & 40 & 27\end{array}$

$\begin{array}{lrrr}\text { April } & 10 & 40 & 27 \\ \text { August } & 9 & 34 & 19\end{array}$

$\begin{array}{lrrr}\text { August } & 9 & 34 & 19 \\ \text { December } & 24 & 129 & 81\end{array}$

$\begin{array}{llll}\text { December } & 24 & 129 & 81 \\ \text { February } & 49 & 134 & 74\end{array}$

$\begin{array}{llll}\text { February } & 49 & 134 & 74 \\ \text { January } & 45 & 138 & 78\end{array}$

$\begin{array}{lrrr}\text { July } & 5 & 138 & 78 \\ & 11 & 36 & 35\end{array}$

$\begin{array}{llll}\text { June } & 11 & 36 & 22\end{array}$

$\begin{array}{llll}\text { March } & 20 & 82 & 41\end{array}$

May $5 \quad 43 \quad 23$

November $17 \quad 82 \quad 62$

October $6 \quad 26 \quad 19$

September $640 \quad 21$

Cases. of.flu.table = as.data.frame (table (Cases. of .flu))

$>$ head (Cases. of .flu.table)

Month Hospital Freq

1 April A 10

2 August A 9

3 December A 24

4 February A 49

5 January A 45

6 July A 5

$>\bmod 1=$ glm (Freq ., data=Cases. of .flu.table, family=poisson)

$>\bmod 1 \$ \mathrm{dev}$

[1] $28.51836$

levels (Cases. of.flu$Month)

Describe a test for the null hypothesis of independence between the variables Month and Hospital using the deviance statistic. State the assumptions of the test.

Perform the test at the $1 \%$ level for each of the two different models shown above. You may use the table below showing 99 th percentiles of the $\chi_{p}^{2}$ distribution with a range of degrees of freedom $p$ . How would you explain the discrepancy between their conclusions?

comment

Paper 4, Section I, J

Statistical Modelling | Part II, 2018

A scientist is studying the effects of a drug on the weight of mice. Forty mice are divided into two groups, control and treatment. The mice in the treatment group are given the drug, and those in the control group are given water instead. The mice are kept in 8 different cages. The weight of each mouse is monitored for 10 days, and the results of the experiment are recorded in the data frame Weight.data. Consider the following $R$ code and its output.

head (Weight.data)

Time Group Cage Mouse Weight

11 Control 1 1 $24.77578$

$2 \quad 2$ Control 1 1 $24.68766$

$3 \quad 3$ Control $1 \quad 124.79008$

44 Control $1 \quad 124.77005$

$5 \quad 5$ Control 1 1 $24.65092$

$6 \quad 6$ Control $1 \quad 124.38436$

$>\bmod 1=\operatorname{lm}$ (Weight $\sim$ Time*Group $+$ Cage, data=Weight. data)

$>\operatorname{summary}(\bmod 1)$

Call:

$\operatorname{lm}$ (formula $=$ Weight $~$ Time $*$ Group $+$ Cage, data $=$ Weight. data)

Residuals:

Min $1 Q$ Median $3 Q$ Max

$-1.36903-0.33527-0.01719 \quad 0.38807 \quad 1.24368$

Coefficients:

Estimate Std. Error t value $\operatorname{Pr}(>|t|)$

$\begin{array}{lllll}\text { Time } & -0.006023 & 0.012616 & -0.477 & 0.63334\end{array}$

GroupTreatment $\quad 0.321837 \quad 0.121993 \quad 2.638 \quad 0.00867 *$

Cage2 $\quad-0.400228 \quad 0.095875-4.1743 .68 \mathrm{e}-05 * * *$

$\begin{array}{lllll}\text { Cage3 } & 0.286941 & 0.102494 & 2.800 & 0.00537 *\end{array}$

$\begin{array}{lllll}\text { Cage4 } & 0.007535 & 0.095875 & 0.079 & 0.93740\end{array}$

$\begin{array}{rrrrr}\text { Cage6 } & 0.124767 & 0.125530 & 0.994 & 0.32087\end{array}$

$\begin{array}{lllll}\text { Cage8 } & -0.295168 & 0.125530 & -2.351 & 0.01920 * \\ \text { Time:GroupTreatment } & -0.173515 & 0.017842 & -9.725 & <2 e-16 * * *\end{array}$

Time: GroupTreatment $-0.173515 \quad 0.017842-9.725<2 \mathrm{e}-16 * * *$

Signif. codes: 0 '' $0.001$ '' $0.01$ '' $0.05$ '., $0.1$ ', 1

Residual standard error: $0.5125$ on 391 degrees of freedom

Multiple R-squared: $0.5591$ , Adjusted R-squared: $0.55$

F-statistic: $61.97$ on 8 and 391 DF, p-value: $<2.2 \mathrm{e}-16$

Which parameters describe the rate of weight loss with time in each group? According to the $\mathrm{R}$ output, is there a statistically significant weight loss with time in the control group?

Three diagnostic plots were generated using the following $R$ code.

Weight.data$Time[mouse1]

Weight.data$Time[mouse2]

Based on these plots, should you trust the significance tests shown in the output of the command summary (mod1)? Explain.

comment

Paper 4, Section II, J

Statistical Modelling | Part II, 2018

Bridge is a card game played by 2 teams of 2 players each. A bridge club records the outcomes of many games between teams formed by its $m$ members. The outcomes are modelled by

\mathbb{P}(\text { team }\{i, j\} \text { wins against team }\{k, \ell\})=\frac{\exp \left(\beta_{i}+\beta_{j}+\beta_{\{i, j\}}-\beta_{k}-\beta_{\ell}-\beta_{\{k, \ell\}}\right)}{1+\exp \left(\beta_{i}+\beta_{j}+\beta_{\{i, j\}}-\beta_{k}-\beta_{\ell}-\beta_{\{k, \ell\}}\right)},

where $\beta_{i} \in \mathbb{R}$ is a parameter representing the skill of player $i$ , and $\beta_{\{i, j\}} \in \mathbb{R}$ is a parameter representing how well-matched the team formed by $i$ and $j$ is.

(a) Would it make sense to include an intercept in this logistic regression model? Explain your answer.

(b) Suppose that players 1 and 2 always play together as a team. Is there a unique maximum likelihood estimate for the parameters $\beta_{1}, \beta_{2}$ and $\beta_{\{1,2\}}$ ? Explain your answer.

(c) Under the model defined above, derive the asymptotic distribution (including the values of all relevant parameters) for the maximum likelihood estimate of the probability that team $\{i, j\}$ wins a game against team $\{k, \ell\}$ . You can state it as a function of the true vector of parameters $\beta$ , and the Fisher information matrix $i_{N}(\beta)$ with $N$ games. You may assume that $i_{N}(\beta) / N \rightarrow I(\beta)$ as $N \rightarrow \infty$ , and that $\beta$ has a unique maximum likelihood estimate for $N$ large enough.

comment

Paper 1, Section I, J

Statistical Modelling | Part II, 2017

The dataset ChickWeights records the weight of a group of chickens fed four different diets at a range of time points. We perform the following regressions in $R$ .

(i) Which hypothesis test does the following command perform? State the degrees of freedom, and the conclusion of the test.

(ii) Define a diagnostic plot that might suggest the logarithmic transformation of the response in fit2.

(iii) Define the dashed line in the following plot, generated with the command plot(fit3). What does it tell us about the data point 579 ?

comment

Paper 1, Section II, J

Statistical Modelling | Part II, 2017

The Cambridge Lawn Tennis Club organises a tournament in which every match consists of 11 games, all of which are played. The player who wins 6 or more games is declared the winner.

For players $a$ and $b$ , let $n_{a b}$ be the total number of games they play against each other, and let $y_{a b}$ be the number of these games won by player $a$ . Let $\tilde{n}_{a b}$ and $\tilde{y}_{a b}$ be the corresponding number of matches.

A statistician analysed the tournament data using a Binomial Generalised Linear Model (GLM) with outcome $y_{a b}$ . The probability $P_{a b}$ that $a$ wins a game against $b$ is modelled by

\log \left(\frac{P_{a b}}{1-P_{a b}}\right)=\beta_{a}-\beta_{b},

with an appropriate corner point constraint. You are asked to re-analyse the data, but the game-level results have been lost and you only know which player won each match.

We define a new GLM for the outcomes $\tilde{y}_{a b}$ with $\tilde{P}_{a b}=\mathbb{E} \tilde{y}_{a b} / \tilde{n}_{a b}$ and $g\left(\tilde{P}_{a b}\right)=$ $\beta_{a}-\beta_{b}$ , where the $\beta_{a}$ are defined in $(*)$ . That is, $\beta_{a}-\beta_{b}$ is the log-odds that $a$ wins a game against $b$ , not a match.

Derive the form of the new link function $g$ . [You may express your answer in terms of a cumulative distribution function.]

comment

Paper 2, Section I, J

Statistical Modelling | Part II, 2017

A statistician is interested in the power of a $t$ -test with level $5 \%$ in linear regression; that is, the probability of rejecting the null hypothesis $\beta_{0}=0$ with this test under an alternative with $\beta_{0}>0$ .

(a) State the distribution of the least-squares estimator $\hat{\beta}_{0}$ , and hence state the form of the $t$ -test statistic used.

(b) Prove that the power does not depend on the other coefficients $\beta_{j}$ for $j>0$ .

comment

Paper 3, Section I, J

Statistical Modelling | Part II, 2017

For Fisher's method of Iteratively Reweighted Least-Squares and Newton-Raphson optimisation of the log-likelihood, the vector of parameters $\beta$ is updated using an iteration

\beta^{(m+1)}=\beta^{(m)}+M\left(\beta^{(m)}\right)^{-1} U\left(\beta^{(m)}\right),

for a specific function $M$ . How is $M$ defined in each method?

Prove that they are identical in a Generalised Linear Model with the canonical link function.

comment

Paper 4, Section I, J

Statistical Modelling | Part II, 2017

A Cambridge scientist is testing approaches to slow the spread of a species of moth in certain trees. Two groups of 30 trees were treated with different organic pesticides, and a third group of 30 trees was kept under control conditions. At the end of the summer the trees are classified according to the level of leaf damage, obtaining the following contingency table.

Which of the following Generalised Linear Model fitting commands is appropriate for these data? Why? Describe the model being fit.

comment

Paper 4, Section II, J

Statistical Modelling | Part II, 2017

The dataset diesel records the number of diesel cars which go through a block of Hills Road in 6 disjoint periods of 30 minutes, between 8AM and 11AM. The measurements are repeated each day for 10 days. Answer the following questions based on the code below, which is shown with partial output.

(a) Can we reject the model fit. 1 at a $1 \%$ level? Justify your answer.

(b) What is the difference between the deviance of the models fit. 2 and fit.3?

(c) Which of fit. 2 and fit. 3 would you use to perform variable selection by backward stepwise selection? Why?

(d) How does the final plot differ from what you expect under the model in fit.2? Provide a possible explanation and suggest a better model.

$>$ head (diesel)

period num.cars day

$\begin{array}{llll}1 & 1 & 69 & 1\end{array}$

$\begin{array}{lllll}2 & 2 & 97 & 1\end{array}$

$\begin{array}{llll}3 & 3 & 103 & 1\end{array}$

$\begin{array}{llll}4 & 4 & 99 & 1\end{array}$

$\begin{array}{llll}5 & 5 & 67 & 1\end{array}$

$6 \quad 6 \quad 911$

$>$ fit. $1=$ glm(num.cars period, data=diesel, family=poisson)

summary (fit.1)

Deviance Residuals:

Min 1Q Median 3Q Max

$\begin{array}{lllll}-4.0188 & -1.4837 & -0.2117 & 1.6257 & 4.5965\end{array}$

Coefficients:

Estimate Std. Error $z$ value $\operatorname{Pr}(>|z|)$

(Intercept) $4.628535 \quad 0.029288158 .035<2 \mathrm{e}-16 * * *$

period $-0.006073 \quad 0.007551-0.804 \quad 0.421$

Signif. codes: 0 ? $* * *$ ? $0.001 ? * * ? 0.01$ ? $*$ ? $0.05$ ?.? $0.1$ ? ? 1

(Dispersion parameter for poisson family taken to be 1)

Null deviance: $262.36$ on 59 degrees of freedom

Residual deviance: $261.72$ on 58 degrees of freedom

AIC: $651.2$

$>$ diesel$period.factor = factor(diesel$period)

$>$ fit. $2=$ glm (num.cars period.factor, data=diesel, family=poisson)

$\operatorname{summary}$ (fit.2)

Coefficients:

Estimate Std. Error z value $\operatorname{Pr}(>|z|)$

Part II, $2017 \quad$ List of Questions

[TURN OVER

comment

Paper 1, Section I, K

Statistical Modelling | Part II, 2016

The body mass index (BMI) of your closest friend is a good predictor of your own BMI. A scientist applies polynomial regression to understand the relationship between these two variables among 200 students in a sixth form college. The $R$ commands

$>$ fit. $1<-\operatorname{lm}(B M I \sim$ poly $($ friendBMI , 2, raw=T $))$

$>$ fit. $2<-\operatorname{lm}(B M I \sim$ poly $($ friendBMI, 3, raw $=\mathrm{T}))$

fit the models $Y=\beta_{0}+\beta_{1} X+\beta_{2} X^{2}+\varepsilon$ and $Y=\beta_{0}+\beta_{1} X+\beta_{2} X^{2}+\beta_{3} X^{3}+\varepsilon$ , respectively, with $\varepsilon \sim N\left(0, \sigma^{2}\right)$ in each case.

Setting the parameters raw to FALSE:

$>$ fit. $3<-\operatorname{lm}(B M I \sim$ poly $($ friendBMI , 2, raw=F $)$ )

$>$ fit. $4<-\operatorname{lm}(\mathrm{BMI} \sim$ poly $($ friendBMI, 3, raw $=\mathrm{F}))$

fits the models $Y=\beta_{0}+\beta_{1} P_{1}(X)+\beta_{2} P_{2}(X)+\varepsilon$ and $Y=\beta_{0}+\beta_{1} P_{1}(X)+\beta_{2} P_{2}(X)+$ $\beta_{3} P_{3}(X)+\varepsilon$ , with $\varepsilon \sim N\left(0, \sigma^{2}\right)$ . The function $P_{i}$ is a polynomial of degree $i$ . Furthermore, the design matrix output by the function poly with raw=F satisfies:

$>t($ poly $($ friendBMI, 3, raw $=F)) \% * \%$ poly $(a, 3$ , raw $=F)$

$\begin{array}{rrrr}1 & 2 & 3 \\ 1 & 1.000000 e+00 & 1.288032 \mathrm{e}-16 & 3.187554 \mathrm{e}-17 \\ 2 & 1.288032 \mathrm{e}-16 & 1.000000 \mathrm{e}+00 & -6.201636 \mathrm{e}-17 \\ 3 & 3.187554 \mathrm{e}-17 & -6.201636 \mathrm{e}-17 & 1.000000 \mathrm{e}+00\end{array}$

How does the variance of $\hat{\beta}$ differ in the models $f i t .2$ and $f i t .4$ ? What about the variance of the fitted values $\hat{Y}=X \hat{\beta}$ ? Finally, consider the output of the commands

$>\operatorname{anova}$ (fit.1,fit.2)

anova(fit.3,fit.4)

Define the test statistic computed by this function and specify its distribution. Which command yields a higher statistic?

comment

Paper 1, Section II, K

Statistical Modelling | Part II, 2016

(a) Let $Y$ be an $n$ -vector of responses from the linear model $Y=X \beta+\varepsilon$ , with $\beta \in \mathbb{R}^{p}$ . The internally studentized residual is defined by

s_{i}=\frac{Y_{i}-x_{i}^{\top} \hat{\beta}}{\tilde{\sigma} \sqrt{1-p_{i}}},

where $\hat{\beta}$ is the least squares estimate, $p_{i}$ is the leverage of sample $i$ , and

\tilde{\sigma}^{2}=\frac{\|Y-X \hat{\beta}\|_{2}^{2}}{(n-p)} .

Prove that the joint distribution of $s=\left(s_{1}, \ldots, s_{n}\right)^{\top}$ is the same in the following two models: (i) $\varepsilon \sim N(0, \sigma I)$ , and (ii) $\varepsilon \mid \sigma \sim N(0, \sigma I)$ , with $1 / \sigma \sim \chi_{\nu}^{2}$ (in this model, $\varepsilon_{1}, \ldots, \varepsilon_{n}$ are identically $t_{\nu}$ -distributed). [Hint: A random vector $Z$ is spherically symmetric if for any orthogonal matrix $H, H Z \stackrel{d}{=} Z$ . If $Z$ is spherically symmetric and a.s. nonzero, then $Z /\|Z\|_{2}$ is a uniform point on the sphere; in addition, any orthogonal projection of $Z$ is also spherically symmetric. A standard normal vector is spherically symmetric.]

(b) A social scientist regresses the income of 120 Cambridge graduates onto 20 answers from a questionnaire given to the participants in their first year. She notices one questionnaire with very unusual answers, which she suspects was due to miscoding. The sample has a leverage of $0.8$ . To check whether this sample is an outlier, she computes its externally studentized residual,

t_{i}=\frac{Y_{i}-x_{i}^{\top} \hat{\beta}}{\tilde{\sigma}_{(i)} \sqrt{1-p_{i}}}=4.57,

where $\tilde{\sigma}_{(i)}$ is estimated from a fit of all samples except the one in question, $\left(x_{i}, Y_{i}\right)$ . Is this a high leverage point? Can she conclude this sample is an outlier at a significance level of $5 \%$ ?

(c) After examining the following plot of residuals against the response, the investigator calculates the externally studentized residual of the participant denoted by the black dot, which is $2.33$ . Can she conclude this sample is an outlier with a significance level of $5 \%$ ?

Part II, $2016 \quad$ List of Questions

comment

Paper 2, Section I, K

Statistical Modelling | Part II, 2016

Define an exponential dispersion family. Prove that the range of the natural parameter, $\Theta$ , is an open interval. Derive the mean and variance as a function of the log normalizing constant.

[Hint: Use the convexity of $e^{x}$ , i.e. $e^{p x+(1-p) y} \leqslant p e^{x}+(1-p) e^{y}$ for all $\left.p \in[0,1] .\right]$

comment

Paper 3, Section I, K

Statistical Modelling | Part II, 2016

The $R$ command

$>\operatorname{boxcox}($ rainfall $\sim$ month+elnino+month:elnino)

performs a Box-Cox transform of the response at several values of the parameter $\lambda$ , and produces the following plot:

We fit two linear models and obtain the Q-Q plots for each fit, which are shown below in no particular order:

Define the variable on the $y$ -axis in the output of boxcox, and match each Q-Q plot to one of the models.

After choosing the model fit.2, the researcher calculates Cook's distance for the $i$ th sample, which has high leverage, and compares it to the upper $0.01$ -point of an $F_{p, n-p}$ distribution, because the design matrix is of size $n \times p$ . Provide an interpretation of this comparison in terms of confidence sets for $\hat{\beta}$ . Is this confidence statement exact?

comment

Paper 4, Section I, K

Statistical Modelling | Part II, 2016

(a) Let $Y_{i}=x_{i}^{\top} \beta+\varepsilon_{i}$ where $\varepsilon_{i}$ for $i=1, \ldots, n$ are independent and identically distributed. Let $Z_{i}=I\left(Y_{i}<0\right)$ for $i=1, \ldots, n$ , and suppose that these variables follow a binary regression model with the complementary log-log link function $g(\mu)=$ $\log (-\log (1-\mu))$ . What is the probability density function of $\varepsilon_{1}$ ?

(b) The Newton-Raphson algorithm can be applied to compute the MLE, $\hat{\beta}$ , in certain GLMs. Starting from $\beta^{(0)}=0$ , we let $\beta^{(t+1)}$ be the maximizer of the quadratic approximation of the log-likelihood $\ell(\beta ; Y)$ around $\beta^{(t)}$ :

\ell(\beta ; Y) \approx \ell\left(\beta^{(t)} ; Y\right)+\left(\beta-\beta^{(t)}\right)^{\top} D \ell\left(\beta^{(t)} ; Y\right)+\left(\beta-\beta^{(t)}\right)^{\top} D^{2} \ell\left(\beta^{(t)} ; Y\right)\left(\beta-\beta^{(t)}\right),

where $D \ell$ and $D^{2} \ell$ are the gradient and Hessian of the log-likelihood. What is the difference between this algorithm and Iterative Weighted Least Squares? Why might the latter be preferable?

comment

Paper 4, Section II, K

Statistical Modelling | Part II, 2016

For 31 days after the outbreak of the 2014 Ebola epidemic, the World Health Organization recorded the number of new cases per day in 60 hospitals in West Africa. Researchers are interested in modelling $Y_{i j}$ , the number of new Ebola cases in hospital $i$ on day $j \geqslant 2$ , as a function of several covariates:

lab: a Boolean factor for whether the hospital has laboratory facilities,
casesBefore: number of cases at the hospital on the previous day,
urban: a Boolean factor indicating an urban area,
country: a factor with three categories, Guinea, Liberia, and Sierra Leone,
numDoctors: number of doctors at the hospital,
tradBurials: a Boolean factor indicating whether traditional burials are common in the region.

Consider the output of the following $R$ code (with some lines omitted):

fit. 1 <- glm(newCases lab+casesBefore+urban+country+numDoctors+tradBurials,

data=ebola, family=poisson)

$>$ summary (fit.1)

Coefficients:

Estimate Std. Error z value $\operatorname{Pr}(>|z|)$

$\begin{array}{lllll}\text { labTRUE } & 0.094731 & 0.050322 & 1.882 & 0.0598 \\ \text { (Intercept) } & 0.011298 & 0.049498 & 0.228 & 0.8195\end{array}$

casesBefore $\quad 0.324744 \quad 0.007752 \quad 41.891<2 \mathrm{e}-16 * * *$

$\begin{array}{llllll}\text { urbanTRUE } & -0.091554 & 0.088212 & -1.038 & 0.2993\end{array}$

countryLiberia $\quad 0.088490 \quad 0.034119 \quad 2.594 \quad 0.0095 * *$

countrySierra Leone $-0.197474 \quad 0.036969-5.3429 .21 \mathrm{e}-08 * * *$

numDoctors $\quad-0.020819 \quad 0.004658-4.4707 .83 \mathrm{e}-06 * * *$

tradBurialstrUE $\quad 0.054296 \quad 0.031676 \quad 1.714 \quad 0.0865 .$

Signif. codes: $0{ }^{\prime} * * *, 0.0011^{\prime} * *, 0.01, *, 0.05, \ldots 0.1,1$

(a) Would you conclude based on the $z$ -tests that an urban setting does not affect the rate of infection?

(b) Explain how you would predict the total number of new cases that the researchers will record in Sierra Leone on day 32 .

We fit a new model which includes an interaction term, and compute a test statistic using the code:

$>$ fit. $2<-$ glm (newCases $\sim$ casesBefore+country+country:casesBefore+numDoctors,

data=ebola, family=poisson)

fit. 2 deviance - fit.1$deviance

[1] $3.016138$

(c) What is the distribution of the statistic computed in the last line?

(d) Under what conditions is the deviance of each model approximately chi-squared?

comment

Paper 1, Section I, J

Statistical Modelling | Part II, 2015

The outputs $Y_{1}, \ldots, Y_{n}$ of a particular process are positive and are believed to be related to $p$ -vectors of covariates $x_{1}, \ldots, x_{n}$ according to the following model

\log \left(Y_{i}\right)=\mu+x_{i}^{T} \beta+\varepsilon_{i}

In this model $\varepsilon_{i}$ are i.i.d. $N\left(0, \sigma^{2}\right)$ random variables where $\sigma>0$ is known. It is not possible to measure the output directly, but we can detect whether the output is greater than or less than or equal to a certain known value $c>0$ . If

Z_{i}= \begin{cases}1 & \text { if } Y_{i}>c \\ 0 & \text { if } Y_{i} \leqslant c\end{cases}

show that a probit regression model can be used for the data $\left(Z_{i}, x_{i}\right), i=1, \ldots, n$ .

How can we recover $\mu$ and $\beta$ from the parameters of the probit regression model?

comment

Paper 1, Section II, J

Statistical Modelling | Part II, 2015

An experiment is conducted where scientists count the numbers of each of three different strains of fleas that are reproducing in a controlled environment. Varying concentrations of a particular toxin that impairs reproduction are administered to the fleas. The results of the experiment are stored in a data frame $f l e a s$ in $\mathrm{R}$ , whose first few rows are given below.

The full dataset has 80 rows. The first column provides the number of fleas, the second provides the concentration of the toxin and the third specifies the strain of the flea as factors 0,1 or 2 . Strain 0 is the common flea and strains 1 and 2 have been genetically modified in a way thought to increase their ability to reproduce in the presence of the toxin.

Explain and interpret the $\mathrm{R}$ commands and (abbreviated) output below. In particular, you should describe the model being fitted, briefly explain how the standard errors are calculated, and comment on the hypothesis tests being described in the summary.

Explain and motivate the following $\mathrm{R}$ code in the light of the output above. Briefly explain the differences between the models fitted below, and the model corresponding to $f$ it $1 .$

Denote by $M_{1}, M_{2}, M_{3}$ the three models being fitted in sequence above. Explain the hypothesis tests comparing the models to each other that can be performed using the output from the following $R$ code.

$>c($ fit1$dev, fit2$dev, fit3$dev)

[1] $56.8756 .9376 .98$

$>\operatorname{qchisq}(0.95, \mathrm{df}=1)$

[1] $3.84$

Use these numbers to comment on the most appropriate model for the data.

comment

Paper 2, Section I, J

Statistical Modelling | Part II, 2015

Let $Y_{1}, \ldots, Y_{n}$ be independent Poisson random variables with means $\mu_{1}, \ldots, \mu_{n}$ , where $\log \left(\mu_{i}\right)=\beta x_{i}$ for some known constants $x_{i} \in \mathbb{R}$ and an unknown parameter $\beta$ . Find the log-likelihood for $\beta$ .

By first computing the first and second derivatives of the log-likelihood for $\beta$ , describe the algorithm you would use to find the maximum likelihood estimator $\hat{\beta}$ . $[$ Hint: Recall that if $Z \sim \operatorname{Pois}(\mu)$ then

\mathbb{P}(Z=k)=\frac{\mu^{k} e^{-\mu}}{k !}

for $k \in\{0,1,2, \ldots\}$ .]

comment

Paper 3, Section I, J

Statistical Modelling | Part II, 2015

Data are available on the number of counts (atomic disintegration events that take place within a radiation source) recorded with a Geiger counter at a nuclear plant. The counts were registered at each second over a 30 second period for a short-lived, man-made radioactive compound. The first few rows of the dataset are displayed below.

Describe the model being fitted with the following $\mathrm{R}$ command.

$>$ fit $1<-\operatorname{lm}($ Counts $~$ Time, data=geiger)

Below is a plot against time of the residuals from the model fitted above.

Referring to the plot, suggest how the model could be improved, and write out the $R$ code for fitting this new model. Briefly describe how one could test in $R$ whether the new model is to be preferred over the old model.

comment

Paper 4, Section I, J

Statistical Modelling | Part II, 2015

Data on 173 nesting female horseshoe crabs record for each crab its colour as one of 4 factors (simply labelled $1, \ldots, 4$ ), its width (in $\mathrm{cm}$ ) and the presence of male crabs nearby (a 1 indicating presence). The data are collected into the $\mathrm{R}$ data frame crabs and the first few lines are displayed below.

Describe the model being fitted by the $\mathrm{R}$ command below.

$>$ fit1 <- glm(males colour + width, family = binomial, data=crabs)

The following (abbreviated) output is obtained from the summary command.

Write out the calculation for an approximate $95 \%$ confidence interval for the coefficient for width. Describe the calculation you would perform to obtain an estimate of the probability that a female crab of colour 3 and with a width of $20 \mathrm{~cm}$ has males nearby. [You need not actually compute the end points of the confidence interval or the estimate of the probability above, but merely show the calculations that would need to be performed in order to arrive at them.]

comment

Paper 4, Section II, J

Statistical Modelling | Part II, 2015

Consider the normal linear model where the $n$ -vector of responses $Y$ satisfies $Y=X \beta+\varepsilon$ with $\varepsilon \sim N_{n}\left(0, \sigma^{2} I\right)$ . Here $X$ is an $n \times p$ matrix of predictors with full column rank where $p \geqslant 3$ and $\beta \in \mathbb{R}^{p}$ is an unknown vector of regression coefficients. For $j \in\{1, \ldots, p\}$ , denote the $j$ th column of $X$ by $X_{j}$ , and let $X_{-j}$ be $X$ with its $j$ th column removed. Suppose $X_{1}=1_{n}$ where $1_{n}$ is an $n$ -vector of 1 's. Denote the maximum likelihood estimate of $\beta$ by $\beta$ . Write down the formula for $\hat{\beta}_{j}$ involving $P_{-j}$ , the orthogonal projection onto the column space of $X_{-j}$ .

Consider $j, k \in\{2, \ldots, p\}$ with $j<k$ . By thinking about the orthogonal projection of $X_{j}$ onto $X_{k}$ , show that

\operatorname{var}\left(\hat{\beta}_{j}\right) \geqslant \frac{\sigma^{2}}{\left\|X_{j}\right\|^{2}}\left(1-\left(\frac{X_{k}^{T} X_{j}}{\left\|X_{k}\right\|\left\|X_{j}\right\|}\right)^{2}\right)^{-1}

[You may use standard facts about orthogonal projections including the fact that if $V$ and $W$ are subspaces of $\mathbb{R}^{n}$ with $V$ a subspace of $W$ and $\Pi_{V}$ and $\Pi_{W}$ denote orthogonal projections onto $V$ and $W$ respectively, then for all $v \in \mathbb{R}^{n},\left\|\Pi_{W} v\right\|^{2} \geqslant\left\|\Pi_{V} v\right\|^{2}$ .]

By considering the fitted values $X \hat{\beta}$ , explain why if, for any $j \geqslant 2$ , a constant is added to each entry in the $j$ th column of $X$ , then $\hat{\beta}_{j}$ will remain unchanged. Let $\bar{X}_{j}=\sum_{i=1}^{n} X_{i j} / n$ . Why is (*) also true when all instances of $X_{j}$ and $X_{k}$ are replaced by $X_{j}-\bar{X}_{j} 1_{n}$ and $X_{k}-\bar{X}_{k} 1_{n}$ respectively?

The marks from mid-year statistics and mathematics tests and an end-of-year statistics exam are recorded for 100 secondary school students. The first few lines of the data are given below.

The following abbreviated output is obtained:

What are the hypothesis tests corresponding to the final column of the coefficients table? What is the hypothesis test corresponding to the final line of the output? Interpret the results when testing at the $5 \%$ level.

How does the following sample correlation matrix for the data help to explain the relative sizes of some of the $p$ -values?

comment

Paper 1, Section $I$ , K

Statistical Modelling | Part II, 2014

Write down the model being fitted by the following $\mathrm{R}$ command, where $y \in\{0,1,2, \ldots\}^{n}$ and $X$ is an $n \times p$ matrix with real-valued entries.

fit $<-\operatorname{glm}(\mathrm{y} \sim \mathrm{X}, \mathrm{family}=$ poisson)

Write down the log-likelihood for the model. Explain why the command

$\operatorname{sum}(y)-\operatorname{sum}($ predict (fit, type $=$ "response" $))$

gives the answer 0, by arguing based on the log-likelihood you have written down. [Hint: Recall that if $Z \sim \operatorname{Pois}(\mu)$ then

\mathbb{P}(Z=k)=\frac{\mu^{k} e^{-\mu}}{k !}

for $k \in\{0,1,2, \ldots\}$ .]

comment

Paper 1, Section II, $13 K$

Statistical Modelling | Part II, 2014

Consider the normal linear model where the $n$ -vector of responses $Y$ satisfies $Y=X \beta+\varepsilon$ with $\varepsilon \sim N_{n}\left(0, \sigma^{2} I\right)$ . Here $X$ is an $n \times p$ matrix of predictors with full column rank where $n \geqslant p+3$ , and $\beta \in \mathbb{R}^{p}$ is an unknown vector of regression coefficients. Let $X_{0}$ be the matrix formed from the first $p_{0}<p$ columns of $X$ , and partition $\beta$ as $\beta=\left(\beta_{0}^{T}, \beta_{1}^{T}\right)^{T}$ where $\beta_{0} \in \mathbb{R}^{p_{0}}$ and $\beta_{1} \in \mathbb{R}^{p-p_{0}}$ . Denote the orthogonal projections onto the column spaces of $X$ and $X_{0}$ by $P$ and $P_{0}$ respectively.

It is desired to test the null hypothesis $H_{0}: \beta_{1}=0$ against the alternative hypothesis $H_{1}: \beta_{1} \neq 0$ . Recall that the $F$ -test for testing $H_{0}$ against $H_{1}$ rejects $H_{0}$ for large values of

F=\frac{\left\|\left(P-P_{0}\right) Y\right\|^{2} /\left(p-p_{0}\right)}{\|(I-P) Y\|^{2} /(n-p)} .

Show that $(I-P)\left(P-P_{0}\right)=0$ , and hence prove that the numerator and denominator of $F$ are independent under either hypothesis.

Show that

\mathbb{E}_{\beta, \sigma^{2}}(F)=\frac{(n-p)\left(\tau^{2}+1\right)}{n-p-2}

where $\tau^{2}=\frac{\left\|\left(P-P_{0}\right) X \beta\right\|^{2}}{\left(p-p_{0}\right) \sigma^{2}}$ .

[In this question you may use the following facts without proof: $P-P_{0}$ is an orthogonal projection with rank $p-p_{0}$ ; any $n \times n$ orthogonal projection matrix $\Pi$ satisfies $\|\Pi \varepsilon\|^{2} \sim \sigma^{2} \chi_{\nu}^{2}$ , where $\nu=\operatorname{rank}(\Pi) ;$ and if $Z \sim \chi_{\nu}^{2}$ then $\mathbb{E}\left(Z^{-1}\right)=(\nu-2)^{-1}$ when $\left.\nu>2 .\right]$

comment

Paper 2, Section I, K

Statistical Modelling | Part II, 2014

Define the concept of an exponential dispersion family. Show that the family of scaled binomial distributions $\frac{1}{n} \operatorname{Bin}(n, p)$ , with $p \in(0,1)$ and $n \in \mathbb{N}$ , is of exponential dispersion family form.

Deduce the mean of the scaled binomial distribution from the exponential dispersion family form.

What is the canonical link function in this case?

comment

Paper 3, Section I, $5 K$

Statistical Modelling | Part II, 2014

In an experiment to study factors affecting the production of the plastic polyvinyl chloride $(\mathrm{PVC})$ , three experimenters each used eight devices to produce the PVC and measured the sizes of the particles produced. For each of the 24 combinations of device and experimenter, two size measurements were obtained.

The experimenters and devices used for each of the 48 measurements are stored in $\mathrm{R}$ as factors in the objects experimenter and device respectively, with the measurements themselves stored in the vector psize. The following analysis was performed in $\mathrm{R}$ .

Let $X$ and $X_{0}$ denote the design matrices obtained by model.matrix(fit) and model.matrix (fit0) respectively, and let $Y$ denote the response psize. Let $P$ and $P_{0}$ denote orthogonal projections onto the column spaces of $X$ and $X_{0}$ respectively.

For each of the following quantities, write down their numerical values if they appear in the analysis of variance table above; otherwise write 'unknown'.

$\|(I-P) Y\|^{2}$
$\left\|X\left(X^{T} X\right)^{-1} X^{T} Y\right\|^{2}$
$\left\|\left(I-P_{0}\right) Y\right\|^{2}-\|(I-P) Y\|^{2}$
$\frac{\left\|\left(P-P_{0}\right) Y\right\|^{2} / 14}{\|(I-P) Y\|^{2} / 24}$
$\sum_{i=1}^{48} Y_{i} / 48$

Out of the two models that have been fitted, which appears to be the more appropriate for the data according to the analysis performed, and why?

comment

Paper 4, Section I, $5 K$

Statistical Modelling | Part II, 2014

Consider the normal linear model where the $n$ -vector of responses $Y$ satisfies $Y=X \beta+\varepsilon$ with $\varepsilon \sim N_{n}\left(0, \sigma^{2} I\right)$ and $X$ is an $n \times p$ design matrix with full column rank. Write down a $(1-\alpha)$ -level confidence set for $\beta$ .

Define the Cook's distance for the observation $\left(Y_{i}, x_{i}\right)$ where $x_{i}^{T}$ is the $i$ th row of $X$ , and give its interpretation in terms of confidence sets for $\beta$ .

In the model above with $n=100$ and $p=4$ , you observe that one observation has Cook's distance 3.1. Would you be concerned about the influence of this observation? Justify your answer.

[Hint: You may find some of the following facts useful:

If $Z \sim \chi_{4}^{2}$ , then $\mathbb{P}(Z \leqslant 1.06)=0.1, \mathbb{P}(Z \leqslant 7.78)=0.9$ .
If $Z \sim F_{4,96}$ , then $\mathbb{P}(Z \leqslant 0.26)=0.1, \mathbb{P}(Z \leqslant 2.00)=0.9$ .
If $Z \sim F_{96,4}$ , then $\left.\mathbb{P}(Z \leqslant 0.50)=0.1, \mathbb{P}(Z \leqslant 3.78)=0.9 .\right]$

comment

Paper 4, Section II, K

Statistical Modelling | Part II, 2014

In a study on infant respiratory disease, data are collected on a sample of 2074 infants. The information collected includes whether or not each infant developed a respiratory disease in the first year of their life; the gender of each infant; and details on how they were fed as one of three categories (breast-fed, bottle-fed and supplement). The data are tabulated in $\mathrm{R}$ as follows:

$\begin{array}{rrrrr} & \text { disease } & \text { nondisease } & \text { gender } & \text { food } \\ 1 & 77 & 381 & \text { Boy } & \text { Bottle-fed } \\ 2 & 19 & 128 & \text { Boy } & \text { Supplement } \\ 3 & 47 & 447 & \text { Boy } & \text { Breast-fed } \\ 4 & 48 & 336 & \text { Girl } & \text { Bottle-fed } \\ 5 & 16 & 111 & \text { Girl } & \text { Supplement } \\ 6 & 31 & 433 & \text { Girl } & \text { Breast-fed }\end{array}$

Write down the model being fit by the $R$ commands on the following page:

The following (slightly abbreviated) output from $R$ is obtained.

Briefly explain the justification for the standard errors presented in the output above.

Explain the relevance of the output of the following $R$ code to the data being studied, justifying your answer:

$>\exp (c(-0.6693-1.96 * 0.153,-0.6693+1.96 * 0.153))$

[1] $0.3793940 \quad 0.6911351$

[Hint: It may help to recall that if $Z \sim N(0,1)$ then $\mathbb{P}(Z \geqslant 1.96)=0.025 .]$

Let $D_{1}$ be the deviance of the model fitted by the following $\mathrm{R}$ command.

$>$ fit $1<-$ glm (disease/total gender + food + gender:food,

$+$ family = binomial, weights = total $)$

What is the numerical value of $D_{1}$ ? Which of the two models that have been fitted should you prefer, and why?

comment

Paper 1, Section I, J

Statistical Modelling | Part II, 2013

Variables $Y_{1}, \ldots, Y_{n}$ are independent, with $Y_{i}$ having a density $p\left(y \mid \mu_{i}\right)$ governed by an unknown parameter $\mu_{i}$ . Define the deviance for a model $M$ that imposes relationships between the $\left(\mu_{i}\right)$ .

From this point on, suppose $Y_{i} \sim \operatorname{Poisson}\left(\mu_{i}\right)$ . Write down the log-likelihood of data $y_{1}, \ldots, y_{n}$ as a function of $\mu_{1}, \ldots, \mu_{n}$ .

Let $\widehat{\mu}_{i}$ be the maximum likelihood estimate of $\mu_{i}$ under model $M$ . Show that the deviance for this model is given by

2 \sum_{i=1}^{n}\left\{y_{i} \log \frac{y_{i}}{\widehat{\mu}_{i}}-\left(y_{i}-\widehat{\mu}_{i}\right)\right\}

Now suppose that, under $M, \log \mu_{i}=\beta^{\mathrm{T}} x_{i}, i=1, \ldots, n$ , where $x_{1}, \ldots, x_{n}$ are known $p$ -dimensional explanatory variables and $\beta$ is an unknown $p$ -dimensional parameter. Show that $\widehat{\mu}:=\left(\widehat{\mu}_{1}, \ldots, \widehat{\mu}_{n}\right)^{\mathrm{T}}$ satisfies $X^{\mathrm{T}} y=X^{\mathrm{T}} \widehat{\mu}$ , where $y=\left(y_{1}, \ldots, y_{n}\right)^{\mathrm{T}}$ and $X$ is the $(n \times p)$ matrix with rows $x_{1}^{\mathrm{T}}, \ldots, x_{n}^{\mathrm{T}}$ , and express this as an equation for the maximum likelihood estimate $\widehat{\beta}$ of $\beta$ . [You are not required to solve this equation.]

comment

Paper 1, Section II, J

Statistical Modelling | Part II, 2013

A cricket ball manufacturing company conducts the following experiment. Every day, a bowling machine is set to one of three levels, "Medium", "Fast" or "Spin", and then bowls 100 balls towards the stumps. The number of times the ball hits the stumps and the average wind speed (in kilometres per hour) during the experiment are recorded, yielding the following data (abbreviated):

$\begin{array}{llll}\text { Day } & \text { Wind } & \text { Level } & \text { Stumps } \\ 1 & 10 & \text { Medium } & 26 \\ 2 & 8 & \text { Medium } & 37 \\ \vdots & \vdots & \vdots & \vdots \\ 50 & 12 & \text { Medium } & 32 \\ 51 & 7 & \text { Fast } & 31 \\ \vdots & \vdots & \vdots & \vdots \\ 120 & 3 & \text { Fast } & 28 \\ 121 & 5 & \text { Spin } & 35 \\ \vdots & \vdots & \vdots & \vdots \\ 150 & 6 & \text { Spin } & 31\end{array}$

Write down a reasonable model for $Y_{1}, \ldots, Y_{150}$ , where $Y_{i}$ is the number of times the ball hits the stumps on the $i^{t h}$ day. Explain briefly why we might want to include interactions between the variables. Write $R$ code to fit your model.

The company's statistician fitted her own generalized linear model using $\mathrm{R}$ , and obtained the following summary (abbreviated):

$\begin{array}{lrrrrr}\text { >summary(ball) } & & & & \\ \text { Coefficients: } & & & & & \\ & \text { Estimate } & \text { Std. Error } & \text { z value } & \operatorname{Pr}(>|z|) & \\ \text { (Intercept) } & -0.37258 & 0.05388 & -6.916 & 4.66 \mathrm{e}-12 & * * * \\ \text { Wind } & 0.09055 & 0.01595 & 5.676 & 1.38 \mathrm{e}-08 & * * * \\ \text { LevelFast } & -0.10005 & 0.08044 & -1.244 & 0.213570 & \\ \text { LevelSpin } & 0.29881 & 0.08268 & 3.614 & 0.000301 & * * * \\ \text { Wind: LevelFast } & 0.03666 & 0.02364 & 1.551 & 0.120933 & \\ \text { Wind: LevelSpin } & -0.07697 & 0.02845 & -2.705 & 0.006825 & * *\end{array}$

Why are LevelMedium and Wind: LevelMedium not listed?

Suppose that, on another day, the bowling machine is set to "Spin", and the wind speed is 5 kilometres per hour. What linear function of the parameters should the statistician use in constructing a predictor of the number of times the ball hits the stumps that day?

Based on the above output, how might you improve the model? How could you fit your new model in $R$ ?

comment

Paper 2, Section I, J

Statistical Modelling | Part II, 2013

Consider a linear model $Y=X \beta+\epsilon$ , where $Y$ and $\epsilon$ are $(n \times 1)$ with $\epsilon \sim N_{n}\left(0, \sigma^{2} I\right)$ , $\beta$ is $(p \times 1)$ , and $X$ is $(n \times p)$ of full $\operatorname{rank} p<n$ . Let $\gamma$ and $\delta$ be sub-vectors of $\beta$ . What is meant by orthogonality between $\gamma$ and $\delta$ ?

Now suppose

Y_{i}=\beta_{0}+\beta_{1} x_{i}+\beta_{2} x_{i}^{2}+\beta_{3} P_{3}\left(x_{i}\right)+\epsilon_{i} \quad(i=1, \ldots, n),

where $\epsilon_{1}, \ldots, \epsilon_{n}$ are independent $N\left(0, \sigma^{2}\right)$ random variables, $x_{1}, \ldots, x_{n}$ are real-valued known explanatory variables, and $P_{3}(x)$ is a cubic polynomial chosen so that $\beta_{3}$ is orthogonal to $\left(\beta_{0}, \beta_{1}, \beta_{2}\right)^{\mathrm{T}}$ and $\beta_{1}$ is orthogonal to $\left(\beta_{0}, \beta_{2}\right)^{\mathrm{T}}$ .

Let $\widetilde{\beta}=\left(\beta_{0}, \beta_{2}, \beta_{1}, \beta_{3}\right)^{\mathrm{T}}$ . Describe the matrix $\widetilde{X}$ such that $Y=\widetilde{X} \widetilde{\beta}+\epsilon$ . Show that $\widetilde{X}^{\mathrm{T}} \widetilde{X}$ is block diagonal. Assuming further that this matrix is non-singular, show that the least-squares estimators of $\beta_{1}$ and $\beta_{3}$ are, respectively,

\widehat{\beta}_{1}=\frac{\sum_{i=1}^{n} x_{i} Y_{i}}{\sum_{i=1}^{n} x_{i}^{2}} \quad \text { and } \quad \widehat{\beta}_{3}=\frac{\sum_{i=1}^{n} P_{3}\left(x_{i}\right) Y_{i}}{\sum_{i=1}^{n} P_{3}\left(x_{i}\right)^{2}}

comment

Paper 3, Section I, J

Statistical Modelling | Part II, 2013

Consider the linear model $Y=X \beta+\epsilon$ where $Y=\left(Y_{1}, \ldots, Y_{n}\right)^{\mathrm{T}}, \beta=\left(\beta_{1}, \ldots, \beta_{p}\right)^{\mathrm{T}}$ , and $\epsilon=\left(\epsilon_{1}, \ldots, \epsilon_{n}\right)^{\mathrm{T}}$ , with $\epsilon_{1}, \ldots, \epsilon_{n}$ independent $N\left(0, \sigma^{2}\right)$ random variables. The $(n \times p)$ matrix $X$ is known and is of full rank $p<n$ . Give expressions for the maximum likelihood estimators $\widehat{\beta}$ and $\widehat{\sigma}^{2}$ of $\beta$ and $\sigma^{2}$ respectively, and state their joint distribution. Show that $\widehat{\beta}$ is unbiased whereas $\widehat{\sigma}^{2}$ is biased.

Suppose that a new variable $Y^{*}$ is to be observed, satisfying the relationship

Y^{*}=x^{* \mathrm{~T}} \beta+\epsilon^{*}

where $x^{*}(p \times 1)$ is known, and $\epsilon^{*} \sim N\left(0, \sigma^{2}\right)$ independently of $\epsilon$ . We propose to predict $Y^{*}$ by $\widetilde{Y}=x^{* \mathrm{~T}} \widehat{\beta}$ . Identify the distribution of

\frac{Y^{*}-\tilde{Y}}{\tau \tilde{\sigma}}

where

\begin{aligned} \tilde{\sigma}^{2} &=\frac{n}{n-p} \widehat{\sigma}^{2} \\ \tau^{2} &=x^{* \mathrm{~T}}\left(X^{\mathrm{T}} X\right)^{-1} x^{*}+1 \end{aligned}

comment

Paper 4, Section I, J

Statistical Modelling | Part II, 2013

The output $X$ of a process depends on the levels of two adjustable variables: $A$ , a factor with four levels, and $B$ , a factor with two levels. For each combination of a level of $A$ and a level of $B$ , nine independent values of $X$ are observed.

Explain and interpret the $R$ commands and (abbreviated) output below. In particular, describe the model being fitted, and describe and comment on the hypothesis tests performed under the summary and anova commands.

comment

Paper 4, Section II, J

Statistical Modelling | Part II, 2013

Let $f_{0}$ be a probability density function, with cumulant generating function $K$ . Define what it means for a random variable $Y$ to have a model function of exponential dispersion family form, generated by $f_{0}$ .

A random variable $Y$ is said to have an inverse Gaussian distribution, with parameters $\phi$ and $\lambda$ (both positive), if its density function is

f(y ; \phi, \lambda)=\frac{\sqrt{\lambda}}{\sqrt{2 \pi y^{3}}} e^{\sqrt{\lambda \phi}} \exp \left\{-\frac{1}{2}\left(\frac{\lambda}{y}+\phi y\right)\right\} \quad(y>0)

Show that the family of all inverse Gaussian distributions for $Y$ is of exponential dispersion family form. Deduce directly the corresponding expressions for $E(Y)$ and $\operatorname{Var}(Y)$ in terms of $\phi$ and $\lambda$ . What are the corresponding canonical link function and variance function?

Consider a generalized linear model, $M$ , for independent variables $Y_{i}(i=1, \ldots, n)$ , whose random component is defined by the inverse Gaussian distribution with link function $g(\mu)=\log (\mu):$ thus $g\left(\mu_{i}\right)=x_{i}^{\mathrm{T}} \beta$ , where $\beta=\left(\beta_{1}, \ldots, \beta_{p}\right)^{\mathrm{T}}$ is the vector of unknown regression coefficients and $x_{i}=\left(x_{i 1}, \ldots, x_{i p}\right)^{\mathrm{T}}$ is the vector of known values of the explanatory variables for the $i^{\text {th }}$ observation. The vectors $x_{i}(i=1, \ldots, n)$ are linearly independent. Assuming that the dispersion parameter is known, obtain expressions for the score function and Fisher information matrix for $\beta$ . Explain how these can be used to compute the maximum likelihood estimate $\widehat{\beta}$ of $\beta$ .

comment

Paper 1, Section I, K

Statistical Modelling | Part II, 2012

Let $Y_{1}, \ldots, Y_{n}$ be independent with $Y_{i} \sim \frac{1}{n_{i}} \operatorname{Bin}\left(n_{i}, \mu_{i}\right), i=1, \ldots, n$ , and

\log \left(\frac{\mu_{i}}{1-\mu_{i}}\right)=x_{i}^{\top} \beta

where $x_{i}$ is a $p \times 1$ vector of regressors and $\beta$ is a $p \times 1$ vector of parameters. Write down the likelihood of the data $Y_{1}, \ldots, Y_{n}$ as a function of $\mu=\left(\mu_{1}, \ldots, \mu_{n}\right)$ . Find the unrestricted maximum likelihood estimator of $\mu$ , and the form of the maximum likelihood estimator $\hat{\mu}=\left(\hat{\mu}_{1}, \ldots, \hat{\mu}_{n}\right)$ under the logistic model (1).

Show that the deviance for a comparison of the full (saturated) model to the generalised linear model with canonical link (1) using the maximum likelihood estimator $\hat{\beta}$ can be simplified to

D(y ; \hat{\mu})=-2 \sum_{i=1}^{n}\left[n_{i} y_{i} x_{i}^{\top} \hat{\beta}-n_{i} \log \left(1-\hat{\mu}_{i}\right)\right]

Finally, obtain an expression for the deviance residual in this generalised linear model.

comment

Paper 1, Section II, K

Statistical Modelling | Part II, 2012

The treatment for a patient diagnosed with cancer of the prostate depends on whether the cancer has spread to the surrounding lymph nodes. It is common to operate on the patient to obtain samples from the nodes which can then be analysed under a microscope. However it would be preferable if an accurate assessment of nodal involvement could be made without surgery. For a sample of 53 prostate cancer patients, a number of possible predictor variables were measured before surgery. The patients then had surgery to determine nodal involvement. We want to see if nodal involvement can be accurately predicted from the available variables and determine which ones are most important. The variables take the values 0 or 1 .

$r$ An indicator $0=$ no $/ 1=$ yes of nodal involvement.

aged The patient's age, split into less than $60(=0)$ and 60 or over $(=1)$ .

stage A measurement of the size and position of the tumour observed by palpation with the fingers. A serious case is coded as 1 and a less serious case as 0 .

grade Another indicator of the seriousness of the cancer which is determined by a pathology reading of a biopsy taken by needle before surgery. A value of 1 indicates a more serious case of cancer.

xray Another measure of the seriousness of the cancer taken from an X-ray reading. A value of 1 indicates a more serious case of cancer.

acid The level of acid phosphatase in the blood serum where $1=$ high and $0=$ low.

A binomial generalised linear model with a logit link was fitted to the data to predict nodal involvement and the following output obtained:

Part II, 2012 List of Questions

[TURN OVER (a) Give an interpretation of the coefficient of xray.

(b) Give the numerical value of the sum of the squared deviance residuals.

(c) Suppose that the predictors, stage, grade and xray are positively correlated. Describe the effect that this correlation is likely to have on our ability to determine the strength of these predictors in explaining the response.

(d) The probability of observing a value of $70.252$ under a Chi-squared distribution with 52 degrees of freedom is $0.047$ . What does this information tell us about the null model for this data? Justify your answer.

(e) What is the lowest predicted probability of the nodal involvement for any future patient?

(f) The first plot in Figure 1 shows the (Pearson) residuals and the fitted values. Explain why the points lie on two curves.

(g) The second plot in Figure 1 shows the value of $\hat{\beta}-\hat{\beta}_{(i)}$ where $(i)$ indicates that patient $i$ was dropped in computing the fit. The values for each predictor, including the intercept, are shown. Could a single case change our opinion of which predictors are important in predicting the response?

Figure 1: The plot on the left shows the Pearson residuals and the fitted values. The plot on the right shows the changes in the regression coefficients when a single point is omitted for each predictor.

comment

Paper 2, Section I, K

Statistical Modelling | Part II, 2012

The purpose of the following study is to investigate differences among certain treatments on the lifespan of male fruit flies, after allowing for the effect of the variable 'thorax length' (thorax) which is known to be positively correlated with lifespan. Data was collected on the following variables:

longevity lifespan in days

thorax (body) length in $\mathrm{mm}$

treat a five level factor representing the treatment groups. The levels were labelled as follows: "00", "10", "80", "11", "81".

No interactions were found between thorax length and the treatment factor. A linear model with thorax as the covariate, treat as a factor (having the above 5 levels) and longevity as the response was fitted and the following output was obtained. There were 25 males in each of the five groups, which were treated identically in the provision of fresh food.

Coefficients :

$\begin{array}{lrrrr} & \text{Estimate} & \text{Std. Error} & \text{t value} & \operatorname{Pr}(>|t|) \\ \text { (Intercept) } & -49.98 & 10.61 & -4.71 & 6.7 e-06 \\ \text { treat10 } & 2.65 & 2.98 & 0.89 & 0.37 \\ \text { treat11 } & -7.02 & 2.97 & -2.36 & 0.02 \\ \text { treat80 } & 3.93 & 3.00 & 1.31 & 0.19 \\ \text { treat81 } & -19.95 & 3.01 & -6.64 & 1.0 e-09 \\ \text { thorax } & 135.82 & 12.44 & 10.92 & <2 \mathrm{e}-16\end{array}$

Residual standard error: $10.5$ on 119 degrees of freedom

Multiple R-Squared: $0.656$ , Adjusted R-squared: $0.642$

F-statistics: $45.5$ on 5 and 119 degrees of freedom, p-value: 0

(a) Assuming the same treatment, how much longer would you expect a fly with a thorax length $0.1 \mathrm{~mm}$ greater than another to live?

(b) What is the predicted difference in longevity between a male fly receiving treatment treat 10 and treat81 assuming they have the same thorax length?

(c) Because the flies were randomly assigned to the five groups, the distribution of thorax lengths in the five groups are essentially equal. What disadvantage would the investigators have incurred by ignoring the thorax length in their analysis (i.e., had they done a one-way ANOVA instead)?

(d) The residual-fitted plot is shown in the left panel of Figure 1 overleaf. Is it possible to determine if the regular residuals or the studentized residuals have been used to construct this plot? Explain.

(e) The Box-Cox procedure was used to determine a good transformation for this data. The plot of the log-likelihood for $\lambda$ is shown in the right panel of Figure 1 . What transformation should be used to improve the fit and yet retain some interpretability?

Figure 1: Residual-Fitted plot on the left and Box-Cox plot on the right

comment

Paper 3, Section I, 5K

Statistical Modelling | Part II, 2012

Consider the linear model

Y_{i}=\beta_{0}+\beta_{1} x_{i 1}+\beta_{2} x_{i 2}+\varepsilon_{i}

for $i=1,2, \ldots, n$ , where the $\varepsilon_{i}$ are independent and identically distributed with $N\left(0, \sigma^{2}\right)$ distribution. What does it mean for the pair $\beta_{1}$ and $\beta_{2}$ to be orthogonal? What does it mean for all the three parameters $\beta_{0}, \beta_{1}$ and $\beta_{2}$ to be mutually orthogonal? Give necessary and sufficient conditions on $\left(x_{i 1}\right)_{i=1}^{n},\left(x_{i 2}\right)_{i=1}^{n}$ so that $\beta_{0}, \beta_{1}$ and $\beta_{2}$ are mutually orthogonal. If $\beta_{0}, \beta_{1}, \beta_{2}$ are mutually orthogonal, find the joint distribution of the corresponding maximum likelihood estimators $\hat{\beta}_{0}, \hat{\beta}_{1}$ and $\hat{\beta}_{2}$ .

comment

Paper 4, Section I, K

Statistical Modelling | Part II, 2012

Define the concepts of an exponential dispersion family and the corresponding variance function. Show that the family of Poisson distributions with parameter $\lambda>0$ is an exponential dispersion family. Find the corresponding variance function and deduce from it expressions for $E(Y)$ and $\operatorname{Var}(Y)$ when $Y \sim \operatorname{Pois}(\lambda)$ . What is the canonical link function in this case?

comment

Paper 4, Section II, K

Statistical Modelling | Part II, 2012

Let $\left(X_{1}, Y_{1}\right), \ldots,\left(X_{n}, Y_{n}\right)$ be jointly independent and identically distributed with $X_{i} \sim N(0,1)$ and conditional on $X_{i}=x, Y_{i} \sim N(x \theta, 1), i=1,2, \ldots, n$ .

(a) Write down the likelihood of the data $\left(X_{1}, Y_{1}\right), \ldots,\left(X_{n}, Y_{n}\right)$ , and find the maximum likelihood estimate $\hat{\theta}$ of $\theta$ . [You may use properties of conditional probability/expectation without providing a proof.]

(b) Find the Fisher information $I(\theta)$ for a single observation, $\left(X_{1}, Y_{1}\right)$ .

(c) Determine the limiting distribution of $\sqrt{n}(\hat{\theta}-\theta)$ . [You may use the result on the asymptotic distribution of maximum likelihood estimators, without providing a proof.]

(d) Give an asymptotic confidence interval for $\theta$ with coverage $(1-\alpha)$ using your answers to (b) and (c).

(e) Define the observed Fisher information. Compare the confidence interval in part (d) with an asymptotic confidence interval with coverage $(1-\alpha)$ based on the observed Fisher information.

(f) Determine the exact distribution of $\left(\sum_{i=1}^{n} X_{i}^{2}\right)^{1 / 2}(\hat{\theta}-\theta)$ and find the true coverage probability for the interval in part (e). [Hint. Condition on $X_{1}, X_{2}, \ldots, X_{n}$ and use the following property of conditional expectation: for $U, V$ random vectors, any suitable function $g$ , and $x \in \mathbb{R}$ ,

P\{g(U, V) \leqslant x\}=E[P\{g(U, V) \leqslant x \mid V\}] .]

comment

Paper 1, Section I, J

Statistical Modelling | Part II, 2011

Let $Y_{1}, \ldots, Y_{n}$ be independent identically distributed random variables with model function $f(y, \theta), y \in \mathcal{Y}, \theta \in \Theta \subseteq \mathbb{R}$ , and denote by $E_{\theta}$ and $\operatorname{Var}_{\theta}$ expectation and variance under $f(y, \theta)$ , respectively. Define $U_{n}(\theta)=\sum_{i=1}^{n} \frac{\partial}{\partial \theta} \log f\left(Y_{i}, \theta\right)$ . Prove that $E_{\theta} U_{n}(\theta)=0$ . Show moreover that if $T=T\left(Y_{1}, \ldots, Y_{n}\right)$ is any unbiased estimator of $\theta$ , then its variance satisfies $\operatorname{Var}_{\theta}(T) \geqslant\left(n \operatorname{Var}_{\theta}\left(U_{1}(\theta)\right)^{-1}\right.$ . [You may use the Cauchy-Schwarz inequality without proof, and you may interchange differentiation and integration without justification if necessary.]

comment

Paper 1, Section II, J

Statistical Modelling | Part II, 2011

The data consist of the record times in 1984 for 35 Scottish hill races. The columns list the record time in minutes, the distance in miles, and the total height gained during the route. The data are displayed in $R$ as follows (abbreviated):

$\begin{array}{lrrr}\text { > hills } & & & \\ \text { Greenmantle } & 2.5 & 650 & 16.083 \\ \text { Carnethy } & 6.0 & 2500 & 48.350 \\ \text { Craig Dunain } & 6.0 & 900 & 33.650 \\ \text { Ben Rha } & 7.5 & 800 & 45.600 \\ \text { Ben Lomond } & 8.0 & 3070 & 62.267 \\ \text { [...] } & & & \\ \text { Cockleroi } & 4.5 & 850 & 28.100 \\ \text { Moffat Chase } & 20.0 & 5000 & 159.833\end{array}$

Consider a simple linear regression of time on dist and climb. Write down this model mathematically, and explain any assumptions that you make. How would you instruct $R$ to fit this model and assign it to a variable hills. $\operatorname{lm} 1$ ?

First, we test the hypothesis of no linear relationship to the variables dist and climb against the full model. $\mathrm{R}$ provides the following ANOVA summary:

Using the information in this table, explain carefully how you would test this hypothesis. What do you conclude?

The $\mathrm{R}$ command

summary (hills. Im1)

provides the following (slightly abbreviated) summary:

Carefully explain the information that appears in each column of the table. What are your conclusions? In particular, how would you test for the significance of the variable climb in this model?

Figure 1: Hills data: diagnostic plots

Finally, we perform model diagnostics on the full model, by looking at studentised residuals versus fitted values, and the normal QQ-plot. The plots are displayed in Figure $1 .$ Comment on possible sources of model misspecification. Is it possible that the problem lies with the data? If so, what do you suggest?

comment

Paper 2, Section I, J

Statistical Modelling | Part II, 2011

Let $f_{0}$ be a probability density function, with cumulant generating function $K$ . Define what it means for a random variable $Y$ to have a model function of exponential dispersion family form, generated by $f_{0}$ . Compute the cumulant generating function $K_{Y}$ of $Y$ and deduce expressions for the mean and variance of $Y$ that depend only on first and second derivatives of $K$ .

comment

Paper 3, Section I, J

Statistical Modelling | Part II, 2011

Define a generalised linear model for a sample $Y_{1}, \ldots, Y_{n}$ of independent random variables. Define further the concept of the link function. Define the binomial regression model with logistic and probit link functions. Which of these is the canonical link function?

comment

Paper 4, Section I, J

Statistical Modelling | Part II, 2011

The numbers of ear infections observed among beach and non-beach (mostly pool) swimmers were recorded, along with explanatory variables: frequency, location, age, and sex. The data are aggregated by group, with a total of 24 groups defined by the explanatory variables.

\begin{array}{ll} \text { freq } & \mathrm{F}=\text { frequent, } \mathrm{NF}=\text { infrequent } \\ \text { loc } & \mathrm{NB}=\text { non-beach, } \mathrm{B}=\text { beach } \\ \text { age } & 15-19,20-24,24-29 \\ \text { sex } & \mathrm{F}=\text { female, } \mathrm{M}=\text { male } \\ \text { count } & \text { the number of infections reported over a fixed time period } \\ \mathrm{n} & \text { the total number of swimmers } \end{array}

The data look like this:

\begin{array}{lrrrrrr} & \text { count } & \text { n } & \text { freq } & \text { loc } & \text { sex } & \text { age } \\ 1 & 68 & 31 & F & \text { NB } & \text { M } & 15-19 \\ 2 & 14 & 4 & F & \text { NB } & \text { F } & 15-19 \\ 3 & 35 & 12 & F & \text { NB } & \text { M } & 20-24 \\ 4 & 16 & 11 & F & \text { NB } & \text { F } & 20-24 \\ {[\ldots]} & & & & & & \\ 23 & 5 & 15 & \text { NF } & \text { B } & \text { M } & 25-29 \\ 24 & 6 & 6 & \text { NF } & \text { B } & \text { F } & 25-29 \end{array}

Let $\mu_{j}$ denote the expected number of ear infections of a person in group $j$ . Explain why it is reasonable to model count ${ }_{j}$ as Poisson with mean $n_{j} \mu_{j}$ .

We fit the following Poisson model:

\log \left(\mathbb{E}\left(\operatorname{count}_{j}\right)\right)=\log \left(n_{j} \mu_{j}\right)=\log \left(n_{j}\right)+\mathbf{x}_{j} \beta

where $\log \left(n_{j}\right)$ is an offset, i.e. an explanatory variable with known coefficient $1 .$ $\mathrm{R}$ produces the following (abbreviated) summary for the main effects model:

Why are expressions freq $\mathrm{F}$ , locB, age $15-19$ , and sexF not listed?

Suppose that we plan to observe a group of 20 female, non-frequent, beach swimmers, aged 20-24. Give an expression (using the coefficient estimates from the model fitted above) for the expected number of ear infections in this group.

Now, suppose that we allow for interaction between variables age and sex. Give the $\mathrm{R}$ command for fitting this model. We test for the effect of this interaction by producing the following (abbreviated) ANOVA table:

Briefly explain what test is performed, and what you would conclude from it. Does either of these models fit the data well?

comment

Paper 4, Section II, J

Statistical Modelling | Part II, 2011

Consider the general linear model $Y=X \beta+\epsilon$ , where the $n \times p$ matrix $X$ has full rank $p \leqslant n$ , and where $\epsilon$ has a multivariate normal distribution with mean zero and covariance matrix $\sigma^{2} I_{n}$ . Write down the likelihood function for $\beta, \sigma^{2}$ and derive the maximum likelihood estimators $\hat{\beta}, \hat{\sigma}^{2}$ of $\beta, \sigma^{2}$ . Find the distribution of $\hat{\beta}$ . Show further that $\hat{\beta}$ and $\hat{\sigma}^{2}$ are independent.

comment

Paper 1, Section I, J

Statistical Modelling | Part II, 2010

Consider a binomial generalised linear model for data $y_{1}, \ldots, y_{n}$ modelled as realisations of independent $Y_{i} \sim \operatorname{Bin}\left(1, \mu_{i}\right)$ and logit $\operatorname{link} \mu_{i}=e^{\beta x_{i}} /\left(1+e^{\beta x_{i}}\right)$ for some known constants $x_{i}, i=1, \ldots, n$ , and unknown scalar parameter $\beta$ . Find the log-likelihood for $\beta$ , and the likelihood equation that must be solved to find the maximum likelihood estimator $\hat{\beta}$ of $\beta$ . Compute the second derivative of the log-likelihood for $\beta$ , and explain the algorithm you would use to find $\hat{\beta}$ .

comment

Paper 1, Section II, J

Statistical Modelling | Part II, 2010

Consider a generalised linear model with parameter $\beta^{\top}$ partitioned as $\left(\beta_{0}^{\top}, \beta_{1}^{\top}\right)$ , where $\beta_{0}$ has $p_{0}$ components and $\beta_{1}$ has $p-p_{0}$ components, and consider testing $H_{0}: \beta_{1}=0$ against $H_{1}: \beta_{1} \neq 0$ . Define carefully the deviance, and use it to construct a test for $H_{0}$ .

[You may use Wilks' theorem to justify this test, and you may also assume that the dispersion parameter is known.]

Now consider the generalised linear model with Poisson responses and the canonical link function with linear predictor $\eta=\left(\eta_{1}, \ldots, \eta_{n}\right)^{T}$ given by $\eta_{i}=x_{i}^{\top} \beta, i=1, \ldots, n$ , where $x_{i 1}=1$ for every $i$ . Derive the deviance for this model, and argue that it may be approximated by Pearson's $\chi^{2}$ statistic.

comment

Paper 2, Section I, J

Statistical Modelling | Part II, 2010

Suppose you have a parametric model consisting of probability mass functions $f(y ; \theta), \theta \in \Theta \subset \mathbb{R}$ . Given a sample $Y_{1}, \ldots, Y_{n}$ from $f(y ; \theta)$ , define the maximum likelihood estimator $\hat{\theta}_{n}$ for $\theta$ and, assuming standard regularity conditions hold, state the asymptotic distribution of $\sqrt{n}\left(\hat{\theta}_{n}-\theta\right)$ .

Compute the Fisher information of a single observation in the case where $f(y ; \theta)$ is the probability mass function of a Poisson random variable with parameter $\theta$ . If $Y_{1}, \ldots, Y_{n}$ are independent and identically distributed random variables having a Poisson distribution with parameter $\theta$ , show that $\bar{Y}=\frac{1}{n} \sum_{i=1}^{n} Y_{i}$ and $S=\frac{1}{n-1} \sum_{i=1}^{n}\left(Y_{i}-\bar{Y}\right)^{2}$ are unbiased estimators for $\theta$ . Without calculating the variance of $S$ , show that there is no reason to prefer $S$ over $\bar{Y}$ .

[You may use the fact that the asymptotic variance of $\sqrt{n}\left(\hat{\theta}_{n}-\theta\right)$ is a lower bound for the variance of any unbiased estimator.]

comment

Paper 3, Section I, J

Statistical Modelling | Part II, 2010

Consider the linear model $Y=X \beta+\varepsilon$ , where $Y$ is a $n \times 1$ random vector, $\varepsilon \sim N_{n}\left(0, \sigma^{2} I\right)$ , and where the $n \times p$ nonrandom matrix $X$ is known and has full column rank $p$ . Derive the maximum likelihood estimator $\hat{\sigma}^{2}$ of $\sigma^{2}$ . Without using Cochran's theorem, show carefully that $\hat{\sigma}^{2}$ is biased. Suggest another estimator $\tilde{\sigma}^{2}$ for $\sigma^{2}$ that is unbiased.

comment

Paper 4, Section I, J

Statistical Modelling | Part II, 2010

Below is a simplified 1993 dataset of US cars. The columns list index, make, model, price (in $\$ 1000)$ , miles per gallon, number of passengers, length and width in inches, and weight (in pounds). The data are displayed in $R$ as follows (abbreviated):

It is reasonable to assume that prices for different makes of car are independent. We model the logarithm of the price as a linear combination of the other quantitative properties of the cars and an error term. Write down this model mathematically. How would you instruct $R$ to fit this model and assign it to a variable "fit"?

$R$ provides the following (slightly abbreviated) summary:

Briefly explain the information that is being provided in each column of the table. What are your conclusions and how would you try to improve the model?

comment

Paper 4, Section II, J

Statistical Modelling | Part II, 2010

Every day, Barney the darts player comes to our laboratory. We record his facial expression, which can be either "mad", "weird" or "relaxed", as well as how many units of beer he has drunk that day. Each day he tries a hundred times to hit the bull's-eye, and we write down how often he succeeds. The data look like this:

\begin{tabular}{rrrr} \multicolumn{1}{l}{} & & & \ Day & Beer & Expression & BullsEye \ 1 & 3 & Mad & 30 \ 2 & 3 & Mad & 32 \ $\vdots$ & $\vdots$ & $\vdots$ & $\vdots$ \ 60 & 2 & Mad & 37 \ 61 & 4 & Weird & 30 \ $\vdots$ & $\vdots$ & $\vdots$ & $\vdots$ \ 110 & 4 & Weird & 28 \ 111 & 2 & Relaxed & 35 \ $\vdots$ & $\vdots$ & $\vdots$ & $\vdots$ \ 150 & 3 & Relaxed & 31 \end{tabular}

Write down a reasonable model for $Y_{1}, \ldots, Y_{n}$ , where $n=150$ and where $Y_{i}$ is the number of times Barney has hit bull's-eye on the $i$ th day. Explain briefly why we may wish initially to include interactions between the variables. Write the $\mathrm{R}$ code to fit your model.

The scientist of the above story fitted her own generalized linear model, and subsequently obtained the following summary (abbreviated):

Why are ExpressionMad and Beer:ExpressionMad not listed? Suppose on a particular day, Barney's facial expression is weird, and he drank three units of beer. Give the linear predictor in the scientist's model for this day.

Based on the summary, how could you improve your model? How could one fit this new model in $R$ (without modifying the data file)?

comment

Paper 1, Section I, I

Statistical Modelling | Part II, 2009

Consider a binomial generalised linear model for data $y_{1}, \ldots, y_{n}$ , modelled as realisations of independent $Y_{i} \sim \operatorname{Bin}\left(1, \mu_{i}\right)$ and $\operatorname{logit} \operatorname{link}$ , i.e. $\log \frac{\mu_{i}}{1-\mu_{i}}=\beta x_{i}$ , for some known constants $x_{1}, \ldots, x_{n}$ , and an unknown parameter $\beta$ . Find the log-likelihood for $\beta$ , and the likelihood equations that must be solved to find the maximum likelihood estimator $\hat{\beta}$ of $\beta$ .

Compute the first and second derivatives of the log-likelihood for $\beta$ , and explain the algorithm you would use to find $\hat{\beta}$ .

comment

Paper 1, Section II, I

Statistical Modelling | Part II, 2009

A three-year study was conducted on the survival status of patients suffering from cancer. The age of the patients at the start of the study was recorded, as well as whether or not the initial tumour was malignant. The data are tabulated in $\mathrm{R}$ as follows:

\begin{array}{rrrrr} > & \text { cancer } & & & \\ & & \text { age malignant } & \text { survive } & \text { die } \\ 1 & <50 & \text { no } & 77 & 10 \\ 2 & <50 & \text { yes } & 51 & 13 \\ 3 & 50-69 & \text { no } & 51 & 11 \\ 4 & 50-69 & \text { yes } & 38 & 20 \\ 5 & 70+ & \text { no } & 7 & 3 \\ 6 & 70+ & \text { yes } & 6 & 3 \end{array}

Describe the model that is being fitted by the following $\mathrm{R}$ commands:

Explain the (slightly abbreviated) output from the code below, describing how the hypothesis tests are performed and your conclusions based on their results.

Based on the summary above, motivate and describe the following alternative model:

Based on the output of the code that follows, which of the two models do you prefer? Why?

What is the final value obtained by the following commands?

comment

Paper 2, Section I, I

Statistical Modelling | Part II, 2009

What is meant by an exponential dispersion family? Show that the family of Poisson distributions with parameter $\lambda$ is an exponential dispersion family by explicitly identifying the terms in the definition.

Find the corresponding variance function and deduce directly from your calculations expressions for $\mathbb{E}(Y)$ and $\operatorname{Var}(Y)$ when $Y \sim \operatorname{Pois}(\lambda)$ .

What is the canonical link function in this case?

comment

Paper 3, Section $I$ , I

Statistical Modelling | Part II, 2009

Consider the linear model $Y=X \beta+\varepsilon$ , where $\varepsilon \sim N_{n}\left(0, \sigma^{2} I\right)$ and $X$ is an $n \times p$ matrix of full rank $p<n$ . Suppose that the parameter $\beta$ is partitioned into $k$ sets as follows: $\beta^{\top}=\left(\beta_{1}^{\top} \cdots \beta_{k}^{\top}\right)$ . What does it mean for a pair of sets $\beta_{i}, \beta_{j}, i \neq j$ , to be orthogonal? What does it mean for all $k$ sets to be mutually orthogonal?

In the model

Y_{i}=\beta_{0}+\beta_{1} x_{i 1}+\beta_{2} x_{i 2}+\varepsilon_{i}

where $\varepsilon_{i} \sim N\left(0, \sigma^{2}\right)$ are independent and identically distributed, find necessary and sufficient conditions on $x_{11}, \ldots, x_{n 1}, x_{12}, \ldots, x_{n 2}$ for $\beta_{0}, \beta_{1}$ and $\beta_{2}$ to be mutually orthogonal.

If $\beta_{0}, \beta_{1}$ and $\beta_{2}$ are mutually orthogonal, what consequence does this have for the joint distribution of the corresponding maximum likelihood estimators $\hat{\beta}_{0}, \hat{\beta}_{1}$ and $\hat{\beta}_{2}$ ?

comment

Paper 4, Section I, $5 I$

Statistical Modelling | Part II, 2009

Sulphur dioxide is one of the major air pollutants. A dataset by Sokal and Rohlf (1981) was collected on 41 US cities/regions in 1969-1971. The annual measurements obtained for each region include (average) sulphur dioxide content, temperature, number of manufacturing enterprises employing more than 20 workers, population size in thousands, wind speed, precipitation, and the number of days with precipitation. The data are displayed in $R$ as follows (abbreviated):

Describe the model being fitted by the following $R$ commands.

$>$ fit $<-\operatorname{lm}(\log (\mathrm{so} 2) \sim$ temp $+$ manuf $+$ pop $+$ wind $+$ precip $+$ days $)$

Explain the (slightly abbreviated) output below, describing in particular how the hypothesis tests are performed and your conclusions based on their results:

Based on the summary above, suggest an alternative model.

Finally, what is the value obtained by the following command?

$>\operatorname{sqrt}\left(\operatorname{sum}\left(\operatorname{resid}(f i t)^{\sim} 2\right) / \mathrm{fit} \$ \mathrm{df}\right)$

comment

Paper 4, Section II, I

Statistical Modelling | Part II, 2009

Consider the linear model $Y=X \beta+\varepsilon$ , where $\varepsilon \sim N_{n}\left(0, \sigma^{2} I\right)$ and $X$ is an $n \times p$ matrix of full rank $p<n$ . Find the form of the maximum likelihood estimator $\hat{\beta}$ of $\beta$ , and derive its distribution assuming that $\sigma^{2}$ is known.

Assuming the prior $\pi\left(\beta, \sigma^{2}\right) \propto \sigma^{-2}$ find the joint posterior of $\left(\beta, \sigma^{2}\right)$ up to a normalising constant. Derive the posterior conditional distribution $\pi\left(\beta \mid \sigma^{2}, X, Y\right)$ .

Comment on the distribution of $\hat{\beta}$ found above and the posterior conditional $\pi\left(\beta \mid \sigma^{2}, X, Y\right)$ . Comment further on the predictive distribution of $y^{*}$ at input $x^{*}$ under both the maximum likelihood and Bayesian approaches.

comment

$2 . \mathrm{I} . 5 \mathrm{~J} \quad$

Statistical Modelling | Part II, 2008

Suppose that we want to estimate the angles $\alpha, \beta$ and $\gamma$ (in radians, say) of the triangle $A B C$ , based on a single independent measurement of the angle at each corner. Suppose that the error in measuring each angle is normally distributed with mean zero and variance $\sigma^{2}$ . Thus, we model our measurements $y_{A}, y_{B}, y_{C}$ as the observed values of random variables

Y_{A}=\alpha+\varepsilon_{A}, \quad Y_{B}=\beta+\varepsilon_{B}, \quad Y_{C}=\gamma+\varepsilon_{C},

where $\varepsilon_{A}, \varepsilon_{B}, \varepsilon_{C}$ are independent, each with distribution $N\left(0, \sigma^{2}\right)$ . Find the maximum likelihood estimate of $\alpha$ based on these measurements.

Can the assumption that $\varepsilon_{A}, \varepsilon_{B}, \varepsilon_{C} \sim N\left(0, \sigma^{2}\right)$ be criticized? Why or why not?

comment

1.I.5J

Statistical Modelling | Part II, 2008

Consider the following Binomial generalized linear model for data $y_{1}, \ldots, y_{n}$ , with logit link function. The data $y_{1}, \ldots, y_{n}$ are regarded as observed values of independent random variables $Y_{1}, \ldots, Y_{n}$ , where

Y_{i} \sim \operatorname{Bin}\left(1, \mu_{i}\right), \quad \log \frac{\mu_{i}}{1-\mu_{i}}=\beta^{\top} x_{i}, \quad i=1, \ldots, n,

where $\beta$ is an unknown $p$ -dimensional parameter, and where $x_{1}, \ldots, x_{n}$ are known $p$ dimensional explanatory variables. Write down the likelihood function for $y=\left(y_{1}, \ldots, y_{n}\right)$ under this model.

Show that the maximum likelihood estimate $\hat{\beta}$ satisfies an equation of the form $X^{\top} y=X^{\top} \hat{\mu}$ , where $X$ is the $p \times n$ matrix with rows $x_{1}^{\top}, \ldots, x_{n}^{\top}$ , and where $\hat{\mu}=$ $\left(\hat{\mu}_{1}, \ldots, \hat{\mu}_{n}\right)$ , with $\hat{\mu}_{i}$ a function of $x_{i}$ and $\hat{\beta}$ , which you should specify.

Define the deviance $D(y ; \hat{\mu})$ and find an explicit expression for $D(y ; \hat{\mu})$ in terms of $y$ and $\hat{\mu}$ in the case of the model above.

comment

1.II.13J

Statistical Modelling | Part II, 2008

Consider performing a two-way analysis of variance (ANOVA) on the following data: $>\mathrm{Y}[,, 1]$ $\mathrm{Y}[,, 2]$ $\mathrm{Y}[,, 3]$ $[, 1] \quad[, 2]$ $[, 1] \quad[, 2]$ $[, 1] \quad[, 2]$

Explain and interpret the R commands and (slightly abbreviated) output below. In particular, you should describe the model being fitted, and comment on the hypothesis tests which are performed under the summary and anova commands. $>\mathrm{K}<-\operatorname{dim}(\mathrm{Y})[1]$ $>I<-\operatorname{dim}(Y)[2]$ $>\mathrm{J}<-\operatorname{dim}(\mathrm{Y})[3]$ $>\mathrm{C}(\mathrm{I}, \mathrm{J}, \mathrm{K})$

[1] $2 \quad 3 \quad 10$

$>y<-$ as.vector $(Y)$

$>\mathrm{a}<-\mathrm{gl}(\mathrm{I}, \mathrm{K}$ , length $(\mathrm{y}))$

$>\mathrm{b}<-\operatorname{gl}(\mathrm{J}, \mathrm{K} * \mathrm{I}$ , length $(\mathrm{y}))$

$>f i t 1<-\operatorname{lm}(\mathrm{y} \sim \mathrm{a}+\mathrm{b})$

$>\operatorname{summary}(f i t 1)$

Coefficients:

Estimate Std. Error t value $\operatorname{Pr}(>|t|)$

(Intercept) $3.7673 \quad 0.3032 \quad 12.43<2 \mathrm{e}-16 * * *$

$\mathrm{a} 2 \quad 3.4542 \quad 0.3032 \quad 11.393 .27 \mathrm{e}-16 * * *$

b2 $-6.3215<0.3713-17.03<2 \mathrm{e}-16 * * *$

b3 $-5.8268<0.3713-15.69<2 \mathrm{e}-16 * * *$

$>\operatorname{anova}(\mathrm{fit1})$

The following $\mathrm{R}$ code fits a similar model. Briefly explain the difference between this model and the one above. Based on the output of the anova call below, say whether you prefer this model over the one above, and explain your preference.

Finally, explain what is being calculated in the code below and give the value that would be obtained by the final line of code.

comment

3.I.5J

Statistical Modelling | Part II, 2008

Consider the linear model $Y=X \beta+\varepsilon$ . Here, $Y$ is an $n$ -dimensional vector of observations, $X$ is a known $n \times p$ matrix, $\beta$ is an unknown $p$ -dimensional parameter, and $\varepsilon \sim N_{n}\left(0, \sigma^{2} I\right)$ , with $\sigma^{2}$ unknown. Assume that $X$ has full rank and that $p \ll n$ . Suppose that we are interested in checking the assumption $\varepsilon \sim N_{n}\left(0, \sigma^{2} I\right)$ . Let $\hat{Y}=X \hat{\beta}$ , where $\hat{\beta}$ is the maximum likelihood estimate of $\beta$ . Write in terms of $X$ an expression for the projection matrix $P=\left(p_{i j}: 1 \leqslant i, j \leqslant n\right)$ which appears in the maximum likelihood equation $\hat{Y}=X \hat{\beta}=P Y$ .

Find the distribution of $\hat{\varepsilon}=Y-\hat{Y}$ , and show that, in general, the components of $\hat{\varepsilon}$ are not independent.

A standard procedure used to check our assumption on $\varepsilon$ is to check whether the studentized fitted residuals

\hat{\eta}_{i}=\frac{\hat{\varepsilon}_{i}}{\tilde{\sigma} \sqrt{1-p_{i i}}}, \quad i=1, \ldots, n

look like a random sample from an $N(0,1)$ distribution. Here,

\tilde{\sigma}^{2}=\frac{1}{n-p}\|Y-X \hat{\beta}\|^{2} .

Say, briefly, how you might do this in R.

This procedure appears to ignore the dependence between the components of $\hat{\varepsilon}$ noted above. What feature of the given set-up makes this reasonable?

comment

4.I $. 5 \mathrm{~J} \quad$

Statistical Modelling | Part II, 2008

A long-term agricultural experiment had $n=90$ grassland plots, each $25 \mathrm{~m} \times 25 \mathrm{~m}$ , differing in biomass, soil pH, and species richness (the count of species in the whole plot). While it was well-known that species richness declines with increasing biomass, it was not known how this relationship depends on soil pH. In the experiment, there were 30 plots of "low pH", 30 of "medium pH" and 30 of "high pH". Three lines of the data are reproduced here as an aid.

Briefly explain the commands below. That is, explain the models being fitted.

Let $H_{1}, H_{2}$ and $H_{3}$ denote the hypotheses represented by the three models and fits. Based on the output of the code below, what hypotheses are being tested, and which of the models seems to give the best fit to the data? Why?

Finally, what is the value obtained by the following command?

\begin{aligned} & >\operatorname{grass}[\mathrm{c}(1,31,61),] \\ & \text { pH Biomass Species } \\ & 1 \text { high } 0.4692972 \quad 30 \\ & 31 \text { mid } 0.1757627 \quad 29 \\ & 61 \text { low } 0.1008479 \quad 18 \\ & >\text { fit1 <- glm(Species Biomass, family = poisson) } \\ & >\text { fit2 <- glm(Species } ~ \mathrm{pH}+\text { Biomass, family }=\text { poisson) } \\ & >\text { fit3 <- glm(Species } \mathrm{pH} * \text { Biomass, family }=\text { poisson) } \end{aligned}

comment

4.II.13J

Statistical Modelling | Part II, 2008

Consider the following generalized linear model for responses $y_{1}, \ldots, y_{n}$ as a function of explanatory variables $x_{1}, \ldots, x_{n}$ , where $x_{i}=\left(x_{i 1}, \ldots, x_{i p}\right)^{\top}$ for $i=1, \ldots, n$ . The responses are modelled as observed values of independent random variables $Y_{1}, \ldots, Y_{n}$ , with

Y_{i} \sim \operatorname{ED}\left(\mu_{i}, \sigma_{i}^{2}\right), \quad g\left(\mu_{i}\right)=x_{i}^{\top} \beta, \quad \sigma_{i}^{2}=\sigma^{2} a_{i},

Here, $g$ is a given link function, $\beta$ and $\sigma^{2}$ are unknown parameters, and the $a_{i}$ are treated as known.

[Hint: recall that we write $Y \sim E D\left(\mu, \sigma^{2}\right)$ to mean that $Y$ has density function of the form

f\left(y ; \mu, \sigma^{2}\right)=a\left(\sigma^{2}, y\right) \exp \left\{\frac{1}{\sigma^{2}}[\theta(\mu) y-K(\theta(\mu))]\right\}

for given functions a and $\theta .]$

[ You may use without proof the facts that, for such a random variable $Y$ ,

\left.E(Y)=K^{\prime}(\theta(\mu)), \quad \operatorname{var}(Y)=\sigma^{2} K^{\prime \prime}(\theta(\mu)) \equiv \sigma^{2} V(\mu) .\right]

Show that the score vector and Fisher information matrix have entries:

U_{j}(\beta)=\sum_{i=1}^{n} \frac{\left(y_{i}-\mu_{i}\right) x_{i j}}{\sigma_{i}^{2} V\left(\mu_{i}\right) g^{\prime}\left(\mu_{i}\right)}, \quad j=1, \ldots, p

and

i_{j k}(\beta)=\sum_{i=1}^{n} \frac{x_{i j} x_{i k}}{\sigma_{i}^{2} V\left(\mu_{i}\right)\left(g^{\prime}\left(\mu_{i}\right)\right)^{2}}, \quad j, k=1, \ldots, p

How do these expressions simplify when the canonical link is used?

Explain briefly how these two expressions can be used to obtain the maximum likelihood estimate $\hat{\beta}$ for $\beta$ .

comment

1.I.5I

Statistical Modelling | Part II, 2007

According to the Independent newspaper (London, 8 March 1994) the Metropolitan Police in London reported 30475 people as missing in the year ending March 1993. For those aged 18 or less, 96 of 10527 missing males and 146 of 11363 missing females were still missing a year later. For those aged 19 and above, the values were 157 of 5065 males and 159 of 3520 females. This data is summarised in the table below.

\begin{array}{rrrrr} & \multicolumn{3}{r}{\text { age }} \\ 1 & \text { Kender } & \text { M } & 96 & 10527 \\ 2 & \text { Kid } & \text { F } & 146 & 11363 \\ 3 & \text { Adult } & \text { M } & 157 & 5065 \\ 4 & \text { Adult } & \text { F } & 159 & 3520 \end{array}

Explain and interpret the $\mathrm{R}$ commands and (slightly abbreviated) output below. You should describe the model being fitted, explain how the standard errors are calculated, and comment on the hypothesis tests being described in the summary. In particular, what is the worst of the four categories for the probability of remaining missing a year later?

For a person who was missing in the year ending in March 1993, find a formula, as a function of age and gender, for the estimated expected probability that they are still missing a year later.

comment

1.II.13I

Statistical Modelling | Part II, 2007

This problem deals with data collected as the number of each of two different strains of Ceriodaphnia organisms are counted in a controlled environment in which reproduction is occurring among the organisms. The experimenter places into the containers a varying concentration of a particular component of jet fuel that impairs reproduction. Hence it is anticipated that as the concentration of jet fuel grows, the mean number of organisms should decrease.

The table below gives a subset of the data. The full dataset has $n=70$ rows. The first column provides the number of organisms, the second the concentration of jet fuel (in grams per litre) and the third specifies the strain of the organism.

\begin{array}{lll} \text { number fuel } & \text { strain } \\ 82 & 0 & 1 \\ 58 & 0 & 0 \\ 45 & 0.5 & 1 \\ 27 & 0.5 & 0 \\ 29 & 0.75 & 1 \\ 15 & 1.25 & 1 \\ 6 & 1.25 & 1 \\ 8 & 1.5 & 0 \\ 4 & 1.75 & 0 \\ . & . & . \end{array}

Explain and interpret the $R$ commands and (slightly abbreviated) output below. In particular, you should describe the model being fitted, explain how the standard errors are calculated, and comment on the hypothesis tests being described in the summary.

The following $\mathrm{R}$ code fits two very similar models. Briefly explain the difference between these models and the one above. Motivate the fitting of these models in light of

Part II 2007 the summary from the fit of the one above.

Denote by $H_{1}, H_{2}, H_{3}$ the three hypotheses being fitted in sequence above.

Explain the hypothesis tests, including an approximate test of the fit of $H_{1}$ , that can be performed using the output from the following $\mathrm{R}$ code. Use these numbers to comment on the most appropriate model for the data.

$>c(f i t 1 \$ d e v$ , fit2$dev, fit3$dev)

[1] $84.59557 \quad 86.37646118 .99503$

$>\operatorname{qchisq}(0.95, \mathrm{df}=1)$

[1] $3.841459$

comment

2.I.5I

Statistical Modelling | Part II, 2007

Consider the linear regression setting where the responses $Y_{i}, i=1, \ldots, n$ are assumed independent with means $\mu_{i}=x_{i}^{\mathrm{T}} \beta$ . Here $x_{i}$ is a vector of known explanatory variables and $\beta$ is a vector of unknown regression coefficients.

Show that if the response distribution is Laplace, i.e.,

Y_{i} \sim f\left(y_{i} ; \mu_{i}, \sigma\right)=(2 \sigma)^{-1} \exp \left\{-\frac{\left|y_{i}-\mu_{i}\right|}{\sigma}\right\}, \quad i=1, \ldots, n ; \quad y_{i}, \mu_{i} \in \mathbb{R} ; \sigma \in(0, \infty)

then the maximum likelihood estimate $\hat{\beta}$ of $\beta$ is obtained by minimising

S_{1}(\beta)=\sum_{i=1}^{n}\left|Y_{i}-x_{i}^{\mathrm{T}} \beta\right|

Obtain the maximum likelihood estimate for $\sigma$ in terms of $S_{1}(\hat{\beta})$ .

Briefly comment on why the Laplace distribution cannot be written in exponential dispersion family form.

comment

3.I.5I

Statistical Modelling | Part II, 2007

Consider two possible experiments giving rise to observed data $y_{i j}$ where $i=1, \ldots, I, j=1, \ldots, J .$

The data are realizations of independent Poisson random variables, i.e.,

Y_{i j} \sim f_{1}\left(y_{i j} ; \mu_{i j}\right)=\frac{\mu_{i j}^{y_{i j}}}{y_{i j} !} \exp \left\{-\mu_{i j}\right\}

where $\mu_{i j}=\mu_{i j}(\beta)$ , with $\beta$ an unknown (possibly vector) parameter. Write $\hat{\beta}$ for the maximum likelihood estimator (m.l.e.) of $\beta$ and $\hat{y}_{i j}=\mu_{i j}(\hat{\beta})$ for the $(i, j)$ th fitted value under this model.

The data are components of a realization of a multinomial random 'vector'

Y \sim f_{2}\left(\left(y_{i j}\right) ; n,\left(p_{i j}\right)\right)=n ! \prod_{i=1}^{I} \prod_{j=1}^{J} \frac{p_{i j}^{y_{i j}}}{y_{i j} !}

where the $y_{i j}$ are non-negative integers with

\sum_{i=1}^{I} \sum_{j=1}^{J} y_{i j}=n \quad \text { and } \quad p_{i j}(\beta)=\frac{\mu_{i j}(\beta)}{n}

Write $\beta^{*}$ for the m.l.e. of $\beta$ and $y_{i j}^{*}=n p_{i j}\left(\beta^{*}\right)$ for the $(i, j)$ th fitted value under this model.

Show that, if

\sum_{i=1}^{I} \sum_{j=1}^{J} \hat{y}_{i j}=n

then $\hat{\beta}=\beta^{*}$ and $\hat{y}_{i j}=y_{i j}^{*}$ for all $i, j$ . Explain the relevance of this result in the context of fitting multinomial models within a generalized linear model framework.

comment

4.I.5I

Statistical Modelling | Part II, 2007

Consider the normal linear model $Y=X \beta+\varepsilon$ in vector notation, where

$Y=\left(\begin{array}{c}Y_{1} \\ \vdots \\ Y_{n}\end{array}\right), \quad X=\left(\begin{array}{c}x_{1}^{\mathrm{T}} \\ \vdots \\ x_{n}^{\mathrm{T}}\end{array}\right), \quad \beta=\left(\begin{array}{c}\beta_{1} \\ \vdots \\ \beta_{p}\end{array}\right), \quad \varepsilon=\left(\begin{array}{c}\varepsilon_{1} \\ \vdots \\ \varepsilon_{n}\end{array}\right), \quad \varepsilon_{i} \sim$ i.i.d. $N\left(0, \sigma^{2}\right)$ ,

where $x_{i}^{\mathrm{T}}=\left(x_{i 1}, \ldots, x_{i p}\right)$ is known and $X$ is of full rank $(p<n)$ . Give expressions for maximum likelihood estimators $\hat{\beta}$ and $\hat{\sigma}^{2}$ of $\beta$ and $\sigma^{2}$ respectively, and state their joint distribution.

Suppose that there is a new pair $\left(x^{*}, y^{*}\right)$ , independent of $\left(x_{1}, y_{1}\right), \ldots,\left(x_{n}, y_{n}\right)$ , satisfying the relationship

y^{*}=x^{* \mathrm{~T}} \beta+\varepsilon^{*}, \quad \text { where } \quad \varepsilon^{*} \sim N\left(0, \sigma^{2}\right) .

We suppose that $x^{*}$ is known, and estimate $y^{*}$ by $\tilde{y}=x^{* \mathrm{~T}} \hat{\beta}$ . State the distribution of

\frac{\tilde{y}-y^{*}}{\tilde{\sigma} \tau}, \quad \text { where } \quad \tilde{\sigma}^{2}=\frac{n}{n-p} \hat{\sigma}^{2} \quad \text { and } \quad \tau^{2}=x^{* \mathrm{~T}}\left(X^{\mathrm{T}} X\right)^{-1} x^{*}+1

Find the form of a $(1-\alpha)$ -level prediction interval for $y^{*}$ .

comment

4.II.13I

Statistical Modelling | Part II, 2007

Let $Y$ have a Gamma distribution with density

f(y ; \alpha, \lambda)=\frac{\lambda^{\alpha} y^{\alpha-1}}{\Gamma(\alpha)} e^{-\lambda y}

Show that the Gamma distribution is of exponential dispersion family form. Deduce directly the corresponding expressions for $\mathbb{E}[Y]$ and $\operatorname{Var}[Y]$ in terms of $\alpha$ and $\lambda$ . What is the canonical link function?

Let $p<n$ . Consider a generalised linear model (g.l.m.) for responses $y_{i}, i=1, \ldots, n$ with random component defined by the Gamma distribution with canonical link $g(\mu)$ , so that $g\left(\mu_{i}\right)=\eta_{i}=x_{i}^{\mathrm{T}} \beta$ , where $\beta=\left(\beta_{1}, \ldots, \beta_{p}\right)^{\mathrm{T}}$ is the vector of unknown regression coefficients and $x_{i}=\left(x_{i 1}, \ldots, x_{i p}\right)^{\mathrm{T}}$ is the vector of known values of the explanatory variables for the $i$ th observation, $i=1, \ldots, n$ .

Obtain expressions for the score function and Fisher information matrix and explain how these can be used in order to approximate $\hat{\beta}$ , the maximum likelihood estimator (m.l.e.) of $\beta$ .

[Use the canonical link function and assume that the dispersion parameter is known.]

Finally, obtain an expression for the deviance for a comparison of the full (saturated) model to the g.l.m. with canonical link using the m.l.e. $\hat{\beta}$ (or estimated mean $\hat{\mu}=X \hat{\beta})$ .

comment

1.I.5I

Statistical Modelling | Part II, 2006

Assume that observations $Y=\left(Y_{1}, \ldots, Y_{n}\right)^{T}$ satisfy the linear model

Y=X \beta+\epsilon

where $X$ is an $n \times p$ matrix of known constants of full $\operatorname{rank} p<n$ , where $\beta=\left(\beta_{1}, \ldots, \beta_{p}\right)^{T}$ is unknown and $\epsilon \sim N_{n}\left(0, \sigma^{2} I\right)$ . Write down a $(1-\alpha)$ -level confidence set for $\beta$ .

Define Cook's distance for the observation $\left(x_{i}, Y_{i}\right)$ , where $x_{i}^{T}$ is the $i$ th row of $X$ . Give its interpretation in terms of confidence sets for $\beta$ .

In the above model with $n=50$ and $p=2$ , you observe that one observation has Cook's distance 1.3. Would you be concerned about the influence of this observation?

[You may find some of the following facts useful:

(i) If $Z \sim \chi_{2}^{2}$ , then $\mathbb{P}(Z \leqslant 0.21)=0.1, \mathbb{P}(Z \leqslant 1.39)=0.5$ and $\mathbb{P}(Z \leqslant 4.61)=0.9$ .

(ii) If $Z \sim F_{2,48}$ , then $\mathbb{P}(Z \leqslant 0.11)=0.1, \mathbb{P}(Z \leqslant 0.70)=0.5$ and $\mathbb{P}(Z \leqslant 2.42)=0.9$ .

(iii) If $Z \sim F_{48,2}$ , then $\mathbb{P}(Z \leqslant 0.41)=0.1, \mathbb{P}(Z \leqslant 1.42)=0.5$ and $\mathbb{P}(Z \leqslant 9.47)=0.9$ . ]

comment

1.II.13I

Statistical Modelling | Part II, 2006

The table below gives a year-by-year summary of the career batting record of the baseball player Babe Ruth. The first column gives his age at the start of each season and the second gives the number of 'At Bats' (AB) he had during the season. For each At Bat, it is recorded whether or not he scored a 'Hit'. The third column gives the total number of Hits he scored in the season, and the final column gives his 'Average' for the season, defined as the number of Hits divided by the number of At Bats.

$\begin{array}{rrrr}\text { Age } & \text { AB } & \text { Hits } & \text { Average } \\ 19 & 10 & 2 & 0.200 \\ 20 & 92 & 29 & 0.315 \\ 21 & 136 & 37 & 0.272 \\ 22 & 123 & 40 & 0.325 \\ 23 & 317 & 95 & 0.300 \\ 24 & 432 & 139 & 0.322 \\ 25 & 457 & 172 & 0.376 \\ 26 & 540 & 204 & 0.378 \\ 27 & 406 & 128 & 0.315 \\ 28 & 522 & 205 & 0.393 \\ 29 & 529 & 200 & 0.378 \\ 30 & 359 & 134 & 0.373 \\ 31 & 495 & 184 & 0.372 \\ 32 & 540 & 192 & 0.356 \\ 33 & 536 & 173 & 0.323 \\ 34 & 499 & 172 & 0.345 \\ 35 & 518 & 186 & 0.359 \\ 36 & 534 & 199 & 0.373 \\ 39 & 365 & 156 & 0.341 \\ 30 & 72 & 13 & 0.181 \\ 359 & 138 & 0.301 \\ 32 & 105 & 0.288 \\ 3 & & & \\ 3 & 123\end{array}$

Explain and interpret the $\mathrm{R}$ commands below. In particular, you should explain the model that is being fitted, the approximation leading to the given standard errors and the test that is being performed in the last line of output.

Assuming that any required packages are loaded, draw a careful sketch of the graph that you would expect to see on entering the following lines of code:

comment

2.I.5I

Statistical Modelling | Part II, 2006

Let $Y_{1}, \ldots, Y_{n}$ be independent Poisson random variables with means $\mu_{1}, \ldots, \mu_{n}$ , for $i=1, \ldots, n$ , where $\log \left(\mu_{i}\right)=\beta x_{i}$ , for some known constants $x_{i}$ and an unknown parameter $\beta$ . Find the log-likelihood for $\beta$ .

By first computing the first and second derivatives of the log-likelihood for $\beta$ , explain the algorithm you would use to find the maximum likelihood estimator, $\hat{\beta}$ .

comment

3.I.5I

Statistical Modelling | Part II, 2006

Consider a generalized linear model for independent observations $Y_{1}, \ldots, Y_{n}$ , with $\mathbb{E}\left(Y_{i}\right)=\mu_{i}$ for $i=1, \ldots, n$ . What is a linear predictor? What is meant by the link function? If $Y_{i}$ has model function (or density) of the form

f\left(y_{i} ; \mu_{i}, \sigma^{2}\right)=\exp \left[\frac{1}{\sigma^{2}}\left\{\theta\left(\mu_{i}\right) y_{i}-K\left(\theta\left(\mu_{i}\right)\right)\right\}\right] a\left(\sigma^{2}, y_{i}\right)

for $y_{i} \in \mathcal{Y} \subseteq \mathbb{R}, \mu_{i} \in \mathcal{M} \subseteq \mathbb{R}, \sigma^{2} \in \Phi \subseteq(0, \infty)$ , where $a\left(\sigma^{2}, y_{i}\right)$ is a known positive function, define the canonical link function.

Now suppose that $Y_{1}, \ldots, Y_{n}$ are independent with $Y_{i} \sim \operatorname{Bin}\left(1, \mu_{i}\right)$ for $i=1, \ldots, n$ . Derive the canonical link function.

comment

4.I.5I

Statistical Modelling | Part II, 2006

The table below summarises the yearly numbers of named storms in the Atlantic basin over the period 1944-2004, and also gives an index of average July ocean temperature in the northern hemisphere over the same period. To save space, only the data for the first four and last four years are shown.

$\begin{array}{rrr}\text { Year } & \text { Storms } & \text { Temp } \\ 1944 & 11 & 0.165 \\ 1945 & 11 & 0.080 \\ 1946 & 6 & 0.000 \\ 1947 & 9 & -0.024 \\ \vdots & \vdots & \vdots \\ 2001 & 15 & 0.592 \\ 2002 & 12 & 0.627 \\ 2003 & 16 & 0.608 \\ 2004 & 15 & 0.546\end{array}$

Explain and interpret the $\mathrm{R}$ commands and (slightly abbreviated) output below.

In 2005 , the ocean temperature index was 0.743. Explain how you would predict the number of named storms for that year.

comment

4.II.13I

Statistical Modelling | Part II, 2006

Consider a linear model for $Y=\left(Y_{1}, \ldots, Y_{n}\right)^{T}$ given by

Y=X \beta+\epsilon,

where $X$ is a known $n \times p$ matrix of full rank $p<n$ , where $\beta$ is an unknown vector and $\epsilon \sim N_{n}\left(0, \sigma^{2} I\right)$ . Derive an expression for the maximum likelihood estimator $\hat{\beta}$ of $\beta$ , and write down its distribution.

Find also the maximum likelihood estimator $\hat{\sigma}^{2}$ of $\sigma^{2}$ , and derive its distribution.

[You may use Cochran's theorem, provided that it is stated carefully. You may also assume that the matrix $P=X\left(X^{T} X\right)^{-1} X^{T}$ has rank $p$ , and that $I-P$ has rank $n-p$ .]

comment

1.I.5I

Statistical Modelling | Part II, 2005

Suppose that $Y_{1}, \ldots, Y_{n}$ are independent random variables, and that $Y_{i}$ has probability density function

f\left(y_{i} \mid \theta_{i}, \phi\right)=\exp \left[\frac{\left(y_{i} \theta_{i}-b\left(\theta_{i}\right)\right)}{\phi}+c\left(y_{i}, \phi\right)\right]

Assume that $\mathbb{E}\left(Y_{i}\right)=\mu_{i}$ and that there is a known link function $g(.)$ such that

g\left(\mu_{i}\right)=\beta^{T} x_{i}

where $x_{1}, \ldots, x_{n}$ are known $p$ -dimensional vectors and $\beta$ is an unknown $p$ -dimensional parameter. Show that $\mathbb{E}\left(Y_{i}\right)=b^{\prime}\left(\theta_{i}\right)$ and that, if $\ell(\beta, \phi)$ is the log-likelihood function from the observations $\left(y_{1}, \ldots, y_{n}\right)$ , then

\frac{\partial \ell(\beta, \phi)}{\partial \beta}=\sum_{1}^{n} \frac{\left(y_{i}-\mu_{i}\right) x_{i}}{g^{\prime}\left(\mu_{i}\right) V_{i}}

where $V_{i}$ is to be defined.

comment

1.II.13I

Statistical Modelling | Part II, 2005

The Independent, June 1999 , under the headline 'Tourists get hidden costs warnings' gave the following table of prices in pounds, called 'How the resorts compared'.

$\begin{array}{lrrrrrr}\text { Algarve } & 8.00 & 0.50 & 3.50 & 3.00 & 4.00 & 100.00 \\ \text { CostaDelSol } & 6.95 & 1.30 & 4.10 & 12.30 & 4.10 & 130.85 \\ \text { Majorca } & 10.25 & 1.45 & 5.35 & 6.15 & 3.30 & 122.20 \\ \text { Tenerife } & 12.30 & 1.25 & 4.90 & 3.70 & 2.90 & 130.85 \\ \text { Florida } & 15.60 & 1.90 & 5.05 & 5.00 & 2.50 & 114.00 \\ \text { Tunisia } & 10.90 & 1.40 & 5.45 & 1.90 & 2.75 & 218.10 \\ \text { Cyprus } & 11.60 & 1.20 & 5.95 & 3.00 & 3.60 & 149.45 \\ \text { Turkey } & 6.50 & 1.05 & 6.50 & 4.90 & 2.85 & 263.00 \\ \text { Corfu } & 5.20 & 1.05 & 3.75 & 4.20 & 2.50 & 137.60 \\ \text { Sorrento } & 7.70 & 1.40 & 6.30 & 8.75 & 4.75 & 215.40 \\ \text { Malta } & 11.20 & 0.70 & 4.55 & 8.00 & 4.80 & 87.85 \\ \text { Rhodes } & 6.30 & 1.05 & 5.20 & 3.15 & 2.70 & 261.30 \\ \text { Sicily } & 13.25 & 1.75 & 4.20 & 7.00 & 3.85 & 174.40 \\ \text { Madeira } & 10.25 & 0.70 & 5.10 & 6.85 & 6.85 & 153.70\end{array}$

Here the column headings are, respectively: Three-course meal, Bottle of Beer, Suntan Lotion, Taxi (5km), Film (24 exp), Car Hire (per week). Interpret the $R$ commands, and explain how to interpret the corresponding (slightly abbreviated) $R$ output given below. Your solution should include a careful statement of the underlying statistical model, but you may quote without proof any distributional results required.

Residual standard error: $0.3425$ on 65 degrees of freedom

Multiple R-Squared: $0.962$

comment

2.I.5I

Statistical Modelling | Part II, 2005

You see below three $R$ commands, and the corresponding output (which is slightly abbreviated). Explain the effects of the commands. How is the deviance defined, and why do we have d.f. $=7$ in this case? Interpret the numerical values found in the output.

comment

3.I.5I

Statistical Modelling | Part II, 2005

Consider the model $Y=X \beta+\epsilon$ , where $Y$ is an $n$ -dimensional observation vector, $X$ is an $n \times p$ matrix of rank $p, \epsilon$ is an $n$ -dimensional vector with components $\epsilon_{1}, \ldots, \epsilon_{n}$ , and $\epsilon_{1}, \ldots, \epsilon_{n}$ are independently and normally distributed, each with mean 0 and variance $\sigma^{2}$

(a) Let $\hat{\beta}$ be the least-squares estimator of $\beta$ . Show that

\left(X^{T} X\right) \hat{\beta}=X^{T} Y

and find the distribution of $\hat{\beta}$ .

(b) Define $\hat{Y}=X \hat{\beta}$ . Show that $\hat{Y}$ has distribution $N\left(X \beta, \sigma^{2} H\right)$ , where $H$ is a matrix that you should define.

[You may quote without proof any results you require about the multivariate normal distribution.]

comment

4.I.5I

Statistical Modelling | Part II, 2005

You see below five $R$ commands, and the corresponding output (which is slightly abbreviated). Without giving any mathematical proofs, explain the purpose of these commands, and interpret the output.

Residual deviance: $1.9369$ on 2 degrees of freedom

Number of Fisher Scoring iterations: 4

comment

4.II.13I

Statistical Modelling | Part II, 2005

(i) Suppose that $Y_{1}, \ldots, Y_{n}$ are independent random variables, and that $Y_{i}$ has probability density function

f\left(y_{i} \mid \beta, \nu\right)=\left(\frac{\nu y_{i}}{\mu_{i}}\right)^{\nu} e^{-y_{i} \nu / \mu_{i}} \frac{1}{\Gamma(\nu)} \frac{1}{y_{i}} \text { for } y_{i}>0

where

1 / \mu_{i}=\beta^{T} x_{i}, \text { for } \quad 1 \leqslant i \leqslant n,

and $x_{1}, \ldots, x_{n}$ are given $p$ -dimensional vectors, and $\nu$ is known.

Show that $\mathbb{E}\left(Y_{i}\right)=\mu_{i}$ and that $\operatorname{var}\left(Y_{i}\right)=\mu_{i}^{2} / \nu$ .

(ii) Find the equation for $\hat{\beta}$ , the maximum likelihood estimator of $\beta$ , and suggest an iterative scheme for its solution.

(iii) If $p=2$ , and $x_{i}=\left(\begin{array}{c}1 \\ z_{i}\end{array}\right)$ , find the large-sample distribution of $\hat{\beta}_{2}$ . Write your answer in terms of $a, b, c$ and $\nu$ , where $a, b, c$ are defined by

a=\sum \mu_{i}^{2}, \quad b=\sum z_{i} \mu_{i}^{2}, \quad c=\sum z_{i}^{2} \mu_{i}^{2} .

comment

Statistical Modelling

Jump to year

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005