# Statistical Modelling

### Jump to year

Paper 1, Section I, J

commentLet $\mu>0$. The probability density function of the inverse Gaussian distribution (with the shape parameter equal to 1 ) is given by

$f(x ; \mu)=\frac{1}{\sqrt{2 \pi x^{3}}} \exp \left[-\frac{(x-\mu)^{2}}{2 \mu^{2} x}\right]$

Show that this is a one-parameter exponential family. What is its natural parameter? Show that this distribution has mean $\mu$ and variance $\mu^{3}$.

Paper 1, Section II, J

commentThe following data were obtained in a randomised controlled trial for a drug. Due to a manufacturing error, a subset of trial participants received a low dose (LD) instead of a standard dose (SD) of the drug.

(a) Below we analyse the data using Poisson regression:

(i) After introducing necessary notation, write down the Poisson models being fitted above.

(ii) Write down the corresponding multinomial models, then state the key theoretical result (the "Poisson trick") that allows you to fit the multinomial models using Poisson regression. [You do not need to prove this theoretical result.]

(iii) Explain why the number of degrees of freedom in the likelihood ratio test is 2 in the analysis of deviance table. What can you conclude about the drug?

(b) Below is the summary table of the second model:

(i) Drug efficacy is defined as one minus the ratio of the probability of worsening in the treated group to the probability of worsening in the control group. By using a more sophisticated method, a published analysis estimated that the drug efficacy is $90.0 \%$ for the LD treatment and $62.1 \%$ for the $\mathrm{SD}$ treatment. Are these numbers similar to what is obtained by Poisson regression? [Hint: $e^{-1} \approx 0.37, e^{-2} \approx 0.14$, and $e^{-3} \approx 0.05$, where $e$ is the base of the natural logarithm.]

(ii) Explain why the information in the summary table is not enough to test the hypothesis that the LD drug and the SD drug have the same efficacy. Then describe how you can test this hypothesis using analysis of deviance in $R$.

Paper 2, Section I, J

commentDefine a generalised linear model for a sample $Y_{1}, \ldots, Y_{n}$ of independent random variables. Define further the concept of the link function. Define the binomial regression model (without the dispersion parameter) with logistic and probit link functions. Which of these is the canonical link function?

Paper 3, Section I, J

commentConsider the normal linear model $Y \mid X \sim \mathrm{N}\left(X \beta, \sigma^{2} I\right)$, where $X$ is a $n \times p$ design matrix, $Y$ is a vector of responses, $I$ is the $n \times n$ identity matrix, and $\beta, \sigma^{2}$ are unknown parameters.

Derive the maximum likelihood estimator of the pair $\beta$ and $\sigma^{2}$. What is the distribution of the estimator of $\sigma^{2}$ ? Use it to construct a $(1-\alpha)$-level confidence interval of $\sigma^{2}$. [You may use without proof the fact that the "hat matrix" $H=X\left(X^{T} X\right)^{-1} X^{T}$ is a projection matrix.]

Paper 4 , Section I, J

commentThe data frame data contains the daily number of new avian influenza cases in a large poultry farm.

Write down the model being fitted by the $R$ code below. Does the model seem to provide a satisfactory fit to the data? Justify your answer.

The owner of the farm estimated that the size of the epidemic was initially doubling every 7 days. Is that estimate supported by the analysis below? [You may need $\log 2 \approx 0.69$.]

Paper 4, Section II, J

commentLet $X$ be an $n \times p$ non-random design matrix and $Y$ be a $n$-vector of random responses. Suppose $Y \sim N\left(\mu, \sigma^{2} I\right)$, where $\mu$ is an unknown vector and $\sigma^{2}>0$ is known.

(a) Let $\lambda \geqslant 0$ be a constant. Consider the ridge regression problem

$\hat{\beta}_{\lambda}=\arg \min _{\beta}\|Y-X \beta\|^{2}+\lambda\|\beta\|^{2} .$

Let $\hat{\mu}_{\lambda}=X \hat{\beta}_{\lambda}$ be the fitted values. Show that $\hat{\mu}_{\lambda}=H_{\lambda} Y$, where

$H_{\lambda}=X\left(X^{T} X+\lambda I\right)^{-1} X^{T}$

(b) Show that

$\mathbb{E}\left(\left\|Y-\hat{\mu}_{\lambda}\right\|^{2}\right)=\left\|\left(I-H_{\lambda}\right) \mu\right\|^{2}+\left\{n-2 \operatorname{trace}\left(H_{\lambda}\right)+\operatorname{trace}\left(H_{\lambda}^{2}\right)\right\} \sigma^{2}$

(c) Let $Y^{*}=\mu+\epsilon^{*}$, where $\epsilon^{*} \sim \mathrm{N}\left(0, \sigma^{2} I\right)$ is independent of $Y$. Show that $\left\|Y-\hat{\mu}_{\lambda}\right\|^{2}+2 \sigma^{2} \operatorname{trace}\left(H_{\lambda}\right)$ is an unbiased estimator of $\mathbb{E}\left(\left\|Y^{*}-\hat{\mu}_{\lambda}\right\|^{2}\right)$.

(d) Describe the behaviour (monotonicity and limits) of $\mathbb{E}\left(\left\|Y^{*}-\hat{\mu}_{\lambda}\right\|^{2}\right)$ as a function of $\lambda$ when $p=n$ and $X=I$. What is the minimum value of $\mathbb{E}\left(\left\|Y^{*}-\hat{\mu}_{\lambda}\right\|^{2}\right)$ ?

Paper 1, Section I, J

commentConsider a generalised linear model with full column rank design matrix $X \in \mathbb{R}^{n \times p}$, output variables $Y=\left(Y_{1}, \ldots, Y_{n}\right) \in \mathbb{R}^{n}$, link function $g$, mean parameters $\mu=\left(\mu_{1}, \ldots, \mu_{n}\right)$ and known dispersion parameters $\sigma_{i}^{2}=a_{i} \sigma^{2}, i=1, \ldots, n$. Denote its variance function by $V$ and recall that $g\left(\mu_{i}\right)=x_{i}^{T} \beta, i=1, \ldots, n$, where $\beta \in \mathbb{R}^{p}$ and $x_{i}^{T}$ is the $i^{\text {th }}$row of $X$.

(a) Define the score function in terms of the log-likelihood function and the Fisher information matrix, and define the update of the Fisher scoring algorithm.

(b) Let $W \in \mathbb{R}^{n \times n}$ be a diagonal matrix with positive entries. Note that $X^{T} W X$ is invertible. Show that

$\operatorname{argmin}_{b \in \mathbb{R}^{p}}\left\{\sum_{i=1}^{n} W_{i i}\left(Y_{i}-x_{i}^{T} b\right)^{2}\right\}=\left(X^{T} W X\right)^{-1} X^{T} W Y$

[Hint: you may use that $\left.\operatorname{argmin}_{b \in \mathbb{R}^{p}}\left\{\left\|Y-X^{T} b\right\|^{2}\right\}=\left(X^{T} X\right)^{-1} X^{T} Y .\right]$

(c) Recall that the score function and the Fisher information matrix have entries

$\begin{aligned} &U_{j}(\beta)=\sum_{i=1}^{n} \frac{\left(Y_{i}-\mu_{i}\right) X_{i j}}{a_{i} \sigma^{2} V\left(\mu_{i}\right) g^{\prime}\left(\mu_{i}\right)} \quad j=1, \ldots, p \\ &i_{j k}(\beta)=\sum_{i=1}^{n} \frac{X_{i j} X_{i k}}{a_{i} \sigma^{2} V\left(\mu_{i}\right)\left\{g^{\prime}\left(\mu_{i}\right)\right\}^{2}} \quad j, k=1, \ldots, p \end{aligned}$

Justify, performing the necessary calculations and using part (b), why the Fisher scoring algorithm is also known as the iterative reweighted least squares algorithm.

Paper 1, Section II, J

commentWe consider a subset of the data on car insurance claims from Hallin and Ingenbleek (1983). For each customer, the dataset includes total payments made per policy-year, the amount of kilometres driven, the bonus from not having made previous claims, and the brand of the car. The amount of kilometres driven is a factor taking values $1,2,3,4$, or 5 , where a car in level $i+1$ has driven a larger number of kilometres than a car in level $i$ for any $i=1,2,3,4$. A statistician from an insurance company fits the following model on $R$.

$>$ model1 <- Im(Paymentperpolicyyr as numeric(Kilometres) $+$ Brand $+$ Bonus)

(i) Why do you think the statistician transformed variable Kilometres from a factor to a numerical variable?

(ii) To check the quality of the model, the statistician applies a function to model1 which returns the following figure:

What does the plot represent? Does it suggest that model1 is a good model? Explain. If not, write down a model which the plot suggests could be better.

(iii) The statistician fits the model suggested by the graph and calls it model2. Consider the following abbreviated output:

$>\operatorname{summary}(\operatorname{model2})$

$\cdots$

Coefficients:

$\begin{array}{lrrrr}\text { (Intercept) } & 6.514035 & 0.186339 & 34.958 & <2 \mathrm{e}-16 * * * \\ \text { as.numeric(Kilometres) } & 0.057132 & 0.032654 & 1.750 & 0.08126 . \\ \text { Brand2 } & 0.363869 & 0.186857 & 1.947 & 0.05248 .\end{array}$

Brand2

$\cdots$

Brand9

$\begin{array}{lllll}0.125446 & 0.186857 & 0.671 & 0.50254\end{array}$

Bonus

Signif. codes: 0 '

' $0.05$ '.' $0.1$ ' 1**' $0.001$ '**' $0.01$ 'Residual standard error: $0.7817$ on 284 degrees of freedom ..

Using the output, write down a $95 \%$ prediction interval for the ratio between the total payments per policy year for two cars of the same brand and with the same value of Bonus, one of which has a Kilometres value one higher than the other. You may express your answer as a function of quantiles of a common distribution, which you should specify.

(iv) Write down a generalised linear model for Paymentperpolicyyr which may be a better model than model1 and give two reasons. You must specify the link function.

Paper 2, Section I, J

commentThe data frame WCG contains data from a study started in 1960 about heart disease. The study used 3154 adult men, all free of heart disease at the start, and eight and a half years later it recorded into variable chd whether they suffered from heart disease (1 if the respective man did and 0 otherwise) along with their height and average number of cigarettes smoked per day. Consider the $\mathrm{R}$ code below and its abbreviated output.

(a) Write down the model fitted by the code above.

(b) Interpret the effect on heart disease of a man smoking an average of two packs of cigarettes per day if each pack contains 20 cigarettes.

(c) Give an alternative latent logistic-variable representation of the model. [Hint: if $F$ is the cumulative distribution function of a logistic random variable, its inverse function is the logit function.]

Paper 3, Section I, J

commentSuppose we have data $\left(Y_{1}, x_{1}^{T}\right), \ldots,\left(Y_{n}, x_{n}^{T}\right)$, where the $Y_{i}$ are independent conditional on the design matrix $X$ whose rows are the $x_{i}^{T}, i=1, \ldots, n$. Suppose that given $x_{i}$, the true probability density function of $Y_{i}$ is $f_{x_{i}}$, so that the data is generated from an element of a model $\mathcal{F}:=\left\{\left(f_{x_{i}}(\cdot ; \theta)\right)_{i=1}^{n}, \theta \in \Theta\right\}$ for some $\Theta \subseteq \mathbb{R}^{q}$ and $q \in \mathbb{N}$.

(a) Define the log-likelihood function for $\mathcal{F}$, the maximum likelihood estimator of $\theta$ and Akaike's Information Criterion (AIC) for $\mathcal{F}$.

From now on let $\mathcal{F}$ be the normal linear model, i.e. $Y:=\left(Y_{1}, \ldots, Y_{n}\right)^{T}=X \beta+\varepsilon$, where $X \in \mathbb{R}^{n \times p}$ has full column rank and $\varepsilon \sim N_{n}\left(0, \sigma^{2} I\right)$.

(b) Let $\hat{\sigma}^{2}$ denote the maximum likelihood estimator of $\sigma^{2}$. Show that the AIC of $\mathcal{F}$ is

$n\left(1+\log \left(2 \pi \hat{\sigma}^{2}\right)\right)+2(p+1)$

(c) Let $\chi_{n-p}^{2}$ be a chi-squared distribution on $n-p$ degrees of freedom. Using any results from the course, show that the distribution of the AIC of $\mathcal{F}$ is

$n \log \left(\chi_{n-p}^{2}\right)+n\left(\log \left(2 \pi \sigma^{2} / n\right)+1\right)+2(p+1)$

$\left[\right.$ Hint: $\hat{\sigma}^{2}:=n^{-1}\|Y-X \hat{\beta}\|^{2}=n^{-1}\|(I-P) \varepsilon\|^{2}$, where $\hat{\beta}$ is the maximum likelihood estimator of $\beta$ and $P$ is the projection matrix onto the column space of $X$.]

Paper 4, Section I, J

commentSuppose you have a data frame with variables response, covar1, and covar2. You run the following commands on $R$.

$\begin{array}{llllll}\text { covar2 } & 0.3755 & 2.5978 & 0.145 & 0.886\end{array}$

...

(a) Consider the following three scenarios:

(i) All the output you have is the abbreviated output of summary (model) above.

(ii) You have the abbreviated output of summary (model) above together with

Residual standard error: $0.8097$ on 47 degrees of freedom

Multiple R-squared: $0.8126$, Adjusted R-squared: $0.8046$

F-statistic: $101.9$ on 2 and 47 DF, p-value: < $2.2 e-16$

(iii) You have the abbreviated output of summary (model) above together with

Residual standard error: $0.9184$ on 47 degrees of freedom

Multiple R-squared: $0.000712$, Adjusted R-squared: $-0.04181$

F-statistic: $0.01674$ on 2 and 47 DF, p-value: $0.9834$

What conclusion can you draw about which variables explain the response in each of the three scenarios? Explain.

(b) Assume now that you have the abbreviated output of summary (model) above together with

anova(lm(response $~ 1), \operatorname{lm}($ response $\sim \operatorname{covar} 1)$, model $)$

What are the values of the entries with a question mark? [You may express your answers as arithmetic expressions if necessary].

Paper 4, Section II, J

comment(a) Define a generalised linear model $(\mathrm{GLM})$ with design matrix $X \in \mathbb{R}^{n \times p}$, output variables $Y:=\left(Y_{1}, \ldots, Y_{n}\right)^{T}$ and parameters $\mu:=\left(\mu_{1}, \ldots, \mu_{n}\right)^{T}, \beta \in \mathbb{R}^{p}$ and $\sigma_{i}^{2}:=a_{i} \sigma^{2} \in(0, \infty), i=1, \ldots, n$. Derive the moment generating function of $Y$, i.e. give an expression for $\mathbb{E}\left[\exp \left(t^{T} Y\right)\right], t \in \mathbb{R}^{n}$, wherever it is well-defined.

Assume from now on that the GLM satisfies the usual regularity assumptions, $X$ has full column rank, and $\sigma^{2}$ is known and satisfies $1 / \sigma^{2} \in \mathbb{N}$.

(b) Let $\tilde{Y}:=\left(\tilde{Y}_{1}, \ldots, \tilde{Y}_{n / \sigma^{2}}\right)^{T}$ be the output variables of a GLM from the same family as that of part (a) and parameters $\tilde{\mu}:=\left(\tilde{\mu}_{1}, \ldots, \tilde{\mu}_{n / \sigma^{2}}\right)^{T}$ and $\tilde{\sigma}^{2}:=\left(\tilde{\sigma}_{1}^{2}, \ldots, \tilde{\sigma}_{n / \sigma^{2}}^{2}\right)$. Suppose the output variables may be split into $n$ blocks of size $1 / \sigma^{2}$ with constant parameters. To be precise, for each block $i=1, \ldots, n$, if $j \in\left\{(i-1) / \sigma^{2}+1, \ldots, i / \sigma^{2}\right\}$ then

$\tilde{\mu}_{j}=\mu_{i} \quad \text { and } \quad \tilde{\sigma}_{j}^{2}=a_{i}$

with $\mu_{i}=\mu_{i}(\beta)$ and $a_{i}$ defined as in part $($ a $)$. Let $\bar{Y}:=\left(\bar{Y}_{1}, \ldots, \bar{Y}_{n}\right)^{T}$, where $\bar{Y}_{i}:=$ $\sigma^{2} \sum_{k=1}^{1 / \sigma^{2}} \tilde{Y}_{(i-1) / \sigma^{2}+k}$.

(i) Show that $\bar{Y}$ is equal to $Y$ in distribution. [Hint: you may use without proof that moment generating functions uniquely determine distributions from exponential dispersion families.]

(ii) For any $\tilde{y} \in \mathbb{R}^{n / \sigma^{2}}$, let $\bar{y}=\left(\bar{y}_{1}, \ldots, \bar{y}_{n}\right)^{T}$, where $\bar{y}_{i}:=\sigma^{2} \sum_{k=1}^{1 / \sigma^{2}} \tilde{y}_{(i-1) / \sigma^{2}+k}$. Show that the model function of $\tilde{Y}$ satisfies

$f\left(\tilde{y} ; \tilde{\mu}, \tilde{\sigma}^{2}\right)=g_{1}\left(\bar{y} ; \tilde{\mu}, \tilde{\sigma}^{2}\right) \times g_{2}\left(\tilde{y} ; \tilde{\sigma}^{2}\right)$

for some functions $g_{1}, g_{2}$, and conclude that $\bar{Y}$ is a sufficient statistic for $\beta$ from $\tilde{Y}$.

(iii) For the model and data from part (a), let $\hat{\mu}$ be the maximum likelihood estimator for $\mu$ and let $D(Y ; \mu)$ be the deviance at $\mu$. Using (i) and (ii), show that

$\frac{D(Y ; \hat{\mu})}{\sigma^{2}}={ }^{d} 2 \log \left\{\frac{\sup _{\tilde{\mu}^{\prime} \in \widetilde{\mathcal{M}}_{1}} f\left(\tilde{Y} ; \tilde{\mu}^{\prime}, \tilde{\sigma}^{2}\right)}{\sup _{\tilde{\mu}^{\prime} \in \widetilde{\mathcal{M}}_{0}} f\left(\tilde{Y} ; \tilde{\mu}^{\prime}, \tilde{\sigma}^{2}\right)}\right\}$

where $=^{d}$ means equality in distribution and $\widetilde{\mathcal{M}}_{0}$ and $\widetilde{\mathcal{M}}_{1}$ are nested subspaces of $\mathbb{R}^{n / \sigma^{2}}$ which you should specify. Argue that $\operatorname{dim}\left(\widetilde{\mathcal{M}}_{1}\right)=n$ and $\operatorname{dim}\left(\widetilde{\mathcal{M}}_{0}\right)=p$, and, assuming the usual regularity assumptions, conclude that

$\frac{D(Y ; \hat{\mu})}{\sigma^{2}} \rightarrow^{d} \chi_{n-p}^{2} \quad \text { as } \sigma^{2} \rightarrow 0$

stating the name of the result from class that you use.

Paper 1, Section I, J

commentThe Gamma distribution with shape parameter $\alpha>0$ and scale parameter $\lambda>0$ has probability density function

$f(y ; \alpha, \lambda)=\frac{\lambda^{\alpha}}{\Gamma(\alpha)} y^{\alpha-1} e^{-\lambda y} \quad \text { for } y>0$

Give the definition of an exponential dispersion family and show that the set of Gamma distributions forms one such family. Find the cumulant generating function and derive the mean and variance of the Gamma distribution as a function of $\alpha$ and $\lambda$.

Paper 1, Section II, J

commentThe ice_cream data frame contains the result of a blind tasting of 90 ice creams, each of which is rated as poor, good, or excellent. It also contains the price of each ice cream classified into three categories. Consider the $R$ code below and its output.

(a) Write down the generalised linear model fitted by the code above.

(b) Prove that the fitted values resulting from the maximum likelihood estimator of the coefficients in this model are identical to those resulting from the maximum likelihood estimator when fitting a Multinomial model which assumes the number of ice creams at each price level is fixed.

(c) Using the output above, perform a goodness-of-fit test at the $1 \%$ level, specifying the null hypothesis, the test statistic, its asymptotic null distribution, any assumptions of the test and the decision from your test. (d) If we believe that better ice creams are more expensive, what could be a more powerful test against the model fitted above and why?

Paper 2, Section I, J

commentThe cycling data frame contains the results of a study on the effects of cycling to work among 1,000 participants with asthma, a respiratory illness. Half of the participants, chosen uniformly at random, received a monetary incentive to cycle to work, and the other half did not. The variables in the data frame are:

miles: the average number of miles cycled per week

episodes: the number of asthma episodes experienced during the study

incentive: whether or not a monetary incentive to cycle was given

history: the number of asthma episodes in the year preceding the study

Consider the $R$ code below and its abbreviated output.

$>\operatorname{lm} .1=\operatorname{lm}$ (episodes miles $+$ history, data=cycling)

$>\operatorname{summary}(1 \mathrm{~lm} .1)$

Coefficients:

Estimate Std. Error $t$ value $\operatorname{Pr}(>|t|)$

(Intercept) $0.66937 \quad 0.07965 \quad 8.404<2 \mathrm{e}-16 * * *$

miles $\quad-0.04917 \quad 0.01839-2.6740 .00761 * *$

history $1.489540 .0481830 .918<2 \mathrm{e}-16 * * *$

$>\operatorname{lm} .2=\operatorname{lm}($ episodes $~$incentive $+$ history, data=cycling)

$>$ summary (lm.2)

Coefficients:

Estimate Std. Error $t$ value $\operatorname{Pr}(>|t|)$

(Intercept) $0.09539 \quad 0.06960 \quad 1.371 \quad 0.171$

incentiveYes $0.91387 \quad 0.06504 \quad 14.051<2 \mathrm{e}-16 * * *$

history $1.46806 \quad 0.04346 \quad 33.782<2 \mathrm{e}-16 * * *$

$>\operatorname{lm} .3=\operatorname{lm}($ miles incentive $+$ history, data=cycling)

$>\operatorname{summary}(\operatorname{lm} .3)$

Coefficients :

Estimate Std. Error t value $\operatorname{Pr}(>|t|)$

(Intercept) $1.47050 \quad 0.11682 \quad 12.588<2 \mathrm{e}-16 * * *$

incentiveYes $1.73282 \quad 0.10917 \quad 15.872<2 \mathrm{e}-16 * * *$

history $\quad 0.47322 \quad 0.07294 \quad 6.487 \quad 1.37 \mathrm{e}-10 * * *$

(a) For each of the fitted models, briefly explain what can be inferred about participants with similar histories.

(b) Based on this analysis and the experimental design, is it advisable for a participant with asthma to cycle to work more often? Explain.

Paper 3, Section I, J

comment(a) For a given model with likelihood $L(\beta), \beta \in \mathbb{R}^{p}$, define the Fisher information matrix in terms of the Hessian of the log-likelihood.

Consider a generalised linear model with design matrix $X \in \mathbb{R}^{n \times p}$, output variables $y \in \mathbb{R}^{n}$, a bijective link function, mean parameters $\mu=\left(\mu_{1}, \ldots, \mu_{n}\right)$ and dispersion parameters $\sigma_{1}^{2}=\ldots=\sigma_{n}^{2}=\sigma^{2}$. Assume $\sigma^{2}$ is known.

(b) State the form of the log-likelihood.

(c) For the canonical link, show that when the parameter $\sigma^{2}$ is known, the Fisher information matrix is equal to

$\sigma^{-2} X^{T} W X$

for a diagonal matrix $W$ depending on the means $\mu$. Identify $W$.

Paper 4, Section I, J

commentIn a normal linear model with design matrix $X \in \mathbb{R}^{n \times p}$, output variables $y \in \mathbb{R}^{n}$ and parameters $\beta \in \mathbb{R}^{p}$ and $\sigma^{2}>0$, define a $(1-\alpha)$-level prediction interval for a new observation with input variables $x^{*} \in \mathbb{R}^{p}$. Derive an explicit formula for the interval, proving that it satisfies the properties required by the definition. [You may assume that the maximum likelihood estimator $\hat{\beta}$ is independent of $\sigma^{-2}\|y-X \hat{\beta}\|_{2}^{2}$, which has a $\chi_{n-p}^{2}$ distribution.]

Paper 4, Section II, J

commentA sociologist collects a dataset on friendships among $m$ Cambridge graduates. Let $y_{i, j}=1$ if persons $i$ and $j$ are friends 3 years after graduation, and $y_{i, j}=0$ otherwise. Let $z_{i}$ be a categorical variable for person $i$ 's college, taking values in the set $\{1,2, \ldots, C\}$. Consider logistic regression models,

$\mathbb{P}\left(y_{i, j}=1\right)=\frac{e^{\theta_{i, j}}}{1+e^{\theta_{i, j}}}, \quad 1 \leqslant i<j \leqslant m$

with parameters either

$\theta_{i, j}=\beta_{z_{i}, z_{j}}$; or,

$\theta_{i, j}=\beta_{z_{i}}+\beta_{z_{j}}$; or,

$\theta_{i, j}=\beta_{z_{i}}+\beta_{z_{j}}+\beta_{0} \delta_{z_{i}, z_{j}}$, where $\delta_{z_{i}, z_{j}}=1$ if $z_{i}=z_{j}$ and 0 otherwise.

(a) Write the likelihood of the models.

(b) Show that the three models are nested and specify the order. Suggest a statistic to compare models 1 and 3, give its definition and specify its asymptotic distribution under the null hypothesis, citing any necessary theorems.

(c) Suppose persons $i$ and $j$ are in the same college $k ;$ consider the number of friendships, $M_{i}$ and $M_{j}$, that each of them has with people in college $\ell \neq k$ ( $\ell$ and $k$ fixed). In each of the models above, compare the distribution of these two random variables. Explain why this might lead to a poor quality of fit.

(d) Find a minimal sufficient statistic for model 3. [You may use the following characterisation of a minimal sufficient statistic: let $f(\beta ; y)$ be the likelihood in this model, where $\beta=\left(\beta_{k}\right)_{k=0,1, \ldots, C}$ and $y=\left(y_{i, j}\right)_{i, j=1, \ldots, m} ;$ suppose $T=t(y)$ is a statistic such that $f(\beta ; y) / f\left(\beta ; y^{\prime}\right)$ is constant in $\beta$ if and only if $t(y)=t\left(y^{\prime}\right)$; then, $T$ is a minimal sufficient statistic for $\beta$.]

Paper 1, Section I, J

commentThe data frame Ambulance contains data on the number of ambulance requests from a Cambridgeshire hospital on different days. In addition to the number of ambulance requests on each day, the dataset records whether each day fell in the winter season, on a weekend, or on a bank holiday, as well as the pollution level on each day.

A health researcher fitted two models to the dataset above using $R$. Consider the following code and its output.

$\begin{aligned} & \text { > head (Ambulance) } \\ & \text { Winter Weekend Bank. holiday Pollution. level Ambulance.requests } \\ & 1 \text { Yes Yes No High } 16 \end{aligned}$

$\begin{aligned} & 3 \text { No No No High } 22 \\ & 4 \text { No } \quad \text { Yes } \quad \text { No } \quad \text { Medium } \quad 11 \end{aligned}$

Define the two models fitted by this code and perform a hypothesis test with level $1 \%$ in which one of the models is the null hypothesis and the other is the alternative. State the theorem used in this hypothesis test. You may use the information generated by the following commands.

Paper 1, Section II, J

commentA clinical study follows a number of patients with an illness. Let $Y_{i} \in[0, \infty)$ be the length of time that patient $i$ lives and $x_{i} \in \mathbb{R}^{p}$ a vector of predictors, for $i \in\{1, \ldots, n\}$. We shall assume that $Y_{1}, \ldots, Y_{n}$ are independent. Let $f_{i}$ and $F_{i}$ be the probability density function and cumulative distribution function, respectively, of $Y_{i}$. The hazard function $h_{i}$ is defined as

$h_{i}(t)=\frac{f_{i}(t)}{1-F_{i}(t)} \quad \text { for } t \geqslant 0 .$

We shall assume that $h_{i}(t)=\lambda(t) \exp \left(\beta^{\top} x_{i}\right)$, where $\beta \in \mathbb{R}^{p}$ is a vector of coefficients and $\lambda(t)$ is some fixed hazard function.

(a) Prove that $F_{i}(t)=1-\exp \left(-\int_{0}^{t} h_{i}(s) d s\right)$.

(b) Using the equation in part (a), write the log-likelihood function for $\beta$ in terms of $\lambda, \beta, x_{i}$ and $Y_{i}$ only.

(c) Show that the maximum likelihood estimate of $\beta$ can be obtained through a surrogate Poisson generalised linear model with an offset.

Paper 2, Section I, $\mathbf{5 J}$

commentConsider a linear model $Y=X \beta+\sigma^{2} \varepsilon$ with $\varepsilon \sim N(0, I)$, where the design matrix $X$ is $n$ by $p$. Provide an expression for the $F$-statistic used to test the hypothesis $\beta_{p_{0}+1}=\beta_{p_{0}+2}=\cdots=\beta_{p}=0$ for $p_{0}<p$. Show that it is a monotone function of a log-likelihood ratio statistic.

Paper 3, Section I, J

commentThe data frame Cases. of .flu contains a list of cases of flu recorded in 3 London hospitals during each month of 2017 . Consider the following $R$ code and its output.

table (Cases. of.flu)

Month Hospital

$\begin{array}{lrrr}\text { Month } & \text { A } & \text { B } & \text { C } \\ \text { April } & 10 & 40 & 27\end{array}$

$\begin{array}{lrrr}\text { April } & 10 & 40 & 27 \\ \text { August } & 9 & 34 & 19\end{array}$

$\begin{array}{lrrr}\text { August } & 9 & 34 & 19 \\ \text { December } & 24 & 129 & 81\end{array}$

$\begin{array}{llll}\text { December } & 24 & 129 & 81 \\ \text { February } & 49 & 134 & 74\end{array}$

$\begin{array}{llll}\text { February } & 49 & 134 & 74 \\ \text { January } & 45 & 138 & 78\end{array}$

$\begin{array}{lrrr}\text { July } & 5 & 138 & 78 \\ & 11 & 36 & 35\end{array}$

$\begin{array}{llll}\text { June } & 11 & 36 & 22\end{array}$

$\begin{array}{llll}\text { March } & 20 & 82 & 41\end{array}$

May $5 \quad 43 \quad 23$

November $17 \quad 82 \quad 62$

October $6 \quad 26 \quad 19$

September $640 \quad 21$

Cases. of.flu.table = as.data.frame (table (Cases. of .flu))

$>$ head (Cases. of .flu.table)

Month Hospital Freq

1 April A 10

2 August A 9

3 December A 24

4 February A 49

5 January A 45

6 July A 5

$>\bmod 1=$ glm (Freq ., data=Cases. of .flu.table, family=poisson)

$>\bmod 1 \$ \mathrm{dev}$

[1] $28.51836$

levels (Cases. of.flu$Month)

Describe a test for the null hypothesis of independence between the variables Month and Hospital using the deviance statistic. State the assumptions of the test.

Perform the test at the $1 \%$ level for each of the two different models shown above. You may use the table below showing 99 th percentiles of the $\chi_{p}^{2}$ distribution with a range of degrees of freedom $p$. How would you explain the discrepancy between their conclusions?

Paper 4, Section I, J

commentA scientist is studying the effects of a drug on the weight of mice. Forty mice are divided into two groups, control and treatment. The mice in the treatment group are given the drug, and those in the control group are given water instead. The mice are kept in 8 different cages. The weight of each mouse is monitored for 10 days, and the results of the experiment are recorded in the data frame Weight.data. Consider the following $R$ code and its output.

head (Weight.data)

Time Group Cage Mouse Weight

11 Control 1 1 $24.77578$

$2 \quad 2$ Control 1 1 $24.68766$

$3 \quad 3$ Control $1 \quad 124.79008$

44 Control $1 \quad 124.77005$

$5 \quad 5$ Control 1 1 $24.65092$

$6 \quad 6$ Control $1 \quad 124.38436$

$>\bmod 1=\operatorname{lm}$ (Weight $\sim$ Time*Group $+$ Cage, data=Weight. data)

$>\operatorname{summary}(\bmod 1)$

Call:

$\operatorname{lm}$ (formula $=$ Weight $~$Time $*$ Group $+$ Cage, data $=$ Weight. data)

Residuals:

Min $1 Q$ Median $3 Q$ Max

$-1.36903-0.33527-0.01719 \quad 0.38807 \quad 1.24368$

Coefficients:

Estimate Std. Error t value $\operatorname{Pr}(>|t|)$

$\begin{array}{lllll}\text { Time } & -0.006023 & 0.012616 & -0.477 & 0.63334\end{array}$

GroupTreatment $\quad 0.321837 \quad 0.121993 \quad 2.638 \quad 0.00867 *$

Cage2 $\quad-0.400228 \quad 0.095875-4.1743 .68 \mathrm{e}-05 * * *$

$\begin{array}{lllll}\text { Cage3 } & 0.286941 & 0.102494 & 2.800 & 0.00537 *\end{array}$

$\begin{array}{lllll}\text { Cage4 } & 0.007535 & 0.095875 & 0.079 & 0.93740\end{array}$

$\begin{array}{rrrrr}\text { Cage6 } & 0.124767 & 0.125530 & 0.994 & 0.32087\end{array}$

$\begin{array}{lllll}\text { Cage8 } & -0.295168 & 0.125530 & -2.351 & 0.01920 * \\ \text { Time:GroupTreatment } & -0.173515 & 0.017842 & -9.725 & <2 e-16 * * *\end{array}$

Time: GroupTreatment $-0.173515 \quad 0.017842-9.725<2 \mathrm{e}-16 * * *$

Signif. codes: 0 '

' $0.05$ '., $0.1$ ', 1**' $0.001$ '**' $0.01$ 'Residual standard error: $0.5125$ on 391 degrees of freedom

Multiple R-squared: $0.5591$, Adjusted R-squared: $0.55$

F-statistic: $61.97$ on 8 and 391 DF, p-value: $<2.2 \mathrm{e}-16$

Which parameters describe the rate of weight loss with time in each group? According to the $\mathrm{R}$ output, is there a statistically significant weight loss with time in the control group?

Three diagnostic plots were generated using the following $R$ code.

Weight.data$Time[mouse1]

Weight.data$Time[mouse2]

Based on these plots, should you trust the significance tests shown in the output of the command summary (mod1)? Explain.

Paper 4, Section II, J

commentBridge is a card game played by 2 teams of 2 players each. A bridge club records the outcomes of many games between teams formed by its $m$ members. The outcomes are modelled by

$\mathbb{P}(\text { team }\{i, j\} \text { wins against team }\{k, \ell\})=\frac{\exp \left(\beta_{i}+\beta_{j}+\beta_{\{i, j\}}-\beta_{k}-\beta_{\ell}-\beta_{\{k, \ell\}}\right)}{1+\exp \left(\beta_{i}+\beta_{j}+\beta_{\{i, j\}}-\beta_{k}-\beta_{\ell}-\beta_{\{k, \ell\}}\right)},$

where $\beta_{i} \in \mathbb{R}$ is a parameter representing the skill of player $i$, and $\beta_{\{i, j\}} \in \mathbb{R}$ is a parameter representing how well-matched the team formed by $i$ and $j$ is.

(a) Would it make sense to include an intercept in this logistic regression model? Explain your answer.

(b) Suppose that players 1 and 2 always play together as a team. Is there a unique maximum likelihood estimate for the parameters $\beta_{1}, \beta_{2}$ and $\beta_{\{1,2\}}$ ? Explain your answer.

(c) Under the model defined above, derive the asymptotic distribution (including the values of all relevant parameters) for the maximum likelihood estimate of the probability that team $\{i, j\}$ wins a game against team $\{k, \ell\}$. You can state it as a function of the true vector of parameters $\beta$, and the Fisher information matrix $i_{N}(\beta)$ with $N$ games. You may assume that $i_{N}(\beta) / N \rightarrow I(\beta)$ as $N \rightarrow \infty$, and that $\beta$ has a unique maximum likelihood estimate for $N$ large enough.

Paper 1, Section I, J

commentThe dataset ChickWeights records the weight of a group of chickens fed four different diets at a range of time points. We perform the following regressions in $R$.

(i) Which hypothesis test does the following command perform? State the degrees of freedom, and the conclusion of the test.

(ii) Define a diagnostic plot that might suggest the logarithmic transformation of the response in fit2.

(iii) Define the dashed line in the following plot, generated with the command plot(fit3). What does it tell us about the data point 579 ?

Paper 1, Section II, J

commentThe Cambridge Lawn Tennis Club organises a tournament in which every match consists of 11 games, all of which are played. The player who wins 6 or more games is declared the winner.

For players $a$ and $b$, let $n_{a b}$ be the total number of games they play against each other, and let $y_{a b}$ be the number of these games won by player $a$. Let $\tilde{n}_{a b}$ and $\tilde{y}_{a b}$ be the corresponding number of matches.

A statistician analysed the tournament data using a Binomial Generalised Linear Model (GLM) with outcome $y_{a b}$. The probability $P_{a b}$ that $a$ wins a game against $b$ is modelled by

$\log \left(\frac{P_{a b}}{1-P_{a b}}\right)=\beta_{a}-\beta_{b},$

with an appropriate corner point constraint. You are asked to re-analyse the data, but the game-level results have been lost and you only know which player won each match.

We define a new GLM for the outcomes $\tilde{y}_{a b}$ with $\tilde{P}_{a b}=\mathbb{E} \tilde{y}_{a b} / \tilde{n}_{a b}$ and $g\left(\tilde{P}_{a b}\right)=$ $\beta_{a}-\beta_{b}$, where the $\beta_{a}$ are defined in $(*)$. That is, $\beta_{a}-\beta_{b}$ is the log-odds that $a$ wins a game against $b$, not a match.

Derive the form of the new link function $g$. [You may express your answer in terms of a cumulative distribution function.]

Paper 2, Section I, J

commentA statistician is interested in the power of a $t$-test with level $5 \%$ in linear regression; that is, the probability of rejecting the null hypothesis $\beta_{0}=0$ with this test under an alternative with $\beta_{0}>0$.

(a) State the distribution of the least-squares estimator $\hat{\beta}_{0}$, and hence state the form of the $t$-test statistic used.

(b) Prove that the power does not depend on the other coefficients $\beta_{j}$ for $j>0$.

Paper 3, Section I, J

commentFor Fisher's method of Iteratively Reweighted Least-Squares and Newton-Raphson optimisation of the log-likelihood, the vector of parameters $\beta$ is updated using an iteration

$\beta^{(m+1)}=\beta^{(m)}+M\left(\beta^{(m)}\right)^{-1} U\left(\beta^{(m)}\right),$

for a specific function $M$. How is $M$ defined in each method?

Prove that they are identical in a Generalised Linear Model with the canonical link function.

Paper 4, Section I, J

commentA Cambridge scientist is testing approaches to slow the spread of a species of moth in certain trees. Two groups of 30 trees were treated with different organic pesticides, and a third group of 30 trees was kept under control conditions. At the end of the summer the trees are classified according to the level of leaf damage, obtaining the following contingency table.

Which of the following Generalised Linear Model fitting commands is appropriate for these data? Why? Describe the model being fit.

Paper 4, Section II, J

commentThe dataset diesel records the number of diesel cars which go through a block of Hills Road in 6 disjoint periods of 30 minutes, between 8AM and 11AM. The measurements are repeated each day for 10 days. Answer the following questions based on the code below, which is shown with partial output.

(a) Can we reject the model fit. 1 at a $1 \%$ level? Justify your answer.

(b) What is the difference between the deviance of the models fit. 2 and fit.3?

(c) Which of fit. 2 and fit. 3 would you use to perform variable selection by backward stepwise selection? Why?

(d) How does the final plot differ from what you expect under the model in fit.2? Provide a possible explanation and suggest a better model.

$>$ head (diesel)

period num.cars day

$\begin{array}{llll}1 & 1 & 69 & 1\end{array}$

$\begin{array}{lllll}2 & 2 & 97 & 1\end{array}$

$\begin{array}{llll}3 & 3 & 103 & 1\end{array}$

$\begin{array}{llll}4 & 4 & 99 & 1\end{array}$

$\begin{array}{llll}5 & 5 & 67 & 1\end{array}$

$6 \quad 6 \quad 911$

$>$ fit. $1=$ glm(num.cars period, data=diesel, family=poisson)

summary (fit.1)

Deviance Residuals:

Min 1Q Median 3Q Max

$\begin{array}{lllll}-4.0188 & -1.4837 & -0.2117 & 1.6257 & 4.5965\end{array}$

Coefficients:

Estimate Std. Error $z$ value $\operatorname{Pr}(>|z|)$

(Intercept) $4.628535 \quad 0.029288158 .035<2 \mathrm{e}-16 * * *$

period $-0.006073 \quad 0.007551-0.804 \quad 0.421$

Signif. codes: 0 ? $* * *$ ? $0.001 ? * * ? 0.01$ ? $*$ ? $0.05$ ?.? $0.1$ ? ? 1

(Dispersion parameter for poisson family taken to be 1)

Null deviance: $262.36$ on 59 degrees of freedom

Residual deviance: $261.72$ on 58 degrees of freedom

AIC: $651.2$

$>$ diesel$period.factor = factor(diesel$period)

$>$ fit. $2=$ glm (num.cars period.factor, data=diesel, family=poisson)

$\operatorname{summary}$ (fit.2)

Coefficients:

Estimate Std. Error z value $\operatorname{Pr}(>|z|)$

Part II, $2017 \quad$ List of Questions

[TURN OVER

Paper 1, Section I, K

commentThe body mass index (BMI) of your closest friend is a good predictor of your own BMI. A scientist applies polynomial regression to understand the relationship between these two variables among 200 students in a sixth form college. The $R$ commands

$>$ fit. $1<-\operatorname{lm}(B M I \sim$ poly $($ friendBMI , 2, raw=T $))$

$>$ fit. $2<-\operatorname{lm}(B M I \sim$ poly $($ friendBMI, 3, raw $=\mathrm{T}))$

fit the models $Y=\beta_{0}+\beta_{1} X+\beta_{2} X^{2}+\varepsilon$ and $Y=\beta_{0}+\beta_{1} X+\beta_{2} X^{2}+\beta_{3} X^{3}+\varepsilon$, respectively, with $\varepsilon \sim N\left(0, \sigma^{2}\right)$ in each case.

Setting the parameters raw to FALSE:

$>$ fit. $3<-\operatorname{lm}(B M I \sim$ poly $($ friendBMI , 2, raw=F $)$ )

$>$ fit. $4<-\operatorname{lm}(\mathrm{BMI} \sim$ poly $($ friendBMI, 3, raw $=\mathrm{F}))$

fits the models $Y=\beta_{0}+\beta_{1} P_{1}(X)+\beta_{2} P_{2}(X)+\varepsilon$ and $Y=\beta_{0}+\beta_{1} P_{1}(X)+\beta_{2} P_{2}(X)+$ $\beta_{3} P_{3}(X)+\varepsilon$, with $\varepsilon \sim N\left(0, \sigma^{2}\right)$. The function $P_{i}$ is a polynomial of degree $i$. Furthermore, the design matrix output by the function poly with raw=F satisfies:

$>t($ poly $($ friendBMI, 3, raw $=F)) \% * \%$ poly $(a, 3$, raw $=F)$

$\begin{array}{rrrr}1 & 2 & 3 \\ 1 & 1.000000 e+00 & 1.288032 \mathrm{e}-16 & 3.187554 \mathrm{e}-17 \\ 2 & 1.288032 \mathrm{e}-16 & 1.000000 \mathrm{e}+00 & -6.201636 \mathrm{e}-17 \\ 3 & 3.187554 \mathrm{e}-17 & -6.201636 \mathrm{e}-17 & 1.000000 \mathrm{e}+00\end{array}$

How does the variance of $\hat{\beta}$ differ in the models $f i t .2$ and $f i t .4$ ? What about the variance of the fitted values $\hat{Y}=X \hat{\beta}$ ? Finally, consider the output of the commands

$>\operatorname{anova}$ (fit.1,fit.2)

anova(fit.3,fit.4)

Define the test statistic computed by this function and specify its distribution. Which command yields a higher statistic?

Paper 1, Section II, K

comment(a) Let $Y$ be an $n$-vector of responses from the linear model $Y=X \beta+\varepsilon$, with $\beta \in \mathbb{R}^{p}$. The internally studentized residual is defined by

$s_{i}=\frac{Y_{i}-x_{i}^{\top} \hat{\beta}}{\tilde{\sigma} \sqrt{1-p_{i}}},$

where $\hat{\beta}$ is the least squares estimate, $p_{i}$ is the leverage of sample $i$, and

$\tilde{\sigma}^{2}=\frac{\|Y-X \hat{\beta}\|_{2}^{2}}{(n-p)} .$

Prove that the joint distribution of $s=\left(s_{1}, \ldots, s_{n}\right)^{\top}$ is the same in the following two models: (i) $\varepsilon \sim N(0, \sigma I)$, and (ii) $\varepsilon \mid \sigma \sim N(0, \sigma I)$, with $1 / \sigma \sim \chi_{\nu}^{2}$ (in this model, $\varepsilon_{1}, \ldots, \varepsilon_{n}$ are identically $t_{\nu}$-distributed). [Hint: A random vector $Z$ is spherically symmetric if for any orthogonal matrix $H, H Z \stackrel{d}{=} Z$. If $Z$ is spherically symmetric and a.s. nonzero, then $Z /\|Z\|_{2}$ is a uniform point on the sphere; in addition, any orthogonal projection of $Z$ is also spherically symmetric. A standard normal vector is spherically symmetric.]

(b) A social scientist regresses the income of 120 Cambridge graduates onto 20 answers from a questionnaire given to the participants in their first year. She notices one questionnaire with very unusual answers, which she suspects was due to miscoding. The sample has a leverage of $0.8$. To check whether this sample is an outlier, she computes its externally studentized residual,

$t_{i}=\frac{Y_{i}-x_{i}^{\top} \hat{\beta}}{\tilde{\sigma}_{(i)} \sqrt{1-p_{i}}}=4.57,$

where $\tilde{\sigma}_{(i)}$ is estimated from a fit of all samples except the one in question, $\left(x_{i}, Y_{i}\right)$. Is this a high leverage point? Can she conclude this sample is an outlier at a significance level of $5 \%$ ?

(c) After examining the following plot of residuals against the response, the investigator calculates the externally studentized residual of the participant denoted by the black dot, which is $2.33$. Can she conclude this sample is an outlier with a significance level of $5 \%$ ?

Part II, $2016 \quad$ List of Questions

Paper 2, Section I, K

commentDefine an exponential dispersion family. Prove that the range of the natural parameter, $\Theta$, is an open interval. Derive the mean and variance as a function of the log normalizing constant.

[Hint: Use the convexity of $e^{x}$, i.e. $e^{p x+(1-p) y} \leqslant p e^{x}+(1-p) e^{y}$ for all $\left.p \in[0,1] .\right]$

Paper 3, Section I, K

commentThe $R$ command

$>\operatorname{boxcox}($ rainfall $\sim$ month+elnino+month:elnino)

performs a Box-Cox transform of the response at several values of the parameter $\lambda$, and produces the following plot:

We fit two linear models and obtain the Q-Q plots for each fit, which are shown below in no particular order:

Define the variable on the $y$-axis in the output of boxcox, and match each Q-Q plot to one of the models.

After choosing the model fit.2, the researcher calculates Cook's distance for the $i$ th sample, which has high leverage, and compares it to the upper $0.01$-point of an $F_{p, n-p}$ distribution, because the design matrix is of size $n \times p$. Provide an interpretation of this comparison in terms of confidence sets for $\hat{\beta}$. Is this confidence statement exact?

Paper 4, Section I, K

comment(a) Let $Y_{i}=x_{i}^{\top} \beta+\varepsilon_{i}$ where $\varepsilon_{i}$ for $i=1, \ldots, n$ are independent and identically distributed. Let $Z_{i}=I\left(Y_{i}<0\right)$ for $i=1, \ldots, n$, and suppose that these variables follow a binary regression model with the complementary log-log link function $g(\mu)=$ $\log (-\log (1-\mu))$. What is the probability density function of $\varepsilon_{1}$ ?

(b) The Newton-Raphson algorithm can be applied to compute the MLE, $\hat{\beta}$, in certain GLMs. Starting from $\beta^{(0)}=0$, we let $\beta^{(t+1)}$ be the maximizer of the quadratic approximation of the log-likelihood $\ell(\beta ; Y)$ around $\beta^{(t)}$ :

$\ell(\beta ; Y) \approx \ell\left(\beta^{(t)} ; Y\right)+\left(\beta-\beta^{(t)}\right)^{\top} D \ell\left(\beta^{(t)} ; Y\right)+\left(\beta-\beta^{(t)}\right)^{\top} D^{2} \ell\left(\beta^{(t)} ; Y\right)\left(\beta-\beta^{(t)}\right),$

where $D \ell$ and $D^{2} \ell$ are the gradient and Hessian of the log-likelihood. What is the difference between this algorithm and Iterative Weighted Least Squares? Why might the latter be preferable?

Paper 4, Section II, K

commentFor 31 days after the outbreak of the 2014 Ebola epidemic, the World Health Organization recorded the number of new cases per day in 60 hospitals in West Africa. Researchers are interested in modelling $Y_{i j}$, the number of new Ebola cases in hospital $i$ on day $j \geqslant 2$, as a function of several covariates:

lab: a Boolean factor for whether the hospital has laboratory facilities,

casesBefore: number of cases at the hospital on the previous day,

urban: a Boolean factor indicating an urban area,

country: a factor with three categories, Guinea, Liberia, and Sierra Leone,

numDoctors: number of doctors at the hospital,

tradBurials: a Boolean factor indicating whether traditional burials are common in the region.

Consider the output of the following $R$ code (with some lines omitted):

fit. 1 <- glm(newCases lab+casesBefore+urban+country+numDoctors+tradBurials,

- data=ebola, family=poisson)

$>$ summary (fit.1)

Coefficients:

Estimate Std. Error z value $\operatorname{Pr}(>|z|)$

$\begin{array}{lllll}\text { labTRUE } & 0.094731 & 0.050322 & 1.882 & 0.0598 \\ \text { (Intercept) } & 0.011298 & 0.049498 & 0.228 & 0.8195\end{array}$

casesBefore $\quad 0.324744 \quad 0.007752 \quad 41.891<2 \mathrm{e}-16 * * *$

$\begin{array}{llllll}\text { urbanTRUE } & -0.091554 & 0.088212 & -1.038 & 0.2993\end{array}$

countryLiberia $\quad 0.088490 \quad 0.034119 \quad 2.594 \quad 0.0095 * *$

countrySierra Leone $-0.197474 \quad 0.036969-5.3429 .21 \mathrm{e}-08 * * *$

numDoctors $\quad-0.020819 \quad 0.004658-4.4707 .83 \mathrm{e}-06 * * *$

tradBurialstrUE $\quad 0.054296 \quad 0.031676 \quad 1.714 \quad 0.0865 .$

Signif. codes: $0{ }^{\prime} * * *, 0.0011^{\prime} * *, 0.01, *, 0.05, \ldots 0.1,1$

(a) Would you conclude based on the $z$-tests that an urban setting does not affect the rate of infection?

(b) Explain how you would predict the total number of new cases that the researchers will record in Sierra Leone on day 32 .

We fit a new model which includes an interaction term, and compute a test statistic using the code:

$>$ fit. $2<-$ glm (newCases $\sim$ casesBefore+country+country:casesBefore+numDoctors,

- data=ebola, family=poisson)

fit. 2 deviance - fit.1$deviance

[1] $3.016138$

(c) What is the distribution of the statistic computed in the last line?

(d) Under what conditions is the deviance of each model approximately chi-squared?

Paper 1, Section I, J

commentThe outputs $Y_{1}, \ldots, Y_{n}$ of a particular process are positive and are believed to be related to $p$-vectors of covariates $x_{1}, \ldots, x_{n}$ according to the following model

$\log \left(Y_{i}\right)=\mu+x_{i}^{T} \beta+\varepsilon_{i}$

In this model $\varepsilon_{i}$ are i.i.d. $N\left(0, \sigma^{2}\right)$ random variables where $\sigma>0$ is known. It is not possible to measure the output directly, but we can detect whether the output is greater than or less than or equal to a certain known value $c>0$. If

$Z_{i}= \begin{cases}1 & \text { if } Y_{i}>c \\ 0 & \text { if } Y_{i} \leqslant c\end{cases}$

show that a probit regression model can be used for the data $\left(Z_{i}, x_{i}\right), i=1, \ldots, n$.

How can we recover $\mu$ and $\beta$ from the parameters of the probit regression model?

Paper 1, Section II, J

commentAn experiment is conducted where scientists count the numbers of each of three different strains of fleas that are reproducing in a controlled environment. Varying concentrations of a particular toxin that impairs reproduction are administered to the fleas. The results of the experiment are stored in a data frame $f l e a s$ in $\mathrm{R}$, whose first few rows are given below.

The full dataset has 80 rows. The first column provides the number of fleas, the second provides the concentration of the toxin and the third specifies the strain of the flea as factors 0,1 or 2 . Strain 0 is the common flea and strains 1 and 2 have been genetically modified in a way thought to increase their ability to reproduce in the presence of the toxin.

Explain and interpret the $\mathrm{R}$ commands and (abbreviated) output below. In particular, you should describe the model being fitted, briefly explain how the standard errors are calculated, and comment on the hypothesis tests being described in the summary.

Explain and motivate the following $\mathrm{R}$ code in the light of the output above. Briefly explain the differences between the models fitted below, and the model corresponding to $f$ it $1 .$

Denote by $M_{1}, M_{2}, M_{3}$ the three models being fitted in sequence above. Explain the hypothesis tests comparing the models to each other that can be performed using the output from the following $R$ code.

$>c($ fit1$dev, fit2$dev, fit3$dev)

[1] $56.8756 .9376 .98$

$>\operatorname{qchisq}(0.95, \mathrm{df}=1)$

[1] $3.84$

Use these numbers to comment on the most appropriate model for the data.

Paper 2, Section I, J

commentLet $Y_{1}, \ldots, Y_{n}$ be independent Poisson random variables with means $\mu_{1}, \ldots, \mu_{n}$, where $\log \left(\mu_{i}\right)=\beta x_{i}$ for some known constants $x_{i} \in \mathbb{R}$ and an unknown parameter $\beta$. Find the log-likelihood for $\beta$.

By first computing the first and second derivatives of the log-likelihood for $\beta$, describe the algorithm you would use to find the maximum likelihood estimator $\hat{\beta}$. $[$ Hint: Recall that if $Z \sim \operatorname{Pois}(\mu)$ then

$\mathbb{P}(Z=k)=\frac{\mu^{k} e^{-\mu}}{k !}$

for $k \in\{0,1,2, \ldots\}$.]

Paper 3, Section I, J

commentData are available on the number of counts (atomic disintegration events that take place within a radiation source) recorded with a Geiger counter at a nuclear plant. The counts were registered at each second over a 30 second period for a short-lived, man-made radioactive compound. The first few rows of the dataset are displayed below.

Describe the model being fitted with the following $\mathrm{R}$ command.

$>$ fit $1<-\operatorname{lm}($ Counts $~$Time, data=geiger)

Below is a plot against time of the residuals from the model fitted above.

Referring to the plot, suggest how the model could be improved, and write out the $R$ code for fitting this new model. Briefly describe how one could test in $R$ whether the new model is to be preferred over the old model.

Paper 4, Section I, J