Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: THEend8_
MAS223 Statistical Inference and Modelling Practice questions Note.
Many of the below question should be done by hand. However, for some of them I ask you to use R.
Where R is required for the question, this question will be prefaced with R. immediately after the question number.
All the other questions should be done by hand. 1. (Revision of 1st year probability and statistics).
A ‘celebrity psychic’ claims to be able to foresee the outcome of a coin toss. A fair coin is tossed 100 times,
with the psychic giving a prediction on each occasion. Let X be the number of correct predictions. (a) Suggest a
statistical model for X, and a null hypothesis that could be tested regarding the psychic’s claim. (b)
The psychic correctly predicts the coin toss 55 times out of 100. He states: “If I were guessing, I should only get 50 predictions right,
and the probability of getting 55 right would be 0.048. This is a small probability (smaller than 0.05), which is evidence that I am not guessing.”
What is wrong with this argument? (c) How close would you expect X to be to 50, if the psychic were guessing? Specifically, suggest an interval (50− c, 50 + c) such that P{X ∈ (50− c, 50 + c)} ' 0.95. Hint: consider the normal approximation to the binomial distribution. (d) Given the observation of X = 55 and using your answer in (c), conduct a 2-sided hypothesis test of your null hypothesis in (a), of size 0.05. Calculate the corresponding p-value, and comment on whether the experiment has provided evidence of psychic ability. 2. For dependent variable y and independent variable x, which of the following are examples of linear models? (In each case, i ∈ {1, . . . , n} and εi are i.i.d. N(0, σ2) distributions.) (a) yi = β0 + β1xi + β2x 2 i + εi (b) yi = β0 + β1x −1 i + εi (c) yi = β0 + β1x 2 i e 3xi + εi (d) yi = β0(x0 + β1xi) + εi (e) yi = β0/(1 + β1xi) + εi 3. Suppose we have data (x, y) = (0, 2), (1, 5), (2, 4), (3, 8). (a) Write a simple linear regression model for these data and put the model in matrix notation. (b) Suppose we model the same data using a quadratic relationship, so that yi = β0+β1xi+β2x 2 i +εi, where εi ∼ N(0, σ2) are i.i.d. Write this new model in matrix form. 4. Write the following models in matrix form, where i ∈ {1, . . . , n}: 1 (a) yi = β0 + β1(xi + x 2 i ) + εi, where εi ∼ N(0, σ2) are i.i.d. (b) yi = β0 + β1xi + β2x 3 i + εi, where εi ∼ N(0, σ2) are i.i.d. (c) yi = β0 + βxxi + βz,1zi + βz,2z 2 i + εi, where εi ∼ N(0, σ2) are i.i.d. (d) yi = β0 + βxxi + βzz 3 i + βw log(wi) + εi, where εi ∼ N(0, σ2) are i.i.d. 5. Suppose we hypothesise that the price, yi of a house i in Greater London is correlated to three continuous variables: size (si), proximity to the centre (zi), and crime rate (ci). Write down a linear model to represent these relationships. Put this model in matrix form. 6. Suppose we have data (x, y) = (0, 2), (3, 4), (5, 5), modelled using the simple linear regression model yi = β0 + β1xi + εi, where εi ∼ N(0, σ2) are i.i.d. (a) Put the model in matrix notation. (b) Find the least-squares estimator for β0 and β1. (c) Plot the best-fit line together with a scatter plot of the data. 7. Repeat Question 6 for each of the following datasets (a) (x, y) = (0, 0), (1, 1), (2, 1) (b) (x, y) = (1, 5), (2, 3), (4, 1) (c) (x, y) = (0, 2), (1, 1), (2, 3) (d) (x, y) = (0, 0), (10, 10), (20, 21) (e) (x, y) = (10, 0), (5, 10), (0, 19) 8. According to the principle of least squares, which one of the two lines y = 3.6 + 2x or y = −1.1 + 3x is a better fit to the data? x 3.1 4.8 6.4 7.2 y 8.2 13.2 16.4 20.5 9. On my son’s 3rd birthday, he was 94cm tall. He was 102cm on his 4th birthday and 109cm tall on his 5th. (a) Find a least-squares estimate for his mean growth rate between his 3rd and 5th birthdays, using a simple linear regression model. (b) On his 2nd birthday, he was 84cm. Find a least-squares estimate for his mean growth rate between his 2nd and 5th birthdays, using a simple linear regression model. (c) You should find that the mean growth rate between 2 and 5 is higher than between 3 and 5. This suggests that the rate is changing over time (i.e. growth is either accelerating or decelerating). As such, we use a quadratic relationship yi = β0 + β1xi + β2x 2 i + εi to model my son’s growth. Use least-squares to fit this curve to the data between ages 2 and 5. What is his deceleration in growth over this period? (d) Use the quadratic relationship to calculate my son’s expected height at (i) 18 years old, (ii) 22 years old. 2 (e) Write a paragraph commenting on what this exercise tells you about extrapolating fitted models. Make sure you write your explanation clearly and unambiguously. (This is good practice as I like to ask exam questions that involve writing well-structured paragraphs. There will be more of this in later questions.) 10. Without doing any calculations, give two reasons why y = 6.8 − 2x cannot be the least squares regression line of y on x for the following data. x 3.2 4.1 4.9 6.5 7.3 y 9.2 11.0 14.2 17.4 21.5 11. A runner completes two laps of a course. He is timed over each lap and over both laps together. The recorded times, which are unbiased and equally reliable, are 52, 55 and 108 seconds. Write down a linear model in matrix notation to represent these data, and hence obtain the least squares estimates of the lap times. 12. At a Y-junction in an electrical circuit (see diagram above), the true current flowing along AO towards O must equal the sum of the true currents flowing along OB and OC away from O. The currents are measured (with error) through ammeters of the same accuracy, placed in AO, OB and OC, and observations y1, y2 and y3 recorded. Show that the least squares estimates of the true currents in the branches AO, OB, OC are (2y1 + y2 + y3)/3, (y1 + 2y2 − y3)/3 and (y1 − y2 + 2y3)/3 respectively. 13. Give an estimator for σ2 for the model and data from Question 6. Calculate 95% confidence intervals for β0 and β1. You may use the fact that t(1,0.975) ≈ 12.71. Why are the confidence intervals so large and what might you do to try to make them smaller? 14. Question 9(a) described some data (x, y) = (3, 94), (4, 102), (5, 109) modelled using a simple linear regression model. Find an estimator for σ2 for this model. Calculate 99% confidence intervals for β0 and β1. You may use the fact that t(1,0.995) ≈ 63.66. 15. R. In the table below xi is the body weight (in kg) and yi is the blood volume (in 10 cm 3) for each of a random sample of 10 goats (i = 1, . . . , 10). xi 34 28 19 41 21 20 21 39 37 23 yi 237 210 112 281 150 166 148 245 256 155 3 A scatter diagram for the data is given below. 20 25 30 35 40 15 0 20 0 25 0 x y Suggest a suitable model for these data. Using R, estimate the model parameters, including the error variance. Using your model, estimate the mean blood volume of a goat with body weight 30 kg. 16. R. Suppose that the amounts of a certain chemical dissolved in a fixed amount of water at various temperatures were determined as follows: Amount (gm) y 7 12 21 29 42 50 63 Temp (degrees C) x 10 20 30 40 50 60 70 Assume that a straight line regression of amount on temperature is appropriate. Find the least squares best fitting line, an estimate of the error variance, and give 95% confidence intervals for the slope and intercept. 17. In Question 7, you fitted a simple linear regression model yi = β0 + β1xi + εi, where εi ∼ N(0, σ2) are i.i.d., to five different datasets. For each of these five datasets, (i) find the best-fit estimator for σ2, (ii) find 95% confidence intervals for β0 and β1 (you may use the fact that t(1,0.975) ≈ 12.71), (iii) find a p-value for the hypothesis test of H0 : β1 = 0 against H1 : β1 6= 0 in the form P (F?,? >?) where the question marks are to be filled-in, (iv) draw a scatter-plot of the data and super-impose a line for both the best-fit linear relationship and the best-fit constant relationship. Don’t use R for any of this! 18. R. For each of the data-sets in Question 17, use R to find the numerical value of each p-value. For each, examine the scatter plots from part (iv). How do these plots relate to the p-values calculated? 19. R. Suppose that the statistic F has, under the null hypothesis, an Fν1,ν2 distribution. Use R to find the p-value when we reject for large F and (F, ν1, ν2) = (i) (6.487, 2, 15) (ii) (3.938, 4, 12) (iii) (2.031, 8, 22) (iv) (0.645, 3, 8). 20. R. Use an F -test to test the null hypothesis that there is no relationship between temperature and amount for the data in Question 16. 4 21. For each of the data-sets in Question 7 (also studied in Question 17), find the R-squared values and explain what each tells you about how well a simple linear regression model fits the data. 22. R. For the data in Question 16 and the simple linear regression model, compute (using R) (a) the proportion of the total variation in the data explained by the regression fit; (b) a 95% prediction interval for the expected amount when the temperature is 30 degrees C. 23. R. For each of the data-sets in Question 7 (also studied in Questions 17 and 21) use R to plot a scatterplot for each dataset. On each plot, superimpose the best-fit curve and the 95% prediction interval curves. 24. In 2016, the United Kingdom (UK) voted in a referendum on whether to leave the European Union (EU). In 2017, the UK had a general election where votes for members of parliament in each con- stituency were cast. It has been postulated that those British voters who voted Leave in 2016 were more likely to have voted for the Conservative Party in 2017. To test this, a simple linear regression model, yi = β0 +β1xi + i, was fitted in R, where i indexes the 638 constituencies in Britain, yi is the proportion of people in constituency i who voted Conservative in 2017 (Con2017), and xi is the proportion of people in constituency i who voted Leave in 2016 (Leave2016). The data were stored in the voter dataframe. Here is some R output: > lmvoter<-lm(Con2017∼Leave2016,data=voter) > summary(lmvoter) Call: lm(formula = Con2017∼Leave2016, data = voter) Residuals: Min 1Q Median 3Q Max -0.42895 -0.11230 0.02199 0.11750 0.27088 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.10773 0.02655 4.058 5.56e-05 Leave2016 0.60237 0.04986 12.082 <2e-16 Residual standard error: 0.1434 on 636 degrees of freedom Multiple R-squared: 0.1867, Adjusted R-squared: 0.1854 F-statistic: 146 on 1 and 636 DF, p-value: < 2.2e-16 > deviance(lmvoter) [1] 13.07105 > deviance(lm(Con2017∼1,data=voter)) [1] 16.07136 5 (a) Write a paragraph reporting on the following aspects of the results: • The hypothesis test used to test for a correlation between those who voted Leave in 2016 and those who voted Conservative in 2017 • What you conclude from this test and why • The statistic that demonstrates how well a simple linear regression model fits the data • What you can conclude from the value of this statistic • To what extent the proportion of Leave votes in 2016 is a good predictor of the proportion of Conservative votes in 2017. Make sure you write your answer in properly-constructed, grammatically-correct sentences. (b) Calculate the 95% confidence interval for β1. You are given that t(638,0.975) = 1.963689, t(638,0.95) = 1.647245, t(636,0.975) = 1.963701, t(636,0.95) = 1.647253. (c) Suppose you know that 40% of people voted Leave in a particular constituency. What is the expected proportion of Conservative votes in that constituency in 2017? 25. Ten patients are divided at random into three groups. Suppose that each group receives a different treatment for a month and the data below indicate the individuals’ responses to the treatments. Group 1: 2, 4; Group 2: 7, 9, 9, 11; Group 3: 8, 9, 10, 13. Is there any evidence that mean responses differ between the three groups? You may use R to calculate the p-value, given that you already have it in the form P (F?,? >?), but don’t use R for anything else in this question! 26. R. In order to investigate differences in mean weight gain due to different feed-stuffs, forty cows are divided at random into 4 groups of 10. Each group received one of the four feed-stuffs for one month. At the end of the month the weight gain of each cow is recorded and the ANOVA table below is obtained: Source Df Sum of Sq Mean Sq F feed-stuff 3 249.864 83.288 8.545 error 36 350.892 9.747 Is there evidence that feed-stuffs lead to different mean weight gains over one month? 27. For the one-way ANOVA model yi,j = µ+ τi + εi,j, with i = 1, 2 and j = 1, 2, if the constraint τ1 = 0 is applied, prove that µˆ = (y1,1 + y1,2)/2 and τˆ2 = (y2,1 + y2,2 − y1,1 − y1,2)/2, where µˆ and τˆ2 are the least squares estimators of µ and τ2. 28. In an investigation into the relationship between annual income and social class, social class is allocated to one of the five ordered categories A, B, C, D, E. (a) One investigator codes A, B, C, D, E as 1,2,3,4,5 respectively, and fits the regression E(Yi) = µi = β0 + β1xi, where Yi is the income and xi the coded social class. Explain the interpretation of β1. 6 (b) Another investigator thinks that A, B, C are more different than C, D, E and suggests using the codes 1,3,5,6,7 in the above regression. Interpret β1 in this case. (c) A third investigator says that as social class is a categorical variable, one should use indicator variables, and fits the regression µi = α0+α1xi,1+α2xi,2+α3xi,3+α4xi,4 where xi,1 is an indicator variable for category A (i.e. xi,1 = 1 if observation i is from category A, and 0 otherwise), xi,2 for category B, xi,3 for category C, and xi,4 for category D. Interpret the α’s in this case. (d) Draw two graphs: one for the model in part (a) and the other for the model in part (c). (e) Which model would you use? 29. In an experiment, 90 participants are asked to predict the outcome of 50 coin tosses. The participants are then ranked according to their number of correct predictions, and split into three groups, labeled 1,2 and 3, with group 1 containing the 30 worst performing participants, and group 3 containing the 30 best performing participants.
Members of group 1 are then put through a training course to improve their forecasting,
and members of group 3 are rewarded with a bonus payment. All 90 participants then predict the outcome
of another 50 coin tosses. Define yij to be the improvement in the number of correct predictions for member j
in group i (number of correct predictions in round 2 − number of correct predictions in round 1).
(a) Using suitable notation, write down the model that has been fitted in R. Explain what your
parameters represent, and state the estimates of each parameter. > group<-as.factor(c(rep("one",30),rep("two",30),rep("three",30))) > lm(y~group)
Call: lm(formula = y ~ group - 1) Coefficients: groupone groupthree grouptwo 3.733 -4.234 0.7 (b) Using the output below,
test the null hypothesis of no difference in improvement between the three groups. (The command var(x)
in R calculates the sample variance of a vector of obser- vations x). > deviance(lm(y~group)) [1] 1185.533 > var(y) [1] 24.22022 > 1-pf(35.594,2,87) [1] 5.070389e-12 (c)
What assumptions have you made in the previous part, and how would you check them?