STAT5002: Introduction to Statistics
Introduction to Statistics
Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: THEend8_
STAT5002: Introduction to Statistics
Instructions:
1. You are required to type up your entire assignment, including any equations. If you are using Word,
you should use the equation editor for any maths notation.
2. Copy and paste relevant R code and outputs while discussing your answer in the text. Do not put
all R code and outputs at the end of the document.
3. Answer all questions in the given order; i.e., 1(a), 1(b), etc. Keep your answers clear, brief, and concise.
4. Convert and submit your assignment in pdf, which must be uploaded to the Turnitin assignment
box on Canvas.
5. Data used in this assignment are in the spreadsheet ADataset.xlsx.
6. You MUST write up solutions on your own. Do not discuss the assignment with your classmates.
Students caught cheating will automatically receive a mark of 0 and are subject to disciplinary action.
7. This assignment carries a weight of 8% towards your final mark for STAT5002.
1. MJ, Katie and Sherwin are part of a large class that sat for an end of semester examination in En-
glish, Computers and Mathematics. The marks in these examinations were approximately normally
distributed. Each exam had a maximum mark of 100. The results for these students are given in
the following table, together with class means and standard deviations.
Subject MJ Katie Sherwin Class mean Class standard deviation
English 50 60 85 60 10
Computers 63 70 72 65 5
Mathematics 65 60 55 55 15
(a) Convert each student’s marks to z-scores.
(b) Relative to the rest of the students: Which is MJ’s worst result? Did MJ get a better result in
Computers or Mathematics?
(c) Which is better: MJ’s mark in Mathematics or Katie’s mark in Computers?
(d) Standardise all the marks to a mean of 50 and a standard deviation of 15. Add each student’s
standardised marks to give them an “aggregate” or total mark out of 300. Which student had
the best “aggregate”?
2. The following data were collected in a study of the effect of dissolved sulfur on the surface tension
of liquid copper (Baes and Killogg, 1953).
X Y Y
0.034 301 316
0.093 430 422
0.3 593 586
0.4 630 618
0.61 656 642
0.83 740 714
where X = Weight % Sulfur and Y = Decrease in Surface Tension (dynes/cm), two replicates.
(a) Produce a scatter plot of ”Decrease in Surface Tension (dynes/cm)” versus ”Weight % Sulfur”.
Make sure you label your axes properly and that your graph has an appropriate title. Briefly
describe the nature of the relationship between these two variables.
(b) Assuming that X is transformed to ln(X) and a linear mode is considered, which choice of Y
gives better results, Y or ln(Y)? Why? Write down the model of your choice.
23. The dataset Q3 contains the following information on a sample of n = 36 severely depressed indi-
viduals.
Variable Description
Eff Measure of the effectiveness of the treatment
Age Age (years)
Tmt Treatment received (A, B or C)
(a) Produce a scatter plot of Eff versus Age. What does it show?
(b) Run a regression of Eff on Age. Write down the fitted regression equation. Does this model
tell us anything about the comparative effectiveness of the three treatments?
(c) Refer to part (b). Calculate the total variation in Eff. What percentage of the total variation
in Eff is explained by the model?
(d) Produce another scatter plot of Eff versus Age but this time with colour coding and different
regression lines for each of the three treatments. Does the treatment appear to interact with
age in explaining the response? Explain why or why not.
(e) Code up dummy variables for treatments A and B as well as an interaction between Age and
each of treatments A and B. For this part of the question, just attach the R code to show how
you create the dummies and interaction terms.
(f) Using Age, the dummies, and the interactions as predictors, perform the backward elimination
to obtain the best model by means of AIC criterion. Write down the final estimated regression
equation. What percentage of the total variation in Eff is explained by the model?
(g) Use the partial F test to determine which model [the one in part (b) or the one in part (f)] is
better. Include mention of H0 and H1, the observed value of the test statistic, the p-value, a
decision, and a conclusion.
(h) Predict the effectiveness of treatments for the following people:
Patient Age Tmt
John 20 A
Mary 56 B
Felix 69 C
4. The following contingency table shows the number of women who have just given birth grouped by
low birth weight baby and race of mother.
Race
Low White Black Other Total
No (birth weight > 25000g) 144 57 136 337
Yes (birth weight ≤ 25000g) 100 15 36 151
Total 244 72 172 488
(a) At the 0.05 level of significance, is there evidence of a significant association between the race
of the mother and a baby of low birth weight? Include mention of H0 and H1, the observed
value of the test statistic, the p-value, a decision, and a conclusion.
(b) What are the odds of a white mother having a baby of low birth weight?
(c) What are the odds of black mother having a baby of low birth weight?
(d) What is the odds ratio for white versus black?
(e) Fit a logistic regression of a baby of normal birth weight on race. Treat black as a base group.
Write down the fitted regression equation.
(f) Refer to part (e). What is the odds ratio for white versus black? Is it the same as your
calculation in part (d)? Explain why or why not.