7CCMMS61- Statistics for Data Analysis
Statistics for Data Analysis
Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: THEend8_
7CCMMS61- Statistics for Data Analysis
Logistics
This coursework assignment consists of two questions.
It covers the material taught in weeks 1 to 5. In one of the questions you will
have to use a data set which you can find in the Assessment section, on
KEATS.
Use RStudio to carry out your analysis of the dataset.
Please submit a PDF document with your answers, including an appropriate
description of the method you used to get your results. Where appropriate
include R code snippets, graphs and any tables you produced.
Each question carries a total of 20 marks.
The submission deadline is 5pm Friday 12th November 2021. The submission
box can be found in the Assessment section, on KEATS.
1
Questions
1. (a) A man was shot and killed while hunting in North Forest. It looks
like an accident, but the police detective in charge of the case is
questioning another hunter. This hunter is a person of interest
(suspect) because he is known to have had a grudge against the
victim. The detective found a pine needle stuck on the suspect’s
coat. The suspect claims to have been hunting in South Forest on the
day of the incident.
North Forest has only Species A pine trees, and South Forest has
only Species B pine trees. Species A pine trees have needle lengths
that are approximately normally distributed with a mean of 5.4 cm
and a standard deviation of 0.4 cm. Species B pine trees have needle
lengths that are approximately normally distributed with a mean of
6.4 cm and a standard deviation of 0.5 cm.
The case goes to court and the prosecution hires you to be the expert
statistician. The two hypotheses you are asked to test are:
H0 : The needle found on the suspect is from Species B pine tree
Ha : The needle found on the suspect is from Species A pine tree
i) Sketch the two distributions on the same axis. Make sure to
label the distributions. [2 marks]
ii) What is the direction of the ’extreme’? Justify your answer.
[2 marks]
iii) Suppose that the pine needle found on the suspect’s coat
measures 5.42 cm. What is the corresponding p-value?
[2 marks]
iv) Assume α = 0.05. Give your decision and state your conclusion
using a well-written sentence that is clearly understood by both
the judge and the jury of this case. [2 marks]
(b) The file crime.csv includes data for 47 US states. More details
about the data set are included in the file ’Description of datasets for
coursework 1’ which you can find on KEATS.
2
Your task is to carry out exploratory data analysis (EDA) in order to
address the following:
i) Describe the distribution of the continuous variables. You
should include all appropriate numerical summaries and
graphical summaries.
Think of the dataset you have, the other variables in that dataset and
whether any of these should be taken into account when producing the
summaries for the distribution of the continuous variables.
ii) Produce the appropriate numerical and graphical summaries
that can help answer the question of whether there been a
change in crime rate after ten years.
Think carefully of what you want to present and also consider
numerical and graphical summaries to check if there is an association
between the quantitative variables.
Marks will be awarded for clear presentation of results and clear
description of the findings.
[12 marks]
2. Demand for tickets for concerts at the O2 arena in London is normally
distributed with mean µ and σ = 21. Based on the ticket demand of the
past seven years, it’s assumed that µ = 37. However, recent data suggest
that µ has increased enough to merit a refurbishment and inclusion of
more seats in the arena. In order to find out, the O2 arena manager came
up with a strategic project. His strategy is as follows: Select 49 concerts at
random and compute the average ticket demand. He is still assuming
that σ = 21. He then decides that he is going to proceed with the
refurbishment if the average ticket demand is greater than 40. If the
sample data suggest that average demand is greater than 40, he will
refurbish the arena. If the data does not suggest that average demand is
greater than 40, he will not refurbish the arena. All numbers are in
thousands. Answer the following:
(a) What is the variable of interest? What is the parameter of interest?
[2 marks]
(b) State the null and the alternative hypotheses. Sketch the
distribution of the variable the O2 manager is interested in, under
the null and the alternative hypothesis. [2 marks]
3
(c) What is the direction of the “extreme"? Explain your answer.
[2 marks]
(d) Recall that the O2 arena manager decided that he is going to proceed
with the refurbishment if average ticket demand is greater than 40.
i. Calculate the level of significance, α, and shade the area it
represents on the appropriate sketch. [3 marks]
ii. The O2 arena manager now decides to assume that µ = 38
under Ha. Calculate the chance of committing Type II error, β,
and shade the area it represents on the appropriate sketch.
[3 marks]
(e) If the O2 arena manager repeats his experiment two more times (i.e.,
draw 49 more events, calculate the average ticket demand each
time), what is the probability that the null hypothesis will be
rejected at least twice? [3 marks]
(f) Suppose that the O2 arena manager used his sample of 49 events
and calculated the average ticket demand to be 43. Calculate the
p-value and shade the area it represents on the appropriate sketch.
[3 marks]
(g) Based on the evidence the O2 arena manager has, should he go
ahead with the refurbishment ? Explain your answer. [2 marks]