STAT802 Advanced Topics in Analytics
Advanced Topics in Analytics
Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: THEend8_
STAT802 Advanced Topics in Analytics
Assignment 2
Total Marks: 100 23 April 2021
Paper Description: Advanced Topics in Analytics
Paper Code: STAT802
Total Marks: 100
Date: 23 April 2021
Due date: 26 May 2021 at 10am
INSTRUCTIONS:
1. Only documents in portable format (pdf) will be accepted. You can use, e.g., LATEX, Word,
knitr or Sweave to create your report, as well as RStudio as editor of the source files.
2. Submit the pdf file via Blackboard.
3. Formats other than pdf will be ignored and the author will be asked to re-submit the
assignment. Resubmissions will be subject to the late policy outlined in the study guide
(i.e. 5% per day up to 5 days).
4. The R and SAS codes required to complete this assignment, which includes code to support
your conclusions & answers, must be embedded in the document in the corresponding answer
as text (not image), unless otherwise specified. This code will be marked. Unsolicited
SAS and R scripts submitted separately will not be marked.
5. It is not necessary to copy and paste the question text, just make reference to each question,
e.g., Answer Question 1 (a), Answer Question 1 (b), ... , etc.
6. Read carefully – Answer all the questions as requested. Any material or information
unrelated to the correct answer may result in a significant reduction of marks for that question.
7. Do not forget to fill in and sign the cover sheet which must be the very first page in the pdf.
Use software such as Adobe Acrobat Pro on the Uni computers to include the file at the start
of your document. Do not submit the cover sheet separately.
8. If you need an extension or if your performance has been impacted by some extenuating
circumstances, then you must complete the special consideration form on Blackboard.
9. The comprehension of the questions is part of the assignment.
Grade table:
Question: 1 2 Total
Points: 30 70 100
Score:
23 April 2021 STAT802— Advanced Topics in Analytics Page 1 of 5
Student ID: STAT802— Assignment 2 Semester 1, 2021
QUESTIONS:
1. The data come from the Western Collaborative Group Study, which was carried out in California
in 1960-61 and studied 3,154 middle-aged men to investigate the relationship between behaviour
pattern and the risk of coronary heart disease. These particular data were obtained from
the 40 heaviest men in the study (all weighing at least 225 pounds) and record cholesterol
measurements (mg per 100 ml), and behaviour type on a twofold categorization. In general
terms, type A behaviour is characterized by urgency, aggression and ambition, while type B
behaviour is relaxed, non-competitive and less hurried. The question of interested is whether,
in heavy middle-aged men, cholesterol level is related to behaviour type.
Total for Question 1: 30 marks
Type A behaviour: cholesterol levels
233 291 312 250 246 197 268 224 239 239
254 276 234 181 248 252 202 218 212 325
Type B behaviour: cholesterol levels
344 185 263 246 224 212 188 250 148 169
226 175 242 252 153 183 137 202 194 213
Source: Selvin, S. (1991) Statistical analysis of epidemiological data, New York: Oxford
University Press, Table 2.1
Use the resampling methods studied in classes and perform the analyses in SAS to answer
the following questions:
(a) Does the Type A behaviour group differ from the Type B group in terms of their average
cholesterol level? Justify your answer. Include the corresponding hypotheses and histogram,
indicating in it clearly the statistic used for the test. [Marks: 2 SAS code + 5 rest] [7 marks]
(b) If there is a difference on average on the cholesterols levels of Type A and B groups in the
analysis performed previously, calculate a 95% bootstrap confidence interval. Interpret
your results. [Marks: 2 SAS code + 5 rest] [7 marks]
(c) What is your advice as analytics specialist? Justify your answer. Write a brief report with
your findings. [4 marks]
(d) The patients of group A have voluntarily participated in a therapy in order to change
positively their behaviour. Some previous studies were used to convince them. The therapy
is new and it has not been tested before. Their initial and final cholesterol levels, i.e., before
and after the therapy, were registered and are displayed in the table below:
Patient 1 2 3 4 5 6 7 8 9 10
Before 233 291 312 250 246 197 268 224 239 239
After 247 283 282 185 254 172 243 221 234 240
Patient 11 12 13 14 15 16 17 18 19 20
Before 254 276 234 181 248 252 202 218 212 325
After 259 279 215 129 237 234 178 165 188 310
You as a data analyst have to assess the therapy. What would you conclude? Write a brief
report with your findings. This should answer the following questions: was the therapy
effective for all the patients in the study?; is the therapy effective?; if so, quantify the
23 April 2021 STAT802— Advanced Topics in Analytics Page 2 of 5
Student ID: STAT802— Assignment 2 Semester 1, 2021
difference between the groups. Justify all your conclusions. [Marks: 3 SAS code + 9
rest] [12 marks]
2. By the early years of the 18th century, astronomers had fairly well determined the relative
dimensions of the solar system, the relative distances between planetary orbits and between
the planets and the sun. However, they lacked precise information on the absolute dimensions
of the solar system, and were eager to determine even one such distance in miles, in particular
the mean distance from the earth to the sun, from which all others could be found. Actually,
the quantity that 18th century astronomers chose to pursue was the parallax of the sun, the
angle subtended by the earth’s radius, as if viewed and measured from the surface of the sun.
From this angle and available knowledge of the physical dimensions of the earth, the mean
distance from earth to sun (or astronomical unit) could be easily determined.
The astronomer Edmund Halley (1656-1742) is generally credited with having been the first to
suggest that the parallax of the sun could be determined by observing a “transit of Venus,” the
apparent passage of the planet across the face of the sun, as viewed from earth. If observers
were dispatched to all corners of the globe from which this transit would be visible, and they
carefully recorded their positions and the elapsed time of the transit (on the order of 5.5 hours),
then each pair of observers would furnish one determination of the parallax of the sun.
Unfortunately for the implementation of Halley’s plan, transits of Venus are quite rare, owing
to the 3◦36’ inclinations between the orbits of Venus and Earth. The first recorded transit
of Venus occurred on 1639 and was observed only in England; the next transits were due in
1761 and 1769, with later transits due in 1874, 1882, 2004 and 2012. By 1761, interest in the
forthcoming transit was high, and observations were made at the Cape of Good Hope, and in
Calcutta, Rome, Stockholm and most other European observatories. A good account of the
transits of 1761 and 1769 can be found in Woolf (1959); and excellent discussion of the data
generated by these transits is given by Newcomb (1891).
James Short was a maker of optical instruments, particularly telescopes. He was born in
Edinburgh, Scotland and entered the University of Edinburgh in 1726, where he attended
lectures by the mathematician Colin Maclaurin. Short is perhaps best remembered for his
observations of the transit of Venus on June 6, 1761, one of the two most important astronomical
events of the mid-eighteenth century (the other one was the transit of Venus in 1769).
Short recorded 53 observations and analysed them in a “robust” manner. He took the mean of
all n = 53 determinations (8.61), then rejected all results differing from 8.61 by more than 1.00
and obtained the mean of the remainder (8.55), then rejected all results differing from 8.61 by
more than 0.50 and obtained the mean of the remainder (8.57); finally he took the mean of
8.61, 8.55, 8.57 to obtain the sun’s parallax as 8.58. He applied similar analyses to other data
sets of the same phenomenon.
The parallax.txt file contains James Short’s measurements of the parallax of the sun (in
seconds of a degree), based on the 1761 transit of Venus. The data were originally published
in Philosophical Transactions of the Royal Society of London. The table with the observations
and descriptive information here are adapted from an article by Stephen Stigler in the Annals
of Statistics. The number of cases is 53 and the true value is 8.798. The parallax of the sun is
the angle subtended by the earth, as seen from the surface of the sun.
The data is available on http://www.randomservices.org/
References:
23 April 2021 STAT802— Advanced Topics in Analytics Page 3 of 5
Student ID: STAT802— Assignment 2 Semester 1, 2021
Newcomb, S. (1891). Discussion of observations of the transits of Venus in 1761 and 1769.
Astronomical Papers 2 259-405, U.S. Nautical Almanac Office.
Stigler, S.M. (1977). Do Robust Estimators Work with Real Data? The Annals of Statistics,
5, 1055-1098.
Woolf, H. (1959). The transits of Venus. Princeton Univ. Press.
Suppose that your are James Short’s data analyst at that time and you are asked to analyse
his data in a Bayesian way.
Assume that Xi|µ, σ ∼ N(µ, σ2) for i = 1, . . . , n, with µ∼N(µ0, σ20) and σ2∼ InvGamma(α, β).
It can be shown that
µ|σ,X1, . . . , Xn ∼ N
nx¯
σ2
+
µ0
σ20
n
σ2
+
1
σ20
,
(
n
σ2
+
1
σ20
)−1
σ2|µ,X1, . . . , Xn ∼ InvGamma
(
n
2
+ α,
1
2
n∑
i=1
(xi − µ)2 + β
)
.
Note that small α (shape) and β (scale) values have a small impact on the conditional distribution
of σ2. Total for Question 2: 70 marks
(a) Identify and define the parameters of the model. [5 marks]
(b) Specify the prior distributions. Justify your answer. [10 marks]
(c) Implement a Gibbs sampling algorithm. Comment on the MCMC specifications and the
effective sample sizes (ESS). Generate trace plots of the posterior of the parameters and
comment about them. [Marks: 8 code + 12 brief report] [20 marks]
(d) Generate a 95% credible interval for µ. Interpret your results and propose a point estimate
for this parameter. Justify your answer. [10 marks]
(e) We now know that the true value of the parallax of the sun is 8.798. What do you conclude
about Short’s and Bayesian estimates? Justify your answer. Hint: consider the relative
error. [10 marks]
(f) Calculate the 95% confidence interval for µ and interpret it in the context of this problem.
Compare it to the credible interval calculated previously and discuss their difference. [10 marks]
(g) Write down the JAGS model. [5 marks]
23 April 2021 STAT802— Advanced Topics in Analytics Page 4 of 5
Student ID: STAT802— Assignment 2 Semester 1, 2021
23 April 2021 STAT802— Advanced Topics in Analytics Page 5 of 5