Statistics for Research Workers
MAST90007: Statistics for Research Workers
Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: THEend8_
SRW MAST90007 2021 Major assignment
MAST90007: Statistics for Research Workers
1,500 word assignment
Submission
Submit an electronic copy of the assignment via the LMS.
A reminder: When submitting your assignment, you will be asked to complete the online
plagiarism declaration. This assignment must be your own work.
This assignment contains three (3) questions worth a total of 30 marks. There is some
general advice on the assignment at the end of this document, on page 7.
The overall requirement for this assignment is to carry out and report on data analytics that
address three questions about the data from the Framingham heart study.
You may know about this study from your general knowledge; it is one of the most famous
studies in epidemiology. You can learn about the study from information on Wikipedia
, but also through these references:
Levy, D., National Heart Lung and Blood Institute., et al. (1999). 50 years of
discovery: medical milestones from the National Heart, Lung, and Blood Institute's
Framingham Heart Study. Hackensack, N.J., Center for Bio-Medical Communication
Inc.
Mahmood, S. S., Levy, D., Vasan, R. S., & Wang, T. J. (2014). The Framingham Heart
Study and the epidemiology of cardiovascular disease: a historical perspective. The
Lancet, 383(9921), 999-1008.
Oppenheimer, G. M. (2005). Becoming the Framingham study 1947–1950. American
Journal of Public Health, 95(4), 602-610.
You may also find your own useful references. You are not required to read these references
for the purposes of the assignment.
The data file contains some information from long term follow up as well as baseline
measures. The file contains records for 5,209 people – all the participants in the original
cohort of the study. The participants were followed up every 2 years. The data file includes
information from baseline, the 2nd examination (one variable), and the 16th examination (30
years after baseline).
2
SRW MAST90007 2022 Major assignment
The data file includes:
Age at baseline (years)
Height at baseline (inches)
Weight at baseline (pounds)
Body Mass Index at baseline (kg/m2)
Sex Female / Male
Diastolic blood pressure at baseline (mmHg)
Systolic blood pressure at baseline (mmHg)
Serum cholesterol (mg/100ml) examination 1 Serum cholesterol (mg/100ml) at baseline;
this variable has 2,037 missing values.
Serum cholesterol (mg/100ml) examination 2 Serum cholesterol (mg/100ml) at the 2nd
examination; this variable has 626 missing
values.
Serum cholesterol (mg/100ml) baseline Baseline serum cholesterol at examination
1, or, when missing at examination 1, the
serum cholesterol at the second
examination.
Metropolitan Relative Weight at baseline A measure of the percentage of actual
weight to desirable weight; a measure very
similar to BMI.
Smoker at baseline Smoker / Non-smoker
Number cigarettes smoked per day at
baseline
Last examination number Number of the last examination that the
person participated in.
Survived at last examination 0 = alive at 16th examination; 1 = died prior
to 16th examination
Cause of death 0 = still alive
1 = sudden death from coronary heart
disease (CHD)
2 = other coronary heart disease
3 = stroke (cerebrovascular accident, CVA)
4 = other cerebral vascular disease
5 = cancer
6 = other causes of death
9 = cause unknown
Examination at which CHD diagnosed, if
applicable
3
SRW MAST90007 2022 Major assignment
The data were accessed from:
http://courses.washington.edu/b513/datasets/datasets.php?class=513
The data file is Framingham.xlxs. You can drop and drag this file into Minitab.
When you do this, some of the variable names will be truncated; you will need to correct
them to make them clear by shortening them.
There are some references to column numbers in the assignment. These numbers will be
correct if you simply drag and drop the Excel file into Minitab; obviously, if you insert
columns yourself in the Minitab file, your column numbers may differ from those given
here.
SRW MAST90007 2022 Major assignment
Question 1 – Baseline data [9 marks]
This question focuses on baseline characteristics and data.
(a) Briefly describe the design of the study to provide context for the analyses you report.
(b) Produce a summary table to describe the following characteristics of the study
participants: age at baseline, height at baseline, weight at baseline and sex.
(c) Consider systolic and diastolic blood pressure at baseline. Produce suitable visual
display(s) to allow a comparison of the distributions of these according to whether or
not an individual was a smoker at baseline. You can exclude those with missing
information about smoking from visual displays using Data Options > Group options.
(d) Carry out appropriate analyses to compare those who were smokers at baseline with
those who were not, for systolic and diastolic blood pressure. Provide one or more
suitable tables that includes the summary statistics and inferential statistics.
(e) Discuss and justify any assumptions underlying your choice of analysis.
(f) Write a summary of the analyses you have carried out explaining the results of all the
comparisons you have made. Write the summary for a doctor interested in the
practical application of the study results.
(g) Consider predicting systolic blood pressure at baseline from age and Metropolitan
relative weight at baseline. Provide graphical display(s) to illustrate the distributions
of the explanatory variables. Explain if you would recommend rescaling these
variables for this analysis. If appropriate, rescale the variables. Fit the model and
obtain the parameter estimates for each of the explanatory variables. Explain the
meaning of the parameter estimates for each of these explanatory variables, according
to whether you have recommended rescaling or not. (You do not need to report other
details of the analysis.)
(h) A colleague is also working with the same data file, and says: “This is great! The
sample size is so big, everything is really, really significant; this whole study gives so
many meaningful findings.” Respond to this comment.
5
SRW MAST90007 2022 Major assignment
Question 2 – Serum cholesterol at baseline [12 marks]
Serum cholesterol (mg/100ml) at baseline (column 10 in the datafile) is defined as serum
cholesterol at examination 1 (the true baseline), or, when missing at examination 1, the
serum cholesterol at the second examination. For many people in the study, serum
cholesterol at both examinations 1 and 2 was available.
(a) Produce an appropriate graph showing the relationship between Serum cholesterol
(mg/100ml) examination 1 and Serum cholesterol (mg/100ml) examination 2.
(b) Describe the relationship between the two variables, and give a suitable summary
statistic.
(c) Fit a linear regression predicting Serum cholesterol (mg/100ml) examination 1 from
Serum cholesterol (mg/100ml) examination 2. Provide an appropriate summary table
and give a plain language explanation of the estimates of the parameters of the model.
(d) Find a 95% prediction interval for Serum cholesterol (mg/100ml) examination 1 when
Serum cholesterol (mg/100ml) examination 2 is 300 (mg/100ml). Explain its meaning.
(e) A colleague asks if using the Serum cholesterol (mg/100ml) examination 2 value itself
as the estimate of Serum cholesterol (mg/100ml) examination 1 is a good idea; for
example, if Serum cholesterol (mg/100ml) examination 2 = 275, predict that Serum
cholesterol (mg/100ml) examination 1 = 275. (This is, in fact, what was done.) Does
this under-estimate, or over-estimate Serum cholesterol (mg/100ml) examination 1,
using the data available? Provide a graph that will help answer this question. (Hint:
Consider adding a Calculated line to show y = x.) Provide an explanation in writing.
(f) Consider improving the prediction of Serum cholesterol (mg/100ml) examination 1.
Explain, in principle, a possible approach. You do not need to implement the
approach.
(g) A key research question is about the relationship of smoking status at baseline and sex
to Serum cholesterol (mg/100ml) baseline (column 10). Describe a suitable statistical
model for answering this question, and explain the effects that will be considered in the
model.
(h) Use Minitab to fit the model that you have specified in part (g). Provide a summary
table of the Analysis of variance, and give a plain language explanation of the meaning
of the P-values associated with each of the explanatory variables. Use concrete terms in
relation to the Framingham study, rather than in abstract form.
(i) State one assumption required for analysing the data using the model you have
suggested. State if the assumption is reasonable and provide relevant evidence.
(j) Provide an appropriate graphical display to summarise the findings in relation to the
model you have fitted in (h).
6
SRW MAST90007 2022 Major assignment
(k) Find 95% confidence intervals for the effects of sex and smoking status on serum
cholesterol at baseline; use Fisher intervals and provide those that best describe the
results. Provide a suitable report of these confidence intervals, including a plain
language explanation in concrete terms.
Question 3 – Survival at last examination [9 marks]
Consider Survived at last examination; this is in column 15.
(a) Produce a graph of the data that allows a comparison of Survived at last examination
in terms of sex.
(b) Comment on any differences for sex, based on the graph.
(c) Estimate the difference in proportions (for sex) surviving at the last examination, and
the 95% confidence for this difference. Write a plain language explanation of the
results, using concrete terms in relation to the Framingham study.
(d) Carry out a logistic regression analysis of “Survived at last examination” using sex as a
predictor. Write a summary of the results, again suitable for a doctor interested in the
findings.
(e) Subset the Minitab worksheet to exclude those who have survived at examination 16,
so that you have the subset of subjects who died prior to examination 16.
Explore the relationship between cause of death and sex, using a suitable graphical
display. You may consider combining causes of death, if you think this is appropriate.
(Hint: Data > Recode). Provide a suitable graph with a brief written description of the
patterns in the graph.
(f) A colleague wants to consider predicting Survived at last examination from Serum
cholesterol (mg/100ml) examination 1 (column 8). She notes that some of the values are
missing. Your colleague suggests says “I don’t think we need to worry about that as
there will still be plenty of data to carry out an analysis”. Provide a response to this,
explaining any assumptions involved, and include a summary table to describe the
amount of missing data for Serum cholesterol (mg/100ml) examination 1.
(g) At the time that the Framingham study, diastolic blood pressure was believed to be a
superior measure of blood pressure compared with systolic blood pressure. High levels
of systolic blood pressure were not believed to be important in terms of health
outcomes. Examine the relationship between these two measures of blood pressure at
baseline visually. Provide a plot that represents this relationship.
Consider the summary table providing the results of three logistic regression models
predicting Survived at last examination, shown on the next page.