Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: THEend8_
MATH6173 Statistical Computing — Coursework 1
Instructions for Coursework 1
1. This coursework assignment is worth 50% of the overall mark for the module MATH6173.
2. The deadline is 1500 on Monday 22nd November 2021 and your completed work must be submitted
electronically via Blackboard.
3. Your submission must consist of exactly one written report including the answers and any tables
and graphs to the questions in Section 1: R Programming Part I, together with one R script file
containing all the R code you used to obtain your results for questions in both Section 1 and Section 2.
In the R script, you must use comments to make it clear to which questions the R codes are answering.
4. Your written report should be in pdf format. Please name your report report as .pdf.
For example, if your student ID is 12345678, then your report must have the filename 12345678.pdf.
5. Your R script file must have the filename .R. For example, if your student ID is
12345678, then your script must have the filename 12345678.R.
6. Your entire R script file should be reproducible and run without any errors.
7. You must present elegant and informative plots in the written report, including the use of appropriate
plotting area, points/lines sizes and colours. Also please include meaningful labels, titles and legends.
8. You must informatively comment your R code including indicating where each question and task begins.
9. You must not load any add-on packages, e.g. using function library or require.
10. The standard Faculty rules on late submissions apply: for every weekday, or part of a weekday, that
the coursework is late, the mark will be reduced by a factor of 10%. No credit will be given for work
which is more than one week late.
11. All coursework must be carried out independently. You are reminded of the University’s Academic
Integrity Policy
12. In the interests of fairness, queries which relate directly to the coursework must be made via the
MATH6173: Coursework 1 Discussion Forum on Blackboard. This ensures that everybody has access
to the same information.
Total marks available: 100
1
Section 1: R Programming Part I
Question 1. Two-sample testing
[22 marks]
The dataset in the file diab.txt is originally from the US National Institute of Diabetes and Digestive and
Kidney Diseases and contains measurements from n = 768 female individuals. The data contains a series of
eight measurements variables, and a response variable indicating whether or not a patient has diabetes. The
variables in the dataset are described in following table.
Variables Description
preg Number of times pregnant
gluc Plasma glucose concentration after 2 hours in an oral
glucose tolerance test
bp Diastolic blood pressure
skin Triceps skin fold thickness
insulin 2-Hour serum insulin
bmi Body mass index
ped Diabetes pedigree function
age Age
out Binary response variable (0 = no diabetes, 1 = diabetes)
(a) [2 marks] Import the dataset from the file diab.txt into your own work space. Calculate the following
quantities:
• number of patients that have diabetes, and number of patients that do not have diabetes
• median age of patients
• mean body mass index for patients that have diabetes, and mean body mass index for patients that do
not have diabetes
(b) [2 marks] Produce a side by side boxplot to show the distributional information of all the eight
measurements after scaling (subtracting the mean and dividing by the standard deviation).
(c) [2 marks] We want to test whether the mean value of body mass index (BMI) are the same between
patients with diabetes and patients without diabetes. Let us use Student’s t-test here and assume the
variances of BMI in both groups are equal. List your null and alternative hypotheses, perform your test
and give the p-value. Do you reject your null hypothesis at the significance level of 0.05?
(d) [6 marks] Suppose now we want to write our own code to perform a similar test in (c) without assuming
equal variance. To achieve this, we can use the Welch’s t-test. Suppose we have two independent groups
of random variables x1, . . . , xn1 and y1, . . . , yn2 . We define their sample means as:
x¯ = 1
n1
n1∑
i=1
xi, y¯ =
1
n2
n2∑
i=1
yi,
and their sample variances as:
σˆ2x =
1
n1 − 1
n1∑
i=1
(xi − x¯)2, σˆ2y =
1
n2 − 1
n2∑
i=1
(yi − y¯)2.
Then we can define the unbiased variance estimator as:
σˆ2 = σˆ
2
x
n1
+
σˆ2y
n2
.
2
Therefore, we obtain
t = (x¯− y¯)
σˆ
,
which is the so-called Welch’s t-test statistic. Under the null hypothesis of mean equality, we have the
distribution of this test statistic can be approximated by a Student’s t-distribution with degrees of
freedom
df =
(σˆ2x/n1 + σˆ2y/n2)2
(σˆ2x/n1)2
n1 − 1 +
(σˆ2y/n2)2
n2 − 1
.
Write your own code to calculate the Welch’s t-test statistic for the same hypothesis testing problem
in (c). Give the value of this test statistic, the degrees of freedom, and the p-value of your test. Do you
reject the null hypothesis at the significance level of 0.05?
(e) [3 marks] The F-distribution is a continuous probability distribution that arises frequently in probability
theory and statistics. It has two parameters, and we write X ∼ Fd1,d2 if a random variable X follows
from a F -distribution with parameters d1 and d2. You can use the R functions df, pf, qf and rf to
obtain the probability density function, cumulative distribution function, quantile function and random
generation function of F -distribution. Read the help file of above functions and produce the following
four plots in one page with 2× 2 layout.
• plot 1: probability density function of F8,10.
• plot 2: cumulative distribution function of F8,10.
• plot 3: a probability histgram for 100 independently generated random numbers from F8,10.
• plot 4: empirical density function of the same 100 random numbers in plot 3.
(f) [7 marks] Suppose now we want to simultaneously test if the mean values of all the eight measurements
are equal between patients that have diabetes and patients that do not have diabetes. To achieve
this, we can use the Hotelling’s T -squared test, which is a natural generalizations of the Student’s
t-test to multivariate hypothesis testing. Suppose we have two independent groups of random p-vectors
x1, . . . ,xn1 and y1, . . . ,yn2 (all the vectors are p-dimensional). We define the their sample mean vectors
as:
x¯ = 1
n1
n1∑
i=1
xi, y¯ =
1
n2
n2∑
i=1
yi,
and their sample covariance matrices as:
Σˆx =
1
n1 − 1
n1∑
i=1
(xi − x¯)(xi − x¯)T , Σˆy = 1
n2 − 1
n2∑
i=1
(yi − y¯)(yi − y¯)T .
Then we can define the unbiased pooled covariance matrix as:
Σˆ = (n1 − 1)Σˆx + (n2 − 1)Σˆy
n1 + n2 − 2 .
Therefore, we obtain
T 2 = n1n2
n1 + n2
(x¯− y¯)Σˆ−1(x¯− y¯)T ,
which is the so-called Hotelling’s two-sample T -squared statistic. If we further assume that both
groups consists of observations that are identically and independently drawn from multivariate normal
distribution N(µ,Σ), then
T 2 ∼ T 2(p, n1 + n2 − 2),
where T 2(p, n1 + n2 − 2) denotes the so-called Hotelling’s T -squared distribution with parameters p and
n1 + n2 − 2. If a random variable X has a Hotelling’s T -squared distribution, i.e., X ∼ T 2p,m, we have
m− p+ 1
pm
X ∼ Fp,m−p+1.
3
Let group 1 be the eight measurements for all the patients with diabetes, and let group 2 be the eight
measurements for all the patients without diabetes. List your null and alternative hypotheses. Write
your own code to calculate the Hotelling’s two-sample T -squared statistic, derive the p-value of your
test. Do you reject your null hypothesis at the significance level of 0.05?
Hints:
(1) In hypothesis testing, the p-value is defined as the probability of obtaining test statistics at least as
extreme as the results actually observed, under the assumption that the null hypothesis is true.
(2) Observe the Hotelling’s two-sample T -squared statistic, if the mean difference between two groups is
small (then x¯− y¯ should be close to 0), the value of T 2 will tend to be small. Otherwise if the mean
difference between two groups is large, the value of T 2 will tend to be large as well.
Question 2: Non-parametric kernel density estimation
[28 marks]
Density estimation is the problem of reconstructing the probability density function (pdf) from a set of given
data points. Namely, we observe X1, . . . , Xn and we want to estimate the underlying unknown pdf, denoted
by f(x), generating this dataset, which is one of the central topics in (non-parametric) statistical research.
The so-called kernel density estimator (KDE) is one of the most famous method for this problem. Here we
give a formal definition of the KDE in the following formula:
fˆn(x) =
1
nh
n∑
i=1
K
(
Xi − x
h
)
,