MATH3811/MATH3911- Statistical Inference
Higher Statistical Inference
Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: THEend8_
DEPARTMENT OF STATISTICS
MATH3811/MATH3911- Statistical Inference/Higher Statistical Inference
ASSIGNMENT 2
Please, add a cover page containing a copy of your ID card, write with your own handwriting:
“I declare that this assignment is my own work, except where acknowledged and I have read
and understood the University rules regarding Academic Misconduct”, and sign it.
Assignment due: Friday, 16th April 2021, 5 pm at the latest.
Math3811: Attempt the first four questions. Math3911: Attempt all questions.
1. Suppose X1, X2, . . . , Xn are independent and identically distributed random variables from
N(µ,20) (each with a density f(x;µ) =
1p
2⇡0
e
(xµ)2
220 ). Suppose that 0 is known.
a) Argue that the joint density ofX1, X2, . . . , Xn has monotone likelihood ratio in
Pn
i=1Xi.
b) Derive the UMP unbiased size ↵ = 0.05 test '⇤ of H0 : µ = µ0 versus H1 : µ 6= µ0.
c) Show that the power function of this test is
Eµ'
⇤ = 1
✓
1.96
p
n(µ µ0)
0
◆
+
✓
1.96
p
n(µ µ0)
0
◆
with denoting the cdf of the standard normal distribution.
d) Set n = 10,0 = 2, µ0 = 3. Evaluate numerically the power function for µ = 1, 2, 3, 4, 5
and draw its graph on the real axis using R.
e) Calculate the density fX(3)(x) of the third order statistic X(3) under H0. Hence find
numerically P (X(3) < 2). (You could use the integrate function in R ).
2. In a sequence of consecutive years 1, 2, . . . , T, an annual number of high-risk events is
recorded by a bank. The random number Nt of high-risk events in a given year is modelled
via Poisson() distribution. This gives a sequence of independent counts n1, n2, . . . , nT . The
prior on is Gamma(a, b) with known a > 0, b > 0 : ⌧() =
a1e/b
(a)ba , > 0.
a) Determine the Bayesian estimator of the intensity with respect to quadratic loss.
b) Assume a = 3, b = 2. If the counts within the last seven years were 2, 4, 7, 3, 4, 4, 5 find
the estimate of for this data.
c) The bank claims that the yearly intensity is less than 4. Test the bank’s claim via
Bayesian testing with a zero-one loss, using the data from b).
Hint: You may find it helpful to use the R function pgamma in your answer.
3. Important measures in exploratory data analysis are the skewness 1 =
E(XE(X))3
V ar(X)3/2
and
the kurtosis 2 =
E(XE(X))4
V ar(X)2 3. One way of estimating them is by using their empirical
counterparts ˆ1 =
p
n
Pn
i=1(XiX¯)3
(
Pn
i=1(XiX¯)2)3/2
and ˆ2 =
n
Pn
i=1(XiX¯)4
(
Pn
i=1(XiX¯)2)2 3, respectively.
1
E
S_ofE
Bada
a) Write down two R functions myskewness and mykurtosis to get ˆ1 and ˆ2.
b) Load Library MASS in R and find the data galaxies (this variable describes the
velocities of 82 galaxies taken in the Corona Borealis region (a small constellation in
the northern sky)). Use your functions to estimate 1 and 2 for the variable galaxies.
c) Bootstrap the ˆ1 and ˆ2 estimators by using B = 2000 replicates and report the
resulting 95% confidence intervals using first principles.
d) For a normal population, theoretical skewness and kurtosis are both equal to zero. If the
95% confidence interval for either skewness or kurtosis excludes zero, the normality is
in doubt. What is your conclusion about the normality of the galaxies data? Include
your coding, and the output containing the confidence intervals, in your assignment.
4. We simulate an example to demonstrate the strength of the LTS procedure in isolating
outliers. Suppose your student number contains the numbers XXXXXXX, in order that
you generate data that is unique to your student number, include the student number in
the starting seed for random number generation as shown below. After setting the initial
seed, generate pairs of observations (xi, yi, i = 1, 2, . . . , 100) of which 70% are scattered
around the line y = x+ 2 and 30% are clustered around (6,3).
>set.seed(round(log(XXXXXXX)))
>x70<-runif(70,0.5,4)
>e70<-rnorm(70,mean=0,sd=0.2)
>y70<-2+x70+e70
>x30<-rnorm(30,mean=6,sd=0.5)
>y30<-rnorm(30,mean=3,sd=0.5)
>x<-c(x70,x30)
>y<-c(y70,y30)
>simuldata<-data.frame(x,y)
...
Using the above commands as a starter and the help of R, produce and include in your
assignment the following graphs and comment on your findings.
i) Graph 1. Plot the x,y data to produce a scatterplot. Study the help and examples of
the abline command to superimpose three regression lines: the ordinary least squares line,
the default M-estimate line and the default LTS line using the lqs function. Label clearly
the lines.
ii) Graph 2. Using the instructions from the Computing exercise on robust regression
you can override the default value using the quantile statement. Modify the default LTS
regression by asking that only 70 residuals be included in the calculation. Redraw the graph
with the new LTS regression line replacing the old one.
iii) Suppose however that you did not know the amount of contamination and used 85
residuals instead of 70 in ii) (i.e., some outliers are still influencing the LTS fit). Try the
LTS estimator again. Does it deliver a good fit? Attach the graph (Graph 3).
5. (*) A random sample X = (X1, X2, X3) of size n = 3 is taken from a population with
density
f(x) = 2x, x 2 [0, 1].
a) Evaluate the covariance between X(1) and X(3).
b) Evaluate the correlation between X(1) and X(3).
c) Find the density of the range R = X(3) X(1) and show that E(R) = 0.4 holds.