Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: THEend8_
MATH401/501 Statistical Fundamentals
1.1 Overview of model-based inference
The basis of much statistical modelling is the assumption of a known stochastic
model for our data. This model will have unknown parameters which we wish to
make inference about.
Data y and model ƒ (y|θ).
If the data are discrete then ƒ is a probability mass function (pmf):
ƒ (y|θ) = P (y|θ) .
If the data are continuous then ƒ is a probability density function (pdf), which can
be defined through ∫
A
ƒ (y|θ) dy = P (y ∈ A|θ) .
More generally, ƒ might be a mixture of mass and density functions, for example.
Initially we will consider simple stochastic models, whereby each data point is
considered as a realisation of a random variable, and these random variables
are assumed to be mutually independent with a common probability distribution.
Such data are often referred to as independent and identically distributed, which
is often abbreviated to IID. The common distribution of our random variable will
come from an appropriate family of distributions, where the family depends on
one or more unknown parameters. Our aim will be to make inference about these
parameters.
7
8 CHAPTER 1. PRELIMINARIES
1.1.0.1 Example: IID Bernoulli data
We have data y = (0,1,1,1,0), where y = 1 indicates that person suffers from
back ache and y = 0 indicates that they do not. We believe that whether not
each Y took the value 0 or 1 did not depend on the values taken by any other
Yj (j 6= ). Further, we believe that P (Y = 1) = θ, with the same θ for each and
0 ≤ θ ≤ 1; thus, P (Y = 0) = 1 − θ. More succinctly this model can be written as
Y
D∼ Berno(θ), = 1, . . . , n.
Then
ƒ (y|θ) = P (Y = (0,1,1,1,0)) = (1 − θ) × θ × θ × θ × (1 − θ) = θ3(1 − θ)2.
Important: this module assumes an understanding of probability, including
independence and mass and density functions and cumulative distribution func-
tions (cdfs). It also assumes familiarity with a number of standard distributions:
their purposes, parameters, ranges, relationships, cdfs and mass/density func-
tions. These include: the Bernoulli, binomial, geometric, uniform, Poisson and
negative-binomial discrete distributions and the Gaussian (normal), uniform, beta,
exponential and gamma continuous distributions. If you are not comfortable with
this material then please look in the relevant parts of your Probability I and II
notes (MSci students) or the pre-reading MATH230 notes (MSc students). Except
for the Bernoulli (above) and the Gaussian (shortly), these distributions will not
be explained in any detail.
Later in this module we will expand the class of problems to include those where
for each y additional information x is available. Elements of the the vector x are
often referred to as covariates or explanatory variables, and the idea is that some
or all of them might affect the distribution of y. Let us revisit the IID Bernoulli
example.
1.1.0.2 Example: A first logistic regression
Suppose that for each y from the IID Bernoulli example we also have information
, the height of person . We might suspect that the probability that a person
suffers from back ache increases (or decreases) with their height, and so we might
suggest a model of the form:
Y
indep∼ Berno
exp(θ0 + θ1)
1 + exp(θ0 + θ1)
, = 1, . . . , n,
1.1. OVERVIEW OF MODEL-BASED INFERENCE 9
for some θ0 ∈ (−∞,∞) and θ1 ∈ (−∞,∞). Clearly,
P (Y = 0|θ, ) = 1 − P (Y = 1|θ, ) = 1
1 + exp(θ0 + θ1)
.
Then, given the parameters and the covariates, the probability of obtaining the
data is:
P (Y = y|θ,x1, . . . ,xn) = ƒ (y|θ,x)
= P (Y1 = 0|θ, 1, . . . , 5) × P (Y2 = 1|θ, 1, . . . , 5)
× P (Y3 = 1|θ, 1, . . . , 5) × P (Y4 = 1|θ, 1, . . . , 5)
× P (Y5 = 0|θ, 1, . . . , 5)
= P (Y1 = 0|θ, 1) × P (Y2 = 1|θ, 2) × P (Y3 = 1|θ, 3)
× P (Y4 = 1|θ, 4) × P (Y5 = 0|θ, 5) ,
which is a messy function of θ and all the x that we will learn how to deal with.
Aside: why is the form for the probability in the above example sensible? (Consider
P (Y = 0)).)
Given the data, y, the model, ƒ , and unknown parameter vector, θ, there are
several tasks we may wish to perform:
1. Obtain a good estimate for the θ. This is referred to as a point estimate, and
we will denote it by bθ.
2. Measure uncertainty in such an estimate. This will be performed through the
use of standard errors and confidence intervals.
3. Test hypotheses about the parameters. For example, for the IID Bernoulli
example, do the data provide evidence to contradict a hypothesis that
θ = 0.4?
4. Choose between competing models for the same data. For example, deciding
between the IID Bernoulli model for the back ache data and the model that
takes height into account: which explains the data better?
Our initial focus will be on 1-3, with model choice, 4, introduced through hypothesis
tests, 3, and of considerable importance in the second part of the module.
Whenever you come across an example specifying a particular model for a set
of data, it is good practice to ask yourself ‘Is the model appropriate?’, ‘What
assumptions are being made?’, ‘Could I think of a more realistic model?’. For
example, often data are assumed to be independent, when they are not (lack of
independence will be covered in Statistical Fundamentals II). Another common
assumption is that data are Gaussian, yet this is very rarely the case in reality.
10 CHAPTER 1. PRELIMINARIES
Even if you decide to make such simplifying assumptions, you should always think
about the knock-on effects for the inference that you make, and the conclusions
that you draw.
In this module you will learn techniques for tasks 1-4 based upon the likelihood.
This is not the only method for inference, but it is the most popular and there
are sound theoretical reasons for this which we will also discuss. In Statistical
Fundamentals II you will delve further into the theory of likelihood, tackle more
complex data models using likelihood inference and also be introduced to Bayesian
inference which, for very good reasons, is another popular paradigm.
Before we even introduce the likelihood and likelihood-based methods, we first
recap the ideas of point estimates and estimators, confidence intervals, hypothesis
tests and p-values through a simple running example. This example will also
provide the key intuitions behind much of the likelihood theory that we will see
and use, and we will point out the natural generalisations as we proceed.
1.2 Point estimates, confidence intervals and hy-
pothesis tests
In this section we describe point estimates, confidence intervals, hypothesis tests
and p-values and exemplify them in the context of an IID Gaussian model for
data. This is one of the cases where all the quantities can be computed exactly
analytically. Likelihood theory allows approximations to all of these quantities
to be obtained for much more general data models and we will overview these
generalisations at the same time as describing the specifics for the running
example.
First we detail the Gaussian (also known as normal) model
A random variable Y is said to have a Gaussian distribution, also known as a
normal distribution, with an expectation of μ and a variance of σ2 if it has a
density function of
ƒ (y|μ, σ) = 1
σ
p
2pi
exp
− 1
2σ2
(y − μ)2
, (−∞ < y <∞).
This is denoted Y ∼ N(μ, σ2), and we have E [Y] = μ and Vr [Y] = σ2.
The density function when μ = 4 and σ = 2 appears below.
1.2. POINT ESTIMATES, CONFIDENCE INTERVALS AND HYPOTHESIS TESTS 11
−2 0 2 4 6 8 10
0.
00
0.
05
0.
10
0.
15
0.
20
y
de
ns
ity
The Gaussian distribution has two key properties that we will use throughout this
section:
1. If Y1 and Y2 are independent, Y1 ∼ N(μ1, σ21) and Y2 ∼ N(μ2, σ22) then Y1+Y2 ∼
N(μ1 + μ2, σ21 + σ
2
2).
2. If Y ∼ N(μ, σ2) and 6= 0 is a constant then Y ∼ N(μ, 2σ2) and Y − ∼
N(μ − , σ2).
1.2.0.1 Running example: IID Gaussian data with known variance
Throughout this section we imagine that our data y = (y1, . . . , yn) arise from
the model:
Y
D∼ N(μ∗, σ2) = 1, . . . , n.
We will suppose that μ∗ is unknown but σ2 is known.
1.2.1 Point estimates and estimators
A natural estimate of the expectation, μ∗, is the average of the data,
bμ = yn = 1n
n∑
=1
y.
12 CHAPTER 1. PRELIMINARIES
Aside: for Gaussian data, this will also turn out to be the maximum likelihood
estimate; see later.
To examine the properties of the estimate we imagine how it might vary if we had
collected a different set of data from the same model, and done this repeatedly.
We would have a whole collection of yn values and could draw a histogram of
these to see, for example, the amount of variability in yn. This is known as
repeated sampling. It is a hypothetical process (i.e., we cannot actually do this;
we simply imagine what would happen if we could) and is the source of all of
probablity-based statistical statements that we will make in this module.
In the limit as the number of hypothetical replicate samples →∞ the histogram
would simply be the density of the random variable:
bμ ≡ bμ(Y) = Yn = 1
n
n∑
=1
Y.
The quantity yn is an average of some numbers so is itself a number; it is an
estimate for μ. The quantity Yn is an average of random variables so is itself a
random variable; it is an estimator for μ. Notice that we use the same symbol, bμ,
for both the estimate and the estimator. This is, unfortunately, standard practice,
and whether or not it refers to an estimate or an estimator must be inferred from
the context. Sometimes, to be explicit we write bμ(y) and bμ(Y) which distinguishes
the two and makes it clear that the first is a function of the data and the second a
function of the corresponding random variables.
To understand how much we can trust the specific estimate we obtained from
our data we look at the properties of the estimator. From Gaussian property 1,∑n
=1 Y ∼ N(nμ∗, nσ2). Hence, from Gaussian property 2,
Yn ∼ N
μ∗,
1
n
σ2
.
We note two important properties:
Firstly, E
Yn
= μ∗; i.e., Yn is an unbiased estimator for μ∗.
Secondly, for any ε > 0, as n→∞
P
μ∗ − ε < Yn ≤ μ∗ + ε
→ 1
In words: pick any small band around the true value, μ∗. The probability that the
estimator Yn is within that band (i.e., very close to the truth) increases to 1 as the
1.2. POINT ESTIMATES, CONFIDENCE INTERVALS AND HYPOTHESIS TESTS 13
amount of data increases to ∞. Such an estimator is termed consistent. To see
why this happens here:
P
μ∗ − ε < Yn ≤ μ∗ + ε
= P
Yn ≤ μ∗ + ε
− P Yn < μ∗ − ε
=
μ∗ + ε − μ∗
σ/
p
n
−
μ∗ − ε − μ∗
σ/
p
n
= (
p
nε/σ) − (−pnε/σ),
where (z) =
∫ z
−∞ ƒ (t) dt is the cumulative distribution function of the N(0,1)
distribution; i.e., P (Z ≤ z) when Z ∼ N(0,1). As n→∞ the first term increases to
1 and the second decreases to 0.
4.6 4.8 5.0 5.2 5.4
0
5
10
15
20
25
30
y
re
pl
ica
te
µ
We visualise this (above) by fixing μ = 5 and σ = 1. We simulate 10 replicates of
data yn = y1, . . . , yn with n = 10 and for each we calculate yn (shown in blue; we
then repeat the exercise with n = 100 (shown in red) and n = 1000 (green).
More generally than the Gaussian setting, given independent data
yn = (y1, . . . , yn) and a parameter θ with a true value of θ∗, an esti-
mate for θ∗ is any function of the data bθ = bθn(yn). The corresponding
estimator for θ∗ is the same function applied to the random variables:bθ = bθn(Yn).
The bias in the estimator is E[ bθn(Yn)] − θ∗, and the estimator is consis-
tent if for any ε > 0,
P
θ∗ − ε < bθn(Yn) ≤ θ∗ + ε→ 1 as n→∞.
14 CHAPTER 1. PRELIMINARIES
This module is concerned with a particular class of estimators called maximum
likelihood estimators. These are consistent and in the limit as n→∞ they are also
unbiased.
Aside: of the two properties discussed so far, consistency is generally considered
to be more important than unbiasedness. For example, in the Gaussian settingbμ = bθn(Yn) = Y1 is an unbiased estimator for μ∗, but it becomes no more accurate
as the amount of data increases.
Consistency is the only formal limiting property (as n→∞) that we examine in
this section, so henceforth for simplicity of notation we drop the subscript n.
1.2.2 Confidence intervals
For the IID Gaussian data example, we have an estimate bμ = y for a parameter μ
with a true value of μ∗, and we know that bμ becomes increasingly accurate as
more and more data are used (n→∞), but how accurate is it for the particular n
we actually have? It is natural to describe our uncertainty via an interval around
our estimate, for example (y− , y+ ). It might seem natural to then describe the
interval in terms of the probability that the true parameter is in it. However, just as
when we investigated the accuracy of an estimate by considering the properties
of the estimator under (hypothetical) repeated sampling, so we investigate the
properties of the intervals such as (y − , y + ) under repeated sampling.
To be clear, the statement P
μ∗ ∈ (y − , y + ) = 0.9 (say) makes no sense as
μ∗ is a fixed (if unknown) value and y and are simply numbers, so either μ∗ is
in the interval or it is not.
Instead, we consider the random interval (Y − , Y + ) and state the probability
that it contains μ∗:
P
Y − , Y + 3 μ∗ = p.
1.2. POINT ESTIMATES, CONFIDENCE INTERVALS AND HYPOTHESIS TESTS 15
4.5 5.0 5.5 6.0
interval
re
pl
ica
te
µ*
The idea is illustrated above using 12 repeated samples and = 0.5. We take
μ∗ = 5, σ = 1 and n = 10 and simulate 12 values of yn drawing a horizonal line to
represent each interval of width 2 = 1.0 and a • for y. Most of the intervals do
contain μ∗, but one of them does not.
More generally we have an estimate, bθ = bθ(y) for a parameter θ with
a true value of θ∗.
We consider the random interval (Clo(Y), Chi(Y)), which could be (bθ(Y)−
, bθ(Y)+ ) for some > 0 or could be more complex. This is a (1− α)%
confidence interval if the probability that it contains the truth is 1 − α:
P ((Clo(Y), Chi(Y)) 3 θ∗) = 1 − α.
For the Gaussian example we wish to evaluate
P
(Y − , Y + ) 3 μ∗
.
We define the standard error of the estimator bμ(Y) = Y to be
SE(bμ) ≡ÇVr bμ(Y) =rVr Y = σ/pn,
and, to simplify notation in calculations, we abbreviate this to s.
Now μ∗ ∈ (Y − , Y + ) is equivalent to μ∗ > Y − and μ∗ < Y + , or
Y < μ∗ + and Y > μ∗ − .
16 CHAPTER 1. PRELIMINARIES
This is illustrated in the following diagram.
D
en
si
ty
o
f Y
n
µ+aµ−a µ
Thus
P
(Y − , Y + ) 3 μ∗
= P
μ∗ − < Y < μ∗ +
=
μ∗ + − μ∗
s
−
μ∗ − − μ∗
s
= (/s) − (−/s) = 1 − 2(−/s),
by symmetry. Hence, to choose such that P
(Yn − , Yn + ) 3 μ∗
= 1 − α, for
example, we must set
1 − 2(−/s) = 1 − α
⇐⇒ 2(−/s) = α
⇐⇒ −/s = −1(α/2)
⇐⇒ = −s−1(α/2).
If, for example, α = 0.05, this gives ≈ 1.96s = 1.96σ/pn. So,
P
(Y − 1.96σ/pn, Yn + 1.96σ/pn) 3 μ∗
≈ 0.95.
When we know σ and we have data y1, . . . , yn, we therefore create a 95% confi-
dence interval for μ as
yn − 1.96
σp
n
, y + 1.96
σp
n
≡
bμ(yn) − 1.96 σp
n
, bμ(yn) + 1.96 σp
n
.
In the case of IID Gaussian data, the standard deviation of the parameter estimator
is s = σ/
p
n. However in more general cases, the standard deviation of the
estimator is not the standard deviation of the distribution divided by
p
n.
1.2. POINT ESTIMATES, CONFIDENCE INTERVALS AND HYPOTHESIS TESTS 17
More generally we will find that just as bμ(Y) ≡ Y ∼ N(μ, σ2/n), for a
certain class of estimators (maximum likelihood estimators) and a scalar
parameter: bθ(Y) ∼ N(θ∗, s2) (approximately),
where s is the standard deviation of the random variable bθ(Y), also
known as the standard error of the estimator bθ, and is the generalisation
of σ/
p
n. See later.
1.2.3 Hypothesis tests
In the Gaussian running example, a scientist may conjecture a particular value for
μ, μ0, say. They may wish to know whether the data fit with this value. Denoting
the true value of μ by μ∗, they wish to know how the data fit with the conjecture
that μ∗ = μ0. This is set up through a pair of hypotheses.
• The null hypothesis, often denoted H0, is that μ∗ = μ0.
• The alternative hypothesis, often denoted H1, is the μ∗ 6= μ0.
Sometimes the direction of a discrepancy is important and in such cases the
null hypothesis can be H0 : μ∗ ≤ μ0, with the alternative H1 : μ∗ > μ0 (or the
other way round: H0 : μ∗ ≥ μ0 and H1 : μ∗ < μ0). These are called one-sided
hypotheses; see later.
More generally, let the true value of the parameter be θ∗. A scientist
may wish to decide whether or not θ∗ = θ0 where θ0 is some specific
hypothesised value. Or, for a specific scalar component θ of θ they may
wish to decide whether θ∗ = θ0. A analogous pair of hypotheses to the
Gaussian running example is set up. For the scalar case these can be
one-sided just as in the Gaussian example.
The hypotheses do not have equal standing. Typically the null hypothesis provides
a default value and we test to see whether or not the data contradict this default
value. There are two possible outcomes to a test: either we reject the null
hypothesis or we fail to reject the null hypothesis. We never accept the null
hypothesis, for reasons that will be described shortly.
A natural entry into the hypothesis test is via the confidence interval: since
P
bθ(Y) − 1.96s, bθ(Y) + 1.96s 3 θ∗ ≈ 0.95,
given data y we check whether or not θ0 ∈
bθ(y) − 1.96s, bθ(y) + 1.96s: if it is
then the data do not seem to contradict the hypothesis that θ∗ = θ0 and we would
18 CHAPTER 1. PRELIMINARIES
fail to reject H0; on the other hand if θ0 /∈
bθ(y) − 1.96s, bθ(y) + 1.96s then the
data do not seem to fit with H0 and we would reject the hypothesis.
Some mathematical proofs work by contradiction. One makes an assumption
that is believed to be false and shows that if the assumption is true it leads to
a contradiction, hence the assumption must be false. We would like something
similar to show that the null hypothesis is false. We start by assuming that it is
true; however, we cannot usually prove that it is false by obtaining a contradiction,
we can only show that the null hypothesis does not fit well with the data, which
makes us inclined to reject it.
To perform a hypothesis test we create a scalar test statistic, T(y), formulated to
that the larger |T(y)| is the more the data disagree with the null hypothesis.