MAST30020 Probability for Inference
Probability for Inference
Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: THEend8_
MAST30020
Probability for Inference
Lecture Slides∗
∗Written by K. Borovkov, last time modified on 1 January 2021 (or even later).
0-0
1. Probability Spaces
Random experiment (RE): You should be familiar with the concept, from
the 2nd year level probability course you have done. Anyway, that’s a
(somewhat vaguely defined) concept representing real-life phenomena that:
• have a “mass character” (i.e. could be repeated many-many times, at least
in theory),
• don’t display “deterministic regularity” (i.e. the outcome of any given trial
is uncertain, to the best of our prior knowledge), but
• possess what’s called “statistical regularity”: the relative frequencies
(nA/n) of events one can observe in the experiment stabilize around some
values ∈ [0, 1] as the # of (“independent”) repetitions of the experiment
grows.
It is that last dot-point that makes Probability Theory possible.
Examples (Ex’s): coin tossing; dice rolling; gender of newborn babies etc.
1
Need to specify the outcome of our RE: how?
In mathematics, we have sets = collections of objects.
• Sample space Ω is the set of all possible outcomes.
Ex. Coin tossing. H/T? What if the coin doesn’t fall on its side? What if
where it landed also matters for us? Several different Ω’s are possible (and
can be used under different circumstances).
• So outcomes are elements ω ∈ Ω (a.k.a. elementary events, sample
points, realizations).
• Events: these are subsets A ⊂ Ω for which probability is defined.
NB: in non-trivial cases, there are subsets A ⊂ Ω for which probability
CANNOT be defined. For example, one can partition Ω = [0, 1] into
countably many “identical” sets Ai (they can be obtained from each other
by translations): Ω =
∑
i≥1Ai. Assuming uniform probability distribution
on [0, 1], what would be the probability of A1?
2
Sample spaces
• Ex. Finitely many possible outcomes ⇒ only need a finite Ω. Just list all
the outcomes: Ω = {H,T} ≡ {0, 1} for the experiment of tossing a coin
once; Ω = {1, 2, . . . , 6} for rolling a die once.
Now consider an RE where we toss a coin
roll a die
n = 3 times in a row.
Product spaces come handy. For two sets A and B, their (Cartesian)
product is defined by
A×B := {(a, b) : a ∈ A, b ∈ B}
(the set of all pairs (a, b) s.t. a ∈ A and b ∈ B), and we put
An := A× · · · ×A︸ ︷︷ ︸
n copies of A
= {(a1, a2, . . . , an) : aj ∈ A, j = 1, 2, . . . , n}.
RE with a sample space Ω0 is replicated n times⇒ the sample space of the
composite experiment is Ω = Ωn0 , with ω = (ω1, ω2, . . . , ωn), ωj = outcome
of the jth replication of the basic (sub-)RE with the sample space Ω0.
3
• Ex. Keep tossing a coin till H shows up. Will a finite Ω suffice?
There are countably many possible outcomes that can be represented by
points from the set N = {1, 2, . . . } (ω = j if we observed j − 1 tails
followed by heads).a
• Ex. You have a date with your girl/boyfriend. She/he shows up
ω ∈ R+ = [0,∞) minutes late. Here we need Ω = R+.
One often uses the whole real line R or a subinterval thereof. For a
composite RE consisting of n dates, one can use
Ω = Rn := {(x1, x2, . . . , xn) : xj ∈ R, j = 1, 2, . . . , n}.
aAny set A s.t. there exists a 1–1 mapping f : A→ N is said to be countable (denumerable).
Thus, Z = {. . . ,−1, 0, 1, . . . } is countable. So is N2, but not [0, 1].
4
• If there can be any number of dates (at least, in theory: imagine that if
your girl/boyfriend was less than 20′ late for a given date, you decide to
meet again), we’ll need
RN := {(x1, x2, . . . ) : xj ∈ R, j ∈ N},
the set of real sequences, i.e. all real-valued functions on N.
[Other functional spaces are used, too, e.g. C[0, 1] for the so-called
Brownian motion process.]
• This models a situation where the basic RE can be repeated infinitely
many times. Likewise, in the coin tossing RE in which you are tossing a
coin till you get H for the first time, one can use Ω = {T,H}N.
• BTW: Note that {T,H}N is uncountable (why?). In the last example, you
could actually use a countable space of outcomes — which one?
5
• Probabilities will be assigned to sets of outcomes, i.e. subsets of Ω.
• For a given RE, can one assign probabilities to all subsets of Ω?
• If Ω is countable, the answer is YES.
In the general case, the answer is NO.
• It is impossible in the basic RE of choosing a point at random from [0, 1].
• So we MUST restrain ourselves and consider only some of the subsets
of Ω, chosen in such a way that there will be no problems with assigning
probabilities to them — and these subsets will be called “events”.
• Now how to choose them? One needs to be able to manipulate events,
and, quite naturally, such (admissible!) manipulations should be
producing events. Let’s look at the ways to manipulate events.
6
Ex. Choose a student from a large class. Want the events that the student:
1) is NOT smoking;
2) is a female AND more than 55 y.o.;
3) was born in Australia OR New Zealand;
4) was born in Australia AND is NOT smoking.
These can be expressed in terms of simpler events using set operations:
1) Let Ω := population of all students in class, A := sub-population of
smokers. Take the complement Ac := {ω ∈ Ω : ω 6∈ A}.
2) If B := sub-population of female students, C := those who are > 55 y.o.,
then take the intersection B ∩ C = {ω ∈ Ω : ω ∈ B and ω ∈ C}.
3) If D := students born in Australia, E := students born in New Zealand,
then take the union D ∪ E = {ω ∈ Ω : ω ∈ D or ω ∈ E}.
4) This will be the set difference D \A := D ∩Ac.
7
Note that D∩E = ∅, i.e. the events are disjoint (no common outcomes). Note
also that, in the case of disjoint sets, one often uses D + E to denote D ∪ E.
In fact, it is often more convenient to work with functions rather than sets.
How to “replace” sets with functions? Use indicators (indicator functions):
1A(ω) :=
1, ω ∈ A,0, ω 6∈ A.
To the (main) set operations on events, there correspond the following
operations on indicator functions (draw diagrams & check!!):
1Ac = 1− 1A, 1A∩B = 1A1B , 1A∪B = max{1A,1B}.
For the symmetric set difference A4B := A \B +B \A ≡ (A ∪B) \ (A ∩B),
1A4B = |1A − 1B |.
8
Ex. Express the following events:
Neither A nor B occurred [i.e. 1A + 1B = 0].
I Ac ∩Bc = (A ∪B)c, de Morgan law.
A and B occurred, but C didn’t.
I A ∩B ∩ Cc ≡ (A ∩B) \ C.
Only one of A1, A2, A3 occurred [i.e.
∑3
j=1 1Aj = 1].
I A1 \ (A2 ∪A3) +A2 \ (A1 ∪A3) +A3 \ (A1 ∪A2).
Exactly two out of A1, . . . , A5 occurred [i.e.
∑5
j=1 1Aj = 2].
I
∑
i(Ai ∩Aj) \ ⋃
k 6=i,j
Ak
.
9
As we said earlier, want to manipulate events, need the resulting sets still be
events. Hence the requirement: the class of events on Ω must be closed under
the main set operations, i.e. the complement of an event should still be an
event, and the union of events must still be an event (the same for
intersections, but this is automatic from de Morgan laws, see below).
To make things work, need a bit more: namely, that we are allowed to take
countable unions (and intersections!). In mathematics, when countable
infinity is allowed/involved, one often uses σ to indicate that.
Def. A family F of subsets of Ω is said to be a σ-algebra on Ω if
(A.1) Ω ∈ F ,
(A.2) A ∈ F ⇒ Ac ∈ F ,
(A.3) A1, A2, . . . ∈ F ⇒
⋃∞
n=1An ∈ F .
In words: the family is closed under complementation and countable union and
intersection. Why the latter? De Morgan + (A.2) + (A.3) + (A.2):⋂∞
n=1An =
[
(
⋂∞
n=1An)
c]c
= [
⋃∞
n=1A
c
n]
c
= [
⋃∞
n=1Bn]
c
, Bn := A
c
n ∈ F .
10
Ex. (continued)
Infinitely many of the events A1, A2, . . . occurred [i.e.
∑∞
j=1 1Aj =∞].
Is this actually an event? [Denoted: An, i.o.]
Why? For instance: will a random walk S0, S1, S2, . . . on Z visit 0 infinitely
many times? Let An := {Sn = 0}.
I
⋂
n≥1
⋃
k≥n
Ak — and this is an event indeed, using (A.2) + (A.3).
Here: ∩ ←→ ∀, “for all”; ∪ ←→ ∃, “there exists”.
Finitely many of the events A1, A2, . . . occurred [i.e.
∑∞
j=1 1Aj <∞].
Why? For instance: in a random walk S0, S1, S2, . . . , let An := {|Sn/n| > ε}
(for a fixed ε > 0). Related to the “strong” LLN.
I
⋃
n≥1
⋂
k≥n
Ack — use de Morgan or apply the same logic as above.
11
One starts modelling an RE by specifying a suitable sample space Ω and then
choosing an appropriate σ-algebra F of subsets Ω. The elements of this
σ-algebra are called events.
NB: Always ∅ ∈ F : indeed, ∅ = Ωc, then use (A.1) + (A.2).
NB: So taking A3 = A4 = · · · = ∅ in (A.3) yields
A1, A2 ∈ F ⇒ A1 ∪A2 ∈ F .
Likewise for any finite union (intersection) of events: still an event. If only
that held instead of (A.3), then F would be called an algebra of sets.
Ex. The trivial σ-algebra: F = {∅,Ω}.
No fun: no uncertain events! All the events we are allowed to look at are: the
impossible event ∅ (it never occurs!) and the certain event Ω (occurs always!).
Ex. The power set P(Ω) := class of all subsets of Ω.
This is often the choice in simple situations with discrete sample spaces.
12
Prm. Suppose Fn, n = 1, 2, . . . , are σ-algebras on a common sample space Ω.
Is F1 ∪ F2 a σ-algebra as well? What about F1 ∩ F2? What about
⋂∞
n=1 Fn?
Of course, there are many different possible choices of F . May wish to
consider, say, a σ-algebra containing a given set A ⊂ Ω. The smallest such
σ-algebra is clearly the so-called σ-algebra generated by A:
σ(A) := {∅, A,Ac,Ω}.
Extending this, let G = {A1, . . . , An} be a finite partition of Ω, i.e. the sets
Ai ⊂ Ω are pairwise disjoint,
∑n
i=1Ai = Ω. Then the σ-algebra generated by G,
i.e. the smallest σ-algebra that contains all the sets Aj , is
σ(G) :=
{∑
i∈I
Ai : I ⊂ {1, 2, . . . , n}
}
.
(Clearly a σ-algebra. Why is it the smallest one containing G?)
Similarly for a countable partition of Ω.
13
In case of the σ-algebra generated by a partition G, it is easy to give
representation for all the elements of σ(G). One can also introduce the concept
of the σ-algebra generated by an arbitrary given family G of subsets of Ω —
but this is less elementary (what about all possible intersections etc?).
Thm [1.12] For any family G of subsets of Ω, there exists a unique
σ-algebra, denoted by σ(G) and called the σ-algebra generated by G, s.t.
1) G ⊂ σ(G), and
2) if H is a σ-algebra on Ω and G ⊂ H, then σ(G) ⊂ H.
That is, σ(G) is the smallest σ-algebra on Ω containing G.
I How to prove such an assertion?? It’s not too difficult. First note that
there are σ-algebras on Ω that contain G: just take P(Ω). So the class of all
σ-algebras on Ω that contain G is non-empty. Now consider the intersection of
all σ-algebras from the class. It will contain G (as each of the σ-algebras
contains it!) and it will be a σ-algebra (as an intersection of such). And it will
be the smallest one with these properties!! (Why?)
14
An important example of a generated σ-algebra is the class B(R) of Borel
subsets of R (a.k.a. the Borel σ-algebra on R):
B(R) := σ{(a, b] : a, b ∈ R, a < b}.
All “reasonable” subsets of R are Borel (e.g. finite and countable subsets, open
intervals, open and closed sets etc.), but B(R) 6= P(R)! [Although giving an
example of a set which is not Borel is a challenge!]
This extends to the multivariate case:
B(Rm) := σ
{
m∏
i=1
(ai, bi] : ai, bi ∈ R, ai < bi
}
.
Here
∏m
i=1(ai, bi] is the Cartesian product of intervals (a “brick”).
Equivalently, B(Rm) is generated by open balls. As in the univariate case, all
reasonable subsets of Rm are Borel.
When Ω = Rm, B(Rm) is the default choice of F . For Ω ⊂ Rm, one takes
F = {Ω ∩A : A ∈ B(Rm)}, the trace of B(Rm) on Ω.
15
Now it’s time to introduce
Probability, from Latin probabilis “provable,” from probare “to try, to test”
(cf. to prove, to probe), from probus “good”.
Probable cause as a legal term is attested from 1676.
Probably is attested from 1535.
Probability is attested from 1551. [Source: http://www.etymonline.com]
Let (Ω,F) be a sample space endowed with a σ-algebra of its subsets (the
couple is called a measurable space).
Def. A probability on (Ω,F) is a function P : F → R s.t.
(P.1) P(A) ≥ 0, A ∈ F ,
(P.2) P(Ω) = 1,
(P.3) for any pairwise disjoint A1, A2, · · · ∈ F ,
P
∞⋃
j=1
Aj
= ∞∑
j=1
P(Aj), “countable additivity”.
16
Def. The triple (Ω,F ,P) is called a probability space.
NB1: P is referred to as a set function (as its argument assumes “values” that
are sets, its domain being F). NB: ω ∈ Ω and {ω} ⊂ Ω are distinct objects!
Note: P(ω) is NO GOOD, P({ω}) is OK.
NB2: On one and the same measurable space, we can have infinitely many
different probabilities. Ex: tossing a (biased) coin (once). In statistics, we do
consider different probabilities on the same measurable space all the time!
NB3: Properties (P.1) and (P.3) specify what’s called a measure. Adding
(P.2), we get a measure of “total mass one”; one often uses “probability
measure” in that case.
NB4: Why was this def’n adopted? It mimics the properties of relative
frequencies of events! Turns out that measure theory is the most natural
framework for formal treatment of probabilities. Very successful, starting with
being able to establish theoretically all the main “statistical laws” observed in
the real world, and first of all — the LLN and CLT.
17
Ex. The point mass (degenerate distribution) at (a fixed point) ω ∈ Ω :
εω(A) := 1A(ω).
NB the difference in interpretation: the LHS is a function of A (for a fixed
outcome ω), whereas the RHS is a function of ω (for a fixed event A).
It models a situation with a deterministic outcome: repeat your RE till you
turn blue, but each time you’ll be seeing one and the same outcome: ω.
Ex. Counting measure on N (or even R ⊃ N):
µ(B) :=
∑
n≥1
εn(B), B ∈ F = P(N).
This is not a probability! The measure counts the number of points in B
(which can be infinite, of course).
18
Ex. Discrete uniform distribution: suppose Ω is finite, F = P(Ω). If all
outcomes ω ∈ Ω are “equally likely”, then they should have the same
probability. Using notation |B| for cardinality of B, just put
P(A) := |A|/|Ω|, A ∈ F
(this is the so-called “classical probability”).
NB: using a version of the counting measure, this can be re-written as
P(A) =
1
|Ω|
∑
ω∈Ω
εω(A).
19
Elementary properties of probability (Thm 1.23):
a) P(∅) = 0.
I Taking A1 = A2 = · · · = ∅ in (P.3), we have P(∅) =
∑∞
n=1 P(∅) — bingo!
b) finite additivity : for any pairwise disjoint A1, . . . , An ∈ F ,
(PF.3) P
n⋃
j=1
Aj
= n∑
j=1
P(Aj).
I Take An+1 = An+2 = · · · = ∅ in (P.3) and use a) — bingo!
NB: in the special case A1 = A, A2 = A
c, we obtain
P(Ac) = 1−P(A).
20
Elementary properties of probability (Thm 1.23): continued.
c) If A ⊂ B (from now on, always assume that A,B, . . . ∈ F), then
P(B \A) = P(B)−P(A).
I Follows from b): take A1 = A, A2 = B \A, then B = A1 +A2.
NB: So probability is non-decreasing: P(A) ≤ P(B) for A ⊂ B.
d) For any events A and B,
P(B ∪A) = P(B) + P(A)−P(B ∩A)
(the simplest version of the inclusion-exclusion principle), and so always
P(B ∪A) ≤ P(B) + P(A), “subadditivity of probability”.
I As A ∪B = A+B \A1, where A1 := A ∩B ⊂ B, can use b) and then c):
P(A ∪B) = P(A) + P(B \A1) = P(A) + P(B)−P(A1),
bingo!
21
Subadditivity of prob’ty extends to Boole’s ineq’ty (Propn 1.24): for any
A1, A2, . . . ∈ F ,
P
∞⋃
j=1
Aj
≤ ∞∑
j=1
P(Aj).
I “Disjointification”: let B1 := A1, B2 := A2 \A1, B3 := A3 \ (A1 ∪A2) etc:
Bn := An \
n−1⋃
j=1
Aj .
Then 1) Bn ⊂ An, n ≥ 1, 2)
⋃
j≤nAj =
⋃
j≤nBj , n ≤ ∞ (prove by
induction), and 3) B1, B2, . . . are disjoint. Hence, using monotonicity of P,