STAT 2011 Probability and Estimation Theory
Probability and Estimation Theory
Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: THEend8_
Mathematics and Statistics
STAT 2011
Probability and Estimation Theory
Contents
1 Random variables 1
1.1 Continuous random variables
1.2 Expected values
References 15
i
1 Random variables
(Begin of Lecture 4.1 )
(We begin this new section on continuous RVs with an example that motivates their use
as an approximation to discrete RVs. We will then independently define what we mean
by a continuous RV.)
1.1 Continuous random variables
Example 1.1.1. An electronic surveillance monitor is turned on briefly, once, every hour,
and has probability 0.905 of working properly, regardless on how long it has remained in
service.
Let RV X be the hour at which first fail occurs.
P (X = k) = pX(k) = ( 0.905︸ ︷︷ ︸
P (no Fail)
)k−1(0.095︸ ︷︷ ︸
P (Fail)
)1; k = 1, 2, 3, . . .
We will compare the probability histogram with the exponential curve
y = 0.1e−0.1x,
super-imposed on the graph of pX(k).
Note
P (0 ≤ X ≤ 4) = P (1 ≤ X ≤ 4) =
4∑
k=1
pX(k) = 0.32918
Also ∫ 4
0
0.1e−0.1xdx = − exp{−0.1x}
∣∣∣4
0
= 1− e−0.4 = 0.32968
Thus, the approximation coincides on the first 3 dp’s with the true probability.
Figure 3.4.2 in Larsen and Marx (2012) shows the probability histogram, overlaid with
the exponential curve. 2
1
This is Figure 3.4.2 in Larsen and Marx (2012)
Comment. Implicit in the similarity between pX(k) and 0.1 exp (−0.1x) is understanding
probabilities on points as areas of rectangles over intervals.
2
Definition 1.1.1 (Continuous probability function). A probability function P on a set
of real numbers S is called continuous if there exists a (probability density) function f(t)
such that for any closed interval [a, b] ⊂ S
P ([a, b]) =
∫ b
a
f(t)dt.
2
Any function P that satisfies this definition gives
P (A) =
∫
A
f(t)dt for any A ⊂ S such that integral is defined.
Conversely, suppose a (probability density) function satisfies
1. f(t) ≥ 0 for all t
2.
∫∞
−∞ f(t)dt = 1
Then,
P (A) =
∫
A
f(t)dt for all A
satisfies the probability axioms and therefore f induces a probability (measure).
(Before we formally define pdf’s, we will consider some examples.)
For sample spaces, S, having an uncountably infinite number of possible outcomes (think
S = R or S = [0, 1]), the function f(t) achieves the same as the function p(s) in discrete
samples spaces.
Example 1.1.2 (Equiprobable on an interval). Let Y ∼ U(a, b), that is Y has pdf
f(t) =
(
1
b− a
)
1[a,b](t), where 1[a,b](t) =
{
1 a ≤ t ≤ b
0 otherwise
Note ∫ ∞
−∞
f(t)dt =
∫ b
a
f(t)dt = (b− a)−1 × t
∣∣∣b
a
= 1
Let A = [l, u] = [1, 3] ⊂ [0, 10] = S, then
P (A) =
∫ u
l
(
1
b− a
)
1[a,b](t)dt
Show
= 0.2.
2
Example 1.1.3 (Normal or Gaussian distribution). This is the most important of all
continuous probability distributions. It’s pdf is given by
f(t) =
1√
2πσ
e−
1
2(
t−µ
σ )
2
Figure 3.4.5 in Larsen and Marx (2012) shows three different normal probability distri-
butions 2
3
(Such continuous probabilities are often used to model observed empirical histograms; i.e.
fitting f(t) to data. See exercise in this week’s lab.)
This is Figure 3.4.5 in Larsen and Marx (2012):
4
Definition 1.1.2 (Probability density function). Let Y be a function from a sample space
S to R. The function Y is called a continuous RV if there exist a function fY (y) such
that for any a, b ∈ R with a < b
P (a ≤ Y ≤ b) =
∫ b
a
fY (y)dy.
The function fY (y) is called the probability density function (pdf) for Y . 2
As in the discrete case, the cumulative distribution function (cdf) is defined by
FY (y) := P (Y ≤ y).
Remark. In the continuous case FY (y) =
∫ y
−∞ f(t)dt.
5
(There was no Lecture 4.2 )
(Begin of Lecture 4.3 )
Theorem 1.1.1. Let FY (y) be the cdf of a continuous RV. Then
d
dy
FY (y) = fY (y). 2
(This follows immediately from the Fundamental Theorem of Calculus: Let f be a contin-
uous real-valued function defined on a closed interval [a, b]. Let F be the function defined,
for all x ∈ [a, b] by
F (x) =
∫ x
a
f(t)dt.
Then F is uniformly continuous on [a, b] and differentiable on the open interval (a, b) and
F ′(x) = f(x)
for all x ∈ (a, b) so F is an antiderivative of f .)
Theorem 1.1.2. Let Y be a continuous RV with cdf FY (y). Then,
a. P (Y > y) = 1− P (Y ≤ y) = 1− FY (y)
b. P (r < Y ≤ s) = FY (s)− FY (r)
c. limy→∞ FY (y) = 1
d. limy→(−∞) FY (y) = 0 2
6
Proof of Theorem 1.1.2
a. P (Y > y) = 1− P (Y ≤ y) since the event (Y > s) is the complement of the event
(Y ≤ y).
b. Similar to a and directly from the Fundamental Theorem of Calculus.
c. Let the sequence {yn} be strictly increasing with limn→∞ yn =∞. If limn→∞ FY (yn) =
1 for every such sequence {yn}, then limn→∞ FY (y) = 1. Consider the disjoint
sets A1 = (Y ≤ y1) and An = (yn−1 < Y ≤ yn) for n = 2, 3, . . .. Then
FY (yn) = P (∪ni=1Ai) =
∑n
i=1 P (Ai). Also, the sample space S = ∪∞i=1Ai, thus
by Axiom 1 of the probability (Lecture 1.2)
1 = P (S) =
∞∑
i=1
P (Ai).
Putting these equalities together gives
1 =
∞∑
i=1
P (Ai) = lim
n→∞
n∑
i=1
P (Ai) = lim
n→∞
FY (yn).
d. It follows from c., that limy→∞ F−Y (y) = 1. Then, note that P (Y ≤ y) = 1−P (Y >
y) = 1− P (−Y < y) = 1− P (−Y ≤ y) = 1− F−Y (y).
2
Example 1.1.4. Let fY (y) = 4y
31[0,1](y).
Show that fY (y) is a pdf and
find P (1/4 ≤ Y ≤ 3/4).
Clearly fY (y) ≥ 0 and ∫ 1
0
4y3dx = y4
∣∣∣1
0
= 1
Thus fY (y) is a density.
P (1/4 ≤ Y ≤ 3/4) = y4
∣∣∣3/4
1/4
=
9
256
− 1
256
=
1
32
2
7
1.2 Expected values
Rule (Integrating by parts). Let d
dx
F (x) = f(x) and d
dx
G(x) = g(x). Then for any
constants a, b ∈ R, ∫ b
a
F (x)g(x)dx = F (x)G(x)
∣∣∣b
a
−
∫ b
a
f(x)G(x)dx.
(In short:
∫
Fg = FG− ∫ fG). 2
Example 1.2.1. Let F (x) = x, g(x) = e−x, a = 0 and b =∞. Then∫ b
a
xe−xdx = x(−e−x)
∣∣∣∞
0
−
∫ ∞
0
(1)(−e−x)dx
= 0− 0 +
∫ ∞
0
(e−x)dx
= −e−x
∣∣∣∞
0
= 0 + 1 = 1
2
Probability density functions (pdf’s) provide a global overview of a random variable’s
behaviour.
If X is
discrete: pX(k) = P (X = k) for all k (the pdf pX defines probabilities)
continuous: P (X ∈ A) = ∫
A
fX(x)dx (the pdf fX induces probabilities)
The expected value is a single number summary of a RV and measures its “average” value
it can attain, therefore describing the central tendency (point of balance) of the pdf.
8
Example 1.2.2. The following is Figure 3.5.1 in Larsen and Marx (2012):
Note that
µX may or may not be observable
µY is typically observable, although examples exist where for µY ∈ A = [a, b], a < b,
P (A) = 0
2
Example 1.2.3 (Roulette). The following picture is fromWikimedia (retrieved on 1/2/2018)
and shows the American roulette board
There are n = 38 possible numbers, 18 of which are odd, 18 are even (zero does not count
as even) and two are different (neither odd nor even).
Place $1 on Odd. Let RV X denote winnings. Then,
X =
{
1 with pX(1) = P (X = 1) =
18
38
= 9
19
−1 with pX(−1) = P (X = −1) = 1019
Thus,
“Expected” winnings = E(X) = ($1)
9
19
+ (−$1)10
19
= −5 cents
The following is Figure 3.5.2 in Larsen and Marx (2012):
2
9
Definition 1.2.1 (Expected value of a discrete RV X). Let discrete RV X have pdf
pX(k). The expected value of X, denoted by E(X) (or sometimes µ or µX), is given by
E(X) =
∑
all k
k · pX(k). 2
(The above equation shows that the pX(k)’s are weights when taking a weighted average
of the k’s.)
Definition 1.2.2 (Expected value of a continuous RV Y ). Let continuous RV Y have
pdf fY (y). Then
E(Y ) =
∫ ∞
−∞
y · fY (y)dy. 2
Comment. We assume that both the sum and the integral in the two definitions above
converge absolutely, that is∑
all k
|k|pX(k) <∞ and
∫ ∞
infty
|y|fY (y)dy <∞.
If not, we say the RV has no finite expectation.
Example 1.2.4 (Equally likely outcomes). Here pX(k) = 1/n for all k ∈ S with #S = n.
Then,
E(X)
Def
=
∑
all k
kpX(k) =
∑
all k
k
1
n
=
1
n
∑
all k
k
Thus E(X) is the arithmetic mean of all values in S. 2
Theorem 1.2.1. Suppose X is a binomial RV with parameters n and p, X ∼ Bin(n, p).
Then E(X) = np. 2
(np makes intuitively sense: consider a repeated coin toss: each success has probability p
and there are n trials)
The proof of Theorem 1.2.1 will introduce a (important) particular technique that will be
very useful to evaluate rather complicated sums: identify
∑
all k pZ(k) for some discrete
random variable Z and of course such a sum equals 1.
10
Proof of Theorem 1.2.1
E(X)
Def
=
n∑
k=0
k
{(
n
k
)
pk(1− p)n−k
}
= 0 +
n∑
k=1
k
{
n!
k!(n− k)!pp
k−1(1− p)n−k
}
= np
n∑
k=1
{
(n− 1)!
(k − 1)!(n− k)!p
k−1(1− p)n−k
}
= np
n∑
k=1
{(
n− 1
k − 1
)
pk−1(1− p)n−k
}
; m = n− 1, j = k − 1
= np
m∑
j=1
(
m
j
)
pj(1− p)m−j︸ ︷︷ ︸
=1
= np
because probabilities pZ(k) from Z ∼ Bin(m, p) sum to 1.
Example 1.2.5 (Hypergeometric). Urn with 9 chips, r = 5 are red and w = 4 are white.
Let X denote the number of red chips in draws of three (without replacement). Thus,
P (X = k) = pX(k) =
(
5
k
)(
4
3−k
)(
9
3
) ; k = 0, 1, 2, 3
Thus,
E(X) =
3∑
k=0
kpX(k) = 0 +
30
84
+ (2)
40
84
+ (3)
10
84
=
5
3
2
11
The proof of the following theorem is one of the tutorial problems.
Theorem 1.2.2. If X is a hypergeometric RV with parameters r, w and n, that is with
pdf
pX(k) = P (X = k) =
(
r
k
)(
w
n−k
)(
r+w
n
) and E(X) = rn
r + w
2
Example (continued). Using this theorem, we confirm that
E(X) =
rn
r + w
=
(5)3
9
=
15
9
=
5
3
2
Comment. Let p = r
r+w
= “proportion of red among all”. Then,
E(X)
Thm
=
rn
r + w
= pn
and has the same structure as for Z ∼ Bin(n, p).
Example 1.2.6. Let RV Y be the distance a molecule in a gas travels before it collides
with another molecule. Model Y by the exponential pdf,
fY (y) =
1
µ
e−y/µ, y ≥ 0
and µ a positive constant, known as the mean free path.
E(Y )
Def
=
∫ ∞
−∞
y
1
µ
e−y/µ1[0,∞)(y)dy
=
∫ ∞
0
y
1
µ
e−y/µdy
Let w(y) = y/µ thus dw
dy
= 1/µ and dy = µdw. Then,
E(Y ) =
∫ ∞
0
we−wµdw = µ
∫ ∞
0
we−wdw
Integration by parts gives (as seen at the start of the lecture)
E(Y ) = µ
[
− we−w − e−w
]∞
0︸ ︷︷ ︸
=1
= µ 2
References
Larsen RL, Marx ML (2012). Introduction to Mathematical Statistics and Its Applica-
tions, 5th Edition, Boston: Pearson, Section 3.4 and 3.5.