STAT 413 Mathematical & Statistical Sciences
Mathematical & Statistical Sciences
Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: THEend8_
Computing for Data Science
or Stochastic Computation and Algorithms
Course notes for STAT 413
Mathematical & Statistical Sciences
cbna
This work is licensed under the Creative Commons Attribution-
NonCommercial-ShareAlike 4.0 International License.
Contents
Preface 1
1 Random Number Generation 2
1.1 Probability Integral Transform . . . . . . . . . . . . . . . . . . . . . 3
1.1.1 CDFs with computable inverses . . . . . . . . . . . . . . . . . 4
1.1.2 CDFs without computable inverses . . . . . . . . . . . . . . . 5
1.1.3 CDFs with discrete distributions . . . . . . . . . . . . . . . . 7
1.2 Other Transformations . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.2.1 Box-Muller Transform . . . . . . . . . . . . . . . . . . . . . . 9
1.3 Acceptance-Rejection Sampling . . . . . . . . . . . . . . . . . . . . . 10
1.3.1 Acceptance-Rejection-Squeeze . . . . . . . . . . . . . . . . . . 13
1.3.2 Box-Muller without Trigonometry . . . . . . . . . . . . . . . 13
1.4 Ratio of Uniforms . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.4.1 Gamma Random Variables . . . . . . . . . . . . . . . . . . . 16
1.5 Points in Geometric Objects . . . . . . . . . . . . . . . . . . . . . . . 17
1.5.1 Simplexes and the Dirichlet Distribution . . . . . . . . . . . . 17
1.5.2 Spheres . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2 Volume Computation and Integration 20
2.1 Volume Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.1.1 Sampling from a regular grid . . . . . . . . . . . . . . . . . . 21
2.1.2 Uniform Monte Carlo . . . . . . . . . . . . . . . . . . . . . . 22
2.2 Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.2.1 Deterministic Approaches . . . . . . . . . . . . . . . . . . . . 26
2.2.2 Monte Carlo Integration . . . . . . . . . . . . . . . . . . . . . 26
2.3 Importance Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.4 Stratified Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3 Stochastic Optimization 32
3.1 Basic Stochastic Search . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.2 Stochastic Gradient Descent . . . . . . . . . . . . . . . . . . . . . . . 32
A Appendix 33
A.1 Change of Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
Preface
Never really seized by the need to calculate, he was more
apt to be aware of pattern than of brute numeration.
Ratner’s Star
Don DeLillo (1976)
This course recently came into being a few years ago. At its birth, it was called
Statistical Computing. Then, it was rebranded as Computing for Data Science. If I
were to give is a more descriptive name, I would call it Stochastic Computation and
Algorithms. The main focus of these notes is how to calculate a desired number using
randomization to achieve it. In Chapter 1, we first discuss how to generate random
numbers using a deterministic computer. It has been said that “It may seem perverse
to use a computer, that most precise and deterministic of all machines conceived by
the human mind, to produce ‘random’ numbers” (Press et al., 2007). Nevertheless,
we require random numbers, for some definition of random, in order to proceed with
stochastic algorithms. In Chapter 2, we consider how to use stochastic algorithms
to compute volumes and integrate functions. In Chapter 3, we consider how to use
stochastic algorithms to optimize a function.
There are three textbooks I used as inspiration for these course notes: George
Fishman’s Monte Carlo: concepts, algorithms, and applications (Fishman, 2013);
the famous Numerical Recipes by Press, Teukolsky, Vetterling, and Flannery (Press
et al., 2007); and James Spall’s Introduction to stochastic search and optimization
(Spall, 2005).
Adam B Kashlak
Edmonton, Canada
Fall 2022
1
Chapter 1
Random Number Generation
Introduction
Type ?set.seed into the R command prompt and start reading under the “Details”
heading. It is easy to take random number generation for granted, but an incredible
amount of work has gone into making sure that your simulations in R or Python or
other languages are as random as possible given a deterministic computer. Many
of the implemented methods within R or Python or other computing languages are
based on algorithms such as Linear Feedback Shift Registers (LFSR) and Linear
Congruential Generators (LCG). More details on these methods can be found in
Chapter 7 of Fishman (2013) and Chapter 7 of Press et al. (2007). When discussing
“random” numbers generated on a computer, they are often called psuedorandom
numbers as they are deterministic but satisfy many properties of a set of truly
random numbers.
For this chapter, we assume that it is possible to generate independent uniformly
random binary variables. That is, for any n ∈ N, we can generate B1, . . . , Bn such
that P (Bi = 1) = P (Bi = 0) = 0.5 and such that Bi and Bj are independent for all
i 6= j. At the time of writing, pure randomness has not been achieved and one has
to be cautious of the fact that such a sequence of Bi’s are necessarily deterministic,
but sufficiently chaotic to be considered as random.
From uniformly random bits comes uniformly random variates on the unit inter-
val [0, 1]. We will denote U1, . . . , Un
iid∼ Uniform [0, 1] to be n independent uniform
random variates. This implies that the cumulative distribution function (CDF) for
Ui is
P (Ui < u) =
0, u ≤ 0
u, u ∈ (0, 1)
1, u ≥ 1
and Ui ⊥⊥ Uj for i 6= j.1 Generating such uniform variates can be done in R via the
1 The symbol ⊥⊥ is often used to indicate independence between two random variables.
2
function runif( ). This function directly calls compiled C code that implements
the chosen random generator from set.seed( ).
Note that the events U = 0 and U = 1 have probability zero of occurring. In
fact, in the C code underlying the R code for random number generation, if the
psuedorandom uniform variate is exactly 0 or 1, the code adjusts very slightly up
or down, respectively, so that runif( ) never actually returns 0 or 1.
Remark 1.0.1. In probability & measure theory, there is a theorem proving the
existence of countably infinite independent random variables on R. The technique of
the proof precisely uses the same approach as the computer does. That is, (1) gener-
ate independent Bernoulli random variables. (2) use those to construct independent
uniform random variables. (3) use the uniforms to construct a general collection of
independent random variables. See STAT 571 notes for more details on this proof.
1.1 Probability Integral Transform
Thus far, we have our collection of iid uniform random variates U1, . . . , Un. However,
more interesting and diverse distributions are often desirable. We will use U1, . . . , Un
to construct new random variables with arbitrary probability distributions. The first
tool we will make use of is the probability integral transform.
Theorem 1.1.1 (Probability Integral Transform). Let F : R → [0, 1] be a strictly
increasing continuous function such that limx→−∞ F (x) = 0 and limx→∞ F (x) = 1.
Then, F−1 : [0, 1]→ R exists and if U ∼ Uniform [0, 1], then
Z = F−1(U)
has F as its cumulative distribution function.
Proof. We want the CDF for Z = F−1(U). That is,
P (Z ≤ t) = P (F−1(U) ≤ t)
= P (U ≤ F (t)) = F (t).
Alternatively, this theorem implies that if Z has continuous increasing CDF F ,
then F (Z) is uniform on [0, 1]. Note that this theorem can be extended to CDFs
with discontinuities by defining the inverse CDF as follows:
F−1(u) = inf{t ∈ R : F (t) ≥ u}
for u ∈ [0, 1]. A jump discontinuity in a CDF implies a point mass exists in the
distribution. Only a countable number of these are possible.
3
1.1.1 CDFs with computable inverses
If the function F has an inverse that exists and furthermore if that inverse can be
expressed in a closed form analytically, then generating random variables from F
is as easy as evaluating the inverse F−1. However, just because generating random
variables from F is easy to code does not imply that this approach is computationally
efficient. More on this will be discussed in Section 1.2.
Example 1.1.1 (Exponential Distribution). The exponential distribution is criti-
cally important in queuing processes, Markov processes, and other models where a
“memoryless” waiting time is required. Thus, being able to simulate exponential
random variates is very desirable.
The exponential distribution with rate parameter λ > 0 respectively has PDF and
CDF
f(x) = λe−λx and F (x) = 1− e−λx.
Hence, F−1(u) = −λ−1 log(1 − u). However, if U ∼ Uniform [0, 1] then 1 − U ∼
Uniform [0, 1]. Thus, to generate exponential random variates with rate λ, we can
compute
X = − 1
λ
log(U).
The parameter λ is referred to as the rate parameter as it can be interpreted as
the reciprocal of the expected waiting time. That is, EX = 1/λ. Hence, if we are
modelling, say, the time until the next lightning strike occurs in a thunderstorm or
the next atom radioactively decays, a larger λ means a faster rate of occurrences.
Note that while this is an easy way to generate exponential random variates, it
is not the way that R does it with the rexp() function in the base stats package.
This will be discussed in a subsequent section.
Example 1.1.2 (Cauchy Distribution). The Cauchy distribution is a Student’s t-
distribution with only 1 degree of freedom. While is has a bell-curve shape similar to
the normal distribution, none of the moments of the Cauchy distribution are defined.
e.g. it does not have a finite mean.
A random variable X is Cauchy if it respectively has the PDF and CDF
f(x) =
1
pi(1 + x2)
and F (x) =
1
pi
arctan(x) +
1
2
.
Thus, we can write X = F−1(U) = tan (pi(U − 1/2)) to generate Cauchy random
variates X given U ∼ Uniform [0, 1]
The R source code uses this approach for generating Cauchy random variates via
rcauchy(). For the more general t-distribution, one can note that if Z ∼ N (0, 1),
V ∼ χ2 (ν), and Z ⊥⊥ V , then X = Z/√ν ∼ t(ν). Thus, the R code in rt.c
generates a standard normal random variate Z and a chi squared variate V in order
to return a t-random variate, which will be Cauchy if ν = 1.
4
1.1.2 CDFs without computable inverses
If often is the case that F−1 exists, but no closed form analytic expression is known.
The most obvious example of this is the normal or Gaussian distribution. The CDF
is written in terms of an integral
F (x) =
1
2pi
∫ x
−∞
e−t
2/2dt
which does not have a closed form.
However, such a issue does not completely ruin our ability to use the probability
integral transform. Indeed, if we wish to compute x = F−1(u) for some u ∈ [0, 1],
this is equivalent to finding an x such that 0 = F (x) − u. Hence, we have a root
finding problem, which can be solved by many algorithms. Most notably, we can
use the Newton-Raphson method, which is detailed in Algorithm 1.
The Newton-Raphson Method
The Newton-Raphson algorithm can be used to find the root of F (x)− u assuming
that the PDF f(x) = F ′(x) exists and that both f(x) and F (x) can be evaluated on
a computer. The idea behind the Newton-Raphson algorithm is to start at a point
x0 in the domain D of F . Then, a tangent line can be drawn passing through the
point (x0, F (x0)− u). Let xi be the root of this tangent line, then
f(x0) =
F (x0)− u
x0 − x1
and solving for x1 gives
x1 = x0 − F (x0)− u
f(x0)
.
The algorithm can be iterated by repeating the above with x1 in place of x0 to get
a new root x2. This may continue until convergence is achieved.
It can be shown using Taylor’s Theorem that if F has a continuous second
derivative, then the Newton-Raphson algorithm converges quadratically. Indeed,
denoting the true root by x∞, we have that
|xi+1 − x∞| = Ci(xi − x∞)2.
But this requires some additional conditions such as f(x∞) = F ′(x∞) 6= 0. Other-
wise, convergence can be hindered.
Furthermore, the choice of starting point is very important for quick convergence.
Also, certain functions will cause this algorithm to fail. For example, try applying
Newton-Raphson to F (x) =
√
x for x ≥ 0 and F (x) = −√−x for x < 0 for an initial
value of x0 = 1 and see what happens.
5
Algorithm 1 The Newton-Raphson Algorithm
Input:
F : D → R, an invertible differentiable function
f : D → R+, the derivative f = F ′
x0 ∈ D, an initial value in the domain of F
τ ∈ R+, a threshold for convergence
Algorithm:
For i ≥ 1
xi = xi−1 − F (xi−1)/f(xi−1)
if |xi − xi−1| < τ
stop
return xi
Output:
xi such that F (xi) ≈ 0
Remark 1.1.3 (Halley’s Method). Many other methods of root finding exist. One
method known as Halley’s method can be achieved by applying Newton-Raphson to
the function F (x)/
√|F ′(x)| assuming that for any x such that F (x) = 0, we have
that F ′(x) 6= 0. In Halley’s method, the update set of
xi+1 = xi − F (xi)
F ′(xi)
is replaced by
xi+1 = xi − 2F (xi)F
′(xi)
2F ′(xi)2 − F (xi)F ′′(xi) .
This approach comes with its own assumptions and concerns, which will be left to
other courses to investigate. See “Householder Methods” for even more.
The Bisection Method
In the context of random variate generation, the existence of a CDF usually implies
the existence of a PDF. However, given a function F to find a root, we may not
have access to the derivative f . In that case, we require a gradient-free root finding
method. Many of these exist. In this section, we will highlight a simple algorithm
applicable when it is known that the root of F lies in the finite interval [a, b] and
F (a) < 0 and F (b) > 0.2 This is the Bisection algorithm detailed in Section 2.3
of Fishman (2013) and written out in Algorithm 2. It is effectively a binary search
algorithm.
2 Or alternatively, F (a) > 0 > F (b). Just as long as the signs are different.
6
Much like a binary search algorithm, the Bisection algorithm is guaranteed to
stop after dlog2 ((b− a)/τ1)e number of steps. Hence, the tighter the interval [a, b]
is, the quicker the algorithm will converge, so like the starting value for Newton-
Raphson, the closer you are to the true root, the faster the algorithm will terminate.
Example 1.1.4 (Beta Distribution). The beta distribution commonly occurs as a
Bayesian prior and posterior distributions when a parameter of interest lies in the
unit interval. If X ∼ Beta (α, β), then the PDF is
f(x) =
1
B(α, β)
xα−1(1− x)β−1
for x ∈ [0, 1] where B(α, β) = Γ(α)Γ(β)/Γ(α+β) is the beta function and Γ( ) is the
gamma function.3 The CDF is the known at the regularized incomplete beta function
F (x) =
1
B(α, β)
∫ x
0
tα−1(1− t)β−1dt.
Of note, when α = β = 1/2, the distribution is known as the arcsin distribution as
its CDF is
F (x) =
2
pi
arcsin(
√
x)
for x ∈ [0, 1]. Finding a root x of F (x) − u for some u ∈ [0, 1] could be done with
the Bisection algorithm.
1.1.3 CDFs with discrete distributions
There are many discrete probability distributions that we may wish to simulate from.
These include the binomial, Poisson, geometric, and hypergeometric distributions.
In such a case, we have a discrete support D = x0, x1, x2, . . . and a CDF F such
that
0 = F (x0) < F (x1) < . . . < F (xi) < F (xi+1) < . . . ≤ 1.
If the support is finite with, say, |D| = m, then F (xm) = 1. Otherwise, F (xi) → 1
as i→∞.4
One easy approach (coding-wise) to generating from random variates from a
discrete distribution is to first compute the ordered values F (xi) ∈ [0, 1]. Then,
generate a random U ∼ Uniform [0, 1] and find the smallest i such that F (xi) <
U . In practice, the C code written to generate, say, Poisson or binomial random
variables for R is immensely complex and needing to consider various extreme cases
and numerical stability.
3For n ∈ N, Γ(n) = (n− 1)! and for general x ∈ R+, γ(x) = ∫∞
0
tx−1e−xdt.
4We can also extend the support of F to −∞ if desired. However, most discrete distributions
of interest have support contained in Z+.
7
Algorithm 2 The Bisection Algorithm
Input:
F : D → R, an function with a root to find
a, b ∈ D such that F (a) < 0 < F (b), interval to search
τ1, τ2 ∈ R+, thresholds for convergence
Algorithm:
Set x0 = a and x1 = b
For i ≥ 1
Compute z = (x0 + x1)/2
Compute F (z)
if F (z) < 0
Set x0 = z
else if F (z) ≥ 0
Set x1 = z
if |x1 − x0| < τ1 or |F (z)| < τ2
stop
return z
Output:
z such that F (z) ≈ 0
1.2 Other Transformations
The probability integral transform described in the previous section takes a uniform
random variate U and transforms it into a new random variate with some CDF F .
In this section, we will consider other ways to transform one random variate into
another.
Example 1.2.1 (Student’s t distribution). The t distribution is of critical impor-
tance in statistical hypothesis testing under the assumption of Gaussian errors. How-
ever, we may also wish to generate t random variates. The t distribution with ν
degrees of freedom has PDF
f(x) = C(1 + x2/ν)−(ν+1)/2
for some normalizing constant C and a CDF in terms of something called a Gaussian
Hypergeometric Function. Suffice to say, we don’t want to try to write down F let
alone invert it. Luckily, as noted above in Example 1.1.2 on the Cauchy distribution,
a t random variate X with ν degrees of freedom can be defined as X = Z/
√
V where
Z ∼ N (0, 1), V ∼ χ2 (ν), and Z ⊥⊥ V . This is precisely how R generated t random
variates.
Example 1.2.2 (Exponential distribution). As discussed in Example 1.1.1, expo-
nential random variates with rate parameter λ > 0 can be generated by using the
8
probability integral transform
X = − 1
λ
log(U)
for U ∼ Uniform [0, 1]. However, more efficient methods exist.
In Fishman (2013) Section 3.10, it is noted that more efficient methods of gen-
eration exist. In particular, let X0 be a random variate from a truncated exponential
distribution. That is, for x ∈ [0, log 2], X0 has PDF and CDF
fX0(x) = 2e
−x and FX0(x) = 2− 2e−x.
Next, let K ∼ Geometric(1/2), which is supported on the non-negative integers such
that
pK(k) = P (K = k) = (1/2)
k+1
such that X0 ⊥⊥ K. Then, we can write X = X0+K log 2, which is Exponential (1).
Indeed, any x ∈ R+ can be uniquely written at x = x0 + k log 2 for k ∈ Z+ and
x0 ∈ [0, log 2). Denoting the convolution product by ?, the PDF is
fX(x) = (fX0 ? pK)(x) = P (K = k) fX0(x0)
= 2−k−12e−x0 = exp(−k log 2− x0) = e−x.
Meanwhile, the current R source code uses an algorithm published by Ahrens & Di-
eter in 1972 in a paper called “Computer methods for sampling from the exponential
and normal distributions.”
1.2.1 Box-Muller Transform
The Gaussian or Normal distribution is perhaps the most widely used probabil-
ity distribution. While its shape is an elegant bell curve, the CDF has no closed
form making use of tools like the Probability Integral Transform very tedious to
implement. Luckily, the Box-Muller transform gives us an alternative approach.
We say that X ∼ N (µ, σ2) if
fX(x) =
1√
2piσ2
exp
(
−1
2
(x− µ)2
)
for µ ∈ R and σ2 > 0. Strictly speaking σ2 = 0 allows for a degenerate Gaussian
distribution that reduces to a point mass at the mean µ. It can be shown that
if Z ∼ N (0, 1), which is referred to as a standard normal distribution, that X =
µ+σZ ∼ N (µ, σ2). Hence, we only need to be concerned with generating standard
normal random variates Z.
Theorem 1.2.1 (Box-Muller Transform, 1958). If U ∼ Uniform [0, 1] and V ∼
Exponential (1) and U ⊥⊥ V , then
x =
√
2v cos(2piu), and
y =
√
2v sin(2piu)
9
are independent N (0, 1) random variables.
Proof. This theorem is proven via a change of variables. Note that
V =
1
2
(X2 + Y 2) and
U =
1
2pi
arctan(Y/X).
Then, defining the functions v(x, y) = (x2 + y2)/2 and u(x, y) = arctan(y/x)/2pi,
gives partial derivatives
∂v
∂x
= x and
∂v
∂y
= y
∂u
∂x
=
1
2pi
1
1 + y2/x2
−y
x2
=
−y
4piv(x, y)
∂u
∂y
=
1
2pi
1
1 + y2/x2
1
x
=
x
4piv(x, y)
Then, be change of variables, we have that
fX,Y (x, y) = fU,V (u, v)
∣∣∣∣∂u∂x ∂v∂y − ∂u∂y ∂v∂x
∣∣∣∣
= 1[u(x, y) ∈ [0, 1]] e−v(x,y)
∣∣∣∣− y24piv(x, y) − x24piv(x, y)
∣∣∣∣
=
1
2pi
exp
{
−x
2 + y2
2
}
.
As this is the Gaussian PDF in R2 with mean zero and covariance I2, we have our
conclusion.
The Box-Muller transform reduces the task of generating a normal random
variate to the tasks of generating a uniform and an exponential random variate.
However, as is typical in these notes, more sophisticated and optimized generation
methods exist in the R source code.
1.3 Acceptance-Rejection Sampling
Acceptance-Rejection sampling, sometimes just called Rejection sampling, is a pow-
erful technique for sampling from complex distributions. In short, we wish to sample
from some probability density f(x), but it is hard to do so. The solution is to find a
proposal or candidate distribution h(x) that is close to f and easy to sample from.
Then, we sample from h instead and reject any samples that don’t look like they
came from f . This is made more precise in the following theorem attributed to John
Von Neumann.
10
Note that the acceptance-rejection method is a critical step in the Metropolis-
Hastings algorithm for performing Markov Chain Monte Carlo. However, MCMC
is beyond scope of this course. In these notes, we restrict to independent sampling
rather than invoking Markov chains.
Remark 1.3.1. In the following theorem, we use | to denote conditioning. That
is, we define a random variable X to equal another random variable Z after forcing
a condition to hold. For example, you could consider a random 20-sided dice roll
conditioned on the number displayed is less than 8.
In general, for two real-valued random variables X and Y , we can write the
conditional density function as
fY |X(y|x) =
f(x, y)
fX(x)
where f(x, y) is the joint density function.
Theorem 1.3.1 (Von Neumann, 1951). Let f(x) be a pdf with support on [a, b] that
can be written as f(x) = cg(x)h(x) where h(x) ≥ 0, ∫ ba h(x)dx = 1, g(x) ∈ [0, 1],
and some constant c ≥ 1. If Z ∼ h(z) and U ∼ Uniform [0, 1] and Z ⊥⊥ U , then
X = Z|{U ≤ g(Z)} has pdf f(x).
Proof. Since U and Z are independent, the joint pdf is simply f(u, z) = 1u∈[0,1]h(z).
Consequently, the conditional pdf for X is
h(z|U ≤ g(Z))P (U ≤ g(Z)) =
∫ g(z)
0
f(u, z)du
h(z|U ≤ g(Z)) =
∫ g(z)
0 f(u, z)du
P (U ≤ g(Z)) .
The numerator becomes
∫ g(z)
0 f(u, z)du = g(z)h(z). The denominator becomes
P (U ≤ g(Z)) =
∫∫
[0,1]×[a,b]
1[u ≤ g(z)] duh(z)dz
=
∫ b
a
∫ 1
0
1[u ≤ g(z)] duh(z)dz =
∫ b
a
g(z)h(z)dz
As f(x) is a pdf,
1 =
∫ b
a
f(x)dx = c
∫ b
a
g(x)h(x)dx.
Thus,
∫ b
a g(x)h(x)dx = c
−1, and consequently, h(z|U ≤ g(Z)) = f(z).
Remark 1.3.2. While many choices of c and g(x) are possible, the acceptance-
rejection method is most efficient when c is close to 1. This is because we accept a
11
Algorithm 3 The Acceptance-Rejection Algorithm
Input:
g(x) ∈ [0, 1]
h(x), a pdf to sample from
Algorithm:
Repeat
Generate z from pdf h(z)
Generate u from Uniform [0, 1]
if u ≤ g(z)
stop
return x = z
Output:
x from pdf f(x) = cg(x)h(x).
randomly generated variate X when U ≤ g(Z) which occurs with probability 1/c as
per the above proof.
Hence, given a pdf of interest f and a candidate distribution h, then c is chosen
to be
c = sup
x
{f(x)/h(x)}
and g chosen accordingly.
Furthermore, we can consider the event U ≤ g(Z) to have a geometric distri-
bution with probability 1/c of success. Hence, the expected number of iterates until
acceptance is achieved is c.
Example 1.3.3 (Half-Normal Distribution). This example is taken from Example
3.1 of Fishman (2013). The half-normal distribution is the standard normal but
only supported on the non-negative reals. That is,
f(x) =
√
2
pi
e−x
2/2
=
√
2e
pi
e−(x−1)
2/2 e−x.
Thus, taking c =
√
2e/pi, g(x) = exp[−(x − 1)2/2], and h(x) = exp(−x), we can
sample from the exponential distribution e−x to apply the Algorithm 3 to generate
half-normal random variates.
Furthermore, this allows for an alternative way to sample from the normal dis-
tribution. That is, use Algorithm 3 to generate X as half-normal. Then, generate
U ∼ Uniform [0, 1] and set
X ←
{
X if U ≥ 1/2
−X if U < 1/2 .
12
Algorithm 4 The Acceptance-Rejection-Squeeze Algorithm
Input:
g, gL, gU ∈ [0, 1] such that gL ≤ g ≤ gU
h(x), a pdf to sample from
Algorithm:
Repeat
Generate z from pdf h(z)
Generate u from Uniform [0, 1]
if u ≤ gL(z) pretest with gL
stop
else if u ≤ gU (z) pretest with gU
if u ≤ g(z) only evaluate g if pretests fail
stop
return x = z
Output:
x from pdf f(x) = cg(x)h(x).
In this setting, we require one Uniform and one Exponential random variate until
a sample is “accepted”. Then, one final Uniform random variate is required to set
the sign, i.e. positive or negative.
1.3.1 Acceptance-Rejection-Squeeze
Sometimes the function g(x) may be slow to compute. However, if we can construct
an envelope gL(x) ≤ g(x) ≤ gU (x) where gL, gU are “simpler” functions, then we
have use gL and gU to perform pretests on whether or not we should accept or reject.
The most natural choices for gL and gU come from power series. For example,
1 − x ≤ e−x ≤ 1 − x + x2/2 for x ≥ 0. Hence, instead of merely evaluating
exp[−(x− 1)2/2] in the previous example for the half-normal distribution, we could
choose
gL(x) = 1− (x− 1)
2
2
and gU (x) = 1− (x− 1)
2
2
+
(x− 1)4
8
.
1.3.2 Box-Muller without Trigonometry
Instead of generating two independent uniform deviates to perform the Box-Muller
transform (i.e. sampling within the square [0, 1]× [0, 1]), we can instead sample from
within the unit circle using an acceptance-rejection step to generate normal random
variates without need to evaluate sines and cosines.
Indeed, we can generate U, V ∼ Uniform [−1, 1] until U2 + V 2 ≤ 1 and hence
lies within the unit circle. This means that the ordered pair (U, V ) has a uni-
13
formly random angle between 0 and 2pi and a distance to the origin (radius) that is
Uniform [0, 1]. After that is achieved, we claim that√
−2 log(U2 + V 2) V
2
U2 + V 2
∼ N (0, 1) .
This is because U2+V 2 is Uniform [0, 1] thus making− log(U2+V 2) ∼ Exponential (1).
Furthermore, the sine and cosine of the angle of the ordered pair (U, V ) can be com-
puted as U√
U2+V 2
and V√
U2+V 2
. Thus, we can compare to the original Box-Muller
transform
x =
√
2v cos(2piu), and
y =
√
2v sin(2piu)
for u uniform and v exponential in this case.
The same trick can be applied generate Cauchy random variates. Recall from
Example 1.1.2 that for U ∼ Uniform [0, 1], then X = tan[pi(U − 1/2)] is Cauchy. If
we preferred to not evaluate the tangent, we can instead sample from the upper-half
of the unit circle. That is, generate U ∼ Uniform [−1, 1] and V ∼ Uniform [0, 1]
until U2 + V 2 ≤ 1. Then, the tangent of the angle of the ordered pair (U, V ) is
simply
U/V ∼ Cauchy.
See sections 7.3.4 and 7.3.7 of Press et al. (2007) for detailed code for these methods.
1.4 Ratio of Uniforms
In this section, we generalize the previously discussed idea of generating a uniform
element of the unit circle to simulate from more complex univariate distributions.
This Ratio-of-Uniforms method is attributed to Kinderman & Monahan (Kinderman
and Monahan, 1977). The idea is to define a region in R2 (e.g. the unit circle), then
generate a uniformly random point (U, V ) from that region, and then return U/V ,
which will have the desired distribution. Once again, we have a theorem to make
this more precise.