Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: THEend8_
STAT 428 NONLINEAR OPTIMIZATION
13.1 RECAP & INTRODUCTION
13.1 RECAP & INTRODUCTION
13.2 THE ROOT-FINDING PROBLEM
13.3 NONLINEAR OPTIMIZATION
13.1 RECAP & INTRODUCTION
THE PROBLEM OF PARAMETER ESTIMATION
A parametric statistical model describes the chances of different outcomes of
a random variable, given the model parameters.
The parameter (θ) is unknown, and one wants to infer something about them
based on the observed data (x).
We have already seen a few examples:
I Linear regression - estimating regression coefficients (βs) based on
observed Xs and Ys.
I Investment model - estimating θ which governs probability that a stock is
a winner on a given day.
I Change point model - estimating k (change point) and the before- and
after-change Poisson rates (µ, λ).
13.1 RECAP & INTRODUCTION
RECAP: BAYESIAN ESTIMATION
The latter two were done under the Bayesian framework, where we
I used MCMC to draw samples of the parameter from the posterior
distribution,
f (θ | x) ∝ L(θ;x)fθ(θ),
I obtained a point estimate of the parameter based on the posterior
samples. For example, the EAP estimator is given by the posterior
mean:
θˆEAP = Eθ|x(θ).
This is estimated using Monte Carlo integration on the post burn-in
samples:
θˆEAP =
1
m − b
m∑
j=b+1
θ(j).
13.1 RECAP & INTRODUCTION
PARAMETER ESTIMATION AS AN OPTIMIZATION
PROBLEM
In other occasions, parameter estimation might be formulated as an
optimization problem.
For example:
I MAP (Bayesian posterior mode):
θˆMAP = argmax
θ
f (θ | x).
I MLE:
θˆMLE = argmax
θ
L(θ;x).
Question: How do we find the value of θ achieving the maximum likelihood (or
posterior probability)?
13.1 RECAP & INTRODUCTION
NONLINEAR OPTIMIZATION
Given the observed data (x), the likelihood function,
L(θ;x) = fX |θ(x | θ),
is a nonlinear function of θ.
Finding its maximum thus involves nonlinear optimization.
For an arbitrary smooth function f (x), its extrema (minima/maxima) occur at
values of x such that
f ′(x) = 0.
I Finding the analytic root to f ′(x) = 0 isn’t always straightforward (they
don’t even necessarily exist).
I Numerical methods for solving nonlinear equations may be used to find
approximation solutions.
13.1 RECAP & INTRODUCTION
THIS WEEK’S LECTURES
We will start this week’s lectures by introducing a few numerical methods for
solving nonlinear equations:
I Bisection
I Newton-Raphson
I Brent’s method
Then we’ll discuss nonlinear optimization.
I Closely related to root-finding: finding the optima of a smooth function is
equivalent to finding the root(s) of its derivative.
I The methods are for generic functions. We’ll see applications in MLE.
Optimization itself is worth another class - the problem becomes much more
challenging when the dimension of variables is high or when there are
linear/non-linear constraints. We will only cover the basics.
13.1 RECAP & INTRODUCTION
EXISTENCE AND UNIQUENESS OF ROOT
Keep in mind that there doesn’t always exist a unique solution to a nonlinear
equation:
I e−x − x = 0 has a unique solution.
I ex + 1 = 0 has no solution.
I sin(x) = 0 has infinitely many solutions.
By the same logic, a unique optimum may not exist:
I e.g., a multimodal distribution has multiple local maxima.
I Most optimization methods are designed to find a local maximum, which
may not be global maximum.
I If global maximum is desired, it is advisable to try different starting points
and see if all produce same result.
13.2 THE ROOT-FINDING PROBLEM
13.1 RECAP & INTRODUCTION
13.2 THE ROOT-FINDING PROBLEM
13.3 NONLINEAR OPTIMIZATION
13.2 THE ROOT-FINDING PROBLEM
13.2.1 Bisection Method
13.2 THE ROOT-FINDING PROBLEM
BISECTION METHOD
Without loss of generality, we’ll consider the problem of finding the solution to
f (x) = 0.
(Anything non-zero on the RHS can be moved to the LHS. e.g., x2 = 2 is the
same as x2 − 2 = 0)
Suppose:
I f (x) is continuous on [a,b], and
I f (a) and f (b) have opposite signs.
By the intermediate value theorem, there must exist c ∈ (a,b) satisfying
f (c) = 0.
13.2 THE ROOT-FINDING PROBLEM
BISECTION: PROCEDURES
The Bisection method begins with an initial bracket [a,b], and iteratively
narrows down the bracket by half, until a solution has been isolated as
accurately as desired, i.e., reaching a pre-specified tolerance level, (tol).
Specifically:
I Step 1: Compute the midpoint x = (a+ b)/2, and check the sign of f (x).
I Step 2: If f (a) and f (x) have opposite signs, replace [a,b] with [a, x ].
Otherwise, replace [a,b] with [x ,b].
Iterate Steps 1 and 2 until (b − a) ≤ tol .
13.2.1 Bisection
Example
Suppose we want to find the solutions to
a2 + y2 + 2ayn − 1 = n − 2,
Where a is a constant and n > 2 is an integer.
Note that this is equivalent to the quadratic equation
f (y) = y2 + 2an − 1y + a
2 − n + 2 = 0
Here’s a function for f (y):
f <- function(y, a, n){
y^2+(2*a*y)/(n-1)+a^2-n+2
}
S Zhang ( UIUC ) STAT428 - Optimization 2 / 34
13.2.1 Bisection
Finding the Root w/ Bisection
We’ll numerically solve f (y) = 0 when a = .5 and n = 20, using bisection
with the starting interval of [b0, b1] = [0, 5n] and tolerance .0001. (Note:
Book uses .Machine$double.epsˆ.25, which is around .00012.)
a <- .5
n <- 20
b0 <- 0
b1 <- 5*n
eps <- .0001 # tolerance
Note that the book expresses error in terms of the difference between
f (b0+b12 ) and 0, instead of the length of the interval (b1 − b0). Let’s be
consistent with the book and use the former.