Machine Learning and Data Mining
Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: THEend8_
COMP9417 Machine Learning and Data Mining
f(x)
subject to x ∈ X
• The problem is to find some x which is called a design point, which is
a p-dimensional vector of values for p design variables.
• These values are manipulated by an optimization algorithm to find a
minimizer, a solution x∗ that minimizes the objective function f .
• A solution must be an element of the feasible set X , and may be
subject to additional constraints, called constrained optimization.
• Optimization is widely studied and applied in many fields such as
engineering, science, economics, . . .
1See: Kochenderfer and Wheeler (2019).
COMP9417 ML & DM Regression (2) Term 2, 2023 4 / 43
Optimization by gradient descent
^
ski#
,
✗
I
✗
*
x
design points
COMP9417 ML & DM Regression (2) Term 2, 2023 5 / 43
Optimization by gradient descent
Optimization
The formulation is general, for constrained or unconstrained optimization,
since we can always replace a problem where we want to maximize the
objective function f :
maximize
x
f(x) subject to x ∈ X
by an equivalent minimization expression:
minimize
x
− f(x) subject to x ∈ X
COMP9417 ML & DM Regression (2) Term 2, 2023 6 / 43
Optimization by gradient descent
Optimization
Usually, we would like an optimization algorithm to quickly reach an
answer that is close to being the right one.
There are many possible approaches to optimization. In machine learning
methods based on finding the derivative, or gradient, are widely used.
• typically, need to minimize a function
• e.g., error or loss
• optimization is known as gradient descent or steepest descent
• sometimes, need to maximize a function
• e.g., probability or likelihood
• optimization is known as gradient ascent or steepest ascent
Requires function to be differentiable.
COMP9417 ML & DM Regression (2) Term 2, 2023 7 / 43
Optimization by gradient descent
a
f-K) t.MY#flectiionP""
Jʰlocalglobal min . MinMin . x
3¥
,
at a minimum if d) f
'
# = 0
(but not
" t"⇔≥o}÷÷÷!,
first - order FONC
second - order SONC
COMP9417 ML & DM Regression (2) Term 2, 2023 8 / 43
Optimization by gradient descent
What is an optimization algorithm ?
• many kinds of optimization algorithm
• depends on objective function, constraints, data
• approaches based on derivatives or gradients are widely used
• derivatives provide the direction of the search for a minimizer
• may be obtained analytically or estimated numerically
• can also used automatic differentiation
• not all approaches require derivatives . . .
COMP9417 ML & DM Regression (2) Term 2, 2023 9 / 43
Optimization by gradient descent
Optimization
A general iterative algorithm 2 to optimise some function f :
1 start with initial point x = x0
2 select a search direction g, usually to decrease f(x)
3 select a step length η
4 set s = ηg
5 set x = x+ s
6 go to step 2, unless convergence criteria are met
For example, could minimize a real-valued function f : Rn → R
Note: convergence criteria will be problem-specific.
2See: Ripley (1996).
COMP9417 ML & DM Regression (2) Term 2, 2023 10 / 43
Optimization by gradient descent
Least-Squares as Loss Minimization
• consider MSE as a loss function
• this is what we want to minimize in OLS regression
• finding the least-squares solution is in effect finding the value of a and
b that minimizes 1n
∑n
i=1(yi − yˆi)2, where yˆi = a+ bxi
• we saw before that the minimum value can be obtained analytically
by the usual process of differentiating and equating to 0
• an alternative is to apply an iterative gradient descent approach
• with MSE as a loss function, given data (x1, y1), . . . , (xn, yn)
Loss(θ) =
1
n
n∑
i=1
(yi − fθ(xi))2
• note that Loss is a function of θ, and yˆi = fθ(xi), so here θ is the
parameters a and b
COMP9417 ML & DM Regression (2) Term 2, 2023 11 / 43
Optimization by gradient descent
Least-Squares as Loss Minimization
• the gradient descent alternative to the analytical approach is to take
(small) steps that decreases the value of the function to be
minimised, stopping when we reach a minimum
• recall that at a point the gradient vector points in the direction of
greatest increase of a function. So, the opposite direction to the
gradient vector gives the direction of greatest decrease
• for univariate linear regression, suppose gb, ga give the gradient, then
the iterative steps are:
• bi+1 = bi - η × gb
• ai+1 = ai - η × ga
• Stop when bi+1 ≈ bi and ai+1 ≈ ai
COMP9417 ML & DM Regression (2) Term 2, 2023 12 / 43
Optimization by gradient descent
^
7
"
large
"
""
"
""
"
÷-
-
.
:
-¥
"
• !
! !
I
'
f
i. : i
2+7"it 1
,,
Dci
COMP9417 ML & DM Regression (2) Term 2, 2023 13 / 43
Multivariate linear regression
Multivariate linear regression
COMP9417 ML & DM Regression (2) Term 2, 2023 14 / 43
Multivariate linear regression
Many variables
• Often, we are interesting in modelling the relationship of Y to several
other variables
• In observational studies, the value of Y may be affected by the values
of several variables. For example, carcinogenicity may be
gender-specific. A regression model that ignores gender may find that
carcinogenicity to be related to some surrogate variable (height, for
example)3
• Including more variables can give a narrower confidence interval on
the prediction being made
• However, more complex models are not always better
3A surrogate variable is a variable for which we have data that replaces one that
cannot be observed, or is otherwise not in the dataset.
COMP9417 ML & DM Regression (2) Term 2, 2023 15 / 43
Multivariate linear regression
Multivariate linear model