Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: THEend8_
Machine Learning
CMPT 726 / 419
1
Announcements
Deadline for academic dishonesty amnesty application is this coming
Sunday, March 20th.
If you committed an academic offence and do not fill out the form, the
standard penalty will apply (at least zero or -100% for a single offence, F in
the course for multiple offences), without exception.
2
Recap
3
Nesterov’s Accelerated Gradient (NAG)
Gradient Descent with Momentum:
⃗θ (0) ← random vector
Δ ⃗θ = ⃗0
for t = 1,2,3,…
Δ ⃗θ ← αΔ ⃗θ − γt ∂L
∂ ⃗θ
( ⃗θ (t−1))
⃗θ (t) ← ⃗θ (t−1) + Δ ⃗θ
if algorithm has converged
return ⃗θ (t)
4
Nesterov’s Accelerated Gradient:
⃗θ (0) ← random vector
Δ ⃗θ = ⃗0
for t = 1,2,3,…
Δ ⃗θ ← αΔ ⃗θ − γt ∂L
∂ ⃗θ
( ⃗θ (t−1) + αΔ ⃗θ )
⃗θ (t) ← ⃗θ (t−1) + Δ ⃗θ
if algorithm has converged
return ⃗θ (t)
Lookahead: account for the
inertia when choosing where
to compute gradient
Adaptive Gradient Methods
Recall the function .
The eigenvalues of the Hessian (which in this
case are the diagonal entries) have very
different magnitudes.
Gradient descent converges to the optimal
parameters slowly.
f(x, y) = 4x2 + (y + 3)
2
15
∇2f(x, y) =
∂2f
∂x∂x (x, y)
∂2f
∂x∂y (x, y)
∂2f
∂y∂x (x, y)
∂2f
∂y∂y (x, y)
= (
8 0
0 215) ≻ 0
5
Adaptive Gradient Methods
Recall the function .
Idea: Try to make the coefficient of each term
1.
Let , , so ,
.
f(x, y) = 4x2 + (y + 3)
2
15
∇2f(x, y) =
∂2f
∂x∂x (x, y)
∂2f
∂x∂y (x, y)
∂2f
∂y∂x (x, y)
∂2f
∂y∂y (x, y)
= (
8 0
0 215) ≻ 0
x′ = 2x y′ = y
15
x = x′
2
y = 15y′
6
“Reparameterization”
Adaptive Gradient Methods
Let , , so , .
x′ = 2x y′ = y
15
x = x′
2
y = 15y′
f(x, y) = 4x2 + (y + 3)
2
15
= 4( x′ 2)
2
+
( 15y′ + 3)2
15
= 4( x′ 2)
2
+
( 15(y′ + 315))
2
15
= 4( x
′ 2
4 ) +
15(y′ + 315)
2
15
= x′ 2 +(y′ + 315)
2
7
Gradient descent converges
quickly after reparameterization!
Adaptive Gradient Methods
In general, suppose we reparameterize ,
so . We can then write as
Challenge: We don’t know which to choose to
make gradient descent converge quickly.
⃗θ ′ = A−1 ⃗θ
⃗θ = A ⃗θ ′ L( ⃗θ ) L(A ⃗θ ′ )
∂L
∂ ⃗θ ′
=
∂(A ⃗θ ′ )
∂ ⃗θ ′
∂L
∂(A ⃗θ ′ )
=
∂(A ⃗θ ′ )
∂ ⃗θ ′
∂L
∂θ
= A⊤ ∂L
∂θ
∂L
∂θ
= (A⊤)−1 ∂L
∂ ⃗θ ′
A
8
Adaptive gradient methods pick
adaptively based on past gradients,
effectively reparameterizing on-the-fly.
A
AdaGrad
Gradient Descent:
⃗θ (0) ← random vector
for t = 1,2,3,…
⃗θ (t) ← ⃗θ (t−1) − γt ∂L
∂ ⃗θ
( ⃗θ (t−1))
if algorithm has converged
return ⃗θ (t)
9
AdaGrad (short for “adaptive gradient”):
⃗θ (0) ← random vector
for t = 1,2,3,…
D← diag
t−1
∑
t′ =1(
∂L
∂θi
( ⃗θ (t′ )))
2
n
i=1
⃗θ (t) ← ⃗θ (t−1) − γtD−1 ∂L
∂ ⃗θ
( ⃗θ (t−1))
if algorithm has converged
return ⃗θ (t)
“Preconditioner”
Adam
Gradient Descent with Momentum:
⃗θ (0) ← random vector
Δ ⃗θ = ⃗0
for t = 1,2,3,…
Δ ⃗θ ← αΔ ⃗θ − γt ∂L
∂ ⃗θ
( ⃗θ (t−1))
⃗θ (t) ← ⃗θ (t−1) + Δ ⃗θ
if algorithm has converged
return ⃗θ (t)
10
Adam (short for “adaptive moment estimation”):
⃗θ (0) ← random vector
for t = 1,2,3,…
⃗m ← α ⃗m + (1 − α) ∂L
∂ ⃗θ
( ⃗θ (t−1))
D← βD + (1 − β)diag ( ∂L∂θi ( ⃗θ (t′ )))
2 n
i=1
⃗θ (t) ← ⃗θ (t−1) − γt(( 11 − βt D)
1/2
+ ϵI)
−1
( 11 − αt ⃗m)
if algorithm has converged
return ⃗θ (t)
Optimization
11
Non-Diagonal Preconditioners
Previous methods use a diagonal
preconditioner, which corresponds to
choosing a diagonal reparameterization
matrix .
If is diagonal, . So,
the reparameterization can only scale along
each axis.
In some cases, this is less than ideal.
A
A ⃗θ = A ⃗θ ′ =
a1,1θ1
⋮
an,nθn
12
Non-Diagonal Preconditioners
Example: rotate the function by 45 degrees.
We apply the transformation
.