CAP 6617 Advanced Machine Learning
Advanced Machine Learning
Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: THEend8_
CAP 6617 Advanced Machine Learning
Homework 1
1. Property of quadratically regularized multinomial logistic regression. Multiclass logistic re-
gression has the form
p(y|x) = exp(φ
>θy)∑k
c=1 exp(φ
>θc)
.
An interesting observation is that if all θc is added by a common vector θ˜, it does not affect
the posterior probability of any of the data samples, i.e.,
p(y|x) = exp(φ
>θy)∑k
c=1 exp(φ
>θc)
=
exp(φ>(θy + θ˜))∑k
c=1 exp(φ
>(θc + θ˜))
.
This also means that the unregularized version of multinomial logistic regression does not have
a unique solution. Now consider the quadratically regularized multinomial logistic regression
minimize
Θ
1
n
n∑
i=1
(
log
(
k∑
c=1
exp(φ>iθc)
)
− φ>iθyi
)
+ λ
k∑
c=1
‖θc‖2.
Show that at the optimum we have
∑k
c=1 θcj = 0 for all j = 1, . . . ,m. This shows that the
shift ambiguity is removed by the quadratic regularization. What does it mean when k = 2?
2. A regression model with unknown measurement nonlinearity. Suppose we know the input-
output relationship of a regression model is
yi = f(φ
>
iθ + vi), i = 1, . . . , n.
Here φi ∈ Rm are known, vi are i.i.d. N (0, σ2) random noises, and f : R→ R is an unknown
monotonic increasing function, known to satisfy
α ≤ f ′(u) ≤ β
for all u. (Here α and β are known positive constants, with α < β.) We want to find a
maximum likelihood estimate of θ and f , given yi. (We also know φi, σ, α, and β.)
This sounds like an infinite-dimensional problem, since one of the parameters we are estimat-
ing is a function. However, we actually only need to know the n numbers zi = f
−1(yi), i =
1, . . . , n. So by estimating f we really mean estimating the n numbers z1, . . . , zm. (These
numbers are not arbitrary; they must be consistent with the prior information α ≤ f ′(u) ≤ β
for all u.)
1
Explain how to find a maximum likelihood estimate of θ and f (i.e., z1, . . . , zm) using convex
optimization. You can assume the measurements are numbered so that yi are sorted in
nondecreasing order, i.e., y1 ≤ y2 ≤ · · · ≤ yn.
Hint. The constraint α ≤ f ′(u) ≤ β, when 0 < α < β, implies
1
β
≤ d
du
f−1(u) ≤ 1
α
.
Therefore
1
β
|yi − yj | ≤ |zi − zj | ≤ 1
α
|yi − yj |,
for all i, j. Conversely, if these inequalities hold, then there is a function f that satisfies
the inequalities, with f−1(yi) = zi. (Actually, this is true only in the limit, but we’re being
informal here.)
3. We test the performance of three multi-class classification methods on the iris data set
http://archive.ics.uci.edu/ml/datasets/Iris. There are 150 samples of 3 classes of
iris, 50 samples per class. For each class we use the first 40 samples for training, and the
last 10 samples for testing. The goal is to build a linear model of the 4 features (together
with a constant term) to predict the class. All models are trained by solving the following
optimization problem
minimize
θ
n∑
i=1
`(φ>iθ, yi),
where feature representation is
φi =
1
sepal length in cm
sepal width in cm
petal length in cm
petal width in cm
and the loss functions are
• least squares loss
• logistic loss
• hinge loss
Write down each of the formulation explicitly as a convex optimization problem, and solve
them using the software of your choice. For example, in MATLAB, the least squares loss can
be directly solved by the command Phi\Y for some properly defined Phi and Y; for the latter
two, you can use the cvx package found on Prof. Boyd’s website https://web.stanford.
edu/~boyd/software.html. Report their prediction accuracy on the test set.