Introduction to Machine Learning
Introduction to Machine Learning
Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: THEend8_
Introduction to Machine Learning
Exercise 1: Car Price Prediction
Imagine you work at a second-hand car dealer and are tasked with finding for-sale vehicles your company can
acquire at a reasonable price. You decide to address this challenge in a data-driven manner and develop a model
that predicts adequate market prices (in EUR) from vehicles’ properties.
a) Characterize the task at hand: supervised or unsupervised? Regression or classification? Learning to explain
or learning to predict? Justify your answers.
b) How would you set up your data? Name potential features along with their respective data type and state the
target variable.
c) Assume now that you have data on vehicles’ age (days), mileage (km), and price (EUR). Explicitly define the
feature space X and target space Y.
d) You choose to use a linear model (LM) for this task. For this, you assume the targets to be conditionally
independent given the features, i.e., y(i)|x(i) ? y(j)|x(j) for all i, j 2 {1, 2, . . . , n}, i 6= j, with sample size n. The
LM models the target as a linear function of the features with Gaussian error term: y = X✓ + ✏,
✏ ⇠ N(0, diag(2)), > 0.
State the hypothesis space for the corresponding model class. For this, assume the parameter vector ✓ to include
the intercept coecient.
e) Which parameters need to be learned? Define the corresponding parameter space ⇥.
f) State the loss function for the i-th observation using L2 loss.
g) In classical statistics, you would estimate the parameters via maximum likelihood estimation (MLE). The
likelihood for the LM is given by:
L(✓|x) =
nY
i=1
1p
2⇡2
exp
✓
1
22
⇣
y(i) ✓Tx(i)
⌘2◆
Describe how you can make use of the likelihood in empirical risk minimization (ERM) and write down the
resulting empirical risk.
h) Now you need to optimize this risk to find the best parameters, and hence the best model, via empirical risk
minimization. State the optimization problem formally and list the necessary steps to solve it.
Congratulations, you just designed your first machine learning project!
Introduction to Machine Learning Exercise sheet 2
Exercise 1: HRO in mlr3
Throughout the lecture, we will frequently use the R package mlr3 and its descendants, providing an integrated
ecosystem for all common machine learning tasks. Let’s recap the HRO principle and see how it is reflected in
mlr3. An overview of the most important objects and their usage
a) How are the key concepts (i.e., hypothesis space, risk and optimization) you learned about in the lecture videos
implemented in mlr3?
b) Have a look at mlr3::tsk("iris"). What attributes does this task object store?
c) Pick an mlr3 learner of your choice. What are the di↵erent settings for this learner?
(Hint: use mlr3::mlr learners keys() to see all available learners.)
Exercise 2: Loss Functions for Regression Tasks
In this exercise, we will examine loss functions for regression tasks somewhat more in depth.
0
10
20
30
2.5 5.0 7.5 10.0
x
y
a) Consider the above linear regression task. How will the model parameters be a↵ected by adding the new outlier
point (orange) if you use
i) L1 loss
ii) L2 loss
in the empirical risk? (You do not need to actually compute the parameter values.)
010
20
30
−10 −5 0 5 10
x
y
b) The second plot visualizes another loss function popular in regression tasks, the so-called Huber loss (depend-
ing on ✏ > 0; here: ✏ = 5). Describe how the Huber loss deals with residuals as compared to L1 and L2 loss.
Can you guess its definition?
c) Derive the least-squares estimator, i.e., the solution to the linear model when using L2 loss, analytically via
✓ˆ = argmin✓2⇥ ky X✓k22.
Exercise 3: Polynomial Regression
Assume the following (noisy) data-generating process from which we have observed 50 realizations:
y = 3 + 5 · sin(0.4⇡x) + ✏
with ✏ ⇠ N (0, 1).
−10
−5
0
−2 0 2
x
y
a) We decide to model the data with a cubic polynomial (including intercept term). State the corresponding
hypothesis space.
b) Demonstrate that this hypothesis space is simply a parameterized family of curves by plotting in R curves for
3 di↵erent models belonging to the considered model class.
c) State the empirical risk w.r.t. ✓ for a member of the hypothesis space. Use L2 loss and be as explicit as possible.
d) We can minimize this risk using gradient descent. In order to make this somewhat easier, we will denote the
transformed feature matrix, containing x to the power from 0 to 3, by X˜, such that we can express our model
by X˜✓ (note that the model is still linear in its parameters, even if X has been transformed in a non-linear
manner!). Derive the gradient of the empirical risk w.r.t ✓.
e) Using the result from d), state the calculation to update the current parameter ✓[t].
f) You will not be able to fit the data perfectly with a cubic polynomial. Describe the advantages and disadvantages
that a more flexible model class would have. Would you opt for a more flexible learner?
Exercise 4: Predicting abalone
We want to predict the age of an abalone using its longest shell measurement and its weight.
See https://archive.ics.uci.edu/ml/machine-learning-databases/abalone/ for more details.
abalone <- read.table(url, sep = ",", row.names = NULL)
colnames(abalone) <- c(
"sex", "longest_shell", "diameter", "height", "whole_weight",
"shucked_weight", "visceral_weight", "shell_weight", "rings")
abalone <- abalone[, c("longest_shell", "whole_weight", "rings")]
a) Plot LongestShell and WholeWeight on the x- and y-axis, respectively, and color points according to Rings.
Using mlr3:
b) Create an mlr3 task for the abalone data.
c) Define a linear regression learner (for this you will need to load the mlr3learners extension package first)
and use it to train a linear model on the abalone data.
d) Compare the fitted and observed targets visually.
(Hint: use autoplot().)
e) Assess the model’s training loss in terms of MAE.
(Hint: losses are retrieved by calling score(), which accepts di↵erent mlr measures, on the
prediction object.)
Exercise 1: Logistic Regression Basics
a) What is the relationship between softmax
⇡k(x | ✓) = exp(✓
>
k x)
gP
j=1
exp(✓>j x)
, k 2 {1, . . . , g}
and the logistic function
⇡(x | ✓) = 1
1 + exp(✓Tx)
for g = 2 (binary classification)?
b) The likelihood function for a multinomially distributed target variable with g target classes is given by1
Li(✓) = P(y(i)|x(i),✓1,✓2, . . . ,✓g) =
gY
j=1
⇡j
⇣
x(i) | ✓
⌘I(y(i)=j)
where the posterior class probabilities ⇡1
x(i) | ✓ ,⇡2 x(i) | ✓ , . . . ,⇡g x(i) | ✓ are modeled with softmax
regression. Derive the likelihood function for n independent observations.
c) We have already addressed the connection that holds between maximum likelihood estimation and empirical
risk minimization. Transform the joint likelihood function into an empirical risk function.
Hints:
By following the maximum likelihood principle, we should look for parameters ✓1,✓2, . . . ,✓g that maximize
the likelihood function.
The expressions
QLi and logQLi, if defined, are maximized by the same parameters.
Minimizing a scalar function multiplied with -1 is equivalent to maximizing the original function.
State the associated risk function.
d) Write down the discriminant functions of multiclass logistic regression resulting from this minimization objective.
How do we arrive at the final prediction?
e) State the parameter space ⇥ and corresponding hypothesis space H for the multiclass case.
Exercise 2: Decision Boundaries & Thresholds in Logistic Regression
In logistic regression (binary case), we estimate the probability P(y = 1 | x,✓) = ⇡(x | ✓). In order to decide about
the class of an observation, we set yˆ = 1 i↵ ⇡ˆ(x | ✓) ↵ for some ↵ 2 (0, 1).
a) Show that the decision boundary of the logistic classifier is a (linear!) hyperplane.
Hint: derive the value of ✓Tx (depending on ↵) starting from which you predict y = 1 rather than y = 0.
1While this might look somewhat complicated, it is actually just a very concise way to express the multinomial likelihood: for
each observation, all factors but the one corresponding to the true class j0 will be 1 (due to the 0 exponent), so the result is simply
⇡j0
x(i) | ✓.
b) Below you see the logistic function for a binary classification problem with two input features for di↵erent values
✓ = (✓1, ✓2) (plots 1-3) as well as ↵ (plot 4). What can you deduce for the values of ✓1, ✓2 and ↵? What are
the implications for classification in the di↵erent scenarios?
Plot (1) Plot (2)
Plot (3) Plot (4)
c) Derive the equation for the decision boundary hyperplane if we choose ↵ = 0.5.
d) Explain when it might be sensible to set ↵ to 0.5.
Introduction to Machine Learning Exercise sheet 4
Exercise 1: Naive Bayes
You are given the following table with the target variable Banana:
ID Color Form Origin Banana
1 yellow oblong imported yes
2 yellow round domestic no
3 yellow oblong imported no
4 brown oblong imported yes
5 brown round domestic no
6 green round imported yes
7 green oblong domestic no
8 red round imported no
a) We want to use a Naive Bayes classifier to predict whether a new fruit is a Banana or not. Estimate the posterior
probability ⇡ˆ(x⇤) for a new observation x⇤ = (yellow, round, imported). How would you classify the object?
b) Assume you have an additional feature Length that measures the length in cm. Describe in 1-2 sentences how
you would handle this numeric feature with Naive Bayes.
Exercise 2: Discriminant Analysis
2.0
2.5
3.0
3.5
4.0
0 2 4 6 8
x
y
The above plot shows D = x(1), y(1) , . . . , x(n), y(n), a data set with n = 200 observations of a continuous
target variable y and a continuous, 1-dimensional feature variable x. In the following, we aim at predicting y with
a machine learning model that takes x as input.
a) To prepare the data for classification, we categorize the target variable y in 3 classes and call the transformed
target variable z, as follows:
z(i) =
8><>:
1, y(i) 2 (1, 2.5]
2, y(i) 2 (2.5, 3.5]
3, y(i) 2 (3.5,1)
Now we can apply quadratic discriminant analysis (QDA):
i) Estimate the class means µk = E(x|z = k) for each of the three classes k 2 {1, 2, 3} visually from the plot.
Do not overcomplicate this, a rough estimate is sucient here.
ii) Make a plot that visualizes the di↵erent estimated densities per class.
iii) How would your plot from ii) change if we used linear discriminant analysis (LDA) instead of QDA? Explain
your answer.
iv) Why is QDA preferable over LDA for this data?
b) Given are two new observations x⇤1 = 10 and x⇤2 = 7. State the prediction for QDA and explain how you
arrive there.
Exercise 3: Decision Boundaries for mlr3 Learners
We will now visualize how well di↵erent learners classify the three-class mlbench::mlbench.cassini data set.
Generate 1000 points from cassini, perturb the x.2 dimension with Gaussian noise (mean 0, standard deviation
0.5), and consider the classifiers already introduced in the lecture:
LDA,
QDA, and
Naive Bayes.
Plot the learners’ decision boundaries. Can you spot di↵erences in separation ability?
(Note that logistic regression cannot handle more than two classes and is therefore not listed here.)
Introduction to Machine Learning Exercise sheet 5
https://slds-lmu.github.io/i2ml/ WS 2021/2022
Exercise 1: Evaluating regression learners
Imagine you work for a data science start-up and sell turn-key statistical models. Based on a set of training
data, you develop a regression model to predict a customer’s legal expenses from the average monthly number of
indictments brought against their firm.
a) Due to the financial sensitivity of the situation, you opt for a very flexible learner that fits the customer’s data
(ntrain = 50 observations) well, and end up with a degree-21 polynomial (blue, solid). Your colleague is skeptical
and argues for a much simpler linear learner (gray, dashed). Which of the models will have a lower empirical
risk if standard L2 loss is used?
5
7
9
11
10 11 12 13 14 15
average number of indictments per month
leg
al
ex
pe
ns
es
in
m
illi
on
E
UR
b) Why might evaluation based on training error not be a good idea here?
c) Evaluate both learners on the following test data (ntest = 10), using
i) mean squared error (MSE), and
ii) mean absolute error (MAE).
State your performance assessment and explain potential di↵erences.
(Hint: use R if you don’t feel like computing a degree-21 polynomial regression by hand.)
set.seed(123)
x_train <- seq(10, 15, length.out = 50)
y_train <- 10 + 3 * sin(0.15 * pi * x_train) + rnorm(length(x_train), sd = 0.5)
data_train <- data.frame(x = x_train, y = y_train)
set.seed(321)
x_test <- seq(10, 15, length.out = 10)
y_test <- 10 + 3 * sin(0.15 * pi * x_test) + rnorm(length(x_test), sd = 0.5)
data_test <- data.frame(x = x_test, y = y_test)
Exercise 2: Importance of train-test split
We consider the BostonHousing data for which we would like to predict the nitric oxides concentration (nox) from
the distance to a number of firms (dis).
library(mlbench)
data(BostonHousing)
data_pollution <- data.frame(dis = BostonHousing$dis, nox = BostonHousing$nox)
data_pollution <- data_pollution[order(data_pollution$dis), ]
head(data_pollution)
## dis nox
## 373 1.1296 0.668
## 375 1.1370 0.668
## 372 1.1691 0.631
## 374 1.1742 0.668
## 407 1.1781 0.659
## 371 1.2024 0.631
ggplot2::ggplot(data_pollution, ggplot2::aes(x = dis, y = nox)) +
ggplot2::geom_point() +
ggplot2::theme_classic()
0.4
0.5
0.6
0.7
0.8
2.5 5.0 7.5 10.0 12.5
dis
no
x
a) Use the first ten observations as training data to compute a linear model with mlr3 and evaluate the performance
of your learner on the remaining data using MSE.
b) What might be disadvantageous about the train-test split in a)?
c) Now, sample your training observations from the data set at random. Use a share of 0.1 through 0.9, in 0.1
steps, of observations for training and repeat this procedure ten times. Afterwards, plot the resulting test errors
(in terms of MSE) in a suitable manner.
(Hint: rsmp is a convenient function for splitting data – you will want to choose the ”holdout” strategy.
Afterwards, resample can be used to repeatedly fit the learner.)
d) Interpret the findings from c).
Introduction to Machine Learning Exercise sheet 6
Exercise 1: Overfitting & underfitting
Assume a polynomial regression model with a continuous target variable y and a continuous, p-dimensional feature
vector x and polynomials of degree d, i.e.,
f
⇣
x(i)
⌘
=
pX
j=1
dX
k=0
✓j,k(x
(i)
j )
k,
and y(i) = f
x(i)
+ ✏(i) where the ✏(i) are iid with Var(✏(i)) = 2 8i 2 {1, . . . , n}.
a) For each of the following situations, indicate whether we would generally expect the performance of a flexible
polynomial learner (high d) to be better or worse than an inflexible one (low d). Justify your answer.
(i) The sample size n is extremely large, and the number of features p is small.
(ii) The number of features p is extremely large, and the number of observations n is small.
(iii) The true relationship between the features and the response is highly non-linear.
(iv) The variance of the error terms, 2, is extremely high.
b) Are overfitting and underfitting properties of a learner or of a fixed model? Explain your answer.
c) Should we aim to completely avoid both overfitting and underfitting?
Exercise 2: Resampling strategies
a) Why would we apply resampling rather than a single holdout split?
b) Using mlr3, classify the german credit data into solvent and insolvent debtors using logistic regression. Com-
pute the training error w.r.t. MCE.