Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: THEend8_
Scope: Sessions 1 to 5
How and what to submit
A. Submit a Jupyter Notebook named COM4509-6509_Assignment_1_UCard_XXXXXXXXX.ipynb where
XXXXXXXXX refers to your UCard number.
B. Upload the notebook file to MOLE before the deadline above.
C. NO DATA UPLOAD: Please do not upload the data files used. We have a copy already.
Assessment Criteria
Being able to express an objective function and its gradients in matrix form.
Being able to use numpy and pandas to preprocess a dataset.
Being able to use numpy to build a machine learning pipeline for supervised learning.
Late submissions
We follow Department's guidelines about late submissions, i.e., a deduction of 5% of the mark each working
day the work is late after the deadline. NO late submission will be marked one week after the deadline because
we will release a solution by then.
pli=1&authuser=1).
Use of unfair means
"Any form of unfair means is treated as a serious academic offence and action may be taken under the
Discipline Regulations." (from the MSc Handbook).
Regularisation for Linear Regression
Regularisation is a technique commonly used in Machine Learning to prevent overfitting. It consists on adding
terms to the objective function such that the optimisation procedure avoids solutions that just learn the training
data. Popular techniques for regularisation in Supervised Learning include Lasso Regression, Ridge
Regression and the Elastic Net.
In this Assignment, you will be looking at Ridge Regression and devising equations to optimise the objective
function in Ridge Regression using two methods: a closed-form derivation and the update rules for stochastic
gradient descent. You will then use those update rules for making predictions on a Air Quaility dataset.
Ridge Regression
Let us start with a data set for training , where the vector and is the design
matrix from Lab 3, this is,
Our predictive model is going to be a linear model
where .
The objetive function we are going to use has the following form
where is known as the regularisation parameter.
The first term on the right-hand side (rhs) of the expression for is very similar to the least-squares
objective function we have seen before, for example in Lab 3. The only difference is on the term that we use
to normalise the objective with respect to the number of observations in the dataset.
The first term on the rhs is what we call the "fitting" term whereas the second term in the expression is the
regularisation term. Given , the two terms in the expression have different purposes. The first term is looking
for a value of that leads the squared-errors to zero. While doing this, can take any value and lead to a
solution that it is only good for the training data but perhaps not for the test data. The second term is
regularising the behavior of the first term by driving the towards zero. By doing this, it restricts the possible
set of values that might take according to the first term. The value that we use for will allow a compromise
between a value of that exactly fits the data (first term) or a value of that does not grow too much (second
term).
This type of regularisation has different names: ridge regression, Tikhonov regularisation or norm
regularisation.
Question 1: in matrix form (2 marks)
Write the expression for in matrix form. Include ALL the steps necessary to reach the expression.
Question 1 Answer
Write your answer to the question in this box.
Optimising the objective function with respect to
There are two ways we can optimise the objective function with respect to . The first one leads to a closed
form expression for and the second one using an iterative optimisation procedure that updates the value of
at each iteration by using the gradient of the objective function with respect to ,
at eac te at o by us g t e g ad e t o t e object e u ct o t espect to ,
where is the learning rate parameter and is the gradient of the objective function.
Question 2: Derivative of wrt (2 marks)
Find the closed-form expression for by taking the derivative of with respect to , equating to zero
and solving for . Write the expression in matrix form.
Also, write down the specific update rule for by using the equation above.
Question 2 Answer
Write your answer to the question in this box.
Using ridge regression to predict air quality
Our dataset comes from a popular machine learning repository that hosts open source datasets for educational
and research purposes, the UCI Machine Learning Repository . We are
going to use ridge regression for predicting air quality. The description of the dataset can be found here.
In [ ]:
In [ ]:
We can see some of the rows in the dataset
In [ ]:
The target variable corresponds to the CO(GT) variable of the first column. The following columns correspond
to the variables in the feature vectors, e.g., PT08.S1(CO) is up until AH which is . The original dataset
also has a date and a time columns that we are not going to use in this assignment.
Before designing our predictive model, we need to think about three stages: the preprocessing stage, the
training stage and the validation stage. The three stages are interconnected and it is important to remember
that the testing data that we use for validation has to be set aside before preprocessing. Any preprocessing that
you do has to be done only on the training data and several key statistics need to be saved for the test stage.
import zipfile
zip = zipfile.ZipFile('./AirQualityUCI.zip', 'r')
for name in zip.namelist():
zip.extract(name, '.')
# The .csv version of the file has some typing issues, so we use the excel version
import pandas as pd
air_quality = pd.read_excel('./AirQualityUCI.xlsx', usecols=range(2,15))
air_quality.sample(5)
Separating the dataset into training and test before any preprocessing has happened help us to recreate the
real world scenario where we will deploy our system and for which the data will come without any
preprocessing.
We are going to use hold-out validation for testing our predictive model so we need to separate the dataset into
a training set and a test set.
Question 3: Splitting the dataset (1 mark)
Split the dataset into a training set and a test set. The training set should have 70% of the total observations
and the test set, the 30%. For making the random selection make sure that you use a random seed that
corresponds to the last five digits of your student UCard. Make sure that you comment your code.
Question 3 Answer
In [ ]:
Preprocessing the data
The dataset has missing values tagged with a -200 value. Before doing any work with the training data, we want
to make sure that we deal properly with the missing values.
Question 4: Missing values (3 marks)
Make some exploratory analysis on the number of missing values per column in the training data.
Remove the rows for which the target feature has missing values. We are doing supervised learning so we
need all our data observations to have known target values.
Remove features with more than 20% of missing values. For all the other features with missing values, use
the mean value of the non-missing values for imputation.
Question 4 Answer
In [ ]:
Question 5: Normalising the training data (2 marks)
Now that you have removed the missing data, we need to normalise the input vectors.
Explain in a sentence why do you need to normalise the input features for this dataset.
Normalise the training data by substracting the mean value for each feature and dividing the result by the
standard deviation of each feature. Keep the mean values and standard deviations, you will need them at
test time.