Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: THEend8_
COMP9417 - Machine Learning
Homework 1: Gradient Descent & Friends
Introduction In this homework, you will be required to manually implement (Stochastic) Gradient Descent
in Python to learn the parameters of a linear regression model. You will make use of the publicly avail-
able ‘real estate.csv’ dataset, which you can download directly from the course Moodle page. The dataset
contains 414 real estate records, each of which contains the following features:
• transactiondate: date of transaction
• age: age of property
• nearestMRT: distance of property to nearest supermarket
• nConvenience: number of convenience stores in nearby locations
• latitude
• longitude
The target variable is the property price. The goal is to learn to predict property prices as a function of a
subset of the above features.
Question 1. (Pre-processing)
A number of pre-processing steps are required before we attempt to fit any models to the data.
(a) Remove any rows of the data that contain a missing (‘NA’) value. List the indices of the removed
data points. Then, delete all features from the dataset apart from: age, nearestMRT and nConve-
nience.
(b) An important pre-processing step: feature normalisation. Feature normalisation involves rescaling
the features such that thay all have similar scales. This is also important for algorithms like gradient
descent to ensure the convergence of the algorithm. One common technique is called min-max
normalisation, in which each feature is scaled to the range [0, 1]. To do this, for each feature we
must find the minimum and maximum values in the available sample, and then use the following
formula to make the transformation:
xnew =
x−min(x)
max(x)−min(x) .
After applying this normalisation, the minimum value of your feature will be 0, and the maximum
will be 1. For each of the features, provide the mean value over your dataset.
Question 2. (Train and Test sets)
Now that the data is pre-processed, we will create train and test sets to build and then evaluate our
1
models on. Use the first half of observations (in the same order as in the original csv file excluding
those you removed in the previous question) to create the training set, and the remaining half for the
test set. Print out the first and last rows of both your training and test sets.
Question 3. (Loss Function)
Consider the loss function
Lc(x, y) =
√
1
c2
(x− y)2 + 1− 1,
where c ∈ R is a hyper-parameter. Consider the (simple) linear model
yˆ(i) = w0 + w1x
(i)
1 + w2x
(i)
2 + w3x
(i)
3 , i = 1, . . . , n.
We can write this more succintly by letting w = (w0, w1, w2, w3)T and X(i) = (1, x
(i)
1 , x
(i)
2 , x
(i)
3 )
T , so that
yˆ(i) = wTX(i). The mean-loss achieved by our model (w) on a given dataset of n observations is then
Lc(y, yˆ) = 1
n
n∑
i=1
Lc(y(i), yˆ(i)) = 1
n
n∑
i=1
[√
1
c2
(y(i) − 〈w(t), X(i)〉)2 + 1− 1
]
,
Compute the following derivatives:
∂Lc(y(i)yˆ(i))
∂wk
, k = 0, 1, 2, 3.
You must show your working for full marks.
Question 4. (Gradient Descent Psuedocode)
For some value of c > 0, we have
∂Lc(y(i), yˆ(i))
∂wk
=
x
(i)
k (〈w(t), X(i)〉 − y(i))
2
√
(〈w(t), X(i)〉 − y(i))2 + 4 , k = 0, 1, 2, 3.
Using this result, write down the gradient descent updates for w0, w1, w2, w3 (using pseudocode), as-
suming a step size of η. Note that in gradient descent we consider the loss over the entire dataset, not
just at a single observation. Further, provide pseudocode for stochastic gradient descent updates. Here,
w(t) denotes the weight vector at the t-th iteration of gradient descent.
Question 5. (Gradient Descent Implementation)
In this section, you will implement gradient descent from scratch on the generated dataset using the
gradients computed in Question 3, and the pseudocode in Question 4.
(a) Initialise your weight vector to w(0) = [1, 1, 1, 1]T . Consider step sizes
η ∈ {10, 5, 2, 1, 0.5, 0.25, 0.1, 0.05, 0.01}
(a total of 9 step sizes). For each step-size, generate a plot of the loss achieved at each iteration of
gradient descent. You should use 400 iterations in total (which will require you to loop over your
training data in order). Generate a 3×3 grid of plots showing performance for each step-size. Add
a screen shot of the python code written for this question in your report (as well as pasting the
code in to your solutions.py file). The following code may help with plotting:
Page 2
1 fig, ax = plt.subplots(3,3, figsize=(10,10))
2 nIter = 400
3 alphas = [10,5,2, 1,0.5, 0.25,0.1, 0.05, 0.01]
4 for i, ax in enumerate(ax.flat):
5 # losses is a list of 9 elements. Each element is an array of length nIter
storing the loss at each iteration for
6 # that particular step size
7 ax.plot(losses[i])
8 ax.set_title(f"step size: {alphas[i]}") # plot titles
9 plt.tight_layout() # plot formatting
10 plt.show()
11
(b) From your results in the previous part, choose an appropriate step size (and state your choice), and
explain why you made this choice.
(c) For this part, take η = 0.3, re-run GD under the same parameter initilisations and number of
iterations. On a single plot, plot the progression of each of the four weights over the iterations.
Print out the final weight vector. Finally, run your model on the train and test set, and print the
achieved losses.
Question 6. (Stochastic Gradient Descent Implementation)
We will now re-run the analysis in question 5, but using stochastic gradient descent (SGD) instead. In
SGD, we update the weights after seeing a single observation.
(a) Use the same settings as in 5 (a). For each step-size, generate a plot of the loss achieved at each
iteration of stochastic gradient descent, which is to be run for 6 Epochs. An epoch is one pass
over the entire training dataset, so in total there should be 6 × ntrain updates, where ntrain is the
size of your training data. Be sure to run these updates in the same ordering that the data is
stored. Generate a 3 × 3 grid of plots showing performance for each step-size. Add a screen shot
of the python code written for this question in your report (as well as pasting the code in to your
solutions.py file).
(b) From your results in the previous part, choose an appropriate step size (and state your choice), and
explain why you made this choice.
(c) Take η = 0.4 and re-run SGD under the same parameter initilisations and number of epochs. On
a single plot, plot the progression of each of the four weights over the iterations. Print yoru final
model. Finally, run your model on the train and test set, and record the achieved losses.
Question 7. Results Analysis
In a few lines, comment on your results in Questions 5 and 6. Explain the importance of the step-size
in both GD and SGD. Explain why one might have a preference for GD or SGD. Explain why the GD
paths look much smoother than the SGD paths.
Points Allocation There are a total of 10 marks, the available marks are:
• Question 1: 1 mark (split 0.5/0.5)
• Question 2: 0.5 marks
• Question 3: 1.5 marks
• Question 4: 1 mark
• Question 5: 2.5 marks (split 1/0.5/1)
Page 3
• Question 6: 2.5 marks (split 1/0.5/1)
• Question 7: 1 mark