COMS30034: Machine learning coursework
Machine learning coursework
Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: THEend8_
COMS30034: Machine learning coursework
1 Introduction
This coursework is designed for you to apply some of the methods that you
have learned during our Machine Learning unit and that are also commonly
applied in practice. Given that this is your only assessment for this unit the
coursework is designed to be relatively open-ended with some guidelines, so
that you can demonstrate your knowledge of what was taught – both in the
labs and in the lectures.
Figure 1: Samples from the MNIST dataset.
1
2 Tasks
In this coursework, we will focus on the classical hand-written MNIST dataset1
and the California housing regression dataset2. We recommend that you first
get a basic implementation, and start writing your report with some plots
with results across all four topics, and then gradually improve them. Where
suitable you should discuss your results in light of the concepts covered in
the lectures (e.g. curse of dimensionality, overfitting, etc.).
2.1 Analysing MNIST
To gain a deeper understanding of a particular dataset it is often a good
strategy to analyse it using unsupervised methods. Use only the MNIST
dataset for this task.
2.1.1 PCA
Run PCA on the MNIST dataset. How much variance does each principal
component explain? Plot the two components that explain the most variance.
Interpret and discuss your results.
2.1.2 K-Means
Apply K-means (with K = 10) using the first two components from the PCA
analysis above. Plot your clusters in 2D and relate them to the digit classes.
What does each cluster correspond to? How good is the match between a
given cluster and a specific digit? Interpret and discuss your results.
2.2 Classifiers
Building on what you learnt from the labs, here you are asked to contrast
two types of classifiers, ANNs and SVMs. Using the libraries used during
the labs you only need to run two classifiers, and discuss its advantages and
disadvantages over the other. You should make sure to control for overfitting.
Use only the MNIST dataset for this task.
california-housing-dataset
2
2.2.1 ANNs
Train an ANN, plot the training and validation learning curves. Does the
model overfit? What are your results in the testing dataset? Interpret and
discuss your results. How do they compare with SVMs (plot decision bound-
aries and number of training epochs to help you compare)? And why? How
do the hyperpameters (e.g. learning rate) impact on performance?
2.2.2 SVMs
Train an SVM (with a chosen Kernel) and perform the same analyses as for
ANNs. You may need to subsample the dataset if SVM training is taking too
long. Interpret and discuss your results. Does the model overfit? How do
they compare with ANNs (plot decision boundaries and number of training
epochs to help you compare)? And why? How does the type of kernel (e.g.
linear, RBF, etc.) impact on performance?
2.3 Bayesian linear regression with PyMC3
In this task you are required to use PyMC3 to perform Bayesian linear re-
gression on the California housing dataset which is easily available via the
sklearn.datasets.fetch california housing function. The goal with this dataset
is to predict the median house value in a ‘block’ in California. A block is
a small geographical area with a population of between 600 and 3000 peo-
ple. Each datapoint in this dataset corresponds to a block. Consult the
scikit-learn documentation for details of the predictor variables.
As always with Bayesian analysis it is up to you to choose your prior
distributions. Be sure to justify your choice of priors in your report. What
do the results produced by PyMC3 tell you about what influences house
value in California? Is it necessary and/or useful to transform the data in
some way before running MCMC?
2.4 Trees and ensembles
Here, you will implement both of the following steps.
3
2.4.1 CART Decision Trees
This part extends the work from lab 7 to regression on the California Housing
regression task. First, run a decision tree regressor for the California hous-
ing dataset, and contrast this with your previous Bayesian linear regression
method. For this you can use the DecisionTreeRegressor class from Scikit-
learn. Then, analyse the effect of the hyperparameters of the decision tree,
such as the maximum depth of the tree or the maximum number of features to
consider at each split. Look at the constructor of the DecisionTreeRegressor
class to see what hyperparameters you can set.
In your report, include the following (you may also wish to add further
analysis of your own):
1. A brief explanation of the CART decision tree method.
2. A comparison of the results with those of Bayesian linear regression.
3. Plot the relationship between a hyperparameter and the performance
of the model.
4. Optimise the hyperparameter on a validation set.
5. Explain the effects you see when setting hyperparameters such as the
maximum tree depth or maximum number of features.
6. Visualise the tree and use the visualisation to describe the most impor-
tant features and how the tree makes decisions.
2.4.2 Boosting
This part adapts the Adaboost method from lab 7 to the MNIST classifica-
tion task. First, implement and test Adaboost on the MNIST dataset. Use
a decision tree as your base classifier. You can use either your implemen-
tation from lab 7 or the AdaBoostClassifier class from Scikit-learn. Next,
analyse key hyperparameters: the depth of the decision tree and number of
estimators (base classifiers) in the ensemble.
In the report, please include the following points:
1. Explain the intuition behind the Adaboost approach (you don’t need
to provide all the steps or all the equations).
4
2. Describe your results and compare them to those from the ANN and
SVM.
3. Plot the relationship between the maximum depth of the decision tree
and the performance of the ensemble.
4. Explain the effects of other hyperparameters, such as the number of
estimators and learning rate.
5. What do you think is a good choice for the number of estimators on
this dataset?
6. How much does the ensemble improve performance over a single base
classifier?
3 Implementation
You are expected to build on the skills you have learned during the labs.
Therefore, you should use the Python libraries used during the labs, namely
Scikit-learn and PyMC3. You can use other libraries, but we won’t be able
to provide support on those.
4 Assessment criteria
Your coursework will be evaluated based on a submitted report, containing
the appropriate discussion and results. The aim of this report is to demon-
strate your understanding of the methods you used and the results that you
have obtained. Note: In the report it is important that you briefly describe
the methods used.
The report should be no more than 10 pages long, using no less than 11
point font. Note that your report should be quality rather than quantity,
so do not feel like you have to use 10 pages if they are not needed. If you
wish to use a template for Latex, you can use the basic report template or
the Coling 2020 template. Submission: On Blackboard (under Assessment,
Coursework) with a pdf file (as cw userid.pdf) for the report together
with your code (e.g. with the Jupyter Notebooks you have used;
as cw userid.zip). Note that your code is not going to be used for marking,
only to validate your work.
5
To gain high marks your report will need to demonstrate clearly a thor-
ough understanding of the tasks and the methods used, backed up by a clear
explanation (including figures) of your results and analysis. The structure of
the report and what is included in it is your decision and you should aim to
write it in a professional and objective manner so that it addresses the issues
mentioned above. In particular you need to explain clearly the following
elements:
1. Analyse the MNIST dataset using K-means and PCA (25%)
2. Apply and discuss the results of a classifier on MNIST (ANN and
SVM) (25%)
3. Bayesian linear regression on the California housing dataset (25%)
4. Implement random forest and stacking and contrast (using the Cali-
fornia housing dataset) them with the previous methods in terms of
performance and interpretability (25%)
Suggestion for discussion points: after describing each method you use, con-
sider what you would expect the results to look like, e.g., how you think
K-means clusters will relate to the digits data. Then use your results to
validate your hypothesis and look for patterns of errors that each method
makes. Can you explain why each method makes certain types of error?
Deadline: The deadline for submission is 13:00 (1pm) on 16th of August.
Students should submit all required materials to the “Assessment, submission
and feedback” section of Blackboard - it is essential that this is done on the
Blackboard page related to the “With Coursework” variant of the unit.
Working in pairs: You are encouraged to work in pairs or small groups,
but both report and code need to be your own pieces of work.
5 Support provided
Lecturers will be available to provide support on an ad-hoc basis. Please
contact us during weekdays 9am-5pm and we will endeavour to reply as soon
as possible. You can send us questions directly, or request a meeting. The
lecturers are available on Teams or email:
6
• Rui: [email protected]
• James: [email protected]
• Edwin: [email protected]
6 Further clarifications
• Feel free to use the labs materials as a starting point.
• To make the best use of space you should use matplotlib subplots and
use a given plot to make comparisons (e.g. training and validation
learning curve).
• You should use Python with Jupyter Notebook and the libraries that
we used during the labs (e.g. Scikit-learn and pyMC3)
• Academic offences: Academic offences (including submission of work
that is not your own, falsification of data/evidence or the use of ma-
terials without appropriate referencing) are all taken very seriously by
the University. Suspected offences will be dealt with in accordance
with the University’s policies and procedures. If an academic offence is
suspected in your work, you will be asked to attend an interview with
senior members of the school, where you will be given the opportunity
to defend your work. The plagiarism panel are able to apply a range
of penalties, depending on the severity of the offence. These include:
requirement to resubmit work, capping of grades and the award of no
mark for an element of assessment.
• Extenuating circumstances: If the completion of your assignment
has been significantly disrupted by serious health conditions, personal
problems, periods of quarantine, or other similar issues, you may be
able to apply for consideration of extenuating circumstances (in ac-
cordance with the normal university policy and processes). Students
should apply for consideration of extenuating circumstances as soon
as possible when the problem occurs, using the following online form.
You should note however that extensions are not possible for optional
unit assignments. If your application for extenuating circumstances
is successful, it is most likely that you will be required to retake the
assessment of the unit at the next available opportunity.
7
7 Marking guidelines
7.1 Outstanding project (80+)
+ mastery of advanced methods in all aspects;
+ truly impressive outcome, novelty, with strong research elements – close
to publication quality;
+ synthesis in an original way using ideas from the unit but also from the
literature;
+ outstanding presentation of work, with very clear description of the
methods and results;
+ excellent use of plots to support the interpretations;
+ evidence of outstanding unique and individual contributions.
7.2 First class project (70+)
+ excellent outcome in all aspects;
+ evidence of excellent use and deep understanding of a wide range of
techniques;
+ study, originality and synthesis clearly beyond the minimum require-
ments set out in the coursework description;
+ excellent presentation of work, with very clear description of the meth-
ods and results;
+ very good use of plots to support the interpretations;
+ evidence of excellent contributions or insights into the methods tested.
7.3 Merit project (60+)
+ very good outcome with complete solutions for all the required aspects
of the assignment;
8
+ evidence of very good use and strong understanding of a range of tech-
niques;
+ study, comprehension and synthesis fully meet or exceed the require-
ments set out in the coursework description;
+ very good presentation of work, with clear description of the methods
and results;
+ good use of plots to support the interpretations;
+ evidence of critical analysis and judgement of the methods tested.
7.4 Good project (50+)
+ good outcome but some of parts of the assignment not fully completed;
+ evidence of good use and understanding of standard techniques;
+ some grasp of issues and concepts underlying the techniques;
+ adequate presentation of work, including a description of the methods
and results;
+ some good use of plots to support the interpretations but with some
notable shortcomings;
+ evidence of understanding and appropriate use of techniques.
7.5 Passing project (40+)
+ Limit outcome yet basic, partly solutions to all the 4 main topics
+ limit understanding as demonstrated through discussion and plots
+ poor presentation of results