Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: THEend8_
Practical 1: Drug Consumption
CS5014 Machine Learning
40% of the coursework grade
Aims
The aim of this practical is to gain experience in applying machine learning methodology to
a real dataset. The focus is on good understanding and justification of steps. A successful
submission will demonstrate the understanding of:
• how to load, clean, and process a dataset;
• how to train a standard algorithm;
• how to report and interpret the results; and
• how to write clear, concise, and re-usable research code.
Dataset
The dataset for this practical is provided on studres in the directory named data. It contains a
file named drug consumption.data in the CSV format. The file contains anonymised data
pertaining to drug consumption of 1885 individuals. You can read the dataset description to
familiarise yourself with the dataset1.
In this practical, we will predict nicotine consumption from 12 personality attributes (columns
2-13). You will notice that these input attributes are numerical in value, but that the target vari-
able (nicotine, column 30) is categorical, with 7 different classes, from CL0 (Never Used) to CL6
(Used in Last Day). Please take time to understand what each column represents.
Task
You will create a machine learning model that can predict nicotine consumption (one of the 7
available classes) from the 12 personality attributes in the described dataset. You will create
two deliverables: the source code for your solution, and a brief report which answers specific
questions about your solution. You should follow the steps outlined in this spec and use the
questions to guide your progress. Both code and the report are important: you must specif-
ically answer each question listed below and will be evaluated on how well your answers
demonstrate understanding of the topics covered in lectures.
In this practical, you will be marked based on your use of logistic regression. There are
no extra points for trying more advanced algorithms or tasks. If you experiment with other
classifiers or other tasks (i.e. predicting consumption of other drugs), please separate this work
into new Python files and clearly identify them to ease marking.
Part 1: Data Processing
Start by loading the dataset using Pandas. You may want to drop or clean some of the values,
change the encoding, apply scaling. You will also need to separate the dataset into a training
and testing set. In your report you should clearly answer the following questions about your
data processing:
(a) How did you load and clean the data, and why?
(b) How did you split the data into test/train set and why?
(c) How did you process the data including encoding, conversion, and scaling, and why?
(d) How did you ensure that there is no data leakage?
Part 2: Training
After loading the data, you should train a logistic regression classifier to predict the output
from inputs. You should use the LogisticRegression algorithm for this. Make sure you are
familiar with all the parameters offered by this implementation, and what they mean. In this
part, make sure to set the penalty parameter to ‘none’ to get the basic, unregularised ver-
sion of logistic regression. In your report you should clearly explain how you performed each
of the following tasks during the training process:
(a) Train using penalty=‘none’ and class weight=None. What is the best and worst
classification accuracy you can expect and why?
(b) Explain each of the following parameters used by LogisticRegression in your own words:
penalty, tol, max iter.
(c) Train using balanced class weights (setting class weight=‘balanced’). What does
this do and why is it useful?
(d) LogisticRegression provides three functions to obtain classification results. Explain the
relationship between the vectors returned by predict(), decision function(), and
predict proba().
Part 3: Evaluation
After successfully training your model on the training data, you should evaluate your model
on the testing data. It is fine to use built-in sklearn functionality like accuracy score, but
you will have to understand what such functions do. In your report, you must clearly explain
the following:
(a) What is the classification accuracy of your model and how is it calculated? Give the
formula.
(b) What is the balanced accuracy of your model and how is it calculated? Give the formula.
(c) Show the confusion matrix for your classifier for both unbalanced (2a) and balanced (2c)
cases. Discuss any differences.
(d) Show the precision and recall of your algorithm for each class, as well as the micro and
macro averages. Explain the difference between the micro and macro averages.
2
Part 4: Advanced Tasks
Once you have successfully completed Parts 1-3, you can try some advanced tasks listed be-
low. These are required for 17 and higher, but they cannot make up for poor performance in
previous tasks.
(a) Set the penalty parameter in LogisticRegression to ‘l2’. Give the equation of the
cost function used by LogisticRegression as the result. Derive the gradient of this l2-
regularised cost.
(b) Implement a 2nd degree polynomial expansion on the dataset. Explain how many di-
mensions this produces and why.
(c) Compare the results of regularised and unregularised classifiers on the expanded data
and explain any differences.
(d) Extend your solution to provide Precision-Recall plots for each class of nicotine con-
sumption. You may need to independently explore advanced scikit-learn functionality
for this.
Code Quality
Your code will evolve as you tackle individual parts of this practical. At the end, you will have
code that produced all your results. This is research code so you should focus on the code
quality aspects that support research. Your code should be:
(a) correct,
(b) clean and understandable,
(c) concise and elegant, and
(d) repeatable and easy to modify.
You will not need to write a lot of code and should avoid overcomplicating. Focus on how
easy it is for someone else to take your code, understand it, reproduce your results, and make
modifications to support further experiments. Our marking will be based on how well your
code meets these criteria. We encourage you to keep these factors in mind from the beginning,
but it is also OK to focus on correctness first and clean up the code later.
Submission
Hand in via MMS, by the deadline of 9pm on Friday of Week 6:
• The source code of your application which works in the Python3 virtual environment
set up in the school labs. This must be in the form of human-readable .py files, not the
binary .ipynb notebook format!
• A brief report in the PDF format. The report must contain sections which correspond to
the four parts described in this specification and it must address each of the questions
associated with each part.
Create a single .zip file containing all of these and submit this to MMS. Do not include the
dataset, your python virtual environment, or git repository.
3
Marking
This practical will be marked according to the guidelines at
https://info.cs.st-andrews.ac.uk/student-handbook/learning-teaching/
feedback.html
It will be based on the quality of your answers to the questions and the quality of your code.
The report is the most important part of the submission – your answers should be brief, but
they have to demonstrate understanding of the underlying algorithms. The code will be eval-
uated based on the criteria listed under “Code Quality”.
Some examples of submissions in various bands are:
• A basic implementation in the 11–13 grade band will complete Parts 1-3, but with signif-
icant weaknesses. Examples include a messy implementation, unexplained differences
between the code and the description in the report, or incorrect or incomplete answers to
questions in Parts 1-3.
• An implementation in the 14–16 range should complete all parts of the basic specifica-
tion comprising Parts 1-3, including answers to all associated questions. The code should
be of good quality, and the answers should be mostly correct and insightful, and demon-
strate understanding of lecture materials.
• An implementation in the 17–18 range, must include a high-quality solution to Parts
1-3, and some work on Part 4. Excellent answers to all attempted questions are strictly
required for this grade band.
• A grade of 19 and higher requires an excellent solution to all four parts with exception-
ally clear code and insightful answers to questions which evidence deep understanding
and independent study.
Note that the goal is solid machine learning methodology and understanding rather than a collection
of extensions – a good scientific approach and analysis are difficult, whereas running many
different scikit-learn algorithms on the same data is easy. Also note that:
• We will not focus on software engineering practice and advanced Python techniques
when marking, but your code should be sensibly organised, commented, and easy to
follow, as described above.