SEHS4696 Machine Learning for Data Mining
Machine Learning for Data Mining
Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: THEend8_
SEHS4696 Machine Learning for Data Mining
Individual Assignment (30%)
Objective
This assignment aims to use different supervised machine learning algorithms to
develop predictive models and evaluate and compare their performance.
The dataset
A reduced version of an epidemiology dataset of positive and negative COVID-19
cases from Mexico is used in this assignment. The dataset was originally collected by
the General Directorate of Epidemiology, Secretariat of Health in Mexico. The dataset
contains the lab Reverse Transcription Polymerase Chain Reaction (RT-PCR) testing
results for COVID-19 cases in Mexico. The dataset consists of 263,007 records and
originally it has 41 features. However, for the purpose of this assignment, the number
of features is reduced to just 11, including 10 input features plus 1 binary outcome
“result” of the RT-PCR (i.e., the label), as follows:
Feature Description Data type
Age ≥ 0 integer
Sex 0 = female, 1 = male integer
Pneumonia 0 = negative, 1 = positive integer
Diabetes 0 = negative, 1 = positive integer
Asthma 0 = negative, 1 = positive integer
Hypertension 0 = negative, 1 = positive integer
CVDs 0 = negative, 1 = positive integer
Obesity 0 = negative, 1 = positive integer
CKDs 0 = negative, 1 = positive integer
Tobacco 0 = negative, 1 = positive integer
Result 0 = negative, 1 = positive integer
Its first 5 records are as follows:
Age Sex Pneumonia Diabetes Ashma Hypertension CVDs Obesity CKDs Tabacco Result
74 0 0 1 0 1 0 1 0 0 0
71 1 0 1 0 1 0 1 0 1 0
50 0 1 0 0 0 0 0 0 0 1
25 1 0 0 0 0 0 1 0 0 1
28 1 0 0 0 0 0 0 0 0 0
This reduced dataset “mexico_covid19.csv” is available from Moodle.
Things to do
1. Develop models for predicting COVID-19 infection in Python using the above
labelled dataset “mexico_covid19.csv” with the following supervised machine
learning algorithms:
(i) logistic regression
(ii) support vector machines
(iii) k-nearest neighbors (kNN)
(iv) naïve Bayes
(v) decision trees
(vi) random forests
(vii) artificial neural networks
(60 marks)
2. Make an evaluation and comparison of the performance of these algorithms in
solving this specific prediction problem using the metrics of accuracy, sensitivity,
specificity, precision, recall, F1-score, etc.
(40 marks)
You are required to use Jupyter Notebook or Google Colab to do this assignment.
Things to submit
1. A report containing (1) and (2) above
2. Your notebook (.ipynb file)