Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: THEend8_
COMP4318/5318 Assignment 1 Overview
This assignment involves two core parts. Part 1 focuses on a classi�cation goal to identify cases of
heart disease in a Cardiovascular Disease Dataset. Part 2 focuses on a regression problem on the
Concrete Dataset.
Your tasks involve applying pre-processing techniques and writing functions to implement and
evaluate for each part. After providing these functions, you will perform grid search procedures to
tune for the best hyperparameter combinations for some of these models.
Each part has been broken down into 4 sections as follows:
1. Pre-processing techniques
2. Classi�cations/Regression methods
3. Ensemble methods
4. Hyperparameter tuning
Academic integrity
While the University is aware that the vast majority of students and sta� act ethically and honestly, it
is opposed to and will not tolerate academic integrity breaches and will treat all allegations seriously.
Part 1: Classi�cation
In Part 1 of this assignment, you will be exploring di�erent classi�cation methods on a modi�ed
version of a real dataset - the Cardiovascular Disease Dataset. This dataset has a balanced selection of
the target class - Cardiovascular disease. The features of the Cardiovascular Disease Dataset are as
follows:
Age - positive int (days)
Height - positive int (cm)
Weight - positive �oat (kg)
Systolic blood pressure - positive int
Diastolic blood pressure - positive int
Gender - categorical [F, M]
Cholesterol - categorical [normal, above normal, well above normal]
Glucose - categorical [normal, above normal, well above normal]
Smoking - categorical [No, Yes]
Alcohol intake - categorical [No, Yes]
Physical activity - categorical [No, Yes]
Cardiovascular disease - categorical [No, Yes]
Your tasks for this part involve applying pre-processing techniques and writing classi�cation
functions that can be applied to this dataset using strati�ed 10-fold cross-validation.
After providing these functions, your next task is to design and test two functions on a range of
di�erent hyperparameters and evaluate their performance with strati�ed 10-fold cross-validation for
bagging and a validation set for Adaboost. You should use the cvkfold provided for all functions when
performing cross-validation.
Although it is not always necessary to wrap your code in functions when using Jupyter Notebooks,
this allows us to test your implementations. Wherever relevant, pass a random_state argument as
instructed below to control for randomness between runs and ensure your results are reproducible.
Further instructions can be found in the sca�old notebook.
Random state clari�cation:
There are multiple ways possible to ensure randomness is controlled between runs, however please
note there are some di�erences in the instructions for Part 1 and Part 2.
In Part 1, whenever there is a need to control for random events in a model, a random_state=0
argument should be passed to the constructor of the model inside your function, and it will not be
passed again during testing.
Part 1 extra instructions
If you fail a hidden test or running your code results in an error, ensure you have handled other
possible datasets within the speci�ed parameters and your functions can handle being called
with di�erent combinations of hyperparameters. You do not need to pass all hidden tests before
moving on to the next section.
Data:
The �lename to be loaded is 'cardio_diseases.csv'. You can �nd it in the �le browser by clicking
the button that looks like this
The �rst function load_data should handle setting any invalid values to np.nan and converting
categorical strings to numbers as outlined in the sca�old. The second function process_data
should process these missing values as described in the sca�old. Note the instructions in the
sca�old regarding testing your code on other datasets.
You can assume any valid categorical attributes in other datasets will take on values [Yes, No],
[Male, Female], and [Normal, Above normal, Well above normal] like the given dataset.
However, they may occur in di�erent columns, have di�erent names, or not be present at all.
Ensure your output array has a dtype of �oat64
Set the variables currently set to None by calling your functions. The values of these variables
will be checked along with your function to ensure you can use them in the rest of the
notebook.
All non-negative values (including 0) should be considered as valid numerical data
Evaluating methods with cross validation:
For the cross validation functions, use the skeletons provided, so that algorithm
hyperparameters such as number of neighbors and power of minkowski distance can
optionally be passed in as arguments and accessed as a dictionary using the **kwargs syntax.
Note you also can pass arguments as a dictionary to functions (such as sklearn constructors)
using the **kwargs syntax (you may need to search for documentation on **kwargs if
unfamiliar).
Your functions should create an instance of the model with the correct hyperparameters set
and run 10 fold cross validation (using the fold set up above and your preprocessed data X and
y) to evaluate it. The function should return the model instance with the correct
hyperparameters and the average cross validation accuracy obtained (but please do not round
this return value).
Once de�ned, you can treat cvKFold as a global variable and use it in all cells below.
Where the sca�old asks you to test your functions, please do so in the same cell and produce
output as below.
Follow the speci�ed formats for the outputs, rounded to 2 decimal places.
For the KNN the expected output format is as for the other methods
Cross validation score: x.xx
You can assume there will not be any cases where attributes shared by both bagging and
logistic regression are passed.
Grid search:
The train_val_test_split function should stratify the training/validation/test sets.
You can ignore this comment
TODO: uncomment this code to create the initial train test split
X_train_all, X_test, y_train_all, y_test = train_test_split(X, y, random_state=0)
Please read the sca�old instructions carefully for the Adaboost grid search. You are not
performing a cross validation grid search, but a grid search using a separate validation set. We
covered this procedure brie�y in the lectures. You calculate a validation score rather than
performing cross validation at each step.