XJTLU Entrepreneur College (Taicang) Cover Sheet
Module code and Title
|
DTS304TC Machine Learning
|
School Title
|
School of AI and Advanced Computing
|
Assignment Title
|
Assessment Task 1
|
Submission Deadline
|
23:59, 24th March (Sunday), 2024 (China Time, GMT + 8)
|
Final Word Count
|
N/A
|
If you agree to let the university use your work anonymously for teaching and learning purposes, please type “yes” here.
|
|
I certify that I have read and understood the University’s Policy for dealing with Plagiarism, Collusion
and the Fabrication of Data (available on Learning Mall Online). With reference to this policy I certify
that:
⚫ My work does not contain any instances of plagiarism and/or collusion.
My work does not contain any fabricated data.
By uploading my assignment onto Learning Mall Online, I formally declare that all of the
above information is true to the best of my knowledge and belief.
Moreover, an external test set without ground truth labels has been provided. Your classifier's
performance will be evaluated based on this set, underscoring the importance of building a model with
strong generalization capabilities.
The competencies you develop during this practical project are not only essential for successfully
completing this assessment but are also highly valuable for your future pursuits in the field of data science.
Throughout this project, you are encouraged to utilize code that was covered during our Lab sessions, as
well as other online resources for assistance. Please ensure that you provide proper citations and links to
any external resources you employ in your work. However, the use of Generative AI for content
generation (such as ChatGPT) is not permitted on all assessed coursework in this module.
Submission deadline: TBD
Percentage in final mark: 50%
Learning outcomes assessed:
A. Demonstrate a solid understanding of the theoretical issues related to problems that machine learning
algorithms try to address.
B. Demonstrate understanding of properties of existing ML algorithms and new ones.
C. Apply ML algorithms for specific problems.
Individual/Group: Individual
Length: The assessment has a total of 4 questions which gives 100 marks. The submitted file must be in
pdf format.
Late policy: 5% of the total marks available for the assessment shall be deducted from the assessment
mark for each working day after the submission date, up to a maximum of five working days
Risks:
• Please read the coursework instructions and requirements carefully. Not following these instructions
and requirements may result in loss of marks.
• The formal procedure for submitting coursework at XJTLU is strictly followed. Submission link on
Learning Mall will be provided in due course. The submission timestamp on Learning Mall will be
used to check late submission.
Question 1: Coding Exercise - Heart Disease Classification with Machine Learning (50 Marks)
In this coding assessment, you are presented with the challenge of analyzing a dataset that contains
patient demographics and health indicators to predict heart disease classifications. This entails solving a
multi-class classification problem with five distinct categories, incorporating both categorical and
numerical attributes.
Your initial task is to demonstrate proficiency in encoding categorical features and imputing missing
values to prepare the dataset for training a basic classifier. Beyond these foundational techniques, you are
invited to showcase your advanced skills. This may include hyperparameter tuning using sophisticated
algorithms like the Asynchronous Successive Halving Algorithm (ASHA). You are also encouraged to
implement strategies for outlier detection and handling, model ensembling, and addressing class
imbalance to enhance your model's performance.
Project Steps:
a) Feature Preprocessing (8 Marks)
⚫ You are required to demonstrate four key preprocessing steps: loading the dataset, encoding
categorical features, handling missing values, and dividing the dataset into training, validation,
and test sets.
⚫ It is crucial to consistently apply the same feature preprocessing steps—including encoding
categorical features, handling missing values, and any other additional preprocessing or custom
modifications you implement—across the training, validation, internal testing, and the externally
provided testing datasets. For efficient processing, you may want to consider utilizing the
sklearn.pipeline and sklearn.preprocessing library functions.
b) Training Classifiers (10 Marks)
⚫ Train a logistic regression classifier with parameter tuned using grid search, and a random forest
classifier with parameters tuned using Async Successive Halving Algorithm (ASHA) with
ray[tune] libraries. You should optimize the model's AUC score during the hyperparameter tuning
process.
⚫ You should aim to optimize a composite score, which is the average of the classification accuracy
and the macro-averaged F1 score. This objective encourages a balance between achieving high
accuracy overall and ensuring that the classifier performs well across all classes in a balanced
manner, which is especially important in multi-class classification scenarios where class
imbalance might be a concern.
To clarify, your optimization goal is to maximize a composite accuracy metric defined as follows:
accuracy = 0.5 * (f1_score(gt, pred, average='macro') + accuracy_score(gt, pred))
In this formula, f1_score and accuracy_score refer to functions provided by the scikit-learn
library, with f1_score being calculated with the 'macro' average to treat all classes equally.
⚫ Ensure that you perform model adjustments, including hyperparameter tuning, on the validation
set rather than the testing set to promote the best generalization of your model.
⚫ We have included an illustrative example of how to implement the ASHA using the ray[tune]
library. Please refer to the notebook DTS304TC_ASHA_with_Ray_Tune_Example.ipynb located
in our project data folder for details.
c) Additional Tweaking and External Test Set Benchmark (19 Marks)
⚫ You are encouraged to explore a variety of advanced techniques to improve your model's
predictive power.
1. Utilizing different classifiers, for example, XGBoost.
2. Implementing methods for outlier detection and treatment.
3. Creating model ensembles with varied validation splits.
4. Addressing issues of class imbalance.
5. Applying feature engineering strategies, such as creating composite attributes.
6. Implementing alternative validation splitting strategies, like cross-validation or stratified
sampling, to enhance model tuning.
7. Additional innovative and valid methods not previously discussed.
You will be awarded 3 marks for successfully applying any one of these methods. Should you
incorporate two or more of the aforementioned techniques, a total of 6 marks will be awarded.
Please include code comments that explain how you have implemented these additional
techniques. Your code and accompanying commentary should explicitly state the rationale behind
the selection of these supplementary strategies, as well as the underlying principles guiding your
implementation. Moreover, it should detail any changes in performance, including improvements,
if any, resulting from the application of these strategies. An additional 4 marks will be awarded
for a clear and comprehensive explanation. To facilitate a streamlined review and grading
process, please ensure that your comments and relevant code are placed within a separate code
block in your Jupyter notebook, in a manner that is readily accessible for our evaluation.
⚫ Additionally, utilize the entire dataset and the previously determined optimal hyperparameters
and classification pipeline to retrain your top-performing classifier. Then, apply this model to the
features in 'dts304tc_a1_heart_disease_dataset_external_test.csv', which lacks true labels, to
produce a set of predictive probability scores. Save these probabilistic scores in a table with two
columns: the first column for patient IDs and the second for the output classification labels.
Export this table to a file named external_test_results_[your_student_id].csv. Submit this file for
evaluation. In the external evaluation conducted by us, your scores will be benchmarked against
the performance of models developed by your classmates. You will receive four marks for
successfully completing the prescribed classifier retraining and submission process. Additionally,
your classifier's benchmark ranking—based on its performance relative to models developed by
your peers—will be assigned five marks, contingent upon your standing in the ranking.