Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: THEend8_
INFS4203/7203 Project
The assignment aims to assess your ability to apply data mining techniques to solve real-world problems.
This is an individual task, and completion should be based on your own design.
For this assignment, you will individually complete a project proposal and implement it to develop data
mining models applicable to test data. You can choose either:
• A data-oriented project, or
• A competition-oriented project.
To complete the project, you need to submit a comprehensive proposal in Phase 1, clearly describing the
data pre-processing, tuning, model training, and evaluation techniques you plan to apply. Based on this
proposal, in Phase 2, you will submit a project implementation and a report on the final test results.
V1.0 2
Track 1: Data-oriented project
In this data-oriented project, our dataset named train.csv is designed to closely simulate real-world
scenarios, reflecting the inherent complexities found in natural data. It originates from the CIFAR-10
dataset, but we have deliberately introduced various
realistic challenges, such as missing values, a diverse range of data scales, and outliers. To create this
dataset, we employed a neural network to extract features from the original CIFAR-10 data and made
certain modifications to the resultant features. As a result, the dataset exhibits a compelling resemblance
to naturally occurring data, offering an excellent opportunity to study and develop robust solutions
applicable to real-world data analysis.
In the "train.csv" file, each row, except for the first one, represents a single data point, and there are a
total of 2180 data points in this dataset. The first 100 columns (Num_Col_0 to Num_Col_99) contain
numerical features, while the subsequent 28 columns (Cat_Col_100 to Cat_Col_127) contain nominal
features. The last column "Label" indicates the corresponding label for each data point. The dataset
includes a total of 10 classes, which are denoted by numbers 0 to 9, representing the following categories:
airplane, automobile, bird, cat, deer, dog, frog, horse, ship, and truck, respectively.
The main objective based on the provided labeled data is to develop a classifier capable of accurately
classifying a data point into one of the ten classes for unseen data. The classifier's performance will be
evaluated by the teaching team using the test data released in Week 9, where the ground truth labels
are only accessible to the teaching team.
Phase 1: project proposal (15 marks)
In the first phase of the project, you are required to submit a proposal by 16:00 on 15th September 2023.
The proposal should outline your overall plan for the project, including details about the learning process
and the timeline for Phase 2. It's important to note that you do not need to submit any actual codes or
report any training, validation, or test results in the Phase 1 proposal. Additionally, an abstract is not
required in the proposal. The focus should be on providing a clear and comprehensive outline of your
approach and timeline for the project in Phase 2.