DTS001TC
Task 1 (50 marks)
You are given a dataset of Boston Housing. This dataset contains information related to housing prices in the Boston area, consisting of 12 numerical features and one target variable (the median price of owner- occupied housing). The introduction to the specific numerical features can be found in the Appendix.docx file. You need to design and create your visualization and model based on the dataset. The visualization will show the impact of different features on the median price of owner-occupied housing (which is in the column in red font in the .xlsx file), while the model needs to consider multiple features to predict the median price of owner-occupied housing. Here are task specifications:
Target for visualization: You are asked to use excel to create a visualization that complete the following tasks
l Clean and preprocess the original dataset (9 marks).
l Show the impact of the Average number of rooms per dwelling on the median price of owner- occupied housing through appropriate tables and charts (5 marks).
l Show the impact of the Percentage of lower status of the population on the median price of owner-occupied housing through appropriate tables and charts (5 marks).
l Show the impact of the Proportion of residential land zoned for lots over 25,000 sq.ft. on the median price of owner-occupied housing through appropriate tables and charts (5 marks).
l Show the impact of the Per capital crime rate by town on the median price of owner-occupied housing through appropriate tables and charts (5 marks).
Target for model: You are asked to use excel to construct a model that can predict the median price of owner-occupied housing (which is in the column in red font in the .xlsx file) based on the different
features of these houses. Your model needs to complete the following tasks
l Choose the appropriate independent variable for the appropriate model (8 marks).
l Strive for low Mean Square Error (MSE) as much as possible (8 marks).
l Use the trained model to output the median price of owner-occupied housing, and present a
table showing the Mean Squared Error (MSE) corresponding to different predicted value ranges (5 marks).
The submitted Excel file should include:
l The original dataset
l The dataset after data preprocessing
l All visualized tables and charts
l Summary output of the constructed model
Detailed Requirements:
l The formulas and functions used in data preprocessing needs to be retained in your .xlsx file. You need to demonstrate through formulas how the processed data was transformed step by step.
l Visual charts and tables need to be generated by Excel and remain in an editable state in your .xlsx file. Screenshots will not be accepted.
Additional notes:
l The use of add-ins that have not been mentioned in lecture is allowed, but it is necessary to refer the source and ensure that the add-ins is publicly available.
l It is allowed to use newly constructed features during the model constructing, but these features must be based on the original dataset, and the process of constructing the new features needs to be retained.
Task 2 (50 marks)
In this task, you need to write a report based on your visualization and modeling results.
Target for report: You are asked to write a report (in PDF) to analyze your visualization results and evaluate your model's prediction result, the report should consist of following contents:
l Analysis of each visualization table and chart (16 marks).
l Conclude which feature has the greatest impact on the median price of owner-occupied housing and provide corresponding evidence (6 marks).
l Evaluation of the fitness of the model (5 marks).
l Evaluation of the predicted results of different predicted value ranges of the model (5 marks).
l Elaborate on the potential of your predictions in commercial applications (10 marks).
l Discuss the limitations of the model and potential directions for improvement (8 marks).
The formatting requirements in the report:
l Font: Times new roman
l Font Size: 12pt
l Page limitation: 1
l Line Spacing: single space
l Spacing Before: 0pt
l Spacing After: 12pt
Notes:
l Newly created features can be included in the discussion
l You can evaluate your model by comparing different models
l Discussions on directions for improvement can include discussions on improving the dataset
l You may get marks deducted if your report has more than 1 page
Marking Criteria
Tasks |
100 |
Components |
Description |
Maximum Credit |
Mark |
Task 1 |
50 |
Data Preprocessing [9 marks] |
Missing value handling |
3 |
|
Outlier handling |
3 |
|
|||
Text Data handling |
3 |
|
|||
Data Visualization [20 marks] |
Visualization for Average number of rooms per dwelling |
5 |
|
||
Visualization for Percentage of lower status of the population |
5 |
|
|||
Visualization for Proportion of residential land zoned for lots over 25,000 sq.ft. |
5 |
|
|||
Visualization for Per capital crime rate by town |
5 |
|
|||
Model Construction [21 marks] |
Model Choice |
8 |
|
||
Prediction MSE |
8 |
|
|||
Predicted Result Table |
5 |
|
|||
Task 2 |
50 |
Visualization Analysis [22 marks] |
Analysis of pivot table |
8 |
|
Analysis of pivot chart |
8 |
|
|||
Analysis of the impact level on different features |
6 |
|
|||
Model Evaluation [20 marks] |
Evaluation of the fitness of the model |
5 |
|
||
Evaluation of the predicted results of different predicted value ranges of the model |
5 |
|
|
|
|
Potential of commercial applications |
10 |
|
Discussion [8 marks] |
Limitations |
4 |
|
||
Future improvement directions |
4 |
|
|||
Late Submission? |
oYes oNo |
Days late |
|||
|
|||||
Final Marks |
|