MIS41270 Data Management and Mining
Data Management and Mining
Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: THEend8_
MIS41270
Data Management and Mining
Individual Assignment
Section 2
This assignment is broken into Three sections.
• This is an individual assignment, and you must answer the questions by yourself without
external help.
• Students’ submissions must represent their own work.
• There are three sections to the assignment
• Submissions will consist of a video and should be submitted via Brightspace
• Each video should last about 3 minutes and no longer than 5 minutes.
• The questions for each section of the assignment will be released:
o on Tuesday 27th April at 07:00am (Irish time).
o on Thursday 29h April at 07:00am (Irish time).
o on Tuesday 4th May at 07:00am (Irish time).
• You will have 24 hours to submit your video. (It should not take you 24 hours; this is just to
allow for the fact that not all students are in the same time zone.)
• Each section carries equal weighting and is worth a total of 33.3% of the overall marks for
this assignment.
Section 2 – Answer all questions
Submission Details:
• The questions for this section will be released on Thursday 29th April at 07:00am (Irish
time).
• Your video submission must be submitted by Friday 30th April at 07:00am (Irish time).
• This section is worth 33.3% of the overall marks for this assessment. Each question below
has equal weighting.
Scenario – Churn Analysis
You are working as the head of Analytics and AI for a telecommunications company. The company is
having issues with customer retention. Customer Churn (customers leaving the company) has been
increasing steadily over time. The head of customer retention has asked for assistance from your
team in deriving a strategy to reverse this current trend. Following the kickstart, workshop you and
your team decide that the best place to start is to build a prediction model. You create an ABT that
contains several input features and a target feature. The team developed three predictive models
using three different methods: a decision tree, a logistic regression model, and a random forest. The
questions in this assignment are based on the validation results from their work.
Question 1 – Validation Statistics
To validate each model the team created three datasets: a training set, a validation set, and a test
set. For each model created, the team generated a set of validation statistics for each dataset. The
validation statistics based on the training and test datasets are shown below. In your own words
review the results below referring to:
• An explanation of the statistics created and how they should be interpreted
• For each model, a comparison of the validation statistics based on the training set and test
set
• A comparison of the validation statistics based on the test set between each model
• Any other details that you deem relevant
Model: Decision Tree
Sample: Training (50%)
Predicted 0 1 All
Target
0 1783 3 1786
1 93 127 220
All 1876 130 2006
Overall Accuracy 0.9521
Accuracy Class (0) 1.000
Accuracy Class (1) 0.58
Sample: Testing (30%)
Predicted 0 1 All
Target
0 1058 12 1070
1 75 60 135
All 1133 72 1205
Overall Accuracy 0.928
Accuracy Class (0) 0.99
Accuracy Class (1) 0.44
Model: Logistic Regression
Sample: Training (50%)
Predicted 0 1 All
Target
0 1784 2 1786
1 103 117 220
All 1887 119 2006
Overall Accuracy 0.947
Accuracy Class (0) 1
Accuracy Class (1) 0.53
Sample: Testing (30%)
Predicted 0 1 All
Target
0 1068 2 1070
1 62 73 135
All 1130 75 1205
Overall Accuracy 0.947
Accuracy Class (0) 1
Accuracy Class (1) 0.54
Model: Random Forest
Sample: Training (50%)
Predicted 0 1 All
Target
0 1786 0 1786
1 136 84 220
All 1922 84 2006
Overall Accuracy 0.932
Accuracy Class (0) 1
Accuracy Class (1) 0.38
Sample: Testing (30%)
Predicted 0 1 All
Target
0 1070 0 1070
1 97 38 135
All 1167 38 1205
Overall Accuracy 0.919
Accuracy Class (0) 1
Accuracy Class (1) 0.28
Question 2 – ROC
For each model created, the team generated a set of visualisations for each dataset. ROC charts for
each model created based on the test set are shown below. In your own words review the charts
below referring to:
• An explanation of a ROC chart and how it should be interpreted
• A comparison of the charts created for each model
• A discussion of the advantage of using an ROC chart
• Any other details that you deem relevant
Model: Decision Tree
Sample: Testing (30%)
Area under the ROC: 0.852
Model: Logistic Regression
Sample: Testing (30%)
Area under the ROC: 0.936
Model: Random Forest
Sample: Testing (30%)
Area under the ROC: 0.932
Question 3 – Choosing a Model
The team has been asked to recommend which model should deployed (the decision tree, logistic
regression model, or random forest model). They have been told that the explain-ability of the
model chosen is important to the retention team that will use the model. Please make a
recommendation on what model should be deployed given the information outlined in Questions 1
and 2.