Applied Statistical Modelling
Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: THEend8_
ST404: Applied Statistical Modelling
Assignment 3: Iranian Churn Data
3.0 ASSIGNMENT WEIGHTING
Assignment 3 counts for 35% of the module mark.
3.1 DEADLINE:
1:00pm Tuesday 2nd May 2023
to be submitted electronically via Moodle in pdf format.
3.2 PROBLEM OUTLINE
The aim of this assignment is to analyse a subset of a dataset available on the UCI website
maintained by (Dua and Graff 2019). The dataset concerns the CHURN of telecom
customers from Iranian companies and was donated by (Jafari-Marandi 2020). An individual
CHURNS if they are a paying customer who fails to renew their contract (typically because
they switch to another provider, in this case another mobile company).
The link to the data on the UCI website seems to no longer exist, but the data are also
available at (Jafari 2020). There are a number of papers that are listed in these sources that
have used these data including (Jafari-Marandi, Denton, et al. 2020), (Keramati, et al. 2014),
and (Keramati and Ardabili 2011). You may wish to scrutinise these papers to give you
further background.
(Jafari-Marandi 2020) state that:
“The dataset is randomly collected from an Iraninan telecom company’s data base of over a
period of 12 months. A total of 3150 rows of data, each representing a customer, bear
information for 13 columns. The attributes that are in this dataset are call failures, frequency
of SMS, number of complaints, number of distinct calls, subscription length, age group, the
charge amount, type of service, seconds of use, status, frequency of use, and Customer
Value.”
Main question: is it possible to predict customers who will
CHURN and explain why they do?
2
3.3 DATA AVAILABILITY
The data are available on Moodle as an R data frame called TeleChurn.Rdata.
Details of the variables and their coding is as follows:
Variable Name Type Detail
CallFailure numeric number of call failures
Complains binary 0 = No complaint, 1 = Complaint
SubscriptionLength numeric total months of subscription
ChargeAmount ordinal attribute 0 lowest amount, 9 highest amount
SecondsOfUse numeric total seconds of calls
FrequencyOfUse numeric total number of calls
FrequencyOfSMS numeric total number of text messages
DistinctCalledNumbers numeric total number of distinct phone calls
AgeGroup* ordinal attribute 1 younger age, 5 older age
TariffPlan binary 1 = Pay as you go, 2 = contractual
Status binary 1 = active, 2= non-active
CustomerValue numeric The calculated value of the customer
Churn binary - Class label 1 = churn, 0 = non-churn
*AgeGroup: the original data set also had age with only five ages and the following
mapping:1=15, 2=25, 3=30, 4=45 and 5= 55 whilst (Keramati and Ardabili 2011) suggest the
age groups are <15, 15-30, 30-45, 45-60, 60-75 respectively.
This variable is to be used to assess the model and should not be used as an explanatory
variable.
Table 1 : Details of the Variables
3.4 ANALYSIS REQUIRED
You are to conduct an analysis of this dataset in R. An outline of the steps you should take
in your analysis is given below.
Given the size of the data it is reasonable to divide your observations into a training and
validation set. The proportion in which you do so should be justified and you should
ensure you do so in a random fashion. It will be useful to set a random number seed
(set.seed(xxx), where xxx is some integer value) so that you always produce the
same sub-samples should you need to re-run any code.
Begin with an exploratory analysis of the data. Using appropriate numerical, tabular or
graphical summaries, describe the distribution of the variables and investigate potential
relationships. You should start with the initial basic plots. However you should then
make use of empirical logistic function and conditional density plots as discussed in
lectures. You are also advised to read (Sheather 2009) which is available from the
library both as a hard copy and electronically. (It is the first book on the Book list for this
module, and the section referred to is only four pages.) Code was provided in lectures
3
that should allow you to complete this quite quickly. Remember you have limited time
and limited room in your report so whilst you should be thorough in your EDA “behind
the scenes”, carefully pick only a few relevant examples to include in your report.
Use logistic regression to investigate the relationship between the dependent variable,
Churn and the explanatory variables. You should attempt this in a number of ways:
a) Find an initial model:
i) Fit a model that contains as a minimum all main effects. You should aim to find a
model that will then be the starting point to reduce the model further in parts 3)b)
and 3)c) below. Below are listed some ideas that you might consider in
developing this model.
(1) You may wish to experiment with transformations of the variables as a result
of your findings in 2) above and to improve this initial model.
(2) You may wish to perform some residual and influential analyses to improve
your model and determine which version of the explanatory variables to use.
You do not need to present details of this model validation, but state if any of
this altered your choice of how each variable should be used in the model.
You will need to present some of these analyses for one of your models in
question 6) below.