Data Mining in Engineering
Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: THEend8_
CHE1148H - Data Mining in Engineering
1 Supervised learning
With the features built in Assignment #2, you are now asked to build a model that pre
dicts clients response to a promotion campaign using 3 MLlib algorithms. This is a typical
classification problem in the retail industry, but the formulation of the problem is similar to
industries such as fraud detection, marketing and manufacturing.
The clients responses are stored in the Retail Data Response.csv file from Kaggle. The
responses are binary: 0 for clients who responded negatively to the promotional campaign
and 1 for clients who responded positively to the campaign.
You will explore solving the classification problem with two different sets of features (i.e.
annual and monthly) and three different algorithms as shown in the image below.
Retail response
classification problem
Annual features
Logistic
Regression
with L1
regularization
Decision
Tree
Random
Forests
Monthly features
Logistic
Regression
with L1
regularization
Decision
Tree
Random
Forests
1.1 Import the monthly and annual data and join
In Assignment #2, you created five different feature families that capture annual and monthly
aggregations. Here, you will model the retail problem with two approaches: using annual
and monthly features. Therefore, you need to create the joined tables based on the following
logic:
1Table
annual features outputs
monthly features outputs
#1
annual features.xlsx
mth rolling features.xlsx
#2
annual day of week counts pivot.xlsx
mth day counts.xlsx
#3
days since last txn.xlsx
#4
Retail Data Response.csv
Retail Data Response.csv
In both the annual and monthly features approach, you need to join at the end with table
#4, the clients responses. This is simply a table that contains the binary response of the
client to our marketing effort as described above and that is the output or label or target
that makes this a supervised learning problem.
1.2 Steps for each method (15 points)
Important note 1: When you set up a new cluster in Databricks make sure that you select
a runtime version that supports ML (any 10.x ML should work).
Important note 2: The Learning Spark book github has many useful notebooks in Chapter
10 relevant to ML pipelines.
1. Separate the inputs X and the output y in two data frames.
2. Split the data in train and test set. Use a test size value of 2/3 and set the random state
equal to 1147 for consistency (i.e. the course code value). Use the following names for
consistency.
Annual
X train annual y train annual
X test annual y test annual
Monthly
X train monthly y train monthly
X test monthly y test monthly
3. Pre-process (if necessary for the method).
4. Fit the training dataset and optimize the hyperparameters of the method.
5. Plot coefficient values or feature importance.
6. Plot probability distribution for test set.
7. Plot confusion matrix and ROC curves of train/test set. Calculate precision/recall.
1.3 Comparison of methods (5 points)
Compare the two feature engineering (annual and monthly) and the three modeling ap
proaches (L1 log-reg, tree, forests) in terms of the outcomes of steps 5-7. Which combina
tion of feature engineering and modeling approach do you select as the best to deploy in a
production environment and why? Tabularize your findings in steps 5-7 to summarize the
results and support your decision (how to organize information with tables in Markdown).