Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: THEend8_
Customer Analytics (Practice) Final Exam – 70 minutes
NO COMMUNICATION WITH OTHERS IS PERMITTED
INSTRUCTIONS
• No interpersonal communication.
• To answer questions, make assumptions if necessary.
• Fill-in your answers into the “EXAM_ANSWERS.docx” template. Do not exceed the allotted
number of lines.
• Continuously save your work. Make sure you upload the correct file and that upload is
successful.
• Submit the file with answers via the “Final exam” link in the assignments tab in Blackboard.
This link expires 2 minutes after due time. In this event, email submission to instructor.
Late submissions face a per-minute point penalty.
Customer Analytics (Practice) Final Exam – 70 minutes BU.450.760.K5
NO COMMUNICATION WITH OTHERS IS PERMITTED
2
1. [8 points] Consider the following sample corpus from Yelp. Each row (review) is a
document. Assume the list of stopwords = c(“so”, “or”, “when” “and”, “the”) and non-
words contain white space, punctuation, numbers, and symbols (e.g. $).
[1] All the food is great here. But the best thing they have is their wings. Their wings are simply fantastic!!
[2] This place is truly a Yinzer's dream!! \"Pittsburgh Dad\" would love this place n'at!!
[3] Wing sauce is like water. Pretty much a lot of butter and some hot sauce (franks red hot maybe).
[4] The whole wings are good size and crispy, but for $1 a wing the sauce could be better.
[5] The fish sandwich is good and is a large portion, sides are decent.
(1) [4 points] After removing non-words and stopwords, what is the TF-IDF score of the
term “good” in doc 4?
# terms in doc 4 = 14
TF = !"#$%#&'( *! “,**-” /& -*' # 1#"23 /& -*' = 445
IDF = 6 & # 789 :; <=> 98?@AB# 789 <=C< :;9DA7> <=> <>?E' = 6 &F6'
TF-IDF = TF * IDF =6 &F6' ∗ 445 ≈ 0.0944
(2) [4 points] If we use this corpus to predict restaurants’ survival rate, is there a “wide X”
problem? Why or why not. Please explain.
The “wide X” problem refers to the case where the number of variables (i.e. coefficients to
estimate) >> number of observations. In this case, there is potentially “wide X” problem
because we have way more terms (~ 60-70 terms even after pre-processing) than
observations (= 5).
We can solve the wide X problem by removing some sparse terms from document-term
matrix and evaluate model performance in validation sample.
2. [8 points] Suppose that your work for the marketing division of the athletic apparel
company Reebok. You are now discussing the allocation of advertising dollars. In a
meeting, the chart shown below is presented. This chart describes the relationship between
the number of times Facebook users see an ad for Reebok shoes (horizontal axis) and the
probability that users will purchase a pair of Reebok shoes after clicking on the link
(vertical axis). Your colleague presents this figure in a meeting, arguing that it provides
“undisputable evidence that advertising on Facebook pays-off” and that the company
should “probably increase the number of advertising dollars in this platform.” Do you
agree? Explain your argument.
Customer Analytics (Practice) Final Exam – 70 minutes BU.450.760.K5
NO COMMUNICATION WITH OTHERS IS PERMITTED
• The statement based on the graph is misleading b/c correlation is not causation.
• There are potentially many reasons other than being exposed to Facebook ads that could lead to
the correlation between one’s purchase probability and number of times one saw an ad. For
example, people who are loyal to Reebok might have followed many Reebok-related Facebook
accounts, thus more likely to see more ads on average. If more ads are allocated to people who
will purchase with a high probability anyway, it would seem the ads are very effective by just
looking at the correlation.
• This is the very definition of “selection bias” and the reason why we need to run an A/B test to
estimate the true “causal effect” of advertising on conversion (purchase probability).
3. [8 points] The blue line in the graph below represents the outcomes of a set of units
affected by a shock (ie, “as if” natural experiment) that unfolded in week 60 of the dataset
at hand. To evaluate the impact of this shock on treated units, we would like to implement a
diff-in-diff analysis, which requires us to select a control series. Our data contains two
candidate control series, controls #1 and #2, respectively shown by the red and green lines.
Which of these two controls would you select to implement the diff-in-diff analysis? Justify
your answer.
Customer Analytics (Practice) Final Exam – 70 minutes BU.450.760.K5
NO COMMUNICATION WITH OTHERS IS PERMITTED
We should choose # 1 as the control group. This is because based on the data before week 60 (i.e.
prior to the event) it has a parallel trend to the treated group.
On the other hand, the outcome trend of potential control #2 group is almost opposite to that of the
treated, implying that post-event gap between the two groups are not attributable solely to the
treatment effect.
4. [6 points] The figure below represents the stylized impact of a shock (ie, natural
experiment) on a set of treated units (in red). Blue markers represent the outcomes that are
observed for a (adequate) control.
What are implied values for parameters G, 4, 6, H in the below diff-in-diff equation?
= G + 4 + 6 + H × +
G = 33, 4 = (15 − 33) = −18, 6 = (22 − 33) = −11, H = 38 − 22 − (15 − 33) = 16 + 18 = 34
(Note that if not due to the event, the outcome of the treated group was supposed to fall based on
the trend of the control group.)
Customer Analytics (Practice) Final Exam – 70 minutes BU.450.760.K5
NO COMMUNICATION WITH OTHERS IS PERMITTED