Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: THEend8_
Customer Analytics (Practice) Final Exam – 70 minutes BU.450.760.K5
1. [8 points] Consider the following sample corpus from Yelp. Each row (review) is a
document. Assume the list of stopwords = c(“so”, “or”, “when” “and”, “the”) and non-
words contain white space, punctuation, numbers, and symbols (e.g. $).
[1] All the food is great here. But the best thing they have is their wings. Their wings are simply fantastic!!
[2] This place is truly a Yinzer's dream!! \"Pittsburgh Dad\" would love this place n'at!!
[3] Wing sauce is like water. Pretty much a lot of butter and some hot sauce (franks red hot maybe).
[4] The whole wings are good size and crispy, but for $1 a wing the sauce could be better.
[5] The fish sandwich is good and is a large portion, sides are decent.
(1) [4 points] After removing non-words and stopwords, what is the TF-IDF score of the
term “good” in doc 4?
(2) [4 points] If we use this corpus to predict restaurants’ survival rate, is there a “wide X”
problem? Why or why not. Please explain.
2. [8 points] Suppose that your work for the marketing division of the athletic apparel
company Reebok. You are now discussing the allocation of advertising dollars. In a
meeting, the chart shown below is presented. This chart describes the relationship between
the number of times Facebook users see an ad for Reebok shoes (horizontal axis) and the
probability that users will purchase a pair of Reebok shoes after clicking on the link
(vertical axis). Your colleague presents this figure in a meeting, arguing that it provides
“undisputable evidence that advertising on Facebook pays-off” and that the company
should “probably increase the number of advertising dollars in this platform.” Do you
agree? Explain your argument.
Customer Analytics (Practice) Final Exam – 70 minutes BU.450.760.K5
NO COMMUNICATION WITH OTHERS IS PERMITTED
3. [8 points] The blue line in the graph below represents the outcomes of a set of units
affected by a shock (ie, “as if” natural experiment) that unfolded in week 60 of the dataset
at hand. To evaluate the impact of this shock on treated units, we would like to implement a
diff-in-diff analysis, which requires us to select a control series. Our data contains two
candidate control series, controls #1 and #2, respectively shown by the red and green lines.
Which of these two controls would you select to implement the diff-in-diff analysis? Justify
your answer.