Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: THEend8_
CS2035
Assignment
a) The population of the US in 1990 was estimated at 249.62 million. Verify that the
counties present in this data set cover more than 99% of the US population.
b) The state names are abbreviated with only two letters in the data set. With the
help of the file US-states-abbrev.txt, add a variable that displays the full name
of the state.
c) Generate a table that displays, for every state, the median value of the percentage
of votes at the county level, for each party. The table should look like this (use the
full name for states):
state democrat republican Perot
Alabama 42.9 46.3 10.5
Arizona … … …
… … … …
Wyoming … … …
d) Create a function plot linreg(T, party, var) that has three inputs: the table T
of the imported data set, a string variable party that represents the percentage of
votes received by the party and another string variable var that represents a
demographic variable of the data set. This function will create a scatter plot of the
two variables party and var with the linear regression line. You will indicate in the
plot title the values of the intercept, slope and unadjusted R2
. For example, the call
plot linreg(T,”democrats”,”black”) should produce a figure similar to Figure 1.
e) Using plot linreg(), create a figure with 15 subplots. The subplots will be
organized in three rows, one for each political party and five columns. The column
will represent the following variables: crime, income, college, white and
black.
f) Explore the dataset and describe (preferably with a figure) something you find
interesting, that has not been covered in this exercise (it could be trends,
unexpected distribution, etc.).
Page 2 of 5
University of Western Ontario CS2035
Exercise 2 – Basketball Player Doping?
A basketball league is worried that one of its player is taking athletic
performance-enhancing drugs (“doping”). The league suspects that the player may have
started doping around the 40th game last season. The league is providing you with the
player’s points-per-game (PPG) for every match of last season in order to establish if the
apparent increase in this player’s PPG can be caused by chance alone. See the file named
basketball-ppg.csv.
a) Plot the time series of the player’s PPG as a function of the game number (game 1
is the first of the season, game 2 the second, and so on). Mark with a dashed
vertical line the time t
∗ when the player is suspected to start doping.
b) Check visually that the distributions of PPGs before and after t
∗ are approximately
normally distributed.
c) Check normality with the more formal Kolmogorov-Smirnov test.
d) Based on your findings about normality of the PPGs, explain why a Z-test can be
used to test if the distribution of PPGs after is different from before t
∗
. State the
Null Hypothesis.
e) Determine, with a 1% confidence level, if the change before/after t
∗
in the player’s
PPGs is due to chance alone. Does your analysis support the league’s suspicion?
Page 3 of 5
University of Western Ontario CS2035
Exercise 3 – Sales Analysis
A large company would like to have a brief analysis regarding the sales of its new
division, Great Products Inc., that manufactures and sells electronic components. The
company has extracted from its main databases the sales records for Great Product Inc.
and has sent you the following files:
• db cust country.csv: The global list of their customers’ unique identification
number (not only customers of Great Products) and their country of origin.
• db cust orders.csv: Sales orders fulfilled by Great Products Inc. that shows the
order ID and the customer’s ID.
• db order ref.csv: The information that links a sales order ID with the reference
ID of the item sold as well as the quantity shipped.
• db ref price.csv: The unit price, in dollars, of an item given its reference ID.
a) Merge the information from all four files such that you end up with a table that
contains only the customers from Great Products Inc, their customer ID, the ID of
the order (the transaction), the country of origin of the customer, the reference ID
of the item they purchased, the quantity purchased, and the unit price of that item.
b) Display a breakdown by countries of the total revenues generated by Great Product
Inc.