Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: THEend8_
EC116 SP Assignment
Instructions
Please read the following instructions carefully before you start!
1. There are five datasets available on Moodle. The last digit of your registration
number 1 will determine which dataset you will use for this assignment.
• Students with reg number ending with 0 and 5: Use dataset1
• Students with reg number ending with 1 and 6: Use dataset2
• Students with reg number ending with 2 and 7: Use dataset3
• Students with reg number ending with 3 and 8: Use dataset4
• Students with reg number ending with 4 and 9: Use dataset5
For example, if your student registration number is 1400251, then you need to use
dataset2. If your student registration number is 1920349, you will use dataset5.
Please use your OWN dataset. Otherwise, you will get a ZERO mark.
2. You MUST name your R code file (for submission on FASER) using your student
registration number and the dataset with an underscore in between.
For example, if your registration number is 1400251, you are using dataset2. There-
fore, you need to name your R code file as: dataset2 1400251.
3. You MUST make sure that your script is good to run (Do not worry about the
working directory, we will change it while marking). If your code does not run and
produce results, you will get a ZERO mark.
1This is the 7-digit student registration number.
1
4. You ONLY need to submit your R code file on FASER. In your code, please make
sure to
• Specify the question number (Question 1, Question 2, or Question 3) and part
numbers (a, b, c, etc..)
• Include your interpretations and descriptive answers as comments in your
code.
You can either use # or print the answer in the Console. An example of how your
R code file should look like is provided below. Any additional materials will not be
marked.
5. Take the 5% level of significance as the threshold level in all the questions (in case
it is asked).
6. Do NOT wait until the last minute to submit your assignment on FASER, as the
system might be overloaded.
2
Brief Description of the Dataset
This is a cross-section data originating from the US National Medical Expenditure
Survey (NMES) conducted in 1987 and 1988. The data are a subsample of individ-
uals (ages 66 and over) with information on Medicare (a public insurance program
providing substantial protection against health-care costs) coverage.
In particular, the dataset consists of 2400 individuals and contains the following vari-
ables:
• visits Number of physician office visits.
• chronic Number of chronic conditions.
• health Factor indicating self-perceived health status. Levels are “poor”, “aver-
age” (reference category), and “excellent”.
• age Age in years.
• male A dummy variable indicating gender. 1 for ”Male” and 0 for ”Female”.
• school Number of years of education.
• income Family income.
• insurance Factor indicating whether the individual covered by private insur-
ance. Answers are “Yes” or “No” .
• medicaid Factor indicating the Medicare coverage . Answers are “Yes” or “No”.
3
Question 1 [30 Marks]
(a) [5 Marks] Set up your working directory and import the dataset according to your
reg number (see the instructions on which dataset to use and how to name your R
code).
(b) [7 marks] In order to better understand your data, it is always good to start with
some descriptive statistics!
- Compute the summary statistics for the number of physician office visits,
the number of chronic conditions and the family income (Hint: use function
“summary” to compute summary statistics).
- Draw the histograms for the number of physician office visits, the family in-
come, age and the number of years of education with frequency on the y-axis.
Make sure that your histograms are self-explanatory, i.e. includes appropriate
graph titles and axis titles.
(c) [8 marks] Find the correlation coefficient matrix among the number of physician
office visits, the number of chronic conditions and the family income. In the cor-
relation coefficient matrix, round all the numbers to two decimal digits and add
proper names to the rows and the columns. Interpret your results.
(d) [5 marks] Draw a scatter plot with the number of chronic conditions on x-axis and
the number of physician office visits on y-axis. Label the x- and y-axis accordingly
and set the title of the plot as your student registration number.
(e) [5 marks] Fit a linear regression line in BLUE on the scatter plot in part (d).
4
Question 2 [30 Marks]
(a) [5 marks] Regress the number of physician office visits on the number of chronic
conditions, i.e. estimate the following linear regression model:
visitsi = α0 + α1 ∗ chronici + i
Save the model as m1 and display the regression results (Hint: Use the “stargazer”
package).
(b) [10 marks] Interpret the estimated intercept (α0) and slope coefficient (α1) in the
model. Is the number of chronic conditions a significant predictor of the number of
visits? Explain your answer briefly.
(c) [5 marks] What’s the R-squared ratio (R2) in the model? What does it mean?
(d) [10 marks] Now add the log of the family income as a regressor in the regression
model and save the estimation results as m2, i.e. estimate the following model:
visitsi = α0 + α1 ∗ chronici + α2 ∗ logincomei + i
(Hint: You need to take the log of income by using the log function in R.)
- According to this new model, does family income significantly predict the
number of physician office visits?
- Can you make a comparison between regression models m1 and m2? Which
model would you prefer? Explain briefly.
5
Question 3 [25 Marks]
(a) [10 marks] Transfer the character variables (insurance and medicaid) into dummy
variables. Then estimate the following model and save your estimation results as
m3.
visitsi = α0+α1∗chronici+α2∗malei+α3∗insurancei+α4∗medicaidi+α5∗schooli+i
Display your results along with the models in Q2 (m1 and m2) using “stargazer”
package.
(Note: In case you fail to transfer the character variables into dummy variables, run
the regression model without insurance and medicaid but doing this will NOT give
you full marks.)
(b) [5 marks] From (a), which is the best model in estimating the number of physician
office visits? Explain your answer briefly (Hint: Pay attention to R2).
(c) [5 marks] Does medicaid encourage people to visit their physicians? Explain your
answer briefly. If yes, what’s the effect of having medicaid on the number of physi-
cian office visits? (Hint: Pay attention to the coefficient of medicaid).
(d) [5 marks] Based on model m3, what’s the predicted number of physician office
visits for an individual described as follows:
- a male, i.e. malei = 1
- without private insurance, i.e. insurancei = 0
- without medicaid, i.e. medicaidi = 0
- with only 1 chronic condition, i.e. chronici = 1
- with 10 years of schooling, i.e. schooli = 10
Show your calculation.
6
Question 4 [15 Marks]
(a) [10 marks] There is a category variable “health” in the dataset and one of your col-
leagues thinks that people with better health condition will make less physician
visits. Can you test whether your colleague’s claim is correct? In other words, do
people with poor health condition visit physicians more than those with excel-
lent health condition? In order to answer this question, start by creating dummy
variables for people with poor health and also for people with average health. Then
add these variables into model m3, run the regression and save your new model as
m4. Display your results using the ”stargazer” package and explain your answer
briefly.
(b) [5 marks] By using model m4, run two regressions separately for males and fe-
males and display your results. Make sure you clearly indicate which model is for
males in the table (Hint: Use the ”stargazer” package).
7