Causal Analysis in Data Science
Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: THEend8_
POLS0012 Causal Analysis in Data Science
Please follow all designated Department of Political Science submission guidelines. These
may be different to those of your home department. You must submit one copy of your essay
via Turnitin.
The datasets for the essay can be found in the ‘Final Essay Part A Materials’ folder in the
‘Assessment Information and Materials’ section of Moodle
The word limit for both Parts A and B is 3,000 words in total, excluding your R script
appendix (see below). You can divide the word limit as you like between the two parts, and
you will not be penalised for using more of your words in part B.
This is an assessed piece of coursework for the POLS0012 module; collaboration and/or
discussion with anyone is strictly prohibited. The rules for plagiarism apply and any cases of
suspected plagiarism of published work or the work of classmates will be taken very
seriously.
You may open up the datasets and work on the essay questions anytime up until the
submission date. There is no limit on the number of times you may open the data files. Be
sure to save your data files and R script file.
You should include a copy of your R script as an appendix to your essay. FAILURE TO
INCLUDE THE R SCRIPT WILL INCUR A 10 POINT PENALTY. Note that your R
script file should be neatly presented and easy to follow, including comments indicating the
question being addressed. The essay answers should not contain any code.
Any tables or figures must be included within your answers to the essay, not in the code
appendix.
You may assume the methods you have used (e.g. a difference in means) are understood by
the reader and do not need definitions, but you do need to say which techniques you have
used and why.
As this is an assessed piece of work, you may not email/ask the course tutors for help with
the essay questions.
2
PART A: QUANTITATIVE QUESTIONS
QUESTION 1: The Long-Term Impact of the Slave Trade [25 Points]
This part uses data from the following paper, available in the assessment folder on Moodle:
Nathan Nunn (2008). “The Long-Term Effects of Africa’s Slave Trades.” Quarterly Journal
of Economics 123 (1): 139-176.
To help answer this question, first read the paper. Part of your task is to replicate and extend some
of Nunn’s results, which he produces using instrumental variables. If you are unable to exactly
reproduce Nunn’s results, report your best effort to do so. Whether or not you can exactly replicate
the paper’s findings, ensure that both your write-up and your R script clearly indicate how you
obtained your results. The dataset for this question is nunn.Rda. It contains these variables:
Variable name Variable description
country Country name
ln_realgdp2000 Log real per capita GDP in 2000, also called “ln y” in the paper
ln_export_area Log total number of slaves exported, divided by land area
atlantic_dist Sailing distance to nearest destination of Atlantic slave trade
indian_dist Sailing distance to nearest destination of Indian slave trade
saharan_dist Overland distance to nearest port of export for Saharan slave trade
redsea_dist Overland distance to nearest port of export for Red Sea slave trade
colonial_power Name of colonizer, if any, prior to independence
equator_dist Distance from equator
longitude Longitude
rain_min Minimum monthly rainfall
humid_max Average maximum humidity
low_temp Average minimum temperature
ln_coastline_area Log coastline divided by land area
low_distance =1 if situated at a low distance from a major slave destination
high_slavery =1 if country had a high level of slave exports
Note that low_distance and high_slavery do not feature in Nunn’s paper: they have been created for
this question. The other variables are identical to those used in the paper. You will also need to load
the AER package for this question.
Answer the following questions:
a) Run a simplified version of the instrumental variables analysis that Nunn uses in his paper,
using the Wald estimator and binary variables. Use the single binary instrumental variable
low_distance, the binary treatment variable high_slavery, and the outcome variable
ln_realgdp2000. You should:
i. Explain, in this case, what type of country is a complier and what type of country is
an always-taker.
ii. Calculate and report the proportion of compliers and the intent-to-treat effect.
3
iii. Use your answers from (ii) to calculate and report the Complier Average Causal
Effect (CACE) of high slavery on GDP.
iv. Use an appropriate method to calculate and report the p-value for this CACE
estimate.
v. Briefly interpret your results.
b) Turning now to the analysis that Nunn conducts in his paper, replicate the first-stage results
from the first column of Table IV on p.162. Report your results.
Note: You only need to produce the four coefficients and four standard errors.
c) Are the instruments in this paper subject to the weak instrument problem? What consequences
does this have, if any, for our interpretation of the results? Explain your answer, providing
evidence from the data.
d) Do you think that the instruments in this example satisfy the exclusion restriction assumption?
Briefly explain your answer.
e) Replicate the second-stage coefficients and standard errors for ln(exports/area) in columns (1),
(2) and (3) of Table IV on p.162 of the paper. Report your results and briefly interpret each of
the three estimated LATEs.
f) Why do you think that Nunn estimated the additional models in columns (2) and (3) of Table
IV that include coloniser fixed effects and geographic controls?
4
QUESTION 2: A Simulated Experiment [25 Points]
This question analyses a simulated experimental dataset contained in the file “2022essay_q2.Rda."
It includes 100 units and the following six variables:
Variable name Variable description
y1 Potential outcome under treatment
y0 Potential outcome under control
R Reporting status under treatment and control (=1 if reports data, 0
otherwise)
x A baseline covariate
d Treatment assignment (=1 if in treatment group, 0 if in control group)
outcome Observed outcome (assuming no attrition occurred)
For parts (a) to (d), we will assume that no attrition occurred. That means using all 100 units for our
calculations. Answer the following questions:
a) Assuming no attrition, is it likely that randomisation failed in this experiment? Provide evidence
from the dataset for your answer.
b) Again assuming no attrition, calculate and report:
i. The true average treatment effect in terms of potential outcomes
ii. The observed average treatment effect from the experiment