Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: THEend8_
FEM11149 - Introduction to Data
In this final assignment you are not going to be given step-by-step instructions.
You are expected to know which techniques are needed to clean and subset data (if needed),
run models and their respective diagnostics. What you should hand-in
• A PDF file, generated using R Markdown containing – A business report (max 4 pages) with your analysis –
The report should follow the guidelines specified on Canvas - except for the code appendix (see below)
• A *.rmd file, used to generate the PDF file above No attachments to the PDF file are allowed. Deadline: 24/10, 23:59.
Final Assignment: investigating income inequality in Brazil Congratulations! Because of your hard work for Gelukshuisje,
you are now a worldwide known data scientist. The United Nations Office in Brazil is looking for a consultant for a data science job,
and your former intern from Gelukshuisje, Ahsia, recommended you. You are trusted with sample data from the last Brazilian
National Census (2010). The dataset Brazil_data_census contains 4500 rows. Rows correspond to Brazilian cities,
that are divided among the 26 states (estados/UF) plus the Federal District, where Brasília is located. In the 30 columns,
besides the identification of the city, you find several numerical indicators, that range from total population to proportion of houses with electricity. The outcome of interest is the income appropriated by the 10% richest divided by the income of the 40% poorest within a city, identified as R1040 in the dataset (hereby referred as 10/40 ratio). This is a comparison of the per capita income of the richest decile with the 2/5 poorest and gives a notion of inequality. Column descriptions are in the file Brazilian_census_databook.xlsx. 1 The plan You need to investigate what explains the 10/40 ratio. For that, you will build and compare two regression models: • The first model will be a penalized regression using LASSO; • The second model is a regression using PCA scores as explanatory variables, that is, principal compo- nents scores are used instead of the original variables to explain your outcome of interest using a linear regression model. Minimum Requirements You need to set aside a sample size of 10 municipalities for which you will perform an out-of-sample prediction exercise to compare the results from both models. For both models, use the best practices you saw in the lectures. Specifically for PCA, in addition to three simple criteria, please use permutation test to select the meaningful number of principal components. Moreover, apply bootstrap procedure to Kaiser’s rule, i.e. test if the variance explained by each component is significantly larger than 1. Use the results of your analysis to name and interpret the selected components. Note that those are partial requirements and are not sufficient for a full grade. Everything should be explained and interpreted. Single results without interpretation will not be considered. Be aware that different criteria for selecting the principal components might differ in their conclusions, and the final decision is up to you and need to be justified.