STAT6118 Complex Survey Data Analysis
Assignment
. This assignment is worth 100% of the overall mark, which is 100, for STAT6118. The maximum mark per question as well as total mark for each Task are given in bold within brackets. The presentation and clarity of your report is important. Marks can be deducted from the inal marks for poor presentation and lack of clarity.
. The deadline for submission is at 16.00 on Wednesday 15 May 2024.
. Your report must not exceed 15 pages including relevant tables, igures, and formu-lae. Appendices do not count towards the page limit. Appendices should be avoided, but can be used to provide computer outputs or code. The page limit is strict and is easily sufficient to receive full credit. Any pages beyond the page limit will not be marked.
. You must submit one electronic copy of your report as a single PDF document via the STAT6118 Blackboard website using the TurnItIn link. Scanned coursework assignments are not allowed. Make sure that you have 4 sections called Task 1, Task 2, Task 3, and Task 4. The subsections 1a, 1b, 1c, 1d, 1e, 2a, 2b, 2c, 2d, 2e, 3a, 3b, 3c, 3d, 4 should also be clearly labelled.
. Standard University policies and procedures will be followed for late submission, extensions and academic integrity (see the Module Outline for details).
. Remember that the University places the highest importance on maintaining aca- demic integrity and expects all students to do the same. Hence, it is very impor- tant that you read carefully the Section 5 (Academic Integrity and Referencing) of the module outline (available on blackboard).
. It is the policy of the Department of Social Statistics that courseworks are anony- mous, therefore only your Student ID Number should appear on the part of the Submission Form. To maintain anonymity please do not put your name on any part of your submission.
. You can use R or STATA for theirst 3 Tasks. For Task 4, you can use any statistical software.
Data set for Tasks 1–3
The data ile called ESSr6 ES.dta is an extract from the Spanish European Social Survey (ESS) round 6 (2012), which is about personal wellbeing and democracy. The ESS is a multinational survey conducted across Europe since 2001. The data set is publicly available at the o伍cial ESS website. The data contains all individuals responded to the survey.
The sampling design is a stratiied two-stage sampling design.
Stratiication is based on cross-classiication of region of residence (18 regions) and population size groups (4 groups). There are 63 out of 72 theoretical strata which are not empty in the data set. Stratiication is given by the variable stratify in the data set.
Primary sampling units (PSUs) are selected with probabilities proportional to population aged 15+ and identiied by the variable psu in the data set.
Secondary sampling units (SSUs) are individuals which are selected within each PSU and identiied by the variable idno.
Achieved sample sizes:
Items |
National |
Strata |
63 |
PSUs |
422 |
Individuals |
1,889 |
A description of the variables in the data set can be found in the Excel ile “ESSr6 ES variables.xls” available on the Blackboard page of the module.
You should use R functions with preix svy or STATA procedures svy wherever you take account of the sampling design in your analysis. USE the variable pspwght as the survey weight. You may come across single-unit strata problem when you consider the sampling design. You can deal with this problem by using the options function in R or the singleunit command in STATA. For this assignment, treat such strata as if they were census-strata, that is, as if the corresponding PSUs were selected with probabilities 1. In this way, such strata wouldn’t contribute to the variance estimates.
in R: run the following code for once before doing further analysis with the data: options(survey.lonely.psu = ”certainty”)
in STATA:
svyset · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · singleunit(certainty)
where the sampling design characteristics are speciied in the place of dots ( · · · ).
Task 1 (Testing) [25 marks]
For this task, you need to take the sampling design into account. You should explain how you took account of the sampling design, at least the irst time you mention it (i.e. in part 1a).
In parts 1c– 1d, for each statistical test you apply, please make sure that you clearly specify the null hypothesis, the type of the statistical test you applied, the value of your test statistics, the statistical distribution against which it is being compared, the p-value (or critical value), and your conclusion. Use α = 0:05 as the signiicance level.
You should also briely describe your R or STATA codes.
1a) Estimate the proportion of people in the target population of Spain who voted in the last election. Use the variable voted lastelection and the variable pspwght as the weight variable. Give an analytic expression of the estimator you used. [4 marks]
1b) Estimate the standard error of this proportion. What is the 95% conidence interval of this proportion? What is the assumption behind the conidence interval you used? [5 marks]
1c) Is this proportion of people who voted signiicantly diferent from a hypothesised proportion of 0:75? [4 marks]
1d) Estimate the diference in means of total time of watching TV (tvtot) between the youngest (i.e. agecat= 1) and the oldest age groups, that is, agecat= 1 and agecat= 4, respectively. Is this diference statistically signiicant? [6 marks]
1e) Apply Wald-F test and adjusted Wald-F test for testing of independence between le-
gal marital status,marcat, and the vote status in the last election, voted lastelection . Please make sure that you clearly specify the null hypothesis, the values of your test statistics, degrees of freedoms, p-value, and your conclusion. Is adjustment for the Wald-F statistics needed in this particular case? Explain why, or why not. [6 marks]
Task 2 (Linear regression) [30 marks]
For this task, satisfaction with government (stfgov) variable will be used as the outcome (dependent) variable. For the purpose of this task, this variable will be treated as a continuous numerical variable although it has discrete values ranging between 0 and 10.
For parts 2a)–2c), you should perform a multiparameter Wald test. This can be applied by using the regTermTest function in the survey package in R and the test command in STATA. An example of an application of this test is given as follows. Note that there is no sampling design involved in the example given below. Let y be the outcome variable, x be a numerical variable, z be a categorical variable with ive categories, and data be the data of interest. Suppose we wish to test the signiicance of z.
in R:
itlm <- lm(y x+z, data)
regTermTest(itlm, z)
in STATA:
xi: regress y x i.z
test Iz 2 Iz 3 Iz 4 Iz 5
where each number after Iz refers to a nonreference category of the variable z. For example, Iz 2 is the dummy variable temporarily generated by STATA that is related to the second category of the variable z. Having all the categories except from the reference category in the test command makes sure that the variable is tested as a whole rather than one individual category being tested.
For parts 2a)–2c), DO NOT directly copy and paste your test outputs from R or STATA. Instead, please present your results in a table including predictor name, the value of the F-test statistics, degrees of freedom and the p-value. Otherwise, you will be penalised. Use α = 0.05 as the signiicance level in your tests.
DO NOT directly copy and paste model outputs from R or STATA. Instead, please present the results of your inal model in part 2c) and your model results in part 2d) in a table including predictor name with corresponding categories, parameter estimate, standard error estimate, t-value and the p-value. Otherwise, you will be penalised.
You should also briely describe your R or STATA codes.
2a) Consider four possible predictors of the dependent variable satisfaction with the government (stfgov): a left-to-right political scale collapsed into three categories (lr3cat), gender (gndr), self-rated health status (health), and age (agea). Fit a set of bivariate linear regression models where stfgov is predicted by each of these predictors (one at a time).
Use an aggregated approach that takes the sampling design into account in your it. You should treat lr3cat, gndr, and health as factor (categorical) variables. Apply a Wald test to test the signiicance of the variables. Give a general expression for the null hypothesis that is appropriate when testing regression parameters. Explain the symbols you used. Which variables are statistically signiicant? [7 marks]
2b) Fit the preliminary linear regression model with all the predictors that you have found signiicant in part 2a). Create a variable of squared age by using the age variable (agea) and call this new variable agesq. Add agesq to your preliminary model and test its signiicance. Should it be retained in your model? [4 marks]
2c) Starting from the model with all the predictors you have retained in part 2b), test all two-way interactions, one at a time. In case you have both agea and agesq in your model, you should consider both predictors as the age variable at the same time when adding the interaction term between age and any other predictor in your model. Should the interaction terms be retained in your inal model? Present your model results in a table. [7 marks]
2d) Fit a linear regression model for the variable stfgov with all the predictors you have retained in your inal model in part 2c) without taking the sampling design into account. Present your model results in a table. What kind of approach is this one? Describe this approach by clearly expressing its key characteristics and the key assumptions behind it. Give one possible disadvantage of this approach. [6 marks]
2e) Compare the results of your models in parts 2c) and 2d) in terms of parameter estimates, standard error estimates and signiicance of model parameters. Are there any diferences? [6 marks]
Task 3 (Logistic regression) [20 marks]
For this task, the binary variable high satisfaction with life (high sat) will be used as the outcome (dependent) variable.
For parts 3a)–3b), DO NOT directly copy and paste your test outputs from R or STATA. Instead, please present your results in a table including predictor name, the value of the
F-test statistics, degrees of freedom and the p-value. Otherwise, you will be penalised. Use α = 0:05 as the signiicance level in your tests.
DO NOT directly copy and paste model outputs from R or STATA. Instead, please present the results of your inal model in part 3b) and your model results in part 3d) in a table including predictor name with corresponding categories, parameter estimate, standard error estimate, t-value and the p-value. Otherwise, you will be penalised.
You should also briely describe your R or STATA codes.
3a) Consider ive possible predictors of the dependent variable high satisfaction with life (high sat): age in four categories (agecat), gender (gndr), indicator variable of very good health (good health), marital status in three categories (marcat), and education level in four categories (educat). By taking the sampling design into account, examine the bivariate relationships between the outcome variable high sat and the possible predictors by performing tests of associations for each predictor. Use the Rao-Scott F-test (i.e. the second-order Rao-Scott adjusted Pearson chi-squared test statistics converted into an F-test statistics).
You should treat all predictors, except from good health, as factor (categorical) variables. Which variables are signiicantly associated with high sat? [5 marks]
3b) Fit the preliminary logistic regression model with all the predictors that you have found being signiicantly associated with high sat in part 3a) .
Use an aggregated approach that takes the sampling design into account in your it. You should treat all predictors, except from good health, as factor (categorical) variables. Test each variable by performing a (multiparameter) Wald test (see Task 2 for a description of the application of this test in R or STATA). Which variables should be retained in your inal model? Present your model results in a table. [5 marks]
3c) Briely describe the aggregated approach being asked to be used in part 3b). How could you call this approach alternatively? What are the key characteristics of this approach? Give one possible disadvantage of this approach. [5 marks]
3d) Re-it the same model as the one you have chosen as your inal model in part 3b) without allowing for the sampling design. Present your model results in a table. Compare the results of your model against those from the one in part 3b) in terms of parameter estimates, standard error estimates and signiicance of model parameters. Are there any diferences? [5 marks]
Task 4 (Nonresponse) [25 marks]
The data DataCPS.CSV is extracted from the September 1976 Current Population Survey in the USA. The units are individual persons. We assume that a stratiied simple random sampling have been used. The population size is N = 46 049. This does not correspond to any sub-population of the USA. It should be viewed as a ictitious population for the purpose of this assignment. The variables in this data set are given as follows.
stratum: The stratum label. The have 3 geographical strata
area: Identiier for the compact geographic areas
person Identiier for the person
age: Age of the person in years
agecat: Age categories created based on age: 1 = 19 years and under; 2 = 20-24; 3 = 25-34; 4 = 35-64; 5 = 65 years and over
race: 1 = non-black; 2 = black
sex: 1 = male; 2 = female
hour: Usual number of hours worked per week
wage: Usual amount of weekly wages (in 1976 US $). Contains missing values, labelled as NA
The strata sizes are 12 279 for strata 1, 18 420 for strata 2 and 15 350 for strata 3. Some population counts are given in the following table.
Age |
Non-black |
Black |
Total |
||
Male |
Female |
Male |
Female |
||
< 19 |
801 |
1 700 |
296 |
184 |
2 981 |
20 - 24 |
2 459 |
1 980 |
864 |
1 377 |
6 680 |
25 - 34 |
12 313 |
3 133 |
497 |
137 |
16 080 |
35 - 64 |
9 349 |
6 396 |
810 |
2 624 |
19 179 |
> 65 |
503 |
365 |
167 |
94 |
1 129 |
Total |
25 425 |
13 574 |
2 634 |
4 416 |
46 049 |
Your aim is to estimate the population average of the variable “wage”. Note that the variable “wage” contains missing values. In your estimation, you need to address the following.
. Calculate the sampling design weights, and provide a formula for your calculation
. Adjust the design weights by taking account of nonresponse
. The assumptions about the response mechanism must be clearly stated and justiied
. You should describe and justify the approach you adopted
. Provide your weighted estimate for the population average of the variable “wage”
Any statistical package can be used.