Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: THEend8_
Data Analysis
First Computing Assignment
One Predictor Linear Regression
Introduction
This assignment is due on Thursday, April 4. This report is worth 100 points. Please
remember that there is a second project coming, so that you should finish the first project as soon as possible. Please submit your project on the Class Brightspace as instructed below. Please submit your report of Project 1 (both parts) in one pdf file. Each student has one chance to resubmit the report before the deadline. Detailed submission information is given below.
Project 1 has two parts. There are three files for this project. Two are for part A, and one is for part B. The files are labeled with the last six digits of your Stony Brook ID number.
Part A
Part A is worth 40 points. The model for the Part A assignment is a first data and
statistical processing task that a newly hired statistician might be given. Your report should address the issues that your future supervisor would want to know about: how many observations, fraction of missing data in independent variable and dependent variable, and imputation of missing data.
The two files for part A each contain a column for subject ID and a column for either the dependent variable value or the independent variable value. Your first task is to sort the two files by subject ID and merge them. You should not just use “cut and paste” to merge your data.
Second, you are expected to deal with missing data. Your report should contain the count of the number of subject IDs that had at least one independent variable value or dependent variable value. It should also include the count of the number of subject IDs that had an independent variable value, the count of the number of subject IDs that had a dependent variable value, the count of the number of subject IDs that had both an independent and dependent variable value, and the count of the number of subject IDs that had at least one independent variable value or dependent variable value.
Your second task is to impute the missing values. There are many of missing data
procedures. Often a statistical package has imputation algorithms in the software. For example, R has a package called MICE that has several options. You may not choose listwise deletion or mean imputation (or its equivalent median imputation). Specify your choice in your report.
Often, the choice of imputation method has little effect on the results if the fraction of missing data is 30% or less.
Part B
Part B is worth 60 points. The data file for part B contains one line for each subject ID.
The line will contain the subject ID, the value of the independent variable, and the value of the dependent variable. A transformation of either IV or DV or both maybe required. You should read the textbook (Chapter 11.1) for suggestions on fitting a model. An approximate lack offit (LOF) test should be applied. It is your responsibility to find repeated (or near repeated) independent variable values. That is, you will have very few exact repeats of an independent variable value. You should bin near repeated data into one level. For example, suppose that x1 = 1.01, x2 = 1.02, x3 = 1.03 and y1 = 2, y2 = 3, y3 = 4. While there are not exactly repeated x values, you could bin these points into one group of nearly repeated points. That is, choose the average x-value as the value of x after binning. Then your binned data would bex1 = 1.02, x2 = 1.02, x3 = 1.02 and y1 = 2, y2 = 3, y3 = 4. Now performa LOF test on the data set after binning all near repeated values. There is software in R that performs an approximate lack offit test. Often a transformation does not improve the apparent extent to which the data satisfies the assumptions of Chapter 11. Please check the r-squared of the data as given tother-squared of the data after you transform it. Also, please check the residual plot of the data. It maybe helpful to apply these checks to the data in part A.
Report
You must submit a one-page report on Problem A and a one-page report on Problem B, both parts in one single pdf format file. Each report should have four sections.
1. Introduction. The introduction should contain a statement of the problem and the
objective of the paper. Some of the questions that you should answer are: What is the objective of your effort? What are your research questions? What is the background of this work? The introduction is easy: your problem is to recover the function that was used to generate the dependent variable value based on the value of the independent variable.
2. Methods. The second section should describe your methodology. Specifically, how were the files were merged? What was the program used to perform. the statistical analysis?
What were the statistical techniques used? Did you use linear regression Did you use additional procedures such as an approximate lack of fittest? How much missing data was present in the data? What procedure did you use to deal with missing data.
3. Results. The third section should contain your results: What fraction of the variation of the dependent variable was explained? What was the analysis of variance table? What was the fitted function? What was the confidence interval for the slope? What was the conclusion to the test of the null hypothesis that the slope was zero.
4. Conclusions and Discussion. The fourth section should be conclusions and discussion. This section should focus on “big picture” issues. Was there an association between the variables? How important was it? That is, what was the r-squared value. What is your fitted function? You may submit a longer appendix of computer work and programs.
You are allowed an appendix to your report and there is no page limit on the appendix. If you include a table or figure, you must discuss it. Tables and figures should be numbered and titled.