Multivariate Analysis and Data Mining
Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: THEend8_
STAT 326/426: Multivariate Analysis and Data Mining
Final Exam
I hereby attest that this is my original work and that I have not discussed the contents of
this exam nor asked for or received help from anyone other than the instructor.
2. Use this page as a cover sheet for your exam with the declaration above signed.
3. You need to upload the exam together with this signed cover page in a single file on Canvas.
4. This exam consists of 6 problems worth a total of 162 points for both classes.
5. A copy of the exam, in a word file, is also available under “Files/Final Exam” in Canvas. You can use it to
Type or write your solutions.
6. Show your work clearly and in details to get full credits. Presentation will be a part of your grade.
7. When R is used, it is ideal to include your codes and output within each part of your solutions. But if you
are writing the solutions by hand on a separate piece of paper, then include all your codes and output in
an appendix at the end. In this latter case, you must label clearly which codes and output belong to which
part of what problem.
8. Data are posted under “Files/Final Exam” in Canvas. They are given in txt files and the column names
are in the files. Remember to use read.table("filename", header=T) when you load the data into R.
Good luck!
1.(27 points) Consider the three-dimensional random vector X = (X1, X2, X3)’ where
X1 = residential tenure (number of years lived in a neighborhood)
X2 = feelings towards racially integrated neighborhoods (measured on a 100 point scale
Fr om -50 to +50, with the -50 implying strong negative feelings and +50 implying
strong positive feelings about racial integration), and
X3 = education level (number of years)
Assume that X ~ N3 (μ, ∑), where and .
a.(5 pt) Derive the joint distribution of .
b.(2 pt) For defined in part (a), are they independent? Why or why not?
c.(10 pt) Specify the conditional distribution of X2 given X1 = x1 and X3 = x3.
d.(5 pt) Find the probability that a person with residential tenure of 8 years and education level of 11 years will have a feeling (towards radically integrated neighborhoods) score greater than 20.
e.(5 pt) Determine the constant-density ellipsoid that will enclose 90 % of the probability for X. Do NOT need to present the ellipsoid in a quadratic form, NOR need to draw the ellipsoid.
2. (2 points) Letbe two univariate normal random variables. Is it generally true
thatare jointly normally distributed?
3. (33 points total) A pharmaceutical company claims to have developed a drug that simultaneously
reduced cholesterol levels in the human body and causes weight loss. A random sample of n = 30
individuals from a population of patients is obtained, and each individual’s change in cholesterol
level (X1) and body weight (X2) are recorded after taking the new drug for a year.
Assume X1 and X2 follow a bivariate normal distribution jointly. The sample mean, sample
variance-covariance matrix are given as and .
Let μ1 and μ2 be the average cholesterol level change and the average weight change, respectively,
for the population of patients after taking the new drug for a year.
a.(8 pt) Construct a 95% T2 confidence region for μ = (μ1, μ2)’. Note: Present the region as a quadratic
form is sufficient, do NOT need to draw the confidence region.
b.(10 pt) At a level of significance of 0.05, can you conclude that the new drug has an impact (in either direction) on the population mean cholesterol level and/or body weight? State your hypotheses, test statistic to be used, rejection region, your work towards a statistical decision and finally, your conclusion.
c.(4 pt) Can you use the confidence region that you have determined in part (a) to conduct the test in part (b)? What final conclusion can you draw if you use the confidence region in part (a) to conduct the test in part (b)? Explain your answers in details.
d. (6 pt) Construct the 95% simultaneous T 2 confidence intervals for μ1 and μ2 by hand.
e. (2 pt) Interpret your results in part (d).
f.(3 pt) Discuss the advantage(s) and disadvantage(s) of the 95% simultaneous T 2 confidence intervals for
μ1 and μ2 that you have constructed in part (d) over the 95% individual t intervals for μ1 and μ2. You do NOT need to construct the individual t intervals. Only discuss.
4. (26 points) Let X = (X1, X2)’ be a 2-dimensional random vector with covariance matrix .
a.(8 pt) Find the principal components Y1 and Y2 based on ∑. If R is used for calculation, show R codes and output below.
b. (4 pt) For the first principle component that you derived in part (a), do both variables X1 and X2 appear to be important? Why or why not? Discuss the reason.
c.(4 pt) Calculate the covariance between the principal components Y1 and Y2 that you derived in
part (a). Is it what it is supposed to be? Explain. Show R codes and output below if R is used for
calculation.
d.(2 pt) Find the proportion of total population variance due to the first principal component.
e. (2 pt) Determine m, a reasonable number of principal components that can describe the data. Justify your choice.
f. (4 pt) Determine, the correlation coefficients between Y1 and Xk, k=1, 2.
g.(2 pt) Are your answers in part (f) consistent with your answer in part (b)? Explain.
5. (30 points) In this problem, you will perform principal components analysis (PCA) on national track records in different events for men in 54 countries. The data is a few years old and may not include new records. The data file “Men.txt” can be found under “Files/Final Exam” in Canvas.
Note: Except for part (e), all the calculations need to be done by using R.
a.(3 pt) Determine whether to do PCA on the covariance or correlation matrix. Justify your choice.
b.(4 pt) Describe the linear relationship between the two variables, X100m and X800m. Why can you say so?
c.(2 pt) Use R to perform PCA. Show R codes below only.
d.(4 pt) Give the first principle component function based on your work in part (c). Show R codes and relevant output below.
e.(2 pt) Calculate by hand the first principle component score for the country of Australia based on
your answer in part (d).
f.(4 pt) A simple line of R code will present the first principle component scores for all 55 countries
based on your work in part (c). What is this line of code? Give the first principle component
score for Australia from this line of codes. Is this score the same as your answer in part (e)?
Show the R codes and relevant output below.
g.(2 pt) What is the proportion of total sample variance due to the second principal component? Show R codes and output below.
h.(2 pt) Determine m, a reasonable number of principal components that can describe these data and justify your choice. Do NOT use a scree plot.
i.(3 pt) Use R to draw a scree plot to determine m, a reasonable number of principal components that can describe the data. Justify your choice. Show R codes and output below.
j.(2 pt) Which nation has the highest first principal component score? What score is it?
k.(2 pt) Which nation has the lowest first principal component score? What score is it?
Does this ranking correspond with your intuitive notion of athletic excellence for this country?
6. (44 points) The admission officer of a business school has used undergraduate grade point average (GPA) and graduate management aptitude test (GMAT) scores to help decide which applicants should be admitted to the school’s graduate program. The data “Admission.txt” gives the GPA, GMAT score and the coded admission status (1 for admitted and 2 for non-admitted) for 59 recent applicants. The goal is to develop a rule for separating the applicants between the admitted and the non-admitted groups based on their GPA and GMAT scores. The admission officer can then use this rule to predict the admission status for a new applicant. Answer questions (a) – (o) below.
The data file “Admission.txt” can be found under “Files/Final Exam” in Canvas.
a. (10 pt) Use a method to access the bivariate normality assumption for the joint distribution of GPA and GMAT for the admission group and the non-admission group, respectively. The method used for the two groups can be different. (Usually, an appropriate data transformation can be done if the normality assumption is suspect. But, you don’t need to do transformation for this problem.) Show R codes and output below.
b.(4 pt) What is the linear discriminant function given by formula (11-19)? R can be used for the calculation. Show R codes and output below if R is used.
c. (4 pt) Assume equal costs and let the prior probability, 𝑝1, of admitted applicants be 0.2 and the prior probability, 𝑝2, of non-admitted applicants be 0.8. Set up the classification regions R1 and R2 based on the classification rule (11-18).
d. (2 pt) What are the assumptions for applying for the classification rule (11-18)?
e.(2 pt) Construct the classification regions R1 and R2 based on Fisher’s linear discriminant analysis.
f.(2 pt) What are the assumptions for applying for Fisher’s classification rule?
g.(2 pt) Classify the admission status for an applicant who has a GPA of 3.21 and a GMAT score of 498, using the classification rule that you established in part (c).
h.(2 pt) Classify the admission status for an applicant who has a GPA of 3.21 and a GMAT score of 498, using the classification rule that you established in part (e).
i.(3 pt) Assume the assumption(s) are met. Use a build in function (in library MASS ) in R and the entire data set to carry out Fisher’s linear discriminant analysis. What are the resulting coefficients for the first discriminant function? Show R codes and output below.
j.(2 pt) Use R to predicted the admission status for all 59 applicants involved in the data set. Show R codes below. Also give the R codes that will allow you to see the predicted status of these 59 applicants.
k.(2 pt) Use R to construct the confusion matrix for the model fitted in part (i). Show R codes and output below.
l.(2 pt) Calculate the apparent error rate using the confusion matrix that you constructed in part (k).
m.(2 pt) Use R to find the observations in the data that are misclassified into the non-admitted group when they actually belong to the admitted group. Show R codes and output below.
n.(2 pt) Use R and the leave-one-out cross-validation procedure (called Lachenbruch’s holdout procedure in the book) to carry out Fisher’s linear discriminant analysis. Show the R codes below for doing this.
o.(3 pt) Use R and the fitted model in part (n) to estimate the expected actual error rate (can find the confusion matrix first). Show R codes and output below.