For this assignment you will write an R program to complete the tasks given below. You will hand in two files for this assignment.
A File with your R program
A PDF/DOC file with your output code.
Use the following file
R Data Set: HMEQ_Scrubbed.csv (in the zip file attached).
The Data Dictionary in the zip file.
Note: The HMEQ_Scrubbed.csv file is a simple scrubbed file from the previous week homework. If you did more advanced scrubbing of data for last week, you may use your own data file instead. You might get better accuracy! If you decide to use your own version of HMEQ_Scrubbed.csv, please hand it in along with the other deliverables. This assignment is an extension of the Week 6 assignment. The difference is that this assignment will now incorporate PCA and tSNE analysis.
Use only the input variables. Do not use either of the target variables.
Use only the continuous variables. Do not use any of the flag variables.
Select at least 4 of the continuous variables. It would be preferable if there were a theme to the variables selected.
Do a Principal Component Analysis (PCA) on the continuous variables.
Display the Scree Plot of the PCA analysis.
Using the Scree Plot, determine how many Principal Components you wish to use. Note, you must use at least two. You may decide to use more. Justify your decision. Note that there is no wrong answer. You will be graded on your reasoning, not your decision.
Print the weights of the Principal Components. Use the weights to tell a story on what the Principal Components represent.
Perform a scatter plot using the first two Principal Components. Do not color the dots. Leave them black.
Use the principal components from Step 2 for this step.
Using the methods presented in the lectures, complete a KMeans cluster analysis for N=1 to at least N=10. Feel free to take the number higher.
Print a scree plot of the clusters and determine how many clusters would be optimum. Justify your decision.
Using the number of clusters from step 3, perform a cluster analysis using the principle components from Step 2.
Print the number of records in each cluster.
Print the cluster center points for each cluster
Convert the KMeans clusters into "flexclust" clusters
Print the barplot of the cluster. Describe the clusters from the barplot.
Score the training data using the flexclust clusters. In other words, determine which cluster they are in.
Perform a scatter plot using the first two Principal Components. Color the plot by the cluster membership.
Add a legend to the plot.
Determine if the clusters predict loan default.
Using the original data from Step 2, predict cluster membership using a Decision Tree
Display the Decision Tree
Using the Decision Tree plot, describe or tell a story of each cluster. Comment on whether the clusters make sense.