CSC597/CSC687 Topics in Computer Science
Topics in Computer Science
Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: THEend8_
CSC597/CSC687 Topics in Computer Science
Assignment
For this assignment two datasets will be used: the Parkinson’s UPDRS dataset and the
Wisconsin breast cancer dataset, from the UCI Machine Learning Repository. The datasets
are the same ones that were used for the previous assignments and they are available for
download under the “Assignments” section.
• Each dataset has two files associated to it: a “data” file, which contains the actual
values for the different variables, and a “names” file, which contains a description about
the dataset.
• The Parkinson’s UPDRS dataset will be used for regression methods and includes two
possible output variables: motor_UPDRS and total_UPDRS. These should not be
included as part of the input variables.
• The Wisconsin breast cancer dataset contains one output variable, the class, and will
be thus used for classification techniques.
Some useful notes involving missing data:
• Data may need to be cleaned before applying any statistical learning algorithm. Given
that sample size of the datasets is large enough, the rows with missing data should be
removed.
• When the data is first loaded into R, we can indicate how we code the missing values.
• Missing values can be inspected by using the is.na() function.
This assignment will cover tree-based and ensemble methods. Divide the datasets into 80%
for training and 20% for testing.
1. Regression problem: build regression models to predict total_UPDRS (reminder: do not
include motor_UPDRS as input to the model) and measure their performance utilizing the
test set.
a. Build a regression model using bagging (m=p).
i. How does the model perform?
b. Build a regression model using random forests (m=!).
i. How does the model perform?
c. Build a regression model using boosting.
i. How does the model perform?
d. Compare and comment on the error obtained with each approach. Which model
seems to perform the best?
2. Classification problem: build classification models to predict the class variable and measure
their performance utilizing the test set.
a. Build a classification model using decision trees.
i. How does the model perform?
b. Prune the tree obtained in “a”.
i. Use cross-validation to determine the optimal level of tree complexity.
ii. How does the model perform?
Graduate students should address all the questions, whereas undergraduate students are only
required to address questions 1.b, 1.c and 2.a.