COMP2501 Introduction to Data Science and Engineering
Introduction to Data Science and Engineering
Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: THEend8_
COMP2501 Introduction to Data Science and Engineering
Assignment
For the following questions, please write the R program (in text format) to perform the tasks and include the screen
captures of your output also (except Q1(iv), and Q2(ii)). For Q1(iv) and Q2(ii), please answer the corresponding
questions. Write your answers in a word document and submit it to Moodle.
(30%) Question 1.
Load the dataset “iris”, and perform the following tasks:
i) Print the structure of the iris dataset.
Sample output:
ii) Calculate the average petal width for each species.
Sample output:
iii) Calculate the average sepal length for each species.
Sample output:
iv) How many flowers have a petal width greater than 1.5cm in the iris dataset?
v) Create a boxplot of petal lengths for each species using the ggplot2 package.
Sample output:
(70%) Question 2.
The following question is related to the “Adult” data in UC Irvine Machine Learning Repository1. Perform the
following tasks:
i) Consider the data file adult.data. Define the column names for the dataset. Strip the white spaces and the
question mark character “?” would be treated as a missing value. Use read.csv() to load the data file.
Print the first few rows of the data.
Sample output:
ii) Is there any missing values in the adult.data file?
iii) Calculate the percentage of individuals with income greater than $50,000 for each education level.
Sample output:
iv) Print the table in (iii) based on the percentage in descending order.
Sample output:
Sample output:
vi) Consider the data for the countries “Canada” and “Germany”. Show the average of hours_per_week for
these two countries.
Sample output:
vii) Make a scatterplot of hours_per_week versus age for the countries Canada, Germany, India, Japan and
Mexico. Use color to represent different countries.
Sample output:
viii) Similar to (vii), consider the Bachelors and Masters data. Use the facet_grid to plot the graph.
Sample output:
ix) Only consider the United-States data. Plot a graph by geom_line() to show the age and the average
hours_per_week for the corresponding age.
Sample output:
x) Make a ridge plot. Consider the United-States Female data and ignore those records with na values in
occupation. Plot the marital_stauts versus hours_per_week.