Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: THEend8_
In this assessment, students will demonstrate their ability to apply and innovate data wrangling techniques learned in the first four weeks of the "Foundations of Data Analysis" course (unit learning outcome 1). The primary goal is to enhance the quality of a provided dataset, making it ready for subsequent analysis. This task will challenge students to think critically about data preparation processes and encourage the creation of custom solutions to data quality issues.
Objectives
· Apply fundamental data wrangling techniques using R programming language to clean and organize a dataset.
· Innovate on existing data wrangling methods to address unique challenges in data quality.
· Assess the effectiveness of applied data wrangling techniques in enhancing data quality.
Topics Covered
· Week 1: Introduction to R
· Week 2: Control Structures in R
· Week 3: Data Manipulation using dplyr
· Week 4: Data Transformation using tidyr
Dataset
A real-world dataset will be provided, containing common data quality issues such as missing values, duplicates, inconsistent formatting, and erroneous entries. The dataset will be complex enough to require a combination of the techniques discussed in the first four weeks of the course.
This assessment is designed to evaluate your understanding and application of R programming skills in data manipulation and transformation to enhance data quality (unit learning outcome 1). The assessment is divided into two sections: Section A, which involves practical coding tasks, and Section B, in which you have to watch a business use case and write conditional statements.
Instructions
1. Combine all the R codes from Sections A and B into a single R Quarto file, and submit only this one file on Moodle.
2. For the subjective questions, please include your answers within the same R Quarto file; do not submit them in a separate document.
3. Exact answers, including codes, sample data, and explanations obtained from ChatGPT, will not be accepted and may result in a score of 0.
4. Please be specific in your coding and in your explanations of outputs; avoid general or vague descriptions.
5. Identical answers from classmates will not be tolerated, so be mindful of plagiarism checks on Moodle.
6. Refrain from requesting direct solutions from lecturers or sessional staff; you may ask for clarifications regarding the questions instead.
SECTION A
All the questions in Section A are based on the dataset “bigbasket.csv”.
E-commerce, or electronic commerce, refers to the process of buying and selling products through online platforms or the Internet. It utilizes various technologies, including mobile commerce, electronic funds transfer, supply chain management, Internet marketing, online transaction processing, electronic data interchange (EDI), inventory management systems, and automated data collection systems. The growth of e-commerce is largely fueled by advancements in the semiconductor industry, making it the largest segment within the electronics sector. BigBasket is the leading online grocery supermarket in India. Launched around 2011, the company has been continuously expanding its operations. Despite facing competition from new entrants like Blinkit, BigBasket has maintained its strong position in the market, thanks to its growing customer base and the ongoing shift toward online shopping.
1. List down 4 data manipulation strategies that can be applied to this dataset to produce clean data for analysis. Justify how the strategies will help in performing data analysis for business decision making. Include strategies such as i) handling missing values, ii) data redundancy, iii) handling outliers in prices, and iv) creating new column. (4 marks)
2. Perform. data manipulation using the 4 strategies listed in (1) by following the order from (i) to (iv). Use the cleaned dataset produced in (i) for (ii), from (ii) to (iii) and from (iii) to (iv). Show the R codes and justify the outputs. Print the header if the output has more than 10 rows. (4 marks)
3. Use the cleaned dataset produced at the end of question 2 to answer this question. Assume that you have been assigned as a Business Analyst in BigBasket supermarket. Your first project is to perform. a price variance analysis using the provided dataset. You have to identify the top 100 products with the highest discount rate. Which category does the majority of the top 100 products fall under? Why do you think this category needs the highest discount rate compared to others? (5 marks)
4. As a business analyst, which product will you correlate with the category you found in (3) to increase the sales of the supermarket? Identify that and Explain why. (4 marks)
5. Use the dataset produced in question 2. Write a conditional statement using the “rating” variable to classify the products into three categories: i) high, ii) average, and iii) poor and store the information in a new column in the same dataset. (3 marks)
SECTION B
Watch the video on the business strategy of ZARA and answer the following questions. Click the image to watch the video!
Decoding ZARA's Billion Dollar Business STRATEGY : Fashion Business Case Study (youtube.com)
1. List down three business strategies used by ZARA to be the number one fashion store worldwide. (3 marks)
2. Assume that you are hired as a business analyst in ZARA to analyze their E-Commerce data. Your first task is to create a data frame. and perform. basic analysis using conditional statements. (7 marks)
Follow the instructions below:
a) Go to this website https://www.zara.com/my/
b) Create a data frame. with 5 columns, including productID (create integers from 1 – 20 by yourself), product name, category, subcategory, and price using the information from the website. You must create 20 observations.
c) Write a conditional statement using the price of the products in your data frame. as an independent variable and subcategory as your target variable. Identify the subcategories at the highest and lowest prices using the conditional statement.
d) Write a conditional statement to identify the products in Q1 and Q3 of the price range and explain the output.