Introduction to Data Science and Analytics
Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: THEend8_
MIE 1624 Introduction to Data Science and Analytics
Assignment
Submit via Quercus
Background:
For this assignment, your task is to analyze the provided dataset and answer the questions listed below. You
will then write a 3-page report to present the results of your analysis. In your report, make use of visual aids
to effectively convey your findings. The format of your visualizations (i.e., with tables/plots/etc.) is up to
you, but ensure they clearly communicate the results of your analysis. In your report, explain how you
arrive at the answers to the questions and justify why your answers are reasonable for the given
data/question. You must interpret your final results in the context of the dataset for your problem.
Background:
In this assignment, we will work on the “2022 Kaggle Machine Learning & Data Science Survey”
dataset.
The purpose of this challenge was to “tell a data story about a subset of the data science community
represented in this survey, through a combination of both narrative text and data exploration.”
The dataset provided (kaggle_survey_2022_responses.csv) contains the survey results provided by Kaggle.
The survey results from 23997 participants are shown in 296 columns, representing survey questions. Not
all questions are answered by each participant, and responses contain various data types. In the dataset,
column ‘Q29’ “What is your current yearly compensation (approximate $USD)?” contains the ordinal
categorical target variable. The original data (kaggle_survey_2022_responses.csv) has been transformed
to clean_kaggle_data_2022.csv as per the code given in KaggleSalary_DataSet.ipynb. In the dataset to
be used for Assignment 2 (clean_kaggle_data_2022.csv – file to be read in notebook for this Assignment,
You should work with the clean dataset for this assignment), rows with the null values of salaries have
been dropped. In addition, two columns (‘Q29_Encoded’ and ‘Q29_buckets’) have been added at the end.
Column ‘Q29_buckets’ (Target Variable for Assignment 2) has been obtained by combining some salary
buckets in the column ‘Q29’. Column ‘Q29_Encoded’ has been obtained by label encoding the column
‘Q29_buckets’.
The purpose of this assignment is to train, validate, and tune multi-class ordinal classification models that
can predict a survey respondent’s current yearly compensation bucket, based on a set of survey responses
by a data scientist.
2/8 MIE 1624 Introduction to Data Science and Analytics – Assignment 2
Classification is a supervised machine learning approach used to assign a discrete value of one variable
when given the values of others. Many types of machine learning models can be used for training
classification problems, such as logistic regression, decision trees, kNN, SVM, random forest, gradient-
boosted decision trees, and neural networks. In this assignment, you are required to implement the
ordinal logistic regression algorithm, but feel free to experiment with other algorithms.
For the purposes of this assignment, any subset of data can be used for data exploration and for classification
purposes. For example, you may focus only on one country, exclude features, or engineer new features. If
a subset of data is chosen, it must contain at least 5000 training examples. You must justify and explain
why you are selecting a subset of the data, and how it may affect the model.
Data is often split into training and testing data. The training data is typically further divided to create
validation sets, either by just splitting, if enough data exists, or by using cross-validation within the training
set. The model can be iteratively improved by tuning the hyperparameters or by feature selection.
You may get started with this assignment using assignment2_template.ipynb. The template contains some
basic data analysis procedures that might be helpful for you, i.e., reading the dataset and the skeleton for
implementing ordinal logistic regression. Note that the filename has to be renamed properly before
submission is made (see later sections for details).
Learning objectives:
1. Understand how to clean and prepare data for machine learning, including working with multiple
data types, incomplete data, and categorical data. Perform data standardization/normalization, if
necessary, prior to modeling.
2. Understand how to apply machine learning algorithms (ordinal logistic regression) to the task of
classification.
3. Improve on skills and competencies required to compare performance of classification algorithms,
including application of performance measurements, and visualization of comparisons.