Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: THEend8_
Credit: 94 points
Grade: 20% of final grade
Submission: Files that need to be submitted for are listed at the end of each question.
Please submit the assignment on Gradescope
Tip: Read the document fully before you begin working on the assignment
Question 1: Extract, Transform Load (ETL) (34 points)
Often, we are given different datasets that represent the same population, but each dataset
contains different information about this population. ETL pipelines can be used to extract each
data source, transform the data sources so we can map one dataset onto the other, and then
load them into a single data source that we can use for analytics.
In this question, we will create an ETL pipeline to combine two raw datasets from a sampled
population, and estimate the value of a parameter that exists in one dataset, but not the other.
The datasets we are looking at come from the U.S. Center for Disease Control (CDC). The first
dataset is called the Behavioral Risk Factor Surveillance Survey, or BRFSS, and the second
dataset is called the National Health Interview Survey, or NHIS. Within the question you will be
asked to perform the following data analysis:
● Extract the raw datasets into a spark SQL DataFrame
● Read the guidebooks for each of these datasets to understand how to map similar
features to each other
● Perform an exact join on these features, and approximate a disease prevalence statistic
within the US population
For this question, you are required to use Spark and will heavily use the Spark SQL API. You
may use any other Python functions/objects when performing map/reduce on sections of the
DataFrame. We recommend reading this entire section before beginning coding, so you can
get an idea about the direction of the problem.
Step 1: Load data (3 points)
Download data from Google Drive(a smaller dataset than the real one). There should be two
data files, called brfss_input.json and nhis_input.csv. We have also provided starter code
CS/INFO 5304 Assignment 1
for you, in p1.py. Complete the function called create_dataframe that takes in a path to a
file, the type of file to upload (e.g. “csv”), a spark session, and returns the spark DataFrame for
that file.
Step 2: Make test example (6 points)
When working with large datasets, it is often helpful to create a small subset as test example
that can be used to validate whether our code works, within a short runtime.
Analyze the columns of the BRFSS and NHIS data you downloaded, and identify what columns
represent similar information (for example, is there a column in BRFSS, and a respective
column in NHIS, that represent a person’s age?). To do this, you will need the codebooks for
each dataset, which can be found here:
● 2017 BRFSS codebook
● 2017 Sample Adult File NHIS codebook (Here is a copy on Drive in case you have any issues.)
After analyzing the columns, prepare three files that can be used as a test case:
● Download the 5-row BRFSS “dummy” dataset and the 5-row NHIS “dummy” dataset
from Google Drive. They are named test_xxx.
● Manually create a joined_test.csv file that represents the expected output of your
program with the above dummy input. Only exact matches should be kept, and all null’s
should be dropped. This file should essentially have:
a. All the BRFSS columns
b. The column(s) of the NHIS dataset that are not initially within the BRFSS dataset
You could use multiple columns to join. Do read the codebook to know what does each
value represent.