Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: THEend8_
Gradient Descent
(20 points)
GitHub Classroom Invitation Link
1 Description
In this assignment you will implement Batch Gradient Descent to fit a line into a two dimensional
data set. You will implement a set of Spark jobs that will learn parameters for such line from the
New York City Taxi trip reports in the Year 2013. The dataset was released under the FOIL (The
Freedom of Information Law) and made public by Chris Whong. See the Assignment 1 for details about this data set.
We would like to train a linear model between travel distance in miles and fare amount (the
money that is paid to the taxis).
2 Taxi Data Set - Same data set as Assignment 1
This is the same data set as use for the Assignment 1. Please have a look on the table description
there.
The data set is in Comma Separated Volume Format (CSV). When you read a line and split it
by comma sign ”,” you will the an string array with length of 17. With index number started from
zero, we need for this assignment to get index 5 trip distance (trip distance in miles) and index 11
fare amount ( fare amount in dollars) as stated on the following table.
index 5 (this our X-axis) trip distance trip distance in miles
index 11 (this our Y-axis) fare amount fare amount in dollars
Table 1: Taxi Data Set fields
1
Data Clean-up Step
? Remove all taxi rides that are less than 2 min or more than 1 hours.
? Remove all taxi rides that have ”fare amount” less than 3 dollar or more than 200 dollar
? Remove all taxi rides that have ”trip distance” less than 1 mile or more than 50 miles
? Remove all taxi rides that have ”tolls amount” less than 3 dollar.
You can also preprocess the data and store it in your own cluster storage.
3 Obtaining the Dataset
Small data set. (93 MB compressed, uncompressed 384 MB) for implementation and testing purposes
(roughly 2 million taxi trips). This is available at Amazon S3:
https://s3.amazonaws.com/metcs777/taxi-data-sorted-small.csv.bz2
You can download or access the data sets using the following internal URLs:
Google Cloud
Small Data Set gs://metcs777/taxi-data-sorted-small.csv.bz2
Large Data Set gs://metcs777/taxi-data-sorted-large.csv.bz2
Table 2: Data set on Google Cloud Storage - URLs
Amzon AWS
Small Data Set s3://metcs777/taxi-data-sorted-small.csv.bz2
Large Data Set s3://metcs777/taxi-data-sorted-large.csv.bz2
Table 3: Data set on Amazon AWS - URLs
2
4 Assignment Tasks
4.1 Task 1 : Simple Linear Regression (4 points)
We want to find a simple line to our data (distance, ”fare amount”) and use it to predict ”fare amount”
from the travel distance.
Consider a Simple Linear Regression model given in equation (1). What are the regression
coefficient for your model?
The solutions for m slope of the line and y-intercept is calculated based on the equations (2)
and (3).
(3)
Implement a PySpark Job that calculates the exact answers for the parameters m and b. The
slope of the line is the parameter m and b is the y-intercept of the line.
Run your implementation on the large data set and report the computation time for your Spark
Job for this task. You can find the time for the completion of your Job on the Cloud System. You
find there a place that reports the time for you.
Note on Task 1: Execution of this task on Large data set depending on your implementation
can take longer time, for example on a cluster with 12 cores in total, it takes more than 40 min
computation time.
4.2 Task 2 - Find the Parameters using Gradient Descent (8 Points)
In this task, you should implement the batch gradient descent to find the optimal parameters for our
Simple Linear Regression model.
? You should load the data into spark cluster memory as RDD or Dataframe
? Start with all parameters set to 0.1 and do 100 iterations.
Cost function will be then
Here is a list of important setup parameters:
? Initialize all of your model parameters with zeros
? Set your initial learning rate to be learningRate=0.0001 and change it if needed.
? You can implement the bold driver to improve your learning rate.
? Maximum number of iteration should be 100 iterations, num iteration=100
Run your implementation on the large data set and report the computation time for your Spark
Job for this task. Compare the computation time with the previous tasks.
? Print out the costs in each iteration
? Print out the model parameters in each iteration
Note: You might write some code for the iteration of gradient descent in PySpark that can work
perfect on your laptop but does not run on the clusters (AWS/Google Cloud). The main reason is
that on your laptop it is running in single process while on a cluster it runs on multiple processes
(shared-nothing processes). You need to be careful to reduce all of jobs/processes to be able to
update the variables, otherwise each processes will have its own variables.
4.3 Task 3 - Fit Multiple Linear Regression using Gradient Descent (8 Points)
We would like to learn a linear model with 4 variables to predict total paid amounts of Taxi rides.
The following table describes the variables that we want to use.
index 4 (1st independent variable) trip time in secs duration of the trip
index 5 (2nd independent variable) trip distance trip distance in miles
index 11 (3rd independent variable) fare amount fare amount in dollars
index 12 (4th independent variable) tolls amount bridge and tunnel tolls in dollars
index 16 (y-axis, dependent variable) total amount total paid amount in dollars
Table 4: Taxi Data Set fields
? Initialize all parameters with 0.1
? Set your learning rate to be learningRate=0.001
? Maximum number of iteration should be 100, num iteration=100
4
? Use Vectorizatoin for this task. We will not accept your solution when you write duplicated
code. It should include vectorization.
? Implement ”Bold Driver” technique to dynamically change the learning rate. (2 points
of 8 points)
? Print out the costs in each iteration
? Print out the model parameters in each iteration
5 Important Considerations