Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: THEend8_
Final Project Assignment
The aim of this assignment is to give you a chance to exercise your skills at prediction using Python. You have been sent an email with a link to data collected on a random sample from some population of Wikipedia pages, to develop prediction models for three different web page attributes. Each student is provided with their own data drawn from a Wikipedia page population unique to that student, and this comes in the form. of two files:
? A training set which is a pickled pandas data frame. with 200,000 rows and 44 columns. Each row corresponds to a distinct Wikipedia page/url drawn at random from a certain population of Wikipedia pages. The columns are
– URLID in column 0, which gives a unique identifier for each url. You will not be able to determine the url from the URLID or the rest of the data. (It would be a waste of time to try so the only information you have about this url is provided in the dataset itself.)
– 40 feature/predictor variable columns in columns 1,...,40 each associated with a particular word (the word is in the header). For each url/Wikipedia page, the word column gives the number of times each word appears in the asociated page.
– Three response variables in columns 41, 42 and 43
? length = the length of the page, defined as the total number of characters in the page
? date = the last date when the page was edited
? word present = a binary variable indicating whether at least one of 5 possible words (using a word list of 5 words specific to each student and not among the 40 feature words) appears in the page
? A test set which is also a pickled pandas data frame. with 50,000 rows but with 41 columns since the response variables (length, date, word present) are not available to you. The rows of the test dataset also correspond to distinct url/pages drawn from the same Wikipedia url/page population as the training dataset (with no pages in common with the training set pages). The response variables have been removed so that the columns that are available are
– URLID in column 0
– the same 40 feature/predictor variable columns corresponding to word counts for the same 40 words as in the training set
Your goal is to use the training data to
? predict the length variable for pages in the test dataset
? predict the mean absolute error you expect to achieve in your predictions of length in the test dataset
? predict word present for pages in the test dataset, attempting to make the false positive as close as you can to .05, and make the true positive rates as high as you possibly can,
? predict your true positive rate for word present in the test dataset
? predict, for each page in the test dataset, whether the last date when the page was edited was in 2023 (so create a 0/1-valued variable referred to as “edited 2023”), attempting to make the false positive as close as you can to .05, and make the true positive rates as high as you possibly can,
? predict your true positive rate for edited 2023 in the test dataset
Since I have the response variable values (length, word present, date) for the pages in your test dataset, I can determine the performance of your predictions. Since you do not have those variables, you will need to set aside some data in your training set or use cross-validation to estimate the performance of your prediction models.
There are 3 different parts of this assignment, each requiring a submission:
? Part 1 (30 points) - a Jupyter notebook containing
– a description (in words, no code) of the steps you followed to arrive at your predictions and your estimates of prediction quality - including a description of any separation of your training data into training and testing data, method you used for imputation, methods you tried to use for making predictions (e.g. regression, logistic regression, ...) followed by
– the code you used in your calculations
? Part 2 (60 points) - a cvs file with your predictions - this file should consist of exactly 4 columns with
– a header row with URLID, length, word present, edited 2023
– 50,000 additional rows
– every URLID in your test dataset appearing in the URLID column - not altered in any way!
– no mssing values
– data type for the length column should be integer or float
– data type for the word present column should be either integer (0 or 1), float (0. or 1.) or Boolean (False/True)
– data type for the edited 2023 column should be either integer (0 or 1), float (0. or 1.) or Boolean (False/True)
? Part 3 (30 points) - providing estimates of the following in a form.
– what do you predict the mean absolute error of your length predictions to be?
– what do you predict the true positive rate for your word present predictions to be?
– what do you predict the true positive rate for your edited 2023 predictions to be?
Your score in this assignment will be based on
? Part 1 (30 points)
– evidence of how much effort you put into the assignment (how many different methods did you try?)
– how well did you document what you did?
– was your method for predicting the quality of your performance prone to over-fitting?
? Part 2 (60 points)
– how good are your predictions of length, word present, edited 2003 - I will do predictions using your training data and I will compare
? your length mean absolute deviation to what I obtained in my predictions
? your true positive rate to what I obtained for the binary variables (assuming you managed to appropriately control the false positive rate)
– how well did you meet specifications - did you get your false positive rate in predictions of the binary variables close to .05 (again, compared to how well I was able to do this)