Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: THEend8_
Phase 1: Warmup – Python Exercises (20 marks, worth 20% of subject grade)
In this phase, you will practice your Python wrangling skills with a publicly available dataset. The dataset is obtained through the TMDB (The Movie DB) API. It contains information on movies featured in the Full MovieLens Dataset and released on or before July 2017. The main features of the Movies Metadata file include posters, backdrops, budget, revenue, release dates, languages, production countries and companies.
You will be working with the following dataset in this phase:
Movies_tmdb.csv: It has a set of movie records (approx. 45,000), released on or before July 2017. Note that this dataset is quite large, and you may find it beneficial during development, to first test your code on a smaller sample of this data.
Libraries to use are Pandas and Matplotlib. You will need to write Python 3 code and work with Series and DataFrames discussed in workshop week 2 and data cleaning and basic visualisations covered in workshop weeks 3-4. If you are using other packages, you must provide an explanation in your code about why it is necessary.
Please write here all the Python libraries you will be using! Also load the dataset (.csv) in a dataframe object.
In [ ]:
1.1 Print the number of movies, number of attributes/columns, column names and datatypes. The output of this step should look like (2 Marks)
*** Q1.1
Number of movies: #
Number of attributes/columns: # Column names: #
Column datatypes: #
***
where # is the values/strings you find.
In [ ]:
### answer Q1.1
1.2 In this assignment, we won't be using all the features (i.e. columns) which are included in the csv file, so create a new dataframe with the following columns: (1 Marks)
You must keep the order of the columns as provided above. Output of this question should be printing the first TWO rows (i.e. movies) from the new created dataframe in the following format:
*** Q1.2
The first two rows from the filtered movies dataframe are: #
#
***
where each # represents one movie row.
In [ ]:
### answer Q1.2
2.1 Most of the columns in the movies dataframe have object datatype, let's convert the "popularity" column to float64 datatype, "title" column to string and "adult" column to boolean. (1 Mark)
The output of this step should print the datatypes of all columns in the movies dataframe after the conversion. You should follow the following format:
***
Q2.1 Datatypes after conversion: #
***
where # should be the datatypes of the dataframe columns. Note: You don't have to create a new dataframe for this question, instead you can use the same dataframe which you created in Q1.2.
In [ ]:
### answer Q2.12.2 Now, we will deal with the missing values as a preprocessing step before performing any further analysis. Let's first print the total number of missing values for each column separately. Following this, you should print the percentage of movies with incomplete data in any of its attributes (i.e. missing values). Note: A movie is considered incomplete record if it has a missed value in at least one of its features. (2 Marks)
Note: missing values might be 0, nan, or empty cell.
***
Q2.2 Number of missing values per attribute: col_1: x
col_2: x
…
col_n: x
***
% of movies with incomplete data: #
***
Replace col_1,col_2 … col_n with the columns' names, x with the calculated values, and # with the calculated percentage.
In [ ]:
### answer Q2.22.3 Write code that will add a new column called "runtime_non_missing" to the movies dataframe. The values in the new column should be copied from the "runtime" column and replaces all missing values in this column with the average of non-missing values for that column. (2 Marks).
The output of this question should print the average calculated value in the following format:
***
Q2.3 Missing values in 'runtime' column are replaced with: #
***
Where # is the calculated value.
Do you think it will be better to replace the missing values in the "runtime" column with the median instead of the average? Yes/No – Why?
In [ ]:
### answer Q2.3