Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: THEend8_
CSE 6242 / CX 4242: Data and Visual Analytics
• Q1.py: The completed Python file
• nodes.csv: The csv file containing nodes
• edges.csv: The csv file containing edges
Follow the instructions found in Q1.py to complete the Graph class, the TMDbAPIUtils class, and the one
global function. The Graph class will serve as a re-usable way to represent and write out your collected
graph data. The TMDbAPIUtils class will be used to work with the TMDb API for data retrieval.
Tasks and point breakdown
1. [10 pts] Implementation of the Graph class according to the instructions in Q1.py.
a. The graph is undirected, thus {a, b} and {b, a} refer to the same undirected edge in the graph;
keep only either {a, b} or {b, a} in the Graph object. A node’s degree is the number of (undirected)
edges incident on it. In/ out-degrees are not defined for undirected graphs.
2. [10 pts] Implementation of the TMDbAPIUtils class according to instructions in Q1.py. Use version 3 of
the TMDb API to download data about actors and their co-actors. To use the API:
a. Create a TMDb account and follow the instructions on this document to obtain an API key.
b. Be sure to use the key, not the token. This is the shorter of the two.
c. Refer to the TMDB API Documentation as you work on this question.
3. [20 pts] Producing correct nodes.csv and edges.csv.
a. If an actor's name has comma characters (“,”), remove those characters before writing that name
into the CSV files.4
Version 1
Q2 [35 points] SQLite
SQLite is a lightweight, serverless, embedded database that can easily handle multiple gigabytes of data. It
is one of the world’s most popular embedded database systems. It is convenient to share data stored in an
SQLite database — just one cross-platform file that does not need to be parsed explicitly (unlike CSV files,
which must be parsed). You can find instructions to install SQLite here. In this question, you will construct a
TMDb database in SQLite, partition it, and combine information within tables to answer questions.
You will modify the given Q2.py file by adding SQL statements to it. We suggest testing your SQL locally on
your computer using interactive tools to speed up testing and debugging, such as DB Browser for SQLite.
Technology
• SQLite release 3.37.2
• Python 3.10.x
Allowed Libraries
Do not modify import statements. Everything you need to complete this question
has been imported for you. Do not use other libraries for this question.
Max runtime
10 minutes. Submissions exceeding this will receive zero credit.
Deliverables
• Q2.py: Modified file containing all the SQL statements you have used to answer
parts a - h in the proper sequence.
IMPORTANT NOTES:
• If the final output asks for a decimal column, format it to two places using printf(). Do NOT use the
ROUND() function, as in rare cases, it works differently on different platforms. If you need to sort that
column, be sure you sort it using the actual decimal value and not the string returned by printf.
• A sample class has been provided to show example SQL statements; you can turn off this output by
changing the global variable SHOW from True to False.
• In this question, you must only use INNER JOIN when performing a join between two tables, except for
part g. Other types of joins may result in incorrect results.
Tasks and point breakdown
1. [9 points] Create tables and import data.
a. [2 points] Create two tables (via two separate methods, part_ai_1 and part_ai_2, in Q2.py) named
movies and movie_cast with columns having the indicated data types:
i. movies
1. id (integer)
2. title (text)
3. score (real)
ii. movie_cast
1. movie_id (integer)
2. cast_id (integer)
3. cast_name (text)
4. birthday (text)
5. popularity (real)
b. [2 points] Import the provided movies.csv file into the movies table and movie_cast.csv into the
movie_cast table
i. Write Python code that imports the .csv files into the individual tables. This will include
looping though the file and using the ‘INSERT INTO’ SQL command. Make sure you use
paths relative to the Q2 directory.
c. [5 points] Vertical Database Partitioning. Database partitioning is an important technique that
divides large tables into smaller tables, which may help speed up queries. Create a new table
cast_bio from the movie_cast table. Be sure that the values are unique when inserting into
the new cast_bio table. Read this page for an example of vertical database partitioning.5
Version 1
i. cast_bio
1. cast_id (integer)
2. cast_name (text)
3. birthday (text)
4. popularity (real)
2. [1 point] Create indexes. Create the following indexes. Indexes increase data retrieval speed; though the
speed improvement may be negligible for this small database, it is significant for larger databases.
a. movie_index for the id column in movies table
b. cast_index for the cast_id column in movie_cast table
c. cast_bio_index for the cast_id column in cast_bio table
3. [3 points] Calculate a proportion. Find the proportion of movies with a score between 7 and 20 (both limits
inclusive). The proportion should be calculated as a percentage.
a. Output format and example value:
7.70
4. [4 points] Find the most prolific actors. List 5 cast members with the highest number of movie appearances
that have a popularity > 10. Sort the results by the number of appearances in descending order, then by
cast_name in alphabetical order.
a. Output format and example row values (cast_name,appearance_count):
Harrison Ford,2
5. [4 points] List the 5 highest-scoring movies. In the case of a tie, prioritize movies with fewer cast members.
Sort the result by score in descending order, then by number of cast members in ascending order, then
by movie name in alphabetical order.
a. Output format and example values (movie_title,score,cast_count):
Star Wars: Holiday Special,75.01,12
Games,58.49,33
6. [4 points] Get high scoring actors. Find the top ten cast members who have the highest average movie
scores. Sort the output by average_score in descending order, then by cast_name alphabetically.
a. Exclude movies with score < 25 before calculating average_score.
b. Include only cast members who have appeared in three or more movies with score >= 25.
i. Output format and example value (cast_id,cast_name,average_score):
8822,Julia Roberts,53.00
7. [2 points] Creating views. Create a view (virtual table) called good_collaboration that lists pairs of
actors who have had a good collaboration as defined here. Each row in the view describes one pair of
actors who appeared in at least 2 movies together AND the average score of these movies is >= 40.