COMP7105 Advanced topics in data science
Advanced topics in data science
Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: THEend8_
COMP7105 Advanced topics in data science
Assignment 1
Data Preprocessing and Object Similarity
The objective of the first assignment part is to familiarize students with data cleaning and preprocessing
tasks, as well as similarity computation between multidimensional objects.
You are going to use the COIL 1999 Competition Data. The data contains measurements of river chemical
The description of the data can be found here: https://kdd.ics.uci.edu/databases/coil/coil.data.html
Each line of the file corresponds to one sample analysis from a European river. The attribute values are
separated by commas. The first attribute (nominal) is the season when the sample was taken, the second
attribute (ordinal) is the river size and the third one (ordinal) is its flow velocity. The remaining attributes
(interval-scaled) are chemical concentration measures and distributions of different kinds of algae.
Step 1: data loading, cleaning, and transformation
Read the data from the file into your program and create one object for each line of the file. The identifier
of each object should correspond to the line number (start counting from 0). For example, the data of the
first line should be the data of object with identifier 0.
You will notice two problems while reading the data. First, at some lines (e.g. 20th line) the 7th and 8th
attribute values are merged to a single value (i.e., not separated by a comma). This problem could be due to
typing or OCR errors by the data producer. Correct this problem by (a) keeping the first five digits after the
first decimal point (.) as the decimal part of the first number (b) keeping the first three digits before the
second decimal point in the second number and drop the rest.
For example, 2.822008777.59961 at the 20th line should become values 2.82200 and 777.59961.
The second problem is that some values are missing (denoted by XXXXXXX).
When creating your objects fix all problems and transform all values to numerical ones. Transform missing
values to a special numerical value (-1) and count missing values. Convert nominal values to integer
identifiers (starting from 0), e.g., spring = 0, summer = 1, etc. Convert ordinal values to integers starting
from 0 (i.e., small = 0, medium = 1, large = 2). In addition, for each attribute keep track (a) the type of the
attribute (0=nominal, 1=ordinal, 2=interval-scaled), (b) the minimum and maximum values of the attribute
in the data. Print this information. For example, we should see at the program the output:
types of attributes: [0, 1, 1, 2, ...]
minimum values per attribute: [0, 0, 0, 5.6, ...]
maximum values per attribute: [3, 2, 2, 9.7, ...]
total number of missing values: ...
Step 2: computation of similarity
Define a function which takes as input two objects and computes their similarity by averaging their
similarities in all attributes. Use slide 61 of lecture “01 data” to see how you can compute the similarity for
each individual attribute. For each ordinal and interval attribute, consider 0 as the smallest possible
dissimilarity and the minimum minus the maximum value as the maximum dissimilarity. To compute the
similarity between two interval attribute values, use formula 1-dissimilarity/max_dissimilarity and this will
give you a value between 0 and 1. When computing the similarity between two objects, you should ignore
all attributes where at least one of the two objects has a missing value.
Use your function to:
a) Find the pair of objects with the highest similarity; print the pair of objects, their attribute values,
and their similarity
b) Find the pair of objects with the lowest similarity; print the pair of objects, their attribute values,
and their similarity
c) Find the most similar object to the following query object:
winter,large_,high__,8.10000,7.50000,140.00000,1.00000,60.00000,100.00000,140.00000,31.000
00,1.00000,10.00000,3.00000,1.00000,0.00000,0.00000,5.00000
Print the object and its similarity to the query object.
For example, we should see at the program output:
maximum similarity: 0.XX; pair with maximum similarity: [x, y]
object x = [...]
object y = [...]
minimum similarity: 0.XX; pair with minimum similarity: [x, y]
object x = [...]
object y = [...]
highest similarity to input object: 0.XX; object with highest similarity: x
object x = [...]
What do you observe about the difference between the maximum and minimum similarity? Please comment
on this.
Step 3: computation of Euclidean distance
Define a function which takes as input two objects and computes their Euclidean distance, considering only
interval attributes, whose values are normalized. To compute the Euclidean distance between objects x and
y, ignore their first three attributes. Then for each other attribute convert the original object value to a value
in [0,1] by (a) subtracting the smallest value in this dimension and (b) dividing by the difference between
the largest and smallest values in this dimension (see min-max normalization at the course slides). After
converting the values, compute the squared differences between the values, sum up all the squared
differences and then compute the square root (see slide 62 of lecture “01 data”). When computing the
Euclidean distance between two objects, you should ignore all attributes where at least one of the two objects
has a missing value.
Use your function to:
a) Find the pair of objects with the largest distance; print the pair of objects, their attribute values,
and their Euclidean distance
b) Find the pair of objects with the smallest distance; print the pair of objects, their attribute values,
and their Euclidean distance
c) Find the object with the smallest distance to the following query object:
winter,large_,high__,8.10000,7.50000,140.00000,1.00000,60.00000,100.00000,140.00000,31.000
00,1.00000,10.00000,3.00000,1.00000,0.00000,0.00000,5.00000
Print the object and its distance to the query object.
For example, we should see at the program output:
maximum distance: X.XX; pair with maximum distance: [x, y]
object x = [...]
object y = [...]
minimum distance: 0.XX; pair with minimum distance: [x, y]
object x = [...]
object y = [...]
smallest distance to input object: 0.XX; object with smallest similarity: x
object x = [...]
What do you observe about the difference between the largest and smallest distance? Please comment on
this.
Submission requirements
You should submit your program. You should also write a short report about your implementation, in case
it is hard for us to understand your program. Your report should include the output of your program. Your
report should also include your comments about the differences between the smallest/largest similarity and
distance and any other observations that you have. Submit the report together with your code at the course
website. Your code should be compilable without problems and you should include basic instructions on
how to compile and use it. Your program can be written in your preferred programming language (e.g., C,
C++, Java, Python, etc.). Your program must run within reasonable time.
Please submit your assignment (one ZIP file) to moodle on or before 5:00pm, Feb 21th, 2022. Make sure
all contents are readable.