Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: THEend8_
AI506: DATA MINING AND SEARCH
Homework 1: Locality Sensitive Hashing
With the advent of the age of online journalism, it has become convenient to get access to a abundant amount
of news and articles. However, plagiarism also occurs frequently due to this ease of access and publication.
Although plagiarism is a serious problem, it is not a trivial task to detect the plagiarism among a vast amount
of documents. Locality Sensitive Hashing (LSH) is an algorithm that is well suited to handle this problem.
It can find similar pairs of documents in O(n) time.
In this assignment, you will find similar pairs of documents in the 20-news group dataset.
The dataset is a collection of articles consists of 20 different news-
groups. You will use the cosine similarity between the documents and find similar pairs.
Consider two vectors x and y. The cosine similarity between the two vectors is computed by the angle
θ between them. Suppose we pick random hyperplanes h1 and h2 whose normal vectors are v1 and v2,
respectively, as in Figure 1. While x and y are on different sides of h1, they are on same side of h2. That is,
the dot products v1 · x and v1 · y have different signs, while the signs of v2 · x and v2 · y are the same. Then,
what is the probability that the randomly chosen hyperplane is like h1 rather than h2? Since angles for the
line that is the intersection of the random hyperplane and the plane of x and y are equally likely, the random
hyperplane is likely to be like h1 with probability θ/180. Refer to
for details.
Figure 1: Two vectors x and y make an angel θ.
1
1 Random Hyperplanes and Cosine Similarity [50 points]
The starter code is provided for you to implement the algorithm. Fill in the blanks in the code marked as
programming assignments.
1.1 Shingling [15 points]
In this section, you will implement the shingling algorithm to convert the document into the characteristic
matrix. You will use a word as a token for your shingle, not a character. However, since storing the whole
characteristic matrix in the form of a dense matrix is expensive in terms of space, your implementation
should store the characteristic matrix in the form of a dictionary.
Note that implementation of shingling is divided into 2-steps just for the readability of the algorithm.
1. (10 points) Implement the get_shingles to get 5-shingles from the documents. You should also
take care of your implementation’s computational efficiency. It should take less than 20 seconds to
2. (5 points) Implement the build_doc_to_shingle_dictionary to build the dictionary that
maps each document to the list of [5-shingle, count of the shingle] in the document. It should take less
than 5 minutes to build the dictionary.
1.2 Random Hyperplanes [35 points]
In this section, you will take a time to understand the random hyperplanes and cosine similarity.
1.2.1 (Written) Computing the signatures [10 points]
First, to review the algorithm, calculate by hand the cosine signatures using a simple characteristic matrix.
Element S1 S2 S3 S4 S5
0 0 0 1 1 2
1 1 0 1 1 2
2 1 1 1 0 0
3 0 1 0 0 0
4 0 0 0 2 1
5 0 0 3 0 0
Characteristic matrix
1. (5 points) Calculate the cosine similarity signatures of each column of the matrix given above using
the following 4 random hyperplanes. Assume 0 is positive. Note that each cosine similarity signature
is either +1 or −1.
h1 :
0.1
−1.8
0.7
−0.4
−0.4
0.9
h2 :
0.8
−0.3
−0.6
1.4
0.1
−0.9
h3 :
1.1
0.7
−0.7
−0.5
−0.8
0.3
h4 :
0.2
−1.3
0.3
−0.4
1.0
−0.5
2. (5 points) Calculate the true and estimated cosine similarities of the 10 pairs of the columns.
2
1.2.2 Implementation [25 points]
Now, implement the algorithm to convert the characteristic matrix into the signatures.
1. (5 points) Implement the cosine_similarity to compute the cosine similarity of two given
vectors.
2. (20 points) Implement the make_signature to create the signature for the documents. It should
take less than 15 minutes to compute the signature.
2 Locality Sensitive Hashing (LSH) [50 points]
The starter code is provided for you to implement the LSH algorithm and analyze it. Fill in the blanks in the
code marked as Programming assignments.
2.1 Locality Sensitive Hashing [20 points]
In this section, you will implement the LSH algorithm. It receives signature matrix and b (i.e., the number
of bands) and r (i.e., the number of rows in each band) as input parameters, and return the candidate pairs
of the similar documents.
1. (20 points) Implement the lsh function to find all candidate pairs of the similar documents using LSH
algorithm.
Notes: Use python’s dictionary to make your hash table, where each column is hashed into a bucket. Convert
each column vector (within a band) into the tuple and use it as a key of the dictionary.
2.2 Analysis [30 points]
In this section, you will analyze the LSH algorithm in terms of query speed and the various measures of
relevance. In this assignment, we fixed M (i.e. the number of hyperplanes) into 256, b to be the divisor of M :
b ∈ {2, 4, 8, 16}, and s = 0.8 (i.e. the similarity threshold for checking condition positives).
1. (10 points) Implement the query_analysis function to compute the query time, precision, recall,
and F1 score as b and r change.
2. (10 points) Attach your query time graph. How does the query speed change as b changes? Explain
the results in terms of the computational complexity.
3. (10 points) Attach your precision, recall, and f1-score graphs. How does each measure change as b
changes? Which b gives the best result in terms of each measure? Explain the results in detail.
3
3 How to submit your assignment
3.1 Submission
1. Download the attached file. It contains the skeleton codes.
2. Fill in the skeleton codes.
3. Modify the name of the file and zip it into hw1-[your student id].tar.gz, which should contain the
following files:
• hw1.ipynb: this file should contain your source code.
• report.pdf: this file should contain your answers to the questions. Also, please attach your im-
plementations and if you have a test message, please attach the results as well. For the written
assignments, we prefer you to write down the equations in word or latex.
• readme.txt: this file should contain the names of any individuals from whom you received help,
and the nature of the help that you received. That includes help from friends, classmates, lab
TAs, course staff members, etc. In this file, you are also welcome to write any comments that
can help us grade your assignment better, your evaluation of this assignment, and your ideas.