Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: THEend8_
Computer Science 320SC
Programming Assignment 6
Requirements
This 6th assignment lets you get familiar with randomized algorithm design and development. It is
worth 5% of your total course marks. We would like you to implement a Bloom Filter to simulate a
email spam filtering.
An excessive number of submissions (over 10) for a particular problem will accrue a 20% penalty per
that problem if you eventually solve it. Therefore, please write a bruteforce algorithm and test
your randomized algorithm with your own generated inputs at scale before submitting to
the automated marker.
We only accept Python programs that use built-in packages (i.e. packages that do not require pip
install).
1 Email spam filtering
1.1 Problem description
You are implementing a Bloom filter for email spam filtering for the University of Auckland. Given
that you know the set A of n trusted email addresses, e.g.
[email protected]. If the email
comes from one of these trusted addresses, it is not a spam. Otherwise, it will be a spam. Since the
server memory is very limited, you can not keep the list of trusted addresses on its memory (if an email
address requires 25 bytes on average and you have 1 billion trusted emails, it would take 25 GB to store
A in the memory).
Due to the limited resource of automarker, instead of using a string, we use a random integer in [0 . . . 2
31]
to present an email address. Also, to simpify the task, please use the following code to generate the list
of n integers that represent for n email addresses. Note that you have to use random.seed(320) to
ensure that you generate the same set of integers as automarker.
You would need to write a Python script to implement a Bloom filter given the setting of m = 8n bits and
different numbers of hash functions k = {2, 4, 6, 8}. The used hash function is: f(x) = (a·x+b) mod m
where a and b are generated using the following code:
import random
random.seed(320)
n = 100
a = [1, 2, 3, 4, 5, 6, 7, 8]
b = [8, 7, 6, 5, 4, 3, 2, 1]
for i in range(n):
x = random.randint(0, 2**31)
# Construct several Bloom filters here
2
# Testing here
for line in sys.stdin:
y = int(line)
# Check y is in the set S here
Note that you must not call x = random.randint(0, 2**31) several times since it will generate different sets S.
Given k = i, you will use i hash functions, from f1 to fi(x) = (a[i − 1] · x + b[i − 1]) mod m. For
example, given k = 2, you will use
f1(x) = (a[0] · x + b[0]) mod m = (x + 8) mod m
and
f2(x) = (a[1] · x + b[1]) mod m = (2x + 7) mod m.
1.2 Test case description
Your test will be a sequence of 2000 integers, each value per line corresponding to the email address,
containing non-spam and spam emails. Your output will be 4 integers, each per line, that count the
number of detected spam emails for each value of k = {2, 4, 6, 8}.
There are 4 test cases.
1. A trial test case of n = 100 has 0 mark.
2. A test case of n = 1, 000 has 2 marks.
3. A test case of n = 10, 000 and has 2 marks.
4. A test case of n = 100, 000 and has 1 mark.