Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: THEend8_
IERG4300 / ESTR4300/ IEMS5709 Fall Homework 2
The solution will be posted right after the deadline, so no late homework will be accepted! Every Student MUST include the following statement, together with his/her signature in the submitted homework. I declare that the assignment submitted on the Elearning system is original except for source material explicitly acknowledged and that the same or related material has not been previously submitted for another course. I also acknowledge that I am aware of University policy and regulations on honesty in academic work, and of the disciplinary guidelines and procedures applicable to breaches of such policy and regulations, Submission notice: ● Submit your homework via the elearning system General homework policies: A student may discuss the problems with others. However, the work a student turns in must be created COMPLETELY by oneself ALONE. A student may not share ANY written work or pictures, nor may one copy answers from any source other than one’s own brain. Each student MUST LIST on the homework paper the name of every person he/she has discussed or worked with. If the answer includes content from any other source, the student MUST STATE THE SOURCE. Failure to do so is cheating and will result in sanctions. Copying answers from someone else is cheating even if one lists their name(s) on the homework. If there is information you need to solve a problem but the information is not stated in the problem, try to find the data somewhere. If you cannot find it, state what data you need, make a reasonable estimate of its value and justify any assumptions you make. You will be graded not only on whether your answer is correct but also on whether you have done an intelligent analysis. Question 1 [20 marks]: Frequent Itemsets Considering running the PCY algorithm on a data set with 500 million baskets to count frequent pairs of items. Suppose each basket contains n items and there are d distinct item-pairs amongst all of the baskets. Consider the following setup during the first pass of PCY: Besides keeping the counters for every singleton itemset observed during the first pass, we can still afford to store in main memory 300 million integers, each of which is a bucket. Assume further that d is much larger than the total number of buckets available, i.e., d >> 300 million. (a) [10 marks] As a function of n and/or d, what is the minimum support threshold s we can allow if the average count for a bucket should be no more than 40% of the threshold? (b) [10 marks] Suppose that A, B, C, D, E, and F are all the items under consideration. For a particular support threshold, the maximal frequent itemsets are {A, B, D} and {C, E}. What are all the other frequent itemsets? Q2 [80 marks + 20 Bonus marks]: Finding frequent itemsets In this problem, we use the Shakespeare data set. The original data set has been pre-processed as follows: ● Apply a sliding window of 40 words on each work. All the 40 words in one window make up a basket. ● Delete duplicate words in one basket, then filter out some common words ● You can download the pre-processed data from the following link ● Each line of this input is a tab-separated list of words that corresponds to one basket. The threshold for a frequent pair is defined as s=0.005. The relative frequency of a pair = Occurrence of pair (i, j) / Total number of baskets. For Q2(a), (b), (c), and (d), if the number of frequent pairs is larger than 40, please only submit the Top 40 pairs (if any). Your results should consist of the frequent pairs and their corresponding count. ● You are allowed to use Linux command sort to post-process your results. (a) [25 marks] Implement the A-Priori algorithm to find frequent pairs on a single machine Refer to the lecture slides, Page 30, 31 to implement the A-Priori algorithm. You do not need to use MapReduce Framework for this sub-question. You can run this job on one single AWS/GoogleCloud machine. Note that dicvmc4.ie.cuhk.edu.hk is only a client for our DIC cluster. Please do NOT run the job on this machine. (b) [30 marks] Implement the SON algorithm on MapReduce to find frequent pairs Implement the SON algorithm under the MapReduce framework to find the frequent pairs. Note that your code should be scalable. In other words, your code should allow multiple mappers or reducers in both jobs. You need to implement two MapReduce jobs: ● The First MapReduce job should use A-priori algorithm to find the candidate pairs, which are frequent in at least one input file. ● Second MapReduce job counts only the candidate frequent pairs. Tips: ● In the second MapReduce job, each mapper will load all the candidate pairs. You can pass them as a supplementary file. Streamline and performance comparison: ● Wrap the two MapReduce rounds as a single executable by putting those commands you type before in a shell script. ● Compare the overall execution time of (a) and (b) . ● Output the command you use to submit the Hadoop job. You can use the IE Data-Intensive Cluster (DIC) or any other Hadoop cluster (e.g., the AWS/GoogleCloud cluster built in HW#0) in various cloud computing platforms of your choice to do this problem. (c) [25 marks] SON on MapReduce to find frequent triplets The threshold for frequent triplets is defined as s=0.0025. The relative frequency of a triplet (i, j, k) = Occurrence of triplet (i, j, k) / Total number of baskets. If the number of frequent triplets is larger than 20, please only submit the top 20 triplets (if any). Tip: ● In case of memory error, you may need to use multiple mappers/ reducers (e.g. 20+). (d) [20 Bonus marks] Use the PCY algorithm to filter the candidate pairs in the SON algorithm Implement the SON algorithm under the MapReduce framework. And use the PCY algorithm to filter the candidate pairs in the first MapReduce job. You can use the following Python hash function. HashFunction=hash( word_1 + word_2) mod 100000 For example, the result of the word pair (‘Monday’, ‘Tuesday’) can be implemented as follows: HashFunction=hash( ‘Monday’ + ‘Tuesday’) % 100000 Streamline and performance comparison: ● Wrap the two MapReduce rounds as a single executable by putting those commands you type before in a shell script. ● Compare the overall execution time of (a), (b), and (d). ● Output the command you use to submit the Hadoop job. Part (d) is an optional (bonus) part for IERG4300 and IEMS5709, but mandatory for ESTR4300. Submission requirement: ● You need to submit BOTH your code and your result. Please place the relevant code and the result in a SINGLE PDF file.