Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: THEend8_
Frequent Itemsets 1
IERG4300/ IEMS5709 Web-Scale Information Analytics
Frequent Itemsets and Association Rule Mining
with the author’s permission. All copyrights belong to the original author of the material. Frequent Itemsets 3 Association Rule Discovery Supermarket shelf management – Market-basket model: ¢ Goal: Identify items that are bought together by sufficiently many customers ¢ Approach: Process the sales data collected with barcode scanners to find dependencies among items ¢ A classic rule: l If one buys diaper and milk, then he is likely to buy beer l Don’t be surprised if you find six-packs next to diapers! TID Items 1 Bread, Coke, Milk 2 Beer, Bread 3 Beer, Coke, Diaper, Milk 4 Beer, Bread, Diaper, Milk 5 Coke, Diaper, Milk Rules Discovered: {Milk} --> {Coke} {Diaper, Milk} --> {Beer} Frequent Itemsets 4 The Market-Basket Model ¢ A large set of items l e.g., things sold in a supermarket ¢ A large set of baskets, each is a small subset of items l e.g., the things one customer buys on one day ¢ A general many-many mapping (association) between two kinds of things l But we ask about connections among “items”, not “baskets” 10/1/21 TID Items 1 Bread, Coke, Milk 2 Beer, Bread 3 Beer, Coke, Diaper, Milk 4 Beer, Bread, Diaper, Milk 5 Coke, Diaper, Milk Frequent Itemsets 5 Association Rules: Approach ¢ Given a set of baskets ¢ Want to discover association rules l People who bought {x,y,z} tend to buy {v,w} • Amazon! ¢ 2-step approach: l 1) Find frequent itemsets l 2) Generate association rules 10/1/21 Rules Discovered: {Milk} --> {Coke} {Diaper, Milk} --> {Beer} TID Items 1 Bread, Coke, Milk 2 Beer, Bread 3 Beer, Coke, Diaper, Milk 4 Beer, Bread, Diaper, Milk 5 Coke, Diaper, Milk Input: Output: Frequent Itemsets 6 Applications – (1) ¢ Items = products; Baskets = sets of products someone bought in one trip to the store ¢ Real market baskets: Chain stores keep TBs of data about what customers buy together l Tells how typical customers navigate stores, lets them position tempting items l Suggests tie-in “tricks”, e.g., run sale on diapers and raise the price of beer l High support needed, or no $$’s ¢ Amazon’s people who bought X also bought Y 10/1/21 Frequent Itemsets 7 Applications – (2) ¢ Baskets = sentences; Items = documents containing those sentences l Items that appear together too often could represent plagiarism l Notice items do not have to be “in” baskets ¢ Baskets = patients; Items = drugs & side-effects l Has been used to detect combinations of drugs that result in particular side-effects l But requires extension: Absence of an item needs to be observed as well as presence 10/1/21 7 First: Define Frequent itemsets Association rules: Confidence, Support, Interestingness Then: Algorithms for finding frequent itemsets Finding frequent pairs Apriori algorithm PCY algorithm + refinements 8 10/1/21 Frequent Itemsets 9 Frequent Itemsets ¢ Simplest question: Find sets of items that appear together “frequently” in baskets ¢ Support for itemset I: Number of baskets containing all items in I l Often expressed as a fraction of the total number of baskets ¢ Given a support threshold s, then sets of items that appear in at least s baskets are called frequent itemsets 10/1/21 9 TID Items 1 Bread, Coke, Milk 2 Beer, Bread 3 Beer, Coke, Diaper, Milk 4 Beer, Bread, Diaper, Milk 5 Coke, Diaper, Milk Support of {Beer, Bread} = 2 Frequent Itemsets 10 Example: Frequent Itemsets ¢ Items = {milk, coke, pepsi, beer, juice} ¢ Minimum support = 3 baskets B1 = {m, c, b} B2 = {m, p, j} B3 = {m, b} B4= {c, j} B5 = {m, p, b} B6 = {m, c, b, j} B7 = {c, b, j} B8 = {b, c} ¢ Frequent itemsets: {m}, {c}, {b}, {j}, 10/1/21 , {b,c} , {c,j}.{m,b} Frequent Itemsets 11 Association Rules ¢ Association Rules: If-then rules about the contents of baskets ¢ {i1, i2,…,ik} → j means: “if a basket contains all of i1,…,ik then it is likely to contain j” ¢ In practice there are many rules, want to find significant/interesting ones! ¢ Confidence of this association rule is the probability of j given I = {i1,…,ik} 10/1/21 conf(I→ j) = support(I ∪ j)support(I ) = Pr[ j | I ] *Note: support(I ∪ j) = # (or %) of baskets contain BOTH I AND j Frequent Itemsets 12 Interesting Association Rules ¢ Not all high-confidence rules are interesting l The rule X → milk may have high confidence for many itemsets X, because milk is just purchased very often (independent of X) and the confidence will be high ¢ Interest of an association rule I → j: difference between its confidence and the fraction of baskets that contain j l Interesting rules are those with high positive or negative interest values 10/1/21 Interest(I→ j) = conf(I→ j)− Pr[ j] = Pr[ j | I ]− Pr[ j] Frequent Itemsets 13 Example: Confidence and Interest B1 = {m, c, b} B2 = {m, p, j} B3 = {m, b} B4= {c, j} B5 = {m, p, b} B6 = {m, c, b, j} B7 = {c, b, j} B8 = {b, c} ¢ Association rule: {m, b} →c l Confidence = 2/4 = 0.5 l Interest = |0.5 – 5/8| = 1/8 • Item c appears in 5/8 of the baskets • Rule is not very interesting! 10/1/21 Frequent Itemsets 14 Finding Association Rules ¢ Problem: Find all association rules with support ≥s and confidence ≥c l Note: Support of an association rule is the support of the set of items on the left side ¢ Hard part: Finding the frequent itemsets! l If {i1, i2,…, ik} → j has high support and confidence, then both {i1, i2,…, ik} and {i1, i2,…,ik, j} will be “frequent” 10/1/21 )support( )support()conf( I jIjI È=® Frequent Itemsets 15 Mining Association Rules ¢ Step 1: Find all frequent itemsets I l (we will explain this next) ¢ Step 2: Rule generation l For every subset A of I, generate a rule A → I \ A • Since I is frequent, A is also frequent • Variant 1: Single pass to compute the rule confidence • conf(A,B→C,D) = supp(A,B,C,D)/supp(A,B) • Variant 2: • Observation**: If A,B,C→D is below confidence, so is A,B→C,D • Can generate “bigger” rules from smaller ones! l Output the rules above the confidence threshold 10/1/21 Frequent Itemsets 16 Mining Association Rules (cont’d) ¢ Claim: If A,B,C→D is below confidence, so is A,B→C,D Why ? Since Supp(ABC) =< Supp(AB) Therefore: Conf.(ABC->D) = Supp(ABCD)/ Supp(ABC) >= Supp(ABCD)/Supp(AB) = Conf(AB->CD) Thus, IF Conf(AB->CD) >= Threshold THEN Conf(ABC->D) also >= threshold ; Equivalently, IF Conf(ABC->D) < Threshold then Conf(AB->CD) is also below threshold This means we can first check Conf(AB->CD) if it is above threshold, we can simply generate additional rules, e.g. ABC->D, ABD->C. => Can generate “bigger” rules from smaller ones! Frequent Itemsets 17 Example B1 = {m, a, b} B2 = {m, p, j} B3 = {m, a, b, n} B4= {a, j} B5 = {m, p, b} B6 = {m,a, b, j} B7 = {a, b, j} B8 = {b, a} ¢ Min support s=3, confidence c=0.75 ¢ 1) Frequent itemsets: l {b,m} {a,b} {a,m} {a,j} {m,a,b} ¢ 2) Generate rules: l b→m: c=4/6 b→a: c=5/6 b,a→m: c=3/5 l m→b: c=4/5 … b,m→a: c=3/4 l b→a,m: c=3/6 10/1/21 Frequent Itemsets 18 A Compact Way to store/track Frequent Itemsets You only need to store the so-called: Maximal Frequent itemsets: Definition: a Frequent set for which NO immediate superset is frequent Nice Properties: All subsets of a Maximal Frequent itemset are frequent AND Every Frequent itemset must be a subset of some Maximal Frequent itemset => By enumerating ALL subsets of all Maximal Frequent Itemsets, you will NOT miss any Frequent Itemset ! Also, every subset you got is a Frequent Itemset ! 10/1/21 18 Frequent Itemsets 19 Example: Maximal Frequent Itemset Count Maximal (s=3) A 4 No B 5 No C 3 No AB 4 Yes AC 2 No BC 3 Yes ABC 2 No 10/1/21 Frequent, but superset BC also frequent. Frequent, and its only superset, ABC, not freq. Finding Frequent Itemsets Frequent Itemsets 21 Itemsets: Computation Model ¢ Back to finding frequent itemsets ¢ Typically, data is kept in flat files rather than in a database system: l Stored on disk l Stored basket-by-basket l Baskets are small but we have many baskets and many items • Expand baskets into pairs, triples, etc. as you read baskets • Use k nested loops to generate all sets of size k 10/1/21 Item Item Item Item Item Item Item Item Item Item Item Item Etc. Items are positive integers, and boundaries between baskets are –1. Note: We want to find frequent itemsets. To find them, we have to count them. To count them, we have to generate them. Frequent Itemsets 22 Computation Model ¢ The true cost of mining disk-resident data is usually the number of disk I/O’s ¢ In practice, association-rule algorithms read the data in passes – all baskets read in turn ¢ We measure the cost by the number of passes an algorithm makes over the data 10/1/21 Frequent Itemsets 23 Main-Memory Bottleneck ¢ For many frequent-itemset algorithms, main-memory is the critical resource l As we read baskets, we need to count something, e.g., occurrences of pairs of items l The number of different things we can count is limited by main memory l Swapping counts in/out is a disaster (why?) 10/1/21 Frequent Itemsets 24 Finding Frequent Pairs ¢ The hardest problem often turns out to be finding the frequent pairs of items {i1, i2} l Why? Often frequent pairs are common, frequent triples are rare • Why? Probability of being frequent drops exponentially with size; number of sets grows more slowly with size. ¢ Let’s first concentrate on pairs, then extend to larger sets ¢ The approach: l We always need to generate all the itemsets l But we would only like to count/keep track of those itemsets that in the end turn out to be frequent 10/1/21 Frequent Itemsets 25 Naïve Algorithm ¢ Naïve approach to finding frequent pairs ¢ Read file once, counting in main memory the occurrences of each pair: l From each basket of n items, generate its n(n-1)/2 pairs by two nested loops ¢ Fails if (#items)2 exceeds main memory l Remember: #items can be 100K (Wal-Mart) or 10B (Web pages) • Suppose 105 items, counts are 4-byte integers • Number of pairs of items: 105(105-1)/2 = 5*109 • Therefore, 2*1010 (20 gigabytes) of memory needed 10/1/21 Frequent Itemsets 26 Counting Pairs in Memory Two Approaches: ¢ Approach 1: Count all pairs using a matrix ¢ Approach 2: Keep a table of triples [i, j, c] = “the count of the pair of items {i, j} is c.” l If integers and item ids are 4 bytes, we need approximately 12 bytes for pairs with count > 0 l Plus some additional overhead for the hashtable Note: ¢ Approach 1 only requires 4 bytes per pair ¢ Approach 2 uses 12 bytes per pair (but only for pairs with count > 0) 10/1/21 Frequent Itemsets 27 Comparing the 2 Approaches 10/1/21 4 bytes per pair Triangular Matrix Triples 12 bytes per occurring pair Frequent Itemsets 28 Triangular Matrix Approach Triangular Matrix Approach l n = total number items l Count pair of items {i, j} only if i