FP-Growth:
1. FP-Tree Construction (Data Compression)
The algorithm builds a highly compressed data structure
called the Frequent Pattern Tree (FP-Tree) by scanning the transaction
database twice.
First
Pass: Counting Frequencies
1.
Count
Item Support: Scan
the entire transaction database to count the support for every single item.
2.
Filter
Infrequent Items:
Discard all items that do not meet the user-specified minimum support (minsup).
3.
Create
Header Table: Sort
the remaining frequent items in descending order of their support counts. This
sorted list, along with pointers to the tree nodes, forms the Header Table.
Second
Pass: Building the FP-Tree
1.
Process
Transactions: Scan
the transaction database again. For each transaction, remove all infrequent
items.
2.
Order
Items: Sort the items
within each transaction according to the global order established in the Header
Table (descending support count).
3.
Insert
into Tree: Insert the
sorted items of each transaction into the FP-Tree.
o Nodes are created for items. The count
of a node is incremented for every transaction that shares that item prefix.
o If a transaction shares a prefix with
an existing path in the tree, only the count of the existing nodes is
incremented. New nodes are only created for items beyond the shared prefix.
4.
Update
Header Pointers: A
link is drawn from the Header Table to the first occurrence of that item in the
tree, and subsequent occurrences are linked to each other using node links.
This forms a chain connecting all nodes containing the same item.
2. FP-Tree Mining (Pattern Growth)
After the FP-Tree is built, the algorithm mines the
frequent itemsets using a divide-and-conquer strategy, working bottom-up from
the Header Table. The principle is: find all frequent itemsets ending with a
specific item i and combine them.
The mining process is recursive:
1.
Iterate
Header Table: Start
with the least frequent item i in the Header
Table. (Working from the least frequent item first improves efficiency.)
2.
Find
Conditional Pattern Bases (CPBs): For the current item i, trace
all paths in the FP-Tree that end at a node containing i,
using the node links in the Header Table.
o The Conditional Pattern Base (CPB)
for item i is the set of prefix paths of i. The count of each path in the CPB is equal to the count of
the i node at the end of that path.
3.
Build
Conditional FP-Trees (CFPTs):
o Treat the CPB as a new, smaller
transaction database.
o Sum the counts for each item within
the CPB.
o Filter out items in the CPB that do
not meet the minimum support threshold within that base.
o Construct a new, smaller Conditional
FP-Tree (CFPT) based on the remaining items in the filtered CPB.
4.
Generate
Frequent Itemsets:
o Any frequent path in the CFPT combined
with the item i forms a frequent itemset.
o If the CFPT consists of a single path,
all combinations of items on that path plus i are
frequent.
5.
Recursion: Recursively apply this mining
procedure (steps 2-4) to the newly formed Conditional FP-Tree until the tree is
empty or contains only a single path.
By recursively projecting the tree onto conditional
pattern bases, FP-Growth effectively finds all frequent itemsets without
generating and counting explicit candidates, making it significantly faster
than Apriori, especially for dense and large datasets.
The FP-Growth algorithm works by transforming the
transaction data into a compressed, tree-like structure, the FP-Tree,
and then recursively mining that tree. This avoids the repeated database scans
and costly candidate generation that plagues Apriori.
Here is the step-by-step working procedure using a sample
database.
Example Database and Minimum Support
We will use the following transaction database (T) and set a minimum support threshold (minsup).
|
TID |
Items Purchased |
|
T1 |
A, B, C |
|
T2 |
B, C, D |
|
T3 |
A, C, D |
|
T4 |
A, B, D |
|
T5 |
A, B, C, D, E |
|
T6 |
B, E |
Minimum Support (minsup): 3 out
of 6 transactions (50%)
Step 1: FP-Tree Construction (Two Passes)
Pass 1:
Count Frequencies and Order Items
1.
Count
Support: Calculate
the count for every item in T.
o A: 4, B: 4, C: 3, D: 4, E: 2
2.
Filter
Infrequent Items:
Item E (count 2) is below minsup (3), so it is
discarded.
3.
Create
Header Table and Order:
Sort the remaining frequent items in descending order of their support counts.
Items with equal counts (A, B, D) are ordered alphabetically.
|
Item |
Support |
Global Rank (Order) |
|
A |
4 |
1 |
|
B |
4 |
2 |
|
D |
4 |
3 |
|
C |
3 |
4 |
The Canonical Order for items is ⟨A,B,D,C⟩.
Pass 2:
Build the FP-Tree
Scan T again. For each
transaction, remove infrequent items (E) and sort the rest according to the
Canonical Order ⟨A,B,D,C⟩. Then, insert the sorted transaction
into the tree.
|
TID |
Original Transaction |
Sorted, Filtered Transaction |
Path in FP-Tree |
|
T1 |
A, B, C |
⟨A,B,C⟩ |
A(1) → B(1) → C(1) |
|
T2 |
B, C, D |
⟨B,D,C⟩ |
Root
→ B(1) → D(1) → C(1) |
|
T3 |
A, C, D |
⟨A,D,C⟩ |
Root
→ A(2) →
D(1) → C(1) |
|
T4 |
A, B, D |
⟨A,B,D⟩ |
Root
→ A(3) →
B(2) → D(1) |
|
T5 |
A, B, C, D, E |
⟨A,B,D,C⟩ |
Root
→ A(4) →
B(3) → D(2)
→ C(1) |
|
T6 |
B, E |
⟨B⟩ |
Root
→ B(4) |
Step 2: FP-Tree Mining (Pattern Growth)
We mine the tree recursively, starting from the least
frequent item in the Header Table: C, D, B, A.
A.
Mining Item 'C'
1.
Find
Paths: Trace all
paths ending in 'C' using the Header Table links:
o Path 1: ⟨A:4,B:3,D:2,C:1⟩ (Count: 1)
o Path 2: ⟨A:3,D:1,C:1⟩ (Count: 1)
o Path 3: ⟨A:1,B:1,C:1⟩ (Count: 1)
o Path 4: ⟨B:4,D:1,C:1⟩ (Count: 1)
2.
Conditional
Pattern Base (CPB):
The prefix paths (excluding 'C') with their respective counts:
o ⟨A,B,D⟩: 1
o ⟨A,D⟩: 1
o ⟨A,B⟩: 1
o ⟨B,D⟩: 1
3.
Conditional
FP-Tree (CFPT): Count
support in the CPB:
o A: 1+1+1=3
(Frequent)
o B: 1+1+1=3
(Frequent)
o D: 1+1=2 (Infrequent → Discard)
The
filtered CPB is: ⟨A,B⟩:1, ⟨A⟩:1, ⟨A,B⟩:1, ⟨B⟩:1.
The
resulting CFPT for C contains the single path: A:3 → B:3
4.
Frequent
Itemsets for 'C':
Find all combinations from the CFPT, and append 'C'.
o {C} (Known: Count 3)
o {A,C}, {B,C}, {A,B,C}
B.
Mining Item 'D'
1.
Find
Paths: Paths ending
in 'D':
o Path 1: ⟨A:4,B:3,D:2⟩ (Count: 2)
o Path 2: ⟨A:3,D:1⟩ (Count: 1)
o Path 3: ⟨B:4,D:1⟩ (Count: 1)
2.
CPB:
o ⟨A,B⟩: 2
o ⟨A⟩: 1
o ⟨B⟩: 1
3.
CFPT: Count support in the CPB:
o A: 2+1=3 (Frequent)
o B: 2+1=3 (Frequent)
The
resulting CFPT for D contains the path: A:3 → B:2, and a separate path B:1.
4.
Frequent
Itemsets for 'D':
o {D} (Known: Count 4)
o {A,D}, {B,D}, {A,B,D}
C.
Mining Item 'B'
1.
Find
Paths: Paths ending
in 'B' (excluding those already used for C and D):
o Path 1: ⟨A⟩ (Count: 1)
o Path 2: ⟨A⟩ (Count: 2)
o Path 3: ⟨A⟩ (Count: 1)
2.
CPB: ⟨A⟩: 4
3.
CFPT: Contains only A:4
4.
Frequent
Itemsets for 'B':
o {B} (Known: Count 4)
o {A,B}
D.
Mining Item 'A'
The paths for 'A' (the highest-ranked item) are only the
root, so its frequent itemset is just {A}.
Final Result
By combining the itemsets generated in each step, we
obtain the complete set of frequent itemsets:
·
1-Itemsets: {A}, {B}, {C}, {D}
·
2-Itemsets: {A,B}, {A,C}, {A,D}, {B,C}, {B,D}
·
3-Itemsets: {A,B,C}, {A,B,D}