Many data mining algorithms, including the traditional Apriori, assume that the entire dataset can fit into a computer's main memory (RAM). However, with today's massive datasets (terabytes or more), this is often impossible. The challenge is how to process data that's too big to load all at once. Solutions involve:
External
algorithms: These are
designed to work with data stored on disk, reading and processing it in
smaller, manageable chunks.
Memory-mapped
files: This technique
maps a file directly into the computer's virtual memory space, allowing the
system to handle the loading and unloading of data pages as needed, effectively
treating a disk file as if it were in RAM.
Specialized data
structures: Using more
memory-efficient data structures can help, but for truly large datasets, the
problem fundamentally remains a matter of minimizing data transfers between the
disk and memory.
From Apriori's Iteration to FP-Growth's
Compression: A Frequent Itemset Mining Evolution
The search for frequent
itemsets—groups of products often bought together—is the cornerstone of
association rule mining. While
the classic Apriori Algorithm provided the initial
breakthrough, its performance limitations spurred the development of more
efficient methods, most notably the Frequent Pattern (FP)-Growth
algorithm.
The Apriori Bottleneck: The Cost of Iteration
The Apriori algorithm
successfully discovers all frequent itemsets by leveraging the Apriori
property (the downward closure property). However, it faces a
significant bottleneck:
1. Multiple Database Scans: Apriori is an iterative,
level-wise algorithm. To find frequent
k-itemsets (Fk), it must generate candidate itemsets (Ck) and then scan the
entire transaction database (T) to count their support. If the largest frequent itemset has size K, the algorithm
makes at most K passes over the data.
2. Massive
Candidate Generation: In each iteration, Apriori generates a large number
of candidate itemsets (Ck)
that might be frequent. Even with the pruning step,
generating and testing these candidates, especially when the item universe is
large, consumes substantial time and memory.
While Apriori is robust
and scalable to large datasets, the repeated database scans and the
combinatorial cost of candidate generation become a major inefficiency, particularly
in dense datasets.
FP-Growth: Mining Without Candidate Generation
The FP-Growth algorithm was designed specifically
to overcome Apriori's weaknesses by eliminating the need for explicit candidate generation. It uses a
divide-and-conquer approach based on compressing the transaction data into a
specialized structure called the FP-Tree.
1.
Data Compression: The FP-Tree
Instead
of repeatedly scanning the original transactions, FP-Growth first scans the
database once to collect item support counts. It then performs a second scan
to build a highly condensed, prefix-tree structure called the Frequent
Pattern Tree (FP-Tree).
·
The FP-Tree stores frequent items in a
compressed format, sharing common prefixes among transactions. This compression
drastically reduces the dataset size and keeps it memory-resident.
·
Infrequent
items are discarded immediately, further shrinking the tree and optimizing the
mining process.
2.
Mining via Projection
Once
the FP-Tree is built, the algorithm mines the itemsets by decomposing the tree
into smaller, conditional databases (called "projection") for
each frequent item.
·
This approach avoids the recursive generation
and counting of candidates (Ck)
characteristic of Apriori.
·
The entire process is done without ever touching
the original transaction database again, saving numerous I/O operations and CPU
cycles.
Comparison Summary
|
Feature |
Apriori Algorithm |
FP-Growth Algorithm |
|
Strategy |
Level-wise search; iterative |
Divide-and-conquer; recursive projection |
|
Candidate Sets |
Explicitly generated (Ck) and tested |
|
|
Database Scans |
Multiple (K
passes) |
Two (one for counts, one for tree building) |
|
Data Structure |
Hash trees for candidate counting |
FP-Tree (compressed prefix tree) |
|
Performance |
Can be slow due to large candidate sets |
Generally faster due to compression and no candidate overhead |
The FP-Growth (Frequent Pattern Growth) algorithm
is an efficient method for finding frequent itemsets without the overhead of
generating candidate sets, which is characteristic of the Apriori algorithm.
Its working procedure involves two main phases: FP-Tree Construction and
FP-Tree Mining (via Conditional Pattern Bases).