The Apriori Algorithm and its Core Property
The Apriori Algorithm is one of the best-known algorithms for association rule mining. It works in two main steps:
Pass1: Find all frequent itemsets (itemsets that satisfy minsup).
Pass2: Use the frequent itemsets to generate association rules (which satisfy minconf).
The
Apriori Property (Downward Closure)
The key to the Apriori
algorithm's efficiency is the Apriori property (or downward
closure property): Any subset of a frequent itemset must also be a
frequent itemset.
This property allows the
algorithm to perform an iterative, level-wise search. It starts by finding all 1-item frequent itemsets (F1),
then uses F1 to
generate and test candidates for F2, and so on.
The logical consequence of the Apriori property is that if an
itemset is infrequent (its support is less than minsup), then all of
its supersets must also be infrequent. This is
the basis for the prune step in candidate generation, which
significantly reduces the number of itemsets that need to be counted.
Simplified
Visual of the Downward Closure Property
This structure illustrates the relationship between frequent
itemsets based on the Apriori property:
|
Itemset Level |
Itemsets |
Frequent/Infrequent |
Downward Closure
Principle |
|
{A, B, C} |
→ Frequent |
All subsets must
also be frequent. |
|
|
{A, B},{A, C},{B, C} |
→ Frequent |
||
|
{A},{B},{C} |
→ Frequent |
||
|
3-Itemset |
{X, Y, Z} |
→ Infrequent |
No need to check
any of its supersets (like {X, Y, Z, W}), and at least one of its subsets must be
infrequent. |
In the Apriori
algorithm, if a subset of a candidate itemset of size k is
not in Fk−1 (the
set of frequent itemsets of size k−1), then the candidate
itemset is removed in the prune step.
Steps of Apriori Algorithm
The Apriori Algorithm is the classical, seminal algorithm for Association Rule Mining. It works in two main, distinct steps: Finding Frequent Itemsets and Generating Association Rules.
The algorithm's efficiency relies on the Apriori Property (or Downward Closure Property): Any subset of a frequent itemset must also be frequent.
Step
1: Finding All Frequent Itemsets
This phase uses a level-wise
search to find all itemsets whose support is greater than or equal
to the user-specified Minimum Support (minsup).
|
Step |
Action |
Explanation |
|
1.1: Initialization |
Find all frequent 1-itemsets (F_1). |
Scan the database to count the
support of every single item. Discard any item with a count below minsup. The
remaining items form F_1. |
|
1.2: Iteration (k=2, 3,……) |
Repeat the cycle of Candidate
Generation and Pruning until no new frequent itemsets are found. |
|
|
1.3: Candidate Generation (C_k) |
Generate candidate k-itemsets (C_k)
from F_k-1. |
Perform a join operation:
Combine any two frequent itemsets from F_k-1 that share the first k-2 items
to form a k-item candidate. |
|
1.4: Pruning (C_k to C'_k) |
Prune the candidate set C_k to
create C'_k. |
Apply the Apriori Property:
If any (k-1) -sized subset of a candidate k-itemset in C_k is not
found in F_k-1, then that candidate cannot be frequent and is removed. |
|
1.5: Support Counting (F_k) |
Find the frequent k-itemsets F_k. |
Scan the transaction database (T)
to count the actual support for the remaining candidates in C'_k. All
candidates whose count meets minsup are added to F_k. |
|
1.6: Termination |
Stop when the set F_k is empty. |
The union of all F_k found F = F_1ÈF_2 È…..)
is the complete set of frequent itemsets. |
Step
2: Generating Association Rules
In this phase, all frequent itemsets
(F) found in Step 1 are used to generate high-confidence rules that satisfy the
user-specified Minimum Confidence (minconf).
- Iterate Over Frequent Itemsets: For every frequent itemset X Î F (where |X|>= 2), generate all possible non-empty
subsets of X.
- Generate Rules: For every non-empty subset A ÌX, form an association rule R:
A®(X - A)
The set A
is the antecedent, and B = (X - A) is the consequent.
- Check Confidence: Calculate the confidence for the rule
R using the support counts recorded in Step 1 (no need to rescan the
database):
{Confidence}(A®B) = Support(AÈB)/Support (A)} = Support}(X)/Support}(A)
- Output Rules:
If Confidence}(A®B) >=minconf, the rule is considered strong and is
output.
This process ensures that every
generated rule is both frequent (because it is derived from a frequent
itemset X and reliable (because it meets the minconf threshold).