SON Algorithm (Savasere–Omiecinski–Navathe Algorithm)
The SON algorithm uses the concept of divide-and-conquer.
The large dataset is divided into several chunks that
fit into memory. Each chunk is processed independently using
any frequent itemset algorithm (like Apriori). Itemsets that appear frequent in
multiple chunks will likely be globally frequent. Although SON is an
algorithmic concept, it is commonly implemented in Big Data
frameworks, especially using MapReduce.
Why It Fits Limited Pass Category
· Pass 1: Each partition is scanned locally. Local frequent
itemsets across partitions are collected.
· Pass 2: All candidate itemsets are verified by scanning the
whole database once.
Only two passes over the entire
dataset are required.
Purpose:
To find frequent itemsets in extremely large
datasets where the data cannot fit into memory and multiple scans are too
expensive.
Core Idea:
Break the dataset into smaller partitions,
mine frequent items locally, then combine and verify in
the entire dataset.
Key Technologies Used in SON Algorithm
Below are the technologies where SON is practically used:
1. MapReduce Framework
SON was originally designed to run on MapReduce.
|
Pass 1 |
Pass 2 |
||
Map phase
|
Map phase
|
Reduce phase |
|
|
Each mapper receives a partition Runs Apriori/FP-Growth locally Emits local frequent itemsets |
Combine all local frequent itemsets Deduplicate to create global candidate list |
Each mapper scans full dataset Counts only candidate itemsets |
Each mapper scans full dataset Counts only candidate itemsets |
2. Apache Hadoop
·
Distributed file
storage (HDFS)
·
MapReduce jobs run
in parallel
·
Handles petabytes of
data
·
SON fits perfectly
into Hadoop’s chunk-based processing model
3. Apache Spark
SON can also be implemented using:
·
in-memory local
mining
·
reduce + map
operations
· Spark is faster
because:
·
Local computations
are in-memory
·
Iterative algorithms
(like Apriori) benefit significantly
4. Distributed File Storage Systems
SON assumes data is stored in distributed blocks:
·
HDFS
·
These systems allow
reading only required partitions.
5. Parallel Computing Concepts
SON relies on:
Detailed Step-by-Step Description of SON Algorithm
The SON algorithm works in two major passes over
the full dataset, but internally runs Apriori or FP-Growth on smaller
partitions.
Pass 1: Local Frequent Itemset Generation
(Partition-wise Mining)
Step 1.1 – Partitioning
The large dataset D is divided
into k equal-sized partitions.
Example: If dataset D = 100 million transactions → divide
into 100 partitions of 1 million each.
Step 1.2 – Local Support Threshold
For global support threshold s, local threshold becomes:
s_local = s × (size of partition / size of dataset)
But in most versions, same support is
used to keep implementation simple.
Step 1.3 – Mine Each Partition
Independently
For every partition:
1.
Load partition into
memory
2.
Run
Apriori/FP-Growth locally
3.
Generate local
frequent itemsets
If an itemset is globally frequent, it must appear in at
least one partition with reasonably high support.
Step 1.4 – Merge All Local Frequent
Itemsets
·
Combine frequent
sets from all partitions
·
Remove duplicates
·
This becomes
the candidate frequent itemset list
After Pass 1, the algorithm only knows candidate
itemsets, not actual global frequent itemsets.
Pass 2: Global Verification of
Candidates
Step 2.1 – Count Candidate Itemsets in
Full Dataset
·
The full dataset is
scanned again
·
Only the candidate
itemsets are counted
·
This drastically
reduces computational cost
Step 2.2 – Check Against Global Support
Threshold
If support ≥ s, itemset is globally frequent.
Step 2.3 – Output Final Frequent
Itemsets
✔ 2-pass algorithm
✔ Memory efficient
✔ Big-data scalable
Conclusion
SON uses the monotonicity property of
support: If an itemset is globally frequent, it must be frequent in at least
one local partition.
This guarantees that:
·
No globally frequent
itemset will be missed
·
Local frequent
itemsets are safe overestimates
Thus SON is an example of a candidate
generation-based distributed algorithm.
Advantages of SON Algorithm
1. Highly Scalable
·
Works on massive
datasets
·
Perfect for
distributed processing
·
Handles terabytes/petabytes
2. Memory Efficient
·
Each partition fits
into memory
·
No need to load
entire dataset at once
3. Parallel Processing
·
Each partition is
processed independently
·
Speed increases
linearly with number of machines
4. Reduced Candidate Explosion
·
Local frequent
itemsets are much smaller
·
Reduces global
candidate space significantly
5. Robust in Big-Data Environments
·
Works well with
Hadoop, Spark, MapReduce
·
Auto fault-tolerance
(recompute lost partitions)
·
Load-balanced and
parallel-friendly
Disadvantages of SON Algorithm
1. Requires Two Full Passes Over Data
·
Expensive for
extremely massive datasets
·
High I/O cost
2. Load Balancing Issues
If partitions are not uniform:
·
Some may take longer
·
Computational
imbalance occurs
3. Local Frequent Itemsets May Be Large
If partitions are very large:
·
Local mining may
still be expensive
·
Memory overflow
possible
4. Not Suitable for Streaming Data
SON requires:
Streaming models require online algorithms like DGIM.
5. Complexity Increases With Itemset
Size
For long itemsets or dense datasets:
·
Apriori on local
partitions still slow
·
Candidate generation
becomes costly
Example
Assume the global minimum support = 30%,
dataset has 10,000 transactions.
Divide data into two partitions:
·
Partition A: 5,000
transactions
·
Partition B: 5,000
transactions
Local minimum support = 30% of 5,000 = 1,500.
Pass 1 (local mining):
·
Partition A finds
frequent itemsets:
{A}, {B}, {A,C}, {B,D}
·
Partition B finds
frequent itemsets:
{A}, {C}, {A,C}
Merge local frequent itemsets → Candidates:
{A}, {B}, {C}, {D}, {A,C}, {B,D}
Pass 2 (global counting):
Scan entire data:
·
{A} appears
3300 → frequent
·
{A,C} appears
2900 → frequent
·
{B} appears
1200 → not frequent
·
{C} appears
2600 → frequent
·
{B,D} appears
900 → not frequent
·
{D} appears
850 → not frequent
Final frequent itemsets:
{A}, {C}, {A,C}
SON avoids huge candidate explosion by working locally,
thus making it efficient for big data systems like Hadoop and Spark.