SON Algorithm (Savasere–Omiecinski–Navathe Algorithm)

 SON Algorithm (Savasere–Omiecinski–Navathe Algorithm)

The SON algorithm uses the concept of divide-and-conquer. The large dataset is divided into several chunks that fit into memory. Each chunk is processed independently using any frequent itemset algorithm (like Apriori). Itemsets that appear frequent in multiple chunks will likely be globally frequent. Although SON is an algorithmic concept, it is commonly implemented in Big Data frameworks, especially using MapReduce.

Why It Fits Limited Pass Category

·            Pass 1: Each partition is scanned locally. Local frequent itemsets across partitions are collected.

·            Pass 2: All        candidate itemsets are verified by scanning the whole database once.

Only two passes over the entire dataset are required.

Purpose:
To find frequent itemsets in extremely large datasets where the data cannot fit into memory and multiple scans are too expensive.

Core Idea:

Break the dataset into smaller partitions, mine frequent items locally, then combine and verify in the entire dataset.

Key Technologies Used in SON Algorithm

Below are the technologies where SON is practically used:

1.      MapReduce Framework

SON was originally designed to run on MapReduce.

 

Pass 1

Pass 2

Map phase

Reduce phase

Map phase

Reduce phase

Each mapper receives a partition

Runs Apriori/FP-Growth locally

Emits local frequent itemsets

Combine all local frequent itemsets

Deduplicate to create global candidate list

 

Each mapper scans full dataset

Counts only candidate itemsets

 

Each mapper scans full dataset

Counts only candidate itemsets

 

 

2.      Apache Hadoop

·         Distributed file storage (HDFS)

·         MapReduce jobs run in parallel

·         Handles petabytes of data

·         SON fits perfectly into       Hadoop’s chunk-based processing model

 

3.      Apache Spark

SON can also be implemented using:

·                     RDD partitions

·         in-memory local mining

·         reduce + map operations

·                   Spark is faster because:

·         Local computations are in-memory

·         Iterative algorithms (like Apriori) benefit significantly

 

4.       Distributed File Storage Systems

SON assumes data is stored in distributed blocks:

·         HDFS

·               Amazon S3

·              Google Cloud Storage

·               Azure Blob Storage

·         These systems allow reading only required partitions.

 

5.      Parallel Computing Concepts

SON relies on:

·              Partition parallelism

·              Local computation

·              Distributed counting

·              Fault tolerance

 

Detailed Step-by-Step Description of SON Algorithm

The SON algorithm works in two major passes over the full dataset, but internally runs Apriori or FP-Growth on smaller partitions.

 

Pass 1: Local Frequent Itemset Generation (Partition-wise Mining)

Step 1.1 – Partitioning

The large dataset D is divided into k equal-sized partitions.

Example: If dataset D = 100 million transactions → divide into 100 partitions of 1 million each.


Step 1.2 – Local Support Threshold

For global support threshold s, local threshold becomes:

s_local = s × (size of partition / size of dataset)

But in most versions, same support is used to keep implementation simple.


Step 1.3 – Mine Each Partition Independently

For every partition:

1.      Load partition into memory

2.      Run Apriori/FP-Growth locally

3.      Generate local frequent itemsets

If an itemset is globally frequent, it must appear in at least one partition with reasonably high support.


Step 1.4 – Merge All Local Frequent Itemsets

·         Combine frequent sets from all partitions

·         Remove duplicates

·         This becomes the         candidate frequent itemset list

After Pass 1, the algorithm only knows candidate itemsets, not actual global frequent itemsets.

Pass 2: Global Verification of Candidates

Step 2.1 – Count Candidate Itemsets in Full Dataset

·         The full dataset is scanned again

·         Only the candidate itemsets are counted

·         This drastically reduces computational cost


Step 2.2 – Check Against Global Support Threshold

If support ≥ s, itemset is globally frequent.


Step 2.3 – Output Final Frequent Itemsets

2-pass algorithm

Memory efficient

Big-data scalable

Conclusion

SON uses the monotonicity property of support: If an itemset is globally frequent, it must be frequent in at least one local partition.

This guarantees that:

·         No globally frequent itemset will be missed

·         Local frequent itemsets are          safe overestimates

Thus SON is an example of a candidate generation-based distributed algorithm.

 

Advantages of SON Algorithm

1. Highly Scalable

·         Works on massive datasets

·         Perfect for distributed processing

·         Handles      terabytes/petabytes

2. Memory Efficient

·         Each partition fits into memory

·         No need to load entire dataset at once

3. Parallel Processing

·         Each partition is processed independently

·         Speed increases linearly with number of machines

4. Reduced Candidate Explosion

·         Local frequent itemsets are much smaller

·         Reduces global candidate space significantly

5. Robust in Big-Data Environments

·         Works well with Hadoop, Spark, MapReduce

·         Auto fault-tolerance (recompute lost partitions)

·         Load-balanced and parallel-friendly

 

Disadvantages of SON Algorithm

1. Requires Two Full Passes Over Data

·         Expensive for extremely massive datasets

·         High I/O cost

2. Load Balancing Issues

If partitions are not uniform:

·         Some may take longer

·         Computational imbalance occurs

3. Local Frequent Itemsets May Be Large

If partitions are very large:

·         Local mining may still be expensive

·         Memory overflow possible

4. Not Suitable for Streaming Data

SON requires:

·             Static dataset

·              Multiple passes

·              Offline mining

Streaming models require online algorithms like DGIM.

5. Complexity Increases With Itemset Size

For long itemsets or dense datasets:

·         Apriori on local partitions still slow

·         Candidate generation becomes costly

 

Example

Assume the global minimum support = 30%, dataset has 10,000 transactions.

Divide data into two partitions:

·         Partition A: 5,000 transactions

·         Partition B: 5,000 transactions

Local minimum support = 30% of 5,000 = 1,500.

Pass 1 (local mining):

·         Partition A finds frequent itemsets:

{A}{B}{A,C}{B,D}

·         Partition B finds frequent itemsets:
{A}{C}{A,C}

Merge local frequent itemsets → Candidates:
{A}{B}{C}{D}{A,C}{B,D}

Pass 2 (global counting):

Scan entire data:

·         {A} appears 3300 → frequent

·         {A,C} appears 2900 → frequent

·         {B} appears 1200 → not frequent

·         {C} appears 2600 → frequent

·         {B,D} appears 900 → not frequent

·         {D} appears 850 → not frequent

Final frequent itemsets:

{A}{C}{A,C}

SON avoids huge candidate explosion by working locally, thus making it efficient for big data systems like Hadoop and Spark.