Web Behavior Analysis — Case Study (Clustering & Pattern Mining)
This case study explains a step-by-step business approach to web behavior analysis using unsupervised techniques: **grid-based subspace clustering (CLIQUE — “CLustering In QUEst”)** and pattern mining on session & event data to produce actionable **user behavior insights** for targeted experience and marketing. We include a small sample dataset for manual walk-through, the math behind the approach, a reproducible Python implementation (CLIQUE-style grid clustering + frequent pattern extraction), output tables, visualizations and interpretation. Paste this HTML into BlogSpot's HTML editor — paragraphs are Times New Roman, 12pt and justified.
Note: CLIQUE (Clustering In QUEst) is a well-known grid-based subspace clustering algorithm designed to discover dense subspaces of high-dimensional data and form clusters from adjacent dense grid cells. CLIQUE is often used when interesting clusters exist in different subspaces rather than the full feature set. :contentReference[oaicite:0]{index=0}
1. Business Problem (concise)
An e-commerce product team wants to understand session-level web behavior to: 1) identify distinct user segments (e.g., explorers, buyers, bargain-searchers), 2) discover frequent navigation patterns (e.g., product→reviews→cart), and 3) take targeted actions (personalized banners, optimized funnels). The dataset is clickstream / session-aggregated features; the goal is unsupervised discovery because labeled behavior classes are not available.
2. Why a CLIQUE-style approach + pattern mining?
Web behavior often lives in subspaces: clusters may exist in combinations of features (e.g., high time-on-site & many product page views) but not across every feature. Grid-based subspace algorithms (like CLIQUE) discretize each dimension into intervals, flag dense cells, and merge adjacent dense cells to form clusters — enabling discovery of clusters in different subspaces automatically. After clustering, frequent pattern mining (e.g., Apriori / FP-Growth) on page-sequence or event-bucket data unearths common navigation paths inside each cluster, producing actionable insights for targeted interventions. :contentReference[oaicite:1]{index=1}
3. Sample dataset (toy) — Session-level features
Save this small CSV as web_sessions_sample.csv. Each row is a session (user visit). For simplicity we provide five sample sessions to walk through a manual grid calculation; later we provide code that works on larger synthetic data.
| SessionID | PagesViewed | SessionDuration_sec | ProductsViewed | AddedToCart | Bounce (0/1) |
|---|---|---|---|---|---|
| S1 | 2 | 45 | 1 | 0 | 1 |
| S2 | 10 | 320 | 5 | 1 | 0 |
| S3 | 8 | 210 | 3 | 0 | 0 |
| S4 | 1 | 30 | 0 | 0 | 1 |
| S5 | 12 | 400 | 7 | 1 | 0 |
Interpretation: S1 and S4 are short bounces; S2–S5 are longer, with product interest and some conversions. Real data should contain thousands of sessions and additional features (referrer, device, geo, time-of-day, funnel-stage events).
4. CLIQUE-style grid & density — equations
bin_index = ⌊ (x - min_f) / width_f ⌋
den(G) = |{ x ∈ dataset : x falls into G }|
G is dense if den(G) ≥ τ (density threshold)
Find 1D dense intervals, then combine to k-dim candidate subspaces and test density; adjacent dense grid cells are merged to build clusters.
5. Manual step-by-step grid calculation (tiny example)
We illustrate a 2D grid over PagesViewed and SessionDuration_sec for the five sessions above. Choose simple bins:
- PagesViewed bins: [0–3], [4–8], [9–15]
- SessionDuration bins (sec): [0–60], [61–240], [241–500]
Map each session to a 2D cell (PagesBin, DurationBin):
S1 -> Pages 2 -> bin1 (0–3), Duration 45 -> binA (0–60) => cell (bin1, binA) S2 -> Pages 10 -> bin3 (9–15), Duration 320 -> binC (241–500) => cell (bin3, binC) S3 -> Pages 8 -> bin2 (4–8), Duration 210 -> binB (61–240) => cell (bin2, binB) S4 -> Pages 1 -> bin1, Duration 30 -> binA => cell (bin1, binA) S5 -> Pages 12 -> bin3, Duration 400 -> binC => cell (bin3, binC)
Count densities per cell:
| Cell | Members | Count |
|---|---|---|
| (bin1,binA) | S1,S4 | 2 |
| (bin2,binB) | S3 | 1 |
| (bin3,binC) | S2,S5 | 2 |
With a density threshold τ = 2, the dense cells are (bin1,binA) and (bin3,binC). Adjacent dense cells would be merged into clusters — here they are separate, giving two clusters: short bounces and engaged shoppers. Sessions in sparse cells (e.g., S3) are either noise or further analyzed in 1D subspaces (maybe PagesViewed alone) to find patterns.
6. Python implementation (CLIQUE-style grid clustering + frequent pattern mining)
The code below is self-contained. It synthesizes a larger session dataset, runs a grid-based subspace clustering (simple CLIQUE-like implementation: discretize dimensions, find dense cells, merge adjacent dense cells), extracts cluster assignments, and runs an FP-Growth frequent itemset mining on simplified page-sequence buckets for each cluster to surface common patterns. Replace the synthetic data with your real session CSV and adjust parameters (bins, τ).
# Requirements: pandas, numpy, matplotlib, sklearn, networkx (optional), mlxtend (for frequent_patterns)
# Install missing packages if needed: pip install mlxtend
import numpy as np
import pandas as pd
from sklearn.preprocessing import KBinsDiscretizer
from itertools import product
import matplotlib.pyplot as plt
from mlxtend.frequent_patterns import apriori, association_rules
# 1) Synthetic session dataset (n=600)
np.random.seed(42)
n = 600
# Features (simulate realistic web session stats)
pages = np.random.poisson(4, n) + np.random.choice([0,1,2], size=n, p=[0.7,0.2,0.1])
duration = np.clip(np.random.exponential(scale=180, size=n).astype(int), 5, 2000) # seconds
products_viewed = np.random.poisson(2, n)
added_to_cart = (np.random.rand(n) < (0.1 + 0.05*(pages>6))).astype(int)
bounce = (np.random.rand(n) < (0.3 - 0.05*(pages>3))).astype(int)
df = pd.DataFrame({
'SessionID': [f'S{10000+i}' for i in range(n)],
'PagesViewed': pages,
'SessionDuration': duration,
'ProductsViewed': products_viewed,
'AddedToCart': added_to_cart,
'Bounce': bounce
})
# 2) Select numeric features for grid-based subspace clustering
features = ['PagesViewed', 'SessionDuration', 'ProductsViewed']
# 3) Discretize each feature into bins (KBinsDiscretizer with uniform strategy)
n_bins = [4, 4, 3] # example bin counts per feature
kbd = KBinsDiscretizer(n_bins=n_bins, encode='ordinal', strategy='uniform')
X_bins = kbd.fit_transform(df[features])
# 4) Build grid cell keys and count densities
cell_keys = [tuple(row.astype(int)) for row in X_bins]
df['cell'] = cell_keys
cell_counts = df['cell'].value_counts().to_dict()
# 5) Density threshold (tau) — e.g., min 12 sessions per cell for dense
tau = 12
dense_cells = {cell:ct for cell,ct in cell_counts.items() if ct >= tau}
# 6) Simple adjacency (cells adjacent if differ by at most 1 on each dimension) -> merge
from collections import deque
def neighbors(cell):
# yields neighboring cell keys including itself
ranges = [range(c-1, c+2) for c in cell]
for nb in product(*ranges):
yield tuple(nb)
visited = set()
clusters = {}
cluster_id = 0
for cell in dense_cells:
if cell in visited:
continue
# BFS over adjacent dense cells
q = deque([cell])
visited.add(cell)
clusters[cluster_id] = []
while q:
cur = q.popleft()
clusters[cluster_id].append(cur)
for nb in neighbors(cur):
if nb in dense_cells and nb not in visited:
visited.add(nb)
q.append(nb)
cluster_id += 1
# 7) Assign sessions to clusters based on cell membership
def assign_cluster(cell):
for cid, cells in clusters.items():
if cell in cells:
return cid
return -1 # noise / unclustered
df['cluster'] = df['cell'].apply(assign_cluster)
# 8) Cluster summary table
summary = df.groupby('cluster').agg(
Count=('SessionID','count'),
MeanPages=('PagesViewed','mean'),
MeanDuration=('SessionDuration','mean'),
AddToCartRate=('AddedToCart','mean'),
BounceRate=('Bounce','mean')
).reset_index().sort_values('cluster')
print('Cluster summary (cluster=-1 means unclustered/noise):')
print(summary.to_string(index=False))
# 9) Pattern mining: create simplified page-bucket sequences (simulated) and mine per cluster
# For demo, simulate page categories visited in each session as itemsets
page_categories = ['home','search','listing','product','reviews','cart','checkout']
def simulate_path(pages):
# simple random path generator biased by pages count
path = []
if np.random.rand() < 0.9: path.append('home')
for i in range(max(1,pages)):
path.append(np.random.choice(['search','listing','product','reviews'], p=[0.2,0.45,0.3,0.05]))
if np.random.rand() < 0.15: path.append('cart')
if np.random.rand() < 0.05: path.append('checkout')
return list(dict.fromkeys(path)) # unique preserve order for simplicity
df['pages_sequence'] = df['PagesViewed'].apply(simulate_path)
# Convert to one-hot itemset per session for Apriori
# Build boolean presence matrix per session
rows = []
for seq in df['pages_sequence']:
rows.append({cat: (cat in seq) for cat in page_categories})
items_df = pd.DataFrame(rows)
# For each cluster, run apriori
for cid in sorted(df['cluster'].unique()):
if cid == -1:
continue
subset = items_df[df['cluster']==cid]
if len(subset) < 20:
continue
freq = apriori(subset, min_support=0.08, use_colnames=True)
rules = association_rules(freq, metric="confidence", min_threshold=0.5)
print(f"Cluster {cid} frequent itemsets (support>=0.08):")
print(freq.sort_values('support', ascending=False).head(10).to_string(index=False))
if not rules.empty:
print("Top rules:")
print(rules[['antecedents','consequents','support','confidence','lift']].head(5).to_string(index=False))
# 10) Visualization: scatter of PagesViewed vs SessionDuration colored by cluster
import matplotlib.pyplot as plt
plt.figure(figsize=(8,5))
palette = { -1: '#cccccc' }
for i in range(cluster_id):
palette[i] = plt.cm.tab10(i % 10)
colors = df['cluster'].map(palette)
plt.scatter(df['PagesViewed'], df['SessionDuration'], c=colors, alpha=0.6, s=20)
plt.xlabel('Pages Viewed')
plt.ylabel('Session Duration (sec)')
plt.title('Grid-based CLIQUE-style Clusters on Sessions')
plt.grid(alpha=0.2)
plt.tight_layout()
plt.show()
7. Example output (interpreting sample run)
After running the code you will see a cluster summary table similar to the below (values illustrative — your run will vary):
| Cluster | Count | MeanPages | MeanDuration | AddToCartRate | BounceRate |
|---|---|---|---|---|---|
| 0 | 145 | 10.2 | 420.5 | 0.28 | 0.05 |
| 1 | 98 | 3.1 | 95.2 | 0.06 | 0.40 |
| -1 | 357 | 4.2 | 172.1 | 0.09 | 0.28 |
Interpretation: Cluster 0 = engaged shoppers (many pages, long sessions, higher add-to-cart). Cluster 1 = brief bounces / information-seekers (moderate pages but high bounce). Unclustered (-1) are mixed sessions or sparse cells; you may tune bin counts or τ to capture more clusters.
8. Visualization & interpretation
The scatter plot (PagesViewed vs SessionDuration) colored by cluster helps stakeholders see where high-value sessions sit. Combine this visual with frequent-pattern outputs (e.g., {product, reviews} -> {cart}) to derive targeted actions:
- Cluster 0 (engaged shoppers): show personalized product bundles, experiment with cart nudges and product recommendations to increase conversion rate further.
- Cluster 1 (short bounces): test faster landing page variants, simplified CTAs, or targeted search improvements to reduce bounce and guide to relevant products.
- Frequent patterns: if pattern mining shows commonly visited pages before abandonment (e.g., product→reviews→exit), add prominent CTAs on review pages or omitting friction on cart flow.
9. Practical notes & tuning checklist
- Bin selection: number and strategy of bins per feature strongly affect grid density — experiment with uniform, quantile or k-means-based binning.
- Density threshold τ: choose relative to dataset size (a cell with 0.5% support may be dense in small data but noise in large data). Use validation or domain rules.
- Subspace support: CLIQUE finds dense subspaces automatically — inspect 1D and 2D dense intervals to understand what features form a segment.
- Merging adjacency: adjacency in high-dim grids can be expensive; implement efficient neighbor enumeration or leverage sparse indices for large datasets.
- Pattern mining: run Apriori/FP-Growth per cluster on event sequences or bucketed page-categories to get interpretable navigation rules.
- Evaluation: track business KPIs (conversion, revenue per session, retention lift) from A/B tests that apply cluster-specific interventions.
10. Closing — actionable summary
This case study shows how a CLIQUE-style grid subspace clustering + pattern mining pipeline can reveal meaningful web behavior segments and the common navigation patterns inside them. The approach is particularly useful when user behavior clusters reside in different combinations of features (subspaces). Use the cluster assignments to power targeted experiences — personalized banners, funnel optimization, email campaigns — and validate impact through controlled experiments.
— End of Web Behavior Analysis Case Study —