Web Behavior Analysis — Projected Clustering Case Study (Step-by-Step)
This case study demonstrates a business-focused pipeline for web behavior analysis using projected clustering (a subspace/ projected-cluster approach). The goal is to discover user behavior segments that exist only in certain feature combinations (subspaces), extract pattern and profile for each segment, and produce actionable insights for product, UX and marketing teams. The post includes a small sample dataset (for manual steps), the equations used, a reproducible Python implementation (heuristic projected clustering), the output table and visualizations, and interpretation. All paragraphs are Times New Roman, 12pt and justified — ready for BlogSpot.
1. Business problem
An analytics/product team wants to identify distinct session-level user behavior groups from web logs so they can tailor experiences and messaging. Typical patterns (e.g., “long explorers”, “quick buyers”, “compare-and-leave”) may live only in particular subspaces (e.g., pages viewed + products viewed but not device type). Projected clustering discovers clusters together with the subspace of features that best explain each cluster.
2. What is projected clustering (brief)
Projected clustering seeks clusters that are tight in a subset of dimensions (subspace) rather than across the full feature set. That is, cluster C is described by a subset of features SC, and points in C are near each other when measured only on SC. Algorithms such as PROCLUS, SUBCLU, CLIQUE families formalize this; here we present an engineering-friendly heuristic that is easy to implement, explain and iterate on.
3. Key equations & definitions
σ²j,k = (1 / Nk) ∑i ∈ Ck (xi,j − μj,k)²
σ²j,global = (1 / N) ∑i (xi,j − μj,global)²
rj,k = 1 − (σ²j,k / σ²j,global)
(Higher r ⇒ feature j is tight in cluster k relative to global variance; r ∈ (−∞,1], typical range [0,1])
dSk(x, μk) = √( ∑j ∈ Sk (xj − μj,k)² )
4. Small sample dataset (manual calculation)
We use five example sessions with four features to manually illustrate subspace selection and cluster assignment. Save as sessions_small.csv to follow along.
| SessionID | PagesViewed | SessionDuration_sec | ProductsViewed | AddedToCart |
|---|---|---|---|---|
| S1 | 2 | 40 | 0 | 0 |
| S2 | 1 | 30 | 0 | 0 |
| S3 | 10 | 420 | 6 | 1 |
| S4 | 12 | 600 | 8 | 1 |
| S5 | 9 | 300 | 3 | 0 |
Intuition: S1–S2 are short bounces, S3–S4 are engaged buyers, S5 is intermediate. We expect clusters to be tight on different feature subsets (e.g., AddedToCart+ProductsViewed for buyers; PagesViewed+Duration for engaged explorers).
5. Manual step-by-step projected clustering (illustrative)
- Initial clustering (seed): run KMeans on standardized features for K=2 (or pick initial seeds by inspection). Suppose initial clusters are: C0 = {S1,S2}, C1 = {S3,S4,S5}.
- Compute within-cluster variance σ²j,k for each feature j and cluster k. (Compute means μ and then variances.) Example: for C0 (S1,S2), PagesViewed mean = 1.5, variance = 0.25; for C1, PagesViewed mean ≈ 10.33, variance ≈ 3.56.
- Compute global variances σ²j,global.
- Compute relevance scores rj,k = 1 − σ²j,k / σ²j,global. For cluster C0, features PagesViewed and Duration will have high r (tight cluster on those features). For C1, ProductsViewed and AddedToCart will have high r.
- Select per-cluster subspaces Sk by thresholding r (e.g., r ≥ 0.5). Suppose S0 = {PagesViewed, SessionDuration} and S1 = {ProductsViewed, AddedToCart}.
- Reassign points by projected distance: for each session compute dS0(x, μ0) and dS1(x, μ1) and assign to cluster with smaller projected distance. This may move S5 into C1 or keep it in C1 depending on distances.
- Iterate: recompute means/variances, update relevance, update subspaces and reassign until assignments stabilize (convergence).
This toy walk-through shows how projected clustering identifies different feature subsets per cluster and assigns points based on distances measured only on those subspaces — making clusters more interpretable and robust when many features are noisy or irrelevant for a given group.
6. Reproducible Python implementation (heuristic projected clustering)
The code below implements a practical, easy-to-run projected-clustering heuristic suitable for session-level web features. It:
- creates a synthetic session dataset,
- standardizes features,
- initializes clusters with KMeans,
- computes per-cluster relevance scores,
- selects subspaces per cluster (threshold on relevance),
- reassigns points by projected distance and iterates until convergence,
- outputs cluster summaries, per-cluster subspaces and visualizations,
- and runs a simple pattern summary (most common pages/events) per cluster.
# Requirements: numpy, pandas, scikit-learn, matplotlib
# pip install numpy pandas scikit-learn matplotlib
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
# 1) Synthesize a session dataset (n=500)
np.random.seed(0)
n = 500
pages = np.random.poisson(4, n) + np.random.choice([0,1,2], size=n, p=[0.7,0.2,0.1])
duration = np.clip(np.random.exponential(scale=200, size=n).astype(int), 5, 3000)
products = np.random.poisson(2, n)
added = (np.random.rand(n) < (0.08 + 0.06*(pages>6))).astype(int)
bounce = (np.random.rand(n) < (0.3 - 0.05*(pages>3))).astype(int)
df = pd.DataFrame({
'SessionID': [f'S{1000+i}' for i in range(n)],
'PagesViewed': pages,
'SessionDuration': duration,
'ProductsViewed': products,
'AddedToCart': added,
'Bounce': bounce
})
# 2) Features to use (numeric)
features = ['PagesViewed','SessionDuration','ProductsViewed','AddedToCart']
X_orig = df[features].astype(float).values
# 3) Standardize features
scaler = StandardScaler()
X = scaler.fit_transform(X_orig)
# 4) Projected clustering heuristic
def projected_clustering(X, K=3, rel_thresh=0.5, max_iter=20, random_state=0):
n, d = X.shape
# 4a) initial cluster (KMeans)
km = KMeans(n_clusters=K, random_state=random_state, n_init=10)
labels = km.fit_predict(X)
# iterate
for it in range(max_iter):
prev_labels = labels.copy()
clusters = {k: np.where(labels==k)[0] for k in range(K)}
centers = np.array([X[clusters[k]].mean(axis=0) if len(clusters[k])>0 else np.zeros(d) for k in range(K)])
# compute per-cluster variances and global variances
global_var = X.var(axis=0, ddof=0)
# avoid division by zero
global_var[global_var==0] = 1e-6
relevance = np.zeros((K,d))
for k in range(K):
if len(clusters[k]) == 0:
relevance[k,:] = 0.0
continue
within_var = X[clusters[k]].var(axis=0, ddof=0)
relevance[k,:] = 1.0 - (within_var / global_var) # higher => feature tight in cluster
# select subspace S_k by thresholding relevance
subspaces = [np.where(relevance[k] >= rel_thresh)[0].tolist() for k in range(K)]
# ensure at least one dimension per cluster (pick best feature if empty)
for k in range(K):
if len(subspaces[k]) == 0:
subspaces[k] = [np.argmax(relevance[k])]
# reassign points by projected distance
dists = np.full((n,K), np.inf)
for k in range(K):
idxs = subspaces[k]
# compute Euclidean distance in selected dimensions
diff = X[:, idxs] - centers[k, idxs]
dists[:,k] = np.sqrt(np.sum(diff**2, axis=1))
# new labels: cluster with smallest projected distance
labels = np.argmin(dists, axis=1)
# convergence
if np.array_equal(labels, prev_labels):
break
# final cluster info
clusters = {k: np.where(labels==k)[0] for k in range(K)}
return labels, subspaces, relevance, clusters, centers
# run projected clustering
K = 3
labels, subspaces, relevance, clusters, centers = projected_clustering(X, K=K, rel_thresh=0.45, max_iter=15, random_state=42)
df['cluster'] = labels
# 5) Produce cluster summary table and show selected subspaces
summary = df.groupby('cluster').agg(
Count=('SessionID','count'),
MeanPages=('PagesViewed','mean'),
MeanDuration=('SessionDuration','mean'),
MeanProducts=('ProductsViewed','mean'),
AddToCartRate=('AddedToCart','mean'),
BounceRate=('Bounce','mean')
).reset_index()
print("Cluster summary:")
print(summary.to_string(index=False))
print("\nPer-cluster selected subspaces (feature indices):")
for k in range(K):
print(f"Cluster {k}: subspace cols = {subspaces[k]} -> features = {[features[i] for i in subspaces[k]]}")
print(f"Relevance scores = {np.round(relevance[k],3)}")
# 6) Visualization: scatter (PagesViewed vs SessionDuration) colored by cluster
plt.figure(figsize=(8,5))
colors = plt.cm.tab10(labels % 10)
plt.scatter(df['PagesViewed'], df['SessionDuration'], c=colors, alpha=0.6, s=20)
# plot cluster centers projected to those axes (inverse transform centers for plotting)
centers_orig = scaler.inverse_transform(centers)
plt.scatter(centers_orig[:,0], centers_orig[:,1], c='black', s=120, marker='X')
plt.xlabel('Pages Viewed')
plt.ylabel('Session Duration (sec)')
plt.title('Projected Clustering: Sessions (Pages vs Duration)')
plt.grid(alpha=0.2)
plt.tight_layout()
plt.show()
# 7) Additional: show per-cluster top patterns (simulated page-buckets)
page_cats = ['home','search','listing','product','reviews','cart','checkout']
def sim_seq(pages):
seq = []
if np.random.rand() < 0.95: seq.append('home')
for _ in range(max(1,int(np.round(pages)))):
seq.append(np.random.choice(['search','listing','product','reviews'], p=[0.2,0.45,0.3,0.05]))
if np.random.rand() < 0.12: seq.append('cart')
if np.random.rand() < 0.04: seq.append('checkout')
return list(dict.fromkeys(seq))
df['seq'] = df['PagesViewed'].apply(sim_seq)
# simple frequency per cluster
for k in range(K):
subset = df[df['cluster']==k]
# flatten categories
cats = [c for seq in subset['seq'] for c in seq]
top = pd.Series(cats).value_counts().head(6)
print(f"\nCluster {k} top page categories (count):")
print(top.to_string())
7. Example output table (sample run)
When you run the code above you will get a cluster summary similar to this (numbers illustrative — values will vary):
| Cluster | Count | MeanPages | MeanDuration | MeanProducts | AddToCartRate | BounceRate |
|---|---|---|---|---|---|---|
| 0 | 142 | 1.8 | 55.2 | 0.6 | 0.04 | 0.48 |
| 1 | 178 | 10.5 | 420.8 | 5.3 | 0.34 | 0.06 |
| 2 | 180 | 4.2 | 150.3 | 1.4 | 0.10 | 0.22 |
Interpretation (example): Cluster 0 = short visits / bounces; Cluster 1 = engaged shoppers (high pages, long duration, high add-to-cart); Cluster 2 = browsers with moderate depth. The projected clustering algorithm also outputs subspaces for each cluster (e.g., Cluster 1's subspace might be {ProductsViewed, AddedToCart}, indicating these features best describe that group).
8. Visualizations and interpretation
The scatter plot (PagesViewed vs SessionDuration) colored by projected-cluster labels shows where clusters concentrate. Key outputs to present to stakeholders:
- Cluster summary table (counts, means, conversion/add-to-cart rates) for prioritizing interventions.
- Per-cluster subspace (list of features selected because they are tight for that cluster). This gives direct interpretation — e.g., “Cluster 1 is characterized mainly by ProductsViewed & AddedToCart”.
- Top page categories / sequences per cluster to build micro-experiments (e.g., show product bundles to Cluster 1, or simplified landing pages to Cluster 0).
The power of projected clustering: each cluster has its own defining features, so interventions can be precise instead of one-size-fits-all.
9. Practical tips, tuning & limitations
- Relevance threshold (r threshold): choose based on data scale and desired sparsity of subspaces (0.4–0.6 is a sensible starting point).
- Number of clusters K: try K via silhouette on projected distances or business constraints; projected clustering often yields more interpretable clusters even with larger K.
- Feature engineering: include RFM features, channel, device, campaign parameters, page-buckets and derived ratios (e.g., products/pages) — subspace selection will pick what matters per cluster.
- Outliers & noise: projected clustering can isolate noise into its own cluster or mark as unassigned; consider flagging low-membership clusters as noise.
- Scalability: the heuristic here scales well for medium datasets; for very high-dimensional or very large datasets use specialized implementations (PROCLUS, SUBCLU) or dimensionality reduction + subspace search.
- Validation: evaluate cluster business value with A/B tests (e.g., targeted campaign vs control) rather than relying only on internal metrics.
10. Actionable recommendations (example)
- Cluster 1 (engaged shoppers): serve personalized bundles, show urgency nudges on product pages, test expedited shipping offers to increase conversion.
- Cluster 0 (bounces): simplify landing experience, A/B test fewer choices and faster CTAs, adjust paid campaign targeting.
- Cluster 2 (browsers): surface recommendations and social proof, test “save for later” nudges to convert browsing into purchases.
11. Closing
Projected clustering finds clusters described by their own best subspaces — ideal for web behavior analysis where different user groups behave distinctly across different metrics. The heuristic above is practical, interpretable and easy to implement in Python. Replace the synthetic data with your own session logs, tune relevance threshold and K, and present the per-cluster subspace + top patterns to product and marketing teams to design targeted, high-impact experiments.
— End of Projected Clustering Case Study —