Web Behavior Analysis — Projected Clustering Case Study (User Behavior Insights)

Web Behavior Analysis — Projected Clustering Case Study (Step-by-Step)

This case study demonstrates a business-focused pipeline for web behavior analysis using projected clustering (a subspace/ projected-cluster approach). The goal is to discover user behavior segments that exist only in certain feature combinations (subspaces), extract pattern and profile for each segment, and produce actionable insights for product, UX and marketing teams. The post includes a small sample dataset (for manual steps), the equations used, a reproducible Python implementation (heuristic projected clustering), the output table and visualizations, and interpretation. All paragraphs are Times New Roman, 12pt and justified — ready for BlogSpot.

1. Business problem

An analytics/product team wants to identify distinct session-level user behavior groups from web logs so they can tailor experiences and messaging. Typical patterns (e.g., “long explorers”, “quick buyers”, “compare-and-leave”) may live only in particular subspaces (e.g., pages viewed + products viewed but not device type). Projected clustering discovers clusters together with the subspace of features that best explain each cluster.

2. What is projected clustering (brief)

Projected clustering seeks clusters that are tight in a subset of dimensions (subspace) rather than across the full feature set. That is, cluster C is described by a subset of features S_C, and points in C are near each other when measured only on S_C. Algorithms such as PROCLUS, SUBCLU, CLIQUE families formalize this; here we present an engineering-friendly heuristic that is easy to implement, explain and iterate on.

3. Key equations & definitions

Within-cluster variance per feature j for cluster k:
σ²_j,k = (1 / N_k) ∑_{i ∈ C_k} (x_i,j − μ_j,k)²

Global variance per feature j:
σ²_j,global = (1 / N) ∑_i (x_i,j − μ_j,global)²

Relevance score (feature j for cluster k):
r_j,k = 1 − (σ²_j,k / σ²_j,global)
(Higher r ⇒ feature j is tight in cluster k relative to global variance; r ∈ (−∞,1], typical range [0,1])

Projected distance of point x to centroid μ_k using selected subspace S_k:
d_{S_k}(x, μ_k) = √( ∑_{j ∈ S_k} (x_j − μ_j,k)² )

4. Small sample dataset (manual calculation)

We use five example sessions with four features to manually illustrate subspace selection and cluster assignment. Save as sessions_small.csv to follow along.

SessionID	PagesViewed	SessionDuration_sec	ProductsViewed	AddedToCart
S1	2	40	0	0
S2	1	30	0	0
S3	10	420	6	1
S4	12	600	8	1
S5	9	300	3	0

Intuition: S1–S2 are short bounces, S3–S4 are engaged buyers, S5 is intermediate. We expect clusters to be tight on different feature subsets (e.g., AddedToCart+ProductsViewed for buyers; PagesViewed+Duration for engaged explorers).

5. Manual step-by-step projected clustering (illustrative)

Initial clustering (seed): run KMeans on standardized features for K=2 (or pick initial seeds by inspection). Suppose initial clusters are: C₀ = {S1,S2}, C₁ = {S3,S4,S5}.
Compute within-cluster variance σ²_j,k for each feature j and cluster k. (Compute means μ and then variances.) Example: for C₀ (S1,S2), PagesViewed mean = 1.5, variance = 0.25; for C₁, PagesViewed mean ≈ 10.33, variance ≈ 3.56.
Compute global variances σ²_j,global.
Compute relevance scores r_j,k = 1 − σ²_j,k / σ²_j,global. For cluster C₀, features PagesViewed and Duration will have high r (tight cluster on those features). For C₁, ProductsViewed and AddedToCart will have high r.
Select per-cluster subspaces S_k by thresholding r (e.g., r ≥ 0.5). Suppose S₀ = {PagesViewed, SessionDuration} and S₁ = {ProductsViewed, AddedToCart}.
Reassign points by projected distance: for each session compute d_S₀(x, μ₀) and d_S₁(x, μ₁) and assign to cluster with smaller projected distance. This may move S5 into C₁ or keep it in C₁ depending on distances.
Iterate: recompute means/variances, update relevance, update subspaces and reassign until assignments stabilize (convergence).

This toy walk-through shows how projected clustering identifies different feature subsets per cluster and assigns points based on distances measured only on those subspaces — making clusters more interpretable and robust when many features are noisy or irrelevant for a given group.

6. Reproducible Python implementation (heuristic projected clustering)

The code below implements a practical, easy-to-run projected-clustering heuristic suitable for session-level web features. It:

creates a synthetic session dataset,
standardizes features,
initializes clusters with KMeans,
computes per-cluster relevance scores,
selects subspaces per cluster (threshold on relevance),
reassigns points by projected distance and iterates until convergence,
outputs cluster summaries, per-cluster subspaces and visualizations,
and runs a simple pattern summary (most common pages/events) per cluster.


# Requirements: numpy, pandas, scikit-learn, matplotlib
# pip install numpy pandas scikit-learn matplotlib

import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt

# 1) Synthesize a session dataset (n=500)
np.random.seed(0)
n = 500
pages = np.random.poisson(4, n) + np.random.choice([0,1,2], size=n, p=[0.7,0.2,0.1])
duration = np.clip(np.random.exponential(scale=200, size=n).astype(int), 5, 3000)
products = np.random.poisson(2, n)
added = (np.random.rand(n) < (0.08 + 0.06*(pages>6))).astype(int)
bounce = (np.random.rand(n) < (0.3 - 0.05*(pages>3))).astype(int)

df = pd.DataFrame({
    'SessionID': [f'S{1000+i}' for i in range(n)],
    'PagesViewed': pages,
    'SessionDuration': duration,
    'ProductsViewed': products,
    'AddedToCart': added,
    'Bounce': bounce
})

# 2) Features to use (numeric)
features = ['PagesViewed','SessionDuration','ProductsViewed','AddedToCart']
X_orig = df[features].astype(float).values

# 3) Standardize features
scaler = StandardScaler()
X = scaler.fit_transform(X_orig)

# 4) Projected clustering heuristic
def projected_clustering(X, K=3, rel_thresh=0.5, max_iter=20, random_state=0):
    n, d = X.shape
    # 4a) initial cluster (KMeans)
    km = KMeans(n_clusters=K, random_state=random_state, n_init=10)
    labels = km.fit_predict(X)
    # iterate
    for it in range(max_iter):
        prev_labels = labels.copy()
        clusters = {k: np.where(labels==k)[0] for k in range(K)}
        centers = np.array([X[clusters[k]].mean(axis=0) if len(clusters[k])>0 else np.zeros(d) for k in range(K)])
        # compute per-cluster variances and global variances
        global_var = X.var(axis=0, ddof=0)
        # avoid division by zero
        global_var[global_var==0] = 1e-6
        relevance = np.zeros((K,d))
        for k in range(K):
            if len(clusters[k]) == 0:
                relevance[k,:] = 0.0
                continue
            within_var = X[clusters[k]].var(axis=0, ddof=0)
            relevance[k,:] = 1.0 - (within_var / global_var)   # higher => feature tight in cluster
        # select subspace S_k by thresholding relevance
        subspaces = [np.where(relevance[k] >= rel_thresh)[0].tolist() for k in range(K)]
        # ensure at least one dimension per cluster (pick best feature if empty)
        for k in range(K):
            if len(subspaces[k]) == 0:
                subspaces[k] = [np.argmax(relevance[k])]
        # reassign points by projected distance
        dists = np.full((n,K), np.inf)
        for k in range(K):
            idxs = subspaces[k]
            # compute Euclidean distance in selected dimensions
            diff = X[:, idxs] - centers[k, idxs]
            dists[:,k] = np.sqrt(np.sum(diff**2, axis=1))
        # new labels: cluster with smallest projected distance
        labels = np.argmin(dists, axis=1)
        # convergence
        if np.array_equal(labels, prev_labels):
            break
    # final cluster info
    clusters = {k: np.where(labels==k)[0] for k in range(K)}
    return labels, subspaces, relevance, clusters, centers

# run projected clustering
K = 3
labels, subspaces, relevance, clusters, centers = projected_clustering(X, K=K, rel_thresh=0.45, max_iter=15, random_state=42)
df['cluster'] = labels

# 5) Produce cluster summary table and show selected subspaces
summary = df.groupby('cluster').agg(
    Count=('SessionID','count'),
    MeanPages=('PagesViewed','mean'),
    MeanDuration=('SessionDuration','mean'),
    MeanProducts=('ProductsViewed','mean'),
    AddToCartRate=('AddedToCart','mean'),
    BounceRate=('Bounce','mean')
).reset_index()

print("Cluster summary:")
print(summary.to_string(index=False))

print("\nPer-cluster selected subspaces (feature indices):")
for k in range(K):
    print(f"Cluster {k}: subspace cols = {subspaces[k]} -> features = {[features[i] for i in subspaces[k]]}")
    print(f"Relevance scores = {np.round(relevance[k],3)}")

# 6) Visualization: scatter (PagesViewed vs SessionDuration) colored by cluster
plt.figure(figsize=(8,5))
colors = plt.cm.tab10(labels % 10)
plt.scatter(df['PagesViewed'], df['SessionDuration'], c=colors, alpha=0.6, s=20)
# plot cluster centers projected to those axes (inverse transform centers for plotting)
centers_orig = scaler.inverse_transform(centers)
plt.scatter(centers_orig[:,0], centers_orig[:,1], c='black', s=120, marker='X')
plt.xlabel('Pages Viewed')
plt.ylabel('Session Duration (sec)')
plt.title('Projected Clustering: Sessions (Pages vs Duration)')
plt.grid(alpha=0.2)
plt.tight_layout()
plt.show()

# 7) Additional: show per-cluster top patterns (simulated page-buckets)
page_cats = ['home','search','listing','product','reviews','cart','checkout']
def sim_seq(pages):
    seq = []
    if np.random.rand() < 0.95: seq.append('home')
    for _ in range(max(1,int(np.round(pages)))):
        seq.append(np.random.choice(['search','listing','product','reviews'], p=[0.2,0.45,0.3,0.05]))
    if np.random.rand() < 0.12: seq.append('cart')
    if np.random.rand() < 0.04: seq.append('checkout')
    return list(dict.fromkeys(seq))
df['seq'] = df['PagesViewed'].apply(sim_seq)

# simple frequency per cluster
for k in range(K):
    subset = df[df['cluster']==k]
    # flatten categories
    cats = [c for seq in subset['seq'] for c in seq]
    top = pd.Series(cats).value_counts().head(6)
    print(f"\nCluster {k} top page categories (count):")
    print(top.to_string())

7. Example output table (sample run)

When you run the code above you will get a cluster summary similar to this (numbers illustrative — values will vary):

Cluster	Count	MeanPages	MeanDuration	MeanProducts	AddToCartRate	BounceRate
0	142	1.8	55.2	0.6	0.04	0.48
1	178	10.5	420.8	5.3	0.34	0.06
2	180	4.2	150.3	1.4	0.10	0.22

Interpretation (example): Cluster 0 = short visits / bounces; Cluster 1 = engaged shoppers (high pages, long duration, high add-to-cart); Cluster 2 = browsers with moderate depth. The projected clustering algorithm also outputs subspaces for each cluster (e.g., Cluster 1's subspace might be {ProductsViewed, AddedToCart}, indicating these features best describe that group).

8. Visualizations and interpretation

The scatter plot (PagesViewed vs SessionDuration) colored by projected-cluster labels shows where clusters concentrate. Key outputs to present to stakeholders:

Cluster summary table (counts, means, conversion/add-to-cart rates) for prioritizing interventions.
Per-cluster subspace (list of features selected because they are tight for that cluster). This gives direct interpretation — e.g., “Cluster 1 is characterized mainly by ProductsViewed & AddedToCart”.
Top page categories / sequences per cluster to build micro-experiments (e.g., show product bundles to Cluster 1, or simplified landing pages to Cluster 0).

The power of projected clustering: each cluster has its own defining features, so interventions can be precise instead of one-size-fits-all.

9. Practical tips, tuning & limitations

Relevance threshold (r threshold): choose based on data scale and desired sparsity of subspaces (0.4–0.6 is a sensible starting point).
Number of clusters K: try K via silhouette on projected distances or business constraints; projected clustering often yields more interpretable clusters even with larger K.
Feature engineering: include RFM features, channel, device, campaign parameters, page-buckets and derived ratios (e.g., products/pages) — subspace selection will pick what matters per cluster.
Outliers & noise: projected clustering can isolate noise into its own cluster or mark as unassigned; consider flagging low-membership clusters as noise.
Scalability: the heuristic here scales well for medium datasets; for very high-dimensional or very large datasets use specialized implementations (PROCLUS, SUBCLU) or dimensionality reduction + subspace search.
Validation: evaluate cluster business value with A/B tests (e.g., targeted campaign vs control) rather than relying only on internal metrics.

10. Actionable recommendations (example)

Cluster 1 (engaged shoppers): serve personalized bundles, show urgency nudges on product pages, test expedited shipping offers to increase conversion.
Cluster 0 (bounces): simplify landing experience, A/B test fewer choices and faster CTAs, adjust paid campaign targeting.
Cluster 2 (browsers): surface recommendations and social proof, test “save for later” nudges to convert browsing into purchases.

11. Closing

Projected clustering finds clusters described by their own best subspaces — ideal for web behavior analysis where different user groups behave distinctly across different metrics. The heuristic above is practical, interpretable and easy to implement in Python. Replace the synthetic data with your own session logs, tune relevance threshold and K, and present the per-cluster subspace + top patterns to product and marketing teams to design targeted, high-impact experiments.

— End of Projected Clustering Case Study —

Dr Umesh Kumar Pandey

Search This Blog

Projected Clustering