Market Segmentation — Customer Segmentation with K-Means (Step-by-Step)
This case study demonstrates a practical, engineering approach to customer segmentation using an unsupervised algorithm — K-Means clustering. We show a small sample dataset, a manual step-by-step K-Means calculation for clarity, the full Python pipeline, output tables, equations, and a visualization with interpretation. The goal is to convert raw customer data into actionable segments for targeted marketing and logistics.
1. Problem Framing
We have transactional and profile data for customers and want to group them into distinct segments (clusters) so that marketing, pricing, and inventory decisions can be tailored to each segment. K-Means is appropriate when you expect compact spherical clusters and want a fast, interpretable segmentation. The algorithm assigns each customer to the nearest centroid and iteratively updates centroids to minimize within-cluster sum of squared distances.
2. Sample Dataset (small, for manual computation)
Below is a small toy dataset of six customers with two features for visualization and manual calculation: Annual Spend (USD) and Avg. Visits per Month. Save this as customers_small.csv if you want to follow along.
| CustomerID | AnnualSpend (USD) | Visits_per_Month |
|---|---|---|
| C1 | 1200 | 1.0 |
| C2 | 1500 | 1.2 |
| C3 | 300 | 4.5 |
| C4 | 400 | 4.0 |
| C5 | 2200 | 0.8 |
| C6 | 350 | 3.8 |
This toy dataset intentionally mixes high-value low-frequency customers and low-value high-frequency customers so clusters are intuitive.
3. K-Means — Mathematical Equations
Euclidean distance between point x and centroid μ (2D example):
Centroid update (for cluster k with points xᵢ):
K-Means objective (minimize):
4. Manual Step-by-Step K-Means Calculation (K = 2)
We initialize K = 2 clusters. For transparency we choose initial centroids (pick two points):
- Initial μ₁ = C1 = (1200, 1.0)
- Initial μ₂ = C3 = (300, 4.5)
Iteration 1 — Assign points to nearest centroid
Compute Euclidean distances (rounded):
Distances to μ₁=(1200,1.0): C1: d=0 C2=(1500,1.2): d = sqrt((1500-1200)^2 + (1.2-1.0)^2) ≈ sqrt(300^2 + 0.2^2) ≈ 300.000 C3=(300,4.5): d ≈ sqrt((300-1200)^2 + (4.5-1.0)^2) ≈ 900.007 C4=(400,4.0): d ≈ sqrt(800^2 + 3^2) ≈ 800.006 C5=(2200,0.8): d ≈ sqrt(1000^2 + 0.2^2) ≈ 1000.000 C6=(350,3.8): d ≈ sqrt(850^2 + 2.8^2) ≈ 850.005 Distances to μ₂=(300,4.5): C1: d ≈ 900.007 C2: d ≈ sqrt(1200^2 + 3.3^2) ≈ 1200.004 C3: d = 0 C4: d ≈ sqrt(100^2 + 0.5^2) ≈ 100.001 C5: d ≈ sqrt(1900^2 + 3.7^2) ≈ 1900.004 C6: d ≈ sqrt(50^2 + 0.7^2) ≈ 50.005
Assign each point to the nearest centroid:
- Cluster 1 (near μ₁): C1, C2, C5
- Cluster 2 (near μ₂): C3, C4, C6
Centroid update:
Compute new centroids by averaging coordinates in each cluster.
μ₁_new = mean of C1(1200,1.0), C2(1500,1.2), C5(2200,0.8)
= ( (1200+1500+2200)/3 , (1.0+1.2+0.8)/3 )
= ( 4900/3 ≈ 1633.33 , 3.0/3 = 1.0 )
μ₂_new = mean of C3(300,4.5), C4(400,4.0), C6(350,3.8)
= ( (300+400+350)/3 , (4.5+4.0+3.8)/3 )
= ( 1050/3 = 350 , 12.3/3 = 4.1 )
Iteration 2 — Reassign points using new centroids
Quick logic: μ₁_new is ~ (1633.3, 1.0) and μ₂_new is (350, 4.1). The assignments remain the same (high spenders to μ₁, frequent low-spend to μ₂). Since assignments are stable, algorithm converges.
This small example demonstrates how K-Means groups customers by proximity in feature space. In practice features are scaled (standardized) before K-Means so numeric scales (e.g., dollars vs visits) don't dominate distance.
5. Full Python Pipeline (reproducible)
Below is a complete Python example. It builds a larger synthetic customer dataset, standardizes features, runs K-Means, shows the cluster centers, an output table, and produces a scatter plot with clusters.
# Requirements: pandas, numpy, scikit-learn, matplotlib, seaborn (optional)
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
# 1) Synthetic dataset (replace with your own CSV)
np.random.seed(0)
n = 300
# Two segments synthetic: high-value low visits, low-value high visits, plus some noise
hv_spend = np.random.normal(loc=2000, scale=400, size=int(n*0.35))
hv_visits = np.random.normal(loc=0.8, scale=0.3, size=int(n*0.35))
lv_spend = np.random.normal(loc=400, scale=120, size=int(n*0.55))
lv_visits = np.random.normal(loc=4.0, scale=0.8, size=int(n*0.55))
# Add mid segment / noise
mid_spend = np.random.normal(loc=900, scale=200, size=int(n*0.10))
mid_visits = np.random.normal(loc=2.2, scale=0.6, size=int(n*0.10))
spend = np.concatenate([hv_spend, lv_spend, mid_spend])
visits = np.concatenate([hv_visits, lv_visits, mid_visits])
df = pd.DataFrame({
'CustomerID': [f'CU{1000+i}' for i in range(len(spend))],
'AnnualSpend': np.clip(spend, a_min=20, a_max=None),
'VisitsPerMonth': np.clip(visits, a_min=0.1, a_max=None)
})
# 2) Preprocessing: standardize features
features = ['AnnualSpend', 'VisitsPerMonth']
scaler = StandardScaler()
X = scaler.fit_transform(df[features])
# 3) Apply KMeans (choose K=3)
k = 3
kmeans = KMeans(n_clusters=k, random_state=42, n_init=20)
kmeans.fit(X)
labels = kmeans.labels_
centers = kmeans.cluster_centers_
# 4) Attach labels and inverse transform centers for interpretability
df['Cluster'] = labels
centers_orig = scaler.inverse_transform(centers)
# 5) Output cluster summary table
summary = df.groupby('Cluster').agg(
Count=('CustomerID','count'),
MeanAnnualSpend=('AnnualSpend','mean'),
MeanVisitsPerMonth=('VisitsPerMonth','mean')
).reset_index()
print(summary)
print('\\nCluster centers (original scale):')
for i,c in enumerate(centers_orig):
print(f'Cluster {i}: AnnualSpend={c[0]:.1f}, VisitsPerMonth={c[1]:.2f}')
# 6) Visualization
plt.figure(figsize=(8,6))
scatter = plt.scatter(df['AnnualSpend'], df['VisitsPerMonth'], c=df['Cluster'], cmap='tab10', alpha=0.7)
plt.scatter(centers_orig[:,0], centers_orig[:,1], c='black', s=150, marker='X')
plt.xlabel('Annual Spend (USD)')
plt.ylabel('Visits per Month')
plt.title('Customer Segments (K=3) — Annual Spend vs Visits')
plt.grid(alpha=0.2)
plt.show()
6. Sample Output Table (example)
Running the code above yields a cluster summary table similar to this (numbers illustrative — your run will vary due to randomness):
| Cluster | Count | Mean AnnualSpend (USD) | Mean Visits/Month |
|---|---|---|---|
| 0 | 106 | 355.4 | 3.9 |
| 1 | 105 | 2094.8 | 0.9 |
| 2 | 29 | 910.2 | 2.3 |
Interpretation: Cluster 1 are high-value, low-frequency customers (premium spenders). Cluster 0 are low-value, high-frequency (frequent shoppers with low basket). Cluster 2 is a mid segment. These labels guide different business actions.
7. Visualization and Interpretation
Use the scatter plot (Annual Spend on X, Visits per Month on Y) to visualize clusters. Centroids are shown with a black 'X'.
Interpretation guide:
- High spend / Low visits: Candidates for loyalty/premium programs, personalized high-value offers, and retention investments.
- Low spend / High visits: Candidates for frequency-based promotions, subscriptions, or cross-sell of higher-margin items.
- Mid segment: Potentially convertible to high-value with targeted upsell campaigns or to high frequency by convenience offers.
8. Practical Engineering Notes
Feature scaling: Always scale numeric features (StandardScaler or MinMax) before K-Means because Euclidean distance is sensitive to feature scale. Choosing K: Use elbow method, silhouette score, or domain knowledge. Initialization: Use multiple initializations (n_init) or k-means++ to avoid poor local minima. Interpretability: Examine centroid values in original scale and profile top customers per cluster.
9. Step-by-Step Checklist (for practitioners)
- Define segmentation objective and business actions for each segment.
- Prepare data: select features, handle missing values, encode categoricals if included.
- Scale numeric features.
- Run K-Means for a range of K values; evaluate with silhouette or elbow method.
- Inspect centroids in original scale and label clusters descriptively.
- Validate segments against business KPIs (conversion, churn, margin).
- Deploy segmentation: use cluster assignments in downstream targeting, personalization, inventory planning.
- Monitor drift and retrain periodically (monthly/quarterly) or when behaviour changes.
10. Closing
This blog-ready case study provides a compact but thorough guide to customer segmentation with K-Means. You received a small manual example to understand the algorithm mechanics, a reproducible Python pipeline for real data, an output table template, equations, and practical interpretation. Replace the synthetic dataset with your customer CSV, select features that matter (e.g., recency, frequency, monetary value, product affinity), and follow the checklist to produce actionable segments for marketing or operations.
— End of case study —