Goal: Use unsupervised learning (K-Means) on RFM features to uncover actionable customer segments from the UCI Online Retail dataset. Business Impact: Customer segmentation drives personalized marketing, improves ROI, and enhances customer retention & engagement.
This notebook demonstrates a complete end-to-end ML pipeline that:
- Cleans & standardizes messy retail transactions
- Constructs RFM (Recency, Frequency, Monetary) features
- Scales features & applies K-Means clustering
- Determines optimal
kvia Elbow & Silhouette methods - Produces 4 actionable segments + 3 outlier micro-segments
- Visualizes clusters for business storytelling
📊 Key Outcomes:
- Raw rows: 525,461 → Cleaned: 406,309 (77.32% retained)
- Unique customers: 4,285
- Clustered customers (after outlier removal): 3,809
- Optimal clusters (k): 4
- Actionable Segments: Retain, Reward, Re-Engage, Nurture + Outlier cohorts (Pamper, Upsell, Delight)
- Source: UCI ML Repository — Online Retail II
- Period: 2009-12-01 → 2010-12-09
- Countries: 40 (UK dominates with 485,852 rows)
- Schema (8 cols):
Invoice,StockCode,Description,Quantity,InvoiceDate,Price,Customer ID,Country
Steps applied:
- Invoice filtering → keep only 6-digit numeric invoices (
^\d{6}$). - StockCode filtering → exclude admin/test codes (keep only SKU-like).
- Drop rows with missing Customer IDs.
- Remove zero-priced lines (28 rows removed).
- Final cleaned dataset: 406,309 rows (77.32% of raw).
📌 Distribution Check:
📌 With Outliers Highlighted:
📌 After Major Outlier Seperated:
Features per customer:
- Recency (R): Days since last purchase
- Frequency (F): Number of invoices
- Monetary (M): Total spend
Summary (non-outliers, 3,809 customers):
- Recency: mean = 97d, median = 58d
- Frequency: mean = 2.86, median = 2
- Monetary: mean = 885.5, median = 588.1
- Algorithm:
KMeans(n_clusters=4, random_state=42, max_iter=1000) - Scaling: StandardScaler on
[R, F, M] - Model Selection:
📌 Elbow + Silhouette Diagnostic:
Optimal K = 4
-
Cluster 0 — Retain (High value, recent buyers)
- Playbook: VIP care, exclusive perks, proactive engagement.
-
Cluster 1 — Re-Engage (Low frequency & spend, lapsed)
- Playbook: Win-back offers, personalized reactivation emails.
-
Cluster 2 — Nurture (Lowest activity, often new)
- Playbook: Onboarding, education, low-friction deals.
-
Cluster 3 — Reward (Loyal, consistent buyers)
- Playbook: Loyalty rewards, referral incentives, bundles.
📌 Cluster Visualization (Raw):
📌 Cluster Visualization (Scaled):
📌 KMeans Cluster Separation:
📌 Violin Plots (Segment Profiles):
- Cluster −1 — Pamper → High spenders → bespoke offers, concierge.
- Cluster −2 — Upsell → Frequent shoppers → subscriptions, bundles.
- Cluster −3 — Delight → Elite (high R + F + M) → premium tiers, surprise perks.
- Retention & Loyalty → Retain & Reward cohorts
- Reactivation Campaigns → Target Re-Engage group
- Acquisition Funnel → Nurture new/dormant customers
- Premium Strategy → Outlier groups (Pamper, Upsell, Delight)
- Python 3.10+, Jupyter Notebook
- pandas → data manipulation
- scikit-learn → KMeans, scaling, silhouette
- matplotlib & seaborn → visualization
- openpyxl → Excel I/O
✔️ Built a scalable, reproducible ML pipeline from raw retail logs
✔️ Applied RFM analysis + K-Means for customer segmentation
✔️ Delivered business-ready insights & playbooks tied to ROI levers
✔️ Integrated robust data cleaning, outlier handling, and model diagnostics
✔️ Produced clear visualizations & segment storytelling for stakeholders







