Skip to content

Steffin12-git/Kmeans-online-retail

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Online Retail — K-Means Customer Segmentation

Python scikit-learn pandas Matplotlib Seaborn Jupyter Status License


Goal: Use unsupervised learning (K-Means) on RFM features to uncover actionable customer segments from the UCI Online Retail dataset. Business Impact: Customer segmentation drives personalized marketing, improves ROI, and enhances customer retention & engagement.


🚀 Executive Summary

This notebook demonstrates a complete end-to-end ML pipeline that:

  • Cleans & standardizes messy retail transactions
  • Constructs RFM (Recency, Frequency, Monetary) features
  • Scales features & applies K-Means clustering
  • Determines optimal k via Elbow & Silhouette methods
  • Produces 4 actionable segments + 3 outlier micro-segments
  • Visualizes clusters for business storytelling

📊 Key Outcomes:

  • Raw rows: 525,461 → Cleaned: 406,309 (77.32% retained)
  • Unique customers: 4,285
  • Clustered customers (after outlier removal): 3,809
  • Optimal clusters (k): 4
  • Actionable Segments: Retain, Reward, Re-Engage, Nurture + Outlier cohorts (Pamper, Upsell, Delight)

📂 Dataset

  • Source: UCI ML Repository — Online Retail II
  • Period: 2009-12-01 → 2010-12-09
  • Countries: 40 (UK dominates with 485,852 rows)
  • Schema (8 cols): Invoice, StockCode, Description, Quantity, InvoiceDate, Price, Customer ID, Country

🧹 Data Cleaning

Steps applied:

  1. Invoice filtering → keep only 6-digit numeric invoices (^\d{6}$).
  2. StockCode filtering → exclude admin/test codes (keep only SKU-like).
  3. Drop rows with missing Customer IDs.
  4. Remove zero-priced lines (28 rows removed).
  5. Final cleaned dataset: 406,309 rows (77.32% of raw).

📌 Distribution Check:

Distribution of RFM

📌 With Outliers Highlighted:

Boxplots with Outliers

📌 After Major Outlier Seperated:

Boxplots after Outlier Removal


📊 RFM Feature Engineering

Features per customer:

  • Recency (R): Days since last purchase
  • Frequency (F): Number of invoices
  • Monetary (M): Total spend

Summary (non-outliers, 3,809 customers):

  • Recency: mean = 97d, median = 58d
  • Frequency: mean = 2.86, median = 2
  • Monetary: mean = 885.5, median = 588.1

🤖 Modeling

  • Algorithm: KMeans(n_clusters=4, random_state=42, max_iter=1000)
  • Scaling: StandardScaler on [R, F, M]
  • Model Selection:

📌 Elbow + Silhouette Diagnostic:

KMeans Inertia and Silhouette

Optimal K = 4


🧩 Segmentation Results

🔑 Core Clusters

  1. Cluster 0 — Retain (High value, recent buyers)

    • Playbook: VIP care, exclusive perks, proactive engagement.
  2. Cluster 1 — Re-Engage (Low frequency & spend, lapsed)

    • Playbook: Win-back offers, personalized reactivation emails.
  3. Cluster 2 — Nurture (Lowest activity, often new)

    • Playbook: Onboarding, education, low-friction deals.
  4. Cluster 3 — Reward (Loyal, consistent buyers)

    • Playbook: Loyalty rewards, referral incentives, bundles.

📌 Cluster Visualization (Raw):

3D Scatter Raw

📌 Cluster Visualization (Scaled):

3D Scatter Scaled

📌 KMeans Cluster Separation:

KMeans Clustered Scatter

📌 Violin Plots (Segment Profiles):

Violin Plot RFM by Cluster


🎯 Outlier Micro-Segments

  • Cluster −1 — Pamper → High spenders → bespoke offers, concierge.
  • Cluster −2 — Upsell → Frequent shoppers → subscriptions, bundles.
  • Cluster −3 — Delight → Elite (high R + F + M) → premium tiers, surprise perks.

💼 Business Applications

  • Retention & Loyalty → Retain & Reward cohorts
  • Reactivation Campaigns → Target Re-Engage group
  • Acquisition Funnel → Nurture new/dormant customers
  • Premium Strategy → Outlier groups (Pamper, Upsell, Delight)

⚙️ Tech Stack

  • Python 3.10+, Jupyter Notebook
  • pandas → data manipulation
  • scikit-learn → KMeans, scaling, silhouette
  • matplotlib & seaborn → visualization
  • openpyxl → Excel I/O

🌟 Highlights

✔️ Built a scalable, reproducible ML pipeline from raw retail logs

✔️ Applied RFM analysis + K-Means for customer segmentation

✔️ Delivered business-ready insights & playbooks tied to ROI levers

✔️ Integrated robust data cleaning, outlier handling, and model diagnostics

✔️ Produced clear visualizations & segment storytelling for stakeholders

About

Unsupervised customer segmentation on the UCI Online Retail II dataset using RFM features and KMeans. Includes full pipeline: data cleaning, feature engineering, outlier handling, model selection (Elbow & Silhouette), and actionable segment insights for targeted marketing.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors