Skip to content

8-Data Mining Prediction - Comprehensive K-Means clustering tutorial with step-by-step Python implementation, including normalization, elbow and silhouette methods, cluster visualizations and applications in image segmentation, Monkey analysis, and other clustering tasks.

License

Notifications You must be signed in to change notification settings

Quantum-Software-Development/8-DataMining-KMeans-Non-Hierarchical-Clustering

Repository files navigation


[🇧🇷 Português] [🇺🇸 English]





Institution: Pontifical Catholic University of São Paulo (PUC-SP)
School: Faculty of Interdisciplinary Studies
Program: Humanistic AI and Data Science Semester: 2nd Semester 2025
Professor: Professor Doctor in Mathematics Daniel Rodrigues da Silva



Sponsor Quantum Software Development






Important

⚠️ Heads Up







🎶 Prelude Suite no.1 (J. S. Bach) - Sound Design Remix
Statistical.Measures.and.Banking.Sector.Analysis.at.Bovespa.mp4

📺 For better resolution, watch the video on YouTube.



Tip

This repository is a review of the Statistics course from the undergraduate program Humanities, AI and Data Science at PUC-SP.

Access Data Mining Main Repository



Theory Overview


K-Means is a non-hierarchical clustering algorithm that partitions data into (k) clusters, minimizing variance within clusters.


Advantages

  • Simple and efficient
  • Scales well with large datasets
  • Fast convergence

Disadvantages

  • Requires predefined number of clusters (k)
  • Sensitive to initial centroids, leading sometimes to local minima
  • Assumes spherical clusters of similar sizes
  • Sensitive to outliers

Elbow Method

The Elbow Method plots the Within-Cluster Sum of Squares (WCSS) versus number of clusters (k). The optimal (k) is indicated by the 'elbow' point where adding another cluster does not significantly reduce WCSS.



Clustering Algorithms Overview

  • K-Means (partitional, non-hierarchical)
  • Hierarchical Clustering (agglomerative and divisive)
  • DBSCAN (density-based)
  • K-Medoids and others



Step-by-Step K-Means Implementation


Libraries Import and Dataset Loading



# Import necessary libraries

# Importar bibliotecas necessárias

import pandas as pd  \# for data handling / para manipulação de dados
import matplotlib.pyplot as plt  \# for plotting / para plotagem
import seaborn as sns  \# enhanced visualization / visualização aprimorada
from sklearn.cluster import KMeans  \# KMeans algorithm / algoritmo KMeans
from sklearn.preprocessing import MinMaxScaler  \# normalization / normalização

# Load dataset

# Carregar o dataset

df = pd.read_csv('clientes-shopping.csv')
print(df.head())  \# Show first rows / mostrar primeiras linhas



Data Preprocessing and Normalization

# Drop unnecessary columns: CustomerID, Gender, Age for clustering

# Remover colunas irrelevantes para clusterização

df_cluster = df.drop(['CustomerID', 'Gender', 'Age'], axis=1)

# Normalize the features with MinMaxScaler

# Normalizar características usando MinMaxScaler

scaler = MinMaxScaler()
df_norm = pd.DataFrame(scaler.fit_transform(df_cluster), columns=df_cluster.columns)

print(df_norm.head())  \# Preview normalized data / visualizar dados normalizados



Scatter Plot of Raw Data

# Scatter plot of Annual Income vs Spending Score, colored by Gender

# Scatter plot de Renda Anual vs Score de Gastos com legenda por Gênero

sns.set_style('dark')
palette = sns.color_palette('turquoise', 3)

plt.figure(figsize=(8, 6))
sns.scatterplot(data=df, x='Annual Income (k\$)', y='Spending Score (1-100)', hue='Gender', palette=palette)
plt.title('Annual Income vs Spending Score by Gender')
plt.show()



Elbow Method to Determine Optimal K


# Elbow Method to find the ideal number of clusters k

# Método do Cotovelo para definir o número ideal de clusters k

wcss = []
for k in range(1, 11):
kmeans = KMeans(n_clusters=k, random_state=42)
kmeans.fit(df_norm)
wcss.append(kmeans.inertia_)

plt.figure(figsize=(8, 6))
plt.plot(range(1, 11), wcss, marker='o', color='turquoise')
plt.title('Elbow Method')
plt.xlabel('Number of clusters (k)')
plt.ylabel('Within-Cluster Sum of Squares (WCSS)')
plt.show()



Running K-Means with $(k=5)$


# Apply KMeans with k=5 and random seed=42

# Aplicar KMeans com k=5 e seed=42

kmeans_model = KMeans(n_clusters=5, random_state=42)
clusters = kmeans_model.fit_predict(df_norm)

# Add clusters to original dataframe

# Adicionar clusters ao dataframe original

df['Cluster'] = clusters

print(df[['Annual Income (k\$)', 'Spending Score (1-100)', 'Cluster']].head())



Scatter Plot with Clusters (Dark Mode, Turquoise Palette)


# Scatter plot with clusters in different colors (turquoise palette)

# Scatter plot com clusters em cores diferentes (palette turquesa)

plt.figure(figsize=(8, 6))
sns.scatterplot(data=df, x='Annual Income (k\$)', y='Spending Score (1-100)',
hue='Cluster', palette='turquoise', legend='full')
plt.title('KMeans Clusters (k=5)')
plt.show()



Scatter Plot with Gender Legend and Cluster Styles


# Scatter plot with Gender as the legend and points styled by cluster

# Scatter plot com gênero na legenda e estilo pelos clusters

plt.figure(figsize=(8, 6))
sns.scatterplot(data=df, x='Annual Income (k\$)', y='Spending Score (1-100)',
hue='Gender', style='Cluster', palette=palette)
plt.title('Annual Income vs Spending Score by Gender and Cluster')
plt.show()



Cluster Statistics


II - Creating Figues - KMEAN CLUSTERING























1. Castro, L. N. & Ferrari, D. G. (2016). Introdução à mineração de dados: conceitos básicos, algoritmos e aplicações. Saraiva.

2. Ferreira, A. C. P. L. et al. (2024). Inteligência Artificial - Uma Abordagem de Aprendizado de Máquina. 2nd Ed. LTC.

3. Larson & Farber (2015). Estatística Aplicada. Pearson.







🛸๋ My Contacts Hub





────────────── 🔭⋆ ──────────────

➣➢➤ Back to Top

Copyright 2025 Quantum Software Development. Code released under the MIT License license.

About

8-Data Mining Prediction - Comprehensive K-Means clustering tutorial with step-by-step Python implementation, including normalization, elbow and silhouette methods, cluster visualizations and applications in image segmentation, Monkey analysis, and other clustering tasks.

Topics

Resources

License

Code of conduct

Security policy

Stars

Watchers

Forks

Releases

No releases published

Sponsor this project

 

Packages

No packages published