8- Data Mining / K-Means: Non-Hierarchical Clustering

Institution: Pontifical Catholic University of São Paulo (PUC-SP)
School: Faculty of Interdisciplinary Studies
Program: Humanistic AI and Data Science Semester: 2nd Semester 2025
Professor: Professor Doctor in Mathematics Daniel Rodrigues da Silva

Important

⚠️ Heads Up

Projects and deliverables may be made publicly available whenever possible.
The course emphasizes practical, hands-on experience with real datasets to simulate professional consulting scenarios in the fields of Data Analysis and Data Mining for partner organizations and institutions affiliated with the university.
All activities comply with the academic and ethical guidelines of PUC-SP.
Any content not authorized for public disclosure will remain confidential and securely stored in private repositories.

🎶 Prelude Suite no.1 (J. S. Bach) - Sound Design Remix

Statistical.Measures.and.Banking.Sector.Analysis.at.Bovespa.mp4

📺 For better resolution, watch the video on YouTube.

Tip

This repository is a review of the Statistics course from the undergraduate program Humanities, AI and Data Science at PUC-SP.

☞ Access Data Mining Main Repository

Theory Overview

K-Means is a non-hierarchical clustering algorithm that partitions data into (k) clusters, minimizing variance within clusters.

Advantages

Simple and efficient
Scales well with large datasets
Fast convergence

Disadvantages

Requires predefined number of clusters (k)
Sensitive to initial centroids, leading sometimes to local minima
Assumes spherical clusters of similar sizes
Sensitive to outliers

Elbow Method

The Elbow Method plots the Within-Cluster Sum of Squares (WCSS) versus number of clusters (k). The optimal (k) is indicated by the 'elbow' point where adding another cluster does not significantly reduce WCSS.

Clustering Algorithms Overview

K-Means (partitional, non-hierarchical)
Hierarchical Clustering (agglomerative and divisive)
DBSCAN (density-based)
K-Medoids and others

Step-by-Step K-Means Implementation

Libraries Import and Dataset Loading

# Import necessary libraries

# Importar bibliotecas necessárias

import pandas as pd  \# for data handling / para manipulação de dados
import matplotlib.pyplot as plt  \# for plotting / para plotagem
import seaborn as sns  \# enhanced visualization / visualização aprimorada
from sklearn.cluster import KMeans  \# KMeans algorithm / algoritmo KMeans
from sklearn.preprocessing import MinMaxScaler  \# normalization / normalização

# Load dataset

# Carregar o dataset

df = pd.read_csv('clientes-shopping.csv')
print(df.head())  \# Show first rows / mostrar primeiras linhas

Data Preprocessing and Normalization

# Drop unnecessary columns: CustomerID, Gender, Age for clustering

# Remover colunas irrelevantes para clusterização

df_cluster = df.drop(['CustomerID', 'Gender', 'Age'], axis=1)

# Normalize the features with MinMaxScaler

# Normalizar características usando MinMaxScaler

scaler = MinMaxScaler()
df_norm = pd.DataFrame(scaler.fit_transform(df_cluster), columns=df_cluster.columns)

print(df_norm.head())  \# Preview normalized data / visualizar dados normalizados

Scatter Plot of Raw Data

# Scatter plot of Annual Income vs Spending Score, colored by Gender

# Scatter plot de Renda Anual vs Score de Gastos com legenda por Gênero

sns.set_style('dark')
palette = sns.color_palette('turquoise', 3)

plt.figure(figsize=(8, 6))
sns.scatterplot(data=df, x='Annual Income (k\$)', y='Spending Score (1-100)', hue='Gender', palette=palette)
plt.title('Annual Income vs Spending Score by Gender')
plt.show()

Elbow Method to Determine Optimal K

# Elbow Method to find the ideal number of clusters k

# Método do Cotovelo para definir o número ideal de clusters k

wcss = []
for k in range(1, 11):
kmeans = KMeans(n_clusters=k, random_state=42)
kmeans.fit(df_norm)
wcss.append(kmeans.inertia_)

plt.figure(figsize=(8, 6))
plt.plot(range(1, 11), wcss, marker='o', color='turquoise')
plt.title('Elbow Method')
plt.xlabel('Number of clusters (k)')
plt.ylabel('Within-Cluster Sum of Squares (WCSS)')
plt.show()

Running K-Means with $(k=5)$

# Apply KMeans with k=5 and random seed=42

# Aplicar KMeans com k=5 e seed=42

kmeans_model = KMeans(n_clusters=5, random_state=42)
clusters = kmeans_model.fit_predict(df_norm)

# Add clusters to original dataframe

# Adicionar clusters ao dataframe original

df['Cluster'] = clusters

print(df[['Annual Income (k\$)', 'Spending Score (1-100)', 'Cluster']].head())

Scatter Plot with Clusters (Dark Mode, Turquoise Palette)

# Scatter plot with clusters in different colors (turquoise palette)

# Scatter plot com clusters em cores diferentes (palette turquesa)

plt.figure(figsize=(8, 6))
sns.scatterplot(data=df, x='Annual Income (k\$)', y='Spending Score (1-100)',
hue='Cluster', palette='turquoise', legend='full')
plt.title('KMeans Clusters (k=5)')
plt.show()

Scatter Plot with Gender Legend and Cluster Styles

# Scatter plot with Gender as the legend and points styled by cluster

# Scatter plot com gênero na legenda e estilo pelos clusters

plt.figure(figsize=(8, 6))
sns.scatterplot(data=df, x='Annual Income (k\$)', y='Spending Score (1-100)',
hue='Gender', style='Cluster', palette=palette)
plt.title('Annual Income vs Spending Score by Gender and Cluster')
plt.show()

Cluster Statistics

II - Creating Figues - KMEAN CLUSTERING

Bibliography

1. Castro, L. N. & Ferrari, D. G. (2016). Introdução à mineração de dados: conceitos básicos, algoritmos e aplicações. Saraiva.

2. Ferreira, A. C. P. L. et al. (2024). Inteligência Artificial - Uma Abordagem de Aprendizado de Máquina. 2nd Ed. LTC.

3. Larson & Farber (2015). Estatística Aplicada. Pearson.

💌 Let the data flow... Ping Me !

🛸๋ My Contacts Hub

────────────── 🔭⋆ ──────────────

➣➢➤ Back to Top

Name		Name	Last commit message	Last commit date
Latest commit History 42 Commits
1-Code_KMeans-Cluster		1-Code_KMeans-Cluster
2-Code_Monkeys_Figure_K-Mean		2-Code_Monkeys_Figure_K-Mean
Workbook-KMeans-Cluster		Workbook-KMeans-Cluster
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Repository files navigation

8- Data Mining / K-Means: Non-Hierarchical Clustering

🎶 Prelude Suite no.1 (J. S. Bach) - Sound Design Remix

📺 For better resolution, watch the video on YouTube.

☞ Access Data Mining Main Repository

Theory Overview

Advantages

Disadvantages

Elbow Method

Clustering Algorithms Overview

Step-by-Step K-Means Implementation

Libraries Import and Dataset Loading

Data Preprocessing and Normalization

Scatter Plot of Raw Data

Elbow Method to Determine Optimal K

Running K-Means with $(k=5)$

Scatter Plot with Clusters (Dark Mode, Turquoise Palette)

Scatter Plot with Gender Legend and Cluster Styles

Cluster Statistics

II - Creating Figues - KMEAN CLUSTERING

Bibliography

💌 Let the data flow... Ping Me !

🛸๋ My Contacts Hub

Copyright 2025 Quantum Software Development. Code released under the MIT License license.

About

Uh oh!

Releases

Sponsor this project

Uh oh!

Packages

Uh oh!

Languages

Uh oh!

License

Quantum-Software-Development/8-DataMining-KMeans-Non-Hierarchical-Clustering

Folders and files

Latest commit

History

Repository files navigation

8- Data Mining / K-Means: Non-Hierarchical Clustering

🎶 Prelude Suite no.1 (J. S. Bach) - Sound Design Remix

📺 For better resolution, watch the video on YouTube.

☞ Access Data Mining Main Repository

Theory Overview

Advantages

Disadvantages

Elbow Method

Clustering Algorithms Overview

Step-by-Step K-Means Implementation

Libraries Import and Dataset Loading

Data Preprocessing and Normalization

Scatter Plot of Raw Data

Elbow Method to Determine Optimal K

Running K-Means with $(k=5)$

Scatter Plot with Clusters (Dark Mode, Turquoise Palette)

Scatter Plot with Gender Legend and Cluster Styles

Cluster Statistics

II - Creating Figues - KMEAN CLUSTERING

Bibliography

💌 Let the data flow... Ping Me !

🛸๋ My Contacts Hub

Copyright 2025 Quantum Software Development. Code released under the MIT License license.

About

Topics

Resources

License

Code of conduct

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Sponsor this project

Uh oh!

Packages 0

Uh oh!

Languages

Packages