Institution: Pontifical Catholic University of São Paulo (PUC-SP)
School: Faculty of Interdisciplinary Studies
Program: Humanistic AI and Data Science
Semester: 2nd Semester 2025
Professor: Professor Doctor in Mathematics Daniel Rodrigues da Silva
Important
- Projects and deliverables may be made publicly available whenever possible.
- The course emphasizes practical, hands-on experience with real datasets to simulate professional consulting scenarios in the fields of Data Analysis and Data Mining for partner organizations and institutions affiliated with the university.
- All activities comply with the academic and ethical guidelines of PUC-SP.
- Any content not authorized for public disclosure will remain confidential and securely stored in private repositories.
🎶 Prelude Suite no.1 (J. S. Bach) - Sound Design Remix
Statistical.Measures.and.Banking.Sector.Analysis.at.Bovespa.mp4
📺 For better resolution, watch the video on YouTube.
Tip
This repository is a review of the Statistics course from the undergraduate program Humanities, AI and Data Science at PUC-SP.
☞ Access Data Mining Main Repository
K-Means is a non-hierarchical clustering algorithm that partitions data into (k) clusters, minimizing variance within clusters.
- Simple and efficient
- Scales well with large datasets
- Fast convergence
- Requires predefined number of clusters (k)
- Sensitive to initial centroids, leading sometimes to local minima
- Assumes spherical clusters of similar sizes
- Sensitive to outliers
The Elbow Method plots the Within-Cluster Sum of Squares (WCSS) versus number of clusters (k). The optimal (k) is indicated by the 'elbow' point where adding another cluster does not significantly reduce WCSS.
- K-Means (partitional, non-hierarchical)
- Hierarchical Clustering (agglomerative and divisive)
- DBSCAN (density-based)
- K-Medoids and others
# Import necessary libraries
# Importar bibliotecas necessárias
import pandas as pd \# for data handling / para manipulação de dados
import matplotlib.pyplot as plt \# for plotting / para plotagem
import seaborn as sns \# enhanced visualization / visualização aprimorada
from sklearn.cluster import KMeans \# KMeans algorithm / algoritmo KMeans
from sklearn.preprocessing import MinMaxScaler \# normalization / normalização
# Load dataset
# Carregar o dataset
df = pd.read_csv('clientes-shopping.csv')
print(df.head()) \# Show first rows / mostrar primeiras linhas
# Drop unnecessary columns: CustomerID, Gender, Age for clustering
# Remover colunas irrelevantes para clusterização
df_cluster = df.drop(['CustomerID', 'Gender', 'Age'], axis=1)
# Normalize the features with MinMaxScaler
# Normalizar características usando MinMaxScaler
scaler = MinMaxScaler()
df_norm = pd.DataFrame(scaler.fit_transform(df_cluster), columns=df_cluster.columns)
print(df_norm.head()) \# Preview normalized data / visualizar dados normalizados
# Scatter plot of Annual Income vs Spending Score, colored by Gender
# Scatter plot de Renda Anual vs Score de Gastos com legenda por Gênero
sns.set_style('dark')
palette = sns.color_palette('turquoise', 3)
plt.figure(figsize=(8, 6))
sns.scatterplot(data=df, x='Annual Income (k\$)', y='Spending Score (1-100)', hue='Gender', palette=palette)
plt.title('Annual Income vs Spending Score by Gender')
plt.show()
# Elbow Method to find the ideal number of clusters k
# Método do Cotovelo para definir o número ideal de clusters k
wcss = []
for k in range(1, 11):
kmeans = KMeans(n_clusters=k, random_state=42)
kmeans.fit(df_norm)
wcss.append(kmeans.inertia_)
plt.figure(figsize=(8, 6))
plt.plot(range(1, 11), wcss, marker='o', color='turquoise')
plt.title('Elbow Method')
plt.xlabel('Number of clusters (k)')
plt.ylabel('Within-Cluster Sum of Squares (WCSS)')
plt.show()
# Apply KMeans with k=5 and random seed=42
# Aplicar KMeans com k=5 e seed=42
kmeans_model = KMeans(n_clusters=5, random_state=42)
clusters = kmeans_model.fit_predict(df_norm)
# Add clusters to original dataframe
# Adicionar clusters ao dataframe original
df['Cluster'] = clusters
print(df[['Annual Income (k\$)', 'Spending Score (1-100)', 'Cluster']].head())
# Scatter plot with clusters in different colors (turquoise palette)
# Scatter plot com clusters em cores diferentes (palette turquesa)
plt.figure(figsize=(8, 6))
sns.scatterplot(data=df, x='Annual Income (k\$)', y='Spending Score (1-100)',
hue='Cluster', palette='turquoise', legend='full')
plt.title('KMeans Clusters (k=5)')
plt.show()
# Scatter plot with Gender as the legend and points styled by cluster
# Scatter plot com gênero na legenda e estilo pelos clusters
plt.figure(figsize=(8, 6))
sns.scatterplot(data=df, x='Annual Income (k\$)', y='Spending Score (1-100)',
hue='Gender', style='Cluster', palette=palette)
plt.title('Annual Income vs Spending Score by Gender and Cluster')
plt.show()
1. Castro, L. N. & Ferrari, D. G. (2016). Introdução à mineração de dados: conceitos básicos, algoritmos e aplicações. Saraiva.
2. Ferreira, A. C. P. L. et al. (2024). Inteligência Artificial - Uma Abordagem de Aprendizado de Máquina. 2nd Ed. LTC.
3. Larson & Farber (2015). Estatística Aplicada. Pearson.
🛸๋ My Contacts Hub
────────────── 🔭⋆ ──────────────
➣➢➤ Back to Top
Copyright 2025 Quantum Software Development. Code released under the MIT License license.