Adult Income EDA and Classification

Dataset Overview

This project involves a comprehensive analysis of a dataset with the following columns:

age - Age of the individual.
workclass - Type of employment (e.g., Private, Self-Employed).
fnlwgt - Final weight (a measure for the population the observation represents).
education - Highest level of education attained.
educational-num - Number of years of education.
marital-status - Marital status (e.g., Married, Single).
occupation - Occupation of the individual.
relationship - Relationship status (e.g., Husband, Not-in-family).
race - Race of the individual.
gender - Gender of the individual.
capital-gain - Capital gains.
capital-loss - Capital losses.
hours-per-week - Hours worked per week.
native-country - Country of origin.
income - Income category (<=50K or >50K).

All columns contain 48,842 non-null entries, ensuring a complete dataset for analysis.

Objectives for Exploratory Data Analysis (EDA)

Data Overview

Display Dataset: Preview the first few rows of the dataset.
Missing Values: Check for missing values and handle them appropriately.
Summary Statistics: Summarize the statistics for numerical features.

Univariate Analysis

Numerical Features: Plot histograms and boxplots.
Categorical Features: Plot bar charts.

Bivariate Analysis

Numerical Features vs Income: Create boxplots and violin plots.
Categorical Features vs Income: Create bar plots and count plots.

Multivariate Analysis

Numerical Interactions: Use pair plots or correlation heatmaps.
Categorical Interactions: Analyze interactions with the target variable.

Feature Engineering

Create New Features: Explore the creation of new features or modification of existing ones.

Objectives for Principal Component Analysis (PCA)

Standardization

Standardize Features: Ensure all numerical features have zero mean and unit variance.

PCA Implementation

Dimensionality Reduction: Perform PCA to reduce the number of dimensions.
Explained Variance: Analyze the explained variance ratio to determine the number of components to retain.

Visualization

Explained Variance Plot: Plot the explained variance ratio for each principal component.
Scatter Plots: Create scatter plots of the first two or three principal components.

Interpretation

Principal Components: Analyze the loadings of the original features on the principal components.

Objectives for Classification Model

Data Preparation

Encode Categorical Features: Use one-hot encoding or similar techniques.
Data Splitting: Split the data into training and testing sets.

Model Selection

Explore Models: Test various models like Logistic Regression, Decision Trees, Random Forests, and SVM.

Model Training

Train Models: Train the models on the training data and use cross-validation for hyperparameter tuning.

Model Evaluation

Evaluate Performance: Use metrics such as accuracy, precision, recall, F1-score, and AUC-ROC to evaluate models.
Compare Models: Select the best model based on performance metrics.

Model Interpretation

Feature Importances: Analyze feature importances or coefficients.

Model Validation

Validation: Test the model on a separate validation set or use k-fold cross-validation to ensure generalization.

Summary of Work Done

In this project, we conducted a thorough exploratory data analysis to understand the structure and distribution of the dataset. We analyzed both individual features and their relationships with the target variable, income. Feature engineering was performed to enhance the dataset. We implemented PCA to reduce dimensionality and visualize the data in fewer dimensions. Finally, we trained and evaluated various classification models to predict income, ensuring the models' robustness through cross-validation and comprehensive performance metrics.

Contributors

Sahand Salmani

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
Data		Data
.gitignore		.gitignore
README.md		README.md
income_prediction.ipynb		income_prediction.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Adult Income EDA and Classification

Dataset Overview

Objectives for Exploratory Data Analysis (EDA)

Data Overview

Univariate Analysis

Bivariate Analysis

Multivariate Analysis

Feature Engineering

Objectives for Principal Component Analysis (PCA)

Standardization

PCA Implementation

Visualization

Interpretation

Objectives for Classification Model

Data Preparation

Model Selection

Model Training

Model Evaluation

Model Interpretation

Model Validation

Summary of Work Done

Contributors

About

Releases

Packages

Languages

sahand-salmani/adult-income-dataset-EDA-classification

Folders and files

Latest commit

History

Repository files navigation

Adult Income EDA and Classification

Dataset Overview

Objectives for Exploratory Data Analysis (EDA)

Data Overview

Univariate Analysis

Bivariate Analysis

Multivariate Analysis

Feature Engineering

Objectives for Principal Component Analysis (PCA)

Standardization

PCA Implementation

Visualization

Interpretation

Objectives for Classification Model

Data Preparation

Model Selection

Model Training

Model Evaluation

Model Interpretation

Model Validation

Summary of Work Done

Contributors

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages