This project involves a comprehensive analysis of a dataset with the following columns:
age
- Age of the individual.workclass
- Type of employment (e.g., Private, Self-Employed).fnlwgt
- Final weight (a measure for the population the observation represents).education
- Highest level of education attained.educational-num
- Number of years of education.marital-status
- Marital status (e.g., Married, Single).occupation
- Occupation of the individual.relationship
- Relationship status (e.g., Husband, Not-in-family).race
- Race of the individual.gender
- Gender of the individual.capital-gain
- Capital gains.capital-loss
- Capital losses.hours-per-week
- Hours worked per week.native-country
- Country of origin.income
- Income category (<=50K or >50K).
All columns contain 48,842 non-null entries, ensuring a complete dataset for analysis.
- Display Dataset: Preview the first few rows of the dataset.
- Missing Values: Check for missing values and handle them appropriately.
- Summary Statistics: Summarize the statistics for numerical features.
- Numerical Features: Plot histograms and boxplots.
- Categorical Features: Plot bar charts.
- Numerical Features vs Income: Create boxplots and violin plots.
- Categorical Features vs Income: Create bar plots and count plots.
- Numerical Interactions: Use pair plots or correlation heatmaps.
- Categorical Interactions: Analyze interactions with the target variable.
- Create New Features: Explore the creation of new features or modification of existing ones.
- Standardize Features: Ensure all numerical features have zero mean and unit variance.
- Dimensionality Reduction: Perform PCA to reduce the number of dimensions.
- Explained Variance: Analyze the explained variance ratio to determine the number of components to retain.
- Explained Variance Plot: Plot the explained variance ratio for each principal component.
- Scatter Plots: Create scatter plots of the first two or three principal components.
- Principal Components: Analyze the loadings of the original features on the principal components.
- Encode Categorical Features: Use one-hot encoding or similar techniques.
- Data Splitting: Split the data into training and testing sets.
- Explore Models: Test various models like Logistic Regression, Decision Trees, Random Forests, and SVM.
- Train Models: Train the models on the training data and use cross-validation for hyperparameter tuning.
- Evaluate Performance: Use metrics such as accuracy, precision, recall, F1-score, and AUC-ROC to evaluate models.
- Compare Models: Select the best model based on performance metrics.
- Feature Importances: Analyze feature importances or coefficients.
- Validation: Test the model on a separate validation set or use k-fold cross-validation to ensure generalization.
In this project, we conducted a thorough exploratory data analysis to understand the structure and distribution of the dataset. We analyzed both individual features and their relationships with the target variable, income
. Feature engineering was performed to enhance the dataset. We implemented PCA to reduce dimensionality and visualize the data in fewer dimensions. Finally, we trained and evaluated various classification models to predict income, ensuring the models' robustness through cross-validation and comprehensive performance metrics.