A complete machine learning pipeline for predicting car prices using multiple regression models.
Project Title: Car Price Prediction Pipeline Objective: Develop a robust regression model to estimate car prices using vehicle features. Methodology: The project implements a full ML pipeline including data cleaning, feature engineering (Target/One-Hot encoding), and a comparative analysis of 5 algorithms (Linear Regression, Logistic Regression, KNN, Decision Tree, K-Means). Outcome: A trained model capable of predicting prices with evaluated accuracy metrics (RMSE, R²).
This project implements a comprehensive ML workflow that:
- Preprocesses and cleans automotive data
- Engineers features from raw inputs
- Applies multiple encoding strategies for categorical variables
- Trains and compares 5 different machine learning models
- Provides detailed evaluation metrics and visualizations
| Model | Type | Description |
|---|---|---|
| Linear Regression | Regression | Baseline linear model for price prediction |
| Logistic Regression | Classification | Classifies cars into Low/High price categories |
| K-Nearest Neighbors | Regression | Instance-based learning for price prediction |
| Decision Tree | Regression | Tree-based model for capturing non-linear patterns |
| K-Means | Clustering | Unsupervised clustering to find natural groupings |
- Python 3.8+
- pip package manager
pip install pandas numpy scikit-learn matplotlib seaborn xgboostThe project requires a CSV file named car_price_prediction.csv with the following columns:
| Column | Type | Description |
|---|---|---|
| ID | int | Unique identifier |
| Price | int | Target variable (USD) |
| Levy | str | Tax levy ('-' or numeric) |
| Manufacturer | str | Car manufacturer |
| Model | str | Car model |
| Prod. year | int | Production year |
| Category | str | Vehicle category |
| Leather interior | str | Yes/No |
| Fuel type | str | Fuel type |
| Engine volume | str | Engine size (e.g., "2.0 Turbo") |
| Mileage | str | Mileage with units (e.g., "100000 km") |
| Cylinders | float | Number of cylinders |
| Gear box type | str | Transmission type |
| Drive wheels | str | Drive type (FWD/RWD/AWD) |
| Doors | str | Number of doors |
| Wheel | str | Steering wheel position |
| Color | str | Car color |
| Airbags | int | Number of airbags |
- Place your
car_price_prediction.csvfile in the project directory - Open
car_price_prediction.ipynbin Jupyter Notebook or VS Code - Run all cells sequentially
# Or run via command line
jupyter notebook car_price_prediction.ipynbcar_price_prediction/
├── car_price_prediction.ipynb # Main notebook
├── car_price_prediction.csv # Dataset (not included)
└── README.md # This file
- Clean
Levycolumn (convert '-' to 0) - Extract numeric values from
Engine volume(remove "Turbo") - Parse
Mileage(remove "km" suffix) - Calculate vehicle
Agefrom production year - Remove duplicates and outliers (IQR method)
- One-Hot Encoding: Leather interior, Gear box type, Drive wheels, Wheel
- Target Encoding: Fuel type, Model, Airbags, Cylinders, Manufacturer
- Label Encoding: Category, Color
- 80/20 train-test split
- StandardScaler for numerical features
- Hyperparameters configurable via
Configclass
- RMSE (Root Mean Squared Error)
- MAE (Mean Absolute Error)
- R² (Coefficient of Determination)
- MAPE (Mean Absolute Percentage Error)
Modify the Config class in the notebook to customize:
class Config:
TARGET = 'Price'
TEST_SIZE = 0.2
RANDOM_STATE = 42
OUTLIER_REMOVAL = True
IQR_MULTIPLIER = 1.5After running the notebook, you'll see:
- Model comparison table sorted by RMSE
- Bar charts comparing RMSE, MAE, and R² across models
- Actual vs. Predicted scatter plot for the best model
This project is open source and available for educational purposes.