🏡 California Housing Price Prediction Model

A data science project built to help property industry clients understand and predict residential house prices — using real-world census data, smart data preparation, and a series of increasingly powerful machine learning models.

📌 Table of Contents

Project Background
What Problem Are We Solving?
Project Workflow Overview
The Dataset
Step 1 — Exploratory Data Analysis (EDA)
Step 2 — Data Imputation (Filling in the Gaps)
Step 3 — Feature Engineering
Step 4 — Encoding Categorical Data
Step 5 — Outlier Handling
Step 6 — Model Training & Results
Model Performance Summary
Business Value & Takeaways
Tech Stack

📋 Project Background

This project was developed in response to a brief from a client operating in the property and real estate sector. The client provided a dataset containing housing census information from California, and the goal was to build a model capable of predicting the median house value of a given district based on a range of measurable factors.

The ability to estimate property values accurately — even at a neighbourhood or district level — has significant commercial value: from advising buyers and sellers, to supporting investment decisions, pricing strategies, and portfolio risk assessments.

❓ What Problem Are We Solving?

"Given what we know about a neighbourhood — its location, income levels, housing age, and population density — can we reliably predict what the median house price will be?"

A property business might want to answer questions like:

Which areas are likely to see higher property values?
Is a property priced fairly relative to its neighbourhood characteristics?
What factors drive house prices up or down the most?
Can we automate valuations at scale to reduce reliance on manual appraisals?

This model addresses all of these by learning patterns from historical data and producing a predicted median house value for any district.

🗺 Project Workflow Overview

The project follows a structured data science pipeline. Each stage feeds into the next:

flowchart TD
    A([📥 Load Raw Data\n20,640 housing records]) --> B([🔍 Exploratory Data Analysis\nUnderstand distributions & correlations])
    B --> C([🩹 Data Imputation\nFill 207 missing bedroom values using KNN])
    C --> D([🔧 Feature Engineering\nCreate smarter, more predictive variables])
    D --> E([🏷️ Encode Categories\nConvert ocean_proximity to numbers])
    E --> F([✂️ Outlier Handling\nRemove capped & extreme values])
    F --> G([🤖 Model Training\nTrain 3 different algorithms])
    G --> H([📊 Evaluate Performance\nMAE · MSE · RMSE · R²])
    H --> I([✅ Best Model Selected\nRandom Forest — R² = 0.75])

    style A fill:#e8f4f8,stroke:#2980b9
    style I fill:#e8f8e8,stroke:#27ae60

📦 The Dataset

The dataset was sourced from the 1990 California Housing Census and contains 20,640 records, each representing a census block (a small geographic district). It includes the following information:

Column	What It Means
`longitude` / `latitude`	Geographic coordinates of the district
`housing_median_age`	Median age of homes in the district
`total_rooms`	Total number of rooms across all homes
`total_bedrooms`	Total number of bedrooms (207 missing values)
`population`	Number of people in the district
`households`	Number of households
`median_income`	Median household income (scaled units)
`median_house_value`	Our target — the value we want to predict
`ocean_proximity`	Categorical label: how close the district is to the ocean

🔍 Step 1 — Exploratory Data Analysis (EDA)

Before building anything, we need to understand the data. EDA is the process of visualising and summarising the dataset to spot patterns, anomalies, and relationships.

What We Looked At

Distribution of each variable — a histogram for every column tells us:

Are values skewed or normally spread?
Are there suspicious spikes at round numbers? (e.g. housing_median_age is capped at 52 and median_house_value is capped at $500,001 — artificial limits in the data)

Correlation heatmap — a grid showing how strongly each variable relates to every other:

graph LR
    subgraph Strong_Positive["🟢 Positively Correlated with House Value"]
        MI[median_income\n+0.69]
    end
    subgraph Weak["🟡 Weakly Correlated"]
        HMA[housing_median_age\n+0.11]
        TR[total_rooms\n+0.13]
    end
    subgraph Negative["🔴 Negatively Correlated"]
        LAT[latitude\n-0.14]
    end
    MI --> HV[🏠 median_house_value]
    HMA --> HV
    TR --> HV
    LAT --> HV

Key Insight

median_income is by far the strongest single predictor of house value — districts where people earn more tend to have much higher property prices. This is intuitive and gives us early confidence that the model has meaningful signal to learn from.

Also notable: longitude and latitude have a −0.92 correlation with each other, meaning they carry very similar geographic information. We later combine them into a single variable.

🩹 Step 2 — Data Imputation (Filling in the Gaps)

The Problem

The total_bedrooms column had 207 missing values — roughly 1% of the dataset. Dropping those rows entirely would lose real data. Filling them with a simple average would be inaccurate.

The Solution: KNN Imputation

We used a technique called K-Nearest Neighbours (KNN) Imputation. Instead of guessing with a blanket average, this method:

flowchart LR
    A[Row with\nmissing bedroom value] --> B[Find 3 most similar\nrows in the dataset\nbased on other columns]
    B --> C[Average their\nbedroom values]
    C --> D[Fill in the\nmissing value]

    style B fill:#fff3cd,stroke:#f39c12
    style D fill:#d4edda,stroke:#27ae60

Why this matters for the business: Imputing intelligently rather than crudely means we keep a fuller, more accurate dataset. Better data in = better predictions out.

🔧 Step 3 — Feature Engineering

Raw columns don't always tell the most useful story. Feature engineering is the process of creating new variables from existing ones that are more meaningful to the model.

Why Raw Totals Are Misleading

A district with 10,000 total rooms sounds large — but if it has 5,000 households, that's 2 rooms per household. A district with 500 total rooms but only 50 households also averages 10 rooms each. The raw number alone is misleading without context.

New Features Created

graph TD
    TR[total_rooms] --> RPH["rooms_per_household\n= total_rooms ÷ households\n💡 Average living space per home"]
    TB[total_bedrooms] --> BPR["bedrooms_per_room\n= total_bedrooms ÷ total_rooms\n💡 How much of the space is bedrooms"]
    POP[population] --> PPH["population_per_household\n= population ÷ households\n💡 How crowded is the average home"]
    LON[longitude] --> COORD["coordinates\n= longitude ÷ latitude\n💡 Combined location signal"]
    LAT[latitude] --> COORD

    style RPH fill:#e8f4f8,stroke:#2980b9
    style BPR fill:#e8f4f8,stroke:#2980b9
    style PPH fill:#e8f4f8,stroke:#2980b9
    style COORD fill:#e8f4f8,stroke:#2980b9

Columns Removed After Engineering

Once the engineered features were created, the original raw columns they were derived from were dropped to reduce noise and redundancy:

latitude, longitude, total_bedrooms, total_rooms, households, population

Why this matters: Cleaner, more meaningful features help the model focus on the patterns that actually drive house prices — not noisy raw totals.

🏷️ Step 4 — Encoding Categorical Data

Machine learning models work with numbers, not text. The ocean_proximity column contained 5 text categories:

Category	Count
`<1H OCEAN`	9,136
`INLAND`	6,551
`NEAR OCEAN`	2,658
`NEAR BAY`	2,290
`ISLAND`	5

One-Hot Encoding

We used a method called one-hot encoding, which converts each category into its own yes/no (1/0) column:

ocean_proximity_<1H OCEAN  | ocean_proximity_INLAND | ocean_proximity_ISLAND | ...
          0                |           0            |           0            |  ...
          1                |           0            |           0            |  ...

This avoids implying any false ordering between categories (e.g. that "ISLAND" is greater than "INLAND").

✂️ Step 5 — Outlier Handling

Why Outliers Are a Problem

Outliers are extreme data points that can skew the model's learning. In this dataset, two artificial caps existed in the raw data:

housing_median_age — capped at 52 years (any home older than 52 was just recorded as 52)
median_house_value — capped at $500,001 (any home worth more was recorded as $500,001)

These aren't real values — they're data collection artefacts. Including them teaches the model the wrong lesson.

Two-Stage Removal

flowchart TD
    A[18,388 rows after\ninitial cap removal] --> B{IQR Filter\nmedian_house_value}
    B --> C{IQR Filter\nhousing_median_age}
    C --> D[Cleaner dataset\nfewer extreme values]

    subgraph IQR["📐 How IQR Works"]
        E[Calculate the\nmiddle 50% of values\nQ1 to Q3]
        F[Anything beyond\n1.5× that range\nis flagged as an outlier]
        E --> F
    end

    style D fill:#d4edda,stroke:#27ae60

Why two stages? Cap removal handles known data artefacts. IQR handles genuinely anomalous values that could unfairly skew the model.

Impact on model accuracy: Removing outliers improved R² from 0.52 → 0.60 — a meaningful jump in predictive power just from cleaner data.

🤖 Step 6 — Model Training & Results

Three different algorithms were trained and compared. Think of each as a different strategy for learning the relationship between the input features and house prices.

The Train/Test Split

Before training, the data was split:

pie title Data Split
    "Training Data (80%)" : 80
    "Test Data (20%)" : 20

The model only ever sees the training data during learning. The test data is held back and used to simulate real-world predictions — checking whether the model can generalise beyond what it was trained on.

Model 1 — Linear Regression (Baseline)

What it does: Draws the best possible straight line through the data. Simple, fast, and interpretable.

Analogy: Like fitting a ruler to a scatter of dots — it finds the single straight line that comes closest to all points.

Metric	Before Outlier Removal	After Outlier Removal
MAE	$54,486	$41,326
RMSE	$79,707	$54,665
R²	0.52	0.60

Limitation: House prices don't follow a perfectly straight line. Many factors interact in complex, non-linear ways (e.g. income has a disproportionately large effect at higher levels). A straight line can only capture so much.

Model 2 — Polynomial Regression (Adding Curves)

What it does: Extends linear regression by allowing the model to learn curved relationships. A degree-2 polynomial can fit a parabola rather than just a straight line.

Analogy: Instead of fitting a ruler, you're fitting a curved piece of flexible wire.

Metric	Result
MAE	$38,220
RMSE	$55,874
R²	0.58

Observation: Interestingly, polynomial regression did not improve significantly over linear regression after outlier removal. This suggests that the data's non-linearity is more complex than a simple curve — which motivates the third model.

Model 3 — Random Forest Regressor (Best Performer) ✅

What it does: Trains a large number of independent decision trees (100 in this case), each learning slightly different patterns from random subsets of the data. The final prediction is the average across all trees.

Analogy: Instead of asking one expert, you ask 100 different analysts who each studied different parts of the data — and you take the consensus answer. No single analyst's blind spot dominates the outcome.

graph TD
    Input[New District Data] --> T1[Tree 1\nPredicts $280,000]
    Input --> T2[Tree 2\nPredicts $310,000]
    Input --> T3[Tree 3\nPredicts $295,000]
    Input --> TN[... 97 more trees ...]
    T1 --> AVG[🎯 Average Prediction\n≈ $295,000]
    T2 --> AVG
    T3 --> AVG
    TN --> AVG

    style AVG fill:#d4edda,stroke:#27ae60

Metric	Result
MAE	$30,409
RMSE	$43,455
R²	0.75

📊 Model Performance Summary

Model	MAE	RMSE	R²	Notes
Linear Regression (raw data)	$54,486	$79,707	0.52	Starting baseline
Linear Regression (cleaned)	$41,326	$54,665	0.60	+8 pts from cleaning alone
Polynomial Regression	$38,220	$55,874	0.58	Marginal improvement
Random Forest	$30,409	$43,455	0.75	Best model

How to Read These Metrics

R²  (R-squared)  → How much of the price variation our model explains.
                   0 = useless, 1 = perfect. 0.75 = explains 75% of variation.

MAE (Mean Absolute Error) → On average, how far off is each prediction?
                            $30,409 means we're typically within ~$30K of the real price.

RMSE (Root Mean Squared Error) → Like MAE but penalises large errors more.
                                  Useful for spotting if occasional big misses are a problem.

💼 Business Value & Takeaways

What This Model Delivers

Automated valuation at scale — rather than manually appraising each district, the model can instantly generate price estimates for thousands of areas using available data.
Income is the #1 driver — median_income is the strongest predictor. This validates the intuition that purchasing power in a neighbourhood largely determines property values, and it's a signal property businesses should weight heavily in their own analysis.
Location matters, but not in isolation — proximity to the ocean and geographic coordinates are meaningful signals, but they don't override income effects.
Data quality pays off — the jump from R² 0.52 → 0.75 came almost entirely from better data preparation (imputation, feature engineering, outlier removal), not just switching algorithms. Clean data is the foundation.
The model explains 75% of house price variation — the remaining 25% is likely driven by factors not in this dataset (property condition, school quality, local amenities, market timing), which is a natural next step for improvement.

Potential Next Steps for the Client

Incorporate additional data sources (e.g. school ratings, crime statistics, proximity to transport)
Retrain on more recent data — this dataset is from 1990 census
Build a lightweight scoring tool that agents or analysts can use in practice
Explore gradient boosting models (XGBoost, LightGBM) which often outperform Random Forest on tabular data

🛠 Tech Stack

Tool	Purpose
`Python`	Core programming language
`Pandas`	Data loading, manipulation, and cleaning
`NumPy`	Numerical operations
`Scikit-learn`	Imputation, scaling, model training, evaluation
`Matplotlib / Seaborn`	Data visualisation
`Jupyter Notebook`	Interactive development environment

Built as part of a client-scoped data science project. Dataset: California Housing Census (1990). Goal: Predict district-level median house values to support property market decision-making. *Feel free to reach out, Entry-level Data Scientist email: Jabulani.Ndlovu@gmail.com

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
.idea		.idea
data		data
models		models
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

🏡 California Housing Price Prediction Model

📌 Table of Contents

📋 Project Background

❓ What Problem Are We Solving?

🗺 Project Workflow Overview

📦 The Dataset

🔍 Step 1 — Exploratory Data Analysis (EDA)

What We Looked At

Key Insight

🩹 Step 2 — Data Imputation (Filling in the Gaps)

The Problem

The Solution: KNN Imputation

🔧 Step 3 — Feature Engineering

Why Raw Totals Are Misleading

New Features Created

Columns Removed After Engineering

🏷️ Step 4 — Encoding Categorical Data

One-Hot Encoding

✂️ Step 5 — Outlier Handling

Why Outliers Are a Problem

Two-Stage Removal

🤖 Step 6 — Model Training & Results

The Train/Test Split

Model 1 — Linear Regression (Baseline)

Model 2 — Polynomial Regression (Adding Curves)

Model 3 — Random Forest Regressor (Best Performer) ✅

📊 Model Performance Summary

How to Read These Metrics

💼 Business Value & Takeaways

What This Model Delivers

Potential Next Steps for the Client

🛠 Tech Stack

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages