Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
197 changes: 152 additions & 45 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,43 +1,128 @@
# End-to-End Machine Learning Project
# 📘 End-to-End Insurance Risk Analytics & Predictive Modeling

This project is an end-to-end machine learning pipeline designed with a
modular folder structure. It includes data processing, feature
engineering, EDA, model building, reporting, and CI setup.
A complete, modular, production-ready machine learning pipeline for
insurance analytics.

## Project Structure
---

end-to-end/
.github\workflows
│ |──ci,yml
| |──codeql.yml
## Project Overview

This project implements a **fully modular end-to-end ML pipeline** for
insurance risk analytics and predictive modeling. It supports real-world
insurance business applications such as:

- Analyzing historical policies, claims, and exposures\
- Performing EDA and anomaly detection\
- Conducting hypothesis tests to validate key risk drivers\
- Building models for claim probability, claim severity, and premium
optimization\
- End-to-end reproducible ML pipeline with CI/CD support\
- Integrated reporting, logging, and versioning

---

## Business Objective

**AlphaCare Insurance Solutions (ACIS)** aims to:

- Identify **low-risk customer segments**\
- Optimize **premium pricing** while maximizing profitability\
- Understand **factors contributing to claims**\
- Support **actuarial and underwriting decisions**\
- Enhance customer retention with targeted strategies

---

## Full Project Folder Structure

End-to-End-Insurance-Risk-Analytics-Predictive-Modeling/
├── .github/
│ └── workflows/ # CI/CD pipelines (tests, linting, dvc)
├── configs/
│ ├── data.yaml # Dataset configuration
│ ├── dvc_remote.yaml # DVC remote configuration
│ ├── logs.yaml # Logging settings
│ └── modeling.yaml # ML model configurations
├── data/
│ ├── raw/ # Original data (untouched)
── processed/ # Cleaned, transformed data
│ ├── raw/ # Original data
── processed/ # Cleaned & feature engineered data
├── docs/ # Documentation & reports
├── notebooks/
│ ├── 01_EDA.ipynb # Exploratory Data Analysis
│ └── 02_Modeling.ipynb # Model training & evaluation
│ ├── analysis/
│ │ ├── hypothesis_tests.ipynb
│ │ └── model_building.ipynb
│ └── exploration/
│ ├── data_overview.ipynb
│ └── eda.ipynb
├── scripts/
│ ├── __init__.py
│ ├── clean_data.py
│ ├── run_eda_pipeline.py
│ ├── run_hypothesis_tests.py
│ └── train_models.py
├── src/
│ ├── eda/
│ │ └── eda_tools.py
│ ├── features/
│ │ └── build_features.py
│ ├── models/
│ │ └── train_model.py
│ └── utils/
│ └── data_loader.py
│ └── insurance_analytics/
│ ├── __init__.py
│ ├── core/
│ │ ├── __init__.py
│ │ ├── config.py
│ │ ├── logger.py
│ │ ├── registry.py
│ │ └── scheduler.py
│ ├── eda/
│ │ ├── __init__.py
│ │ ├── exploration.py
│ │ └── visualization.py
│ ├── models/
│ │ ├── __init__.py
│ │ ├── evaluation.py
│ │ ├── interpretability.py
│ │ ├── linear_regression.py
│ │ ├── random_forest.py
│ │ └── xgboost_model.py
│ ├── preprocessing/
│ │ ├── __init__.py
│ │ ├── cleaner.py
│ │ └── feature_engineering.py
│ ├── utils/
│ │ ├── __init__.py
│ │ ├── io_utils.py
│ │ ├── metrics.py
│ │ ├── project_root.py
│ │ ├── system.py
│ │ └── validation.py
│ └── viz/
│ ├── __init__.py
│ └── plots.py
├── tests/
│ ├── integration/
│ │ ├── __init__.py
│ │ ├── test_dvc_integration.py
│ │ ├── test_eda_pipeline.py
│ │ ├── test_full_pipeline.py
│ │ └── test_model_pipeline.py
│ └── unit/
│ ├── __init__.py
│ ├── test_cleaners.py
│ ├── test_features.py
│ ├── test_hypothesis.py
│ ├── test_loaders.py
│ ├── test_models.py
│ └── test_registry.py
├── .gitignore
├── requirements.txt
├── README.md
└── requirements.txt

---

## How to Run This Project
## How to Run the Project

### Create a Virtual Environment

```bash
python -m venv venv
. env\Scriptsctivate
venv\Scripts\activate # Windows
source venv/bin/activate # macOS/Linux
```

### Install Dependencies
Expand All @@ -46,36 +131,58 @@ python -m venv venv
pip install -r requirements.txt
```

### Run Notebooks
### Run Data Cleaning

```bash
jupyter notebook
python scripts/clean_data.py
```

## Features
### Run EDA Pipeline

✔ Clean modular code\
✔ Separate folders for EDA, features, models, utils\
✔ Ready for CI/CD\
✔ Clear data directory hierarchy\
✔ Reproducible ML workflow
```bash
python scripts/run_eda_pipeline.py
```

## Requirements
### Train Machine Learning Models

Add libraries in `requirements.txt`, example:
```bash
python scripts/train_models.py
```

numpy
pandas
scikit-learn
matplotlib
seaborn
jupyter
### Use Jupyter Notebooks

```bash
jupyter notebook
```

---

## Key Features

- ✔ Modular ML architecture\
- ✔ Clear data/configs/scripts separation\
- ✔ DVC versioning\
- ✔ CI-ready workflows\
- ✔ Logging & validation utilities\
- ✔ Interpretability (SHAP, feature importance)\
- ✔ Reproducible experiments

---

## Reports

- `interim_report.md`: insights during project development\
- `final_report.md`: final results, visualizations, and model
performance
- `docs/interim_report.md`\
- `docs/final_report.md`

---

## Testing

```bash
pytest
```

---

## Version Control

Expand Down
17 changes: 17 additions & 0 deletions configs/data.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
data:
data_dir: "data"
raw_dir: "data/raw"
processed_dir: "data/processed"

logs:
logs_dir: "logs"

reports:
reports_dir: "reports"
plots_dir: "reports/plots"

models:
models_dir: "src/insurance_analytics/models"

artifacts:
artifacts_dir: "artifacts"
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
Loading
Loading