Skip to content

Pratyusha108/COVID-19-MORTALITY-RATE-PREDICTION-ANALYSIS-USING-PYTHON

Repository files navigation

Global COVID-19 Mortality & Outcome Prediction (Python)

End-to-end data analytics and machine learning pipeline analyzing global COVID-19 trends, mortality risk, and patient outcomes


📌 Project Overview

This project performs a comprehensive data analytics and machine learning analysis on global COVID-19 data to understand the spread, severity, and outcomes of the pandemic across countries and regions.

Using Python, the project integrates:

  • Exploratory Data Analysis (EDA)
  • ETL (Extract–Transform–Load) pipelines
  • Statistical analysis
  • Supervised machine learning models
  • Geographic and time-series visualizations

The analysis is designed to reflect real-world public-health and data-science workflows, emphasizing interpretability, scalability, and analytical rigor.


🌍 Business & Public Health Context

During global health crises, policymakers and healthcare systems rely on data to:

  • Monitor disease spread
  • Predict mortality risk
  • Allocate healthcare resources
  • Assess recovery trends
  • Compare regional impacts

This project demonstrates how data analytics and machine learning can support evidence-based decision-making during large-scale public health emergencies.


🎯 Project Objectives

Analytical Goals

  • Analyze global COVID-19 trends over time
  • Identify relationships between confirmed cases, deaths, recoveries, and active cases
  • Engineer meaningful health metrics (fatality & recovery rates)
  • Predict COVID-19 mortality using machine learning
  • Classify outbreak outcomes (recovery-dominant vs death-dominant)

Technical Goals

  • Build a clean, reusable ETL pipeline
  • Apply regression and classification models
  • Evaluate model performance using multiple metrics
  • Visualize global and temporal patterns effectively

📊 Dataset Summary

  • ~49,000+ records across multiple countries
  • Daily global COVID-19 reporting (2020)
  • Publicly available open data sources

Key Features

  • Province / State
  • Country / Region
  • WHO Region
  • Latitude & Longitude
  • Date
  • Confirmed Cases
  • Deaths
  • Recovered
  • Active Cases

Derived metrics:

  • Case Fatality Rate
  • Recovery Rate

🔄 Data Engineering (ETL Pipeline)

Extract

  • Loaded raw CSV data into Pandas DataFrames

Transform

  • Converted date fields for time-series analysis
  • Handled missing geographic values
  • Removed duplicate records
  • Filtered invalid zero-case records
  • Engineered health indicators:
    • Case Fatality Rate
    • Recovery Rate

Load

  • Saved transformed datasets for reproducible downstream analysis

This ETL workflow ensures data quality, consistency, and reusability.


🔍 Exploratory Data Analysis (EDA)

Key analytical steps:

  • Statistical summaries to detect outliers and skewness
  • Global geospatial visualizations using latitude/longitude
  • WHO-region-based comparisons
  • Time-series trend analysis for confirmed, active, recovered, and death cases
  • Correlation analysis between confirmed cases and mortality

Visualizations include:

  • Global scatter maps
  • Time-series line plots
  • Regional bar charts
  • Regression plots for severity analysis

🤖 Machine Learning Models

1️⃣ Mortality Prediction (Regression)

Target: Predict number of deaths

Models implemented:

  • Random Forest Regressor
  • Support Vector Machine (SVM)

Evaluation metrics:

  • Mean Squared Error (MSE)
  • R² Score
  • Feature importance analysis

✔ Random Forest demonstrated strong predictive performance and interpretability
✔ Feature importance revealed active and confirmed cases as key mortality drivers


2️⃣ Patient Outcome Prediction (Classification)

Binary classification:

  • Recovery-Dominant (1)
  • Death-Dominant (0)

Model:

  • Random Forest Classifier

Metrics:

  • Precision
  • Recall
  • F1-Score
  • Confusion Matrix

✔ Achieved high overall accuracy
✔ Strong performance in identifying recovery-dominant scenarios


📈 Key Insights

  • Active case volume is a strong predictor of mortality
  • Confirmed cases strongly correlate with deaths
  • Recovery trends vary significantly by WHO region
  • Pandemic severity peaks mid-timeline rather than uniformly
  • Tree-based models outperform kernel-based methods for this dataset

🛠️ Tech Stack

  • Python
  • Pandas, NumPy
  • Matplotlib, Seaborn
  • Scikit-learn
  • Jupyter Notebook
  • Basemap (geospatial visualization)

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors