End-to-end data analytics and machine learning pipeline analyzing global COVID-19 trends, mortality risk, and patient outcomes
This project performs a comprehensive data analytics and machine learning analysis on global COVID-19 data to understand the spread, severity, and outcomes of the pandemic across countries and regions.
Using Python, the project integrates:
- Exploratory Data Analysis (EDA)
- ETL (Extract–Transform–Load) pipelines
- Statistical analysis
- Supervised machine learning models
- Geographic and time-series visualizations
The analysis is designed to reflect real-world public-health and data-science workflows, emphasizing interpretability, scalability, and analytical rigor.
During global health crises, policymakers and healthcare systems rely on data to:
- Monitor disease spread
- Predict mortality risk
- Allocate healthcare resources
- Assess recovery trends
- Compare regional impacts
This project demonstrates how data analytics and machine learning can support evidence-based decision-making during large-scale public health emergencies.
- Analyze global COVID-19 trends over time
- Identify relationships between confirmed cases, deaths, recoveries, and active cases
- Engineer meaningful health metrics (fatality & recovery rates)
- Predict COVID-19 mortality using machine learning
- Classify outbreak outcomes (recovery-dominant vs death-dominant)
- Build a clean, reusable ETL pipeline
- Apply regression and classification models
- Evaluate model performance using multiple metrics
- Visualize global and temporal patterns effectively
- ~49,000+ records across multiple countries
- Daily global COVID-19 reporting (2020)
- Publicly available open data sources
- Province / State
- Country / Region
- WHO Region
- Latitude & Longitude
- Date
- Confirmed Cases
- Deaths
- Recovered
- Active Cases
Derived metrics:
- Case Fatality Rate
- Recovery Rate
- Loaded raw CSV data into Pandas DataFrames
- Converted date fields for time-series analysis
- Handled missing geographic values
- Removed duplicate records
- Filtered invalid zero-case records
- Engineered health indicators:
- Case Fatality Rate
- Recovery Rate
- Saved transformed datasets for reproducible downstream analysis
This ETL workflow ensures data quality, consistency, and reusability.
Key analytical steps:
- Statistical summaries to detect outliers and skewness
- Global geospatial visualizations using latitude/longitude
- WHO-region-based comparisons
- Time-series trend analysis for confirmed, active, recovered, and death cases
- Correlation analysis between confirmed cases and mortality
Visualizations include:
- Global scatter maps
- Time-series line plots
- Regional bar charts
- Regression plots for severity analysis
Target: Predict number of deaths
Models implemented:
- Random Forest Regressor
- Support Vector Machine (SVM)
Evaluation metrics:
- Mean Squared Error (MSE)
- R² Score
- Feature importance analysis
✔ Random Forest demonstrated strong predictive performance and interpretability
✔ Feature importance revealed active and confirmed cases as key mortality drivers
Binary classification:
- Recovery-Dominant (1)
- Death-Dominant (0)
Model:
- Random Forest Classifier
Metrics:
- Precision
- Recall
- F1-Score
- Confusion Matrix
✔ Achieved high overall accuracy
✔ Strong performance in identifying recovery-dominant scenarios
- Active case volume is a strong predictor of mortality
- Confirmed cases strongly correlate with deaths
- Recovery trends vary significantly by WHO region
- Pandemic severity peaks mid-timeline rather than uniformly
- Tree-based models outperform kernel-based methods for this dataset
- Python
- Pandas, NumPy
- Matplotlib, Seaborn
- Scikit-learn
- Jupyter Notebook
- Basemap (geospatial visualization)