Declarative ETL pipeline built with Databricks DLT following Medallion Architecture
Transforms raw sales data from multiple regions into a production-ready data warehouse with:
- Automated data quality checks
- Change data capture (SCD Type 1 & 2)
- Incremental processing
- Historical tracking of dimension changes
Raw Data β Bronze (Ingest) β Silver (Clean) β Gold (Analytics)
- Bronze: Raw data ingestion with basic validation
- Silver: Cleaned, enriched data with upserts
- Gold: Dimensional model ready for BI tools
source_code/
βββ bronze/ # Raw data ingestion
βββ silver/ # Data cleaning & enrichment
βββ gold/ # Dimensional model & business views
- Setup Databricks workspace (Free tier works)
- Enable Lakeflow pipeline editor in Settings > Developer
- Create catalog:
dltwith schemasource - Load source data (sales_east, sales_west, products, customers)
- Create DLT pipeline and upload code files
- Run pipeline - that's it!
- Zero-config CDC: Automatic handling of inserts, updates, deletes
- Data quality gates: Bad data stopped at source
- Smart incremental loads: Only processes what changed
- SCD Type 2: Full history preserved automatically
- One-click deployment: Declarative approach = less code
After running, you'll have:
dim_products&dim_customers(with full history)fact_sales(optimized for queries)business_view_sales(ready for dashboards)
- Databricks Delta Live Tables
- PySpark
- Delta Lake
- Medallion Architecture
Built following YouTube tutorial on modern data engineering with Databricks
