Azure Stock Data Pipeline

Summary

This project is a production-minded demo deployed on Azure: a containerized ETL pipeline that fetches daily stock market data, calculates technical indicators (MACD, RSI, Bollinger Bands), and visualizes results through an interactive dashboard. It showcases cloud deployment fluency, automated orchestration with Apache Airflow, and cost-aware architecture decisions suitable for real-world data teams. The entire stack—database, scheduler, web UI, and frontend—runs on Azure Container Instances with a one-command redeploy (after initial setup) and indicative, region-dependent costs (often under ~$2/day when running 24/7; ~$0.01/day when stopped). Residual ACR and storage charges may apply even when stopped. You can verify the working system in 3 steps (see Quick Start) or review live screenshots below.

Key outcomes:
✅ End-to-end data pipeline on Azure (ingest → transform → load → visualize)
✅ Automated daily scheduling with retry logic and error isolation
✅ Fully containerized and reproducible (Docker + Azure Container Registry)
✅ A concise production-upgrade path (managed database, Key Vault, private networking)

What This Demonstrates

End-to-end Azure data pipeline: Automated data ingestion from yFinance API → transformation (pandas/NumPy calculations) → PostgreSQL storage → Streamlit visualization
Parallel orchestration: Apache Airflow with dynamic task mapping runs tickers in parallel on a single node; parallelism depends on allocated vCPU/RAM.
Production DevOps practices: Multi-stage Docker builds, Azure Container Registry, infrastructure-as-code (YAML descriptors), init container patterns
Cost awareness: Ephemeral storage for dev/demo (indicative, region-dependent ~$45-65/mo when running; ~$0.30/mo when stopped; ACR/storage costs may persist); clear migration path to managed services
Reproducibility: Three bash scripts deploy the entire stack from zero to running URLs in ~8 minutes (indicative, machine-dependent).

Screenshots (What You'll See)

Live Airflow Dashboard

Automated daily refresh: Parallel processing of multiple stock tickers with automatic retry on failure. Each ticker runs independently—one failure doesn't stop the pipeline.

Interactive Streamlit Dashboard

Interactive market insights: Candlestick charts with overlay indicators (EMA, MACD, RSI, Bollinger Bands). Users select tickers, date ranges, and toggle indicators interactively.

Azure Deployment View

Multi-container orchestration: Five application containers (Airflow scheduler/webserver, database, ETL, frontend) and 3 init containers (sql-init, airflow init, pipeline-data-init) managed as a single unit with shared networking and public endpoints.

Quick Start

Prerequisites

Azure CLI
Docker
Active Azure subscription with Contributor role

First-time setup (before deployment): Copy env.example to .env and populate Azure subscription ID, region, and credentials. Generate required keys:

# Airflow Fernet key
python3 -c "from cryptography.fernet import Fernet; print(Fernet.generate_key().decode())"

# Airflow secret key
python3 -c "import secrets; print(secrets.token_urlsafe(32))"

3-Step Deployment

1. Authenticate and set up Azure resources

az login
cd azure
bash resource_setup.sh  # Creates resource group + container registry (~2-5 min)

2. Build and push container images

bash build_and_push.sh  # Builds 3 images, pushes to Azure (~4-8 min)

3. Deploy to Azure Container Instances

bash deploy.sh  # Launches multi-container group (~3-6 min)

# Output:
# Deployment Complete!
# Airflow UI: http://XXX.XXX.XXX.XXX:8080 (user: admin, password: from .env)
# Streamlit Dashboard: http://XXX.XXX.XXX.XXX:8501

Setup variables

After deployment succeeded run the setup_airflow_variables.py script

az container exec -g rg-stock-pipeline -n stock-pipeline-group --container-name airflow-webserver --exec-command "python /opt/app/setup_airflow_variables.py"

How It Works

Architecture:

Data ingestion: Airflow scheduler triggers daily DAG runs; Python tasks fetch stock data from yFinance API using yfinance library (open source)
Transformation: Calculates technical indicators (exponential moving averages, MACD, RSI, Bollinger Bands) using pandas DataFrame operations
Storage: Loads processed data into PostgreSQL with incremental upsert logic (only new/changed rows); primary key on (ticker, date)
Orchestration: Three DAG options—sequential (simple), dynamic with parallel task mapping (production), and bootstrap (one-time historical load)
Visualization: Streamlit frontend queries PostgreSQL and renders interactive Plotly candlestick charts with real-time filtering

Deployment model: All components run as Docker containers in a single Azure Container Instances group with shared localhost networking. Public IP exposes Airflow (port 8080) and Streamlit (port 8501); PostgreSQL remains internal.

Full technical details: Appendix for Engineers

Why These Choices Matter

Azure Container Instances (ACI)

Value: Fastest path from local Docker Compose to cloud deployment—no Kubernetes complexity, no cluster management overhead. Ideal for demonstrating cloud fluency without over-engineering a portfolio project.
Trade-off: Limited auto-scaling vs. Azure Container Apps; acceptable for batch ETL workloads.
Personal note: This is a production-minded demo: I used ACI to minimize setup time and make the pipeline easy to review. For production, I would move to Azure Container Apps and Managed PostgreSQL to gain managed ingress, scale-to-zero patterns, durable storage, and finer lifecycle control than ACI’s container-group model.

Containerized PostgreSQL (Demo Only)

Value: Self-contained deployment with zero external dependencies; entire stack stops/starts as a unit.
Trade-off: Data stored in ephemeral volumes (lost on restart); production workloads need Azure Database for PostgreSQL with automated backups.

Airflow LocalExecutor

Value: Simpler than CeleryExecutor (no Redis/RabbitMQ); supports parallel task execution for multi-ticker processing.
Trade-off: Parallelism depends on allocated CPU/RAM; suitable for small-to-medium ticker sets on a single node.

Public HTTP Endpoints (No TLS)

Value: Direct access without ingress controller or load balancer setup; reduces demo complexity.
Trade-off: No encryption or WAF; production requires Azure Front Door with custom domain + TLS certificates.

Environment Variable Secrets

Value: Simple templating with envsubst; no Key Vault provisioning delays.
Trade-off: Secrets in generated YAML files; production requires Azure Key Vault + Managed Identity integration.
Security note: Templating writes secrets into the generated YAML used for deployment (e.g., aci-stock-pipeline-generated.yaml). Do not commit generated files or store them in shared locations.

Demo vs. Production Maturity

Component	Today (Demo)	Next (Production)	Why Upgrade
Compute	Azure Container Instances (manual restart)	Azure Container Apps (auto-scale to zero)	Cost savings during idle periods; built-in health probes and ingress
Database	Containerized Postgres (ephemeral storage)	Azure Database for PostgreSQL Flexible Server	Automated backups, high availability, monitoring, 99.99% SLA
Storage	emptyDir volumes (lost on restart)	Azure Files + Blob Storage	Persistent logs, long-term data retention, disaster recovery
Secrets	Environment variables in YAML	Azure Key Vault + Managed Identity	Encryption at rest, audit logs, no secrets in deployment descriptors

What's Next:

Move DB to Managed Postgres
Adopt Key Vault + Managed Identity
Migrate compute to ACA (scale-to-zero)

Skills You Can Verify Here

Skill	Evidence	Where to Look
Cloud deployment on Azure	Multi-container orchestration with ACI, private container registry (ACR), resource provisioning scripts	Screenshot: Azure Portal, `azure/resource_setup.sh`
Containerized ETL pipelines	Dockerfile-based builds with multi-stage optimization, environment-based configuration	`Dockerfile.app`, `Dockerfile.airflow`, `app/src/*.py`
Workflow orchestration	Apache Airflow with TaskFlow API, dynamic task mapping, incremental processing logic	Screenshot: Airflow Graph, `dags/stock_pipeline_dynamic_dag.py`
Data transformation	Pandas/NumPy calculations (technical indicators), SQL upsert patterns, three-layer validation	`app/src/transform.py`, `app/src/utils.py` (validation functions)
Interactive dashboards	Streamlit with Plotly charts, real-time filtering, responsive UI design	Screenshot: Dashboard, `frontend/app.py`
Cost awareness	Resource sizing decisions, ephemeral vs. persistent storage trade-offs, stop/start cost optimization	Why These Choices Matter, Demo vs Production
Reproducible infrastructure	Bash automation scripts, YAML descriptors with templating, `.env` configuration pattern	`azure/deploy.sh`, `azure/aci-stock-pipeline.yaml`

Operations & Troubleshooting

View Logs

# Airflow scheduler (task execution logs)
az container logs --resource-group rg-stock-pipeline --name stock-pipeline-group --container-name airflow-scheduler --tail 100

# Streamlit frontend
az container logs --resource-group rg-stock-pipeline --name stock-pipeline-group --container-name frontend --tail 50

Common Issues

Symptom	Solution
No data in dashboard	Run bootstrap DAG via Airflow UI: DAGs → `bootstrap_stock_data_dag` → Trigger. Wait ~5 min for initial data load.
Airflow UI shows 502	Wait 2-3 minutes for containers to initialize. Check webserver logs: `az container logs ... --container-name airflow-webserver`
Container group stuck in "Creating"	Verify ACR credentials: `az acr credential show --name acrstockpipeline`. Retry deployment: `bash deploy.sh`

Cleanup (Stop Charges)

# Stop containers (minimal storage cost: ~$0.01/day)
az container stop --resource-group rg-stock-pipeline --name stock-pipeline-group

# Delete everything (zero charges)
az group delete --name rg-stock-pipeline --yes --no-wait

Note: Deleting the resource group removes ACR and eliminates residual charges. If you keep ACR for later, storage charges may continue

Technical Details

Cost Breakdown

Service	Configuration	Monthly Cost (24/7)	Notes
Azure Container Registry	Basic SKU, 2 GB images	~$5	Private registry for Docker images
ACI Container Group	2.75 vCPU, 6.5 GB RAM total	~$45-60 (indicative, region-dependent)	5 application containers, 3 init containers, pay-per-second billing
Outbound bandwidth	yfinance API calls	$0	Minimal data transfer (~100 MB/day)
Total	–	~$50-65/mo	Indicative and region-dependent; even when stopped, ACR/storage may incur minimal residual costs

Data Quality & Validation

Every DAG run includes three-layer validation:

Row count check: Ensures each ticker has loaded data (fails pipeline if zero rows)
Date continuity check: Detects gaps in historical data using SQL LAG() window functions
Ticker whitelist validation: Prevents unauthorized symbols from being processed

Results logged to Airflow UI with structured JSON output (see validation task logs).

What Makes This Production-Minded

✅ Incremental processing: Only fetches new data since last run; automatic backfill if gaps > 20 days
✅ Error isolation: Dynamic DAG with per-ticker task mapping—one ticker failure doesn't stop others
✅ SQL injection protection: Parameterized queries and table name validation throughout
✅ Idempotent operations: Upsert logic with ON CONFLICT clauses; safe to re-run on same dates
✅ Comprehensive logging: Structured logs in Airflow UI with task-level granularity
✅ Environment-based config: No hardcoded credentials; .env + Airflow Variables pattern

Further Resources

Full technical documentation (for engineering deep-dive: init containers, DAG internals, YAML anatomy)
Root README (local development setup, testing instructions)
Azure Container Instances Docs
Airflow TaskFlow API Guide

Appendix for Engineers

Dual ETL Architecture (Embedded + Standalone)

The deployment uses two ETL execution paths for flexibility:

Embedded ETL (Primary - Used by DAGs)

Location: /opt/app inside Airflow webserver/scheduler containers
Usage: All DAGs import from src.extract, src.transform, src.load
Benefits: Native Airflow integration, proper logging to UI, zero exec overhead

Standalone ETL Container (Secondary - Manual Testing)

Container name: etl
Usage: az container exec ... --container-name etl --exec-command "python run_pipeline.py"
Benefits: Debug pipeline independently, ad-hoc data loads, quick iteration without DAG changes

Typical workflow: Develop with standalone ETL → deploy via DAG execution → troubleshoot with standalone ETL

Three DAG Orchestration Strategies

1. `stock_pipeline_dag` (Sequential)

Use case: Simple deployments with 3-5 tickers
Flow: extract_all → transform_all → load_all → cleanup → validate
Trade-off: One failure stops entire pipeline; lower parallelism

2. `stock_pipeline_dynamic_dag` (Parallel) Recommended

Use case: Production workloads with 10+ tickers
Flow: Dynamic task mapping with .expand() operator—each ticker gets independent task instances
Benefits: Error isolation, per-ticker incremental ranges, automatic backfill detection, SQL injection protection

3. `bootstrap_stock_data_dag` (One-Time Historical Load)

Use case: Initial database population
Flow: Fetches full history (2019-01-01 to yesterday) for all tickers
Trigger: Manual only (prevent accidental re-runs)

Recommendation: Run bootstrap_stock_data_dag once, then enable stock_pipeline_dynamic_dag for daily operations.

Container Initialization Flow

Init-like containers (ACI starts all containers together; app containers wait on readiness checks and a file signal before launching services):

sql-init (Postgres 17.5): Creates price_metrics table schema with indexes
pipeline-data-init (Alpine): Creates /tmp/pipeline_data/ directory, signals readiness
airflow-init (Airflow image): Runs airflow db migrate to set up metadata database

App containers (start after readiness checks and file signal):

db: PostgreSQL server (ports internal to container group)
airflow-webserver: Airflow UI on port 8080 (exposed publicly)
airflow-scheduler: Task execution engine (internal)
etl: Standalone pipeline executor (manual use only)
frontend: Streamlit dashboard on port 8501 (exposed publicly)

All share localhost networking and emptyDir volumes for pipeline artifacts.

Environment Variables Reference

Required variables in .env:

# Azure Resources
AZURE_SUBSCRIPTION_ID=12345678-...  # Find via: az account show
AZURE_LOCATION=eastus               # Use: az account list-locations -o table
AZURE_RESOURCE_GROUP=rg-stock-pipeline
AZURE_ACR_NAME=acrstockpipeline     # Globally unique, 5-50 alphanumeric

# Database
POSTGRES_USER=stockuser
POSTGRES_PASSWORD=SecureP@ss123!    # 16+ chars recommended
POSTGRES_DB=stockdb

# Airflow
AIRFLOW_USER=airflow
AIRFLOW_PASSWORD=AirflowPass456!
AIRFLOW_DB=airflow
AIRFLOW_ADMIN_USERNAME=admin        # Web UI login
AIRFLOW_ADMIN_PASSWORD=AdminPass789!
[email protected]
AIRFLOW_FERNET_KEY=<base64-key>     # Generate via cryptography.fernet
AIRFLOW_SECRET_KEY=<random-string>  # Generate via secrets.token_urlsafe(32)

# ETL Configuration
TICKERS=AAPL,MSFT,NVDA             # Comma-separated symbols
START_DATE=2019-01-01              # Historical start date
INTERVAL=1d                         # yFinance interval (1d, 1wk, 1mo)

Security note: Never commit .env to version control. The provided env.example contains placeholders only.

Deployment Script Internals

resource_setup.sh:

Validates required .env variables
Registers Azure resource providers: Microsoft.ContainerInstance, Microsoft.ContainerRegistry
Creates resource group (idempotent via az group create)
Creates ACR with Basic SKU and admin auth enabled

build_and_push.sh:

Retrieves ACR credentials dynamically via az acr credential show
Builds three Docker images from root directory:
- stock-pipeline-airflow:latest (Dockerfile.airflow)
- stock-pipeline-etl:latest (Dockerfile.app)
- stock-pipeline-frontend:latest (Dockerfile.frontend)
Pushes to ACR using docker push

deploy.sh:

Exports ACR credentials and all .env variables
Runs envsubst to template-substitute aci-stock-pipeline.yaml → aci-stock-pipeline-generated.yaml
Deploys via az container create --file
Retrieves and displays public IP + service URLs

Questions? For detailed technical discussions, see the full deployment guide.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
app		app
azure		azure
dags		dags
frontend		frontend
.dockerignore		.dockerignore
.gitignore		.gitignore
Dockerfile.airflow		Dockerfile.airflow
Dockerfile.app		Dockerfile.app
Dockerfile.frontend		Dockerfile.frontend
README.md		README.md
aci-container-group.png		aci-container-group.png
airflow-graph-view.png		airflow-graph-view.png
env.example		env.example
requirements.txt		requirements.txt
simple_architecture.png		simple_architecture.png
streamlit-candlestick.png		streamlit-candlestick.png

PhilNiPN/azure-stock-pipeline

Folders and files

Latest commit

History

Repository files navigation