This project is a production-minded demo deployed on Azure: a containerized ETL pipeline that fetches daily stock market data, calculates technical indicators (MACD, RSI, Bollinger Bands), and visualizes results through an interactive dashboard. It showcases cloud deployment fluency, automated orchestration with Apache Airflow, and cost-aware architecture decisions suitable for real-world data teams. The entire stack—database, scheduler, web UI, and frontend—runs on Azure Container Instances with a one-command redeploy (after initial setup) and indicative, region-dependent costs (often under ~$2/day when running 24/7; ~$0.01/day when stopped). Residual ACR and storage charges may apply even when stopped. You can verify the working system in 3 steps (see Quick Start) or review live screenshots below.
Key outcomes:
✅ End-to-end data pipeline on Azure (ingest → transform → load → visualize)
✅ Automated daily scheduling with retry logic and error isolation
✅ Fully containerized and reproducible (Docker + Azure Container Registry)
✅ A concise production-upgrade path (managed database, Key Vault, private networking)
- End-to-end Azure data pipeline: Automated data ingestion from yFinance API → transformation (pandas/NumPy calculations) → PostgreSQL storage → Streamlit visualization
- Parallel orchestration: Apache Airflow with dynamic task mapping runs tickers in parallel on a single node; parallelism depends on allocated vCPU/RAM.
- Production DevOps practices: Multi-stage Docker builds, Azure Container Registry, infrastructure-as-code (YAML descriptors), init container patterns
- Cost awareness: Ephemeral storage for dev/demo (indicative, region-dependent ~$45-65/mo when running; ~$0.30/mo when stopped; ACR/storage costs may persist); clear migration path to managed services
- Reproducibility: Three bash scripts deploy the entire stack from zero to running URLs in ~8 minutes (indicative, machine-dependent).

Automated daily refresh: Parallel processing of multiple stock tickers with automatic retry on failure. Each ticker runs independently—one failure doesn't stop the pipeline.

Interactive market insights: Candlestick charts with overlay indicators (EMA, MACD, RSI, Bollinger Bands). Users select tickers, date ranges, and toggle indicators interactively.

Multi-container orchestration: Five application containers (Airflow scheduler/webserver, database, ETL, frontend) and 3 init containers (sql-init, airflow init, pipeline-data-init) managed as a single unit with shared networking and public endpoints.
- Azure CLI
- Docker
- Active Azure subscription with Contributor role
First-time setup (before deployment): Copy env.example to .env and populate Azure subscription ID, region, and credentials. Generate required keys:
# Airflow Fernet key
python3 -c "from cryptography.fernet import Fernet; print(Fernet.generate_key().decode())"
# Airflow secret key
python3 -c "import secrets; print(secrets.token_urlsafe(32))"1. Authenticate and set up Azure resources
az login
cd azure
bash resource_setup.sh # Creates resource group + container registry (~2-5 min)2. Build and push container images
bash build_and_push.sh # Builds 3 images, pushes to Azure (~4-8 min)3. Deploy to Azure Container Instances
bash deploy.sh # Launches multi-container group (~3-6 min)
# Output:
# Deployment Complete!
# Airflow UI: http://XXX.XXX.XXX.XXX:8080 (user: admin, password: from .env)
# Streamlit Dashboard: http://XXX.XXX.XXX.XXX:8501After deployment succeeded run the setup_airflow_variables.py script
az container exec -g rg-stock-pipeline -n stock-pipeline-group --container-name airflow-webserver --exec-command "python /opt/app/setup_airflow_variables.py"
Architecture:
- Data ingestion: Airflow scheduler triggers daily DAG runs; Python tasks fetch stock data from yFinance API using
yfinancelibrary (open source) - Transformation: Calculates technical indicators (exponential moving averages, MACD, RSI, Bollinger Bands) using pandas DataFrame operations
- Storage: Loads processed data into PostgreSQL with incremental upsert logic (only new/changed rows); primary key on
(ticker, date) - Orchestration: Three DAG options—sequential (simple), dynamic with parallel task mapping (production), and bootstrap (one-time historical load)
- Visualization: Streamlit frontend queries PostgreSQL and renders interactive Plotly candlestick charts with real-time filtering
Deployment model: All components run as Docker containers in a single Azure Container Instances group with shared localhost networking. Public IP exposes Airflow (port 8080) and Streamlit (port 8501); PostgreSQL remains internal.
Full technical details: Appendix for Engineers
Value: Fastest path from local Docker Compose to cloud deployment—no Kubernetes complexity, no cluster management overhead. Ideal for demonstrating cloud fluency without over-engineering a portfolio project.
Trade-off: Limited auto-scaling vs. Azure Container Apps; acceptable for batch ETL workloads.
Personal note: This is a production-minded demo: I used ACI to minimize setup time and make the pipeline easy to review. For production, I would move to Azure Container Apps and Managed PostgreSQL to gain managed ingress, scale-to-zero patterns, durable storage, and finer lifecycle control than ACI’s container-group model.
Value: Self-contained deployment with zero external dependencies; entire stack stops/starts as a unit.
Trade-off: Data stored in ephemeral volumes (lost on restart); production workloads need Azure Database for PostgreSQL with automated backups.
Value: Simpler than CeleryExecutor (no Redis/RabbitMQ); supports parallel task execution for multi-ticker processing.
Trade-off: Parallelism depends on allocated CPU/RAM; suitable for small-to-medium ticker sets on a single node.
Value: Direct access without ingress controller or load balancer setup; reduces demo complexity.
Trade-off: No encryption or WAF; production requires Azure Front Door with custom domain + TLS certificates.
Value: Simple templating with envsubst; no Key Vault provisioning delays.
Trade-off: Secrets in generated YAML files; production requires Azure Key Vault + Managed Identity integration.
Security note:
Templating writes secrets into the generated YAML used for deployment (e.g., aci-stock-pipeline-generated.yaml). Do not commit generated files or store them in shared locations.
| Component | Today (Demo) | Next (Production) | Why Upgrade |
|---|---|---|---|
| Compute | Azure Container Instances (manual restart) | Azure Container Apps (auto-scale to zero) | Cost savings during idle periods; built-in health probes and ingress |
| Database | Containerized Postgres (ephemeral storage) | Azure Database for PostgreSQL Flexible Server | Automated backups, high availability, monitoring, 99.99% SLA |
| Storage | emptyDir volumes (lost on restart) | Azure Files + Blob Storage | Persistent logs, long-term data retention, disaster recovery |
| Secrets | Environment variables in YAML | Azure Key Vault + Managed Identity | Encryption at rest, audit logs, no secrets in deployment descriptors |
- Move DB to Managed Postgres
- Adopt Key Vault + Managed Identity
- Migrate compute to ACA (scale-to-zero)
| Skill | Evidence | Where to Look |
|---|---|---|
| Cloud deployment on Azure | Multi-container orchestration with ACI, private container registry (ACR), resource provisioning scripts | Screenshot: Azure Portal, azure/resource_setup.sh |
| Containerized ETL pipelines | Dockerfile-based builds with multi-stage optimization, environment-based configuration | Dockerfile.app, Dockerfile.airflow, app/src/*.py |
| Workflow orchestration | Apache Airflow with TaskFlow API, dynamic task mapping, incremental processing logic | Screenshot: Airflow Graph, dags/stock_pipeline_dynamic_dag.py |
| Data transformation | Pandas/NumPy calculations (technical indicators), SQL upsert patterns, three-layer validation | app/src/transform.py, app/src/utils.py (validation functions) |
| Interactive dashboards | Streamlit with Plotly charts, real-time filtering, responsive UI design | Screenshot: Dashboard, frontend/app.py |
| Cost awareness | Resource sizing decisions, ephemeral vs. persistent storage trade-offs, stop/start cost optimization | Why These Choices Matter, Demo vs Production |
| Reproducible infrastructure | Bash automation scripts, YAML descriptors with templating, .env configuration pattern |
azure/deploy.sh, azure/aci-stock-pipeline.yaml |
# Airflow scheduler (task execution logs)
az container logs --resource-group rg-stock-pipeline --name stock-pipeline-group --container-name airflow-scheduler --tail 100
# Streamlit frontend
az container logs --resource-group rg-stock-pipeline --name stock-pipeline-group --container-name frontend --tail 50| Symptom | Solution |
|---|---|
| No data in dashboard | Run bootstrap DAG via Airflow UI: DAGs → bootstrap_stock_data_dag → Trigger. Wait ~5 min for initial data load. |
| Airflow UI shows 502 | Wait 2-3 minutes for containers to initialize. Check webserver logs: az container logs ... --container-name airflow-webserver |
| Container group stuck in "Creating" | Verify ACR credentials: az acr credential show --name acrstockpipeline. Retry deployment: bash deploy.sh |
# Stop containers (minimal storage cost: ~$0.01/day)
az container stop --resource-group rg-stock-pipeline --name stock-pipeline-group
# Delete everything (zero charges)
az group delete --name rg-stock-pipeline --yes --no-waitNote: Deleting the resource group removes ACR and eliminates residual charges. If you keep ACR for later, storage charges may continue
| Service | Configuration | Monthly Cost (24/7) | Notes |
|---|---|---|---|
| Azure Container Registry | Basic SKU, 2 GB images | ~$5 | Private registry for Docker images |
| ACI Container Group | 2.75 vCPU, 6.5 GB RAM total | ~$45-60 (indicative, region-dependent) | 5 application containers, 3 init containers, pay-per-second billing |
| Outbound bandwidth | yfinance API calls | $0 | Minimal data transfer (~100 MB/day) |
| Total | – | ~$50-65/mo | Indicative and region-dependent; even when stopped, ACR/storage may incur minimal residual costs |
Every DAG run includes three-layer validation:
- Row count check: Ensures each ticker has loaded data (fails pipeline if zero rows)
- Date continuity check: Detects gaps in historical data using SQL
LAG()window functions - Ticker whitelist validation: Prevents unauthorized symbols from being processed
Results logged to Airflow UI with structured JSON output (see validation task logs).
- ✅ Incremental processing: Only fetches new data since last run; automatic backfill if gaps > 20 days
- ✅ Error isolation: Dynamic DAG with per-ticker task mapping—one ticker failure doesn't stop others
- ✅ SQL injection protection: Parameterized queries and table name validation throughout
- ✅ Idempotent operations: Upsert logic with
ON CONFLICTclauses; safe to re-run on same dates - ✅ Comprehensive logging: Structured logs in Airflow UI with task-level granularity
- ✅ Environment-based config: No hardcoded credentials;
.env+ Airflow Variables pattern
- Full technical documentation (for engineering deep-dive: init containers, DAG internals, YAML anatomy)
- Root README (local development setup, testing instructions)
- Azure Container Instances Docs
- Airflow TaskFlow API Guide
Dual ETL Architecture (Embedded + Standalone)
The deployment uses two ETL execution paths for flexibility:
- Location:
/opt/appinside Airflow webserver/scheduler containers - Usage: All DAGs import from
src.extract,src.transform,src.load - Benefits: Native Airflow integration, proper logging to UI, zero exec overhead
- Container name:
etl - Usage:
az container exec ... --container-name etl --exec-command "python run_pipeline.py" - Benefits: Debug pipeline independently, ad-hoc data loads, quick iteration without DAG changes
Typical workflow: Develop with standalone ETL → deploy via DAG execution → troubleshoot with standalone ETL
Three DAG Orchestration Strategies
- Use case: Simple deployments with 3-5 tickers
- Flow:
extract_all → transform_all → load_all → cleanup → validate - Trade-off: One failure stops entire pipeline; lower parallelism
- Use case: Production workloads with 10+ tickers
- Flow: Dynamic task mapping with
.expand()operator—each ticker gets independent task instances - Benefits: Error isolation, per-ticker incremental ranges, automatic backfill detection, SQL injection protection
- Use case: Initial database population
- Flow: Fetches full history (2019-01-01 to yesterday) for all tickers
- Trigger: Manual only (prevent accidental re-runs)
Recommendation: Run bootstrap_stock_data_dag once, then enable stock_pipeline_dynamic_dag for daily operations.
Container Initialization Flow
Init-like containers (ACI starts all containers together; app containers wait on readiness checks and a file signal before launching services):
sql-init(Postgres 17.5): Createsprice_metricstable schema with indexespipeline-data-init(Alpine): Creates/tmp/pipeline_data/directory, signals readinessairflow-init(Airflow image): Runsairflow db migrateto set up metadata database
App containers (start after readiness checks and file signal):
db: PostgreSQL server (ports internal to container group)airflow-webserver: Airflow UI on port 8080 (exposed publicly)airflow-scheduler: Task execution engine (internal)etl: Standalone pipeline executor (manual use only)frontend: Streamlit dashboard on port 8501 (exposed publicly)
All share localhost networking and emptyDir volumes for pipeline artifacts.
Environment Variables Reference
Required variables in .env:
# Azure Resources
AZURE_SUBSCRIPTION_ID=12345678-... # Find via: az account show
AZURE_LOCATION=eastus # Use: az account list-locations -o table
AZURE_RESOURCE_GROUP=rg-stock-pipeline
AZURE_ACR_NAME=acrstockpipeline # Globally unique, 5-50 alphanumeric
# Database
POSTGRES_USER=stockuser
POSTGRES_PASSWORD=SecureP@ss123! # 16+ chars recommended
POSTGRES_DB=stockdb
# Airflow
AIRFLOW_USER=airflow
AIRFLOW_PASSWORD=AirflowPass456!
AIRFLOW_DB=airflow
AIRFLOW_ADMIN_USERNAME=admin # Web UI login
AIRFLOW_ADMIN_PASSWORD=AdminPass789!
[email protected]
AIRFLOW_FERNET_KEY=<base64-key> # Generate via cryptography.fernet
AIRFLOW_SECRET_KEY=<random-string> # Generate via secrets.token_urlsafe(32)
# ETL Configuration
TICKERS=AAPL,MSFT,NVDA # Comma-separated symbols
START_DATE=2019-01-01 # Historical start date
INTERVAL=1d # yFinance interval (1d, 1wk, 1mo)Security note: Never commit .env to version control. The provided env.example contains placeholders only.
Deployment Script Internals
resource_setup.sh:
- Validates required
.envvariables - Registers Azure resource providers:
Microsoft.ContainerInstance,Microsoft.ContainerRegistry - Creates resource group (idempotent via
az group create) - Creates ACR with Basic SKU and admin auth enabled
build_and_push.sh:
- Retrieves ACR credentials dynamically via
az acr credential show - Builds three Docker images from root directory:
stock-pipeline-airflow:latest(Dockerfile.airflow)stock-pipeline-etl:latest(Dockerfile.app)stock-pipeline-frontend:latest(Dockerfile.frontend)
- Pushes to ACR using
docker push
deploy.sh:
- Exports ACR credentials and all
.envvariables - Runs
envsubstto template-substituteaci-stock-pipeline.yaml→aci-stock-pipeline-generated.yaml - Deploys via
az container create --file - Retrieves and displays public IP + service URLs
Questions? For detailed technical discussions, see the full deployment guide.
