Skip to content

A streaming analytics stack that captures MySQL changes via CDC, stores them in Apache Paimon format, and visualizes them with Rill dashboards

License

Notifications You must be signed in to change notification settings

gordonmurray/flink_paimon_duckdb_rill

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

2 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Real-Time Analytics Pipeline with Flink, Paimon, and Rill

A complete streaming analytics stack that captures MySQL changes via CDC, stores them in Apache Paimon format, and visualizes them with Rill dashboards.

πŸš€ What You Get

  • Real-Time CDC: Captures every MySQL change using Flink CDC
  • Lake Storage: Stores data in Apache Paimon format on S3-compatible storage
  • Live Dashboard: Rill analytics with automated catalog management
  • Automated Fixes: Sidecar container handles DuckDB catalog prefix issues
  • One Command Start: Everything runs with docker compose up

πŸ—οΈ Architecture

MySQL β†’ Flink CDC β†’ Apache Paimon β†’ MinIO β†’ Rill Dashboard
  ↑                                           ↓
  Manual inserts                          Analytics

Components:

  • MySQL/MariaDB: Source database with sample product data
  • Apache Flink: Real-time CDC processing engine
  • Apache Paimon: Lake storage format optimized for streaming
  • MinIO: S3-compatible object storage
  • Rill: Modern analytics dashboard with DuckDB engine
  • Rill Patcher: Automated sidecar handling catalog prefix issues

⚑ Quick Start

Prerequisites

  • Docker and Docker Compose
  • 8GB+ RAM recommended
  • Ports 3000, 3306, 8081, 9000-9001 available

1. Clone and Start

git clone <your-repo>
cd flink_iceberg_anomaly_pipeline_paimon
docker compose up -d

2. Initialize the CDC Pipeline

./setup_cdc.sh

3. Open the Dashboard

Navigate to: http://localhost:3000

The dashboard will show live data with automatic 60-second refresh.

πŸ§ͺ Test Real-Time Updates

Add new products to see live updates:

# Add some products
docker exec mariadb mysql -u root -prootpassword -e "
INSERT INTO mydatabase.products (name, price) VALUES
('New Product 1', 99.99),
('New Product 2', 199.99);"

# Check MySQL count
docker exec mariadb mysql -u root -prootpassword -e "SELECT COUNT(*) FROM mydatabase.products;"

# Wait 60 seconds for dashboard to refresh
# You'll see the updated count automatically!

πŸ”§ How It Works

CDC Pipeline

  1. MySQL Changes: Any INSERT/UPDATE/DELETE in MySQL is captured
  2. Flink Processing: Flink CDC reads the MySQL binlog in real-time
  3. Paimon Storage: Changes are written to Paimon tables in MinIO
  4. Rill Dashboard: Visualizes data with 60-second refresh cycle

The Catalog Prefix Solution

DuckDB creates random catalog prefixes (e.g., main8514e79c) on startup. Our rill-patcher sidecar:

  1. Waits for Rill to start
  2. Discovers the current catalog alias via SQL
  3. Patches the model file with the correct prefix
  4. Refreshes data every 60 seconds
  5. Re-patches if Rill restarts with a new prefix

Why Apache Paimon?

  • Optimized for streaming updates with ACID guarantees
  • Supports both batch and streaming workloads
  • Compatible with multiple query engines
  • Efficient storage with automatic compaction

πŸ“Š Monitoring

Service Health Checks

# Check all containers
docker ps

# Monitor CDC job
curl -s http://localhost:8081/jobs | jq

# Test Rill Dashboard API
curl -s "http://localhost:3000/v1/instances/default/query" \
  -H "Content-Type: application/json" \
  -d '{"sql":"SELECT COUNT(*) FROM paimon_products"}'

# View Paimon files in MinIO
docker exec minio mc ls --recursive local/warehouse/

Data Flow Verification

# MySQL data
docker exec mariadb mysql -u root -prootpassword -e "SELECT COUNT(*) FROM mydatabase.products;"

# MinIO storage
docker exec minio mc ls --recursive local/warehouse/cdc_db.db/products_sink/

# Rill dashboard count
curl -s "http://localhost:3000/v1/instances/default/query" \
  -H "Content-Type: application/json" \
  -d '{"sql":"SELECT COUNT(*) FROM paimon_products"}' | jq '.data[0]'

πŸ› οΈ Development

Project Structure

β”œβ”€β”€ docker-compose.yml          # Complete stack definition
β”œβ”€β”€ conf/
β”‚   └── flink-conf.yaml        # Flink configuration
β”œβ”€β”€ rill/
β”‚   β”œβ”€β”€ connectors/           # DuckDB S3 configuration
β”‚   β”œβ”€β”€ models/               # SQL model definitions
β”‚   β”œβ”€β”€ metrics/              # Metrics definitions
β”‚   └── dashboards/           # Dashboard configs
β”œβ”€β”€ rill-patcher.sh           # Automated catalog management
β”œβ”€β”€ duckdb/
β”‚   └── test_s3.py            # DuckDB query examples
β”œβ”€β”€ sql/
β”‚   β”œβ”€β”€ init.sql              # MySQL initial data
β”‚   └── setup_paimon_cdc.sql  # CDC pipeline setup
└── setup_cdc.sh              # CDC initialization script

Key Configuration Files

Flink Config (conf/flink-conf.yaml):

  • Configures Flink job manager and task manager
  • Sets checkpointing intervals
  • Defines S3/MinIO credentials

CDC Setup (sql/setup_paimon_cdc.sql):

  • Creates Paimon catalog
  • Defines source MySQL table
  • Creates sink Paimon table
  • Starts CDC pipeline

🚨 Troubleshooting

Common Issues

CDC Pipeline not starting

# Check if the job started:
curl -s http://localhost:8081/jobs | jq

# If not, run setup again:
./setup_cdc.sh

No data in MinIO

# Check Flink job status
curl -s http://localhost:8081/jobs

# Restart CDC setup
./setup_cdc.sh

Verify data flow

# Check Flink job metrics
curl -s http://localhost:8081/jobs/<job-id>/metrics

# List Paimon files
docker exec minio mc ls local/warehouse/cdc_db.db/

Clean Restart

# Complete reset
docker compose down -v
docker compose up -d
./setup_cdc.sh
# Wait 2-3 minutes for full initialization

Built with: Apache Flink β€’ Apache Paimon β€’ Rill β€’ DuckDB β€’

About

A streaming analytics stack that captures MySQL changes via CDC, stores them in Apache Paimon format, and visualizes them with Rill dashboards

Topics

Resources

License

Stars

Watchers

Forks