Production-ready distributed data ingestion from Google BigQuery to Harper.
About Harper: Harper is a distributed application platform that unifies database, cache, and application server. Learn more
Quick Deploy: Launch this component on Harper Fabric - no credit card required, free tier available.
- ✅ Multi-table support - Sync multiple BigQuery tables simultaneously with independent settings
- ✅ Column selection - Reduce costs by fetching only needed columns from BigQuery
- ✅ Horizontal scalability - Linear throughput increase with cluster size
- ✅ Adaptive batch sizing - Automatically adjusts based on sync lag (initial/catchup/steady phases)
- ✅ Failure recovery - Local checkpoints enable independent node recovery
- ✅ Exponential backoff - Smart retry logic with jitter for transient BigQuery errors
- ✅ Production-ready - Battle-tested with comprehensive logging for Grafana observability
- ✅ No coordination overhead - Each node independently determines its workload via modulo partitioning
Install Harper and add this plugin:
# Install Harper
# See https://docs.harperdb.io/docs/getting-started/quickstart
# Clone this plugin
cd your-harper-project
npm install @harperdb/bigquery-ingestorEdit config.yaml with your BigQuery connection and tables:
bigquery:
projectId: your-gcp-project
credentials: service-account-key.json
location: US
tables:
# Example: Sync user events
- id: user_events
dataset: production
table: user_events
timestampColumn: event_time
columns: [event_time, user_id, event_type, properties]
targetTable: UserEvents
sync:
initialBatchSize: 10000
catchupBatchSize: 1000
steadyBatchSize: 500
# Example: Sync application logs
- id: app_logs
dataset: production
table: application_logs
timestampColumn: timestamp
columns: ['*'] # Fetch all columns
targetTable: AppLogs
sync:
initialBatchSize: 5000
catchupBatchSize: 500
steadyBatchSize: 100Key Configuration Points:
timestampColumn- The timestamp field used for incremental sync (must be monotonically increasing)columns- Array of column names to fetch, or['*']for all columnstargetTable- Harper table where data will be synced- Each BigQuery table must sync to a different Harper table (see Configuration for details)
Start Harper with the plugin:
harper dev .The plugin will:
- Calculate worker IDs from cluster instances and thread counts
- Determine this worker's partition assignments
- Begin syncing data from BigQuery
- Create checkpoints for recovery
- Continuously poll for new data
Monitor sync status via the REST API:
curl http://localhost:9926/SyncControlEach Harper worker in the cluster:
- Calculates worker ID from cluster instances and thread counts per instance (deterministic ordering by hostname-workerIndex)
- Determines cluster size from total workers across all instances
- Partitions workload using modulo: pulls only records where
hash(timestamp) % clusterSize == workerID - Syncs independently with local checkpoints
- Relies on Harper replication for data distribution across the cluster
Key Benefits:
- No coordination between nodes (no distributed locks, no leader election)
- Linear scalability - add nodes to increase throughput proportionally
- Independent failure recovery - nodes restart without affecting others
- Deterministic partitioning - same timestamp always routes to same node
The plugin automatically adjusts batch sizes based on sync lag:
- Initial phase (lag > 1 hour): Large batches for fast catch-up
- Catchup phase (lag 10 min - 1 hour): Medium batches to close the gap
- Steady phase (lag < 10 min): Small batches for low latency
Poll intervals also adapt - faster during catch-up, slower when near real-time.
Sync multiple BigQuery tables to Harper simultaneously:
bigquery:
projectId: your-project
credentials: service-account-key.json
location: US
# Optional retry configuration
maxRetries: 5
initialRetryDelay: 1000 # milliseconds
tables:
- id: orders
dataset: ecommerce
table: orders
timestampColumn: created_at
columns: [created_at, order_id, customer_id, total, status]
targetTable: Orders
sync:
initialBatchSize: 10000
catchupBatchSize: 1000
steadyBatchSize: 500
pollInterval: 30000 # 30 seconds
- id: payments
dataset: ecommerce
table: payments
timestampColumn: payment_time
columns: ['*']
targetTable: Payments
sync:
initialBatchSize: 5000
catchupBatchSize: 500
steadyBatchSize: 100
pollInterval: 60000 # 60 secondsImportant Constraints:
Each BigQuery table MUST sync to a different Harper table. Multiple BigQuery tables syncing to the same Harper table will cause:
- Record ID collisions and data overwrites
- Validation failures (can only validate one source)
- Checkpoint confusion (different sync states)
- Schema conflicts (mixed field sets)
If you need to combine data from multiple BigQuery tables, sync them to separate Harper tables and join at query time.
Reduce BigQuery costs by fetching only needed columns:
tables:
- id: large_table
dataset: analytics
table: events
timestampColumn: event_time
# Only fetch these 5 columns instead of all 50+
columns: [event_time, user_id, event_type, page_url, session_id]
targetTable: EventsRules:
timestampColumnMUST be included in the columns list- Use
['*']to fetch all columns (default if omitted) - Column selection reduces network transfer and query costs
The plugin implements exponential backoff with jitter for transient BigQuery errors:
bigquery:
maxRetries: 5 # Maximum retry attempts (default: 5)
initialRetryDelay: 1000 # Initial delay in ms, doubles each retry (default: 1000)Retry Behavior:
- Retryable errors: Rate limits, quota exceeded, internal errors, 503, 429
- Non-retryable errors: Invalid queries, permissions, schema mismatches - fail immediately
- Backoff strategy: Initial delay × 2^attempt with random jitter, capped at 30 seconds
- Logging: Warnings on retry attempts, errors on final failure
For backward compatibility, the plugin still supports the single-table format:
bigquery:
projectId: your-project
dataset: your_dataset
table: your_table
timestampColumn: timestamp
credentials: service-account-key.json
location: US
columns: ['*']This automatically converts to a multi-table configuration internally with targetTable: BigQueryData.
BigQuery records are stored as-is in Harper tables:
type YourTable @table {
id: ID! @primaryKey # Generated from timestamp + hash
_syncedAt: String @createdTime # When record was synced
# All your BigQuery columns appear here at the top level
}Example stored record:
{
"id": "a1b2c3d4e5f6g7h8",
"_syncedAt": "2025-12-15T20:00:00Z",
"event_time": "2025-12-15T19:59:00Z",
"user_id": "user_12345",
"event_type": "page_view",
"page_url": "/products/widget",
"session_id": "sess_abc123"
}All BigQuery fields are directly queryable without nested paths, providing maximum flexibility.
Want to test the plugin before connecting your own data? Use our maritime data synthesizer to generate realistic vessel tracking data.
The synthesizer creates production-like workloads with:
- 100,000+ vessels with realistic movement patterns
- Multiple related tables (positions, events, metadata)
- Global scale with 29 major ports worldwide
- Physics-based navigation
- Automatic retention management
Quick Start:
# Generate test data (writes TO BigQuery)
npx maritime-data-synthesizer initialize realistic
# Start the plugin (reads FROM BigQuery)
harper dev .Documentation:
- 5-Minute Quick Start - Start generating data immediately
- Maritime Synthesizer Guide - Comprehensive documentation
- System Overview - How plugin + synthesizer work together
- Why Maritime Data? - Rationale for vessel tracking data
The maritime synthesizer provides a realistic, production-grade test environment. Vessel tracking data mirrors common BigQuery workloads:
- ✅ High volume continuous flow - 144K+ records/day sustained
- ✅ Temporal ordering constraints - Chronological data with late arrivals
- ✅ Complex schema - Geospatial coords, metadata, multi-table relationships
- ✅ Production use case - Matches IoT streams, event tracking, time-series data
Perfect for testing sync performance, multi-table coordination, and distributed workload partitioning before production deployment.
The plugin provides cluster-wide sync control via REST API. All commands replicate across nodes automatically.
Available Commands:
# Get current status (GET)
curl http://localhost:9926/SyncControl \
-u admin:HarperRocks!
# Start sync across entire cluster (POST)
curl -X POST http://localhost:9926/SyncControl \
-u admin:HarperRocks! \
-H "Content-Type: application/json" \
-d '{"action": "start"}'
# Stop sync across entire cluster (POST)
curl -X POST http://localhost:9926/SyncControl \
-u admin:HarperRocks! \
-H "Content-Type: application/json" \
-d '{"action": "stop"}'
# Run validation across cluster (POST)
curl -X POST http://localhost:9926/SyncControl \
-u admin:HarperRocks! \
-H "Content-Type: application/json" \
-d '{"action": "validate"}'Status Response Format:
{
"global": {
"command": "start",
"commandedAt": "2025-12-16T20:30:00Z",
"commandedBy": "node1-0",
"version": 42
},
"worker": {
"nodeId": "node1-0",
"running": true,
"tables": [
{ "tableId": "vessel_positions", "running": true, "phase": "steady" },
{ "tableId": "port_events", "running": true, "phase": "catchup" }
],
"failedEngines": []
},
"uptime": 3600,
"version": "2.0.0"
}- global: Cluster-wide sync command state (replicated across all nodes via HarperDB)
- worker: This specific worker thread's status
- nodeId: Identifies worker as
hostname-workerIndex - tables: Status per sync engine (one per configured table)
- failedEngines: Any engines that failed to start
Run validation to verify data integrity across the cluster:
curl -X POST http://localhost:9926/SyncControl \
-u admin:HarperRocks! \
-H "Content-Type: application/json" \
-d '{"action": "validate"}'Validation performs three checks per table:
-
Progress Check - Verifies sync is advancing, checks for stalled workers
- Status:
healthy,lagging,severely_lagging,stalled,no_checkpoint
- Status:
-
Smoke Test - Confirms recent data (last 5 minutes) is queryable
- Status:
healthy,no_recent_data,query_failed,table_not_found
- Status:
-
Spot Check - Validates data integrity bidirectionally
- Checks if Harper records exist in BigQuery (detects phantom records)
- Checks if BigQuery records exist in Harper (detects missing records)
- Status:
healthy,issues_found,no_data,check_failed
View validation results:
# Get recent validation audits
curl http://localhost:9926/SyncAudit/ \
-u admin:HarperRocks!Each validation run creates audit records with:
timestamp- When validation rannodeId- Which worker performed the validationstatus- Overall status:healthy,issues_detected, orerrorcheckResults- JSON with detailed results per table and check
# REST API
curl http://localhost:9926/SyncCheckpoint/ \
-u admin:HarperRocks!Or query via SQL:
-- Check sync progress per node
SELECT * FROM SyncCheckpoint ORDER BY nodeId;
-- Calculate current lag
SELECT
nodeId,
lastTimestamp,
(UNIX_TIMESTAMP(NOW()) - UNIX_TIMESTAMP(lastTimestamp)) as lag_seconds,
phase
FROM SyncCheckpoint;All synced tables are accessible via REST API:
# Query vessel positions (note trailing slash)
curl http://localhost:9926/VesselPositions/ \
-u admin:HarperRocks!
# Query port events
curl http://localhost:9926/PortEvents/ \
-u admin:HarperRocks!
# Query vessel metadata
curl http://localhost:9926/VesselMetadata/ \
-u admin:HarperRocks!Important: REST endpoints require a trailing slash (/TableName/) to return data arrays. Without the trailing slash, you get table metadata instead of records.
A comprehensive Postman collection is included for testing all endpoints:
# Import into Postman
bigquery-ingestor_postman.jsonCollection includes:
- Cluster Control - Start, stop, validate commands
- Status Monitoring - Check sync status and worker health
- Data Verification - Query PortEvents, VesselMetadata, VesselPositions
- Checkpoint Inspection - View sync progress per node
- Audit Review - Check validation results
Authentication: Uses Basic Auth with default credentials (admin / HarperRocks!). Update the collection variables if using different credentials.
-- Query your synced data
SELECT * FROM UserEvents
WHERE event_time > '2025-12-01T00:00:00Z'
ORDER BY event_time DESC
LIMIT 100;
-- Check sync latency
SELECT
id,
event_time as source_time,
_syncedAt as synced_time,
TIMESTAMPDIFF(SECOND, event_time, _syncedAt) as latency_seconds
FROM UserEvents
ORDER BY _syncedAt DESC
LIMIT 10;Symptoms: Node shows as running but no new records appear
Checks:
- Verify BigQuery credentials are valid
- Check network connectivity to BigQuery API
- Query checkpoint table for errors:
SELECT * FROM SyncCheckpoint WHERE nodeId = 'your-node' - Check logs for permission errors or API failures
Symptoms: Sync lag increasing over time
Solutions:
- Increase batch sizes in config for faster catch-up
- Add more nodes to the cluster for horizontal scaling
- Benchmark your workload - use the benchmarking tool to determine optimal cluster sizing
- Reduce columns - fetch only needed columns to reduce network transfer
Symptoms: Validation shows missing records
Causes:
- Partition key collisions (rare, hash-based)
- Some nodes not running or stuck
- Checkpoint corruption
Resolution:
- Check all nodes are running:
SELECT DISTINCT nodeId FROM SyncCheckpoint - Review checkpoint timestamps for anomalies
- Check validation logs for specific issues:
SELECT * FROM SyncAudit WHERE status = 'failed'
Symptoms: Repeated API failures in logs
Common Issues:
| Error | Cause | Solution |
|---|---|---|
403 Permission Denied |
Service account lacks permissions | Add bigquery.jobs.create and bigquery.tables.getData |
429 Too Many Requests |
Rate limit exceeded | Reduce batch sizes or poll frequency |
503 Service Unavailable |
Temporary BigQuery outage | Plugin will automatically retry with backoff |
Invalid query |
Schema mismatch | Verify timestampColumn exists and is correct type |
Symptoms: Plugin fails to start or sync doesn't begin
Common Mistakes:
- timestampColumn not in columns list - Must include timestamp in columns array
- Multiple tables → same targetTable - Each BigQuery table needs unique Harper table
- Invalid credentials path - Ensure service account key file exists at specified path
- Missing location - Defaults to US, but must match your BigQuery dataset location
To determine the optimal cluster size and configuration for your specific use case, use the BigQuery Ingestor Benchmarking Tool (coming soon). The tool will:
- Measure actual throughput with your data volume and record sizes
- Test different batch size configurations
- Recommend optimal cluster sizing based on your target latency
- Identify storage and network bottlenecks specific to your workload
Until the benchmarking tool is available, start with the batch size recommendations below and monitor your sync lag to determine if scaling is needed.
Note: Harper doesn't autoscale. Add/remove nodes manually via Fabric UI or self-hosted configuration. Cluster size changes require workload rebalancing (see Limitations).
Learn more about Harper's storage architecture
Adjust based on your workload:
| Record Size | Network | Initial Batch | Catchup Batch | Steady Batch |
|---|---|---|---|---|
| Small (<1KB) | Fast | 10000 | 1000 | 500 |
| Medium (1-10KB) | Fast | 5000 | 500 | 100 |
| Large (>10KB) | Fast | 1000 | 100 | 50 |
| Any | Slow | Reduce by 50% | Reduce by 50% | Reduce by 50% |
Your service account needs:
bigquery.jobs.create- Create query jobsbigquery.tables.getData- Read table data
- Column selection - Fetch only needed columns to reduce query costs
- Polling intervals - Adjust
pollIntervalbased on latency requirements - Batch sizes - Larger batches = fewer queries = lower costs (but higher latency)
- Partitioned tables - Use timestamp partitioning in BigQuery for faster queries
- Stable cluster topology - Adding/removing nodes requires workload rebalancing (v3.0 will add dynamic rebalancing)
- Monotonic timestamps - Timestamp column must be monotonically increasing for correct partitioning
- Schema evolution - Adding columns works, but removing/renaming requires manual intervention
- One direction - Plugin syncs FROM BigQuery TO Harper (not bidirectional)
See ROADMAP.md for future plans.
- Dynamic rebalancing for autoscaling
- Enhanced monitoring dashboards
- Dynamic Harper table creation via Operations API
We welcome contributions! See CONTRIBUTING.md for guidelines.
Areas we'd love help with:
- Production deployment documentation
- Integration tests
- Performance benchmarks
- Video tutorials
- More real-world configuration examples
- Issues: GitHub Issues
- Discussions: GitHub Discussions
- Email: [email protected]
Apache 2.0 - See LICENSE
Get Started: Deploy on Harper Fabric - free tier available, no credit card required.
Learn More: Harper Documentation | harperdb.io