Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
34 commits
Select commit Hold shift + click to select a range
74a069d
Add SyncControlState table for distributed control
irjudson Dec 16, 2025
959a677
Add SyncControlManager class structure
irjudson Dec 16, 2025
05a145a
Implement state loading in SyncControlManager
irjudson Dec 16, 2025
6db267a
Implement subscription loop with error recovery
irjudson Dec 16, 2025
f42d249
Implement command processing with concurrency guard
irjudson Dec 16, 2025
7a8f202
Add engine control methods
irjudson Dec 16, 2025
cd032fd
Fix code review issues in Task 6
irjudson Dec 16, 2025
de229cb
Integrate SyncControlManager into handleApplication
irjudson Dec 16, 2025
a382c48
Update SyncControl resource for cluster-wide control
irjudson Dec 16, 2025
2ec915f
Document new cluster-wide status format in README
irjudson Dec 16, 2025
1118586
Add Postman collection for distributed sync control testing
irjudson Dec 17, 2025
94222e4
Add safety checks for uninitialized controlManager
irjudson Dec 17, 2025
a56fda1
Rename package from harper-bigquery-sync to bigquery-ingestor
irjudson Dec 17, 2025
8bf869b
Configure basic auth at collection level in Postman
irjudson Dec 17, 2025
fb4a678
Fix cluster discovery to use workerCount instead of server.nodes
irjudson Dec 17, 2025
043abb5
Add multi-node cluster discovery with debug logging
irjudson Dec 17, 2025
da35f6a
Fix cluster discovery to include current node
irjudson Dec 17, 2025
df9af78
Fix prettier formatting for long log lines
irjudson Dec 17, 2025
4d6f0f7
Convert commandedAt to ISO string for database storage
irjudson Dec 17, 2025
d953972
Fix subscription to use global tables instead of this.tables
irjudson Dec 17, 2025
abd9a67
Add globals import to sync-control-manager.js
irjudson Dec 17, 2025
2155b30
Remove JSON.stringify from globals debug logging
irjudson Dec 17, 2025
4c9fce2
Add logger availability checks to globals.js
irjudson Dec 17, 2025
074388d
Add safe logger wrapper to config-loader.js for CLI compatibility
irjudson Dec 17, 2025
88b7010
Fix CLI to generate 30 days of historical data by default
irjudson Dec 18, 2025
ff20b54
Implement clear and clean commands for multi-table synthesizer
irjudson Dec 18, 2025
23dcb86
Make clear and cleanup commands free tier compatible
irjudson Dec 18, 2025
3f30974
Fix ID generation to use strings instead of numbers
irjudson Dec 18, 2025
380ee20
Add detailed query logging to debug 0 results issue
irjudson Dec 18, 2025
63713f7
Add detailed record logging to debug missing fields issue
irjudson Dec 18, 2025
9492b67
Fix timestamp generation to use microsecond precision for even MOD pa…
irjudson Dec 19, 2025
d914a9c
Use put (upsert) instead of create to handle duplicate IDs
irjudson Dec 19, 2025
a7dd12e
Fix validation service cluster discovery and update API documentation
irjudson Dec 22, 2025
f5a98c8
Fix code formatting issues
irjudson Dec 22, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -211,7 +211,7 @@ The BigQuery plugin integrates with HarperDB. When modifying plugin code:
- `src/validation.js` - Data validation and auditing
- `src/query-builder.js` - SQL query construction
- `src/config-loader.js` - Configuration parsing (single/multi-table)
- `schema/harper-bigquery-sync.graphql` - GraphQL schema
- `schema/bigquery-ingestor.graphql` - GraphQL schema
- `test/` - Unit tests for all core functionality

## Synthesizer Development
Expand Down
149 changes: 136 additions & 13 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,7 @@ Install Harper and add this plugin:

# Clone this plugin
cd your-harper-project
npm install @harperdb/harper-bigquery-sync
npm install @harperdb/bigquery-ingestor
```

### 2. Configure (Your Data)
Expand Down Expand Up @@ -307,22 +307,115 @@ Perfect for testing sync performance, multi-table coordination, and distributed

## Monitoring & Operations

### Check Sync Status
### Distributed Sync Control

Query the REST API:
The plugin provides cluster-wide sync control via REST API. All commands replicate across nodes automatically.

**Available Commands:**

```bash
# Get current status
curl http://localhost:9926/SyncControl
# Get current status (GET)
curl http://localhost:9926/SyncControl \
-u admin:HarperRocks!

# Control sync
# Start sync across entire cluster (POST)
curl -X POST http://localhost:9926/SyncControl \
-u admin:HarperRocks! \
-H "Content-Type: application/json" \
-d '{"action": "start"}' # or "stop"
-d '{"action": "start"}'

# Stop sync across entire cluster (POST)
curl -X POST http://localhost:9926/SyncControl \
-u admin:HarperRocks! \
-H "Content-Type: application/json" \
-d '{"action": "stop"}'

# Run validation across cluster (POST)
curl -X POST http://localhost:9926/SyncControl \
-u admin:HarperRocks! \
-H "Content-Type: application/json" \
-d '{"action": "validate"}'
```

**Status Response Format:**

```json
{
"global": {
"command": "start",
"commandedAt": "2025-12-16T20:30:00Z",
"commandedBy": "node1-0",
"version": 42
},
"worker": {
"nodeId": "node1-0",
"running": true,
"tables": [
{ "tableId": "vessel_positions", "running": true, "phase": "steady" },
{ "tableId": "port_events", "running": true, "phase": "catchup" }
],
"failedEngines": []
},
"uptime": 3600,
"version": "2.0.0"
}
```

- **global**: Cluster-wide sync command state (replicated across all nodes via HarperDB)
- **worker**: This specific worker thread's status
- **nodeId**: Identifies worker as `hostname-workerIndex`
- **tables**: Status per sync engine (one per configured table)
- **failedEngines**: Any engines that failed to start

### Data Validation

Run validation to verify data integrity across the cluster:

```bash
curl -X POST http://localhost:9926/SyncControl \
-u admin:HarperRocks! \
-H "Content-Type: application/json" \
-d '{"action": "validate"}'
```

**Validation performs three checks per table:**

1. **Progress Check** - Verifies sync is advancing, checks for stalled workers
- Status: `healthy`, `lagging`, `severely_lagging`, `stalled`, `no_checkpoint`

2. **Smoke Test** - Confirms recent data (last 5 minutes) is queryable
- Status: `healthy`, `no_recent_data`, `query_failed`, `table_not_found`

3. **Spot Check** - Validates data integrity bidirectionally
- Checks if Harper records exist in BigQuery (detects phantom records)
- Checks if BigQuery records exist in Harper (detects missing records)
- Status: `healthy`, `issues_found`, `no_data`, `check_failed`

**View validation results:**

```bash
# Get recent validation audits
curl http://localhost:9926/SyncAudit/ \
-u admin:HarperRocks!
```

Each validation run creates audit records with:

- `timestamp` - When validation ran
- `nodeId` - Which worker performed the validation
- `status` - Overall status: `healthy`, `issues_detected`, or `error`
- `checkResults` - JSON with detailed results per table and check

### View Checkpoints

```bash
# REST API
curl http://localhost:9926/SyncCheckpoint/ \
-u admin:HarperRocks!
```

Or query via SQL:

```sql
-- Check sync progress per node
SELECT * FROM SyncCheckpoint ORDER BY nodeId;
Expand All @@ -336,15 +429,45 @@ SELECT
FROM SyncCheckpoint;
```

### View Validation Results
### Access Synced Data

```sql
-- Check recent audits
SELECT * FROM SyncAudit
WHERE timestamp > NOW() - INTERVAL '1 hour'
ORDER BY timestamp DESC;
All synced tables are accessible via REST API:

```bash
# Query vessel positions (note trailing slash)
curl http://localhost:9926/VesselPositions/ \
-u admin:HarperRocks!

# Query port events
curl http://localhost:9926/PortEvents/ \
-u admin:HarperRocks!

# Query vessel metadata
curl http://localhost:9926/VesselMetadata/ \
-u admin:HarperRocks!
```

**Important:** REST endpoints require a trailing slash (`/TableName/`) to return data arrays. Without the trailing slash, you get table metadata instead of records.

### Postman Collection

A comprehensive Postman collection is included for testing all endpoints:

```bash
# Import into Postman
bigquery-ingestor_postman.json
```

**Collection includes:**

- **Cluster Control** - Start, stop, validate commands
- **Status Monitoring** - Check sync status and worker health
- **Data Verification** - Query PortEvents, VesselMetadata, VesselPositions
- **Checkpoint Inspection** - View sync progress per node
- **Audit Review** - Check validation results

**Authentication:** Uses Basic Auth with default credentials (admin / HarperRocks!). Update the collection variables if using different credentials.

### Query Synced Data

```sql
Expand Down
20 changes: 10 additions & 10 deletions ROADMAP.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,9 +11,9 @@ We've shipped v2.0 with comprehensive multi-table support, column selection, and
- ✅ **Multi-table support** - Sync multiple BigQuery tables simultaneously with independent settings
- ✅ **Column selection** - Reduce costs by fetching only needed columns from BigQuery
- ✅ **Per-table configuration** - Independent batch sizes, sync intervals, and strategies per table
- ✅ **Exponential backoff retry logic** - Smart retry with jitter for transient BigQuery errors ([#3](https://github.com/HarperFast/harper-bigquery-sync/issues/3))
- ✅ **Comprehensive logging** - Structured logging throughout codebase for Grafana observability ([#11](https://github.com/HarperFast/harper-bigquery-sync/issues/11))
- ✅ **Optional streaming insert API** - Configurable streaming inserts for production deployments ([#8](https://github.com/HarperFast/harper-bigquery-sync/issues/8))
- ✅ **Exponential backoff retry logic** - Smart retry with jitter for transient BigQuery errors ([#3](https://github.com/HarperFast/bigquery-ingestor/issues/3))
- ✅ **Comprehensive logging** - Structured logging throughout codebase for Grafana observability ([#11](https://github.com/HarperFast/bigquery-ingestor/issues/11))
- ✅ **Optional streaming insert API** - Configurable streaming inserts for production deployments ([#8](https://github.com/HarperFast/bigquery-ingestor/issues/8))
- ✅ **Multi-table validation** - Independent validation and monitoring per table
- ✅ **Backward compatibility** - Single-table format still supported

Expand All @@ -26,15 +26,15 @@ We've shipped v2.0 with comprehensive multi-table support, column selection, and

### Maritime Data Synthesizer

- ✅ **Multi-table orchestrator** - Generate realistic data for multiple related tables ([#6](https://github.com/HarperFast/harper-bigquery-sync/issues/6))
- ✅ **Multi-table orchestrator** - Generate realistic data for multiple related tables ([#6](https://github.com/HarperFast/bigquery-ingestor/issues/6))
- ✅ **Rolling window mode** - Automatic data window maintenance and backfill
- ✅ **100K+ vessel simulation** - Realistic maritime tracking data at global scale
- ✅ **Physics-based movement** - Realistic navigation patterns
- ✅ **Automatic retention** - Configurable rolling window with cleanup

### Project Quality

- ✅ **Memory leak fixes** - Journey tracking system optimized ([#5](https://github.com/HarperFast/harper-bigquery-sync/issues/5))
- ✅ **Memory leak fixes** - Journey tracking system optimized ([#5](https://github.com/HarperFast/bigquery-ingestor/issues/5))
- ✅ **CI/CD pipeline** - Automated lint, test, and format checks
- ✅ **Comprehensive documentation** - User guides, API docs, design documents
- ✅ **Project history** - Development milestones and evolution tracking
Expand All @@ -44,15 +44,15 @@ We've shipped v2.0 with comprehensive multi-table support, column selection, and

### Multi-Threaded Ingestion

- [ ] **Multi-threaded ingestion per node** ([#9](https://github.com/HarperFast/harper-bigquery-sync/issues/9))
- [ ] **Multi-threaded ingestion per node** ([#9](https://github.com/HarperFast/bigquery-ingestor/issues/9))
- Better CPU utilization on multi-core nodes
- Code already supports durable thread identity via `hostname-workerIndex`
- Thread-level checkpointing for fine-grained recovery
- Automatic thread scaling based on lag

### Dynamic Rebalancing

- [ ] **Dynamic rebalancing for autoscaling** ([#10](https://github.com/HarperFast/harper-bigquery-sync/issues/10))
- [ ] **Dynamic rebalancing for autoscaling** ([#10](https://github.com/HarperFast/bigquery-ingestor/issues/10))
- Detect topology changes → pause → recalculate → resume
- Graceful node additions/removals without manual intervention
- Zero-downtime scaling capabilities
Expand All @@ -69,7 +69,7 @@ We've shipped v2.0 with comprehensive multi-table support, column selection, and

### Dynamic Schema Management

- [ ] **Dynamic Harper table creation via Operations API** ([#7](https://github.com/HarperFast/harper-bigquery-sync/issues/7))
- [ ] **Dynamic Harper table creation via Operations API** ([#7](https://github.com/HarperFast/bigquery-ingestor/issues/7))
- Currently requires manual schema.graphql definition
- Could dynamically create tables based on BigQuery schema at runtime
- Enables automatic table creation from BigQuery metadata
Expand All @@ -82,7 +82,7 @@ These are potential enhancements without specific commitments:

### Production Operations

- [ ] **Production deployment documentation** ([#4](https://github.com/HarperFast/harper-bigquery-sync/issues/4))
- [ ] **Production deployment documentation** ([#4](https://github.com/HarperFast/bigquery-ingestor/issues/4))
- Fabric deployment guide with one-click setup
- Self-hosted installation for on-premise clusters
- Monitoring dashboards (Grafana/CloudWatch templates)
Expand Down Expand Up @@ -143,7 +143,7 @@ These are potential enhancements without specific commitments:

Want to help build v3.0 or tackle future considerations? See [CONTRIBUTING.md](CONTRIBUTING.md) for guidelines.

Check out [open issues on GitHub](https://github.com/HarperFast/harper-bigquery-sync/issues) for specific tasks you can pick up.
Check out [open issues on GitHub](https://github.com/HarperFast/bigquery-ingestor/issues) for specific tasks you can pick up.

---

Expand Down
Loading